[jira] Commented: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem closed" error on large input

Hadoop QA (JIRA) Tue, 07 Apr 2009 02:05:38 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696442#action_12696442
 ]


Hadoop QA commented on PIG-733:
-------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12404769/PIG-733-v2.patch
  against trunk revision 759376.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified 
tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

    -1 findbugs.  The patch appears to introduce 5 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/console

This message is automatically generated.

> Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem 
> closed" error on large input
> -------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-733
>                 URL: https://issues.apache.org/jira/browse/PIG-733
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: 0.3.0
>
>         Attachments: PIG-733-v2.patch, PIG-733.patch
>
>
> Order by has a sampling job which samples the input and creates a sorted list 
> of sample items. CUrrently the number of items sampled is 100 per map task. 
> So if the input is large resulting in many maps (say 50,000) the sample is 
> big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
> computes quantile boundaries and weighted probabilities for repeating values 
> in each map by reading the samples file from DFS. In queries with many maps 
> (in the order of 50,000) the dfs read of the sample file fails with 
> "FileSystem closed" error. This seems to point to a dfs issue wherein a big 
> dfs file being read simultaneously by many dfs clients (in this case all 
> maps) causes the clients to be closed. However on the pig side, loading the 
> sample from each map in the final map reduce job and computing the quantile 
> boundaries and weighted probabilities is inefficient. We should do this 
> computation through a FindQuantiles udf in the same map reduce job which 
> produces the sorted samples. This way lesser data is written to dfs and in 
> the final map reduce job, the weightedRangePartitioner needs to just load the 
> computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem closed" error on large input

Reply via email to