[
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696442#action_12696442
]
Hadoop QA commented on PIG-733:
-------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12404769/PIG-733-v2.patch
against trunk revision 759376.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified
tests.
Please justify why no tests are needed for this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac
compiler warnings.
-1 findbugs. The patch appears to introduce 5 new Findbugs warnings.
+1 release audit. The applied patch does not increase the total number of
release audit warnings.
-1 core tests. The patch failed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/testReport/
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/console
This message is automatically generated.
> Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem
> closed" error on large input
> -------------------------------------------------------------------------------------------------------
>
> Key: PIG-733
> URL: https://issues.apache.org/jira/browse/PIG-733
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.2.0
> Reporter: Pradeep Kamath
> Assignee: Pradeep Kamath
> Fix For: 0.3.0
>
> Attachments: PIG-733-v2.patch, PIG-733.patch
>
>
> Order by has a sampling job which samples the input and creates a sorted list
> of sample items. CUrrently the number of items sampled is 100 per map task.
> So if the input is large resulting in many maps (say 50,000) the sample is
> big. This sorted sample is stored on dfs. The WeightedRangePartitioner
> computes quantile boundaries and weighted probabilities for repeating values
> in each map by reading the samples file from DFS. In queries with many maps
> (in the order of 50,000) the dfs read of the sample file fails with
> "FileSystem closed" error. This seems to point to a dfs issue wherein a big
> dfs file being read simultaneously by many dfs clients (in this case all
> maps) causes the clients to be closed. However on the pig side, loading the
> sample from each map in the final map reduce job and computing the quantile
> boundaries and weighted probabilities is inefficient. We should do this
> computation through a FindQuantiles udf in the same map reduce job which
> produces the sorted samples. This way lesser data is written to dfs and in
> the final map reduce job, the weightedRangePartitioner needs to just load the
> computed information.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.