[jira] Commented: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input

2009-04-07 Thread Giridharan Kesavan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696416#action_12696416
 ] 

Giridharan Kesavan commented on PIG-733:


I'm going to resubmit the patch to hudson .. 

 Order by sampling dumps entire sample to hdfs which causes dfs FileSystem 
 closed error on large input
 ---

 Key: PIG-733
 URL: https://issues.apache.org/jira/browse/PIG-733
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-733-v2.patch, PIG-733.patch


 Order by has a sampling job which samples the input and creates a sorted list 
 of sample items. CUrrently the number of items sampled is 100 per map task. 
 So if the input is large resulting in many maps (say 50,000) the sample is 
 big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
 computes quantile boundaries and weighted probabilities for repeating values 
 in each map by reading the samples file from DFS. In queries with many maps 
 (in the order of 50,000) the dfs read of the sample file fails with 
 FileSystem closed error. This seems to point to a dfs issue wherein a big 
 dfs file being read simultaneously by many dfs clients (in this case all 
 maps) causes the clients to be closed. However on the pig side, loading the 
 sample from each map in the final map reduce job and computing the quantile 
 boundaries and weighted probabilities is inefficient. We should do this 
 computation through a FindQuantiles udf in the same map reduce job which 
 produces the sorted samples. This way lesser data is written to dfs and in 
 the final map reduce job, the weightedRangePartitioner needs to just load the 
 computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input

2009-04-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696442#action_12696442
 ] 

Hadoop QA commented on PIG-733:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12404769/PIG-733-v2.patch
  against trunk revision 759376.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 5 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/console

This message is automatically generated.

 Order by sampling dumps entire sample to hdfs which causes dfs FileSystem 
 closed error on large input
 ---

 Key: PIG-733
 URL: https://issues.apache.org/jira/browse/PIG-733
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-733-v2.patch, PIG-733.patch


 Order by has a sampling job which samples the input and creates a sorted list 
 of sample items. CUrrently the number of items sampled is 100 per map task. 
 So if the input is large resulting in many maps (say 50,000) the sample is 
 big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
 computes quantile boundaries and weighted probabilities for repeating values 
 in each map by reading the samples file from DFS. In queries with many maps 
 (in the order of 50,000) the dfs read of the sample file fails with 
 FileSystem closed error. This seems to point to a dfs issue wherein a big 
 dfs file being read simultaneously by many dfs clients (in this case all 
 maps) causes the clients to be closed. However on the pig side, loading the 
 sample from each map in the final map reduce job and computing the quantile 
 boundaries and weighted probabilities is inefficient. We should do this 
 computation through a FindQuantiles udf in the same map reduce job which 
 produces the sorted samples. This way lesser data is written to dfs and in 
 the final map reduce job, the weightedRangePartitioner needs to just load the 
 computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input

2009-04-06 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696244#action_12696244
 ] 

Pradeep Kamath commented on PIG-733:


Tests are not included in this patch since there are existing tests for order 
by.

All core unit tests did pass and finbugs gave the same number of warnings with 
and without the patch (output below). The excess warnings produced by the patch 
have been addressed in the new version of the patch (PIG-733-v2.patch).

{noformat}
=== CORE UNIT TESTS OUTPUT WITH PATCH
[prade...@afterside:/tmp/PIG-733/trunk]


test-core:
[mkdir] Created dir: /tmp/PIG-733/trunk/build/test/logs
[junit] Running org.apache.pig.test.TestAdd
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.056 sec
...
[junit] Running org.apache.pig.test.TestTypeCheckingValidatorNoSchema
[junit] Tests run: 13, Failures: 0, Errors: 0, Time elapsed: 0.629 sec
[junit] Running org.apache.pig.test.TestUnion
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 49.94 sec

test-contrib:

BUILD SUCCESSFUL
Total time: 77 minutes 47 seconds

=== FINDBUGS OUTPUT WITH PATCH
[prade...@afterside:/tmp/PIG-733/trunk]

[prade...@chargesize:/tmp/PIG-733/trunk]ant 
-Dfindbugs.home=/homes/pradeepk/findbugs-1.3.8 findbugs
Buildfile: build.xml
...
findbugs:
[mkdir] Created dir: /tmp/PIG-733/trunk/build/test/findbugs
 [findbugs] Executing findbugs from ant task
 [findbugs] Running FindBugs...
 [findbugs] Warnings generated: 665
 [findbugs] Calculating exit code...
 [findbugs] Setting 'bugs found' flag (1)
 [findbugs] Exit code set to: 1
 [findbugs] Java Result: 1
 [findbugs] Output saved to 
/tmp/PIG-733/trunk/build/test/findbugs/pig-findbugs-report.xml
 [xslt] Processing 
/tmp/PIG-733/trunk/build/test/findbugs/pig-findbugs-report.xml to 
/tmp/PIG-733/trunk/build/test/findbugs/pig-findbugs-report.html
 [xslt] Loading stylesheet 
/homes/pradeepk/findbugs-1.3.8/src/xsl/default.xsl

=== FINDBUGS OUTPUT WITHOUT PATCH
[prade...@chargesize:/tmp/svncheckout/trunk]ant 
-Dfindbugs.home=/homes/pradeepk/findbugs-1.3.8 findbugs
Buildfile: build.xml

check-for-findbugs:

...
findbugs:
[mkdir] Created dir: /tmp/svncheckout/trunk/build/test/findbugs
 [findbugs] Executing findbugs from ant task
 [findbugs] Running FindBugs...
 [findbugs] Warnings generated: 665
 [findbugs] Calculating exit code...
 [findbugs] Setting 'bugs found' flag (1)
 [findbugs] Exit code set to: 1
 [findbugs] Java Result: 1
 [findbugs] Output saved to 
/tmp/svncheckout/trunk/build/test/findbugs/pig-findbugs-report.xml
 [xslt] Processing 
/tmp/svncheckout/trunk/build/test/findbugs/pig-findbugs-report.xml to 
/tmp/svncheckout/trunk/build/test/findbugs/pig-findbugs-report.html
 [xslt] Loading stylesheet 
/homes/pradeepk/findbugs-1.3.8/src/xsl/default.xsl



{noformat}

 Order by sampling dumps entire sample to hdfs which causes dfs FileSystem 
 closed error on large input
 ---

 Key: PIG-733
 URL: https://issues.apache.org/jira/browse/PIG-733
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-733.patch


 Order by has a sampling job which samples the input and creates a sorted list 
 of sample items. CUrrently the number of items sampled is 100 per map task. 
 So if the input is large resulting in many maps (say 50,000) the sample is 
 big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
 computes quantile boundaries and weighted probabilities for repeating values 
 in each map by reading the samples file from DFS. In queries with many maps 
 (in the order of 50,000) the dfs read of the sample file fails with 
 FileSystem closed error. This seems to point to a dfs issue wherein a big 
 dfs file being read simultaneously by many dfs clients (in this case all 
 maps) causes the clients to be closed. However on the pig side, loading the 
 sample from each map in the final map reduce job and computing the quantile 
 boundaries and weighted probabilities is inefficient. We should do this 
 computation through a FindQuantiles udf in the same map reduce job which 
 produces the sorted samples. This way lesser data is written to dfs and in 
 the final map reduce job, the weightedRangePartitioner needs to just load the 
 computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input

2009-04-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694855#action_12694855
 ] 

Hadoop QA commented on PIG-733:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12404402/PIG-733.patch
  against trunk revision 759376.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 207 javac compiler warnings (more 
than the trunk's current 200 warnings).

-1 findbugs.  The patch appears to introduce 5 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/18/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/18/console

This message is automatically generated.

 Order by sampling dumps entire sample to hdfs which causes dfs FileSystem 
 closed error on large input
 ---

 Key: PIG-733
 URL: https://issues.apache.org/jira/browse/PIG-733
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-733.patch


 Order by has a sampling job which samples the input and creates a sorted list 
 of sample items. CUrrently the number of items sampled is 100 per map task. 
 So if the input is large resulting in many maps (say 50,000) the sample is 
 big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
 computes quantile boundaries and weighted probabilities for repeating values 
 in each map by reading the samples file from DFS. In queries with many maps 
 (in the order of 50,000) the dfs read of the sample file fails with 
 FileSystem closed error. This seems to point to a dfs issue wherein a big 
 dfs file being read simultaneously by many dfs clients (in this case all 
 maps) causes the clients to be closed. However on the pig side, loading the 
 sample from each map in the final map reduce job and computing the quantile 
 boundaries and weighted probabilities is inefficient. We should do this 
 computation through a FindQuantiles udf in the same map reduce job which 
 produces the sorted samples. This way lesser data is written to dfs and in 
 the final map reduce job, the weightedRangePartitioner needs to just load the 
 computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.