[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: streaming-fix.patch Some fixes in the patch streaming-fix.patch: * The split operator wasn't always playing nicely with the way we run the pipeline one extra time in the mapper's or reducer's close function if there's a stream operator present * Moved the MR optimizer that sets stream in map and stream in reduce to the end of the queue. * PhyPlanVisitor forgets to pop some walkers it pushed on the stack. That can result in the NoopFilterRemoval stage failing, because it's looking in the wrong plan. * Setting the jobname by default to the scriptname came in through the last merge, but didn't work anymore PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, fix_store_prob.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-765) to implement jdiff
[ https://issues.apache.org/jira/browse/PIG-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giridharan Kesavan updated PIG-765: --- Attachment: pig-765.patch this patch implements jdiff. to implement jdiff -- Key: PIG-765 URL: https://issues.apache.org/jira/browse/PIG-765 Project: Pig Issue Type: Improvement Components: build Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Attachments: pig-765.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-627: --- Attachment: merge-041409.patch merge-041409.patch contains the latest merge from trunk to branch. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-573) Changes to make Pig run with Hadoop 19
[ https://issues.apache.org/jira/browse/PIG-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12698910#action_12698910 ] Kevin Weil commented on PIG-573: What is the current status of this patch with pig 0.2? Since PIG-563 went in to 0.20, all that should be necessary is applying this single patch to the 0.20 release source, right? Changes to make Pig run with Hadoop 19 -- Key: PIG-573 URL: https://issues.apache.org/jira/browse/PIG-573 Project: Pig Issue Type: Task Affects Versions: 0.2.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: hadoop19.jar, PIG-573-combinerflag.patch, PIG-573.patch This issue tracks changes to Pig code to make it work with Hadoop-0.19.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Ajax library for Pig
Would you want to contribute this to the Pig project or release it separately? Either way, keep us posted on your progress. It sounds interesting. Alan. On Apr 9, 2009, at 9:28 PM, nitesh bhatia wrote: Hi Thanks for the reply. This will be the architecture: 1. Pig would be installed on some dedicated server machine (say P) with hadoop support. 2. In front of it will be a web server (say S) 2.1 A web server will consist of a dedicated tomcat server (say St) for handling dwr servlets. 2.2 PigScript.js proposed javascript. 2.2 If user is using some other server than tomcat for presentation layer (say http for php or IIS for asp.net); the server (say Su) will appear in front of St. -Connections between Su and St will be done through PigScript.js - St and P will be done through dwr - To get the results from server, this system will be using Reverse- ajax calls ( i.e async call from server to browser an inbuilt feature in DWR). DWR is under Apache Licence V2. --nitesh On Wed, Apr 8, 2009 at 9:11 PM, Alan Gates ga...@yahoo-inc.com wrote: Sorry if these are silly questions, but I'm not very familiar with some of these technologies. So what you propose is that Pig would be installed on some dedicated server machine and a web server would be placed in front of it. Then client libraries would be developed that made calls to the web server. Would these client side libraries include presentation in the browser, both for user's submitting queries and receiving results? Also, pig currently does not have a server mode, thus any web server would have to spin off threads that ran a pig job. If the above is what you're proposing, I think it would be great. Opening up pig to more users by making it browser accessible would be nice. Alan. On Apr 3, 2009, at 5:36 AM, nitesh bhatia wrote: Hi Since pig is getting a lot of usage in industries and universities; how about adding a front-end support for Pig? The plan is to write a jquery/dojo type of general JavaScript/AJAX library which can be used over any server technologies (php, jsp, asp, etc.) to call pig functions over web. Direct Web Remoting (DWR- http://directwebremoting.org ), an open source project at Java.net gives a functionality that allows JavaScript in a browser to interact with Java on a server. Can we write a JavaScript library exclusively for Pig using DWR? I am not sure about licensing issues. The major advantages I can point is -Use of Pig over HTTP rather SSH. -User management will become easy as this can be handled easily using any CMS --nitesh -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Re: Ajax library for Pig
Each pig program submission should involve a separate piglatin interpreter. On Tue, Apr 14, 2009 at 2:32 PM, nitesh bhatia niteshbhatia...@gmail.comwrote: Hi Currently I am under one doubt. How this system can be designed so that multiple users can run same pig. Current scenario is - User executes its own copy of pig.jar on shell and access hadoop. But under this system multiple users will log-in to some domain and they have separate sessions. Now suppose user1 submits a pig script or access pig. Then user2 also access pig shell. How this system will work for multiple users? I am not sure what can be the optimized solution. --nitesh On Wed, Apr 15, 2009 at 2:07 AM, Alan Gates ga...@yahoo-inc.com wrote: Would you want to contribute this to the Pig project or release it separately? Either way, keep us posted on your progress. It sounds interesting. Alan. On Apr 9, 2009, at 9:28 PM, nitesh bhatia wrote: Hi Thanks for the reply. This will be the architecture: 1. Pig would be installed on some dedicated server machine (say P) with hadoop support. 2. In front of it will be a web server (say S) 2.1 A web server will consist of a dedicated tomcat server (say St) for handling dwr servlets. 2.2 PigScript.js proposed javascript. 2.2 If user is using some other server than tomcat for presentation layer (say http for php or IIS for asp.net); the server (say Su) will appear in front of St. -Connections between Su and St will be done through PigScript.js - St and P will be done through dwr - To get the results from server, this system will be using Reverse-ajax calls ( i.e async call from server to browser an inbuilt feature in DWR). DWR is under Apache Licence V2. --nitesh On Wed, Apr 8, 2009 at 9:11 PM, Alan Gates ga...@yahoo-inc.com wrote: Sorry if these are silly questions, but I'm not very familiar with some of these technologies. So what you propose is that Pig would be installed on some dedicated server machine and a web server would be placed in front of it. Then client libraries would be developed that made calls to the web server. Would these client side libraries include presentation in the browser, both for user's submitting queries and receiving results? Also, pig currently does not have a server mode, thus any web server would have to spin off threads that ran a pig job. If the above is what you're proposing, I think it would be great. Opening up pig to more users by making it browser accessible would be nice. Alan. On Apr 3, 2009, at 5:36 AM, nitesh bhatia wrote: Hi Since pig is getting a lot of usage in industries and universities; how about adding a front-end support for Pig? The plan is to write a jquery/dojo type of general JavaScript/AJAX library which can be used over any server technologies (php, jsp, asp, etc.) to call pig functions over web. Direct Web Remoting (DWR- http://directwebremoting.org ), an open source project at Java.net gives a functionality that allows JavaScript in a browser to interact with Java on a server. Can we write a JavaScript library exclusively for Pig using DWR? I am not sure about licensing issues. The major advantages I can point is -Use of Pig over HTTP rather SSH. -User management will become easy as this can be handled easily using any CMS --nitesh -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
[jira] Created: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12698996#action_12698996 ] Alan Gates commented on PIG-766: It isn't overall data size that matters. It is the size of a given key. So if you have a 2G data set up it has only one key (that is, every row has that key), then you'll hit this problem (assuming you can't fit 2G in memory on your data nodes). Pig does try to spill to avoid this, but has a hard time knowing when and how much to spill, and thus often runs out of memory. But I think you're right that this isn't in the join. From the stack it looks like it's trying to write data out of the map task. Do you have very large rows in this data? ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699000#action_12699000 ] Vadim Zaliva commented on PIG-766: -- I have at most 17m rows in my dataset. At some point I am doing GROUP BY and longest row about 500,000 tuples. ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699017#action_12699017 ] Olga Natkovich commented on PIG-766: I asked a member of hadoop team to take a look. A possible problem is that there is a single record that does not fit into combiner buffer. Hopefully we will get some help with this. ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699019#action_12699019 ] Olga Natkovich commented on PIG-766: I got confirmation from Hadoop dev that this is a case of one huge record that is larger than combiner buffer which means that it is over 90 MB. Does this sound right for your data? Is it possible you have data corruption? Do you have another data set to try this query with? ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699021#action_12699021 ] Vadim Zaliva edited comment on PIG-766 at 4/14/09 6:53 PM: --- I know for sure that I have at some point a record of 500K tuples, but tuples being 40-50 bytes each, it is far from 90M. Even if this is the case, I do not see how this could cause OutOfMemory exception in java.util.Arrays.copyOf(). Even if this record is, say, 200Mb, given JVM total heap memory size 1Gb, it could happen only if all of this memory is already used and does not have 200Mb left. How can I increase combiner buffer size? What are the possible remedies/workarounds for this problem? was (Author: vzaliva): I know for sure that I have at some point a record of 500K tuples, but tuples being 40-50 bytes each, it is far from 90M. Even if this is the case, I do not see how this could cause OutOfMemory exception in java.util.Arrays.copyOf(). Even if this record is, say, 200Mb, given JVM total heap memory size 1Gb, it could happen only if all of this memory is already used and does not have 200Mb left. How can I increase combiner buffer size? ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-750) Use combiner when a mix of algebraic and non-algebraic functions are used
[ https://issues.apache.org/jira/browse/PIG-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699023#action_12699023 ] Amir Youssefi commented on PIG-750: --- Other use-cases we need have in unit tests: 1) foreach X generate SUM(a) * AVG(b), ... 2) foreach X generate 1 / SUM(a) Currently, there is a work-around suggested to have all algebraic functions calculated in a foreach and then more expressions/mixes are calculated in a second foreach. This way combiner is used in the first foreach and we get combiner speed-up. Use combiner when a mix of algebraic and non-algebraic functions are used - Key: PIG-750 URL: https://issues.apache.org/jira/browse/PIG-750 Project: Pig Issue Type: Improvement Reporter: Amir Youssefi Priority: Minor Currently Pig uses combiner when all a,b, c,... are algebraic (e.g. SUM, AVG etc.) in foreach: foreach X generate a,b,c,... It's a performance improvement if it uses combiner when a mix of algebraic and non-algebraic functions are used as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699026#action_12699026 ] Santhosh Srinivasan commented on PIG-766: - You can specify the I/O sort buffer size on the command line as: java -Dio.sort.mb=200 -cp pig.jar:/path_to_hadoop_site.xml Reference: http://hadoop.apache.org/core/docs/current/hadoop-default.html ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.