[ 
https://issues.apache.org/jira/browse/PIG-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878336#action_12878336
 ] 

Ashutosh Chauhan commented on PIG-1442:
---------------------------------------

This looks like a variant of PIG-1446 and PIG-1448 PigCombiner attaches the 
tuple to the roots of combine plan, but never detaches them. PODemux also 
attach the tuple to the inner plan, but never detaches it. Note that 
PigCombiner may also contain multiple pipelines depending on number of 
operations done inside For Each resulting in similar problems as described in 
PIG-1448.    

> java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
> ---------------------------------------------------------------
>
>                 Key: PIG-1442
>                 URL: https://issues.apache.org/jira/browse/PIG-1442
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.2.0, 0.7.0
>         Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev 
> (18/may)
> Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0
>            Reporter: Dirk Schmid
>
> As mentioned by Ashutosh this is a reopen of 
> https://issues.apache.org/jira/browse/PIG-766 because there is still a 
> problem which causes that PIG scales only by memory.
> For convenience here comes the last entry of the PIG-766-Jira-Ticket:
> {quote}1. Are you getting the exact same stack trace as mentioned in the 
> jira?{quote} Yes the same and some similar traces:
> {noformat}
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.Arrays.copyOf(Arrays.java:2786)
>       at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>       at java.io.DataOutputStream.write(DataOutputStream.java:90)
>       at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>       at 
> org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279)
>       at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
>       at 
> org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249)
>       at 
> org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214)
>       at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
>       at 
> org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209)
>       at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
>       at 
> org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
>       at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179)
>       at 
> org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880)
>       at 
> org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201)
>       at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>       at 
> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
> java.lang.OutOfMemoryError: Java heap space
>       at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58)
>       at 
> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
>       at 
> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>       at 
> org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263)
>       at 
> org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>       at 
> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>       at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284)
>       at 
> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>       at 
> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
>       at 
> org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCombinerPackage.getNext(POCombinerPackage.java:155)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:242)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:170)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>       at 
> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.ArrayList.(ArrayList.java:112)
>       at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58)
>       at 
> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
>       at 
> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>       at 
> org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263)
>       at 
> org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>       at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284)
>       at 
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:221)
>       at 
> org.apache.pig.builtin.Distinct.getDistinctFromNestedBags(Distinct.java:138)
>       at org.apache.pig.builtin.Distinct.access$200(Distinct.java:40)
>       at org.apache.pig.builtin.Distinct$Intermediate.exec(Distinct.java:103)
>       at org.apache.pig.builtin.Distinct$Intermediate.exec(Distinct.java:96)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:209)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:250)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:341)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:289)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:259)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:217)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:207)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:183)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>       at 
> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
> {noformat}
> {quote}
> 2. Which operations are you doing in your query - join, group-by, any other ?
> 3. What load/store func are you using to read and write data? PigStorage or 
> your own ?
> 4. What is your data size and memory available to your tasks?
> 5. Do you have very large records in your dataset, like hundreds of MB for 
> one record ?
> It would be great if you can paste here the script from which you get this 
> exception.
> {quote}
> As we started to test the transformation (see below) the OutOfMemory-Error 
> first occured at input-datasets of about 150MB.
> Increasing the Memory for the child-vms by setting {{mapred.child.java.opts}} 
> to {{600m}} fixed it for a while.
> When using larger input-dataset the problem reappears.
> *Input-Data:*
> A CSV-File, ~14GB Dataset, ~100,000,000 Records per Dataset, ~145 Byte per 
> Record
> *Example:*
> {noformat} 
>   USER_ID                       REQUEST_DATE    SESSION                       
>           COMPANY SERVICENAME  SECTION_1  SECTION_2  SECTION_3  SECTION_4  
> SECTION_5  SECTION_6     SECTION SECTION_NEW
>   ac14263e-22082-2263455080-9   2010-03-02      
> ac14263e-22082-2263455080-9-1273015305  ABC     (NULL)       main       
> (NULL)     (NULL)     (NULL)     (NULL)     abc/main/mail /main/mail
>   ...
>   ...
> {noformat} 
> *The Pig-Script*
> {code}
> A = LOAD 'full_load' USING PigStorage('\t');
> B = FOREACH A GENERATE $4 AS servicename, $3 AS company, $2 AS session, $0 as 
> user_id
>                        , $5 AS section_1, $6 AS section_2, $7 AS section_3, 
> $8 as section_4
>                        , $9 as section_5, $10 as section_6, $11 AS section;
>                         
> /* 1st aggregation */
> S0 = GROUP B BY (servicename, company);
> S0_A = FOREACH S0 {
>                     unique_clients = DISTINCT B.user_id;
>                     visits = DISTINCT B.session;
>                     GENERATE FLATTEN(group), COUNT(B) AS pi_count, 
> COUNT(unique_clients) AS unique_clients_count, COUNT(visits) AS visit_count;
>                   }
> S0_B = FOREACH S0_A GENERATE servicename, company, '' as section_1, '' as 
> section_2, '' as section_3, '' as section_4
>                            , '' as section_5, '' as section_6, '' as section, 
> pi_count, unique_clients_count
>                            , visit_count, 0 as level;
> /* 2nd aggregation */
> S1 = GROUP B BY (servicename, company, section_1); S1_A = FOREACH S1 {
>                     unique_clients = DISTINCT B.user_id;
>                     visits = DISTINCT B.session;
>                     GENERATE FLATTEN(group), COUNT(B) AS pi_count, 
> COUNT(unique_clients) AS unique_clients_count, COUNT(visits) AS visit_count;
>                   }
> S1_B = FOREACH S1_A GENERATE servicename, company, section_1, '' as 
> section_2, '' as section_3, '' as section_4
>                              , '' as section_5, '' as section_6, '' as 
> section, pi_count, unique_clients_count
>                              , visit_count, 1 as level;
> /* 3rd - 7th aggregation may follow here */
> /* build result*/
> X = UNION S0_B, S1_B;
> STORE X INTO 'result' USING PigStorage ('\t'); {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to