[ 
https://issues.apache.org/jira/browse/PIG-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Rachabattuni updated PIG-4313:
--------------------------------------
    Issue Type: Sub-task  (was: Bug)
        Parent: PIG-4059

> StackOverflowError in LIMIT operation on Spark
> ----------------------------------------------
>
>                 Key: PIG-4313
>                 URL: https://issues.apache.org/jira/browse/PIG-4313
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Carlos Balduz
>            Assignee: Carlos Balduz
>             Fix For: spark-branch
>
>         Attachments: PIG-4313.patch
>
>
> The current implemenation of the LIMIT operation does not take into account 
> the actual limit introduced in the script when iterating over the tuples. 
> POOutputConsumerIterator will iterate over all of them even if the limit has 
> already been reached. However, it uses POLimit's getNextTuple method to get 
> the next result, and this returns a Result with a STATUS_EOP status once the 
> limit has been reached. Since POOutputConsumerIterator calls recursively the 
> readNext method when a STATUS_EOP is returned, a StackOverflowError is thrown 
> for large datasets.
> I have solved this by creating a subclass of POOutputConsumerIterator that 
> takes into account the limit when reading results, so that it ends once it 
> has found the desired number of STATUS_OK results.
> {code:java}
> 2014-11-07 13:35:33,715 [Result resolver thread-0] WARN  
> org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 
> 0, master): java.lang.StackOverflowError:
>         
> org.joda.time.format.DateTimeFormatterBuilder$MatchingParser.parseInto(DateTimeFormatterBuilder.java:2793)
>         
> org.joda.time.format.DateTimeFormatterBuilder$Composite.parseInto(DateTimeFormatterBuilder.java:2695)
>         
> org.joda.time.format.DateTimeFormatterBuilder$MatchingParser.parseInto(DateTimeFormatterBuilder.java:2793)
>         
> org.joda.time.format.DateTimeFormatterBuilder$Composite.parseInto(DateTimeFormatterBuilder.java:2695)
>         
> org.joda.time.format.DateTimeFormatterBuilder$MatchingParser.parseInto(DateTimeFormatterBuilder.java:2793)
>         
> org.joda.time.format.DateTimeFormatterBuilder$Composite.parseInto(DateTimeFormatterBuilder.java:2695)
>         
> org.joda.time.format.DateTimeFormatterBuilder$MatchingParser.parseInto(DateTimeFormatterBuilder.java:2793)
>         
> org.joda.time.format.DateTimeFormatterBuilder$Composite.parseInto(DateTimeFormatterBuilder.java:2695)
>         
> org.joda.time.format.DateTimeFormatterBuilder$MatchingParser.parseInto(DateTimeFormatterBuilder.java:2793)
>         
> org.joda.time.format.DateTimeFormatterBuilder$Composite.parseInto(DateTimeFormatterBuilder.java:2695)
>         
> org.joda.time.format.DateTimeFormatterBuilder$MatchingParser.parseInto(DateTimeFormatterBuilder.java:2793)
>         
> org.joda.time.format.DateTimeFormatterBuilder$Composite.parseInto(DateTimeFormatterBuilder.java:2695)
>         
> org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:846)
>         org.apache.pig.builtin.ToDate.extractDateTime(ToDate.java:124)
>         
> org.apache.pig.builtin.Utf8StorageConverter.bytesToDateTime(Utf8StorageConverter.java:541)
>         org.apache.pig.impl.util.CastUtils.convertToType(CastUtils.java:61)
>         org.apache.pig.builtin.PigStorage.applySchema(PigStorage.java:339)
>         org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:282)
>         arq.hadoop.pig.loaders.PigStorage.getNext(PigStorage.java:49)
>         
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
>         
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138)
>         
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>         scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>         scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>         scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>         
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29)  
>             
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LimitConverter$LimitFunction$POOutputConsumerIterator.readNext(LimitConverter.java:98)
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LimitConverter$LimitFunction$POOutputConsumerIterator.readNext(LimitConverter.java:98)
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LimitConverter$LimitFunction$POOutputConsumerIterator.readNext(LimitConverter.java:98)
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LimitConverter$LimitFunction$POOutputConsumerIterator.readNext(LimitConverter.java:98)
>       
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LimitConverter$LimitFunction$POOutputConsumerIterator.readNext(LimitConverter.java:98)
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LimitConverter$LimitFunction$POOutputConsumerIterator.readNext(LimitConverter.java:98)
>       
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LimitConverter$LimitFunction$POOutputConsumerIterator.readNext(LimitConverter.java:98)
>         ...
>         (~1000 lines)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to