Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-29 Thread Rok Roskar
Thanks for the clarification on the partitioning.

I did what you suggested and tried reading in individual part-* files --
some of them are ~1.7Gb in size and that's where it's failing. When I
increase the number of partitions before writing to disk, it seems to work.
Would be nice if this was somehow automatically corrected!

Thanks,

Rok

On Wed, Jan 28, 2015 at 7:01 PM, Davies Liu  wrote:

> HadoopRDD will try to split the file as 64M partitions in size, so you
> got 1916+ partitions.
> (assume 100k per row, they are 80G in size).
>
> I think it has very small chance that one object or one batch will be
> bigger than 2G.
> Maybe there are a bug when it split the pickled file, could you create
> a RDD for each
> file, then see which file is cause the issue (maybe some of them)?
>
> On Wed, Jan 28, 2015 at 1:30 AM, Rok Roskar  wrote:
> > hi, thanks for the quick answer -- I suppose this is possible, though I
> > don't understand how it could come about. The largest individual RDD
> > elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of
> > 800k of them. The file is saved in 134 parts, but is being read in using
> > some 1916+ partitions (I don't know why actually -- how does this number
> > come about?). How can I check if any objects/batches are exceeding 2Gb?
> >
> > Thanks,
> >
> > Rok
> >
> >
> > On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu 
> wrote:
> >>
> >> Maybe it's caused by integer overflow, is it possible that one object
> >> or batch bigger than 2G (after pickling)?
> >>
> >> On Tue, Jan 27, 2015 at 7:59 AM, rok  wrote:
> >> > I've got an dataset saved with saveAsPickleFile using pyspark -- it
> >> > saves
> >> > without problems. When I try to read it back in, it fails with:
> >> >
> >> > Job aborted due to stage failure: Task 401 in stage 0.0 failed 4
> times,
> >> > most
> >> > recent failure: Lost task 401.3 in stage 0.0 (TID 449,
> >> > e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException:
> >> >
> >> > org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119)
> >> >
> >> > org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98)
> >> >
> >> > org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153)
> >> >
> >> >
> >> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
> >> >
> >> >
> >> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
> >> >
> >> >
> >> >
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875)
> >> >
> >> >
> >> >
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848)
> >> >
> >> >
> >> >
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
> >> >
> >> >
> >> >
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
> >> >
> >> > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)
> >> >
> >> > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)
> >> >
> >> > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> >> >
> >> >
> >> >
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> >> > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> >> >
> >> >
> >> >
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330)
> >> >
> >> >
> >> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
> >> >
> >> >
> >> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
> >> >
> >> >
> >> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
> >> >
> >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> >> >
> >> >
> >> >
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
> >> >
> >> >
> >> > Not really sure where to start looking for the culprit -- any
> >> > suggestions
> >> > most welcome. Thanks!
> >> >
> >> > Rok
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> >
> http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html
> >> > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >> >
> >> > -
> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> > For additional commands, e-mail: user-h...@spark.apache.org
> >> >
> >
> >
>


Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Davies Liu
HadoopRDD will try to split the file as 64M partitions in size, so you
got 1916+ partitions.
(assume 100k per row, they are 80G in size).

I think it has very small chance that one object or one batch will be
bigger than 2G.
Maybe there are a bug when it split the pickled file, could you create
a RDD for each
file, then see which file is cause the issue (maybe some of them)?

On Wed, Jan 28, 2015 at 1:30 AM, Rok Roskar  wrote:
> hi, thanks for the quick answer -- I suppose this is possible, though I
> don't understand how it could come about. The largest individual RDD
> elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of
> 800k of them. The file is saved in 134 parts, but is being read in using
> some 1916+ partitions (I don't know why actually -- how does this number
> come about?). How can I check if any objects/batches are exceeding 2Gb?
>
> Thanks,
>
> Rok
>
>
> On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu  wrote:
>>
>> Maybe it's caused by integer overflow, is it possible that one object
>> or batch bigger than 2G (after pickling)?
>>
>> On Tue, Jan 27, 2015 at 7:59 AM, rok  wrote:
>> > I've got an dataset saved with saveAsPickleFile using pyspark -- it
>> > saves
>> > without problems. When I try to read it back in, it fails with:
>> >
>> > Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times,
>> > most
>> > recent failure: Lost task 401.3 in stage 0.0 (TID 449,
>> > e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException:
>> >
>> > org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119)
>> >
>> > org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98)
>> >
>> > org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153)
>> >
>> >
>> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>> >
>> >
>> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>> >
>> >
>> > org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875)
>> >
>> >
>> > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848)
>> >
>> >
>> > org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>> >
>> >
>> > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>> >
>> > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)
>> >
>> > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)
>> >
>> > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>> >
>> >
>> > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> >
>> >
>> > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330)
>> >
>> >
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
>> >
>> >
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
>> >
>> >
>> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
>> >
>> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
>> >
>> >
>> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
>> >
>> >
>> > Not really sure where to start looking for the culprit -- any
>> > suggestions
>> > most welcome. Thanks!
>> >
>> > Rok
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Rok Roskar
hi, thanks for the quick answer -- I suppose this is possible, though I
don't understand how it could come about. The largest individual RDD
elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of
800k of them. The file is saved in 134 parts, but is being read in using
some 1916+ partitions (I don't know why actually -- how does this number
come about?). How can I check if any objects/batches are exceeding 2Gb?

Thanks,

Rok


On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu  wrote:

> Maybe it's caused by integer overflow, is it possible that one object
> or batch bigger than 2G (after pickling)?
>
> On Tue, Jan 27, 2015 at 7:59 AM, rok  wrote:
> > I've got an dataset saved with saveAsPickleFile using pyspark -- it saves
> > without problems. When I try to read it back in, it fails with:
> >
> > Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times,
> most
> > recent failure: Lost task 401.3 in stage 0.0 (TID 449,
> > e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException:
> >
> > org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119)
> > org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98)
> >
> > org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153)
> >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
> >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
> >
> >
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875)
> >
> >
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848)
> >
> >
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
> >
> >
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
> >
>  org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)
> >
>  org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)
> > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> >
> >
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> >
> >
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330)
> >
> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
> >
> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
> >
> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
> >
>  org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> >
> >
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
> >
> >
> > Not really sure where to start looking for the culprit -- any suggestions
> > most welcome. Thanks!
> >
> > Rok
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-27 Thread Davies Liu
Maybe it's caused by integer overflow, is it possible that one object
or batch bigger than 2G (after pickling)?

On Tue, Jan 27, 2015 at 7:59 AM, rok  wrote:
> I've got an dataset saved with saveAsPickleFile using pyspark -- it saves
> without problems. When I try to read it back in, it fails with:
>
> Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times, most
> recent failure: Lost task 401.3 in stage 0.0 (TID 449,
> e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException:
>
> org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119)
> org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98)
>
> org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153)
>
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875)
>
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848)
>
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
> org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)
> org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330)
>
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
>
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
>
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
>
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
>
>
> Not really sure where to start looking for the culprit -- any suggestions
> most welcome. Thanks!
>
> Rok
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-27 Thread rok
I've got an dataset saved with saveAsPickleFile using pyspark -- it saves
without problems. When I try to read it back in, it fails with: 

Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times, most
recent failure: Lost task 401.3 in stage 0.0 (TID 449,
e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException: 
   
org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119)
org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98)
   
org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153)
   
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
   
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
   
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875)
   
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848)
   
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
   
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)
org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
   
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330)
   
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
   
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
   
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
   
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)


Not really sure where to start looking for the culprit -- any suggestions
most welcome. Thanks!

Rok




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org