Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile
Thanks for the clarification on the partitioning. I did what you suggested and tried reading in individual part-* files -- some of them are ~1.7Gb in size and that's where it's failing. When I increase the number of partitions before writing to disk, it seems to work. Would be nice if this was somehow automatically corrected! Thanks, Rok On Wed, Jan 28, 2015 at 7:01 PM, Davies Liu wrote: > HadoopRDD will try to split the file as 64M partitions in size, so you > got 1916+ partitions. > (assume 100k per row, they are 80G in size). > > I think it has very small chance that one object or one batch will be > bigger than 2G. > Maybe there are a bug when it split the pickled file, could you create > a RDD for each > file, then see which file is cause the issue (maybe some of them)? > > On Wed, Jan 28, 2015 at 1:30 AM, Rok Roskar wrote: > > hi, thanks for the quick answer -- I suppose this is possible, though I > > don't understand how it could come about. The largest individual RDD > > elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of > > 800k of them. The file is saved in 134 parts, but is being read in using > > some 1916+ partitions (I don't know why actually -- how does this number > > come about?). How can I check if any objects/batches are exceeding 2Gb? > > > > Thanks, > > > > Rok > > > > > > On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu > wrote: > >> > >> Maybe it's caused by integer overflow, is it possible that one object > >> or batch bigger than 2G (after pickling)? > >> > >> On Tue, Jan 27, 2015 at 7:59 AM, rok wrote: > >> > I've got an dataset saved with saveAsPickleFile using pyspark -- it > >> > saves > >> > without problems. When I try to read it back in, it fails with: > >> > > >> > Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 > times, > >> > most > >> > recent failure: Lost task 401.3 in stage 0.0 (TID 449, > >> > e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException: > >> > > >> > org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119) > >> > > >> > org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98) > >> > > >> > org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153) > >> > > >> > > >> > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > >> > > >> > > >> > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > >> > > >> > > >> > > org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875) > >> > > >> > > >> > > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848) > >> > > >> > > >> > > org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) > >> > > >> > > >> > > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) > >> > > >> > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219) > >> > > >> > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188) > >> > > >> > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > >> > > >> > > >> > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > >> > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > >> > > >> > > >> > > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330) > >> > > >> > > >> > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) > >> > > >> > > >> > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) > >> > > >> > > >> > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) > >> > > >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > >> > > >> > > >> > > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) > >> > > >> > > >> > Not really sure where to start looking for the culprit -- any > >> > suggestions > >> > most welcome. Thanks! > >> > > >> > Rok > >> > > >> > > >> > > >> > > >> > -- > >> > View this message in context: > >> > > http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html > >> > Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >> > > >> > - > >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> > For additional commands, e-mail: user-h...@spark.apache.org > >> > > > > > >
Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile
HadoopRDD will try to split the file as 64M partitions in size, so you got 1916+ partitions. (assume 100k per row, they are 80G in size). I think it has very small chance that one object or one batch will be bigger than 2G. Maybe there are a bug when it split the pickled file, could you create a RDD for each file, then see which file is cause the issue (maybe some of them)? On Wed, Jan 28, 2015 at 1:30 AM, Rok Roskar wrote: > hi, thanks for the quick answer -- I suppose this is possible, though I > don't understand how it could come about. The largest individual RDD > elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of > 800k of them. The file is saved in 134 parts, but is being read in using > some 1916+ partitions (I don't know why actually -- how does this number > come about?). How can I check if any objects/batches are exceeding 2Gb? > > Thanks, > > Rok > > > On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu wrote: >> >> Maybe it's caused by integer overflow, is it possible that one object >> or batch bigger than 2G (after pickling)? >> >> On Tue, Jan 27, 2015 at 7:59 AM, rok wrote: >> > I've got an dataset saved with saveAsPickleFile using pyspark -- it >> > saves >> > without problems. When I try to read it back in, it fails with: >> > >> > Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times, >> > most >> > recent failure: Lost task 401.3 in stage 0.0 (TID 449, >> > e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException: >> > >> > org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119) >> > >> > org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98) >> > >> > org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153) >> > >> > >> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >> > >> > >> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >> > >> > >> > org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875) >> > >> > >> > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848) >> > >> > >> > org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) >> > >> > >> > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) >> > >> > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219) >> > >> > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188) >> > >> > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) >> > >> > >> > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) >> > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) >> > >> > >> > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330) >> > >> > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) >> > >> > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) >> > >> > >> > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) >> > >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) >> > >> > >> > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) >> > >> > >> > Not really sure where to start looking for the culprit -- any >> > suggestions >> > most welcome. Thanks! >> > >> > Rok >> > >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile
hi, thanks for the quick answer -- I suppose this is possible, though I don't understand how it could come about. The largest individual RDD elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of 800k of them. The file is saved in 134 parts, but is being read in using some 1916+ partitions (I don't know why actually -- how does this number come about?). How can I check if any objects/batches are exceeding 2Gb? Thanks, Rok On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu wrote: > Maybe it's caused by integer overflow, is it possible that one object > or batch bigger than 2G (after pickling)? > > On Tue, Jan 27, 2015 at 7:59 AM, rok wrote: > > I've got an dataset saved with saveAsPickleFile using pyspark -- it saves > > without problems. When I try to read it back in, it fails with: > > > > Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times, > most > > recent failure: Lost task 401.3 in stage 0.0 (TID 449, > > e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException: > > > > org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119) > > org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98) > > > > org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153) > > > > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > > > > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > > > > > org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875) > > > > > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848) > > > > > org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) > > > > > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) > > > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219) > > > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188) > > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > > > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > > > > > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330) > > > > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) > > > > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) > > > > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) > > > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > > > > > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) > > > > > > Not really sure where to start looking for the culprit -- any suggestions > > most welcome. Thanks! > > > > Rok > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >
Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile
Maybe it's caused by integer overflow, is it possible that one object or batch bigger than 2G (after pickling)? On Tue, Jan 27, 2015 at 7:59 AM, rok wrote: > I've got an dataset saved with saveAsPickleFile using pyspark -- it saves > without problems. When I try to read it back in, it fails with: > > Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times, most > recent failure: Lost task 401.3 in stage 0.0 (TID 449, > e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException: > > org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119) > org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98) > > org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153) > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > > org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875) > > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848) > > org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) > > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219) > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330) > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) > > > Not really sure where to start looking for the culprit -- any suggestions > most welcome. Thanks! > > Rok > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
NegativeArraySizeException in pyspark when loading an RDD pickleFile
I've got an dataset saved with saveAsPickleFile using pyspark -- it saves without problems. When I try to read it back in, it fails with: Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times, most recent failure: Lost task 401.3 in stage 0.0 (TID 449, e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException: org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119) org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98) org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153) org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875) org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848) org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219) org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) Not really sure where to start looking for the culprit -- any suggestions most welcome. Thanks! Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org