Thanks for the clarification on the partitioning. I did what you suggested and tried reading in individual part-* files -- some of them are ~1.7Gb in size and that's where it's failing. When I increase the number of partitions before writing to disk, it seems to work. Would be nice if this was somehow automatically corrected!
Thanks, Rok On Wed, Jan 28, 2015 at 7:01 PM, Davies Liu <dav...@databricks.com> wrote: > HadoopRDD will try to split the file as 64M partitions in size, so you > got 1916+ partitions. > (assume 100k per row, they are 80G in size). > > I think it has very small chance that one object or one batch will be > bigger than 2G. > Maybe there are a bug when it split the pickled file, could you create > a RDD for each > file, then see which file is cause the issue (maybe some of them)? > > On Wed, Jan 28, 2015 at 1:30 AM, Rok Roskar <rokros...@gmail.com> wrote: > > hi, thanks for the quick answer -- I suppose this is possible, though I > > don't understand how it could come about. The largest individual RDD > > elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of > > 800k of them. The file is saved in 134 parts, but is being read in using > > some 1916+ partitions (I don't know why actually -- how does this number > > come about?). How can I check if any objects/batches are exceeding 2Gb? > > > > Thanks, > > > > Rok > > > > > > On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu <dav...@databricks.com> > wrote: > >> > >> Maybe it's caused by integer overflow, is it possible that one object > >> or batch bigger than 2G (after pickling)? > >> > >> On Tue, Jan 27, 2015 at 7:59 AM, rok <rokros...@gmail.com> wrote: > >> > I've got an dataset saved with saveAsPickleFile using pyspark -- it > >> > saves > >> > without problems. When I try to read it back in, it fails with: > >> > > >> > Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 > times, > >> > most > >> > recent failure: Lost task 401.3 in stage 0.0 (TID 449, > >> > e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException: > >> > > >> > org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119) > >> > > >> > org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98) > >> > > >> > org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153) > >> > > >> > > >> > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > >> > > >> > > >> > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > >> > > >> > > >> > > org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875) > >> > > >> > > >> > > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848) > >> > > >> > > >> > > org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) > >> > > >> > > >> > > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) > >> > > >> > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219) > >> > > >> > org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188) > >> > > >> > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > >> > > >> > > >> > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > >> > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > >> > > >> > > >> > > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330) > >> > > >> > > >> > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) > >> > > >> > > >> > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) > >> > > >> > > >> > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) > >> > > >> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > >> > > >> > > >> > > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) > >> > > >> > > >> > Not really sure where to start looking for the culprit -- any > >> > suggestions > >> > most welcome. Thanks! > >> > > >> > Rok > >> > > >> > > >> > > >> > > >> > -- > >> > View this message in context: > >> > > http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html > >> > Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> > For additional commands, e-mail: user-h...@spark.apache.org > >> > > > > > >