I think the new API sc. binaryRecords [1] (added in 1.2) can help in this case.
[1] http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords Davies On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote: > Hi, > > I want to process some files, there're a king of big, dozens of > gigabytes each one. I get them like a array of bytes and there's an > structure inside of them. > > I have a header which describes the structure. It could be like: > Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ...... > This structure appears N times on the file. > > So, I could know the size of each block since it's fix. There's not > separator among block and block. > > If I would do this with MapReduce, I could implement a new > RecordReader and InputFormat to read each block because I know the > size of them and I'd fix the split size in the driver. (blockX1000 for > example). On this way, I could know that each split for each mapper > has complete blocks and there isn't a piece of the last block in the > next split. > > Spark works with RDD and partitions, How could I resize each > partition to do that?? is it possible? I guess that Spark doesn't use > the RecordReader and these classes for these tasks. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org