if you are only concerned about big partition size you can specify number of partitions as an additional parameter while loading files form hdfs.
On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser <kras...@gmail.com> wrote: > You can also use your InputFormat/RecordReader in Spark, e.g. using > newAPIHadoopFile. See here: > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext > . > -Sven > > On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz <konstt2...@gmail.com> > wrote: > >> Hi, >> >> I want to process some files, there're a king of big, dozens of >> gigabytes each one. I get them like a array of bytes and there's an >> structure inside of them. >> >> I have a header which describes the structure. It could be like: >> Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ...... >> This structure appears N times on the file. >> >> So, I could know the size of each block since it's fix. There's not >> separator among block and block. >> >> If I would do this with MapReduce, I could implement a new >> RecordReader and InputFormat to read each block because I know the >> size of them and I'd fix the split size in the driver. (blockX1000 for >> example). On this way, I could know that each split for each mapper >> has complete blocks and there isn't a piece of the last block in the >> next split. >> >> Spark works with RDD and partitions, How could I resize each >> partition to do that?? is it possible? I guess that Spark doesn't use >> the RecordReader and these classes for these tasks. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > http://sites.google.com/site/krasser/?utm_source=sig >