Re: Define size partitions

2015-01-30 Thread Davies Liu
I think the new API sc. binaryRecords [1] (added in 1.2) can help in this case. [1] http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords Davies On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Hi, I want to process some

Re: Define size partitions

2015-01-30 Thread Sven Krasser
You can also use your InputFormat/RecordReader in Spark, e.g. using newAPIHadoopFile. See here: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext . -Sven On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Hi, I want to process

Re: Define size partitions

2015-01-30 Thread Rishi Yadav
if you are only concerned about big partition size you can specify number of partitions as an additional parameter while loading files form hdfs. On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser kras...@gmail.com wrote: You can also use your InputFormat/RecordReader in Spark, e.g. using

Define size partitions

2015-01-30 Thread Guillermo Ortiz
Hi, I want to process some files, there're a king of big, dozens of gigabytes each one. I get them like a array of bytes and there's an structure inside of them. I have a header which describes the structure. It could be like: Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..