I think the new API sc. binaryRecords [1] (added in 1.2) can help in this case.
[1]
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords
Davies
On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
Hi,
I want to process some
You can also use your InputFormat/RecordReader in Spark, e.g. using
newAPIHadoopFile. See here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
.
-Sven
On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:
Hi,
I want to process
if you are only concerned about big partition size you can specify number
of partitions as an additional parameter while loading files form hdfs.
On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser kras...@gmail.com wrote:
You can also use your InputFormat/RecordReader in Spark, e.g. using
Hi,
I want to process some files, there're a king of big, dozens of
gigabytes each one. I get them like a array of bytes and there's an
structure inside of them.
I have a header which describes the structure. It could be like:
Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..