Re: Define size partitions

Rishi Yadav Fri, 30 Jan 2015 10:39:15 -0800

if you are only concerned about big partition size you can specify number
of partitions as an additional parameter while loading files form hdfs.


On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser <kras...@gmail.com> wrote:

> You can also use your InputFormat/RecordReader in Spark, e.g. using
> newAPIHadoopFile. See here:
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
> .
> -Sven
>
> On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz <konstt2...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I want to process some files, there're a king of big, dozens of
>> gigabytes each one. I get them like a array of bytes and there's an
>> structure inside of them.
>>
>> I have a header which describes the structure. It could be like:
>> Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ......
>> This structure appears N times on the file.
>>
>> So, I could know the size of each block since it's fix. There's not
>> separator among block and block.
>>
>> If I would do this with MapReduce, I could implement a new
>> RecordReader and InputFormat  to read each block because I know the
>> size of them and I'd fix the split size in the driver. (blockX1000 for
>> example). On this way, I could know that each split for each mapper
>> has complete blocks and there isn't a piece of the last block in the
>> next split.
>>
>> Spark works with RDD and partitions, How could I resize  each
>> partition to do that?? is it possible? I guess that Spark doesn't use
>> the RecordReader and these classes for these tasks.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> http://sites.google.com/site/krasser/?utm_source=sig
>

Re: Define size partitions

Reply via email to