Re: parallelism on binary file

Imran Rashid Mon, 18 May 2015 15:05:35 -0700

You can use sc.hadoopFile (or any of the variants) to do what you want.
They even let you reuse your existing HadoopInputFormats.  You should be
able to mimic your old use with MR just fine.  sc.textFile is just a
convenience method which sits on top.


imran

On Fri, May 8, 2015 at 12:03 PM, tog <guillaume.all...@gmail.com> wrote:

> Hi
>
> I havé an application that currently run using MR. It currently starts
> extracting information from a proprietary binary file that is copied to
> HDFS. The application starts by creating business objects from information
> extracted from the binary files. Later those objects are used for further
> processing using again MR jobs.
>
> I am planning to move towards Spark and I clearly see that I could use
> JavaRDD<businessObjects> for parallel processing. however it is not yet
> obvious what could be the process to generate this RDD from my binary file
> in parallel.
>
> Today I use parallelism based on the split assign to each of the map
> elements of a job. Can I mimick such a thing using spark. All example I
> have seen so far are using text files for which I guess the partitions are
> based on a given number of contiguous lines.
>
> Any help or pointer would be appreciated
>
> Cheers
> Guillaume
>
>
>
> --
> PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net
>

Re: parallelism on binary file

Reply via email to