Hi

I havé an application that currently run using MR. It currently starts
extracting information from a proprietary binary file that is copied to
HDFS. The application starts by creating business objects from information
extracted from the binary files. Later those objects are used for further
processing using again MR jobs.

I am planning to move towards Spark and I clearly see that I could use
JavaRDD<businessObjects> for parallel processing. however it is not yet
obvious what could be the process to generate this RDD from my binary file
in parallel.

Today I use parallelism based on the split assign to each of the map
elements of a job. Can I mimick such a thing using spark. All example I
have seen so far are using text files for which I guess the partitions are
based on a given number of contiguous lines.

Any help or pointer would be appreciated

Cheers
Guillaume


-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net

Reply via email to