Hi I havé an application that currently run using MR. It currently starts extracting information from a proprietary binary file that is copied to HDFS. The application starts by creating business objects from information extracted from the binary files. Later those objects are used for further processing using again MR jobs.
I am planning to move towards Spark and I clearly see that I could use JavaRDD<businessObjects> for parallel processing. however it is not yet obvious what could be the process to generate this RDD from my binary file in parallel. Today I use parallelism based on the split assign to each of the map elements of a job. Can I mimick such a thing using spark. All example I have seen so far are using text files for which I guess the partitions are based on a given number of contiguous lines. Any help or pointer would be appreciated Cheers Guillaume -- PGP KeyID: 2048R/EA31CFC9 subkeys.pgp.net