Phillip, if there are easily detectable line groups you might define your
own InputFormat. Alternatively you can consider using mapPartitions() to
get access to the entire data partition instead of row-at-a-time. You'd
still have to worry about what happens at the partition boundaries. A third
approach is indeed to pre-process with an appropriate mapper/reducer.

Sent while mobile. Pls excuse typos etc.
I have a file that consists of multi-line records.  Is it possible to read
in multi-line records with a method such as SparkContext.newAPIHadoopFile?
 Or do I need to pre-process the data so that all the data for one element
is in a single line?

Thanks,
Philip

Reply via email to