Phillip, if there are easily detectable line groups you might define your own InputFormat. Alternatively you can consider using mapPartitions() to get access to the entire data partition instead of row-at-a-time. You'd still have to worry about what happens at the partition boundaries. A third approach is indeed to pre-process with an appropriate mapper/reducer.
Sent while mobile. Pls excuse typos etc. I have a file that consists of multi-line records. Is it possible to read in multi-line records with a method such as SparkContext.newAPIHadoopFile? Or do I need to pre-process the data so that all the data for one element is in a single line? Thanks, Philip