Yes that binary files function looks interesting, thanks for the tip. Some followup questions:
- I always wonder when people are talking about 'small' files and 'large' files. Is there any rule of thumb when these things apply? Are small files those which can fit completely in memory on the node and large files do not? - If it works similarly to wholeTextFiles it will give me tuples like this: (/base/id1/file1, contentA) (/base/id1/file2, contentB) ... (/base/id2/file1, contentC) (/base/id2/file2, contentD) ... since I want to end up with tuples like: (id1, parsedContentA ++ parsedContentB ++ ...) (id2, parsedContentC ++ parsedContentD ++ ...) would reduceByKey be the best function to accomplish this? will using dataFrames give me any benefits here? This will end up with some shuffling of parsedContent's which are List[(Timestamp, RecordData)] right? but I guess this is not really something which can be avoided. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/use-case-reading-files-split-per-id-tp28044p28086.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org