Hi ruben, You may try sc.binaryFiles which is designed for lots of small files and it can map paths into inputstreams. Each inputstream will keep only the path and some configuration, so it would be cheap to shuffle them. However, I'm not sure whether spark take the data locality into account while dealing with these inputstreams.
Hope this helps -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/use-case-reading-files-split-per-id-tp28044p28075.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
