I think writing to hdfs and reading it back again is totally reasonable. In fact, in my experience, writing to hdfs and reading back in actually gives you a good opportunity to handle some other issues as well:
a) instead of just writing as an object file, I've found its helpful to write in a format that is a little more readable. Json if efficiency doesn't matter :) or you could use something like avro, which at least has a good set of command line tools. b) when developing, I hate it when I introduce a bug in step 12 of a long pipeline, and need to re-run the whole thing. If you save to disk, you can write a little application logic that realizes step 11 is already sitting on disk, and just restart from there. c) writing to disk is also a good opportunity to do a little crude "auto-tuning" of the number of partitions. You can look at the size of each partition on hdfs, and then adjust the number of partitions. And I completely agree that losing the partitioning info is a major limitation -- I submitted a PR to help deal w/ it: https://github.com/apache/spark/pull/4449 getting narrow dependencies w/ partitioners can lead to pretty big performance improvements, so I do think its important to make it easily accessible to the user. Though now I'm thinking that maybe this api is a little clunky, and this should get rolled into the other changes you are proposing to hadoop RDD & friends -- but I'll go into more discussion on that thread. On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers <ko...@tresata.com> wrote: > there is a way to reinstate the partitioner, but that requires > sc.objectFile to read exactly what i wrote, which means sc.objectFile > should never split files on reading (a feature of hadoop file inputformat > that gets in the way here). > > On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> i just realized the major limitation is that i lose partitioning info... >> >> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin <r...@databricks.com> wrote: >> >>> >>> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers <ko...@tresata.com> >>> wrote: >>> >>>> so finally i can resort to: >>>> rdd.saveAsObjectFile(...) >>>> sc.objectFile(...) >>>> but that seems like a rather broken abstraction. >>>> >>>> >>> This seems like a fine solution to me. >>> >>> >> >