imran, great, i will take a look at the pullreq. seems we are interested in similar things
On Tue, Mar 24, 2015 at 11:00 AM, Imran Rashid <iras...@cloudera.com> wrote: > I think writing to hdfs and reading it back again is totally reasonable. > In fact, in my experience, writing to hdfs and reading back in actually > gives you a good opportunity to handle some other issues as well: > > a) instead of just writing as an object file, I've found its helpful to > write in a format that is a little more readable. Json if efficiency > doesn't matter :) or you could use something like avro, which at least has > a good set of command line tools. > > b) when developing, I hate it when I introduce a bug in step 12 of a long > pipeline, and need to re-run the whole thing. If you save to disk, you can > write a little application logic that realizes step 11 is already sitting > on disk, and just restart from there. > > c) writing to disk is also a good opportunity to do a little crude > "auto-tuning" of the number of partitions. You can look at the size of > each partition on hdfs, and then adjust the number of partitions. > > And I completely agree that losing the partitioning info is a major > limitation -- I submitted a PR to help deal w/ it: > > https://github.com/apache/spark/pull/4449 > > getting narrow dependencies w/ partitioners can lead to pretty big > performance improvements, so I do think its important to make it easily > accessible to the user. Though now I'm thinking that maybe this api is a > little clunky, and this should get rolled into the other changes you are > proposing to hadoop RDD & friends -- but I'll go into more discussion on > that thread. > > > > On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> there is a way to reinstate the partitioner, but that requires >> sc.objectFile to read exactly what i wrote, which means sc.objectFile >> should never split files on reading (a feature of hadoop file inputformat >> that gets in the way here). >> >> On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> i just realized the major limitation is that i lose partitioning info... >>> >>> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> >>>> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers <ko...@tresata.com> >>>> wrote: >>>> >>>>> so finally i can resort to: >>>>> rdd.saveAsObjectFile(...) >>>>> sc.objectFile(...) >>>>> but that seems like a rather broken abstraction. >>>>> >>>>> >>>> This seems like a fine solution to me. >>>> >>>> >>> >> >