imran,
great, i will take a look at the pullreq. seems we are interested in
similar things


On Tue, Mar 24, 2015 at 11:00 AM, Imran Rashid <iras...@cloudera.com> wrote:

> I think writing to hdfs and reading it back again is totally reasonable.
> In fact, in my experience, writing to hdfs and reading back in actually
> gives you a good opportunity to handle some other issues as well:
>
> a) instead of just writing as an object file, I've found its helpful to
> write in a format that is a little more readable.  Json if efficiency
> doesn't matter :) or you could use something like avro, which at least has
> a good set of command line tools.
>
> b) when developing, I hate it when I introduce a bug in step 12 of a long
> pipeline, and need to re-run the whole thing.  If you save to disk, you can
> write a little application logic that realizes step 11 is already sitting
> on disk, and just restart from there.
>
> c) writing to disk is also a good opportunity to do a little crude
> "auto-tuning" of the number of partitions.  You can look at the size of
> each partition on hdfs, and then adjust the number of partitions.
>
> And I completely agree that losing the partitioning info is a major
> limitation -- I submitted a PR to help deal w/ it:
>
> https://github.com/apache/spark/pull/4449
>
> getting narrow dependencies w/ partitioners can lead to pretty big
> performance improvements, so I do think its important to make it easily
> accessible to the user.  Though now I'm thinking that maybe this api is a
> little clunky, and this should get rolled into the other changes you are
> proposing to hadoop RDD & friends -- but I'll go into more discussion on
> that thread.
>
>
>
> On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> there is a way to reinstate the partitioner, but that requires
>> sc.objectFile to read exactly what i wrote, which means sc.objectFile
>> should never split files on reading (a feature of hadoop file inputformat
>> that gets in the way here).
>>
>> On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i just realized the major limitation is that i lose partitioning info...
>>>
>>> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>>
>>>>
>>>> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> so finally i can resort to:
>>>>> rdd.saveAsObjectFile(...)
>>>>> sc.objectFile(...)
>>>>> but that seems like a rather broken abstraction.
>>>>>
>>>>>
>>>> This seems like a fine solution to me.
>>>>
>>>>
>>>
>>
>

Reply via email to