Notice the fact that I have 1+ TB. If I didn't mind things to be slow I
wouldn't be using spark.

On 17 November 2017 at 11:06, Sebastian Piu <sebastian....@gmail.com> wrote:

> If you don't want to recalculate you need to hold the results somewhere,
> of you need to save it why don't you so that and then read it again and get
> your stats?
>
> On Fri, 17 Nov 2017, 10:03 Fernando Pereira, <ferdonl...@gmail.com> wrote:
>
>> Dear Spark users
>>
>> Is it possible to take the output of a transformation (RDD/Dataframe) and
>> feed it to two independent transformations without recalculating the first
>> transformation and without caching the whole dataset?
>>
>> Consider the case of a very large dataset (1+TB) which suffered several
>> transformations and now we want to save it but also calculate some
>> statistics per group.
>> So the best processing way would for: for each partition: do task A, do
>> task B.
>>
>> I don't see a way of instructing spark how to proceed that way without
>> caching to disk, which seems unnecessarily heavy. And if we don't cache
>> spark recalculates every partition all the way from the beginning. In
>> either case huge file reads happen.
>>
>> Any ideas on how to avoid it?
>>
>> Thanks
>>
>> Fernando
>>
>

Reply via email to