Notice the fact that I have 1+ TB. If I didn't mind things to be slow I wouldn't be using spark.
On 17 November 2017 at 11:06, Sebastian Piu <sebastian....@gmail.com> wrote: > If you don't want to recalculate you need to hold the results somewhere, > of you need to save it why don't you so that and then read it again and get > your stats? > > On Fri, 17 Nov 2017, 10:03 Fernando Pereira, <ferdonl...@gmail.com> wrote: > >> Dear Spark users >> >> Is it possible to take the output of a transformation (RDD/Dataframe) and >> feed it to two independent transformations without recalculating the first >> transformation and without caching the whole dataset? >> >> Consider the case of a very large dataset (1+TB) which suffered several >> transformations and now we want to save it but also calculate some >> statistics per group. >> So the best processing way would for: for each partition: do task A, do >> task B. >> >> I don't see a way of instructing spark how to proceed that way without >> caching to disk, which seems unnecessarily heavy. And if we don't cache >> spark recalculates every partition all the way from the beginning. In >> either case huge file reads happen. >> >> Any ideas on how to avoid it? >> >> Thanks >> >> Fernando >> >