A back-of-a-beermat calculation says if you have, say, 20 boxes, saving 1TB
should take approximately 15 minutes (with a replication factor of 1 since
you don't need it higher for ephemeral data that is relatively easy to
generate).
This isn't much if the whole job takes hours. You get the added
Notice the fact that I have 1+ TB. If I didn't mind things to be slow I
wouldn't be using spark.
On 17 November 2017 at 11:06, Sebastian Piu wrote:
> If you don't want to recalculate you need to hold the results somewhere,
> of you need to save it why don't you so that
If you don't want to recalculate you need to hold the results somewhere, of
you need to save it why don't you so that and then read it again and get
your stats?
On Fri, 17 Nov 2017, 10:03 Fernando Pereira, wrote:
> Dear Spark users
>
> Is it possible to take the output of
Dear Spark users
Is it possible to take the output of a transformation (RDD/Dataframe) and
feed it to two independent transformations without recalculating the first
transformation and without caching the whole dataset?
Consider the case of a very large dataset (1+TB) which suffered several