Re: Is there a way to decide what RDDs get cached in the Spark Runner?

Jan Lukavský Wed, 15 May 2019 02:42:44 -0700

The DAG formatted strangely, the intended formatting was


PCollectionA -> PCollectionB -> PCollectionC

                            \-> PCollectionD

Jan

On 5/15/19 11:19 AM, Jan Lukavský wrote:

Hi,
I think this thread is another manifestation of a problem discussedrecently in [1]. Long story short - users in certain situations mightlegitimately need finer control over how is their pipeline translatedinto runner's operators. The case of caching is another one, wherelooking at the pipeline itself doesn't contain enough information toperform correct optimization - consider for example this DAG ofoperations:
 PCollectionA -> PCollectionB -> PCollectionC

                                              \-> PCollectionD
That is PCollectionB is consumed by two operators - PCollectionC andPCollectionD. Logical conclusion would be that we need to cachePCollectionB, but what if the transform that produced PCollectionB isof type "cheap explosion" - where PCollectionB is significantly biggerthan PCollectionA and at the same time can be produced very cheaplyfrom elements of PCollectionA? Than it would make sense to cachePCollectionA instead. But you cannot know that just from the DAG.There would be many more examples of this. Maybe we could think abouthow to support this, which might help widen the user base.
Jan

[1] https://www.mail-archive.com/user@beam.apache.org/msg03809.html

On 5/15/19 10:39 AM, Robert Bradshaw wrote:
Just to clarify, do you need direct control over what to cache, or
would it be OK to let Spark decide the minimal set of RDDs to cache as
long as we didn't cache all intermediates?

From: Augusto Ribeiro <augusto....@gmail.com>
Date: Wed, May 15, 2019 at 8:37 AM
To: <user@beam.apache.org>
Hi Kyle,
Thanks for the help. It seems like I have no other choice than usingSpark directly, since my job causes immense memory pressure if Ican't decide what to cache.
Best regards,
Augusto

On 14 May 2019, at 18:40, Kyle Weaver <kcwea...@google.com> wrote:

Minor correction: Slack channel is actually #beam-spark
Kyle Weaver | Software Engineer | github.com/ibzib |kcwea...@google.com | +16502035555
From: Kyle Weaver <kcwea...@google.com>
Date: Tue, May 14, 2019 at 9:38 AM
To: <user@beam.apache.org>
Hi Augusto,
Right now the default behavior is to cache all intermediate RDDsthat are consumed more than once by the pipeline. This can bedisabled with `options.setCacheDisabled(true)` [1], but there iscurrently no way for the user to specify to the runner that itshould cache certain RDDs, but not others.
There has recently been some discussion on the Slack (#spark-beam)about implementing such a feature, but no concrete plans as of yet.
[1]https://github.com/apache/beam/blob/81faf35c8a42493317eba9fa1e7b06fb42d54662/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java#L150
Thanks
Kyle Weaver | Software Engineer | github.com/ibzib |kcwea...@google.com | +16502035555
From: augusto....@gmail.com <augusto....@gmail.com>
Date: Tue, May 14, 2019 at 5:01 AM
To: <user@beam.apache.org>
Hi,
I guess the title says it all, right now it seems like BEAM cachesall the intermediate RDD results for my pipeline when using theSpark runner, this leads to a very inefficient usage of memory.Any way to control this?
Best regards,
Augusto

Re: Is there a way to decide what RDDs get cached in the Spark Runner?

Reply via email to