Re: understanding spark shuffle file re-use better

Jacek Laskowski Sun, 17 Jan 2021 06:07:05 -0800

Hi,

An interesting question that I must admit I'm not sure how to answer myself
actually :)

Off the top of my head, I'd **guess** unless you cache the first query
these two queries would share nothing. With caching, there's a phase in
query execution when a canonicalized version of a query is used to look up
any cached queries.

Again, I'm not really sure and if I'd have to answer it (e.g. as part of an
interview) I'd say nothing would be shared / re-used.

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

On Wed, Jan 13, 2021 at 5:39 PM Koert Kuipers <ko...@tresata.com> wrote:

> is shuffle file re-use based on identity or equality of the dataframe?
>
> for example if run the exact same code twice to load data and do
> transforms (joins, aggregations, etc.) but without re-using any actual
> dataframes, will i still see skipped stages thanks to shuffle file re-use?
>
> thanks!
> koert
>

Re: understanding spark shuffle file re-use better

Reply via email to