Re: understanding spark shuffle file re-use better

2021-02-18 Thread Mandloi87
Increase or Decrease the number of data partitions: Since a data partition represents the quantum of data to be processed together by a single Spark Task, there could be situations: (a) Where existing number of data partitions are not sufficient enough in order to maximize the usage of available

Re: understanding spark shuffle file re-use better

2021-02-12 Thread Attila Zsolt Piros
A much better one-liner (easier to understand the UI because it will be 1 simple job with 2 stages): ``` spark.read.text("README.md").repartition(2).take(1) ``` Attila Zsolt Piros wrote > No, it won't be reused. > You should reuse the dateframe for reusing the shuffle blocks (and cached >

Re: understanding spark shuffle file re-use better

2021-02-11 Thread Attila Zsolt Piros
No, it won't be reused. You should reuse the dateframe for reusing the shuffle blocks (and cached data). I know this because the two actions will lead to building a two separate DAGs, but I will show you a way how you could check this on your own (with a small simple spark application). For

Re: understanding spark shuffle file re-use better

2021-01-17 Thread Jacek Laskowski
Hi, An interesting question that I must admit I'm not sure how to answer myself actually :) Off the top of my head, I'd **guess** unless you cache the first query these two queries would share nothing. With caching, there's a phase in query execution when a canonicalized version of a query is

understanding spark shuffle file re-use better

2021-01-13 Thread Koert Kuipers
is shuffle file re-use based on identity or equality of the dataframe? for example if run the exact same code twice to load data and do transforms (joins, aggregations, etc.) but without re-using any actual dataframes, will i still see skipped stages thanks to shuffle file re-use? thanks! koert