Re: understanding spark shuffle file re-use better

Attila Zsolt Piros Thu, 11 Feb 2021 07:38:11 -0800

No, it won't be reused.
You should reuse the dateframe for reusing the shuffle blocks (and cached
data).


I know this because the two actions will lead to building a two separate
DAGs, but I will show you a way how you could check this on your own (with a
small simple spark application). 

For this the spark-shell can be used, too. Start it in directory where a
simple text file available ("README.md" in my case).

After this the one-liner is:

```
scala> spark.read.text("README.md").selectExpr("length(value) as l",
"value").groupBy("l").count
.take(1)
```

Now if you check Stages tab on the UI you will see 3 stages.
After re-executing the same line of code in the Stages tab you can see the
number of stages are doubled.

So shuffle files are not reused.

Finally you can delete the file and re-execute our small test. Now it will
produce:

``` 
org.apache.spark.sql.AnalysisException: Path does not exist:
file:/Users/attilazsoltpiros/git/attilapiros/spark/README.md;
```

So the file would have been opened again for loading the data (even in the
3rd run).



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: understanding spark shuffle file re-use better

Reply via email to