Re: understanding spark shuffle file re-use better

Attila Zsolt Piros Fri, 12 Feb 2021 02:14:12 -0800

A much better one-liner (easier to understand the UI because it will be 1
simple job with 2 stages):


```
spark.read.text("README.md").repartition(2).take(1)
```


Attila Zsolt Piros wrote
> No, it won't be reused.
> You should reuse the dateframe for reusing the shuffle blocks (and cached
> data).
> 
> I know this because the two actions will lead to building a two separate
> DAGs, but I will show you a way how you could check this on your own (with
> a
> small simple spark application). 
> 
> For this you can even use the spark-shell. Start it in directory where a
> simple text file available ("README.md" in my case).
> 
> After this the one-liner is:
> 
> ```
> scala> spark.read.text("README.md").selectExpr("length(value) as l",
> "value").groupBy("l").count
> .take(1)
> ```
> 
> Now if you check Stages tab on the UI you will see 3 stages.
> After re-executing the same line of code in the Stages tab you can see the
> number of stages are doubled.
> 
> So shuffle files are not reused.
> 
> Finally you can delete the file and re-execute our small test. Now it will
> produce:
> 
> ``` 
> org.apache.spark.sql.AnalysisException: Path does not exist:
> file:/Users/attilazsoltpiros/git/attilapiros/spark/README.md;
> ```
> 
> So the file would have been opened again for loading the data (even in the
> 3rd run).
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


```
```



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: understanding spark shuffle file re-use better

Reply via email to