A much better one-liner (easier to understand the UI because it will be 1 simple job with 2 stages):
``` spark.read.text("README.md").repartition(2).take(1) ``` Attila Zsolt Piros wrote > No, it won't be reused. > You should reuse the dateframe for reusing the shuffle blocks (and cached > data). > > I know this because the two actions will lead to building a two separate > DAGs, but I will show you a way how you could check this on your own (with > a > small simple spark application). > > For this you can even use the spark-shell. Start it in directory where a > simple text file available ("README.md" in my case). > > After this the one-liner is: > > ``` > scala> spark.read.text("README.md").selectExpr("length(value) as l", > "value").groupBy("l").count > .take(1) > ``` > > Now if you check Stages tab on the UI you will see 3 stages. > After re-executing the same line of code in the Stages tab you can see the > number of stages are doubled. > > So shuffle files are not reused. > > Finally you can delete the file and re-execute our small test. Now it will > produce: > > ``` > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/Users/attilazsoltpiros/git/attilapiros/spark/README.md; > ``` > > So the file would have been opened again for loading the data (even in the > 3rd run). > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org ``` ``` -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org