Hi! I would like to know what is the difference between the following
transformations when they are executed right before writing RDD to a file?

    1. coalesce(1, shuffle = true)
    2. coalesce(1, shuffle = false)

Code example:

    val input = sc.textFile(inputFile)
    val filtered = input.filter(doSomeFiltering)
    val mapped = filtered.map(doSomeMapping)

    mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile)
    vs
    mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile)

And how does it compare with collect()? I'm fully aware that Spark save
methods will store it with HDFS-style structure, however I'm more interested
in data partitioning aspects of collect() and shuffled/non-shuffled
coalesce().

Thanks, Paweł.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffled-vs-non-shuffled-coalesce-in-Apache-Spark-tp23377.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to