Why is shuffle write size so large when joining Dataset with nested structure?

taozhuo Fri, 25 Nov 2016 18:16:54 -0800

The Dataset is defined as case class with many fields with nested
structure(Map, List of another case class etc.)
The size of the Dataset is only 1T when saving to disk as Parquet file.
But when joining it, the shuffle write size becomes as large as 12T.
Is there a way to cut it down without changing the schema? If not, what is
the best practice when designing complex schemas?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large-when-joining-Dataset-with-nested-structure-tp28136.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Why is shuffle write size so large when joining Dataset with nested structure?

Reply via email to