Re: Why is shuffle write size so large when joining Dataset with nested structure?
Hi Takeshi, Thank you for your comment. I changed it to RDD and it's a lot better. Zhuo On Fri, Nov 25, 2016 at 7:04 PM, Takeshi Yamamuro wrote: > Hi, > > I think this is just the overhead to represent nested elements as internal > rows on-runtime > (e.g., it consumes null bits for each nested element). > Moreover, in parquet formats, nested data are columnar and highly > compressed, > so it becomes so compact. > > But, I'm not sure about better approaches in this cases. > > // maropu > > > > > > > > > On Sat, Nov 26, 2016 at 11:16 AM, taozhuo wrote: > >> The Dataset is defined as case class with many fields with nested >> structure(Map, List of another case class etc.) >> The size of the Dataset is only 1T when saving to disk as Parquet file. >> But when joining it, the shuffle write size becomes as large as 12T. >> Is there a way to cut it down without changing the schema? If not, what is >> the best practice when designing complex schemas? >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large-whe >> n-joining-Dataset-with-nested-structure-tp28136.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > --- > Takeshi Yamamuro >
Re: Why is shuffle write size so large when joining Dataset with nested structure?
Hi, I think this is just the overhead to represent nested elements as internal rows on-runtime (e.g., it consumes null bits for each nested element). Moreover, in parquet formats, nested data are columnar and highly compressed, so it becomes so compact. But, I'm not sure about better approaches in this cases. // maropu On Sat, Nov 26, 2016 at 11:16 AM, taozhuo wrote: > The Dataset is defined as case class with many fields with nested > structure(Map, List of another case class etc.) > The size of the Dataset is only 1T when saving to disk as Parquet file. > But when joining it, the shuffle write size becomes as large as 12T. > Is there a way to cut it down without changing the schema? If not, what is > the best practice when designing complex schemas? > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large- > when-joining-Dataset-with-nested-structure-tp28136.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro
Why is shuffle write size so large when joining Dataset with nested structure?
The Dataset is defined as case class with many fields with nested structure(Map, List of another case class etc.) The size of the Dataset is only 1T when saving to disk as Parquet file. But when joining it, the shuffle write size becomes as large as 12T. Is there a way to cut it down without changing the schema? If not, what is the best practice when designing complex schemas? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large-when-joining-Dataset-with-nested-structure-tp28136.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: shuffle write size
Anyone can shed some light on this? On Tue, Mar 17, 2015 at 5:23 PM, Chen Song wrote: > I have a map reduce job that reads from three logs and joins them on some > key column. The underlying data is protobuf messages in sequence > files. Between mappers and reducers, the underlying raw byte arrays for > protobuf messages are shuffled . Roughly, for 1G input from HDFS, there is > 2G data output from map phase. > > I am testing spark jobs (v1.3.0) on the same input. I found that shuffle > write is 3 - 4 times input size. I tried passing protobuf Message object > and ArrayByte but neither gives good shuffle write output. > > Is there any good practice on shuffling > > * protobuf messages > * raw byte array > > Chen > > -- Chen Song
shuffle write size
I have a map reduce job that reads from three logs and joins them on some key column. The underlying data is protobuf messages in sequence files. Between mappers and reducers, the underlying raw byte arrays for protobuf messages are shuffled . Roughly, for 1G input from HDFS, there is 2G data output from map phase. I am testing spark jobs (v1.3.0) on the same input. I found that shuffle write is 3 - 4 times input size. I tried passing protobuf Message object and ArrayByte but neither gives good shuffle write output. Is there any good practice on shuffling * protobuf messages * raw byte array Chen