subject:"Shuffle size difference \- operations on RDD vs. operations on SchemaRDD"

Shuffle size difference - operations on RDD vs. operations on SchemaRDD

2014-09-21 Thread Grega Kešpret

Hi, I am seeing different shuffle write sizes when using SchemaRDD (versus normal RDD). I'm doing the following: case class DomainObj(a: String, b: String, c: String, d: String) val logs: RDD[String] = sc.textFile(...) val filtered: RDD[String] = logs.filter(...) val myDomainObjects:

Re: Shuffle size difference - operations on RDD vs. operations on SchemaRDD

2014-09-21 Thread Michael Armbrust

Spark SQL always uses a custom configuration of Kryo under the hood to improve shuffle performance: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlSerializer.scala Michael On Sun, Sep 21, 2014 at 9:04 AM, Grega Kešpret gr...@celtra.com