GitHub user ConeyLiu opened a pull request: https://github.com/apache/spark/pull/19586
[SPARK-22367][CORE] Separate the serialization of class and object for iteraor ## What changes were proposed in this pull request? Becuase they are all the same class for an iterator. So there is no need write class information for every record in the iterator. We only need write the class information once at the serialization beginning, also only need read the class information once for deserialization. In this patch, we separate the serialization of class and object for an iterator serialized by Kryo. This can improve the performance of the serialization and deserialization, and save the space. Test case: ```scala val conf = new SparkConf().setAppName("Test for serialization") val sc = new SparkContext(conf) val random = new Random(1) val data = sc.parallelize(1 to 1000000000).map { i => Person("id-" + i, random.nextInt(Integer.MAX_VALUE)) }.persist(StorageLevel.OFF_HEAP) var start = System.currentTimeMillis() data.count() println("First time: " + (System.currentTimeMillis() - start)) start = System.currentTimeMillis() data.count() println("Second time: " + (System.currentTimeMillis() - start)) ``` Test result: The size of serialized: before: 34.3GB after: 17.5GB | before(cal+serialization)| before(deserialization)| after(cal+serialization)| after(deserialization) | | ------| ------ | ------ | ------ | | 63869| 21882| 45513| 15158| | 59368| 21507| 51683| 15524| | 66230| 21481| 62163| 14903| | 62399| 22529| 52400| 16255| | 137564.2 | 136990.8 | 1.004186 | ## How was this patch tested? Existing UT. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ConeyLiu/spark kryo Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19586.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19586 ---- commit c681e81f9d49b3558c91a3b981504159bbeff910 Author: Xianyang Liu <xianyang....@intel.com> Date: 2017-10-26T06:37:04Z serialize object and class seperately for iterator commit 640ad5e1d12d1137f4c979a1e75dbdbd713e14de Author: Xianyang Liu <xianyang....@intel.com> Date: 2017-10-26T06:42:58Z Merge remote-tracking branch 'spark/master' into kryo ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org