Xianyang Liu created SPARK-22367:
------------------------------------

             Summary: Separate the serialization of class and object for iteraor
                 Key: SPARK-22367
                 URL: https://issues.apache.org/jira/browse/SPARK-22367
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.2.0
            Reporter: Xianyang Liu


Becuase they are all the same class for an iterator.  So there is no need write 
class information for every record in the iterator. We only need write the 
class information once at the serialization beginning, also only need read the 
class information once for deserialization.

In this patch, we separate the serialization of class and object for an 
iterator serialized by Kryo. This can improve the performance of the 
serialization and deserialization, and save the space.

Test case:
```scala
    val conf = new SparkConf().setAppName("Test for serialization")
    val sc = new SparkContext(conf)

    val random = new Random(1)
    val data = sc.parallelize(1 to 1000000000).map { i =>
      Person("id-" + i, random.nextInt(Integer.MAX_VALUE))
    }.persist(StorageLevel.OFF_HEAP)

    var start = System.currentTimeMillis()
    data.count()
    println("First time: " + (System.currentTimeMillis() - start))

    start = System.currentTimeMillis()
    data.count()
    println("Second time: " + (System.currentTimeMillis() - start))

```

Test result:

The size of serialized:
before: 34.3GB
after: 17.5GB

| before(cal+serialization)| before(deserialization)| after(cal+serialization)| 
after(deserialization) |
| ------| ------ | ------ | ------ | 
| 63869| 21882|  45513| 15158|
| 59368| 21507|  51683| 15524|
| 66230| 21481|  62163| 14903|
| 62399| 22529|  52400| 16255|

| 137564.2 | 136990.8 | 1.004186 | 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to