GitHub user ConeyLiu opened a pull request:

    https://github.com/apache/spark/pull/19586

    [SPARK-22367][CORE] Separate the serialization of class and object for 
iteraor

    ## What changes were proposed in this pull request?
    
    Becuase they are all the same class for an iterator.  So there is no need 
write class information for every record in the iterator. We only need write 
the class information once at the serialization beginning, also only need read 
the class information once for deserialization.
    
    In this patch, we separate the serialization of class and object for an 
iterator serialized by Kryo. This can improve the performance of the 
serialization and deserialization, and save the space.
    
    Test case:
    ```scala
        val conf = new SparkConf().setAppName("Test for serialization")
        val sc = new SparkContext(conf)
    
        val random = new Random(1)
        val data = sc.parallelize(1 to 1000000000).map { i =>
          Person("id-" + i, random.nextInt(Integer.MAX_VALUE))
        }.persist(StorageLevel.OFF_HEAP)
    
        var start = System.currentTimeMillis()
        data.count()
        println("First time: " + (System.currentTimeMillis() - start))
    
        start = System.currentTimeMillis()
        data.count()
        println("Second time: " + (System.currentTimeMillis() - start))
    
    ```
    
    Test result:
    
    The size of serialized:
    before: 34.3GB
    after: 17.5GB
    
    | before(cal+serialization)| before(deserialization)| 
after(cal+serialization)| after(deserialization) |
    | ------| ------ | ------ | ------ | 
    | 63869| 21882|  45513| 15158|
    | 59368| 21507|  51683| 15524|
    | 66230| 21481|  62163| 14903|
    | 62399| 22529|  52400| 16255|
    
    | 137564.2 | 136990.8 | 1.004186 | 
    
    ## How was this patch tested?
    
    Existing UT.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ConeyLiu/spark kryo

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19586.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19586
    
----
commit c681e81f9d49b3558c91a3b981504159bbeff910
Author: Xianyang Liu <xianyang....@intel.com>
Date:   2017-10-26T06:37:04Z

    serialize object and class seperately for iterator

commit 640ad5e1d12d1137f4c979a1e75dbdbd713e14de
Author: Xianyang Liu <xianyang....@intel.com>
Date:   2017-10-26T06:42:58Z

    Merge remote-tracking branch 'spark/master' into kryo

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to