Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    Hi @cloud-fan, @jerryshao. The problem of `writeClass` and `readClass` can 
be solved by register the class: Vector, DenseVector, SparseVector.  The follow 
is the test results:
    ```scala
    val conf = new SparkConf().setAppName("Vector Register Test")
        conf.registerKryoClasses(Array(classOf[Vector], classOf[DenseVector], 
classOf[SparseVector]))
        val sc = new SparkContext(conf)
    
        val sourceData = sc.sequenceFile[LongWritable, VectorWritable](args(0))
          .map { case (k, v) =>
            val vector = v.get()
            val tmpVector = new Array[Double](v.get().size())
            for (i <- 0 until vector.size()) {
              tmpVector(i) = vector.get(i)
            }
            Vectors.dense(tmpVector)
          }
    
        sourceData.persist(StorageLevel.OFF_HEAP)
        var start = System.currentTimeMillis()
        sourceData.count()
        println("First: " + (System.currentTimeMillis() - start))
        start = System.currentTimeMillis()
        sourceData.count()
        println("Second: " + (System.currentTimeMillis() - start))
    
        sc.stop()
    ```
    
    
    Results:
    serialized size:  before 38.4GB after: 30.5GB
    First time: before 93318ms    after:  80708ms
    Second time:  before: 5870ms    after: 3382ms
    
    Those classes are very common for ML,  and also `Matrix`, `DenseMatrix` and 
`SparseMatrix` too. I'm not sure whether we should register those classes in 
core directly,  because this could introduce extra jar dependency.  So could 
you give some advice? Or else we just remind in the ml doc?
    
    The reason shoule be the problem of kryo, it  will write the full class 
name instead of the classID if the class is not registered.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to