[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-05 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19586 Thanks for the suggestion, I re-raised a pr to solve this problem. Close it now. --- - To unsubscribe, e-mail:

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19586 and in `ml`, if we want to register class before running algos, Some other classes like `LabeledPoint`, `Instance` also need registered. and there're some class temporary defined in some

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19586 We can config the class to register by config `spark.kryo.classesToRegister`, does it need to add into spark code ? ---

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-03 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19586 also cc @WeichenXu123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-03 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19586 You can call `SparkConf#registerKryoClasses` manually, maybe we can also register these ml classes automatically in `KryoSerializer.newKryo` via reflection. cc @yanboliang @srowen ---

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-03 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19586 Hi @cloud-fan, @jerryshao. The problem of `writeClass` and `readClass` can be solved by register the class: Vector, DenseVector, SparseVector. The follow is the test results: ```scala val

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-02 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19586 I think this problem will go away after mllib migrate to Spark SQL completely. For now I think we can make the serializer config job-wise and set this special serializer for ml jobs. ---

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-02 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19586 OK, I can understand your concern. There is huge gc problem for K-means workload, it occupied about 10-20% percent. The source data is cached in memory, there is even worse performance when the

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-01 Thread jerryshao
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19586 I tend to agree with @cloud-fan , I think you can implement your own serializer out of Spark to be more specialized for your application, that will definitely be more efficient than the built-in

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-01 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19586 Hi @cloud-fan, for most case the data type should be same. So I think this optimization is valuable, because it can save the space and cpu resource considerable. What about setting a flag for the

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-01 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19586 For these cases, they can write their own serializer and set it via `spark.serializer`. I don't think Spark should have built-in support for them because it's not general. ---

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-01 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19586 Currently, I use it directly. Maybe this is suitable for some special case which has same type data, such as ml or else. ---

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-10-31 Thread jerryshao
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19586 Using configurations seems not so elegant, also configuration is application based, how would you turn off/on this feature in the runtime? Sorry I cannot give you a good advice, maybe kryo's

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-10-31 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19586 Hi @jerryshao, Thanks for the reminder, it doesn't support it. I'm sorry I did not take that into account. How about using configuration to determine whether we should use

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-10-31 Thread jerryshao
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19586 @ConeyLiu what about the below example, does your implementation support this? ```scala trait Base { val name: String } case class A(name: String) extends Base case class

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-10-30 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19586 Hi @cloud-fan, thanks for reviewing. There are some errors about `UnsafeShuffleWrite` need further fixed. I am not familiar with this code, so I need some time. ---

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-10-30 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19586 OK to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: