Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19586
Thanks for the suggestion, I re-raised a pr to solve this problem. Close it
now.
---
-
To unsubscribe, e-mail:
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19586
and in `ml`, if we want to register class before running algos, Some other
classes like `LabeledPoint`, `Instance` also need registered.
and there're some class temporary defined in some
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19586
We can config the class to register by config
`spark.kryo.classesToRegister`, does it need to add into spark code ?
---
Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/19586
also cc @WeichenXu123
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19586
You can call `SparkConf#registerKryoClasses` manually, maybe we can also
register these ml classes automatically in `KryoSerializer.newKryo` via
reflection.
cc @yanboliang @srowen
---
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19586
Hi @cloud-fan, @jerryshao. The problem of `writeClass` and `readClass` can
be solved by register the class: Vector, DenseVector, SparseVector. The follow
is the test results:
```scala
val
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19586
I think this problem will go away after mllib migrate to Spark SQL
completely. For now I think we can make the serializer config job-wise and set
this special serializer for ml jobs.
---
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19586
OK, I can understand your concern. There is huge gc problem for K-means
workload, it occupied about 10-20% percent. The source data is cached in
memory, there is even worse performance when the
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/19586
I tend to agree with @cloud-fan , I think you can implement your own
serializer out of Spark to be more specialized for your application, that will
definitely be more efficient than the built-in
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19586
Hi @cloud-fan, for most case the data type should be same. So I think this
optimization is valuable, because it can save the space and cpu resource
considerable. What about setting a flag for the
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19586
For these cases, they can write their own serializer and set it via
`spark.serializer`. I don't think Spark should have built-in support for them
because it's not general.
---
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19586
Currently, I use it directly. Maybe this is suitable for some special case
which has same type data, such as ml or else.
---
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/19586
Using configurations seems not so elegant, also configuration is
application based, how would you turn off/on this feature in the runtime? Sorry
I cannot give you a good advice, maybe kryo's
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19586
Hi @jerryshao, Thanks for the reminder, it doesn't support it. I'm sorry I
did not take that into account. How about using configuration to determine
whether we should use
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/19586
@ConeyLiu what about the below example, does your implementation support
this?
```scala
trait Base { val name: String }
case class A(name: String) extends Base
case class
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19586
Hi @cloud-fan, thanks for reviewing. There are some errors about
`UnsafeShuffleWrite` need further fixed. I am not familiar with this code, so I
need some time.
---
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19586
OK to test
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
17 matches
Mail list logo