Hao, 

I’d say there are few possible ways to achieve that:
1. Use KryoSerializer.
  The flaw of KryoSerializer is that current version (2.21) has an issue with 
internal state and it might not work for some objects. Spark get kryo 
dependency as transitive through chill and it’ll not be resolved quickly. Kryo 
doesn’t work for me (I have such an classes I have to transfer, but do not have 
their codebase).

2. Wrap it into something you have control and make that something serializable.
  The flaw is kind of obvious - it’s really hard to write serialization for 
complex objects.

3. Tricky algo: don’t do anything that might end up as reshuffle.
  That’s the way I took. The flow is that we have CSV file as input, parse it 
and create objects that we cannot serialize / deserialize, thus cannot transfer 
over the network. Currently we’ve workarounded it so that these objects 
processed only in those partitions where thye’ve been born. 

Hope, this helps.

On 07 Aug 2015, at 12:39, Hao Ren <inv...@gmail.com> wrote:

> Is there any workaround to distribute non-serializable object for RDD 
> transformation or broadcast variable ?
> 
> Say I have an object of class C which is not serializable. Class C is in a 
> jar package, I have no control on it. Now I need to distribute it either by 
> rdd transformation or by broadcast. 
> 
> I tried to subclass the class C with Serializable interface. It works for 
> serialization, but deserialization does not work, since there are no 
> parameter-less constructor for the class C and deserialization is broken with 
> an invalid constructor exception.
> 
> I think it's a common use case. Any help is appreciated.
> 
> -- 
> Hao Ren
> 
> Data Engineer @ leboncoin
> 
> Paris, France

Eugene Morozov
fathers...@list.ru




Reply via email to