Re: RDD collect help

2014-04-18 Thread Eugen Cepoi
Because it happens to reference something outside the closures scope that will reference some other objects (that you don't need) and so one, resulting in serializing with your task a lot of things that you don't want. But sure it is discutable and it's more my personal opinion. 2014-04-17 23:28

Re: RDD collect help

2014-04-18 Thread Flavio Pompermaier
Ok thanks. However it turns out that there's a problem with that and it's not so safe to use kryo serialization with Spark: Exception in thread Executor task launch worker-0 java.lang.NullPointerException at

Re: RDD collect help

2014-04-18 Thread Eugen Cepoi
Indeed, serialization is always tricky when you want to work on objects that are more sophisticated than simple POJOs. And you can have sometimes unexpected behaviour when using the deserialized objects. In my case I had troubles when serializaing/deser Avro specific records with lists. The

Re: RDD collect help

2014-04-17 Thread Eugen Cepoi
You have two kind of ser : data and closures. They both use java ser. This means that in your function you reference an object outside of it and it is getting ser with your task. To enable kryo ser for closures set spark.closure.serializer property. But usualy I dont as it allows me to detect such

RDD collect help

2014-04-14 Thread Flavio Pompermaier
Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are

Re: RDD collect help

2014-04-14 Thread Flavio Pompermaier
Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html

Re: RDD collect help

2014-04-14 Thread Eugen Cepoi
Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization (when you do collect the data from the workers will be serialized, collected by the driver and then deserialized on the driver side). Kryo (as most other decent serialization libs)

Re: RDD collect help

2014-04-14 Thread Flavio Pompermaier
Ok, that's fair enough. But why things work up to the collect?during map and filter objects are not serialized? On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization

Re: RDD collect help

2014-04-14 Thread Eugen Cepoi
Nope, those operations are lazy, meaning it will create the RDDs but won't trigger any action. The computation is launched by operations such as collect, count, save to HDFS etc. And even if they were not lazy, no serialization would happen. Serialization occurs only when data will be transfered