Thanks Cedric, I learnt something :-) and it solved my issue. Few additional questions then:
In my script, should Serializable.isAssignableFrom(filterClosure.class) returns true only when I call dehydrate on it ? (this is not the case ?) Would there be a way to automatically create "dehydrated" closure in a script ? or should I catch all calls to map on JavaRDD to make sure the closure is dehydrated before calling the actual method ? On 26 July 2015 at 11:07, Cédric Champeau <[email protected]> wrote: > A closure keeps a reference to its owner/thisObject, which is in your > case the script. The script is not serializable. If you dehydrate the > closure (call closure.dehydrate()) it will not keep a reference to the > script anymore and it should be serializable. > > 2015-07-26 11:57 GMT+02:00 Jeff MAURY <[email protected]>: > > So it may be an object stored in your task that is not > > > > Jeff > > > > Le 26 juil. 2015 11:42, "tog" <[email protected]> a écrit : > >> > >> Thanks Jeff for your quick answer. > >> > >> Yes, the tasks shall be serializable and I believe they are. > >> > >> My test script has 2 tasks (doing the same job) one is a closure, the > >> other is a org.apache.spark.api.java.function.Function - and according > to a > >> small test in my script both are serializable for Java/Groovy. > >> > >> I am a bit puzzled/stuck here. > >> > >> On 26 July 2015 at 10:34, Jeff MAURY <[email protected]> wrote: > >>> > >>> Spark is distribution tasks on cluster nodes so the task needs to be > >>> serializable. Appears that you task is a Groovy closure so you must > make it > >>> serializable. > >>> > >>> Jeff > >>> > >>> On Sun, Jul 26, 2015 at 11:12 AM, tog <[email protected]> > wrote: > >>>> > >>>> Hi > >>>> > >>>> I am starting to play with Apache Spark using groovy. I have a small > >>>> script that I use for that purpose. > >>>> > >>>> When the script is transformed in a class and launched with java, this > >>>> is working fine but it fails when run as a script. > >>>> > >>>> Any idea what I am doing wrong ? May be some of you have already come > >>>> accros that problem. > >>>> > >>>> $ groovy -version > >>>> > >>>> Groovy Version: 2.4.3 JVM: 1.8.0_40 Vendor: Oracle Corporation OS: Mac > >>>> OS X > >>>> > >>>> $ groovy GroovySparkWordcount.groovy > >>>> > >>>> class org.apache.spark.api.java.JavaRDD > >>>> > >>>> true > >>>> > >>>> true > >>>> > >>>> Caught: org.apache.spark.SparkException: Task not serializable > >>>> > >>>> org.apache.spark.SparkException: Task not serializable > >>>> > >>>> at > >>>> > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315) > >>>> > >>>> at > >>>> > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305) > >>>> > >>>> at > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) > >>>> > >>>> at org.apache.spark.SparkContext.clean(SparkContext.scala:1893) > >>>> > >>>> at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:311) > >>>> > >>>> at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:310) > >>>> > >>>> at > >>>> > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > >>>> > >>>> at > >>>> > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > >>>> > >>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) > >>>> > >>>> at org.apache.spark.rdd.RDD.filter(RDD.scala:310) > >>>> > >>>> at org.apache.spark.api.java.JavaRDD.filter(JavaRDD.scala:78) > >>>> > >>>> at org.apache.spark.api.java.JavaRDD$filter$0.call(Unknown Source) > >>>> > >>>> at GroovySparkWordcount.run(GroovySparkWordcount.groovy:27) > >>>> > >>>> Caused by: java.io.NotSerializableException: GroovySparkWordcount > >>>> > >>>> Serialization stack: > >>>> > >>>> - object not serializable (class: GroovySparkWordcount, value: > >>>> GroovySparkWordcount@57c6feea) > >>>> > >>>> - field (class: GroovySparkWordcount$1, name: this$0, type: class > >>>> GroovySparkWordcount) > >>>> > >>>> - object (class GroovySparkWordcount$1, > GroovySparkWordcount$1@3db1ce78) > >>>> > >>>> - field (class: org.apache.spark.api.java.JavaRDD$$anonfun$filter$1, > >>>> name: f$1, type: interface > org.apache.spark.api.java.function.Function) > >>>> > >>>> - object (class org.apache.spark.api.java.JavaRDD$$anonfun$filter$1, > >>>> <function1>) > >>>> > >>>> at > >>>> > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > >>>> > >>>> at > >>>> > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > >>>> > >>>> at > >>>> > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) > >>>> > >>>> at > >>>> > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312) > >>>> > >>>> ... 12 more > >>>> > >>>> > >>> > >>> > >>> > >>> -- > >>> Jeff MAURY > >>> > >>> > >>> "Legacy code" often differs from its suggested alternative by actually > >>> working and scaling. > >>> - Bjarne Stroustrup > >>> > >>> http://www.jeffmaury.com > >>> http://riadiscuss.jeffmaury.com > >>> http://www.twitter.com/jeffmaury > >> > >> > >> > >> > >> -- > >> PGP KeyID: 2048R/EA31CFC9 subkeys.pgp.net > -- PGP KeyID: 2048R/EA31CFC9 subkeys.pgp.net
