Re: Serialization Exception
I am guessing one of the two things might work. 1. Either define the pattern SPACE inside the process() 2. Mark streamingContext field and inputStream field as transient. The problem is that the function like PairFunction needs to be serialized for being sent to the tasks. And whole closure of the function is serialized, and somehow that closure is capturing the whole WordCountProcessorKafkaImpl On Mon, Jun 29, 2015 at 5:14 AM, Spark Enthusiast wrote: > For prototyping purposes, I created a test program injecting dependancies > using Spring. > > Nothing fancy. This is just a re-write of KafkaDirectWordCount. When I run > this, I get the following exception: > > Exception in thread "main" org.apache.spark.SparkException: Task not > serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305) > at > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) > at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:681) > at > org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:258) > at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:527) > at > org.apache.spark.streaming.api.java.JavaDStreamLike$class.map(JavaDStreamLike.scala:157) > at > org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.map(JavaDStreamLike.scala:43) > at > com.olacabs.spark.examples.WordCountProcessorKafkaImpl.process(WordCountProcessorKafkaImpl.java:45) > at com.olacabs.spark.examples.WordCountApp.main(WordCountApp.java:49) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.io.NotSerializableException: Object of > org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being > serialized possibly as a part of closure of an RDD operation. This is > because the DStream object is being referred to from within the closure. > Please rewrite the RDD operation inside this DStream to avoid this. This > has been enforced to avoid bloating of Spark tasks with unnecessary > objects. > Serialization stack: > - object not serializable (class: > org.apache.spark.streaming.api.java.JavaStreamingContext, value: > org.apache.spark.streaming.api.java.JavaStreamingContext@7add323c) > - field (class: > com.olacabs.spark.examples.WordCountProcessorKafkaImpl, name: > streamingContext, type: class > org.apache.spark.streaming.api.java.JavaStreamingContext) > - object (class > com.olacabs.spark.examples.WordCountProcessorKafkaImpl, > com.olacabs.spark.examples.WordCountProcessorKafkaImpl@29a1505c) > - field (class: > com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1, name: this$0, > type: class com.olacabs.spark.examples.WordCountProcessorKafkaImpl) > - object (class > com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1, > com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1@c6c82aa) > - field (class: > org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: > fun$1, type: interface org.apache.spark.api.java.function.Function) > - object (class > org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, > ) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312) > ... 23 more > > > > Can someone help me figure out why? > > > Here is the Code : > > public interface EventProcessor extends Serializable { > void process(); > } > > > public class WordCountProcessorKafk
Serialization Exception
For prototyping purposes, I created a test program injecting dependancies using Spring. Nothing fancy. This is just a re-write of KafkaDirectWordCount. When I run this, I get the following exception: Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528) at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.SparkContext.withScope(SparkContext.scala:681) at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:258) at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:527) at org.apache.spark.streaming.api.java.JavaDStreamLike$class.map(JavaDStreamLike.scala:157) at org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.map(JavaDStreamLike.scala:43) at com.olacabs.spark.examples.WordCountProcessorKafkaImpl.process(WordCountProcessorKafkaImpl.java:45) at com.olacabs.spark.examples.WordCountApp.main(WordCountApp.java:49) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects. Serialization stack: - object not serializable (class: org.apache.spark.streaming.api.java.JavaStreamingContext, value: org.apache.spark.streaming.api.java.JavaStreamingContext@7add323c) - field (class: com.olacabs.spark.examples.WordCountProcessorKafkaImpl, name: streamingContext, type: class org.apache.spark.streaming.api.java.JavaStreamingContext) - object (class com.olacabs.spark.examples.WordCountProcessorKafkaImpl, com.olacabs.spark.examples.WordCountProcessorKafkaImpl@29a1505c) - field (class: com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1, name: this$0, type: class com.olacabs.spark.examples.WordCountProcessorKafkaImpl) - object (class com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1, com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1@c6c82aa) - field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function) - object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312) ... 23 more Can someone help me figure out why? Here is the Code : public interface EventProcessor extends Serializable { void process(); } public class WordCountProcessorKafkaImpl implements EventProcessor { private static final Pattern SPACE = Pattern.compile(" "); @Autowired @Qualifier("streamingContext") JavaStreamingContext streamingContext; @Autowired @Qualifier("inputDStream") JavaPairInputDStream inputDStream; @Override public void process() { // Get the lines, split them into words, count the words and print JavaDStream lines = inputDStream.map(new Function, String>() { @Override public String call(Tuple2 tuple2) { return tuple2._2(); } }); JavaDStre
Re: PairRDD serialization exception
I have the same exact error. Am running a pyspark job in yarn-client mode. Works well in standalone but I need to run it in yarn-client mode. Other people reported the same problem when bundling jars and extra dependencies. I'm pointing the pyspark to use a specific python executable bundled with external dependencies. However since the job runs on standalone, I see no reason why it should give me this error whilst saving to s3 on yarn-client. Thanks. Any help or direction would be appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PairRDD-serialization-exception-tp21999p22019.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PairRDD serialization exception
Hi Sean, Below is the sbt dependencies that I am using. I gave another try by removing the "provided" keyword which failed with the same error. What confuses me is that the stack trace appears after few of the stages have already run completely. object V { val spark = "1.2.0-cdh5.3.0" val esriGeometryAPI = "1.2" val csvWriter = "1.0.0" val hadoopClient = "2.5.0" val scalaTest = "2.2.1" val jodaTime = "1.6.0" val scalajHTTP = "1.0.1" val avro = "1.7.7" val scopt = "3.2.0" val breeze = "0.8.1" val config = "1.2.1" } object Libraries { val EEAMessage = "com.waterloopublic" %% "eeaformat" % "1.0-SNAPSHOT" val avro= "org.apache.avro" % "avro-mapred" % V.avro classifier "hadoop2" val spark = "org.apache.spark" % "spark-core_2.10" % V.spark % "provided" val hadoopClient= "org.apache.hadoop" % "hadoop-client" % V.hadoopClient % "provided" val esriGeometryAPI = "com.esri.geometry" % "esri-geometry-api" % V.esriGeometryAPI val scalaTest = "org.scalatest" %% "scalatest" % V.scalaTest % "test" val csvWriter = "com.github.tototoshi" %% "scala-csv" % V.csvWriter val jodaTime = "com.github.nscala-time" %% "nscala-time" % V.jodaTime % "provided" val scalajHTTP= "org.scalaj" %% "scalaj-http" % V.scalajHTTP val scopt= "com.github.scopt" %% "scopt" % V.scopt val breeze = "org.scalanlp" %% "breeze" % V.breeze val breezeNatives = "org.scalanlp" %% "breeze-natives" % V.breeze val config = "com.typesafe" % "config" % V.config } There are only few more things to try(like reverting back to Spark 1.1) before I run out of idea completely. Please share your insights. ..Manas On Wed, Mar 11, 2015 at 9:44 AM, Sean Owen wrote: > This usually means you are mixing different versions of code. Here it > is complaining about a Spark class. Are you sure you built vs the > exact same Spark binaries, and are not including them in your app? > > On Wed, Mar 11, 2015 at 1:40 PM, manasdebashiskar > wrote: > > (This is a repost. May be a simpler subject will fetch more attention > among > > experts) > > > > Hi, > > I have a CDH5.3.2(Spark1.2) cluster. > > I am getting an local class incompatible exception for my spark > application > > during an action. > > All my classes are case classes(To best of my knowledge) > > > > Appreciate any help. > > > > Exception in thread "main" org.apache.spark.SparkException: Job aborted > due > > to stage failure: Task 0 in stage 3.0 failed 4 times, most recent > failure: > > Lost task 0.3 in stage 3.0 (TID 346, datanode02): > > java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; > local > > class incompatible:stream classdesc serialVersionUID = > 8789839749593513237, > > local class serialVersionUID = -4145741279224749316 > > at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) > > at > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) > > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > > at > > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > > at > > > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) > > at org.apache.spark.scheduler.Task.run(Task.scala:56) > > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:745) > > > > > > Thanks > > Manas > > Manas Kar > > > > > > View this message in context: PairRDD serialization exception > > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: PairRDD serialization exception
This usually means you are mixing different versions of code. Here it is complaining about a Spark class. Are you sure you built vs the exact same Spark binaries, and are not including them in your app? On Wed, Mar 11, 2015 at 1:40 PM, manasdebashiskar wrote: > (This is a repost. May be a simpler subject will fetch more attention among > experts) > > Hi, > I have a CDH5.3.2(Spark1.2) cluster. > I am getting an local class incompatible exception for my spark application > during an action. > All my classes are case classes(To best of my knowledge) > > Appreciate any help. > > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: > Lost task 0.3 in stage 3.0 (TID 346, datanode02): > java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local > class incompatible:stream classdesc serialVersionUID = 8789839749593513237, > local class serialVersionUID = -4145741279224749316 > at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) > at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > > Thanks > Manas > Manas Kar > > > View this message in context: PairRDD serialization exception > Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
PairRDD serialization exception
(This is a repost. May be a simpler subject will fetch more attention among experts) Hi, I have a CDH5.3.2(Spark1.2) cluster. I am getting an local class incompatible exception for my spark application during an action. All my classes are case classes(To best of my knowledge) Appreciate any help. Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 346, datanode02): java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local class incompatible:stream classdesc serialVersionUID = 8789839749593513237, local class serialVersionUID = -4145741279224749316 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks Manas - Manas Kar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PairRDD-serialization-exception-tp21999.html Sent from the Apache Spark User List mailing list archive at Nabble.com.