Re: Serialization Exception

2015-06-30 Thread Tathagata Das
I am guessing one of the two things might work.

1. Either define the pattern SPACE inside the process()
2. Mark streamingContext field and inputStream field as transient.

The problem is that the function like PairFunction needs to be serialized
for being sent to the tasks. And whole closure of the function is
serialized, and somehow that closure is capturing the whole
WordCountProcessorKafkaImpl


On Mon, Jun 29, 2015 at 5:14 AM, Spark Enthusiast 
wrote:

> For prototyping purposes, I created a test program injecting dependancies
> using Spring.
>
> Nothing fancy. This is just a re-write of KafkaDirectWordCount. When I run
> this, I get the following exception:
>
> Exception in thread "main" org.apache.spark.SparkException: Task not
> serializable
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
> at
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
> at
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
> at org.apache.spark.SparkContext.withScope(SparkContext.scala:681)
> at
> org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:258)
> at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:527)
> at
> org.apache.spark.streaming.api.java.JavaDStreamLike$class.map(JavaDStreamLike.scala:157)
> at
> org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.map(JavaDStreamLike.scala:43)
> at
> com.olacabs.spark.examples.WordCountProcessorKafkaImpl.process(WordCountProcessorKafkaImpl.java:45)
> at com.olacabs.spark.examples.WordCountApp.main(WordCountApp.java:49)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.NotSerializableException: Object of
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being
> serialized  possibly as a part of closure of an RDD operation. This is
> because  the DStream object is being referred to from within the closure.
> Please rewrite the RDD operation inside this DStream to avoid this.  This
> has been enforced to avoid bloating of Spark tasks  with unnecessary
> objects.
> Serialization stack:
> - object not serializable (class:
> org.apache.spark.streaming.api.java.JavaStreamingContext, value:
> org.apache.spark.streaming.api.java.JavaStreamingContext@7add323c)
> - field (class:
> com.olacabs.spark.examples.WordCountProcessorKafkaImpl, name:
> streamingContext, type: class
> org.apache.spark.streaming.api.java.JavaStreamingContext)
> - object (class
> com.olacabs.spark.examples.WordCountProcessorKafkaImpl,
> com.olacabs.spark.examples.WordCountProcessorKafkaImpl@29a1505c)
> - field (class:
> com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1, name: this$0,
> type: class com.olacabs.spark.examples.WordCountProcessorKafkaImpl)
> - object (class
> com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1,
> com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1@c6c82aa)
> - field (class:
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name:
> fun$1, type: interface org.apache.spark.api.java.function.Function)
> - object (class
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1,
> )
> at
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
> ... 23 more
>
>
>
> Can someone help me figure out why?
>
>
> Here is the Code :
>
> public interface EventProcessor extends Serializable {
> void process();
> }
>
>
> public class WordCountProcessorKafk

Serialization Exception

2015-06-29 Thread Spark Enthusiast
For prototyping purposes, I created a test program injecting dependancies using 
Spring. 

Nothing fancy. This is just a re-write of KafkaDirectWordCount. When I run 
this, I get the following exception:
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
    at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
    at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
    at 
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
    at 
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
    at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
    at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:681)
    at 
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:258)
    at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:527)
    at 
org.apache.spark.streaming.api.java.JavaDStreamLike$class.map(JavaDStreamLike.scala:157)
    at 
org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.map(JavaDStreamLike.scala:43)
    at 
com.olacabs.spark.examples.WordCountProcessorKafkaImpl.process(WordCountProcessorKafkaImpl.java:45)
    at com.olacabs.spark.examples.WordCountApp.main(WordCountApp.java:49)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: Object of 
org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being serialized  
possibly as a part of closure of an RDD operation. This is because  the DStream 
object is being referred to from within the closure.  Please rewrite the RDD 
operation inside this DStream to avoid this.  This has been enforced to avoid 
bloating of Spark tasks  with unnecessary objects.
Serialization stack:
    - object not serializable (class: 
org.apache.spark.streaming.api.java.JavaStreamingContext, value: 
org.apache.spark.streaming.api.java.JavaStreamingContext@7add323c)
    - field (class: com.olacabs.spark.examples.WordCountProcessorKafkaImpl, 
name: streamingContext, type: class 
org.apache.spark.streaming.api.java.JavaStreamingContext)
    - object (class com.olacabs.spark.examples.WordCountProcessorKafkaImpl, 
com.olacabs.spark.examples.WordCountProcessorKafkaImpl@29a1505c)
    - field (class: com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1, 
name: this$0, type: class 
com.olacabs.spark.examples.WordCountProcessorKafkaImpl)
    - object (class com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1, 
com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1@c6c82aa)
    - field (class: 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, 
type: interface org.apache.spark.api.java.function.Function)
    - object (class 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, )
    at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
    at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
    ... 23 more



Can someone help me figure out why?


Here is the Code :

public interface EventProcessor extends Serializable {
void process();
}
public class WordCountProcessorKafkaImpl implements EventProcessor {

private static final Pattern SPACE = Pattern.compile(" ");

@Autowired
@Qualifier("streamingContext")
JavaStreamingContext streamingContext;

@Autowired
@Qualifier("inputDStream")
JavaPairInputDStream inputDStream;

@Override
public void process() {
// Get the lines, split them into words, count the words and print
JavaDStream lines = inputDStream.map(new 
Function, String>() {
@Override
public String call(Tuple2 tuple2) {
return tuple2._2();
}
});
JavaDStre

Re: PairRDD serialization exception

2015-03-12 Thread dylanhockey
I have the same exact error.  Am running a pyspark job in yarn-client mode. 
Works well in standalone but I need to run it in yarn-client mode.  

Other people reported the same problem when bundling jars and extra
dependencies.  I'm pointing the pyspark to use a specific python executable
bundled with external dependencies.  However since the job runs on
standalone, I see no reason why it should give me this error whilst saving
to s3 on yarn-client.

Thanks. Any help or direction would be appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PairRDD-serialization-exception-tp21999p22019.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PairRDD serialization exception

2015-03-11 Thread Manas Kar
Hi Sean,
Below is the sbt dependencies that I am using.

I gave another try by removing the "provided" keyword which failed with the
same error.
What confuses me is that the stack trace appears after few of the stages
have already run completely.

  object V {
val spark = "1.2.0-cdh5.3.0"
val esriGeometryAPI = "1.2"
val csvWriter = "1.0.0"
val hadoopClient = "2.5.0"
val scalaTest = "2.2.1"
val jodaTime = "1.6.0"
val scalajHTTP = "1.0.1"
val avro   = "1.7.7"
val scopt  = "3.2.0"
val breeze = "0.8.1"
val config = "1.2.1"
  }
  object Libraries {
val EEAMessage  = "com.waterloopublic" %% "eeaformat" %
"1.0-SNAPSHOT"
val avro= "org.apache.avro" % "avro-mapred" %
V.avro classifier "hadoop2"
val spark  = "org.apache.spark" % "spark-core_2.10" %
V.spark % "provided"
val hadoopClient= "org.apache.hadoop" % "hadoop-client" %
V.hadoopClient % "provided"
val esriGeometryAPI  = "com.esri.geometry" % "esri-geometry-api" %
V.esriGeometryAPI
val scalaTest = "org.scalatest" %% "scalatest" %
V.scalaTest % "test"
val csvWriter  = "com.github.tototoshi" %% "scala-csv" %
V.csvWriter
val jodaTime   = "com.github.nscala-time" %% "nscala-time"
% V.jodaTime % "provided"
val scalajHTTP= "org.scalaj" %% "scalaj-http" % V.scalajHTTP
val scopt= "com.github.scopt" %% "scopt" % V.scopt
val breeze  = "org.scalanlp" %% "breeze" % V.breeze
val breezeNatives   = "org.scalanlp" %% "breeze-natives" % V.breeze
val config  = "com.typesafe" % "config" % V.config
  }

There are only few more things to try(like reverting back to Spark 1.1)
before I run out of idea completely.
Please share your insights.

..Manas

On Wed, Mar 11, 2015 at 9:44 AM, Sean Owen  wrote:

> This usually means you are mixing different versions of code. Here it
> is complaining about a Spark class. Are you sure you built vs the
> exact same Spark binaries, and are not including them in your app?
>
> On Wed, Mar 11, 2015 at 1:40 PM, manasdebashiskar
>  wrote:
> > (This is a repost. May be a simpler subject will fetch more attention
> among
> > experts)
> >
> > Hi,
> >  I have a CDH5.3.2(Spark1.2) cluster.
> >  I am getting an local class incompatible exception for my spark
> application
> > during an action.
> > All my classes are case classes(To best of my knowledge)
> >
> > Appreciate any help.
> >
> > Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due
> > to stage failure: Task 0 in stage 3.0 failed 4 times, most recent
> failure:
> > Lost task 0.3 in stage 3.0 (TID 346, datanode02):
> > java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions;
> local
> > class incompatible:stream classdesc serialVersionUID =
> 8789839749593513237,
> > local class serialVersionUID = -4145741279224749316
> > at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
> > at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
> > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> > at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> > at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> > at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> > at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> > at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> > at
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> > at
> >
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
> > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
> > at org.apache.spark.scheduler.Task.run(Task.scala:56)
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:745)
> >
> >
> > Thanks
> > Manas
> > Manas Kar
> >
> > 
> > View this message in context: PairRDD serialization exception
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: PairRDD serialization exception

2015-03-11 Thread Sean Owen
This usually means you are mixing different versions of code. Here it
is complaining about a Spark class. Are you sure you built vs the
exact same Spark binaries, and are not including them in your app?

On Wed, Mar 11, 2015 at 1:40 PM, manasdebashiskar
 wrote:
> (This is a repost. May be a simpler subject will fetch more attention among
> experts)
>
> Hi,
>  I have a CDH5.3.2(Spark1.2) cluster.
>  I am getting an local class incompatible exception for my spark application
> during an action.
> All my classes are case classes(To best of my knowledge)
>
> Appreciate any help.
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure:
> Lost task 0.3 in stage 3.0 (TID 346, datanode02):
> java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local
> class incompatible:stream classdesc serialVersionUID = 8789839749593513237,
> local class serialVersionUID = -4145741279224749316
> at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Thanks
> Manas
> Manas Kar
>
> 
> View this message in context: PairRDD serialization exception
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



PairRDD serialization exception

2015-03-11 Thread manasdebashiskar
(This is a repost. May be a simpler subject will fetch more attention among
experts)

Hi,
 I have a CDH5.3.2(Spark1.2) cluster.
 I am getting an local class incompatible exception for my spark
application during an action.
All my classes are case classes(To best of my knowledge)

Appreciate any help.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 3.0 (TID 346, datanode02):
java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local
class incompatible:stream classdesc serialVersionUID = 8789839749593513237,
local class serialVersionUID = -4145741279224749316
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Thanks
Manas




-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PairRDD-serialization-exception-tp21999.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.