Re: Using Spark Context as an attribute of a class cannot be used
That's an interesting question for which I do not know the answer. Probably a question for someone with more knowledge of the internals of the shell interpreter... On Mon, Nov 24, 2014 at 2:19 PM, aecc wrote: > Ok, great, I'm gonna do do it that way, thanks :). However I still don't > understand why this object should be serialized and shipped? > > aaa.s and sc are both the same object org.apache.spark.SparkContext@1f222881 > > However this : > aaa.s.parallelize(1 to 10).filter(_ == myNumber).count > > Needs to be serialized, and this: > > sc.parallelize(1 to 10).filter(_ == myNumber).count > > does not. -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Spark Context as an attribute of a class cannot be used
Ok, great, I'm gonna do do it that way, thanks :). However I still don't understand why this object should be serialized and shipped? aaa.s and sc are both the same object org.apache.spark.SparkContext@1f222881 However this : aaa.s.parallelize(1 to 10).filter(_ == myNumber).count Needs to be serialized, and this: sc.parallelize(1 to 10).filter(_ == myNumber).count does not. 2014-11-24 23:13 GMT+01:00 Marcelo Vanzin [via Apache Spark User List] < ml-node+s1001560n19692...@n3.nabble.com>: > On Mon, Nov 24, 2014 at 1:56 PM, aecc <[hidden email] > <http://user/SendEmail.jtp?type=node&node=19692&i=0>> wrote: > > I checked sqlContext, they use it in the same way I would like to use my > > class, they make the class Serializable with transient. Does this > affects > > somehow the whole pipeline of data moving? I mean, will I get > performance > > issues when doing this because now the class will be Serialized for some > > reason that I still don't understand? > > If you want to do the same thing, your "AAA" needs to be serializable > and you need to mark all non-serializable fields as "@transient". The > only performance penalty you'll be paying is the serialization / > deserialization of the "AAA" instance, which most probably will be > really small compared to the actual work the task will be doing. > > Unless your class is holding a whole lot of data, in which case you > should start thinking about using a broadcast instead. > > -- > Marcelo > > - > To unsubscribe, e-mail: [hidden email] > <http://user/SendEmail.jtp?type=node&node=19692&i=1> > For additional commands, e-mail: [hidden email] > <http://user/SendEmail.jtp?type=node&node=19692&i=2> > > > > ---------- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19692.html > To unsubscribe from Using Spark Context as an attribute of a class cannot > be used, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=19668&code=YWxlc3NhbmRyb2FlY2NAZ21haWwuY29tfDE5NjY4fDE2MzQ0ODgyMDU=> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- Alessandro Chacón Aecc_ORG -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19694.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Using Spark Context as an attribute of a class cannot be used
On Mon, Nov 24, 2014 at 1:56 PM, aecc wrote: > I checked sqlContext, they use it in the same way I would like to use my > class, they make the class Serializable with transient. Does this affects > somehow the whole pipeline of data moving? I mean, will I get performance > issues when doing this because now the class will be Serialized for some > reason that I still don't understand? If you want to do the same thing, your "AAA" needs to be serializable and you need to mark all non-serializable fields as "@transient". The only performance penalty you'll be paying is the serialization / deserialization of the "AAA" instance, which most probably will be really small compared to the actual work the task will be doing. Unless your class is holding a whole lot of data, in which case you should start thinking about using a broadcast instead. -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Spark Context as an attribute of a class cannot be used
Yes, I'm running this in the Shell. In my compiled Jar it works perfectly, the issue is I need to do this on the shell. Any available workarounds? I checked sqlContext, they use it in the same way I would like to use my class, they make the class Serializable with transient. Does this affects somehow the whole pipeline of data moving? I mean, will I get performance issues when doing this because now the class will be Serialized for some reason that I still don't understand? 2014-11-24 22:33 GMT+01:00 Marcelo Vanzin [via Apache Spark User List] < ml-node+s1001560n19687...@n3.nabble.com>: > Hello, > > On Mon, Nov 24, 2014 at 12:07 PM, aecc <[hidden email] > <http://user/SendEmail.jtp?type=node&node=19687&i=0>> wrote: > > This is the stacktrace: > > > > org.apache.spark.SparkException: Job aborted due to stage failure: Task > not > > serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA > > - field (class "$iwC$$iwC$$iwC$$iwC", name: "aaa", type: "class > > $iwC$$iwC$$iwC$$iwC$AAA") > > Ah. Looks to me that you're trying to run this in spark-shell, right? > > I'm not 100% sure of how it works internally, but I think the Scala > repl works a little differently than regular Scala code in this > regard. When you declare a "val" in the shell it will behave > differently than a "val" inside a method in a compiled Scala class - > the former will behave like an instance variable, the latter like a > local variable. So, this is probably why you're running into this. > > Try compiling your code and running it outside the shell to see how it > goes. I'm not sure whether there's a workaround for this when trying > things out in the shell - maybe declare an `object` to hold your > constants? Never really tried, so YMMV. > > -- > Marcelo > > - > To unsubscribe, e-mail: [hidden email] > <http://user/SendEmail.jtp?type=node&node=19687&i=1> > For additional commands, e-mail: [hidden email] > <http://user/SendEmail.jtp?type=node&node=19687&i=2> > > > > -------------- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19687.html > To unsubscribe from Using Spark Context as an attribute of a class cannot > be used, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=19668&code=YWxlc3NhbmRyb2FlY2NAZ21haWwuY29tfDE5NjY4fDE2MzQ0ODgyMDU=> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- Alessandro Chacón Aecc_ORG -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19690.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Using Spark Context as an attribute of a class cannot be used
Hello, On Mon, Nov 24, 2014 at 12:07 PM, aecc wrote: > This is the stacktrace: > > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA > - field (class "$iwC$$iwC$$iwC$$iwC", name: "aaa", type: "class > $iwC$$iwC$$iwC$$iwC$AAA") Ah. Looks to me that you're trying to run this in spark-shell, right? I'm not 100% sure of how it works internally, but I think the Scala repl works a little differently than regular Scala code in this regard. When you declare a "val" in the shell it will behave differently than a "val" inside a method in a compiled Scala class - the former will behave like an instance variable, the latter like a local variable. So, this is probably why you're running into this. Try compiling your code and running it outside the shell to see how it goes. I'm not sure whether there's a workaround for this when trying things out in the shell - maybe declare an `object` to hold your constants? Never really tried, so YMMV. -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Spark Context as an attribute of a class cannot be used
If I actually instead of using myNumber I use the 5 value, the exception is not given. E.g: aaa.s.parallelize(1 to 10).filter(_ == 5).count Works perfectly -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19680.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Spark Context as an attribute of a class cannot be used
Marcelo Vanzin wrote > Do you expect to be able to use the spark context on the remote task? Not At all, what I want to create is a wrapper of the SparkContext, to be used only on the driver node. I would like to have in this "AAA" wrapper several attributes, such as the SparkContext and other configurations for my project. I tested using -Dsun.io.serialization.extendedDebugInfo=true This is the stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA - field (class "$iwC$$iwC$$iwC$$iwC", name: "aaa", type: "class $iwC$$iwC$$iwC$$iwC$AAA") - object (class "$iwC$$iwC$$iwC$$iwC", $iwC$$iwC$$iwC$$iwC@24e57dcb) - field (class "$iwC$$iwC$$iwC", name: "$iw", type: "class $iwC$$iwC$$iwC$$iwC") - object (class "$iwC$$iwC$$iwC", $iwC$$iwC$$iwC@178cc62b) - field (class "$iwC$$iwC", name: "$iw", type: "class $iwC$$iwC$$iwC") - object (class "$iwC$$iwC", $iwC$$iwC@1e9f5eeb) - field (class "$iwC", name: "$iw", type: "class $iwC$$iwC") - object (class "$iwC", $iwC@37d8e87e) - field (class "$line18.$read", name: "$iw", type: "class $iwC") - object (class "$line18.$read", $line18.$read@124551f) - field (class "$iwC$$iwC$$iwC", name: "$VAL15", type: "class $line18.$read") - object (class "$iwC$$iwC$$iwC", $iwC$$iwC$$iwC@2e846e6b) - field (class "$iwC$$iwC$$iwC$$iwC", name: "$outer", type: "class $iwC$$iwC$$iwC") - object (class "$iwC$$iwC$$iwC$$iwC", $iwC$$iwC$$iwC$$iwC@4b31ba1b) - field (class "$iwC$$iwC$$iwC$$iwC$$anonfun$1", name: "$outer", type: "class $iwC$$iwC$$iwC$$iwC") - object (class "$iwC$$iwC$$iwC$$iwC$$anonfun$1", ) - field (class "org.apache.spark.rdd.FilteredRDD", name: "f", type: "interface scala.Function1") - root object (class "org.apache.spark.rdd.FilteredRDD", FilteredRDD[3] at filter at :20) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) I actually don't understand much about this stack trace. If you can help me, I would appreciate it. Transient didn't work either Thanks a lot -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19679.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Spark Context as an attribute of a class cannot be used
Do you expect to be able to use the spark context on the remote task? If you do, that won't work. You'll need to rethink what it is you're trying to do, since SparkContext is not serializable and it doesn't make sense to make it so. If you don't, you could mark the field as @transient. But the two examples you posted shouldn't be creating a reference to the "aaa" variable in the serialized task. You could use -Dsun.io.serialization.extendedDebugInfo=true to debug these things. On Mon, Nov 24, 2014 at 10:15 AM, aecc wrote: > Hello guys, > > I'm using Spark 1.0.0 and Kryo serialization > In the Spark Shell, when I create a class that contains as an attribute the > SparkContext, in this way: > > class AAA(val s: SparkContext) { } > val aaa = new AAA(sc) > > and I execute any action using that attribute like: > > val myNumber = 5 > aaa.s.textFile("FILE").filter(_ == myNumber.toString).count > or > aaa.s.parallelize(1 to 10).filter(_ == myNumber).count > > Returns a NonSerializibleException: > > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$AAA > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > > Any thoughts about how to solve this issue and how can I give a workaround > to it? I'm actually developing an Api that will need the usage of this > SparkContext several times in different locations, so I will needed to be > accessible. > > Thanks a lot for the cooperation > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Using Spark Context as an attribute of a class cannot be used
Hello guys, I'm using Spark 1.0.0 and Kryo serialization In the Spark Shell, when I create a class that contains as an attribute the SparkContext, in this way: class AAA(val s: SparkContext) { } val aaa = new AAA(sc) and I execute any action using that attribute like: val myNumber = 5 aaa.s.textFile("FILE").filter(_ == myNumber.toString).count or aaa.s.parallelize(1 to 10).filter(_ == myNumber).count Returns a NonSerializibleException: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$AAA at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any thoughts about how to solve this issue and how can I give a workaround to it? I'm actually developing an Api that will need the usage of this SparkContext several times in different locations, so I will needed to be accessible. Thanks a lot for the cooperation -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org