Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream
Missed the $ export SPARK_CLASSPATH=/home/christophe/Development/spark-streaming3/config-1.2.1.jar: *$SPARK_CLASSPATH* Thanks Best Regards On Fri, Jan 2, 2015 at 4:57 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try: export SPARK_CLASSPATH=/home/christophe/Development/spark-streaming3/config-1.2.1.jar:SPARK_CLASSPATH Also put the jar overriding the spark's version at the end. Thanks Best Regards On Fri, Jan 2, 2015 at 4:46 PM, Christophe Billiard christophe.billi...@gmail.com wrote: Thank you Akhil for your idea. In spark-env.sh, I set export SPARK_CLASSPATH=/home/christophe/Development/spark-streaming3/config-1.2.1.jar When I run bin/compute-classpath.sh I get Spark's classpath: /home/christophe/Development/spark-streaming3/config-1.2.1.jar::/home/christophe/Development/spark-streaming3/conf:/home/christophe/Development/spark-streaming3/lib/spark-assembly-1.1.1-hadoop2.4.0.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-api-jdo-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-rdbms-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-core-3.2.2.jar Few errors later, Spark's classpath looks like: /home/christophe/Development/spark-streaming3/config-1.2.1.jar:akka-stream-experimental_2.10-1.0-M2.jar:reactive-streams-1.0.0.M3.jar::/home/christophe/Development/spark-streaming3/conf:/home/christophe/Development/spark-streaming3/lib/spark-assembly-1.1.1-hadoop2.4.0.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-api-jdo-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-rdbms-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-core-3.2.2.jar And the error is now: Exception in thread main java.lang.NoSuchMethodError: akka.actor.ExtendedActorSystem.systemActorOf(Lakka/actor/Props;Ljava/lang/String;)Lakka/actor/ActorRef; at akka.stream.scaladsl.StreamTcp.init(StreamTcp.scala:147) at akka.stream.scaladsl.StreamTcp$.createExtension(StreamTcp.scala:140) at akka.stream.scaladsl.StreamTcp$.createExtension(StreamTcp.scala:32) at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:654) at akka.actor.ExtensionId$class.apply(Extension.scala:79) at akka.stream.scaladsl.StreamTcp$.apply(StreamTcp.scala:134) at akka.http.HttpExt.bind(Http.scala:33) at SimpleAppStreaming3$.main(SimpleAppStreaming3.scala:250) at SimpleAppStreaming3.main(SimpleAppStreaming3.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) I am trying to add akka-actor_2.10-2.3.7.jar to Spark's classpath: And the error gets worse (Spark can't even start anymore): [ERROR] [01/02/2015 12:08:14.807] [sparkDriver-akka.actor.default-dispatcher-4] [ActorSystem(sparkDriver)] Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem [sparkDriver] java.lang.AbstractMethodError at akka.actor.ActorCell.create(ActorCell.scala:580) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [ERROR] [01/02/2015 12:08:14.811] [sparkDriver-akka.actor.default-dispatcher-3] [ActorSystem(sparkDriver)] Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-3] shutting down ActorSystem [sparkDriver] java.lang.AbstractMethodError: akka.event.slf4j.Slf4jLogger.aroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at
Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream
Thank you Akhil for your idea. In spark-env.sh, I set export SPARK_CLASSPATH=/home/christophe/Development/spark-streaming3/config-1.2.1.jar When I run bin/compute-classpath.sh I get Spark's classpath: /home/christophe/Development/spark-streaming3/config-1.2.1.jar::/home/christophe/Development/spark-streaming3/conf:/home/christophe/Development/spark-streaming3/lib/spark-assembly-1.1.1-hadoop2.4.0.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-api-jdo-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-rdbms-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-core-3.2.2.jar Few errors later, Spark's classpath looks like: /home/christophe/Development/spark-streaming3/config-1.2.1.jar:akka-stream-experimental_2.10-1.0-M2.jar:reactive-streams-1.0.0.M3.jar::/home/christophe/Development/spark-streaming3/conf:/home/christophe/Development/spark-streaming3/lib/spark-assembly-1.1.1-hadoop2.4.0.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-api-jdo-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-rdbms-3.2.1.jar:/home/christophe/Development/spark-streaming3/lib/datanucleus-core-3.2.2.jar And the error is now: Exception in thread main java.lang.NoSuchMethodError: akka.actor.ExtendedActorSystem.systemActorOf(Lakka/actor/Props;Ljava/lang/String;)Lakka/actor/ActorRef; at akka.stream.scaladsl.StreamTcp.init(StreamTcp.scala:147) at akka.stream.scaladsl.StreamTcp$.createExtension(StreamTcp.scala:140) at akka.stream.scaladsl.StreamTcp$.createExtension(StreamTcp.scala:32) at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:654) at akka.actor.ExtensionId$class.apply(Extension.scala:79) at akka.stream.scaladsl.StreamTcp$.apply(StreamTcp.scala:134) at akka.http.HttpExt.bind(Http.scala:33) at SimpleAppStreaming3$.main(SimpleAppStreaming3.scala:250) at SimpleAppStreaming3.main(SimpleAppStreaming3.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) I am trying to add akka-actor_2.10-2.3.7.jar to Spark's classpath: And the error gets worse (Spark can't even start anymore): [ERROR] [01/02/2015 12:08:14.807] [sparkDriver-akka.actor.default-dispatcher-4] [ActorSystem(sparkDriver)] Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem [sparkDriver] java.lang.AbstractMethodError at akka.actor.ActorCell.create(ActorCell.scala:580) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [ERROR] [01/02/2015 12:08:14.811] [sparkDriver-akka.actor.default-dispatcher-3] [ActorSystem(sparkDriver)] Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-3] shutting down ActorSystem [sparkDriver] java.lang.AbstractMethodError: akka.event.slf4j.Slf4jLogger.aroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) My guess is that akka-actor_2.10-2.3.7.jar is overiding the akka-actor's version of Spark. But if I am not overidding it, I can't use akka-http/akka-stream. Is there a way to work around this problem? Thanks, Best regards On Thu, Jan 1, 2015 at 9:28 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Its a typesafe jar conflict, you will need to put the jar with getDuration method in the first position of your classpath. Thanks Best Regards On Wed, Dec 31, 2014 at 4:38
pyspark executor PYTHONPATH
Hi, I am running spark 1.1.0 on yarn. I have custom set of modules installed under same location on each executor node and wondering how can I pass the executors the PYTHONPATH so that they can use the modules. I've tried this: spark-env.sh:export PYTHONPATH=/tmp/test/ spark-defaults.conf:spark.executorEnv.PYTHONPATH=/tmp/test/ /tmp/test/pkg:__init__.pymod.py: def test(x): return x from the pyspark shell I can import the module pkg.mod without any issues: $$$ import pkg.mod$$$ print pkg.mod.test(1)1 also the path is correctly set: $$$ print os.environ['PYTHONPATH']/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark/python/:/tmp/test/ $$$ print sys.path['', '/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip', '/usr/lib/spark/python', '/tmp/test/', ... ] it even is seen by the executors: $$$ sc.parallelize(range(4)).map(lambda x: os.environ['PYTHONPATH']).collect()['/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/', '/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/', '/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/', '/u02/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/'] yet it fails when actually using the module on the executor:$$$ sc.parallelize(range(4)).map(pkg.mod.test).collect()...ImportError: No module named mod... any idea how to achieve this? don't want to use the sc.addPyFile as this is big packages and they are installed everywhere anyway... thank you,Antony.
Is it possible to do incremental training using ALSModel (MLlib)?
Hi all, I'm curious about MLlib and if it is possible to do incremental training on the ALSModel. Usually training is run first, and then you can query. But in my case, data is collected in real-time and I want the predictions of my ALSModel to consider the latest data without complete re-training phase. I've checked out these resources, but could not find any info on how to solve this: https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html My question fits in a larger picture where I'm using Prediction IO, and this in turn is based on Spark. Thanks in advance for any advice! Wouter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: DAG info
Do you have some example code of what you are trying to do? Robin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DAG-info-tp20940p20941.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
JdbcRdd for Python
Hi All - Is JdbcRdd currently supported? Having trouble finding any info or examples?
Re: pyspark executor PYTHONPATH
ok, I see now what's happening - the pkg.mod.test is serialized by reference and there is nothing actually trying to import pkg.mod on the executors so the reference is broken. so how can I get the pkg.mod imported on the executors? thanks,Antony. On Friday, 2 January 2015, 13:49, Antony Mayi antonym...@yahoo.com wrote: Hi, I am running spark 1.1.0 on yarn. I have custom set of modules installed under same location on each executor node and wondering how can I pass the executors the PYTHONPATH so that they can use the modules. I've tried this: spark-env.sh:export PYTHONPATH=/tmp/test/ spark-defaults.conf:spark.executorEnv.PYTHONPATH=/tmp/test/ /tmp/test/pkg:__init__.pymod.py: def test(x): return x from the pyspark shell I can import the module pkg.mod without any issues: $$$ import pkg.mod$$$ print pkg.mod.test(1)1 also the path is correctly set: $$$ print os.environ['PYTHONPATH']/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark/python/:/tmp/test/ $$$ print sys.path['', '/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip', '/usr/lib/spark/python', '/tmp/test/', ... ] it even is seen by the executors: $$$ sc.parallelize(range(4)).map(lambda x: os.environ['PYTHONPATH']).collect()['/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/', '/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/', '/u01/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/', '/u02/yarn/local/usercache/user/filecache/24/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar:/tmp/test/:/tmp/test/'] yet it fails when actually using the module on the executor:$$$ sc.parallelize(range(4)).map(pkg.mod.test).collect()...ImportError: No module named mod... any idea how to achieve this? don't want to use the sc.addPyFile as this is big packages and they are installed everywhere anyway... thank you,Antony.
KafkaReceiver executor in spark streaming job on YARN suddenly killed by ResourceManager
Hi, guys I tried to run job of spark streaming with kafka on YARN. My business logic is very simple. Just listen on kafka topic and write dstream to hdfs on each batch iteration. After launching streaming job few hours, it works well. However suddenly died by ResourceManager. ResourceManager send SIGTERM to kafka receiver executor and other executors also died without any log error messages. No Outofmemory exception, no other exceptions…. just died. Do you know about this issue?? Please help me to resolve this problem. My environments are below described. * Scala : 2.10 * Spark : 1.1 * YARN : 2.5.0-cdh5.2.1 * CDH : 5.2.1 version Thanks in advance. Have a nice day~ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KafkaReceiver-executor-in-spark-streaming-job-on-YARN-suddenly-killed-by-ResourceManager-tp20945.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MLLIB and Openblas library in non-default dir
Hi I have compiled OpenBlas library into nonstandard directory and I want to inform Spark app about it via: -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so which is a standard option in netlib-java (https://github.com/fommil/netlib-java) I tried 2 ways: 1. via *--conf* parameter /bin/spark-submit -v --class org.apache.spark.examples.mllib.LinearRegression *--conf -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so* examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar data/mllib/sample_libsvm_data.txt/ 2. via *--driver-java-options* parameter /bin/spark-submit -v *--driver-java-options -Dcom.github.fommil.netlib.NativeSystemBLAS.natives=/usr/local/lib/libopenblas.so* --class org.apache.spark.examples.mllib.LinearRegression examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar data/mllib/sample_libsvm_data.txt / How can I force spark-submit to propagate info about non-standard placement of openblas library to netlib-java lib? thanks, Tomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-and-Openblas-library-in-non-default-dir-tp20943.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sparkContext.textFile does not honour the minPartitions argument
Thanks everyone. I studied the source code and realized minPartitions is passed over to Hadoop's InputFormat and its upto the InputFormat implementation to use the parameter as a hint. Thanks, Aniket On Fri, Jan 2, 2015, 7:13 AM Rishi Yadav ri...@infoobjects.com wrote: Hi Ankit, Optional number of partitions value is to increase number of partitions not reduce it from default value. On Thu, Jan 1, 2015 at 10:43 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am trying to read a file into a single partition but it seems like sparkContext.textFile ignores the passed minPartitions value. I know I can repartition the RDD but I was curious to know if this is expected or if this is a bug that needs to be further investigated?
KafkaReceiver executor in spark streaming job on YARN suddenly killed by ResourceManager
Hi, guys I tried to run job of spark streaming with kafka on YARN. My business logic is very simple. Just listen on kafka topic and write dstream to hdfs on each batch iteration. After launching streaming job few hours, it works well. However suddenly died by ResourceManager. ResourceManager send SIGTERM to kafka receiver executor and other executors also died without any log error messages. No Outofmemory exception, no other exceptions…. just died. Do you know about this issue?? Please help me to resolve this problem. My environments are below described. * Scala : 2.10 * Spark : 1.1 * YARN : 2.5.0-cdh5.2.1 * CDH : 5.2.1 version Thanks in advance. Have a nice day~
Re: NoClassDefFoundError when trying to run spark application
do you assemble the uber jar ? you can use sbt assembly to build the jar and then run. It should fix the issue -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoClassDefFoundError-when-trying-to-run-spark-application-tp20707p20944.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
different akka versions and spark
i noticed spark 1.2.0 bumps the akka version. since spark uses it's own akka version, does this mean it can co-exist with another akka version in the same JVM? has anyone tried this? we have some spark apps that also use akka (2.2.3) and spray. if different akka versions causes conflicts then spark 1.2.0 would not be backwards compatible for us... thanks. koert
Re: (send this email to subscribe)
There is no need to include user@spark.apache.org in subscription request. FYI On Fri, Jan 2, 2015 at 7:36 AM, Pankaj pankajnaran...@gmail.com wrote:
(send this email to subscribe)
Re: Spark or Tachyon: capture data lineage
Agreed with Jerry. Aside from Tachyon, seeing this for general debugging would be very helpful. Haoyuan, is that feature you are referring to related to https://issues.apache.org/jira/browse/SPARK-975? In the interim, I've found the toDebugString() method useful (but it renders execution as a tree and not as a more general DAG and therefore doesn't always capture the flow in the way I'd like to review it). Example: a = sc.parallelize(range(1,1000)).map(lambda x: (x, x*x)).filter(lambda x: x[1]1000) b = a.join(a) print b.toDebugString() (16) PythonRDD[19] at RDD at PythonRDD.scala:43 | MappedRDD[17] at values at NativeMethodAccessorImpl.java:-2 | ShuffledRDD[16] at partitionBy at NativeMethodAccessorImpl.java:-2 +-(16) PairwiseRDD[15] at RDD at PythonRDD.scala:261 | PythonRDD[14] at RDD at PythonRDD.scala:43 | UnionRDD[13] at union at NativeMethodAccessorImpl.java:-2 | PythonRDD[11] at RDD at PythonRDD.scala:43 | ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315 | PythonRDD[12] at RDD at PythonRDD.scala:43 | ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315 Best, -Sven On Fri, Jan 2, 2015 at 12:32 PM, Haoyuan Li haoyuan...@gmail.com wrote: Jerry, Great question. Spark and Tachyon capture lineage information at different granularities. We are working on an integration between Spark/Tachyon about this. Hope to get it ready to be released soon. Best, Haoyuan On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam chiling...@gmail.com wrote: Hi spark developers, I was thinking it would be nice to extract the data lineage information from a data processing pipeline. I assume that spark/tachyon keeps this information somewhere. For instance, a data processing pipeline uses datasource A and B to produce C. C is then used by another process to produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so useful if there is a way to capture this information when we are using spark/tachyon to query this data lineage information. For example, give me datasets that produce E. It should give me a graph like (A and B)-C-E. Is this something already possible with spark/tachyon? If not, do you think it is possible? Does anyone mind to share their experience in capturing the data lineage in a data processing pipeline? Best Regards, Jerry -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/ -- http://sites.google.com/site/krasser/?utm_source=sig
Re: Submitting spark jobs through yarn-client
Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Submitting spark jobs through yarn-client
I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Re: Spark or Tachyon: capture data lineage
Jerry, Great question. Spark and Tachyon capture lineage information at different granularities. We are working on an integration between Spark/Tachyon about this. Hope to get it ready to be released soon. Best, Haoyuan On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam chiling...@gmail.com wrote: Hi spark developers, I was thinking it would be nice to extract the data lineage information from a data processing pipeline. I assume that spark/tachyon keeps this information somewhere. For instance, a data processing pipeline uses datasource A and B to produce C. C is then used by another process to produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so useful if there is a way to capture this information when we are using spark/tachyon to query this data lineage information. For example, give me datasets that produce E. It should give me a graph like (A and B)-C-E. Is this something already possible with spark/tachyon? If not, do you think it is possible? Does anyone mind to share their experience in capturing the data lineage in a data processing pipeline? Best Regards, Jerry -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Publishing streaming results to web interface
Hi, New to spark so just feeling my way in using it on a standalone server under linux. I'm using scala to store running count totals of certain tokens in my streaming data and publishing a top 10 list. eg (TokenX,count) (TokenY,count) .. At the moment this is just being printed to std out using print() but I want to be able to view these running counts from a web page - but not sure where to start with this. Can anyone advise or point me to examples of how this might be achieved ? Many Thanks, Thomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface-tp20948.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark or Tachyon: capture data lineage
Hi spark developers, I was thinking it would be nice to extract the data lineage information from a data processing pipeline. I assume that spark/tachyon keeps this information somewhere. For instance, a data processing pipeline uses datasource A and B to produce C. C is then used by another process to produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so useful if there is a way to capture this information when we are using spark/tachyon to query this data lineage information. For example, give me datasets that produce E. It should give me a graph like (A and B)-C-E. Is this something already possible with spark/tachyon? If not, do you think it is possible? Does anyone mind to share their experience in capturing the data lineage in a data processing pipeline? Best Regards, Jerry
Re: JdbcRdd for Python
yeah.. i went through the source, and unless i'm missing something it's not.. agreed, i'd love to see it implemented! On Fri, Jan 2, 2015 at 3:59 PM, Tim Schweichler tim.schweich...@healthination.com wrote: Doesn't look like it is at the moment. If that's the case I'd love to see it implemented. From: elliott cordo elliottco...@gmail.com Date: Friday, January 2, 2015 at 8:17 AM To: user@spark.apache.org user@spark.apache.org Subject: JdbcRdd for Python Hi All - Is JdbcRdd currently supported? Having trouble finding any info or examples?
Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration
Well that's confusing. I have the same issue. So you're saying I have to compile Spark with Yarn set to true to make it work with Hadoop 2.2.0 in Standalone mode? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Hadoop-2-2-0-without-Yarn-Integration-tp9202p20947.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: JdbcRdd for Python
Doesn't look like it is at the moment. If that's the case I'd love to see it implemented. From: elliott cordo elliottco...@gmail.commailto:elliottco...@gmail.com Date: Friday, January 2, 2015 at 8:17 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: JdbcRdd for Python Hi All - Is JdbcRdd currently supported? Having trouble finding any info or examples?
Re: Submitting spark jobs through yarn-client
So looking @ the actual code- I see where it looks like --class 'notused' --jar null is set on the ClientBase.scala when yarn is being run in client mode. One thing I noticed is that the jar is being set by trying to grab the jar's uri from the classpath resources- in this case I think it's finding the spark-yarn jar instead of spark-assembly so when it tries to runt the ExecutorLauncher.scala, none of the core classes (like org.apache.spark.Logging) are going to be available on the classpath. I hope this is the root of the issue. I'll keep this thread updated with my findings. On Fri, Jan 2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote: .. and looking even further, it looks like the actual command tha'ts executed starting up the JVM to run the org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class 'notused' --jar null. I would assume this isn't expected but I don't see where to set these properties or why they aren't making it through. On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote: Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream
Like before I get a java.lang.NoClassDefFoundError: akka/stream/FlowMaterializer$ This can be solved using assembly plugin. you need to enable assembly plugin in global plugins C:\Users\infoshore\.sbt\0.13\plugins add a line in plugins.sbt addSbtPlugin(com.eed3si9n % sbt-assembly % 0.11.0) and then add the following lines in build.sbt import AssemblyKeys._ // put this at the top of the file seq(assemblySettings: _*) Also in the bottom dont forget to add assemblySettings mergeStrategy in assembly := { case m if m.toLowerCase.endsWith(manifest.mf) = MergeStrategy.discard case m if m.toLowerCase.matches(meta-inf.*\\.sf$) = MergeStrategy.discard case log4j.properties = MergeStrategy.discard case m if m.toLowerCase.startsWith(meta-inf/services/) = MergeStrategy.filterDistinctLines case reference.conf= MergeStrategy.concat case _ = MergeStrategy.first } Now in your sbt run sbt assembly that will create the jar which can be run without --jars options as this will be a uber jar containing all jars Also nosuchmethod exception is thrown when there is difference in versions of complied and runtime versions. What is the version of spark you are using ? You need to use same version in build.sbt Here is your build.sbt libraryDependencies += org.apache.spark %% spark-core % 1.1.1 //exclude(com.typesafe, config) libraryDependencies += org.apache.spark %% spark-sql % 1.1.1 libraryDependencies += com.datastax.cassandra % cassandra-driver-core % 2.1.3 libraryDependencies += com.datastax.spark %% spark-cassandra-connector % 1.1.0 withSources() withJavadoc() libraryDependencies += org.apache.cassandra % cassandra-thrift % 2.0.5 libraryDependencies += joda-time % joda-time % 2.6 and your error is Exception in thread main java.lang.NoSuchMethodError: com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J at akka.stream.StreamSubscriptionTimeoutSettings$.apply(FlowMaterializer.scala:256) I think there is version mismatch on the jars you use at runtime If you need more help add me on skype pankaj.narang ---Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-tp20926p20950.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Submitting spark jobs through yarn-client
.. and looking even further, it looks like the actual command tha'ts executed starting up the JVM to run the org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class 'notused' --jar null. I would assume this isn't expected but I don't see where to set these properties or why they aren't making it through. On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote: Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from Spark and Hadoop to the setJars() method on the SparkConf. 2) I'm using SparkSubmit,main() to pass the classname and jar containing my code. When yarn tries to create the container, I get an exception in the driver Yarn application already ended, might be killed or not able to launch application master. When I look into the logs for the nodemanager, I see NoClassDefFoundError: org/apache/spark/Logging. Looking closer @ the contents of the nodemanagers, I see that the spark yarn jar was renamed to __spark__.jar and placed in the app cache while the rest of the libraries I specified via setJars() were all placed in the file cache. Any ideas as to what may be happening? I even tried adding the spark-core dependency and uber-jarring my own classes so that the dependencies would be there when Yarn tries to create the container.
Re: Is it possible to do incremental training using ALSModel (MLlib)?
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4981 On Fri, Jan 2, 2015 at 8:28 PM, Peng Cheng rhw...@gmail.com wrote: I was under the impression that ALS wasn't designed for it :- The famous ebay online recommender uses SGD However, you can try using the previous model as starting point, and gradually reduce the number of iteration after the model stablize. I never verify this idea, so you need to at least cross-validate it before putting into productio On 2 January 2015 at 04:40, Wouter Samaey wouter.sam...@storefront.be wrote: Hi all, I'm curious about MLlib and if it is possible to do incremental training on the ALSModel. Usually training is run first, and then you can query. But in my case, data is collected in real-time and I want the predictions of my ALSModel to consider the latest data without complete re-training phase. I've checked out these resources, but could not find any info on how to solve this: https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html My question fits in a larger picture where I'm using Prediction IO, and this in turn is based on Spark. Thanks in advance for any advice! Wouter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is it possible to do incremental training using ALSModel (MLlib)?
I was under the impression that ALS wasn't designed for it :- The famous ebay online recommender uses SGD However, you can try using the previous model as starting point, and gradually reduce the number of iteration after the model stablize. I never verify this idea, so you need to at least cross-validate it before putting into productio On 2 January 2015 at 04:40, Wouter Samaey wouter.sam...@storefront.be wrote: Hi all, I'm curious about MLlib and if it is possible to do incremental training on the ALSModel. Usually training is run first, and then you can query. But in my case, data is collected in real-time and I want the predictions of my ALSModel to consider the latest data without complete re-training phase. I've checked out these resources, but could not find any info on how to solve this: https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html My question fits in a larger picture where I'm using Prediction IO, and this in turn is based on Spark. Thanks in advance for any advice! Wouter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Publishing streaming results to web interface
Try and see if this helps. http://zeppelin-project.org/ -Sathish On Fri Jan 02 2015 at 8:20:54 PM Pankaj Narang pankajnaran...@gmail.com wrote: Thomus, Spark does not provide any web interface directly. There might be third party apps providing dashboards but I am not aware of any for the same purpose. *You can use some methods so that this data is saved on file system instead of being printed on screen Some of the methods you can use ON RDD are saveAsObjectFile, saveAsFile * Now you can read these files to show them on web interface in any language of your choice Regards Pankaj -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface- tp20948p20949.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Publishing streaming results to web interface
Thomus, Spark does not provide any web interface directly. There might be third party apps providing dashboards but I am not aware of any for the same purpose. *You can use some methods so that this data is saved on file system instead of being printed on screen Some of the methods you can use ON RDD are saveAsObjectFile, saveAsFile * Now you can read these files to show them on web interface in any language of your choice Regards Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface-tp20948p20949.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to convert String data to RDD.
Please see http://search-hadoop.com/m/JW1q53L9PJ On Fri, Jan 2, 2015 at 4:31 PM, RP hadoo...@outlook.com wrote: Hello Guys, Spark noob here. I am trying to create RDD from JSON data fetched from URL parsing. My URL parsing function gives me JSON in string format. How do I convert JSON string to JSONRDD so that I can use it in SparkSQL. // get json data in string fromat val jsonURLData = getUrlAsString(https://somehost.com/test.json; https://somehost.com/test.json) val jsonDataRDD = ? val json1 = sqlContext.jsonRDD(jsonDataRDD) Thanks, RP
How to convert String data to RDD.
Hello Guys, Spark noob here. I am trying to create RDD from JSON data fetched from URL parsing. My URL parsing function gives me JSON in string format. How do I convert JSON string to JSONRDD so that I can use it in SparkSQL. // get json data in string fromat val jsonURLData = getUrlAsString(https://somehost.com/test.json;) val jsonDataRDD = ? val json1 = sqlContext.jsonRDD(jsonDataRDD) Thanks, RP
Re: FlatMapValues
OK this is how I solved it. Not elegant at all but works and I need to move ahead at this time.Converting to pair RDD is now not required. reacRdd.map(line = line.split(',')).map(fields = { if (fields.length = 10 !fields(0).contains(VAERS_ID)) { ((fields(0)+,+fields(1)+\t+fields(0)+,+fields(3)+\t+fields(0)+,+fields(5)+\t+fields(0)+,+fields(7)+\t+fields(0)+,+fields(9))) } else { () } }).flatMap(str = str.split('\t')).filter(line = line.toString.length() 0).saveAsTextFile(/data/vaers/msfx/reac/ + outFile) From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID To: Hitesh Khamesra hiteshk...@gmail.com Cc: Kapil Malik kma...@adobe.com; Sean Owen so...@cloudera.com; user@spark.apache.org user@spark.apache.org Sent: Thursday, January 1, 2015 12:39 PM Subject: Re: FlatMapValues thanks let me try that out From: Hitesh Khamesra hiteshk...@gmail.com To: Sanjay Subramanian sanjaysubraman...@yahoo.com Cc: Kapil Malik kma...@adobe.com; Sean Owen so...@cloudera.com; user@spark.apache.org user@spark.apache.org Sent: Thursday, January 1, 2015 9:46 AM Subject: Re: FlatMapValues How about this..apply flatmap on per line. And in that function, parse each line and return all the colums as per your need. On Wed, Dec 31, 2014 at 10:16 AM, Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: hey guys Some of u may care :-) but this is just give u a background with where I am going with this. I have an IOS medical side effects app called MedicalSideFx. I built the entire underlying data layer aggregation using hadoop and currently the search is based on lucene. I am re-architecting the data layer by replacing hadoop with Spark and integrating FDA data, Canadian sidefx data and vaccines sidefx data. @Kapil , sorry but flatMapValues is being reported as undefined To give u a complete picture of the code (its inside IntelliJ but thats only for testingthe real code runs on sparkshell on my cluster) https://github.com/sanjaysubramanian/msfx_scala/blob/master/src/main/scala/org/medicalsidefx/common/utils/AersReacColumnExtractor.scala If u were to assume dataset as 025003,Delirium,8.10,Hypokinesia,8.10,Hypotonia,8.10 025005,Arthritis,8.10,Injection site oedema,8.10,Injection site reaction,8.10 This present version of the code, the flatMap works but only gives me values DeliriumHypokinesiaHypotonia ArthritisInjection site oedemaInjection site reaction What I need is 025003,Delirium 025003,Hypokinesia025003,Hypotonia025005,Arthritis 025005,Injection site oedema025005,Injection site reaction thanks sanjay From: Kapil Malik kma...@adobe.com To: Sean Owen so...@cloudera.com; Sanjay Subramanian sanjaysubraman...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Sent: Wednesday, December 31, 2014 9:35 AM Subject: RE: FlatMapValues Hi Sanjay, Oh yes .. on flatMapValues, it's defined in PairRDDFunctions, and you need to import org.apache.spark.rdd.SparkContext._ to use them (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions ) @Sean, yes indeed flatMap / flatMapValues both can be used. Regards, Kapil -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: 31 December 2014 21:16 To: Sanjay Subramanian Cc: user@spark.apache.org Subject: Re: FlatMapValues From the clarification below, the problem is that you are calling flatMapValues, which is only available on an RDD of key-value tuples. Your map function returns a tuple in one case but a String in the other, so your RDD is a bunch of Any, which is not at all what you want. You need to return a tuple in both cases, which is what Kapil pointed out. However it's still not quite what you want. Your input is basically [key value1 value2 value3] so you want to flatMap that to (key,value1) (key,value2) (key,value3). flatMapValues does not come into play. On Wed, Dec 31, 2014 at 3:25 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com wrote: My understanding is as follows STEP 1 (This would create a pair RDD) === reacRdd.map(line = line.split(',')).map(fields = { if (fields.length = 11 !fields(0).contains(VAERS_ID)) { (fields(0),(fields(1)+\t+fields(3)+\t+fields(5)+\t+fields(7)+\t+fields(9))) } else { } }) STEP 2 === Since previous step created a pair RDD, I thought flatMapValues method will be applicable. But the code does not even compile saying that flatMapValues is not applicable to RDD :-( reacRdd.map(line = line.split(',')).map(fields = { if (fields.length = 11 !fields(0).contains(VAERS_ID)) { (fields(0),(fields(1)+\t+fields(3)+\t+fields(5)+\t+fields(7)+\t+fields(9))) } else { } }).flatMapValues(skus = skus.split('\t')).saveAsTextFile(/data/vaers/msfx/reac/ + outFile) SUMMARY === when a dataset looks like the
Re: SparkSQL 1.2.0 sources API error
Most of the time a NoSuchMethodError means wrong classpath settings, and some jar file is overriden by a wrong version. In your case it could be netty. On 1/3/15 1:36 PM, Niranda Perera wrote: Hi all, I am evaluating the spark sources API released with Spark 1.2.0. But I'm getting a ava.lang.NoSuchMethodError: org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V error running the program. Error log: 15/01/03 10:41:30 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver] java.lang.NoSuchMethodError: org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V at akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:283) at akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:240) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78) at scala.util.Try$.apply(Try.scala:161) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84) at scala.util.Success.flatMap(Try.scala:200) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84) at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:692) at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:684) at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721) at akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:684) at akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:492) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Following is my simple Java code: public class AvroSparkTest { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf() .setMaster(local[2]) .setAppName(avro-spark-test) .setSparkHome(/home/niranda/software/spark-1.2.0-bin-hadoop1); JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); JavaSQLContext sqlContext = new JavaSQLContext(sparkContext); JavaSchemaRDD episodes = AvroUtils.avroFile(sqlContext, /home/niranda/projects/avro-spark-test/src/test/resources/episodes.avro); episodes.printSchema(); } } Dependencies: dependencies dependency groupIdcom.databricks/groupId artifactIdspark-avro_2.10/artifactId version0.1/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-sql_2.10/artifactId version1.2.0/version /dependency /dependencies I'm using Java 1.7, IntelliJ IDEA and Maven as the build tool. What might cause this error and what may be the remedy? Cheers -- Niranda
SparkSQL 1.2.0 sources API error
Hi all, I am evaluating the spark sources API released with Spark 1.2.0. But I'm getting a ava.lang.NoSuchMethodError: org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V error running the program. Error log: 15/01/03 10:41:30 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver] java.lang.NoSuchMethodError: org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V at akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:283) at akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:240) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78) at scala.util.Try$.apply(Try.scala:161) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84) at scala.util.Success.flatMap(Try.scala:200) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84) at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:692) at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:684) at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721) at akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:684) at akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:492) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Following is my simple Java code: public class AvroSparkTest { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf() .setMaster(local[2]) .setAppName(avro-spark-test) .setSparkHome(/home/niranda/software/spark-1.2.0-bin-hadoop1); JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); JavaSQLContext sqlContext = new JavaSQLContext(sparkContext); JavaSchemaRDD episodes = AvroUtils.avroFile(sqlContext, /home/niranda/projects/avro-spark-test/src/test/resources/episodes.avro); episodes.printSchema(); } } Dependencies: dependencies dependency groupIdcom.databricks/groupId artifactIdspark-avro_2.10/artifactId version0.1/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-sql_2.10/artifactId version1.2.0/version /dependency /dependencies I'm using Java 1.7, IntelliJ IDEA and Maven as the build tool. What might cause this error and what may be the remedy? Cheers -- Niranda
Re: different akka versions and spark
Please see http://akka.io/news/2014/05/22/akka-2.3.3-released.html which points to http://doc.akka.io/docs/akka/2.3.3/project/migration-guide-2.2.x-2.3.x.html?_ga=1.35212129.1385865413.1420220234 Cheers On Fri, Jan 2, 2015 at 9:11 AM, Koert Kuipers ko...@tresata.com wrote: i noticed spark 1.2.0 bumps the akka version. since spark uses it's own akka version, does this mean it can co-exist with another akka version in the same JVM? has anyone tried this? we have some spark apps that also use akka (2.2.3) and spray. if different akka versions causes conflicts then spark 1.2.0 would not be backwards compatible for us... thanks. koert
Spark-1.2.0 build error
Hi, I get the following error when I build spark using sbt: [error] Nonzero exit code (128): git clone https://github.com/ScrapCodes/sbt-pom-reader.git /home/karthik/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader [error] Use 'last' for the full log. Any help please?