spark streaming kafka output
Hi , Is there any code to implement a kafka output for spark streaming? My use case is all the output need to be dumped back to kafka cluster again after data is processed ? What will be guideline to implement such function ? I heard foreachRDD will create one instance of producer per batch ? If so, will that hurt performance ? Thanks, Weide
spark streaming question
Hi , It might be a very general question to ask here but I'm curious to know why spark streaming can achieve better throughput than storm as claimed in the spark streaming paper. Does it depend on certain use cases and/or data source ? What drives better performance in spark streaming case or in other ways, what makes storm not as performant as spark streaming ? Also, in order to guarantee exact-once semantics when node failure happens, spark makes replicas of RDDs and checkpoints so that data can be recomputed on the fly while on Trident case, they use transactional object to persist the state and result but it's not obvious to me which approach is more costly and why ? Any one can provide some experience here ? Thanks a lot, Weide
what's local[n]
in spark kafka example, it says `./bin/run-example org.apache.spark.streaming.examples.KafkaWordCount local[2] zoo01,zoo02,zoo03 my-consumer-group topic1,topic2 1` can any one tell me what does local[2] represent ? i thought master url should be sth like spark://hostname:portname . also, the thread count is specified as 1 in kafka example, what will happen thread count goes to more than 1 ? does that mean multiple kafka consumer will be created on different spark workers ? I'm not sure how does the code mapped to realtime worker thread allocation ? Is there any documentation on that ? Weide
spark run issue
Hi I'm trying to run the kafka-word-count example in spark2.9.1. I encountered some exception when initialize kafka consumer/producer config. I'm using scala 2.10.3 and used maven build inside spark streaming kafka library comes with spark2.9.1. Any one see this exception before? Thanks, producer: [java] The args attribute is deprecated. Please use nested arg elements. [java] Exception in thread main java.lang.NoClassDefFoundError: scala/Tuple2$mcLL$sp [java] at kafka.producer.ProducerConfig.init(ProducerConfig.scala:56) [java] at com.turn.apache.KafkaWordCountProducer$.main(HelloWorld.scala:89) [java] at com.turn.apache.KafkaWordCountProducer.main(HelloWorld.scala) [java] Caused by: java.lang.ClassNotFoundException: scala.Tuple2$mcLL$sp [java] at java.net.URLClassLoader$1.run(URLClassLoader.java:202) [java] at java.security.AccessController.doPrivileged(Native Method) [java] at java.net.URLClassLoader.findClass(URLClassLoader.java:190) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:306) [java] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:247) [java] ... 3 more [java] Java Result: 1
Re: spark run issue
Hi Tathagata, I figured out the reason. I was adding a wrong kafka lib along side with the version spark uses. Sorry for spamming. Weide On Sat, May 3, 2014 at 7:04 PM, Tathagata Das tathagata.das1...@gmail.comwrote: I am a little confused about the version of Spark you are using. Are you using Spark 0.9.1 that uses scala 2.10.3 ? TD On Sat, May 3, 2014 at 6:16 PM, Weide Zhang weo...@gmail.com wrote: Hi I'm trying to run the kafka-word-count example in spark2.9.1. I encountered some exception when initialize kafka consumer/producer config. I'm using scala 2.10.3 and used maven build inside spark streaming kafka library comes with spark2.9.1. Any one see this exception before? Thanks, producer: [java] The args attribute is deprecated. Please use nested arg elements. [java] Exception in thread main java.lang.NoClassDefFoundError: scala/Tuple2$mcLL$sp [java] at kafka.producer.ProducerConfig.init(ProducerConfig.scala:56) [java] at com.turn.apache.KafkaWordCountProducer$.main(HelloWorld.scala:89) [java] at com.turn.apache.KafkaWordCountProducer.main(HelloWorld.scala) [java] Caused by: java.lang.ClassNotFoundException: scala.Tuple2$mcLL$sp [java] at java.net.URLClassLoader$1.run(URLClassLoader.java:202) [java] at java.security.AccessController.doPrivileged(Native Method) [java] at java.net.URLClassLoader.findClass(URLClassLoader.java:190) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:306) [java] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:247) [java] ... 3 more [java] Java Result: 1
Re: spark run issue
Hi Tathagata, I actually have a separate question. What's the usage of lib_managed folder inside spark source folder ? Are those the library required for spark streaming to run ? Do they needed to be added to spark classpath when starting sparking cluster? Weide On Sat, May 3, 2014 at 7:08 PM, Weide Zhang weo...@gmail.com wrote: Hi Tathagata, I figured out the reason. I was adding a wrong kafka lib along side with the version spark uses. Sorry for spamming. Weide On Sat, May 3, 2014 at 7:04 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I am a little confused about the version of Spark you are using. Are you using Spark 0.9.1 that uses scala 2.10.3 ? TD On Sat, May 3, 2014 at 6:16 PM, Weide Zhang weo...@gmail.com wrote: Hi I'm trying to run the kafka-word-count example in spark2.9.1. I encountered some exception when initialize kafka consumer/producer config. I'm using scala 2.10.3 and used maven build inside spark streaming kafka library comes with spark2.9.1. Any one see this exception before? Thanks, producer: [java] The args attribute is deprecated. Please use nested arg elements. [java] Exception in thread main java.lang.NoClassDefFoundError: scala/Tuple2$mcLL$sp [java] at kafka.producer.ProducerConfig.init(ProducerConfig.scala:56) [java] at com.turn.apache.KafkaWordCountProducer$.main(HelloWorld.scala:89) [java] at com.turn.apache.KafkaWordCountProducer.main(HelloWorld.scala) [java] Caused by: java.lang.ClassNotFoundException: scala.Tuple2$mcLL$sp [java] at java.net.URLClassLoader$1.run(URLClassLoader.java:202) [java] at java.security.AccessController.doPrivileged(Native Method) [java] at java.net.URLClassLoader.findClass(URLClassLoader.java:190) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:306) [java] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:247) [java] ... 3 more [java] Java Result: 1
docker image build issue for spark 0.9.1
Hi I tried to build docker image for spark 0.9.1 but get the following error. any one has experience resolving the issue ? The following packages have unmet dependencies: tzdata-java : Depends: tzdata (= 2012b-1) but 2013g-0ubuntu0.12.04 is to be installed E: Unable to correct problems, you have held broken packages. 2014/05/02 14:16:25 The command [/bin/sh -c apt-get install -y tzdata-java less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server] returned a non-zero code: 100 it seems it relates with openjdk-7-jre-headless installation but i dont know how to resolve it. Thanks a lot, Weide
Re: docker image build issue for spark 0.9.1
yes, the docker script is there inside spark source package. It already specifies the master and worker container to run in different docker containers. Mainly it is used for easy deployment and development in my scenario. On Fri, May 2, 2014 at 2:30 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Don't have any tips for you, Weide, but I was just learning about Docker and it sounds very cool. Are you trying to build both master and worker containers that you can easily deploy to create a cluster? I'm interested in knowing how Docker is used in this case. Nick On Fri, May 2, 2014 at 5:17 PM, Weide Zhang weo...@gmail.com wrote: Hi I tried to build docker image for spark 0.9.1 but get the following error. any one has experience resolving the issue ? The following packages have unmet dependencies: tzdata-java : Depends: tzdata (= 2012b-1) but 2013g-0ubuntu0.12.04 is to be installed E: Unable to correct problems, you have held broken packages. 2014/05/02 14:16:25 The command [/bin/sh -c apt-get install -y tzdata-java less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server] returned a non-zero code: 100 it seems it relates with openjdk-7-jre-headless installation but i dont know how to resolve it. Thanks a lot, Weide