Re: Strange behaviour of different SSCs with same Kafka topic
Yes. I am running this in a local mode and the SSCs run on the same JVM. So, if I deploy this on a cluster, such behavior would be gone ? Also, is there anyway I can start the SSCs on a local machine but on different JVMs? I couldn't find anything about this in the documentation. The inter-mingling of data seems to be gone after I made some of those external classes as 'scala objects' and keeping static maps and all. Is that a good idea as far as performance is concerned ? Thanks Gagan B Mishra On Tue, Apr 22, 2014 at 1:59 AM, Tathagata Das [via Apache Spark User List] ml-node+s1001560n4556...@n3.nabble.com wrote: Are you by any chance starting two StreamingContexts in the same JVM? That could explain a lot of the weird mixing of data that you are seeing. Its not a supported usage scenario to start multiple streamingContexts simultaneously in the same JVM. TD On Thu, Apr 17, 2014 at 10:58 PM, gaganbm [hidden email]http://user/SendEmail.jtp?type=nodenode=4556i=0 wrote: It happens with normal data rate, i.e., lets say 20 records per second. Apart from that, I am also getting some more strange behavior. Let me explain. I establish two sscs. Start them one after another. In SSCs I get the streams from Kafka sources, and do some manipulations. Like, adding some Record_Name for example, to each of the incoming records. Now this Record_Name is different for both the SSCs, and I get this field from some other class, not relevant to the streams. Now, expected behavior should be, all records in SSC1 gets added with the field RECORD_NAME_1 and all records in SSC2 should get added with the field RECORD_NAME_2. Both the SSCs have nothing to do with each other as I believe. However, strangely enough, I find many records in SSC1 get added with RECORD_NAME_2 and vice versa. Is it some kind of serialization issue ? That, the class which provides this RECORD_NAME gets serialized and is reconstructed and then some weird thing happens inside ? I am unable to figure out. So, apart from skewed frequency and volume of records in both the streams, I am getting this inter-mingling of data among the streams. Can you help me in how to use some external data to manipulate the RDD records ? Thanks and regards Gagan B Mishra *Programmer* *560034, Bangalore* *India* On Tue, Apr 15, 2014 at 4:09 AM, Tathagata Das [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=4434i=0 wrote: Does this happen at low event rate for that topic as well, or only for a high volume rate? TD On Wed, Apr 9, 2014 at 11:24 PM, gaganbm [hidden email]http://user/SendEmail.jtp?type=nodenode=4238i=0 wrote: I am really at my wits' end here. I have different Streaming contexts, lets say 2, and both listening to same Kafka topics. I establish the KafkaStream by setting different consumer groups to each of them. Ideally, I should be seeing the kafka events in both the streams. But what I am getting is really unpredictable. Only one stream gets a lot of events and the other one almost gets nothing or very less compared to the other. Also the frequency is very skewed. I get a lot of events in one stream continuously, and after some duration I get a few events in the other one. I don't know where I am going wrong. I can see consumer fetcher threads for both the streams that listen to the Kafka topics. I can give further details if needed. Any help will be great. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behaviour-of-different-SSCs-with-same-Kafka-topic-tp4050.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behaviour-of-different-SSCs-with-same-Kafka-topic-tp4050p4238.html To start a new topic under Apache Spark User List, email [hidden email]http://user/SendEmail.jtp?type=nodenode=4434i=1 To unsubscribe from Apache Spark User List, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: Strange behaviour of different SSCs with same Kafka topichttp://apache-spark-user-list.1001560.n3.nabble.com/Strange-behaviour-of-different-SSCs-with-same-Kafka-topic-tp4050p4434.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com. -- If you reply to this email
Need clarification of joining streams
I wanted some clarification on the behavior of join streams. As I believe, the join works per batch. I am reading data from two Kafka streams and then joining them based on some keys. But what happens if one stream hasn't produced any data in that batch duration, and the other has some ? Or lets say, one stream is getting data at a higher rate, and the other one is not so frequent. How does it behave in such a case ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-clarification-of-joining-streams-tp4583.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
KafkaReciever Error when starting ssc (Actor name not unique)
Hi All, I am getting this exception when doing ssc.start to start the streaming context. ERROR KafkaReceiver - Error receiving data akka.actor.InvalidActorNameException: actor name [NetworkReceiver-0] is not unique! at akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130) at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77) at akka.actor.ActorCell.reserveChild(ActorCell.scala:338) at akka.actor.dungeon.Children$class.makeChild(Children.scala:186) at akka.actor.dungeon.Children$class.attachChild(Children.scala:42) at akka.actor.ActorCell.attachChild(ActorCell.scala:338) at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:518) at org.apache.spark.streaming.dstream.NetworkReceiver.actor$lzycompute(NetworkInputDStream.scala:94) at org.apache.spark.streaming.dstream.NetworkReceiver.actor(NetworkInputDStream.scala:94) at org.apache.spark.streaming.dstream.NetworkReceiver.start(NetworkInputDStream.scala:122) at org.apache.spark.streaming.scheduler.NetworkInputTracker$ReceiverExecutor$$anonfun$8.apply(NetworkInputTracker.scala:173) at org.apache.spark.streaming.scheduler.NetworkInputTracker$ReceiverExecutor$$anonfun$8.apply(NetworkInputTracker.scala:169) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) I tried cleaning up the zookeeper and kafka temp/cached files. But still the same. Any help on this ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KafkaReciever-Error-when-starting-ssc-Actor-name-not-unique-tp3978.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RDD Collect returns empty arrays
I am getting strange behavior with the RDDs. All I want is to persist the RDD contents in a single file. The saveAsTextFile() saves them in multiple textfiles for each partition. So I tried with rdd.coalesce(1,true).saveAsTextFile(). This fails with the exception : org.apache.spark.SparkException: Job aborted: Task 75.0:0 failed 1 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) Then I tried collecting the RDD contents in an array, and writing the array to the file manually. Again, that fails. It is giving me empty arrays, even when data is there. /**The below saves the data in multiple text files. So data is there for sure **/ rdd.saveAsTextFile(resultDirectory) /**The below simply prints size 0 for all the RDDs in a stream. Why ?! **/ val arr = rdd.collect println(SIZE of RDD + rdd.id + + arr.size) Kindly help! I am clueless on how to proceed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Collect-returns-empty-arrays-tp3242.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: rdd.saveAsTextFile problem
Hi Folks, Is this issue resolved ? If yes, could you please throw some light on how to fix this ? I am facing the same problem during writing to text files. When I do stream.foreachRDD(rdd ={ rdd.saveAsTextFile(Some path) }) This works fine for me. But it creates multiple text files for each partition within an RDD. So I tried with coalesce option to merge my results in a single file for each RDD as : stream.foreachRDD(rdd ={ rdd.coalesce(1, true).saveAsTextFile(Some path) }) This fails with : org.apache.spark.SparkException: Job aborted: Task 75.0:0 failed 1 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) I am using Spark Streaming 0.9.0 Any clue what's going wrong when using coalesce ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/rdd-saveAsTextFile-problem-tp176p3238.html Sent from the Apache Spark User List mailing list archive at Nabble.com.