Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-21 Thread gaganbm
Yes. I am running this in a local mode and the SSCs run on the same JVM.
So, if I deploy this on a cluster, such behavior would be gone ? Also, is
there anyway I can start the SSCs on a local machine but on different JVMs?
I couldn't find anything about this in the documentation.

The inter-mingling of data seems to be gone after I made some of those
external classes as 'scala objects' and keeping static maps and all. Is
that a good idea as far as performance is concerned ?

Thanks

Gagan B Mishra


On Tue, Apr 22, 2014 at 1:59 AM, Tathagata Das [via Apache Spark User List]
ml-node+s1001560n4556...@n3.nabble.com wrote:

 Are you by any chance starting two StreamingContexts in the same JVM? That
 could explain a lot of the weird mixing of data that you are seeing. Its
 not a supported usage scenario to start multiple streamingContexts
 simultaneously in the same JVM.

 TD


 On Thu, Apr 17, 2014 at 10:58 PM, gaganbm [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=4556i=0
  wrote:

 It happens with normal data rate, i.e., lets say 20 records per second.

 Apart from that, I am also getting some more strange behavior. Let me
 explain.

 I establish two sscs. Start them one after another. In SSCs I get the
 streams from Kafka sources, and do some manipulations. Like, adding some
 Record_Name for example, to each of the incoming records. Now this
 Record_Name is different for both the SSCs, and I get this field from some
 other class, not relevant to the streams.

 Now, expected behavior should be, all records in SSC1 gets added with the
 field RECORD_NAME_1 and all records in SSC2 should get added with the field
 RECORD_NAME_2. Both the SSCs have nothing to do with each other as I
 believe.

 However, strangely enough, I find many records in SSC1 get added with
 RECORD_NAME_2 and vice versa. Is it some kind of serialization issue ?
 That, the class which provides this RECORD_NAME gets serialized and is
 reconstructed and then some weird thing happens inside ? I am unable to
 figure out.

 So, apart from skewed frequency and volume of records in both the
 streams, I am getting this inter-mingling of data among the streams.

 Can you help me in how to use some external data to manipulate the RDD
 records ?

 Thanks and regards

 Gagan B Mishra


 *Programmer*
 *560034, Bangalore*
 *India*


 On Tue, Apr 15, 2014 at 4:09 AM, Tathagata Das [via Apache Spark User
 List] [hidden email] http://user/SendEmail.jtp?type=nodenode=4434i=0
  wrote:

 Does this happen at low event rate for that topic as well, or only for a
 high volume rate?

 TD


 On Wed, Apr 9, 2014 at 11:24 PM, gaganbm [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=4238i=0
  wrote:

 I am really at my wits' end here.

 I have different Streaming contexts, lets say 2, and both listening to
 same
 Kafka topics. I establish the KafkaStream by setting different consumer
 groups to each of them.

 Ideally, I should be seeing the kafka events in both the streams. But
 what I
 am getting is really unpredictable. Only one stream gets a lot of
 events and
 the other one almost gets nothing or very less compared to the other.
 Also
 the frequency is very skewed. I get a lot of events in one stream
 continuously, and after some duration I get a few events in the other
 one.

 I don't know where I am going wrong. I can see consumer fetcher threads
 for
 both the streams that listen to the Kafka topics.

 I can give further details if needed. Any help will be great.

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behaviour-of-different-SSCs-with-same-Kafka-topic-tp4050.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behaviour-of-different-SSCs-with-same-Kafka-topic-tp4050p4238.html
  To start a new topic under Apache Spark User List, email [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=4434i=1
 To unsubscribe from Apache Spark User List, click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Strange behaviour of different SSCs
 with same Kafka 
 topichttp://apache-spark-user-list.1001560.n3.nabble.com/Strange-behaviour-of-different-SSCs-with-same-Kafka-topic-tp4050p4434.html

 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.




 --
  If you reply to this email

Need clarification of joining streams

2014-04-21 Thread gaganbm
I wanted some clarification on the behavior of join streams.

As I believe, the join works per batch. I am reading data from two Kafka
streams and then joining them based on some keys. But what happens if one
stream hasn't produced any data in that batch duration, and the other has
some ? Or lets say, one stream is getting data at a higher rate, and the
other one is not so frequent. How does it behave in such a case ?

Thanks 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-clarification-of-joining-streams-tp4583.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


KafkaReciever Error when starting ssc (Actor name not unique)

2014-04-09 Thread gaganbm
Hi All,

I am getting this exception when doing ssc.start to start the streaming
context.


ERROR KafkaReceiver - Error receiving data
akka.actor.InvalidActorNameException: actor name [NetworkReceiver-0] is not
unique!
at
akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)
at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
at akka.actor.ActorCell.reserveChild(ActorCell.scala:338)
at akka.actor.dungeon.Children$class.makeChild(Children.scala:186)
at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
at akka.actor.ActorCell.attachChild(ActorCell.scala:338)
at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:518)
at
org.apache.spark.streaming.dstream.NetworkReceiver.actor$lzycompute(NetworkInputDStream.scala:94)
at
org.apache.spark.streaming.dstream.NetworkReceiver.actor(NetworkInputDStream.scala:94)
at
org.apache.spark.streaming.dstream.NetworkReceiver.start(NetworkInputDStream.scala:122)
at
org.apache.spark.streaming.scheduler.NetworkInputTracker$ReceiverExecutor$$anonfun$8.apply(NetworkInputTracker.scala:173)
at
org.apache.spark.streaming.scheduler.NetworkInputTracker$ReceiverExecutor$$anonfun$8.apply(NetworkInputTracker.scala:169)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

I tried cleaning up the zookeeper and kafka temp/cached files. But still the
same. 

Any help on this ? 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaReciever-Error-when-starting-ssc-Actor-name-not-unique-tp3978.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RDD Collect returns empty arrays

2014-03-26 Thread gaganbm
I am getting strange behavior with the RDDs.

All I want is to persist the RDD contents in a single file. 

The saveAsTextFile() saves them in multiple textfiles for each partition. So
I tried with rdd.coalesce(1,true).saveAsTextFile(). This fails with the
exception :

org.apache.spark.SparkException: Job aborted: Task 75.0:0 failed 1 times
(most recent failure: Exception failure: java.lang.IllegalStateException:
unread block data) 

Then I tried collecting the RDD contents in an array, and writing the array
to the file manually. Again, that fails. It is giving me empty arrays, even
when data is there.

/**The below saves the data in multiple text files. So data is there for
sure **/
rdd.saveAsTextFile(resultDirectory)
/**The below simply prints size 0 for all the RDDs in a stream. Why ?! **/
val arr = rdd.collect
println(SIZE of RDD  + rdd.id +   + arr.size)

Kindly help! I am clueless on how to proceed.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Collect-returns-empty-arrays-tp3242.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: rdd.saveAsTextFile problem

2014-03-25 Thread gaganbm
Hi Folks,

Is this issue resolved ? If yes, could you please throw some light on how to
fix this ?

I am facing the same problem during writing to text files.

When I do 

stream.foreachRDD(rdd ={
rdd.saveAsTextFile(Some path)
})

This works fine for me. But it creates multiple text files for each
partition within an RDD.

So I tried with coalesce option to merge my results in a single file for
each RDD as :

stream.foreachRDD(rdd ={
rdd.coalesce(1, true).saveAsTextFile(Some 
path)
})

This fails with :
org.apache.spark.SparkException: Job aborted: Task 75.0:0 failed 1 times
(most recent failure: Exception failure: java.lang.IllegalStateException:
unread block data)

I am using Spark Streaming 0.9.0

Any clue what's going wrong when using coalesce ?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-saveAsTextFile-problem-tp176p3238.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.