Re: Problem in Spark Streaming

2014-06-11 Thread nilmish
I used these commands to show the GC timings : -verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps Following is the output I got on the standard output : 4.092: [GC 4.092: [ParNew: 274752K-27199K(309056K), 0.0421460 secs] 274752K-27199K(995776K), 0.0422720 secs] [Times: user=0.33 sys=0.11,

Problem in Spark Streaming

2014-06-10 Thread nilmish
I am running a spark streaming job to count top 10 hashtags over last 5 mins window, querying every 1 sec. It is taking approx 1.4 sec (end-to-end-delay) to answer most of the query but there are few instances in between when it takes considerable more amount of time (like around 15 sec) due to

Re: Problem in Spark Streaming

2014-06-10 Thread nilmish
You can measure the latency from the logs. Search for words like Total delay in the logs. This denotes the total end to end delay for a particular query. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Spark-Streaming-tp7310p7312.html Sent from

Re: Problem in Spark Streaming

2014-06-10 Thread nilmish
How can I measure data rate/node ? I am feeding the data through kafka API. I only know the total inflow data rate which almost remains constant . How can I figure out what amount of data is distributed to the nodes in my cluster ? Latency does not keep on increasing infinetly. It goes up for

Re: Error related to serialisation in spark streaming

2014-06-05 Thread nilmish
Thanx a lot for your reply. I can see kryo serialiser in the UI. I have 1 another query : I wanted to know the meaning of the following log message when running a spark streaming job : [spark-akka.actor.default-dispatcher-18] INFO org.apache.spark.streaming.scheduler.JobScheduler - Total

Problem understanding log message in SparkStreaming

2014-06-04 Thread nilmish
I wanted to know the meaning of the following log message when running a spark streaming job : [spark-akka.actor.default-dispatcher-18] INFO org.apache.spark.streaming.scheduler.JobScheduler - Total delay: 5.432 s for time 1401870454500 ms (execution: 0.593 s) According to my understanding,

Re: Error related to serialisation in spark streaming

2014-06-04 Thread nilmish
The error is resolved. I was using a comparator which was not serialised because of which it was throwing the error. I have now switched to kryo serializer as it is faster than java serialser. I have set the required config conf.set(spark.serializer,

Error related to serialisation in spark streaming

2014-06-03 Thread nilmish
I am using the following code segment : countPerWindow.foreachRDD(new FunctionJavaPairRDDlt;String, Long, Void() { @Override public Void call(JavaPairRDDString, Long rdd) throws Exception { ComparatorTuple2lt;String,Long comp = new

Re: Selecting first ten values in a RDD/partition

2014-05-30 Thread nilmish
My primary goal : To get top 10 hashtag for every 5 mins interval. I want to do this efficiently. I have already done this by using reducebykeyandwindow() and then sorting all hashtag in 5 mins interval taking only top 10 elements. But this is very slow. So I now I am thinking of retaining only

Selecting first ten values in a RDD/partition

2014-05-29 Thread nilmish
I have a DSTREAM which consists of RDD partitioned every 2 sec. I have sorted each RDD and want to retain only top 10 values and discard further value. How can I retain only top 10 values ? I am trying to get top 10 hashtags. Instead of sorting the entire of 5-minute-counts (thereby, incurring

Efficient implementation of getting top 10 hashtags in last 5 mins window

2014-05-16 Thread nilmish
I wanted to know how can we efficiently get top 10 hashtags in last 5 mins window. Currently I am using reduceByKeyAndWindow over 5 mins window and then sorting to get top 10 hashtags. But it is taking a lot of time. How can we do it efficiently ? -- View this message in context: