I used these commands to show the GC timings : -verbose:gc
-XX:-PrintGCDetails -XX:+PrintGCTimeStamps
Following is the output I got on the standard output :
4.092: [GC 4.092: [ParNew: 274752K->27199K(309056K), 0.0421460 secs]
274752K->27199K(995776K), 0.0422720 secs] [Times: user=0.33 sys=0.11,
How can I measure data rate/node ?
I am feeding the data through kafka API. I only know the total inflow data
rate which almost remains constant . How can I figure out what amount of
data is distributed to the nodes in my cluster ?
Latency does not keep on increasing infinetly. It goes up for so
You can measure the latency from the logs. Search for words like Total delay
in the logs. This denotes the total end to end delay for a particular query.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Spark-Streaming-tp7310p7312.html
Sent from th
I am running a spark streaming job to count top 10 hashtags over last 5 mins
window, querying every 1 sec.
It is taking approx <1.4 sec (end-to-end-delay) to answer most of the query
but there are few instances in between when it takes considerable more
amount of time (like around 15 sec) due to
Thanx a lot for your reply. I can see kryo serialiser in the UI.
I have 1 another query :
I wanted to know the meaning of the following log message when running a
spark streaming job :
[spark-akka.actor.default-dispatcher-18] INFO
org.apache.spark.streaming.scheduler.JobScheduler - Total dela
The error is resolved. I was using a comparator which was not serialised
because of which it was throwing the error.
I have now switched to kryo serializer as it is faster than java serialser.
I have set the required config
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerialize
I wanted to know the meaning of the following log message when running a
spark streaming job :
[spark-akka.actor.default-dispatcher-18] INFO
org.apache.spark.streaming.scheduler.JobScheduler - Total delay: 5.432 s for
time 1401870454500 ms (execution: 0.593 s)
According to my understanding, tota
I am using the following code segment :
countPerWindow.foreachRDD(new Function, Void>()
{
@Override
public Void call(JavaPairRDD rdd) throws Exception
{
Comparator> comp = new
Comparator >()
{
public int compare(Tuple2
My primary goal : To get top 10 hashtag for every 5 mins interval.
I want to do this efficiently. I have already done this by using
reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
taking only top 10 elements. But this is very slow.
So I now I am thinking of retaining only
I have a DSTREAM which consists of RDD partitioned every 2 sec. I have sorted
each RDD and want to retain only top 10 values and discard further value.
How can I retain only top 10 values ?
I am trying to get top 10 hashtags. Instead of sorting the entire of
5-minute-counts (thereby, incurring th
I wanted to know how can we efficiently get top 10 hashtags in last 5 mins
window. Currently I am using reduceByKeyAndWindow over 5 mins window and
then sorting to get top 10 hashtags. But it is taking a lot of time. How can
we do it efficiently ?
--
View this message in context:
http://apache-
11 matches
Mail list logo