And if you want to genuinely “reduce the latency” (still within the boundaries 
of the micro-batch) THEN you need to design and finely tune the Parallel 
Programming / Execution Model of your application. The objective/metric here is:

 

a)      Consume all data within your selected micro-batch window WITHOUT any 
artificial message rate limits

b)      The above will result in a certain size of Dstream RDD per micro-batch. 

c)       The objective now is to Process that RDD WITHIN the time of the 
micro-batch (and also account for temporary message rate spike etc which may 
further increase the size of the RDD) – this will avoid any clogging up of the 
app and will process your messages at the lowest latency possible in a 
micro-batch architecture 

d)      You achieve the objective stated in c by designing, varying and 
experimenting with various aspects of the Spark Streaming Parallel Programming 
and Execution Model – e.g. number of receivers, number of threads per receiver, 
number of executors, number of cores, RAM allocated to executors, number of RDD 
partitions which correspond to the number of parallel threads operating on the 
RDD etc etc  

 

Re the “unceremonious removal of DStream RDDs” from RAM by Spark Streaming when 
the available RAM is exhausted due to high message rate and which crashes your 
(hence clogged up) application the name of the condition is:

 

Loss was due to java.lang.Exception   

java.lang.Exception: Could not compute split, block
input-4-1410542878200 not found

 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Monday, May 18, 2015 12:13 PM
To: 'Dmitry Goldenberg'; 'Akhil Das'
Cc: 'user@spark.apache.org'
Subject: RE: Spark Streaming and reducing latency

 

You can use

 


spark.streaming.receiver.maxRate

not set

Maximum rate (number of records per second) at which each receiver will receive 
data. Effectively, each stream will consume at most this number of records per 
second. Setting this configuration to 0 or a negative number will put no limit 
on the rate. See the deployment guide 
<https://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications>
  in the Spark Streaming programing guide for mode details.

 

 

Another way is to implement a feedback loop in your receivers monitoring the 
performance metrics of your application/job and based on that adjusting 
automatically the receiving rate – BUT all these have nothing to do  with 
“reducing the latency” – they simply prevent your application/job from clogging 
up – the nastier effect of which is when S[ark Streaming starts removing In 
Memory RDDs from RAM before they are processed by the job – that works fine in 
Spark Batch (ie removing RDDs from RAM based on LRU) but in Spark Streaming 
when done in this “unceremonious way” it simply Crashes the application

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Monday, May 18, 2015 11:46 AM
To: Akhil Das
Cc: user@spark.apache.org
Subject: Re: Spark Streaming and reducing latency

 

Thanks, Akhil. So what do folks typically do to increase/contract the capacity? 
Do you plug in some cluster auto-scaling solution to make this elastic?

 

Does Spark have any hooks for instrumenting auto-scaling?

In other words, how do you avoid overwheling the receivers in a scenario when 
your system's input can be unpredictable, based on users' activity?


On May 17, 2015, at 11:04 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:

With receiver based streaming, you can actually specify 
spark.streaming.blockInterval which is the interval at which the receiver will 
fetch data from the source. Default value is 200ms and hence if your batch 
duration is 1 second, it will produce 5 blocks of data. And yes, with 
sparkstreaming when your processing time goes beyond your batch duration and 
you are having a higher data consumption then you will overwhelm the receiver's 
memory and hence will throw up block not found exceptions. 




Thanks

Best Regards

 

On Sun, May 17, 2015 at 7:21 PM, dgoldenberg <dgoldenberg...@gmail.com> wrote:

I keep hearing the argument that the way Discretized Streams work with Spark
Streaming is a lot more of a batch processing algorithm than true streaming.
For streaming, one would expect a new item, e.g. in a Kafka topic, to be
available to the streaming consumer immediately.

With the discretized streams, streaming is done with batch intervals i.e.
the consumer has to wait the interval to be able to get at the new items. If
one wants to reduce latency it seems the only way to do this would be by
reducing the batch interval window. However, that may lead to a great deal
of churn, with many requests going into Kafka out of the consumers,
potentially with no results whatsoever as there's nothing new in the topic
at the moment.

Is there a counter-argument to this reasoning? What are some of the general
approaches to reduce latency  folks might recommend? Or, perhaps there are
ways of dealing with this at the streaming API level?

If latency is of great concern, is it better to look into streaming from
something like Flume where data is pushed to consumers rather than pulled by
them? Are there techniques, in that case, to ensure the consumers don't get
overwhelmed with new data?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-reducing-latency-tp22922.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 

Reply via email to