[ 
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486980#comment-14486980
 ] 

Tathagata Das commented on SPARK-6691:
--------------------------------------

[~jerryshao] This is a very good attempt! This is a good first attempt! 
However, from my experience in rate controlling matters, it can be quite tricky 
to balance the stability of the system, with high throughput. So any attempt 
needs to be done carefully with some basic theoretically analysis of how the 
system will behave in different scenarios (e.g. when processing load increases 
with same data rate or vice versa, maybe when cluster size decreases due to 
failures, etc.)  I would like to see that sort of analysis in the design doc.

Beyond that I will spend time thinking about the pros and cons of this 
approach. Nonetheless, thank you for initiating this. 

> Abstract and add a dynamic RateLimiter for Spark Streaming
> ----------------------------------------------------------
>
>                 Key: SPARK-6691
>                 URL: https://issues.apache.org/jira/browse/SPARK-6691
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.3.0
>            Reporter: Saisai Shao
>
> Flow control (or rate control) for input data is very important in streaming 
> system, especially for Spark Streaming to keep stable and up-to-date. The 
> unexpected flood of incoming data or the high ingestion rate of input data 
> which beyond the computation power of cluster will make the system unstable 
> and increase the delay time. For Spark Streaming’s job generation and 
> processing pattern, this delay will be accumulated and introduce unacceptable 
> exceptions.
> ----
> Currently in Spark Streaming’s receiver based input stream, there’s a 
> RateLimiter in BlockGenerator which controls the ingestion rate of input 
> data, but the current implementation has several limitations:
> # The max ingestion rate is set by user through configuration beforehand, 
> user may lack the experience of how to set an appropriate value before the 
> application is running.
> # This configuration is fixed through the life-time of application, which 
> means you need to consider the worst scenario to set a reasonable 
> configuration.
> # Input stream like DirectKafkaInputStream need to maintain another solution 
> to achieve the same functionality.
> # Lack of slow start control makes the whole system easily trapped into large 
> processing and scheduling delay at the very beginning.
> ----
> So here we propose a new dynamic RateLimiter as well as the new interface for 
> the RateLimiter to better improve the whole system's stability. The target is:
> * Dynamically adjust the ingestion rate according to processing rate of 
> previous finished jobs.
> * Offer an uniform solution not only for receiver based input stream, but 
> also for direct stream like DirectKafkaInputStream and new ones.
> * Slow start rate to control the network congestion when job is started.
> * Pluggable framework to make the maintenance of extension more easy.
> ----
> Here is the design doc 
> (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing)
>  and working branch 
> (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).
> Any comment would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to