[ 
https://issues.apache.org/jira/browse/SPARK-24707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528967#comment-16528967
 ] 

Apache Spark commented on SPARK-24707:
--------------------------------------

User 'sidhavratha' has created a pull request for this issue:
https://github.com/apache/spark/pull/21685

> Enable spark-kafka-streaming to maintain min buffer using async thread to 
> avoid blocking kafka poll
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24707
>                 URL: https://issues.apache.org/jira/browse/SPARK-24707
>             Project: Spark
>          Issue Type: Improvement
>          Components: DStreams
>    Affects Versions: 2.4.0
>            Reporter: Sidhavratha Kumar
>            Priority: Major
>         Attachments: 40_partition_topic_without_buffer.pdf
>
>
> Currently Spark Kafka RDD will block on kafka consumer poll. Specially in 
> Spark-Kafka-streaming job this poll duration adds into batch processing time 
> which result in
>  * Increased batch processing time (which is apart from time taken to process 
> records)
>  * Results in unpredictable batch processing time based on poll time.
> If we can poll kafka in background thread and maintain buffer for each 
> partition, poll time will not get added into batch processing time, and this 
> will make processing time more predicatble (based on time taken to process 
> each record, instead of extra time taken to poll records from source)
> For ex. we are facing issues where sometime kafka poll is ~30 secs, and 
> sometime it returns within second. With backpressure enabled this reduces our 
> job speed to great extent. In this situation it is also difficult to scale 
> our processing or calculate resource requirement for future increase in 
> records.
> Even if someone does not face varying kafka poll time, it will be provide 
> performance improvement if some buffer is already maintained for each 
> partition, so that each batch can just concentrate on processing records.
> Ex : 
>  Lets consider
>  * each kafka poll takes 2sec average
>  * batch duration is 10 sec
>  * to process 100 records we take 10 sec
>  * each kafka poll returns 300 recordsĀ 
>  ## Spark Job starts
>  ## Batch-1 (100 records) (buffer = 0) (processing time = 10 sec + 2sec) => 
> 12 sec processing time
>  ## Batch-2 (100 records) (buffer = 200) (processing time = 10 sec) => 10 sec 
> processing time
>  ## Batch-3 (100 records) (buffer = 100) (processing time = 10 sec) => 10 sec 
> processing time
>  ## Batch-4 (100 records) (buffer = 0) (processing time = 10 sec + 2 sec) => 
> 12 sec processing time
> If we poll in async and always maintain 500 records for each partition, only 
> Batch-1 will take 12 sec. After that all batches will complete in 10 sec 
> (unless some rebalancing/failure happens, in that case buffer will be cleaned 
> and next batch will take 12 sec).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to