[ https://issues.apache.org/jira/browse/SPARK-24707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528967#comment-16528967 ]
Apache Spark commented on SPARK-24707: -------------------------------------- User 'sidhavratha' has created a pull request for this issue: https://github.com/apache/spark/pull/21685 > Enable spark-kafka-streaming to maintain min buffer using async thread to > avoid blocking kafka poll > --------------------------------------------------------------------------------------------------- > > Key: SPARK-24707 > URL: https://issues.apache.org/jira/browse/SPARK-24707 > Project: Spark > Issue Type: Improvement > Components: DStreams > Affects Versions: 2.4.0 > Reporter: Sidhavratha Kumar > Priority: Major > Attachments: 40_partition_topic_without_buffer.pdf > > > Currently Spark Kafka RDD will block on kafka consumer poll. Specially in > Spark-Kafka-streaming job this poll duration adds into batch processing time > which result in > * Increased batch processing time (which is apart from time taken to process > records) > * Results in unpredictable batch processing time based on poll time. > If we can poll kafka in background thread and maintain buffer for each > partition, poll time will not get added into batch processing time, and this > will make processing time more predicatble (based on time taken to process > each record, instead of extra time taken to poll records from source) > For ex. we are facing issues where sometime kafka poll is ~30 secs, and > sometime it returns within second. With backpressure enabled this reduces our > job speed to great extent. In this situation it is also difficult to scale > our processing or calculate resource requirement for future increase in > records. > Even if someone does not face varying kafka poll time, it will be provide > performance improvement if some buffer is already maintained for each > partition, so that each batch can just concentrate on processing records. > Ex : > Lets consider > * each kafka poll takes 2sec average > * batch duration is 10 sec > * to process 100 records we take 10 sec > * each kafka poll returns 300 recordsĀ > ## Spark Job starts > ## Batch-1 (100 records) (buffer = 0) (processing time = 10 sec + 2sec) => > 12 sec processing time > ## Batch-2 (100 records) (buffer = 200) (processing time = 10 sec) => 10 sec > processing time > ## Batch-3 (100 records) (buffer = 100) (processing time = 10 sec) => 10 sec > processing time > ## Batch-4 (100 records) (buffer = 0) (processing time = 10 sec + 2 sec) => > 12 sec processing time > If we poll in async and always maintain 500 records for each partition, only > Batch-1 will take 12 sec. After that all batches will complete in 10 sec > (unless some rebalancing/failure happens, in that case buffer will be cleaned > and next batch will take 12 sec). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org