Sidhavratha Kumar created SPARK-24707:
-----------------------------------------

             Summary: Enable spark-kafka-streaming to maintain min buffer using 
async thread to avoid blocking kafka poll
                 Key: SPARK-24707
                 URL: https://issues.apache.org/jira/browse/SPARK-24707
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 2.4.0
            Reporter: Sidhavratha Kumar


Currently Spark Kafka RDD will block on kafka consumer poll. Specially in 
Spark-Kafka-streaming job this poll duration adds into batch processing time 
which result in
 * Increased batch processing time (which is apart from time taken to process 
records)
 * Results in unpredictable batch processing time based on poll time.

If we can poll kafka in background thread and maintain buffer for each 
partition, poll time will not get added into batch processing time, and this 
will make processing time more predicatble (based on time taken to process each 
record, instead of extra time taken to poll records from source)

For ex. we are facing issues where sometime kafka poll is ~30 secs, and 
sometime it returns within second. With backpressure enabled this reduces our 
job speed to great extent. In this situation it is also difficult to scale our 
processing or calculate resource requirement for future increase in records.

Even if someone does not face varying kafka poll time, it will be provide 
performance improvement if some buffer is already maintained for each 
partition, so that each batch can just concentrate on processing records.

Ex : 
Lets consider
 * each kafka poll takes 2sec average
 * batch duration is 10 sec
 * to process 100 records we take 10 sec
 ## Spark Job starts
 ## Batch-1 (100 records) (buffer = 0) (processing time = 10 sec + 2sec) => 12 
sec processing time
 ## Batch-2 (100 records) (buffer = 200) (processing time = 10 sec) => 10 sec 
processing time
 ## Batch-3 (100 records) (buffer = 100) (processing time = 10 sec) => 10 sec 
processing time
 ## Batch-4 (100 records) (buffer = 0) (processing time = 10 sec + 2 sec) => 12 
sec processing time

If we poll in async and always maintain 500 records for each partition, only 
Batch-1 will take 12 sec. After that all batches will complete in 10 sec 
(unless some rebalancing/failure happens, in that case buffer will be cleaned 
and next batch will take 12 sec).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to