eddie baggott created SPARK-44933:
-------------------------------------

             Summary: Spark structured streaming performance regression in 
latency times reading/writing to kafka since 3.0.2
                 Key: SPARK-44933
                 URL: https://issues.apache.org/jira/browse/SPARK-44933
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.0, 3.0.2, 2.4.8
            Reporter: eddie baggott


During a migration from spark 2.4.4 to spark 3.4.0 I have noticed slower 
latency times in spark structured streaming when reading and writing to kafka. 
I have tested using both CONTINUOUS and MICROBATCH.

In simple read and write to kafka using CONTINUOUS mode in spark 2.4.4 I 
usually see latency times of ~5ms in our appllication. When moving to spark 
3.4.0 this increased to ~15ms.

I stripped it back to a very simple test where I send 2 data fields in csv 
format to a kafka topic using a simple producer. Then I have a simple consumer 
which reads from the input topic and writes to an output topic. The 2 fields 
are an ID and an amount value. I read from both topics and retrieve the kafka 
timestamp value for all rows. I then subtract the input timestamp from the 
output timestamp to get the latency. To keep things as simple as possible I am 
using 1 kafka partition and I am using local[1] as the spark master.

Version    latency (ms)    Trigger
2.4.4    3.25    CONTINUOUS
3.4.0    7.23    CONTINUOUS
2.4.4    640    MICROBATCH
3.4.0    693    MICROBATCH
I have tried all versions of spark 3.x and I believe this issue was introduced 
in 3.0.2. I also tried different versions of spark 2.4.x and I see the same 
behaviour when going from 2.4.7 to 2.4.8.

In the simple test I only use a few jars. One of these is 
spark-sql-kafka-0-10_2.12 When running on spark 3.0.2 using the 3.0.2 version 
of this jar I see the slower times. When I run again on spark 3.0.2 and use the 
3.0.1 version of this jar I see the faster times.

The same thing happens between 2.4.7 version and the 2.4.8 version. The 2.4.8 
version has the slower times.

Has anyone else observed a slow down in latency in structured streaming when 
reading from kafka ?

Are there any settings I need to change when moving to these versions ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to