eddie baggott created SPARK-44933: ------------------------------------- Summary: Spark structured streaming performance regression in latency times reading/writing to kafka since 3.0.2 Key: SPARK-44933 URL: https://issues.apache.org/jira/browse/SPARK-44933 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.0, 3.0.2, 2.4.8 Reporter: eddie baggott
During a migration from spark 2.4.4 to spark 3.4.0 I have noticed slower latency times in spark structured streaming when reading and writing to kafka. I have tested using both CONTINUOUS and MICROBATCH. In simple read and write to kafka using CONTINUOUS mode in spark 2.4.4 I usually see latency times of ~5ms in our appllication. When moving to spark 3.4.0 this increased to ~15ms. I stripped it back to a very simple test where I send 2 data fields in csv format to a kafka topic using a simple producer. Then I have a simple consumer which reads from the input topic and writes to an output topic. The 2 fields are an ID and an amount value. I read from both topics and retrieve the kafka timestamp value for all rows. I then subtract the input timestamp from the output timestamp to get the latency. To keep things as simple as possible I am using 1 kafka partition and I am using local[1] as the spark master. Version latency (ms) Trigger 2.4.4 3.25 CONTINUOUS 3.4.0 7.23 CONTINUOUS 2.4.4 640 MICROBATCH 3.4.0 693 MICROBATCH I have tried all versions of spark 3.x and I believe this issue was introduced in 3.0.2. I also tried different versions of spark 2.4.x and I see the same behaviour when going from 2.4.7 to 2.4.8. In the simple test I only use a few jars. One of these is spark-sql-kafka-0-10_2.12 When running on spark 3.0.2 using the 3.0.2 version of this jar I see the slower times. When I run again on spark 3.0.2 and use the 3.0.1 version of this jar I see the faster times. The same thing happens between 2.4.7 version and the 2.4.8 version. The 2.4.8 version has the slower times. Has anyone else observed a slow down in latency in structured streaming when reading from kafka ? Are there any settings I need to change when moving to these versions ? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org