Hi all, I'm running some tests on spark streaming (not structured) for my PhD, and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus Spark/Kafka 2.4.8/Kafka 2.7.0.
My (scala) application code is as follows: *KafkaStream* => foreachRDD => mapPartitions => repartition => GroupBy => .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))")) *=> take(2) In short, a two core executor could process 600.000 rows of key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1, the same processing (same code) can be achieved in 5-10 seconds. @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a significant optimization that could explain this improvement? BR, Athanasios Kordelas