Hi all,

I'm running some tests on spark streaming (not structured) for my PhD, and
I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus
Spark/Kafka 2.4.8/Kafka 2.7.0.

My (scala) application code is as follows:

*KafkaStream* => foreachRDD => mapPartitions => repartition => GroupBy
=> .*agg(expr("percentile(value,
array(0.25, 0.5, 0.75))")) *=> take(2)

In short, a two core executor could process 600.000 rows of key/value pairs
in 60 seconds with Spark 2.x, while now, with Spark 3.3.1, the same
processing (same code) can be achieved in 5-10 seconds.

@apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
significant optimization that could explain this improvement?

BR,
Athanasios Kordelas

Reply via email to