Please qualify what you mean by* extreme improvements*?

What matrix are you using?

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
athanasioskorde...@gmail.com> wrote:

> Hi all,
>
> I'm running some tests on spark streaming (not structured) for my PhD, and
> I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus
> Spark/Kafka 2.4.8/Kafka 2.7.0.
>
> My (scala) application code is as follows:
>
> *KafkaStream* => foreachRDD => mapPartitions => repartition => GroupBy =>
> .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))")) *=> take(2)
>
> In short, a two core executor could process 600.000 rows of
> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
> the same processing (same code) can be achieved in 5-10 seconds.
>
> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
> significant optimization that could explain this improvement?
>
> BR,
> Athanasios Kordelas
>
>

Reply via email to