Hi Mich, I'll gather them and send them to you :)
Many thanks, Thanasis Στις Παρ 27 Ιαν 2023 στις 11:40 π.μ., ο/η Mich Talebzadeh < mich.talebza...@gmail.com> έγραψε: > > Hi Athanasios > > > Thanks for the details. Since I believe this is Spark streaming, the all > important indicator is the Processing Time defined by Spark GUI as Time > taken to process all jobs of a batch versus the batch interval. The Scheduling > Delay and the Total Delay are additional indicators of health. Do you > have these stats for both versions? > > > cheers > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 27 Jan 2023 at 09:03, Athanasios Kordelas < > athanasioskorde...@gmail.com> wrote: > >> Hi Mich, >> >> Thank you for your reply. For my benchmark test, I'm only using one >> executor with two cores in both cases. >> I had created a large image with multiple UI screenshots a few days ago, >> so I'm attaching it (please zoom in). >> You can see spark 3 on the left side versus spark 2 on the right. >> >> I can collect more info by triggering new runs if this would help, but >> I'm not sure what is the best way to provide you with all the matrix data, >> maybe from logs? >> >> --Thanasis >> >> >> >> Στις Πέμ 26 Ιαν 2023 στις 10:03 μ.μ., ο/η Mich Talebzadeh < >> mich.talebza...@gmail.com> έγραψε: >> >>> You have given some stats, 5-10 sec vs 60 sec with set-up and >>> systematics being the same for both tests? >>> >>> so let us assume we see with 3.3.1, <10> sec average time versus 60 with >>> the older spark 2.x >>> >>> so that gives us (60-10) = 50*100/60) ~ 80% gain >>> >>> However, that would not tell us why the 3.3,.1 excels in detail. For >>> that you need to look at the Spark GUI matrix. >>> >>> HTH >>> >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Thu, 26 Jan 2023 at 16:51, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>>> Please qualify what you mean by* extreme improvements*? >>>> >>>> What matrix are you using? >>>> >>>> HTH >>>> >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas < >>>> athanasioskorde...@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I'm running some tests on spark streaming (not structured) for my PhD, >>>>> and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus >>>>> Spark/Kafka 2.4.8/Kafka 2.7.0. >>>>> >>>>> My (scala) application code is as follows: >>>>> >>>>> *KafkaStream* => foreachRDD => mapPartitions => repartition => >>>>> GroupBy => .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))")) *=> >>>>> take(2) >>>>> >>>>> In short, a two core executor could process 600.000 rows of >>>>> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1, >>>>> the same processing (same code) can be achieved in 5-10 seconds. >>>>> >>>>> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a >>>>> significant optimization that could explain this improvement? >>>>> >>>>> BR, >>>>> Athanasios Kordelas >>>>> >>>>> >> >> -- >> Athanasios Kordelas >> Staff SW Engineer >> T: +30 6972053674 | Skype: athanasios.korde...@outlook.com.gr >> athanasioskorde...@gmail.com >> >> -- Athanasios Kordelas Staff SW Engineer T: +30 6972053674 | Skype: athanasios.korde...@outlook.com.gr athanasioskorde...@gmail.com