Re: Question regarding Spark 3.X performance

Athanasios Kordelas Fri, 27 Jan 2023 05:14:01 -0800

Hi Mich,

I'll gather them and send them to you :)


Many thanks,
Thanasis

Στις Παρ 27 Ιαν 2023 στις 11:40 π.μ., ο/η Mich Talebzadeh <
mich.talebza...@gmail.com> έγραψε:

>
> Hi Athanasios
>
>
> Thanks for the details.  Since I believe this is Spark streaming, the all
> important indicator is the Processing Time defined by Spark GUI as Time
> taken to process all jobs of a batch versus the batch interval. The Scheduling
> Delay and the Total Delay are additional indicators of health.  Do you
> have these stats for both versions?
>
>
> cheers
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 27 Jan 2023 at 09:03, Athanasios Kordelas <
> athanasioskorde...@gmail.com> wrote:
>
>> Hi Mich,
>>
>> Thank you for your reply. For my benchmark test, I'm only using one
>> executor with two cores in both cases.
>> I had created a large image with multiple UI screenshots a few days ago,
>> so I'm attaching it (please zoom in).
>> You can see spark 3 on the left side versus spark 2 on the right.
>>
>> I can collect more info by triggering new runs if this would help, but
>> I'm not sure what is the best way to provide you with all the matrix data,
>> maybe from logs?
>>
>> --Thanasis
>>
>>
>>
>> Στις Πέμ 26 Ιαν 2023 στις 10:03 μ.μ., ο/η Mich Talebzadeh <
>> mich.talebza...@gmail.com> έγραψε:
>>
>>> You have given some stats, 5-10 sec vs 60 sec with set-up and
>>> systematics being the same for both tests?
>>>
>>> so let us assume we see with 3.3.1, <10> sec average time versus 60 with
>>> the older spark 2.x
>>>
>>> so that gives us (60-10) = 50*100/60) ~ 80% gain
>>>
>>> However, that would not tell us why the 3.3,.1 excels in detail. For
>>> that you need to look at the Spark GUI matrix.
>>>
>>> HTH
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 26 Jan 2023 at 16:51, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Please qualify what you mean by* extreme improvements*?
>>>>
>>>> What matrix are you using?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
>>>> athanasioskorde...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm running some tests on spark streaming (not structured) for my PhD,
>>>>> and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus
>>>>> Spark/Kafka 2.4.8/Kafka 2.7.0.
>>>>>
>>>>> My (scala) application code is as follows:
>>>>>
>>>>> *KafkaStream* => foreachRDD => mapPartitions => repartition =>
>>>>> GroupBy => .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))")) *=>
>>>>> take(2)
>>>>>
>>>>> In short, a two core executor could process 600.000 rows of
>>>>> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
>>>>> the same processing (same code) can be achieved in 5-10 seconds.
>>>>>
>>>>> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
>>>>> significant optimization that could explain this improvement?
>>>>>
>>>>> BR,
>>>>> Athanasios Kordelas
>>>>>
>>>>>
>>
>> --
>> Athanasios Kordelas
>> Staff SW Engineer
>> T: +30 6972053674 | Skype: athanasios.korde...@outlook.com.gr
>> athanasioskorde...@gmail.com
>>
>>

-- 
Athanasios Kordelas
Staff SW Engineer
T: +30 6972053674 | Skype: athanasios.korde...@outlook.com.gr
athanasioskorde...@gmail.com

Re: Question regarding Spark 3.X performance

Reply via email to