AM I correct that with

.. WHERE (SELECT COUNT(DISTINCT(Salary))..

You will have to shuffle because of DISTINCTas each worker will have to
read data separately and perform the reduce task to get the local
distinct value
and one final shuffle to get the actual distinct
for all the data?



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 27 Feb 2022 at 20:31, Sean Owen <sro...@gmail.com> wrote:

> "count distinct' does not have that problem, whether in a group-by or not.
> I'm still not sure these are equivalent queries but maybe not seeing it.
> Windowing makes sense when you need the whole window, or when you need
> sliding windows to express the desired groups.
> It may be unnecessary when your query does not need the window, just a
> summary stat like 'max'. Depends.
>
> On Sun, Feb 27, 2022 at 2:14 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> You are using distinct which collects everything to the driver. Soo use
>> the other one :)
>>
>> søn. 27. feb. 2022 kl. 21:00 skrev Sid <flinkbyhe...@gmail.com>:
>>
>>> Basically, I am trying two different approaches for the same problem and
>>> my concern is how it will behave in the case of big data if you talk about
>>> millions of records. Which one would be faster? Is using windowing
>>> functions a better way since it will load the entire dataset into a single
>>> window and do the operations?
>>>
>>
>>

Reply via email to