You are using distinct which collects everything to the driver. Soo use the
other one :)

søn. 27. feb. 2022 kl. 21:00 skrev Sid <flinkbyhe...@gmail.com>:

> Basically, I am trying two different approaches for the same problem and
> my concern is how it will behave in the case of big data if you talk about
> millions of records. Which one would be faster? Is using windowing
> functions a better way since it will load the entire dataset into a single
> window and do the operations?
>
> On Mon, Feb 28, 2022 at 12:26 AM Sean Owen <sro...@gmail.com> wrote:
>
>> Those queries look like they do fairly different things. One is selecting
>> top employees by salary, the other is ... selecting where there are less
>> than 3 distinct salaries or something.
>> Not sure what the intended comparison is then; these are not equivalent
>> ways of doing the same thing, or does not seem so as far as I can see.
>>
>> On Sun, Feb 27, 2022 at 12:30 PM Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> My bad.
>>>
>>> Aggregation Query:
>>>
>>> # Write your MySQL query statement below
>>>
>>>    SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
>>> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
>>> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>>>        WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
>>> ORDER by E.DepartmentId, E.Salary DESC
>>>
>>> Time Taken: 1212 ms
>>>
>>> Windowing Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time Taken: 790 ms
>>>
>>> Thanks,
>>> Sid
>>>
>>>
>>> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> Those two queries are identical?
>>>>
>>>> On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> I am aware that if windowing functions are used, then at first it
>>>>> loads the entire dataset into one window,scans and then performs the other
>>>>> mentioned operations for that particular window which could be slower when
>>>>> dealing with trillions / billions of records.
>>>>>
>>>>> I did a POC where I used an example to find the max 3 highest salary
>>>>> for an employee per department. So, I wrote a below queries and compared
>>>>> the time for it:
>>>>>
>>>>> Windowing Query:
>>>>>
>>>>> select Department,Employee,Salary from (
>>>>> select d.name as Department, e.name as Employee,e.salary as
>>>>> Salary,dense_rank() over(partition by d.name order by e.salary desc)
>>>>> as rnk from Department d join Employee e on e.departmentId=d.id ) a
>>>>> where rnk<=3
>>>>>
>>>>> Time taken: 790 ms
>>>>>
>>>>> Aggregation Query:
>>>>>
>>>>> select Department,Employee,Salary from (
>>>>> select d.name as Department, e.name as Employee,e.salary as
>>>>> Salary,dense_rank() over(partition by d.name order by e.salary desc)
>>>>> as rnk from Department d join Employee e on e.departmentId=d.id ) a
>>>>> where rnk<=3
>>>>>
>>>>> Time taken: 1212 ms
>>>>>
>>>>> But as per my understanding, the aggregation should have run faster.
>>>>> So, my whole point is if the dataset is huge I should force some kind of
>>>>> map reduce jobs like we have an option called 
>>>>> df.groupby().reduceByGroups()
>>>>>
>>>>> So I think the aggregation query is taking more time since the dataset
>>>>> size here is smaller and as we all know that map reduce works faster when
>>>>> there is a huge volume of data. Haven't tested it yet on big data but
>>>>> needed some expert guidance over here.
>>>>>
>>>>> Please correct me if I am wrong.
>>>>>
>>>>> TIA,
>>>>> Sid
>>>>>
>>>>>
>>>>>
>>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Reply via email to