Re: Difference between windowing functions and aggregation functions on big data

Sid Sun, 27 Feb 2022 10:30:45 -0800

My bad.

Aggregation Query:


# Write your MySQL query statement below

   SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
       WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
ORDER by E.DepartmentId, E.Salary DESC

Time Taken: 1212 ms

Windowing Query:

select Department,Employee,Salary from (
select d.name as Department, e.name as Employee,e.salary as
Salary,dense_rank() over(partition by d.name order by e.salary desc) as rnk
from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3

Time Taken: 790 ms

Thanks,
Sid


On Sun, Feb 27, 2022 at 11:35 PM Sean Owen <sro...@gmail.com> wrote:

> Those two queries are identical?
>
> On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote:
>
>> Hi Team,
>>
>> I am aware that if windowing functions are used, then at first it loads
>> the entire dataset into one window,scans and then performs the other
>> mentioned operations for that particular window which could be slower when
>> dealing with trillions / billions of records.
>>
>> I did a POC where I used an example to find the max 3 highest salary for
>> an employee per department. So, I wrote a below queries and compared the
>> time for it:
>>
>> Windowing Query:
>>
>> select Department,Employee,Salary from (
>> select d.name as Department, e.name as Employee,e.salary as
>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>> rnk<=3
>>
>> Time taken: 790 ms
>>
>> Aggregation Query:
>>
>> select Department,Employee,Salary from (
>> select d.name as Department, e.name as Employee,e.salary as
>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>> rnk<=3
>>
>> Time taken: 1212 ms
>>
>> But as per my understanding, the aggregation should have run faster. So,
>> my whole point is if the dataset is huge I should force some kind of map
>> reduce jobs like we have an option called df.groupby().reduceByGroups()
>>
>> So I think the aggregation query is taking more time since the dataset
>> size here is smaller and as we all know that map reduce works faster when
>> there is a huge volume of data. Haven't tested it yet on big data but
>> needed some expert guidance over here.
>>
>> Please correct me if I am wrong.
>>
>> TIA,
>> Sid
>>
>>
>>
>>

Re: Difference between windowing functions and aggregation functions on big data

Reply via email to