Re: Difference between windowing functions and aggregation functions on big data

Sid Sun, 27 Feb 2022 12:00:05 -0800

Basically, I am trying two different approaches for the same problem and my
concern is how it will behave in the case of big data if you talk about
millions of records. Which one would be faster? Is using windowing
functions a better way since it will load the entire dataset into a single
window and do the operations?


On Mon, Feb 28, 2022 at 12:26 AM Sean Owen <sro...@gmail.com> wrote:

> Those queries look like they do fairly different things. One is selecting
> top employees by salary, the other is ... selecting where there are less
> than 3 distinct salaries or something.
> Not sure what the intended comparison is then; these are not equivalent
> ways of doing the same thing, or does not seem so as far as I can see.
>
> On Sun, Feb 27, 2022 at 12:30 PM Sid <flinkbyhe...@gmail.com> wrote:
>
>> My bad.
>>
>> Aggregation Query:
>>
>> # Write your MySQL query statement below
>>
>>    SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
>> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
>> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>>        WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
>> ORDER by E.DepartmentId, E.Salary DESC
>>
>> Time Taken: 1212 ms
>>
>> Windowing Query:
>>
>> select Department,Employee,Salary from (
>> select d.name as Department, e.name as Employee,e.salary as
>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>> rnk<=3
>>
>> Time Taken: 790 ms
>>
>> Thanks,
>> Sid
>>
>>
>> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> Those two queries are identical?
>>>
>>> On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> I am aware that if windowing functions are used, then at first it loads
>>>> the entire dataset into one window,scans and then performs the other
>>>> mentioned operations for that particular window which could be slower when
>>>> dealing with trillions / billions of records.
>>>>
>>>> I did a POC where I used an example to find the max 3 highest salary
>>>> for an employee per department. So, I wrote a below queries and compared
>>>> the time for it:
>>>>
>>>> Windowing Query:
>>>>
>>>> select Department,Employee,Salary from (
>>>> select d.name as Department, e.name as Employee,e.salary as
>>>> Salary,dense_rank() over(partition by d.name order by e.salary desc)
>>>> as rnk from Department d join Employee e on e.departmentId=d.id ) a
>>>> where rnk<=3
>>>>
>>>> Time taken: 790 ms
>>>>
>>>> Aggregation Query:
>>>>
>>>> select Department,Employee,Salary from (
>>>> select d.name as Department, e.name as Employee,e.salary as
>>>> Salary,dense_rank() over(partition by d.name order by e.salary desc)
>>>> as rnk from Department d join Employee e on e.departmentId=d.id ) a
>>>> where rnk<=3
>>>>
>>>> Time taken: 1212 ms
>>>>
>>>> But as per my understanding, the aggregation should have run faster.
>>>> So, my whole point is if the dataset is huge I should force some kind of
>>>> map reduce jobs like we have an option called df.groupby().reduceByGroups()
>>>>
>>>> So I think the aggregation query is taking more time since the dataset
>>>> size here is smaller and as we all know that map reduce works faster when
>>>> there is a huge volume of data. Haven't tested it yet on big data but
>>>> needed some expert guidance over here.
>>>>
>>>> Please correct me if I am wrong.
>>>>
>>>> TIA,
>>>> Sid
>>>>
>>>>
>>>>
>>>>

Re: Difference between windowing functions and aggregation functions on big data

Reply via email to