Re: Difference between windowing functions and aggregation functions on big data

Enrico Minack Sun, 27 Feb 2022 11:29:49 -0800

Sid,

Your Aggregation Query selects all employees where less than threedistinct salaries exist that are larger. So, both queries seem to do thesame.

The Windowing Query is explicit in what it does: give me the rank forsalaries per department in the given order and pick the top 3 perdepartment.

The Aggregation Query is trying to get to this conclusion byconstructing some comparison. The former is the better approach, thesecond scales badly as this is done by counting distinct salaries thatare larger than each salary in E. This looks like a Cartesian product ofEmployees. You make this very hard to optimize or execute by the queryengine.

And as you say, your example is very small, so this will not give anyinsights into big data.


Enrico


Am 27.02.22 um 19:30 schrieb Sid:

My bad.

Aggregation Query:

# Write your MySQL query statement below

   SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
       WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
ORDER by E.DepartmentId, E.Salary DESC

Time Taken: 1212 ms

Windowing Query:

select Department,Employee,Salary from (

select d.name <http://d.name/> as Department, e.name<http://e.name/> as Employee,e.salary as Salary,dense_rank()over(partition by d.name <http://d.name/> order by e.salary desc) asrnk from Department d join Employee e on e.departmentId=d.id<http://d.id/> ) a where rnk<=3


Time Taken: 790 ms

Thanks,
Sid


On Sun, Feb 27, 2022 at 11:35 PM Sean Owen <sro...@gmail.com> wrote:

    Those two queries are identical?

    On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote:

        Hi Team,

        I am aware that if windowing functions are used, then at first
        it loads the entire dataset into one window,scans and then
        performs the other mentioned operations for that particular
        window which could be slower when dealing with trillions /
        billions of records.

        I did a POC where I used an example to find the max 3 highest
        salary for an employee per department. So, I wrote a below
        queries and compared the time for it:

        Windowing Query:

        select Department,Employee,Salary from (
        select d.name <http://d.name> as Department, e.name
        <http://e.name> as Employee,e.salary as Salary,dense_rank()
        over(partition by d.name <http://d.name> order by e.salary
        desc) as rnk from Department d join Employee e on
        e.departmentId=d.id <http://d.id> ) a where rnk<=3

        Time taken: 790 ms

        Aggregation Query:

        select Department,Employee,Salary from (
        select d.name <http://d.name> as Department, e.name
        <http://e.name> as Employee,e.salary as Salary,dense_rank()
        over(partition by d.name <http://d.name> order by e.salary
        desc) as rnk from Department d join Employee e on
        e.departmentId=d.id <http://d.id> ) a where rnk<=3

        Time taken: 1212 ms

        But as per my understanding, the aggregation should have run
        faster. So, my whole point is if the dataset is huge I should
        force some kind of map reduce jobs like we have an option
        called df.groupby().reduceByGroups()

        So I think the aggregation query is taking more time since the
        dataset size here is smaller and as we all know that map
        reduce works faster when there is a huge volume of data.
        Haven't tested it yet on big data but needed some expert
        guidance over here.

        Please correct me if I am wrong.

        TIA,
        Sid

Re: Difference between windowing functions and aggregation functions on big data

Reply via email to