Sid,

Your Aggregation Query selects all employees where less than three distinct salaries exist that are larger. So, both queries seem to do the same.

The Windowing Query is explicit in what it does: give me the rank for salaries per department in the given order and pick the top 3 per department.

The Aggregation Query is trying to get to this conclusion by constructing some comparison. The former is the better approach, the second scales badly as this is done by counting distinct salaries that are larger than each salary in E. This looks like a Cartesian product of Employees. You make this very hard to optimize or execute by the query engine.

And as you say, your example is very small, so this will not give any insights into big data.

Enrico


Am 27.02.22 um 19:30 schrieb Sid:
My bad.

Aggregation Query:

# Write your MySQL query statement below

   SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
       WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
ORDER by E.DepartmentId, E.Salary DESC

Time Taken: 1212 ms

Windowing Query:

select Department,Employee,Salary from (
select d.name <http://d.name/> as Department, e.name <http://e.name/> as Employee,e.salary as Salary,dense_rank() over(partition by d.name <http://d.name/> order by e.salary desc) as rnk from Department d join Employee e on e.departmentId=d.id <http://d.id/> ) a where rnk<=3

Time Taken: 790 ms

Thanks,
Sid


On Sun, Feb 27, 2022 at 11:35 PM Sean Owen <sro...@gmail.com> wrote:

    Those two queries are identical?

    On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote:

        Hi Team,

        I am aware that if windowing functions are used, then at first
        it loads the entire dataset into one window,scans and then
        performs the other mentioned operations for that particular
        window which could be slower when dealing with trillions /
        billions of records.

        I did a POC where I used an example to find the max 3 highest
        salary for an employee per department. So, I wrote a below
        queries and compared the time for it:

        Windowing Query:

        select Department,Employee,Salary from (
        select d.name <http://d.name> as Department, e.name
        <http://e.name> as Employee,e.salary as Salary,dense_rank()
        over(partition by d.name <http://d.name> order by e.salary
        desc) as rnk from Department d join Employee e on
        e.departmentId=d.id <http://d.id> ) a where rnk<=3

        Time taken: 790 ms

        Aggregation Query:

        select Department,Employee,Salary from (
        select d.name <http://d.name> as Department, e.name
        <http://e.name> as Employee,e.salary as Salary,dense_rank()
        over(partition by d.name <http://d.name> order by e.salary
        desc) as rnk from Department d join Employee e on
        e.departmentId=d.id <http://d.id> ) a where rnk<=3

        Time taken: 1212 ms

        But as per my understanding, the aggregation should have run
        faster. So, my whole point is if the dataset is huge I should
        force some kind of map reduce jobs like we have an option
        called df.groupby().reduceByGroups()

        So I think the aggregation query is taking more time since the
        dataset size here is smaller and as we all know that map
        reduce works faster when there is a huge volume of data.
        Haven't tested it yet on big data but needed some expert
        guidance over here.

        Please correct me if I am wrong.

        TIA,
        Sid


Reply via email to