Sid,
Your Aggregation Query selects all employees where less than three
distinct salaries exist that are larger. So, both queries seem to do the
same.
The Windowing Query is explicit in what it does: give me the rank for
salaries per department in the given order and pick the top 3 per
department.
The Aggregation Query is trying to get to this conclusion by
constructing some comparison. The former is the better approach, the
second scales badly as this is done by counting distinct salaries that
are larger than each salary in E. This looks like a Cartesian product of
Employees. You make this very hard to optimize or execute by the query
engine.
And as you say, your example is very small, so this will not give any
insights into big data.
Enrico
Am 27.02.22 um 19:30 schrieb Sid:
My bad.
Aggregation Query:
# Write your MySQL query statement below
SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
ORDER by E.DepartmentId, E.Salary DESC
Time Taken: 1212 ms
Windowing Query:
select Department,Employee,Salary from (
select d.name <http://d.name/> as Department, e.name
<http://e.name/> as Employee,e.salary as Salary,dense_rank()
over(partition by d.name <http://d.name/> order by e.salary desc) as
rnk from Department d join Employee e on e.departmentId=d.id
<http://d.id/> ) a where rnk<=3
Time Taken: 790 ms
Thanks,
Sid
On Sun, Feb 27, 2022 at 11:35 PM Sean Owen <sro...@gmail.com> wrote:
Those two queries are identical?
On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote:
Hi Team,
I am aware that if windowing functions are used, then at first
it loads the entire dataset into one window,scans and then
performs the other mentioned operations for that particular
window which could be slower when dealing with trillions /
billions of records.
I did a POC where I used an example to find the max 3 highest
salary for an employee per department. So, I wrote a below
queries and compared the time for it:
Windowing Query:
select Department,Employee,Salary from (
select d.name <http://d.name> as Department, e.name
<http://e.name> as Employee,e.salary as Salary,dense_rank()
over(partition by d.name <http://d.name> order by e.salary
desc) as rnk from Department d join Employee e on
e.departmentId=d.id <http://d.id> ) a where rnk<=3
Time taken: 790 ms
Aggregation Query:
select Department,Employee,Salary from (
select d.name <http://d.name> as Department, e.name
<http://e.name> as Employee,e.salary as Salary,dense_rank()
over(partition by d.name <http://d.name> order by e.salary
desc) as rnk from Department d join Employee e on
e.departmentId=d.id <http://d.id> ) a where rnk<=3
Time taken: 1212 ms
But as per my understanding, the aggregation should have run
faster. So, my whole point is if the dataset is huge I should
force some kind of map reduce jobs like we have an option
called df.groupby().reduceByGroups()
So I think the aggregation query is taking more time since the
dataset size here is smaller and as we all know that map
reduce works faster when there is a huge volume of data.
Haven't tested it yet on big data but needed some expert
guidance over here.
Please correct me if I am wrong.
TIA,
Sid