What is optimal depends on the context of the problem. Is the intent here to find the best solution for top n values with a group by ?
Both the solutions look sub-optimal to me. Window function would be expensive as it needs an order by (which a top n solution shouldn't need). It would be best to just group by department and use an aggregate function which stores the top n values in a heap. -- Raghavendra On Mon, Feb 28, 2022 at 12:01 AM Sid <flinkbyhe...@gmail.com> wrote: > My bad. > > Aggregation Query: > > # Write your MySQL query statement below > > SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary > FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id > WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee > WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3 > ORDER by E.DepartmentId, E.Salary DESC > > Time Taken: 1212 ms > > Windowing Query: > > select Department,Employee,Salary from ( > select d.name as Department, e.name as Employee,e.salary as > Salary,dense_rank() over(partition by d.name order by e.salary desc) as > rnk from Department d join Employee e on e.departmentId=d.id ) a where > rnk<=3 > > Time Taken: 790 ms > > Thanks, > Sid > > > On Sun, Feb 27, 2022 at 11:35 PM Sean Owen <sro...@gmail.com> wrote: > >> Those two queries are identical? >> >> On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote: >> >>> Hi Team, >>> >>> I am aware that if windowing functions are used, then at first it loads >>> the entire dataset into one window,scans and then performs the other >>> mentioned operations for that particular window which could be slower when >>> dealing with trillions / billions of records. >>> >>> I did a POC where I used an example to find the max 3 highest salary for >>> an employee per department. So, I wrote a below queries and compared the >>> time for it: >>> >>> Windowing Query: >>> >>> select Department,Employee,Salary from ( >>> select d.name as Department, e.name as Employee,e.salary as >>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as >>> rnk from Department d join Employee e on e.departmentId=d.id ) a where >>> rnk<=3 >>> >>> Time taken: 790 ms >>> >>> Aggregation Query: >>> >>> select Department,Employee,Salary from ( >>> select d.name as Department, e.name as Employee,e.salary as >>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as >>> rnk from Department d join Employee e on e.departmentId=d.id ) a where >>> rnk<=3 >>> >>> Time taken: 1212 ms >>> >>> But as per my understanding, the aggregation should have run faster. So, >>> my whole point is if the dataset is huge I should force some kind of map >>> reduce jobs like we have an option called df.groupby().reduceByGroups() >>> >>> So I think the aggregation query is taking more time since the dataset >>> size here is smaller and as we all know that map reduce works faster when >>> there is a huge volume of data. Haven't tested it yet on big data but >>> needed some expert guidance over here. >>> >>> Please correct me if I am wrong. >>> >>> TIA, >>> Sid >>> >>> >>> >>>