You are using distinct which collects everything to the driver. Soo use the other one :)
søn. 27. feb. 2022 kl. 21:00 skrev Sid <flinkbyhe...@gmail.com>: > Basically, I am trying two different approaches for the same problem and > my concern is how it will behave in the case of big data if you talk about > millions of records. Which one would be faster? Is using windowing > functions a better way since it will load the entire dataset into a single > window and do the operations? > > On Mon, Feb 28, 2022 at 12:26 AM Sean Owen <sro...@gmail.com> wrote: > >> Those queries look like they do fairly different things. One is selecting >> top employees by salary, the other is ... selecting where there are less >> than 3 distinct salaries or something. >> Not sure what the intended comparison is then; these are not equivalent >> ways of doing the same thing, or does not seem so as far as I can see. >> >> On Sun, Feb 27, 2022 at 12:30 PM Sid <flinkbyhe...@gmail.com> wrote: >> >>> My bad. >>> >>> Aggregation Query: >>> >>> # Write your MySQL query statement below >>> >>> SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary >>> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id >>> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee >>> WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3 >>> ORDER by E.DepartmentId, E.Salary DESC >>> >>> Time Taken: 1212 ms >>> >>> Windowing Query: >>> >>> select Department,Employee,Salary from ( >>> select d.name as Department, e.name as Employee,e.salary as >>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as >>> rnk from Department d join Employee e on e.departmentId=d.id ) a where >>> rnk<=3 >>> >>> Time Taken: 790 ms >>> >>> Thanks, >>> Sid >>> >>> >>> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen <sro...@gmail.com> wrote: >>> >>>> Those two queries are identical? >>>> >>>> On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote: >>>> >>>>> Hi Team, >>>>> >>>>> I am aware that if windowing functions are used, then at first it >>>>> loads the entire dataset into one window,scans and then performs the other >>>>> mentioned operations for that particular window which could be slower when >>>>> dealing with trillions / billions of records. >>>>> >>>>> I did a POC where I used an example to find the max 3 highest salary >>>>> for an employee per department. So, I wrote a below queries and compared >>>>> the time for it: >>>>> >>>>> Windowing Query: >>>>> >>>>> select Department,Employee,Salary from ( >>>>> select d.name as Department, e.name as Employee,e.salary as >>>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) >>>>> as rnk from Department d join Employee e on e.departmentId=d.id ) a >>>>> where rnk<=3 >>>>> >>>>> Time taken: 790 ms >>>>> >>>>> Aggregation Query: >>>>> >>>>> select Department,Employee,Salary from ( >>>>> select d.name as Department, e.name as Employee,e.salary as >>>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) >>>>> as rnk from Department d join Employee e on e.departmentId=d.id ) a >>>>> where rnk<=3 >>>>> >>>>> Time taken: 1212 ms >>>>> >>>>> But as per my understanding, the aggregation should have run faster. >>>>> So, my whole point is if the dataset is huge I should force some kind of >>>>> map reduce jobs like we have an option called >>>>> df.groupby().reduceByGroups() >>>>> >>>>> So I think the aggregation query is taking more time since the dataset >>>>> size here is smaller and as we all know that map reduce works faster when >>>>> there is a huge volume of data. Haven't tested it yet on big data but >>>>> needed some expert guidance over here. >>>>> >>>>> Please correct me if I am wrong. >>>>> >>>>> TIA, >>>>> Sid >>>>> >>>>> >>>>> >>>>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297