Those two queries are identical? On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote:
> Hi Team, > > I am aware that if windowing functions are used, then at first it loads > the entire dataset into one window,scans and then performs the other > mentioned operations for that particular window which could be slower when > dealing with trillions / billions of records. > > I did a POC where I used an example to find the max 3 highest salary for > an employee per department. So, I wrote a below queries and compared the > time for it: > > Windowing Query: > > select Department,Employee,Salary from ( > select d.name as Department, e.name as Employee,e.salary as > Salary,dense_rank() over(partition by d.name order by e.salary desc) as > rnk from Department d join Employee e on e.departmentId=d.id ) a where > rnk<=3 > > Time taken: 790 ms > > Aggregation Query: > > select Department,Employee,Salary from ( > select d.name as Department, e.name as Employee,e.salary as > Salary,dense_rank() over(partition by d.name order by e.salary desc) as > rnk from Department d join Employee e on e.departmentId=d.id ) a where > rnk<=3 > > Time taken: 1212 ms > > But as per my understanding, the aggregation should have run faster. So, > my whole point is if the dataset is huge I should force some kind of map > reduce jobs like we have an option called df.groupby().reduceByGroups() > > So I think the aggregation query is taking more time since the dataset > size here is smaller and as we all know that map reduce works faster when > there is a huge volume of data. Haven't tested it yet on big data but > needed some expert guidance over here. > > Please correct me if I am wrong. > > TIA, > Sid > > > >