Hi Team,

I am aware that if windowing functions are used, then at first it loads the
entire dataset into one window,scans and then performs the other mentioned
operations for that particular window which could be slower when dealing
with trillions / billions of records.

I did a POC where I used an example to find the max 3 highest salary for an
employee per department. So, I wrote a below queries and compared the time
for it:

Windowing Query:

select Department,Employee,Salary from (
select d.name as Department, e.name as Employee,e.salary as
Salary,dense_rank() over(partition by d.name order by e.salary desc) as rnk
from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3

Time taken: 790 ms

Aggregation Query:

select Department,Employee,Salary from (
select d.name as Department, e.name as Employee,e.salary as
Salary,dense_rank() over(partition by d.name order by e.salary desc) as rnk
from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3

Time taken: 1212 ms

But as per my understanding, the aggregation should have run faster. So, my
whole point is if the dataset is huge I should force some kind of map
reduce jobs like we have an option called df.groupby().reduceByGroups()

So I think the aggregation query is taking more time since the dataset size
here is smaller and as we all know that map reduce works faster when there
is a huge volume of data. Haven't tested it yet on big data but needed some
expert guidance over here.

Please correct me if I am wrong.

TIA,
Sid

Reply via email to