Re: Difference between windowing functions and aggregation functions on big data

Sean Owen Sun, 27 Feb 2022 10:05:37 -0800

Those two queries are identical?

On Sun, Feb 27, 2022 at 11:30 AM Sid <flinkbyhe...@gmail.com> wrote:


> Hi Team,
>
> I am aware that if windowing functions are used, then at first it loads
> the entire dataset into one window,scans and then performs the other
> mentioned operations for that particular window which could be slower when
> dealing with trillions / billions of records.
>
> I did a POC where I used an example to find the max 3 highest salary for
> an employee per department. So, I wrote a below queries and compared the
> time for it:
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time taken: 790 ms
>
> Aggregation Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time taken: 1212 ms
>
> But as per my understanding, the aggregation should have run faster. So,
> my whole point is if the dataset is huge I should force some kind of map
> reduce jobs like we have an option called df.groupby().reduceByGroups()
>
> So I think the aggregation query is taking more time since the dataset
> size here is smaller and as we all know that map reduce works faster when
> there is a huge volume of data. Haven't tested it yet on big data but
> needed some expert guidance over here.
>
> Please correct me if I am wrong.
>
> TIA,
> Sid
>
>
>
>

Re: Difference between windowing functions and aggregation functions on big data

Reply via email to