Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Raghavendra Ganesh
What is optimal depends on the context of the problem.
Is the intent here to find the best solution for top n values with a group
by ?

Both the solutions look sub-optimal to me. Window function would be
expensive as it needs an order by (which a top n solution shouldn't need).
It would be best to just group by department and use an aggregate function
which stores the top n values in a heap.
--
Raghavendra


On Mon, Feb 28, 2022 at 12:01 AM Sid  wrote:

> My bad.
>
> Aggregation Query:
>
> # Write your MySQL query statement below
>
>SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
> ORDER by E.DepartmentId, E.Salary DESC
>
> Time Taken: 1212 ms
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time Taken: 790 ms
>
> Thanks,
> Sid
>
>
> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:
>
>> Those two queries are identical?
>>
>> On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:
>>
>>> Hi Team,
>>>
>>> I am aware that if windowing functions are used, then at first it loads
>>> the entire dataset into one window,scans and then performs the other
>>> mentioned operations for that particular window which could be slower when
>>> dealing with trillions / billions of records.
>>>
>>> I did a POC where I used an example to find the max 3 highest salary for
>>> an employee per department. So, I wrote a below queries and compared the
>>> time for it:
>>>
>>> Windowing Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 790 ms
>>>
>>> Aggregation Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 1212 ms
>>>
>>> But as per my understanding, the aggregation should have run faster. So,
>>> my whole point is if the dataset is huge I should force some kind of map
>>> reduce jobs like we have an option called df.groupby().reduceByGroups()
>>>
>>> So I think the aggregation query is taking more time since the dataset
>>> size here is smaller and as we all know that map reduce works faster when
>>> there is a huge volume of data. Haven't tested it yet on big data but
>>> needed some expert guidance over here.
>>>
>>> Please correct me if I am wrong.
>>>
>>> TIA,
>>> Sid
>>>
>>>
>>>
>>>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Mich Talebzadeh
AM I correct that with

.. WHERE (SELECT COUNT(DISTINCT(Salary))..

You will have to shuffle because of DISTINCTas each worker will have to
read data separately and perform the reduce task to get the local
distinct value
and one final shuffle to get the actual distinct
for all the data?



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 27 Feb 2022 at 20:31, Sean Owen  wrote:

> "count distinct' does not have that problem, whether in a group-by or not.
> I'm still not sure these are equivalent queries but maybe not seeing it.
> Windowing makes sense when you need the whole window, or when you need
> sliding windows to express the desired groups.
> It may be unnecessary when your query does not need the window, just a
> summary stat like 'max'. Depends.
>
> On Sun, Feb 27, 2022 at 2:14 PM Bjørn Jørgensen 
> wrote:
>
>> You are using distinct which collects everything to the driver. Soo use
>> the other one :)
>>
>> søn. 27. feb. 2022 kl. 21:00 skrev Sid :
>>
>>> Basically, I am trying two different approaches for the same problem and
>>> my concern is how it will behave in the case of big data if you talk about
>>> millions of records. Which one would be faster? Is using windowing
>>> functions a better way since it will load the entire dataset into a single
>>> window and do the operations?
>>>
>>
>>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
"count distinct' does not have that problem, whether in a group-by or not.
I'm still not sure these are equivalent queries but maybe not seeing it.
Windowing makes sense when you need the whole window, or when you need
sliding windows to express the desired groups.
It may be unnecessary when your query does not need the window, just a
summary stat like 'max'. Depends.

On Sun, Feb 27, 2022 at 2:14 PM Bjørn Jørgensen 
wrote:

> You are using distinct which collects everything to the driver. Soo use
> the other one :)
>
> søn. 27. feb. 2022 kl. 21:00 skrev Sid :
>
>> Basically, I am trying two different approaches for the same problem and
>> my concern is how it will behave in the case of big data if you talk about
>> millions of records. Which one would be faster? Is using windowing
>> functions a better way since it will load the entire dataset into a single
>> window and do the operations?
>>
>
>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Bjørn Jørgensen
You are using distinct which collects everything to the driver. Soo use the
other one :)

søn. 27. feb. 2022 kl. 21:00 skrev Sid :

> Basically, I am trying two different approaches for the same problem and
> my concern is how it will behave in the case of big data if you talk about
> millions of records. Which one would be faster? Is using windowing
> functions a better way since it will load the entire dataset into a single
> window and do the operations?
>
> On Mon, Feb 28, 2022 at 12:26 AM Sean Owen  wrote:
>
>> Those queries look like they do fairly different things. One is selecting
>> top employees by salary, the other is ... selecting where there are less
>> than 3 distinct salaries or something.
>> Not sure what the intended comparison is then; these are not equivalent
>> ways of doing the same thing, or does not seem so as far as I can see.
>>
>> On Sun, Feb 27, 2022 at 12:30 PM Sid  wrote:
>>
>>> My bad.
>>>
>>> Aggregation Query:
>>>
>>> # Write your MySQL query statement below
>>>
>>>SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
>>> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
>>> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>>>WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
>>> ORDER by E.DepartmentId, E.Salary DESC
>>>
>>> Time Taken: 1212 ms
>>>
>>> Windowing Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time Taken: 790 ms
>>>
>>> Thanks,
>>> Sid
>>>
>>>
>>> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:
>>>
 Those two queries are identical?

 On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:

> Hi Team,
>
> I am aware that if windowing functions are used, then at first it
> loads the entire dataset into one window,scans and then performs the other
> mentioned operations for that particular window which could be slower when
> dealing with trillions / billions of records.
>
> I did a POC where I used an example to find the max 3 highest salary
> for an employee per department. So, I wrote a below queries and compared
> the time for it:
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc)
> as rnk from Department d join Employee e on e.departmentId=d.id ) a
> where rnk<=3
>
> Time taken: 790 ms
>
> Aggregation Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc)
> as rnk from Department d join Employee e on e.departmentId=d.id ) a
> where rnk<=3
>
> Time taken: 1212 ms
>
> But as per my understanding, the aggregation should have run faster.
> So, my whole point is if the dataset is huge I should force some kind of
> map reduce jobs like we have an option called 
> df.groupby().reduceByGroups()
>
> So I think the aggregation query is taking more time since the dataset
> size here is smaller and as we all know that map reduce works faster when
> there is a huge volume of data. Haven't tested it yet on big data but
> needed some expert guidance over here.
>
> Please correct me if I am wrong.
>
> TIA,
> Sid
>
>
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Basically, I am trying two different approaches for the same problem and my
concern is how it will behave in the case of big data if you talk about
millions of records. Which one would be faster? Is using windowing
functions a better way since it will load the entire dataset into a single
window and do the operations?

On Mon, Feb 28, 2022 at 12:26 AM Sean Owen  wrote:

> Those queries look like they do fairly different things. One is selecting
> top employees by salary, the other is ... selecting where there are less
> than 3 distinct salaries or something.
> Not sure what the intended comparison is then; these are not equivalent
> ways of doing the same thing, or does not seem so as far as I can see.
>
> On Sun, Feb 27, 2022 at 12:30 PM Sid  wrote:
>
>> My bad.
>>
>> Aggregation Query:
>>
>> # Write your MySQL query statement below
>>
>>SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
>> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
>> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>>WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
>> ORDER by E.DepartmentId, E.Salary DESC
>>
>> Time Taken: 1212 ms
>>
>> Windowing Query:
>>
>> select Department,Employee,Salary from (
>> select d.name as Department, e.name as Employee,e.salary as
>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>> rnk<=3
>>
>> Time Taken: 790 ms
>>
>> Thanks,
>> Sid
>>
>>
>> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:
>>
>>> Those two queries are identical?
>>>
>>> On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:
>>>
 Hi Team,

 I am aware that if windowing functions are used, then at first it loads
 the entire dataset into one window,scans and then performs the other
 mentioned operations for that particular window which could be slower when
 dealing with trillions / billions of records.

 I did a POC where I used an example to find the max 3 highest salary
 for an employee per department. So, I wrote a below queries and compared
 the time for it:

 Windowing Query:

 select Department,Employee,Salary from (
 select d.name as Department, e.name as Employee,e.salary as
 Salary,dense_rank() over(partition by d.name order by e.salary desc)
 as rnk from Department d join Employee e on e.departmentId=d.id ) a
 where rnk<=3

 Time taken: 790 ms

 Aggregation Query:

 select Department,Employee,Salary from (
 select d.name as Department, e.name as Employee,e.salary as
 Salary,dense_rank() over(partition by d.name order by e.salary desc)
 as rnk from Department d join Employee e on e.departmentId=d.id ) a
 where rnk<=3

 Time taken: 1212 ms

 But as per my understanding, the aggregation should have run faster.
 So, my whole point is if the dataset is huge I should force some kind of
 map reduce jobs like we have an option called df.groupby().reduceByGroups()

 So I think the aggregation query is taking more time since the dataset
 size here is smaller and as we all know that map reduce works faster when
 there is a huge volume of data. Haven't tested it yet on big data but
 needed some expert guidance over here.

 Please correct me if I am wrong.

 TIA,
 Sid






Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Hi Enrico,

Thanks for your time :)

Consider a huge data volume scenario, If I don't use any keywords like
distinct, which one would be faster ? Window with partitionBy or normal SQL
aggregation methods? and how does df.groupBy().reduceByGroups() work
internally ?

Thanks,
Sid

On Mon, Feb 28, 2022 at 12:59 AM Enrico Minack 
wrote:

> Sid,
>
> Your Aggregation Query selects all employees where less than three
> distinct salaries exist that are larger. So, both queries seem to do the
> same.
>
> The Windowing Query is explicit in what it does: give me the rank for
> salaries per department in the given order and pick the top 3 per
> department.
>
> The Aggregation Query is trying to get to this conclusion by constructing
> some comparison. The former is the better approach, the second scales badly
> as this is done by counting distinct salaries that are larger than each
> salary in E. This looks like a Cartesian product of Employees. You make
> this very hard to optimize or execute by the query engine.
>
> And as you say, your example is very small, so this will not give any
> insights into big data.
>
> Enrico
>
>
> Am 27.02.22 um 19:30 schrieb Sid:
>
> My bad.
>
> Aggregation Query:
>
> # Write your MySQL query statement below
>
>SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
> ORDER by E.DepartmentId, E.Salary DESC
>
> Time Taken: 1212 ms
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time Taken: 790 ms
>
> Thanks,
> Sid
>
>
> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:
>
>> Those two queries are identical?
>>
>> On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:
>>
>>> Hi Team,
>>>
>>> I am aware that if windowing functions are used, then at first it loads
>>> the entire dataset into one window,scans and then performs the other
>>> mentioned operations for that particular window which could be slower when
>>> dealing with trillions / billions of records.
>>>
>>> I did a POC where I used an example to find the max 3 highest salary for
>>> an employee per department. So, I wrote a below queries and compared the
>>> time for it:
>>>
>>> Windowing Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 790 ms
>>>
>>> Aggregation Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 1212 ms
>>>
>>> But as per my understanding, the aggregation should have run faster. So,
>>> my whole point is if the dataset is huge I should force some kind of map
>>> reduce jobs like we have an option called df.groupby().reduceByGroups()
>>>
>>> So I think the aggregation query is taking more time since the dataset
>>> size here is smaller and as we all know that map reduce works faster when
>>> there is a huge volume of data. Haven't tested it yet on big data but
>>> needed some expert guidance over here.
>>>
>>> Please correct me if I am wrong.
>>>
>>> TIA,
>>> Sid
>>>
>>>
>>>
>>>
>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Enrico Minack

Sid,

Your Aggregation Query selects all employees where less than three 
distinct salaries exist that are larger. So, both queries seem to do the 
same.


The Windowing Query is explicit in what it does: give me the rank for 
salaries per department in the given order and pick the top 3 per 
department.


The Aggregation Query is trying to get to this conclusion by 
constructing some comparison. The former is the better approach, the 
second scales badly as this is done by counting distinct salaries that 
are larger than each salary in E. This looks like a Cartesian product of 
Employees. You make this very hard to optimize or execute by the query 
engine.


And as you say, your example is very small, so this will not give any 
insights into big data.


Enrico


Am 27.02.22 um 19:30 schrieb Sid:

My bad.

Aggregation Query:

# Write your MySQL query statement below

   SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
       WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
ORDER by E.DepartmentId, E.Salary DESC

Time Taken: 1212 ms

Windowing Query:

select Department,Employee,Salary from (
select d.name  as Department, e.name 
 as Employee,e.salary as Salary,dense_rank() 
over(partition by d.name  order by e.salary desc) as 
rnk from Department d join Employee e on e.departmentId=d.id 
 ) a where rnk<=3


Time Taken: 790 ms

Thanks,
Sid


On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:

Those two queries are identical?

On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:

Hi Team,

I am aware that if windowing functions are used, then at first
it loads the entire dataset into one window,scans and then
performs the other mentioned operations for that particular
window which could be slower when dealing with trillions /
billions of records.

I did a POC where I used an example to find the max 3 highest
salary for an employee per department. So, I wrote a below
queries and compared the time for it:

Windowing Query:

select Department,Employee,Salary from (
select d.name  as Department, e.name
 as Employee,e.salary as Salary,dense_rank()
over(partition by d.name  order by e.salary
desc) as rnk from Department d join Employee e on
e.departmentId=d.id  ) a where rnk<=3

Time taken: 790 ms

Aggregation Query:

select Department,Employee,Salary from (
select d.name  as Department, e.name
 as Employee,e.salary as Salary,dense_rank()
over(partition by d.name  order by e.salary
desc) as rnk from Department d join Employee e on
e.departmentId=d.id  ) a where rnk<=3

Time taken: 1212 ms

But as per my understanding, the aggregation should have run
faster. So, my whole point is if the dataset is huge I should
force some kind of map reduce jobs like we have an option
called df.groupby().reduceByGroups()

So I think the aggregation query is taking more time since the
dataset size here is smaller and as we all know that map
reduce works faster when there is a huge volume of data.
Haven't tested it yet on big data but needed some expert
guidance over here.

Please correct me if I am wrong.

TIA,
Sid




Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
Those queries look like they do fairly different things. One is selecting
top employees by salary, the other is ... selecting where there are less
than 3 distinct salaries or something.
Not sure what the intended comparison is then; these are not equivalent
ways of doing the same thing, or does not seem so as far as I can see.

On Sun, Feb 27, 2022 at 12:30 PM Sid  wrote:

> My bad.
>
> Aggregation Query:
>
> # Write your MySQL query statement below
>
>SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
> ORDER by E.DepartmentId, E.Salary DESC
>
> Time Taken: 1212 ms
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time Taken: 790 ms
>
> Thanks,
> Sid
>
>
> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:
>
>> Those two queries are identical?
>>
>> On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:
>>
>>> Hi Team,
>>>
>>> I am aware that if windowing functions are used, then at first it loads
>>> the entire dataset into one window,scans and then performs the other
>>> mentioned operations for that particular window which could be slower when
>>> dealing with trillions / billions of records.
>>>
>>> I did a POC where I used an example to find the max 3 highest salary for
>>> an employee per department. So, I wrote a below queries and compared the
>>> time for it:
>>>
>>> Windowing Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 790 ms
>>>
>>> Aggregation Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 1212 ms
>>>
>>> But as per my understanding, the aggregation should have run faster. So,
>>> my whole point is if the dataset is huge I should force some kind of map
>>> reduce jobs like we have an option called df.groupby().reduceByGroups()
>>>
>>> So I think the aggregation query is taking more time since the dataset
>>> size here is smaller and as we all know that map reduce works faster when
>>> there is a huge volume of data. Haven't tested it yet on big data but
>>> needed some expert guidance over here.
>>>
>>> Please correct me if I am wrong.
>>>
>>> TIA,
>>> Sid
>>>
>>>
>>>
>>>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
My bad.

Aggregation Query:

# Write your MySQL query statement below

   SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
   WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
ORDER by E.DepartmentId, E.Salary DESC

Time Taken: 1212 ms

Windowing Query:

select Department,Employee,Salary from (
select d.name as Department, e.name as Employee,e.salary as
Salary,dense_rank() over(partition by d.name order by e.salary desc) as rnk
from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3

Time Taken: 790 ms

Thanks,
Sid


On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:

> Those two queries are identical?
>
> On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:
>
>> Hi Team,
>>
>> I am aware that if windowing functions are used, then at first it loads
>> the entire dataset into one window,scans and then performs the other
>> mentioned operations for that particular window which could be slower when
>> dealing with trillions / billions of records.
>>
>> I did a POC where I used an example to find the max 3 highest salary for
>> an employee per department. So, I wrote a below queries and compared the
>> time for it:
>>
>> Windowing Query:
>>
>> select Department,Employee,Salary from (
>> select d.name as Department, e.name as Employee,e.salary as
>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>> rnk<=3
>>
>> Time taken: 790 ms
>>
>> Aggregation Query:
>>
>> select Department,Employee,Salary from (
>> select d.name as Department, e.name as Employee,e.salary as
>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>> rnk<=3
>>
>> Time taken: 1212 ms
>>
>> But as per my understanding, the aggregation should have run faster. So,
>> my whole point is if the dataset is huge I should force some kind of map
>> reduce jobs like we have an option called df.groupby().reduceByGroups()
>>
>> So I think the aggregation query is taking more time since the dataset
>> size here is smaller and as we all know that map reduce works faster when
>> there is a huge volume of data. Haven't tested it yet on big data but
>> needed some expert guidance over here.
>>
>> Please correct me if I am wrong.
>>
>> TIA,
>> Sid
>>
>>
>>
>>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
Those two queries are identical?

On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:

> Hi Team,
>
> I am aware that if windowing functions are used, then at first it loads
> the entire dataset into one window,scans and then performs the other
> mentioned operations for that particular window which could be slower when
> dealing with trillions / billions of records.
>
> I did a POC where I used an example to find the max 3 highest salary for
> an employee per department. So, I wrote a below queries and compared the
> time for it:
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time taken: 790 ms
>
> Aggregation Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time taken: 1212 ms
>
> But as per my understanding, the aggregation should have run faster. So,
> my whole point is if the dataset is huge I should force some kind of map
> reduce jobs like we have an option called df.groupby().reduceByGroups()
>
> So I think the aggregation query is taking more time since the dataset
> size here is smaller and as we all know that map reduce works faster when
> there is a huge volume of data. Haven't tested it yet on big data but
> needed some expert guidance over here.
>
> Please correct me if I am wrong.
>
> TIA,
> Sid
>
>
>
>


Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Hi Team,

I am aware that if windowing functions are used, then at first it loads the
entire dataset into one window,scans and then performs the other mentioned
operations for that particular window which could be slower when dealing
with trillions / billions of records.

I did a POC where I used an example to find the max 3 highest salary for an
employee per department. So, I wrote a below queries and compared the time
for it:

Windowing Query:

select Department,Employee,Salary from (
select d.name as Department, e.name as Employee,e.salary as
Salary,dense_rank() over(partition by d.name order by e.salary desc) as rnk
from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3

Time taken: 790 ms

Aggregation Query:

select Department,Employee,Salary from (
select d.name as Department, e.name as Employee,e.salary as
Salary,dense_rank() over(partition by d.name order by e.salary desc) as rnk
from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3

Time taken: 1212 ms

But as per my understanding, the aggregation should have run faster. So, my
whole point is if the dataset is huge I should force some kind of map
reduce jobs like we have an option called df.groupby().reduceByGroups()

So I think the aggregation query is taking more time since the dataset size
here is smaller and as we all know that map reduce works faster when there
is a huge volume of data. Haven't tested it yet on big data but needed some
expert guidance over here.

Please correct me if I am wrong.

TIA,
Sid