Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
Yea codegen can be a good improvement, PRs are welcome!

On Sun, Nov 10, 2019 at 6:28 PM Wang, Gang  wrote:

> That’s right. By default, Spark prefers sort merge join.
>
> While, in our product environment, there are many huge bucket tables. We
> can leverage the bucketing to avoid shuffle when join with other small
> tables (the small tables are not small enough to leverage broad cast join).
> Problem is that, although shuffle can be avoid, sort is still necessary to
> leverage sort merge join (we cannot pre-sort since there are different join
> patterns). For a huge table, sort may take even tens of seconds.
>
> That’s why I’m trying to enable shuffle hash join, and for such cases,
> there were almost 10% ~ 20% improvement when apply shuffle hash join
> instead of sort merge join. I wonder if there is still some space to
> improve shuffle hash join? Like code generation for ShuffledHashJoinExec or
> something….
>
>
>
> *From: *Wenchen Fan 
> *Date: *Sunday, November 10, 2019 at 5:57 PM
> *To: *"Wang, Gang" 
> *Cc: *"dev@spark.apache.org" 
> *Subject: *Re: Why not implement CodegenSupport in class
> ShuffledHashJoinExec?
>
>
>
> By default sort merge join is preferred over shuffle hash join, that's why
> we haven't spend resources to implement codegen for it.
>
>
>
> On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang 
> wrote:
>
> There are some cases, shuffle hash join performs even better than sort
> merge join.
>
> While, I noticed that ShuffledHashJoinExec does not implement
> CodegenSupport, is there any concern? And if there is any chance to improve
> the performance of ShuffledHashJoinExec?
>
>
>
>


Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wang, Gang
That’s right. By default, Spark prefers sort merge join.
While, in our product environment, there are many huge bucket tables. We can 
leverage the bucketing to avoid shuffle when join with other small tables (the 
small tables are not small enough to leverage broad cast join). Problem is 
that, although shuffle can be avoid, sort is still necessary to leverage sort 
merge join (we cannot pre-sort since there are different join patterns). For a 
huge table, sort may take even tens of seconds.
That’s why I’m trying to enable shuffle hash join, and for such cases, there 
were almost 10% ~ 20% improvement when apply shuffle hash join instead of sort 
merge join. I wonder if there is still some space to improve shuffle hash join? 
Like code generation for ShuffledHashJoinExec or something….

From: Wenchen Fan 
Date: Sunday, November 10, 2019 at 5:57 PM
To: "Wang, Gang" 
Cc: "dev@spark.apache.org" 
Subject: Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

By default sort merge join is preferred over shuffle hash join, that's why we 
haven't spend resources to implement codegen for it.

On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang  wrote:

There are some cases, shuffle hash join performs even better than sort merge 
join.

While, I noticed that ShuffledHashJoinExec does not implement CodegenSupport, 
is there any concern? And if there is any chance to improve the performance of 
ShuffledHashJoinExec?



Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
By default sort merge join is preferred over shuffle hash join, that's why
we haven't spend resources to implement codegen for it.

On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang  wrote:

> There are some cases, shuffle hash join performs even better than sort
> merge join.
>
> While, I noticed that ShuffledHashJoinExec does not implement
> CodegenSupport, is there any concern? And if there is any chance to improve
> the performance of ShuffledHashJoinExec?
>
>
>


Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-09 Thread Wang, Gang
There are some cases, shuffle hash join performs even better than sort merge 
join.

While, I noticed that ShuffledHashJoinExec does not implement CodegenSupport, 
is there any concern? And if there is any chance to improve the performance of 
ShuffledHashJoinExec?