Yea codegen can be a good improvement, PRs are welcome! On Sun, Nov 10, 2019 at 6:28 PM Wang, Gang <gwa...@ebay.com> wrote:
> That’s right. By default, Spark prefers sort merge join. > > While, in our product environment, there are many huge bucket tables. We > can leverage the bucketing to avoid shuffle when join with other small > tables (the small tables are not small enough to leverage broad cast join). > Problem is that, although shuffle can be avoid, sort is still necessary to > leverage sort merge join (we cannot pre-sort since there are different join > patterns). For a huge table, sort may take even tens of seconds. > > That’s why I’m trying to enable shuffle hash join, and for such cases, > there were almost 10% ~ 20% improvement when apply shuffle hash join > instead of sort merge join. I wonder if there is still some space to > improve shuffle hash join? Like code generation for ShuffledHashJoinExec or > something…. > > > > *From: *Wenchen Fan <cloud0...@gmail.com> > *Date: *Sunday, November 10, 2019 at 5:57 PM > *To: *"Wang, Gang" <gwa...@ebay.com.invalid> > *Cc: *"dev@spark.apache.org" <dev@spark.apache.org> > *Subject: *Re: Why not implement CodegenSupport in class > ShuffledHashJoinExec? > > > > By default sort merge join is preferred over shuffle hash join, that's why > we haven't spend resources to implement codegen for it. > > > > On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang <gwa...@ebay.com.invalid> > wrote: > > There are some cases, shuffle hash join performs even better than sort > merge join. > > While, I noticed that ShuffledHashJoinExec does not implement > CodegenSupport, is there any concern? And if there is any chance to improve > the performance of ShuffledHashJoinExec? > > > >