Yea codegen can be a good improvement, PRs are welcome! On Sun, Nov 10, 2019 at 6:28 PM Wang, Gang <[email protected]> wrote:
> That’s right. By default, Spark prefers sort merge join. > > While, in our product environment, there are many huge bucket tables. We > can leverage the bucketing to avoid shuffle when join with other small > tables (the small tables are not small enough to leverage broad cast join). > Problem is that, although shuffle can be avoid, sort is still necessary to > leverage sort merge join (we cannot pre-sort since there are different join > patterns). For a huge table, sort may take even tens of seconds. > > That’s why I’m trying to enable shuffle hash join, and for such cases, > there were almost 10% ~ 20% improvement when apply shuffle hash join > instead of sort merge join. I wonder if there is still some space to > improve shuffle hash join? Like code generation for ShuffledHashJoinExec or > something…. > > > > *From: *Wenchen Fan <[email protected]> > *Date: *Sunday, November 10, 2019 at 5:57 PM > *To: *"Wang, Gang" <[email protected]> > *Cc: *"[email protected]" <[email protected]> > *Subject: *Re: Why not implement CodegenSupport in class > ShuffledHashJoinExec? > > > > By default sort merge join is preferred over shuffle hash join, that's why > we haven't spend resources to implement codegen for it. > > > > On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang <[email protected]> > wrote: > > There are some cases, shuffle hash join performs even better than sort > merge join. > > While, I noticed that ShuffledHashJoinExec does not implement > CodegenSupport, is there any concern? And if there is any chance to improve > the performance of ShuffledHashJoinExec? > > > >
