Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
Yea codegen can be a good improvement, PRs are welcome! On Sun, Nov 10, 2019 at 6:28 PM Wang, Gang wrote: > That’s right. By default, Spark prefers sort merge join. > > While, in our product environment, there are many huge bucket tables. We > can leverage the bucketing to avoid shuffle when

Re: dev/merge_spark_pr.py broken on python 2

2019-11-10 Thread Hyukjin Kwon
Yeah.. let's stick to Python 3 in general .. I plan to drop Python 2 completely right after Spark 3.0 release. The exception you face .. seems like run_cmd now produces unicode instead of bytes in Python 2 with the merge script. Later, seems this unicode is attempted to be casted to bytes

Re: Build customized resource manager

2019-11-10 Thread Klaus Ma
hm that'll be better to me if we can build customized resource manager out of core; otherwise, we have to go through the long discussion in the community :) But if we support that, why still mesos/yarn/k8s resource manager there in the tree? On Fri, Nov 8, 2019 at 10:18 PM Tom Graves wrote:

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wang, Gang
That’s right. By default, Spark prefers sort merge join. While, in our product environment, there are many huge bucket tables. We can leverage the bucketing to avoid shuffle when join with other small tables (the small tables are not small enough to leverage broad cast join). Problem is that,

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
By default sort merge join is preferred over shuffle hash join, that's why we haven't spend resources to implement codegen for it. On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang wrote: > There are some cases, shuffle hash join performs even better than sort > merge join. > > While, I noticed that