Yea codegen can be a good improvement, PRs are welcome!
On Sun, Nov 10, 2019 at 6:28 PM Wang, Gang wrote:
> That’s right. By default, Spark prefers sort merge join.
>
> While, in our product environment, there are many huge bucket tables. We
> can leverage the bucketing to avoid shuffle when joi
Yeah.. let's stick to Python 3 in general ..
I plan to drop Python 2 completely right after Spark 3.0 release.
The exception you face .. seems like run_cmd now produces unicode instead
of bytes in Python 2 with the merge script. Later, seems this unicode is
attempted to be casted to bytes implicit
hm that'll be better to me if we can build customized resource manager
out of core; otherwise, we have to go through the long discussion in the
community :)
But if we support that, why still mesos/yarn/k8s resource manager there in
the tree?
On Fri, Nov 8, 2019 at 10:18 PM Tom Graves wrote:
That’s right. By default, Spark prefers sort merge join.
While, in our product environment, there are many huge bucket tables. We can
leverage the bucketing to avoid shuffle when join with other small tables (the
small tables are not small enough to leverage broad cast join). Problem is
that, al
By default sort merge join is preferred over shuffle hash join, that's why
we haven't spend resources to implement codegen for it.
On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang wrote:
> There are some cases, shuffle hash join performs even better than sort
> merge join.
>
> While, I noticed that Sh