Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169416019
> When nondeterministic expressions is used with table attribute, I think
pull out it should depend on the attribute. What do you think?
Why? Multiplying by
Github user jeanlyn commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169524471
@marmbrus you are right. But i think @zhonghaihua 's solution is try to
reduce cartesian product possibility, right?
---
If your project is set up for it, you can
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169525153
Oh, I see what you are trying to do. Hmm, this feels like a hack to me.
If we want to fix skewed joins I'm thinking we need a more principled solution.
---
If your
Github user jeanlyn commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169563822
It's difference from join selection, it just pull out nondeterministic
expressions of join condition to the left or right children, but it seems it
can reuse the code
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169524856
How does this make a difference in join selection? I think the logic in
`ExtractEquiJoinKeys` is already smart enough to do this correctly. Even if
that was not the
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169089482
This seems like a reasonable thing to do, but the implementation seems
unnecessarily complex. Why not just:
- `transform` the condition, matching on non
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169089505
ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169093496
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169093176
**[Test build #48778 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48778/consoleFull)**
for PR 10128 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169093483
**[Test build #48778 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48778/consoleFull)**
for PR 10128 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169093499
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user zhonghaihua commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169231327
@marmbrus Thanks for your suggestions. I think your idea can simply solve
problem. But in some situations, this seems not very suitable.
For example:
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169235179
**[Test build #48833 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48833/consoleFull)**
for PR 10128 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169257014
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169257017
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-169256396
**[Test build #48833 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48833/consoleFull)**
for PR 10128 at commit
Github user zhonghaihua commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-168892937
@rxin @yhuai Could you verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user zhonghaihua commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-161969421
@cloud-fan Thanks for your advice. But, as @jeanlyn said,`Repartition` will
deal with all data, and this PR will deal with join keys cause data skew.
Because in
Github user cloud-fan commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-162135747
This makes sense, but I still wanna argue that spark SQL has
`RepartitionByExpression` that deal with several columns...
I think the main point is to be
Github user cloud-fan commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-161626933
I think it's not a good example to show that we need to allow
nondeterministic expressions in join codition. We can use `Repartition`
operator to fix data skew, like
Github user chenghao-intel commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-161675474
I can understand the motivation of this change, we do have workaround for
relieving the data skew, but we probably don't want to change the existing SQL
queries
Github user jeanlyn commented on the pull request:
https://github.com/apache/spark/pull/10128#issuecomment-161851642
@cloud-fan I think your case is different from @zhonghaihua 's. The sql
only deal with some join keys ('' and null) before shuffle to handle those
pointless key cause
22 matches
Mail list logo