[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169416019 > When nondeterministic expressions is used with table attribute, I think pull out it should depend on the attribute. What do you think? Why? Multiplying by

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread jeanlyn
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169524471 @marmbrus you are right. But i think @zhonghaihua 's solution is try to reduce cartesian product possibility, right? --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169525153 Oh, I see what you are trying to do. Hmm, this feels like a hack to me. If we want to fix skewed joins I'm thinking we need a more principled solution. --- If your

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread jeanlyn
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169563822 It's difference from join selection, it just pull out nondeterministic expressions of join condition to the left or right children, but it seems it can reuse the code

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169524856 How does this make a difference in join selection? I think the logic in `ExtractEquiJoinKeys` is already smart enough to do this correctly. Even if that was not the

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169089482 This seems like a reasonable thing to do, but the implementation seems unnecessarily complex. Why not just: - `transform` the condition, matching on non

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169089505 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169093496 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169093176 **[Test build #48778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48778/consoleFull)** for PR 10128 at commit

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169093483 **[Test build #48778 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48778/consoleFull)** for PR 10128 at commit

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169093499 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread zhonghaihua
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169231327 @marmbrus Thanks for your suggestions. I think your idea can simply solve problem. But in some situations, this seems not very suitable. For example:

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169235179 **[Test build #48833 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48833/consoleFull)** for PR 10128 at commit

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169257014 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169257017 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169256396 **[Test build #48833 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48833/consoleFull)** for PR 10128 at commit

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-04 Thread zhonghaihua
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-168892937 @rxin @yhuai Could you verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-04 Thread zhonghaihua
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-161969421 @cloud-fan Thanks for your advice. But, as @jeanlyn said,`Repartition` will deal with all data, and this PR will deal with join keys cause data skew. Because in

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-04 Thread cloud-fan
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-162135747 This makes sense, but I still wanna argue that spark SQL has `RepartitionByExpression` that deal with several columns... I think the main point is to be

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-03 Thread cloud-fan
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-161626933 I think it's not a good example to show that we need to allow nondeterministic expressions in join codition. We can use `Repartition` operator to fix data skew, like

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-03 Thread chenghao-intel
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-161675474 I can understand the motivation of this change, we do have workaround for relieving the data skew, but we probably don't want to change the existing SQL queries

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-03 Thread jeanlyn
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-161851642 @cloud-fan I think your case is different from @zhonghaihua 's. The sql only deal with some join keys ('' and null) before shuffle to handle those pointless key cause