[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169563822
  
It's difference from join selection, it just pull out nondeterministic 
expressions of join condition to the left or right children, but it seems it 
can reuse the code of `ExtractEquiJoinKeys`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169525153
  
Oh, I see what you are trying to do.  Hmm, this feels like a hack to me.  
If we want to fix skewed joins I'm thinking we need a more principled solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169524856
  
How does this make a difference in join selection?  I think the logic in 
`ExtractEquiJoinKeys` is already smart enough to do this correctly.  Even if 
that was not the case we should not mix concerns here.  Rules should be kept as 
simple as possible to avoid bugs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169524471
  
@marmbrus you are right. But i think @zhonghaihua 's solution is try to 
reduce cartesian product possibility, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169416019
  
> When nondeterministic expressions is used with table attribute, I think 
pull out it should depend on the attribute. What do you think?

Why? Multiplying by 33 returns the same result wherever you do it.  Its not 
worth complicating the code unless its going to change the answer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169257014
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169257017
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48833/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169256396
  
**[Test build #48833 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48833/consoleFull)**
 for PR 10128 at commit 
[`b2a4a6b`](https://github.com/apache/spark/commit/b2a4a6bd193033512b400bb9434a92864d4553a8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169235179
  
**[Test build #48833 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48833/consoleFull)**
 for PR 10128 at commit 
[`b2a4a6b`](https://github.com/apache/spark/commit/b2a4a6bd193033512b400bb9434a92864d4553a8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread zhonghaihua
Github user zhonghaihua commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169231327
  
@marmbrus Thanks for your suggestions. I think your idea can simply solve 
problem. But in some situations, this seems not very suitable. 
For example:
`Join(testRelation, testRelation2, Inner,Some(And(EqualTo(a, b), 
EqualTo(Rand(33) * c, d` If `c` is an attribute belong to `testRelation2`, 
I think `Rand(33)` is more appropriately pulled out to the right child tree of 
`Join`, otherwise, if belong to `testRelation`, it is appropriately pulled out 
to left child tree. 

When `nondeterministic expressions` is used with `table attribute`, I think 
pull out it should depend on the attribute. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169093499
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48778/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169093483
  
**[Test build #48778 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48778/consoleFull)**
 for PR 10128 at commit 
[`6e16657`](https://github.com/apache/spark/commit/6e166578a5c1a1faf260389509663ac8c71ec015).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169093496
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169093176
  
**[Test build #48778 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48778/consoleFull)**
 for PR 10128 at commit 
[`6e16657`](https://github.com/apache/spark/commit/6e166578a5c1a1faf260389509663ac8c71ec015).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169089505
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-05 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169089482
  
This seems like a reasonable thing to do, but the implementation seems 
unnecessarily complex.  Why not just:
 - `transform` the condition, matching on non deterministic subtrees with 
no references.
 - when you find one create an alias and add it to an `ArrayBuffer`, 
replacing the tree with `alias.toAttribute`
 - if the array buffer is empty, return the same tree.  otherwise, add the 
two projections as you do now and use the transformed condition for the join.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-04 Thread zhonghaihua
Github user zhonghaihua commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-168892937
  
@rxin @yhuai Could you verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-04 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-162135747
  
This makes sense, but I still wanna argue that spark SQL has 
`RepartitionByExpression` that deal with several columns...

I think the main point is to be compatiable with existing SQL queries, but 
I'm not sure if it worth.
cc @rxin @marmbrus 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-04 Thread zhonghaihua
Github user zhonghaihua commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-161969421
  
@cloud-fan Thanks for your advice. But, as @jeanlyn said,`Repartition` will 
deal with all data, and this PR will deal with join keys cause data skew.
Because in some situations, we will use this operator to avoid data skew in 
`SQL`, then I think maybe we should support this. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-03 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-161851642
  
@cloud-fan I think your case is different from @zhonghaihua 's. The sql 
only deal with some join keys ('' and null) before shuffle to handle those 
pointless key cause skew during join operator, while `repartition` deal with 
all data before some map operator. This is two type data skew, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-03 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-161675474
  
I can understand the motivation of this change, we do have workaround for 
relieving the data skew, but we probably don't want to change the existing SQL 
queries based on legacy system (like Hive).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-03 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-161626933
  
I think it's not a good example to show that we need to allow 
nondeterministic expressions in join codition. We can use `Repartition` 
operator to fix data skew, like 
`sqlContext.table(tblName).repartition(numPartitions).registerTempTable`, which 
looks better than your random join approach. Do you find other cases that need 
to use  nondeterministic expressions in join codition?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org