[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2017-01-31 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r98751522
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
--- End diff --

If we do it in the optimizer, it does not cover all the cases, right? 
Cartesian Product might be chosen based on the statistics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2017-01-31 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r98751322
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
--- End diff --

If we do it in the optimizer, it does not cover all the cases, right? 
Cartesian Product might be chosen based on the statistics.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2017-01-31 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r98750834
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
--- End diff --

If we do it here, it does not cover all the cases, right? `Cartesian 
Product` might be chosen based on the statistics. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-09-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14866


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-09-02 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77383666
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2265,7 +2265,7 @@ setMethod("join",
   signature(x = "SparkDataFrame", y = "SparkDataFrame"),
   function(x, y, joinExpr = NULL, joinType = NULL) {
 if (is.null(joinExpr)) {
-  sdf <- callJMethod(x@sdf, "join", y@sdf)
+  sdf <- callJMethod(x@sdf, "crossJoin", y@sdf)
 } else {
   if (class(joinExpr) != "Column") stop("joinExpr must be a 
Column")
   if (is.null(joinType)) {
--- End diff --

could you add "cross" to the API doc - L2246


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-09-02 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77380304
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2265,7 +2265,7 @@ setMethod("join",
   signature(x = "SparkDataFrame", y = "SparkDataFrame"),
   function(x, y, joinExpr = NULL, joinType = NULL) {
 if (is.null(joinExpr)) {
-  sdf <- callJMethod(x@sdf, "join", y@sdf)
+  sdf <- callJMethod(x@sdf, "crossJoin", y@sdf)
 } else {
   if (class(joinExpr) != "Column") stop("joinExpr must be a 
Column")
   if (is.null(joinType)) {
--- End diff --

I've added cross to allowable join types. With a joinExpr I think inner is 
appropriate


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-09-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77290718
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2265,7 +2265,7 @@ setMethod("join",
   signature(x = "SparkDataFrame", y = "SparkDataFrame"),
   function(x, y, joinExpr = NULL, joinType = NULL) {
 if (is.null(joinExpr)) {
-  sdf <- callJMethod(x@sdf, "join", y@sdf)
+  sdf <- callJMethod(x@sdf, "crossJoin", y@sdf)
 } else {
   if (class(joinExpr) != "Column") stop("joinExpr must be a 
Column")
   if (is.null(joinType)) {
--- End diff --

should the next line be "crossJoin" too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-09-01 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77219760
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
--- End diff --

As discussed, while it is probably best for this to be in analysis (does 
throw an analysis exception after all), leaving it here until we want to 
reorganize the analysis/optimizer rules.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-09-01 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77155243
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
--- End diff --

Yeah it is the earliest possible place. It is just not something I would 
expect from optimization.

If we are to add it in QueryExecution then I should do this after 
optimization and before 
planning:https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L76-L81


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77099268
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -39,39 +38,46 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
*
* The joined plan are picked from left to right, prefer those has at 
least one join condition.
*
-   * @param input a list of LogicalPlans to join.
+   * @param input a list of LogicalPlans to inner join and the type of 
inner join.
* @param conditions a list of condition for join.
*/
   @tailrec
-  def createOrderedJoin(input: Seq[LogicalPlan], conditions: 
Seq[Expression]): LogicalPlan = {
+  def createOrderedJoin(input: Seq[(LogicalPlan, InnerLike)], conditions: 
Seq[Expression])
+: LogicalPlan = {
 assert(input.size >= 2)
 if (input.size == 2) {
--- End diff --

Sameer helped me interpret this comment. I *think* I've addressed this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77098768
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala 
---
@@ -28,6 +28,7 @@ object JoinType {
 case "rightouter" | "right" => RightOuter
 case "leftsemi" => LeftSemi
 case "leftanti" => LeftAnti
+case "cross" => Cross
 case _ =>
   val supported = Seq(
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77098708
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
--- End diff --

This is just the earliest possible place to do the check. (In fact it 
ideally belongs in analysis). Could you elaborate on triggering this in 
QueryExecution ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77097569
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
+extends Rule[LogicalPlan] with PredicateHelper {
+  /**
+   * Check if a join is a cartesian product. Returns true if
+   * there are no join conditions involving references from both left and 
right.
+   */
+  def isCartesianProduct(join: Join)
+  : Boolean = {
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77089271
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -39,39 +38,46 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
*
* The joined plan are picked from left to right, prefer those has at 
least one join condition.
*
-   * @param input a list of LogicalPlans to join.
+   * @param input a list of LogicalPlans to inner join and the type of 
inner join.
* @param conditions a list of condition for join.
*/
   @tailrec
-  def createOrderedJoin(input: Seq[LogicalPlan], conditions: 
Seq[Expression]): LogicalPlan = {
+  def createOrderedJoin(input: Seq[(LogicalPlan, InnerLike)], conditions: 
Seq[Expression])
+: LogicalPlan = {
 assert(input.size >= 2)
 if (input.size == 2) {
--- End diff --

using a pattern match might be simpler here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77086020
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala 
---
@@ -28,6 +28,7 @@ object JoinType {
 case "rightouter" | "right" => RightOuter
 case "leftsemi" => LeftSemi
 case "leftanti" => LeftAnti
+case "cross" => Cross
 case _ =>
   val supported = Seq(
--- End diff --

also add cross to the list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77085681
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
--- End diff --

More general remark. Why is this part of optimization? We could for 
instance also trigger this check in `QueryExecution`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77084851
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
+extends Rule[LogicalPlan] with PredicateHelper {
+  /**
+   * Check if a join is a cartesian product. Returns true if
+   * there are no join conditions involving references from both left and 
right.
+   */
+  def isCartesianProduct(join: Join)
+  : Boolean = {
--- End diff --

nit: put boolean on the the same line as the def.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77069424
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,45 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
+extends Rule[LogicalPlan] with PredicateHelper {
+  /**
+   * Check if a join specified by left, right and condition is a cartesian 
product. Returns true if
+   * there are no join conditions involving references from both left and 
right.
+   */
+  def isCartesianProduct(left: LogicalPlan, right: LogicalPlan, condition: 
Option[Expression])
+  : Boolean = {
+val conditions = 
condition.map(splitConjunctivePredicates).getOrElse(Nil)
+!conditions.map(_.references).exists(refs => 
refs.exists(left.outputSet.contains)
+&& refs.exists(right.outputSet.contains))
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan =
+if (conf.allowCartesianProduct) {
+  plan
+} else plan transform {
+  case j @ Join(left, right, Inner(false) | LeftOuter | RightOuter | 
FullOuter, condition)
+if isCartesianProduct(left, right, condition) =>
+  throw new AnalysisException(
+s"""Detected cartesian product for ${j.joinType.sql} join 
between logical plans
+   |${left.treeString(false).trim}
+   |and
+   |${right.treeString(false).trim}
+   |Use a CROSS JOIN to allow cartesian products between these 
relations""".stripMargin)
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77067766
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -698,6 +698,8 @@ private[sql] class SQLConf extends Serializable with 
CatalystConf with Logging {
   override def orderByOrdinal: Boolean = getConf(ORDER_BY_ORDINAL)
 
   override def groupByOrdinal: Boolean = getConf(GROUP_BY_ORDINAL)
+
+  override def allowCartesianProduct: Boolean = 
getConf(CROSS_JOINS_ENABLED)
--- End diff --

As discussed, I only have the crossJoinEnabled flag, which has been exposed 
in CatalystConf. Also as discussed, deprecating crossJoinEnabled and switching 
the conf flag to requireCrossJoinSyntax is a good idea, will address in a 
follow-up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77067849
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
 ---
@@ -353,7 +353,7 @@ case class BroadcastNestedLoopJoinExec(
 val broadcastedRelation = 
broadcast.executeBroadcast[Array[InternalRow]]()
--- End diff --

The physical plan checks have been removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77049987
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -39,39 +38,44 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
*
* The joined plan are picked from left to right, prefer those has at 
least one join condition.
*
-   * @param input a list of LogicalPlans to join.
+   * @param input a list LogicalPlans to inner join, and a bool speciying 
if an explicit cross join
--- End diff --

fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77049867
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
@@ -303,7 +303,7 @@ package object dsl {
 
   def join(
 otherPlan: LogicalPlan,
-joinType: JoinType = Inner,
+joinType: JoinType = Inner(false),
--- End diff --

Now renamed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-31 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r77049426
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,45 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
+extends Rule[LogicalPlan] with PredicateHelper {
+  /**
+   * Check if a join specified by left, right and condition is a cartesian 
product. Returns true if
+   * there are no join conditions involving references from both left and 
right.
+   */
+  def isCartesianProduct(left: LogicalPlan, right: LogicalPlan, condition: 
Option[Expression])
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-30 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76885747
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -39,39 +38,44 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
*
* The joined plan are picked from left to right, prefer those has at 
least one join condition.
*
-   * @param input a list of LogicalPlans to join.
+   * @param input a list LogicalPlans to inner join, and a bool speciying 
if an explicit cross join
--- End diff --

nit: typo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-30 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76885372
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
@@ -303,7 +303,7 @@ package object dsl {
 
   def join(
 otherPlan: LogicalPlan,
-joinType: JoinType = Inner,
+joinType: JoinType = Inner(false),
--- End diff --

nit: not a big deal either way but explicitly naming the boolean parameters 
(`Inner(explicitCartesian = false)`) would be great for readability.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-30 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76884744
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,45 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
+extends Rule[LogicalPlan] with PredicateHelper {
+  /**
+   * Check if a join specified by left, right and condition is a cartesian 
product. Returns true if
+   * there are no join conditions involving references from both left and 
right.
+   */
+  def isCartesianProduct(left: LogicalPlan, right: LogicalPlan, condition: 
Option[Expression])
+  : Boolean = {
+val conditions = 
condition.map(splitConjunctivePredicates).getOrElse(Nil)
+!conditions.map(_.references).exists(refs => 
refs.exists(left.outputSet.contains)
+&& refs.exists(right.outputSet.contains))
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan =
+if (conf.allowCartesianProduct) {
+  plan
+} else plan transform {
+  case j @ Join(left, right, Inner(false) | LeftOuter | RightOuter | 
FullOuter, condition)
+if isCartesianProduct(left, right, condition) =>
+  throw new AnalysisException(
+s"""Detected cartesian product for ${j.joinType.sql} join 
between logical plans
+   |${left.treeString(false).trim}
+   |and
+   |${right.treeString(false).trim}
+   |Use a CROSS JOIN to allow cartesian products between these 
relations""".stripMargin)
--- End diff --

Should we also talk about the conf flag here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-30 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76884680
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -933,6 +936,45 @@ object CombineLimits extends Rule[LogicalPlan] {
 }
 
 /**
+ * Check if there any cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ *
+ * This rule must be run AFTER the ReorderJoin rule since the join 
conditions for each join must be
+ * collected before checking if it is a cartesian product. If you have
+ * SELECT * from R, S where R.r = S.s,
+ * the join between R and S is not a cartesian product and therefore 
should be allowed.
+ * The predicate R.r = S.s is not recognized as a join condition until the 
ReorderJoin rule.
+ */
+case class CheckCartesianProducts(conf: CatalystConf)
+extends Rule[LogicalPlan] with PredicateHelper {
+  /**
+   * Check if a join specified by left, right and condition is a cartesian 
product. Returns true if
+   * there are no join conditions involving references from both left and 
right.
+   */
+  def isCartesianProduct(left: LogicalPlan, right: LogicalPlan, condition: 
Option[Expression])
--- End diff --

nit: might be cleaner to just pass in the `Join` node directly as a 
parameter


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-30 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76878769
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
 ---
@@ -353,7 +353,7 @@ case class BroadcastNestedLoopJoinExec(
 val broadcastedRelation = 
broadcast.executeBroadcast[Array[InternalRow]]()
--- End diff --

per offline discussion, depending on the semantics we decide, we should 
revisit whether we'd like to keep the physical plan check here L344


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-30 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76874085
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -698,6 +698,8 @@ private[sql] class SQLConf extends Serializable with 
CatalystConf with Logging {
   override def orderByOrdinal: Boolean = getConf(ORDER_BY_ORDINAL)
 
   override def groupByOrdinal: Boolean = getConf(GROUP_BY_ORDINAL)
+
+  override def allowCartesianProduct: Boolean = 
getConf(CROSS_JOINS_ENABLED)
--- End diff --

Also, on a related note, do you think it'd make sense to rename this (and 
the underlying conf) to `requireExplicitCrossJoinSyntax` to better convey the 
implied semantics?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-30 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76873765
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -698,6 +698,8 @@ private[sql] class SQLConf extends Serializable with 
CatalystConf with Logging {
   override def orderByOrdinal: Boolean = getConf(ORDER_BY_ORDINAL)
 
   override def groupByOrdinal: Boolean = getConf(GROUP_BY_ORDINAL)
+
+  override def allowCartesianProduct: Boolean = 
getConf(CROSS_JOINS_ENABLED)
--- End diff --

The `crossJoinEnabled` above also references the same underlying conf and 
is being used at multiple places in the code. Can we unify the two?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-30 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76857357
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -642,7 +642,7 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 checkColumnNames(
   """SELECT x.a, y.a, x.b, y.b
 |FROM (SELECT 1 AS a, 2 AS b) x
-|INNER JOIN (SELECT 1 AS a, 2 AS b) y
+|CROSS JOIN (SELECT 1 AS a, 2 AS b) y
--- End diff --

The join condition is eliminated because it is "trivial" according to the 
foldable propagation rule. This is annoying, but an expected by-product of 
detecting joins in the optimization phase. I can check to see if there is a way 
to split up the operator optimization batch so the cartesian check can be 
inserted before the constant propagation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76720134
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 ---
@@ -642,7 +642,7 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 checkColumnNames(
   """SELECT x.a, y.a, x.b, y.b
 |FROM (SELECT 1 AS a, 2 AS b) x
-|INNER JOIN (SELECT 1 AS a, 2 AS b) y
+|CROSS JOIN (SELECT 1 AS a, 2 AS b) y
--- End diff --

this is not a cross join, is it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76716719
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -159,23 +159,30 @@ object ExtractEquiJoinKeys extends Logging with 
PredicateHelper {
  */
 object ExtractFiltersAndInnerJoins extends PredicateHelper {
 
-  // flatten all inner joins, which are next to each other
-  def flattenJoin(plan: LogicalPlan): (Seq[LogicalPlan], Seq[Expression]) 
= plan match {
-case Join(left, right, Inner, cond) =>
-  val (plans, conditions) = flattenJoin(left)
-  (plans ++ Seq(right), conditions ++ cond.toSeq)
+  /**
+   * Flatten all inner joins, which are next to each other.
+   * Return a list of logical plans to be joined with a boolean for each 
plan indicating if it
+   * was involved in an explicit cross join. Also returns the entire list 
of join conditions for
+   * the left-deep tree.
+   */
+  def flattenJoin(plan: LogicalPlan, parentJoinType: Inner = Inner(false))
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76716740
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * Check if ther eany cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76716676
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * Check if ther eany cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ */
+case class CheckCartesianProducts(conf: CatalystConf) extends 
Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan =
+if (conf.allowCartesianProduct) {
+  plan
+} else plan transform {
+  case j @ Join(left, right, joinType, conditions)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76716689
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * Check if ther eany cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ */
+case class CheckCartesianProducts(conf: CatalystConf) extends 
Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan =
+if (conf.allowCartesianProduct) {
+  plan
+} else plan transform {
+  case j @ Join(left, right, joinType, conditions)
+  if Seq(Inner(false), LeftOuter, RightOuter, 
FullOuter).contains(joinType) =>
+val hasJoinCondition =
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread srinathshankar
Github user srinathshankar commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76683461
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * Check if ther eany cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ */
+case class CheckCartesianProducts(conf: CatalystConf) extends 
Rule[LogicalPlan] {
--- End diff --

As discussed, must be done after ReorderJoin. Added a comment explaining 
why.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76670132
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * Check if ther eany cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
--- End diff --

As discussed offline. Please document that we need to run this after join 
reordering.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76665743
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -159,23 +159,30 @@ object ExtractEquiJoinKeys extends Logging with 
PredicateHelper {
  */
 object ExtractFiltersAndInnerJoins extends PredicateHelper {
 
-  // flatten all inner joins, which are next to each other
-  def flattenJoin(plan: LogicalPlan): (Seq[LogicalPlan], Seq[Expression]) 
= plan match {
-case Join(left, right, Inner, cond) =>
-  val (plans, conditions) = flattenJoin(left)
-  (plans ++ Seq(right), conditions ++ cond.toSeq)
+  /**
+   * Flatten all inner joins, which are next to each other.
+   * Return a list of logical plans to be joined with a boolean for each 
plan indicating if it
+   * was involved in an explicit cross join. Also returns the entire list 
of join conditions for
+   * the left-deep tree.
+   */
+  def flattenJoin(plan: LogicalPlan, parentJoinType: Inner = Inner(false))
--- End diff --

just pass in `explicitCartesian`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76665431
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * Check if ther eany cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ */
+case class CheckCartesianProducts(conf: CatalystConf) extends 
Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan =
+if (conf.allowCartesianProduct) {
+  plan
+} else plan transform {
+  case j @ Join(left, right, joinType, conditions)
+  if Seq(Inner(false), LeftOuter, RightOuter, 
FullOuter).contains(joinType) =>
+val hasJoinCondition =
--- End diff --

move this to a function and call that function in the case guard.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76665332
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * Check if ther eany cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ */
+case class CheckCartesianProducts(conf: CatalystConf) extends 
Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan =
+if (conf.allowCartesianProduct) {
+  plan
+} else plan transform {
+  case j @ Join(left, right, joinType, conditions)
--- End diff --

`j @ Join(left, right, Inner(false) | LeftOuter | RightOuter | FullOuter, 
condition) =>`?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14866#discussion_r76665008
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * Check if ther eany cartesian products between joins of any type in the 
optimized plan tree.
+ * Throw an error if a cartesian product is found without an explicit 
cross join specified.
+ * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is 
true.
+ */
+case class CheckCartesianProducts(conf: CatalystConf) extends 
Rule[LogicalPlan] {
--- End diff --

Shouldn't this be a part of CheckAnalysis?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...

2016-08-29 Thread srinathshankar
GitHub user srinathshankar opened a pull request:

https://github.com/apache/spark/pull/14866

[SPARK-17298][SQL] Require explicit CROSS join for cartesian products

## What changes were proposed in this pull request?

Require the use of CROSS join syntax in SQL (and a new crossJoin
DataFrame API) to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where
there is no join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS
join, an error must be thrown. Turning on the
"spark.sql.crossJoin.enabled" configuration flag will disable this check
and allow cartesian products without an explicit CROSS join.

The new crossJoin DataFrame API must be used to specify explicit cross
joins. The existing join(DataFrame) method will produce a INNER join
that will require a subsequent join condition.
That is df1.join(df2) is equivalent to select * from df1, df2.

## How was this patch tested?

Added cross-join.sql to the SQLQueryTestSuite to test the check for 
cartesian products. Added a couple of tests to the DataFrameJoinSuite to test 
the crossJoin API. Modified various other test suites to explicitly specify a 
cross join where an INNER join or a comma-separated list was previously used.






You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srinathshankar/spark crossjoin

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14866.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14866


commit 250d96dcf43c09dbff821e5b20f3f0ea86f08887
Author: Srinath Shankar 
Date:   2016-08-25T16:43:08Z

[SPARK-17298][SQL] Require explicit CROSS join for cartesian products

Require the use of CROSS join syntax in SQL (and a new crossJoin
DataFrame API) to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where
there is no join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS
join, an error must be thrown. Turning on the
"spark.sql.crossJoin.enabled" configuration flag will disable this check
and allow cartesian products without an explicit CROSS join.

The new crossJoin DataFrame API must be used to specify explicit cross
joins. The existing join(DataFrame) method will produce a INNER join
that will require a subsequent join condition.
That is df1.join(df2) == select * from df1, df2.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org