[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17353


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-21 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r107138307
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -119,25 +120,28 @@ case class CostBasedJoinReorder(conf: SQLConf) 
extends Rule[LogicalPlan] with Pr
  * When building m-way joins, we only keep the best plan (with the lowest 
cost) for the same set
  * of m items. E.g., for 3-way joins, we keep only the best plan for items 
{A, B, C} among
  * plans (A J B) J C, (A J C) J B and (B J C) J A.
- *
- * Thus the plans maintained for each level when reordering four items A, 
B, C, D are as follows:
+ * We also prune cartesian product candidates when building a new plan if 
there exists no join
+ * condition involving references from both left and right. This pruning 
strategy significantly
+ * reduces the search space.
+ * For example, given A J B J C J D, plans maintained for each level will 
be as follows:
--- End diff --

Do you mean join conditions are not specified? How about `A J B J C J D 
where A.k1 = B.k1 and B.k2 = C.k2 and C.k3 = D.k3` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-21 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r107137731
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -119,25 +120,28 @@ case class CostBasedJoinReorder(conf: SQLConf) 
extends Rule[LogicalPlan] with Pr
  * When building m-way joins, we only keep the best plan (with the lowest 
cost) for the same set
  * of m items. E.g., for 3-way joins, we keep only the best plan for items 
{A, B, C} among
  * plans (A J B) J C, (A J C) J B and (B J C) J A.
- *
- * Thus the plans maintained for each level when reordering four items A, 
B, C, D are as follows:
+ * We also prune cartesian product candidates when building a new plan if 
there exists no join
+ * condition involving references from both left and right. This pruning 
strategy significantly
+ * reduces the search space.
+ * For example, given A J B J C J D, plans maintained for each level will 
be as follows:
  * level 0: p({A}), p({B}), p({C}), p({D})
- * level 1: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, 
D})
- * level 2: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D})
+ * level 1: p({A, B}), p({B, C}), p({C, D})
+ * level 2: p({A, B, C}), p({B, C, D})
  * level 3: p({A, B, C, D})
  * where p({A, B, C, D}) is the final output plan.
  *
  * For cost evaluation, since physical costs for operators are not 
available currently, we use
  * cardinalities and sizes to compute costs.
  */
-object JoinReorderDP extends PredicateHelper {
+object JoinReorderDP extends PredicateHelper with Logging {
 
   def search(
   conf: SQLConf,
   items: Seq[LogicalPlan],
   conditions: Set[Expression],
   topOutput: AttributeSet): LogicalPlan = {
 
+val startTime = System.nanoTime()
--- End diff --

`nanoTime()` is more reliable than `currentTimeMillis()`: 
https://github.com/databricks/scala-style-guide#misc_currentTimeMillis_vs_nanoTime


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-21 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r107137438
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -119,25 +120,28 @@ case class CostBasedJoinReorder(conf: SQLConf) 
extends Rule[LogicalPlan] with Pr
  * When building m-way joins, we only keep the best plan (with the lowest 
cost) for the same set
  * of m items. E.g., for 3-way joins, we keep only the best plan for items 
{A, B, C} among
  * plans (A J B) J C, (A J C) J B and (B J C) J A.
- *
- * Thus the plans maintained for each level when reordering four items A, 
B, C, D are as follows:
+ * We also prune cartesian product candidates when building a new plan if 
there exists no join
+ * condition involving references from both left and right. This pruning 
strategy significantly
+ * reduces the search space.
+ * For example, given A J B J C J D, plans maintained for each level will 
be as follows:
--- End diff --

why?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r107133614
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -119,25 +120,28 @@ case class CostBasedJoinReorder(conf: SQLConf) 
extends Rule[LogicalPlan] with Pr
  * When building m-way joins, we only keep the best plan (with the lowest 
cost) for the same set
  * of m items. E.g., for 3-way joins, we keep only the best plan for items 
{A, B, C} among
  * plans (A J B) J C, (A J C) J B and (B J C) J A.
- *
- * Thus the plans maintained for each level when reordering four items A, 
B, C, D are as follows:
+ * We also prune cartesian product candidates when building a new plan if 
there exists no join
+ * condition involving references from both left and right. This pruning 
strategy significantly
+ * reduces the search space.
+ * For example, given A J B J C J D, plans maintained for each level will 
be as follows:
  * level 0: p({A}), p({B}), p({C}), p({D})
- * level 1: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, 
D})
- * level 2: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D})
+ * level 1: p({A, B}), p({B, C}), p({C, D})
+ * level 2: p({A, B, C}), p({B, C, D})
  * level 3: p({A, B, C, D})
  * where p({A, B, C, D}) is the final output plan.
  *
  * For cost evaluation, since physical costs for operators are not 
available currently, we use
  * cardinalities and sizes to compute costs.
  */
-object JoinReorderDP extends PredicateHelper {
+object JoinReorderDP extends PredicateHelper with Logging {
 
   def search(
   conf: SQLConf,
   items: Seq[LogicalPlan],
   conditions: Set[Expression],
   topOutput: AttributeSet): LogicalPlan = {
 
+val startTime = System.nanoTime()
--- End diff --

use `System.currentTimeMillis` if we only care about the ms level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r107133345
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -119,25 +120,28 @@ case class CostBasedJoinReorder(conf: SQLConf) 
extends Rule[LogicalPlan] with Pr
  * When building m-way joins, we only keep the best plan (with the lowest 
cost) for the same set
  * of m items. E.g., for 3-way joins, we keep only the best plan for items 
{A, B, C} among
  * plans (A J B) J C, (A J C) J B and (B J C) J A.
- *
- * Thus the plans maintained for each level when reordering four items A, 
B, C, D are as follows:
+ * We also prune cartesian product candidates when building a new plan if 
there exists no join
+ * condition involving references from both left and right. This pruning 
strategy significantly
+ * reduces the search space.
+ * For example, given A J B J C J D, plans maintained for each level will 
be as follows:
--- End diff --

`will be` -> `may be`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-21 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r107080002
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -152,6 +156,10 @@ object JoinReorderDP extends PredicateHelper {
   foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
 }
 
+//val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
+//logDebug(s"Join reordering finished. Duration: $durationInMs ms, 
number of items: " +
+//  s"${items.length}, number of plans in memo: 
${foundPlans.map(_.size).sum}")
--- End diff --

oops, I forgot to recover them...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r107077912
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -152,6 +156,10 @@ object JoinReorderDP extends PredicateHelper {
   foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
 }
 
+//val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
+//logDebug(s"Join reordering finished. Duration: $durationInMs ms, 
number of items: " +
+//  s"${items.length}, number of plans in memo: 
${foundPlans.map(_.size).sum}")
--- End diff --

?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-20 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r106843645
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -202,14 +201,15 @@ object JoinReorderDP extends PredicateHelper {
   }
 
   /**
-   * Builds a new JoinPlan if the two given sides can be joined with some 
conditions.
+   * Builds a new JoinPlan if there exists at least one join condition 
involving references from
+   * both left and right.
--- End diff --

Great! Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-20 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r106843609
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -202,14 +201,15 @@ object JoinReorderDP extends PredicateHelper {
   }
 
   /**
-   * Builds a new JoinPlan if the two given sides can be joined with some 
conditions.
+   * Builds a new JoinPlan if there exists at least one join condition 
involving references from
+   * both left and right.
* @param oneJoinPlan One side JoinPlan for building a new JoinPlan.
* @param otherJoinPlan The other side JoinPlan for building a new join 
node.
--- End diff --

left and right sides are decided inside this method. It tends to build a 
left deep tree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r106841610
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -202,14 +201,15 @@ object JoinReorderDP extends PredicateHelper {
   }
 
   /**
-   * Builds a new JoinPlan if the two given sides can be joined with some 
conditions.
+   * Builds a new JoinPlan if there exists at least one join condition 
involving references from
+   * both left and right.
* @param oneJoinPlan One side JoinPlan for building a new JoinPlan.
* @param otherJoinPlan The other side JoinPlan for building a new join 
node.
--- End diff --

Do you want to rename them to leftPlan and rightPlan?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r106841523
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -202,14 +201,15 @@ object JoinReorderDP extends PredicateHelper {
   }
 
   /**
-   * Builds a new JoinPlan if the two given sides can be joined with some 
conditions.
+   * Builds a new JoinPlan if there exists at least one join condition 
involving references from
+   * both left and right.
* @param oneJoinPlan One side JoinPlan for building a new JoinPlan.
* @param otherJoinPlan The other side JoinPlan for building a new join 
node.
* @param conf SQLConf for statistics computation.
* @param conditions The overall set of join conditions.
* @param topOutput The output attributes of the final plan.
-   * @return Return a new JoinPlan if the two sides can be joined with 
some conditions. Otherwise,
-   * return None.
+   * @return Builds and returns a new JoinPlan if there exists at least 
one join condition
+   * involving references from both left and right. Otherwise, 
returns None.
--- End diff --

Now, we can simplify it to `Builds and returns a new JoinPlan if both 
conditions hold. Otherwise, returns None.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r106841464
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -202,14 +201,15 @@ object JoinReorderDP extends PredicateHelper {
   }
 
   /**
-   * Builds a new JoinPlan if the two given sides can be joined with some 
conditions.
+   * Builds a new JoinPlan if there exists at least one join condition 
involving references from
+   * both left and right.
--- End diff --

```
Builds a new JoinPlan when both conditions hold
- the sets of items contained in both left and right sides do not overlap 
- there exists at least one join condition involving references from both 
sides
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r106835910
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -201,7 +201,16 @@ object JoinReorderDP extends PredicateHelper {
 nextLevel.toMap
   }
 
-  /** Build a new join node. */
+  /**
+   * Builds a new JoinPlan if the two given sides can be joined with some 
conditions.
--- End diff --

Let us reword it too. 

> Builds a new JoinPlan if there is a join predicate connecting two given 
sides.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17353#discussion_r106835632
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -201,7 +201,16 @@ object JoinReorderDP extends PredicateHelper {
 nextLevel.toMap
   }
 
-  /** Build a new join node. */
+  /**
+   * Builds a new JoinPlan if the two given sides can be joined with some 
conditions.
+   * @param oneJoinPlan One side JoinPlan for building a new JoinPlan.
+   * @param otherJoinPlan The other side JoinPlan for building a new join 
node.
+   * @param conf SQLConf for statistics computation.
+   * @param conditions The overall set of join conditions.
+   * @param topOutput The output attributes of the final plan.
+   * @return Return a new JoinPlan if the two sides can be joined with 
some conditions. Otherwise,
--- End diff --

How about? 
> Returns a new JoinPlan if there exists at least one join condition 
involving references from both left and right. Otherwise, returns None.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17353: [SPARK-17080][SQL][FOLLOWUP] Improve documentatio...

2017-03-19 Thread wzhfy
GitHub user wzhfy opened a pull request:

https://github.com/apache/spark/pull/17353

[SPARK-17080][SQL][FOLLOWUP] Improve documentation and naming for 
methods/variables

## What changes were proposed in this pull request?

Improve documentation and naming for methods/variables.

## How was this patch tested?

Not related.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wzhfy/spark reorderFollow

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17353.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17353


commit 7598eb81b782f71d9da3794a6a007c63fdf42b23
Author: wangzhenhua 
Date:   2017-03-20T03:18:31Z

fxi comments




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org