[GitHub] spark pull request #21668: [SPARK-24690][SQL] Add a new config to control pl...

2018-07-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21668#discussion_r199682495
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -375,16 +375,16 @@ case class CatalogStatistics(
* Convert [[CatalogStatistics]] to [[Statistics]], and match column 
stats to attributes based
* on column names.
*/
-  def toPlanStats(planOutput: Seq[Attribute], cboEnabled: Boolean): 
Statistics = {
-if (cboEnabled && rowCount.isDefined) {
+  def toPlanStats(planOutput: Seq[Attribute], planStatsEnabled: Boolean): 
Statistics = {
+if (planStatsEnabled && rowCount.isDefined) {
   val attrStats = AttributeMap(planOutput
 .flatMap(a => colStats.get(a.name).map(a -> _.toPlanStat(a.name, 
a.dataType
   // Estimate size as number of rows * row size.
   val size = EstimationUtils.getOutputSize(planOutput, rowCount.get, 
attrStats)
   Statistics(sizeInBytes = size, rowCount = rowCount, attributeStats = 
attrStats)
 } else {
-  // When CBO is disabled or the table doesn't have other statistics, 
we apply the size-only
-  // estimation strategy and only propagate sizeInBytes in statistics.
+  // When plan statistics are disabled or the table doesn't have other 
statistics,
+  // we apply the size-only estimation strategy and only propagate 
sizeInBytes in statistics.
   Statistics(sizeInBytes = sizeInBytes)
--- End diff --

yea, I see. We might do so.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21668: [SPARK-24690][SQL] Add a new config to control pl...

2018-07-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21668#discussion_r199552405
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -375,16 +375,16 @@ case class CatalogStatistics(
* Convert [[CatalogStatistics]] to [[Statistics]], and match column 
stats to attributes based
* on column names.
*/
-  def toPlanStats(planOutput: Seq[Attribute], cboEnabled: Boolean): 
Statistics = {
-if (cboEnabled && rowCount.isDefined) {
+  def toPlanStats(planOutput: Seq[Attribute], planStatsEnabled: Boolean): 
Statistics = {
+if (planStatsEnabled && rowCount.isDefined) {
   val attrStats = AttributeMap(planOutput
 .flatMap(a => colStats.get(a.name).map(a -> _.toPlanStat(a.name, 
a.dataType
   // Estimate size as number of rows * row size.
   val size = EstimationUtils.getOutputSize(planOutput, rowCount.get, 
attrStats)
   Statistics(sizeInBytes = size, rowCount = rowCount, attributeStats = 
attrStats)
 } else {
-  // When CBO is disabled or the table doesn't have other statistics, 
we apply the size-only
-  // estimation strategy and only propagate sizeInBytes in statistics.
+  // When plan statistics are disabled or the table doesn't have other 
statistics,
+  // we apply the size-only estimation strategy and only propagate 
sizeInBytes in statistics.
   Statistics(sizeInBytes = sizeInBytes)
--- End diff --

cc @wzhfy 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21668: [SPARK-24690][SQL] Add a new config to control pl...

2018-07-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21668#discussion_r199552346
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -375,16 +375,16 @@ case class CatalogStatistics(
* Convert [[CatalogStatistics]] to [[Statistics]], and match column 
stats to attributes based
* on column names.
*/
-  def toPlanStats(planOutput: Seq[Attribute], cboEnabled: Boolean): 
Statistics = {
-if (cboEnabled && rowCount.isDefined) {
+  def toPlanStats(planOutput: Seq[Attribute], planStatsEnabled: Boolean): 
Statistics = {
+if (planStatsEnabled && rowCount.isDefined) {
   val attrStats = AttributeMap(planOutput
 .flatMap(a => colStats.get(a.name).map(a -> _.toPlanStat(a.name, 
a.dataType
   // Estimate size as number of rows * row size.
   val size = EstimationUtils.getOutputSize(planOutput, rowCount.get, 
attrStats)
   Statistics(sizeInBytes = size, rowCount = rowCount, attributeStats = 
attrStats)
 } else {
-  // When CBO is disabled or the table doesn't have other statistics, 
we apply the size-only
-  // estimation strategy and only propagate sizeInBytes in statistics.
+  // When plan statistics are disabled or the table doesn't have other 
statistics,
+  // we apply the size-only estimation strategy and only propagate 
sizeInBytes in statistics.
   Statistics(sizeInBytes = sizeInBytes)
--- End diff --

I think we should respect rowCount, if it is available, no matter whether 
CBO is on or off. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21668: [SPARK-24690][SQL] Add a new config to control pl...

2018-06-29 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/21668

[SPARK-24690][SQL] Add a new config to control plan stats computation in 
LogicalRelation

## What changes were proposed in this pull request?
This pr proposes to propose a new separate config so that `LogicalRelation` 
can use `rowCount` to compute data statistics in logical plans. In the master, 
we currently cannot enable `StarSchemaDetection.reorderStarJoins` because we 
need to turn off CBO to enable it but `StarSchemaDetection` internally 
references the `rowCount` that is used in LogicalRelation if CBO disabled.

## How was this patch tested?
Added tests in `DataFrameJoinSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark PlanStatsConf

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21668.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21668


commit f0db73bcad7741ec06560a1b6bb50bb49483b954
Author: Takeshi Yamamuro 
Date:   2018-06-29T06:22:16Z

Fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org