[GitHub] [spark] bzhaoopenstack commented on pull request #37117: [WIP][SPARK-39714][PYTHON] Try to fix the mypy annotation tests

2022-07-11 Thread GitBox


bzhaoopenstack commented on PR #37117:
URL: https://github.com/apache/spark/pull/37117#issuecomment-1181387164

   > BTW, mind keeping the PR description template please? 
https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
   
   Thanks for correcting, sorry for that I'm not very familiar with Spark 
protocol. ;-) . Refreshed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bzhaoopenstack commented on a diff in pull request #37117: [WIP][SPARK-39714][PYTHON] Try to fix the mypy annotation tests

2022-07-11 Thread GitBox


bzhaoopenstack commented on code in PR #37117:
URL: https://github.com/apache/spark/pull/37117#discussion_r918609239


##
python/pyspark/pandas/series.py:
##
@@ -4738,7 +4738,7 @@ def mode(self, dropna: bool = True) -> "Series":
 ser_count = self.value_counts(dropna=dropna, sort=False)
 sdf_count = ser_count._internal.spark_frame
 most_value = ser_count.max()
-sdf_most_value = sdf_count.filter("count == {}".format(most_value))
+sdf_most_value = sdf_count.filter("count == 
{}".format(most_value.__str__()))

Review Comment:
   Thanks, this is good. ;-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default

2022-07-11 Thread GitBox


wangyum commented on PR #37163:
URL: https://github.com/apache/spark/pull/37163#issuecomment-1181369549

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #37160: [SPARK-39749][SQL] Always use plain string representation on casting Decimal to String

2022-07-11 Thread GitBox


gengliangwang commented on PR #37160:
URL: https://github.com/apache/spark/pull/37160#issuecomment-1181368882

   Let's keep this open for one or two days since it is changing the behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic

2022-07-11 Thread GitBox


beliefer commented on code in PR #37040:
URL: https://github.com/apache/spark/pull/37040#discussion_r918590789


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRand.scala:
##
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{DoubleLiteral, GreaterThan, 
GreaterThanOrEqual, LessThan, LessThanOrEqual, Rand}
+import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, 
TrueLiteral}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_COMPARISON, 
EXPRESSION_WITH_RANDOM_SEED, LITERAL}
+
+/**
+ * Rand() generates a random column with i.i.d. uniformly distributed values 
in [0, 1), so
+ * compare double literal value with 1.0 could eliminate Rand() in binary 
comparison.
+ *
+ * 1. Converts the binary comparison to true literal when the comparison value 
must be true.
+ * 2. Converts the binary comparison to false literal when the comparison 
value must be false.
+ */
+object OptimizeRand extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan =
+plan.transformAllExpressionsWithPruning(_.containsAllPatterns(
+  EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) {
+case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 =>
+  TrueLiteral
+case GreaterThan(_: Rand, DoubleLiteral(value)) if value >= 1.0 =>

Review Comment:
   Yeah. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default

2022-07-11 Thread GitBox


wangyum commented on code in PR #37163:
URL: https://github.com/apache/spark/pull/37163#discussion_r918589988


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/DynamicPartitionPruningHiveScanSuite.scala:
##
@@ -41,6 +42,16 @@ abstract class DynamicPartitionPruningHiveScanSuiteBase
   case _ => Nil
 }
   }
+
+  override protected def beforeAll(): Unit = {
+super.beforeAll()
+conf.setConf(SQLConf.CBO_ENABLED, false)

Review Comment:
   To fix the test error:
   https://user-images.githubusercontent.com/5399861/178422851-a4f60bc5-2593-4ac0-876c-1bcff332918c.png";>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default

2022-07-11 Thread GitBox


wangyum commented on code in PR #37163:
URL: https://github.com/apache/spark/pull/37163#discussion_r918589988


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/DynamicPartitionPruningHiveScanSuite.scala:
##
@@ -41,6 +42,16 @@ abstract class DynamicPartitionPruningHiveScanSuiteBase
   case _ => Nil
 }
   }
+
+  override protected def beforeAll(): Unit = {
+super.beforeAll()
+conf.setConf(SQLConf.CBO_ENABLED, false)

Review Comment:
   To fix the test error:
   ![Uploading image.png…]()
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default

2022-07-11 Thread GitBox


wangyum commented on code in PR #37163:
URL: https://github.com/apache/spark/pull/37163#discussion_r918589133


##
sql/core/src/test/scala/org/apache/spark/sql/InjectRuntimeFilterSuite.scala:
##
@@ -209,11 +209,13 @@ class InjectRuntimeFilterSuite extends QueryTest with 
SQLTestUtils with SharedSp
 // `MergeScalarSubqueries` can duplicate subqueries in the optimized plan 
and would make testing
 // complicated.
 conf.setConfString(SQLConf.OPTIMIZER_EXCLUDED_RULES.key, 
MergeScalarSubqueries.ruleName)
+conf.setConf(SQLConf.CBO_ENABLED, false)

Review Comment:
   To fix the test error:
   https://user-images.githubusercontent.com/5399861/178422514-d5bfdac8-5518-40d3-b8ae-6c6571b351e4.png";>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default

2022-07-11 Thread GitBox


wangyum commented on code in PR #37163:
URL: https://github.com/apache/spark/pull/37163#discussion_r918588559


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala:
##
@@ -58,6 +58,7 @@ case class FilterEstimation(plan: Filter) extends Logging {
 rowsAfterFilter = filteredRowCount)
 }
 val filteredSizeInBytes: BigInt = getOutputSize(plan.output, 
filteredRowCount, newColStats)
+  .min(plan.child.stats.sizeInBytes)

Review Comment:
   To fix the test error:
   https://user-images.githubusercontent.com/5399861/178422367-6aa26b63-1135-4741-8fcc-c2a0551c0fd4.png";>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default

2022-07-11 Thread GitBox


wangyum commented on code in PR #37163:
URL: https://github.com/apache/spark/pull/37163#discussion_r918586900


##
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala:
##
@@ -353,48 +353,65 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 
   test("invalidation of tableRelationCache after inserts") {
 val table = "invalidate_catalog_cache_table"
-Seq(false, true).foreach { autoUpdate =>
-  withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) 
{
-withTable(table) {
-  spark.range(100).write.saveAsTable(table)
-  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS")
-  spark.table(table)
-  val initialSizeInBytes = 
getTableFromCatalogCache(table).stats.sizeInBytes
-  spark.range(100).write.mode(SaveMode.Append).saveAsTable(table)
-  spark.table(table)
-  assert(getTableFromCatalogCache(table).stats.sizeInBytes == 2 * 
initialSizeInBytes)
+Seq(false, true).foreach { cboEnabled =>
+  Seq(false, true).foreach { autoUpdate =>
+withSQLConf(
+  SQLConf.CBO_ENABLED.key -> cboEnabled.toString,
+  SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) {
+  withTable(table) {
+spark.range(100).write.saveAsTable(table)
+sql(s"ANALYZE TABLE $table COMPUTE STATISTICS")
+spark.table(table)
+val initialSizeInBytes = 
getTableFromCatalogCache(table).stats.sizeInBytes
+spark.range(100).write.mode(SaveMode.Append).saveAsTable(table)
+spark.table(table)
+if (!cboEnabled) {
+  assert(getTableFromCatalogCache(table).stats.sizeInBytes === 2 * 
initialSizeInBytes)
+} else {
+  assert(getTableFromCatalogCache(table).stats.sizeInBytes > 
initialSizeInBytes)
+}
+  }
 }
   }
 }
   }
 
   test("invalidation of tableRelationCache after alter table add partition") {
 val table = "invalidate_catalog_cache_table"
-Seq(false, true).foreach { autoUpdate =>
-  withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) 
{
-withTempDir { dir =>
-  withTable(table) {
-val path = dir.getCanonicalPath
-sql(s"""
-  |CREATE TABLE $table (col1 int, col2 int)
-  |USING PARQUET
-  |PARTITIONED BY (col2)
-  |LOCATION '${dir.toURI}'""".stripMargin)
-sql(s"ANALYZE TABLE $table COMPUTE STATISTICS")
-spark.table(table)
-assert(getTableFromCatalogCache(table).stats.sizeInBytes == 0)
-spark.catalog.recoverPartitions(table)
-val df = Seq((1, 2), (1, 2)).toDF("col2", "col1")
-df.write.parquet(s"$path/col2=1")
-sql(s"ALTER TABLE $table ADD PARTITION (col2=1) LOCATION 
'${dir.toURI}'")
-spark.table(table)
-val cachedTable = getTableFromCatalogCache(table)
-val cachedTableSizeInBytes = cachedTable.stats.sizeInBytes
-val defaultSizeInBytes = conf.defaultSizeInBytes
-if (autoUpdate) {
-  assert(cachedTableSizeInBytes != defaultSizeInBytes && 
cachedTableSizeInBytes > 0)
-} else {
-  assert(cachedTableSizeInBytes == defaultSizeInBytes)
+Seq(false, true).foreach { cboEnabled =>
+  Seq(false, true).foreach { autoUpdate =>
+withSQLConf(
+  SQLConf.CBO_ENABLED.key -> cboEnabled.toString,
+  SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) {
+  withTempDir { dir =>
+withTable(table) {
+  val path = dir.getCanonicalPath
+  sql(
+s"""
+   |CREATE TABLE $table (col1 int, col2 int)
+   |USING PARQUET
+   |PARTITIONED BY (col2)
+   |LOCATION '${dir.toURI}'""".stripMargin)
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS")
+  spark.table(table)
+  if (!cboEnabled) {
+assert(getTableFromCatalogCache(table).stats.sizeInBytes === 0)
+  } else {
+assert(getTableFromCatalogCache(table).stats.sizeInBytes === 1)

Review Comment:
   The size is 1, not 0:
   
https://github.com/apache/spark/blob/ddc61e62b9af5deff1b93e22f466f2a13f281155/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala#L116-L118



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.a

[GitHub] [spark] wangyum opened a new pull request, #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default

2022-07-11 Thread GitBox


wangyum opened a new pull request, #37163:
URL: https://github.com/apache/spark/pull/37163

   ### What changes were proposed in this pull request?
   
   This PR enable `spark.sql.cbo.enabled` by default.
   
   ### Why are the changes needed?
   
   1. Enable CBO to get better performance, we've enabled it over 3 years.
   2. Benchmark related tests also enabled 
it:https://github.com/apache/spark/blob/e83f8a872a16d4f049cefb1fc445f91cf84443ad/sql/core/src/test/scala/org/apache/spark/sql/TPCBase.scala#L32
   
   ### Does this PR introduce _any_ user-facing change?
   
   Maybe. But user can disable it by `set spark.sql.cbo.enabled=false`.
   
   ### How was this patch tested?
   
   Fix existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pralabhkumar commented on pull request #37009: [SPARK-38292][PYTHON]Support na_filter for pyspark.pandas.read_csv

2022-07-11 Thread GitBox


pralabhkumar commented on PR #37009:
URL: https://github.com/apache/spark/pull/37009#issuecomment-1181349887

   @itholic @HyukjinKwon Gentle ping


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a diff in pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join

2022-07-11 Thread GitBox


viirya commented on code in PR #37129:
URL: https://github.com/apache/spark/pull/37129#discussion_r918563482


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushLocalTopKThroughOuterJoin.scala:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{Literal, SortOrder}
+import org.apache.spark.sql.catalyst.planning.{ExtractEquiJoinKeys, 
ExtractTopK}
+import org.apache.spark.sql.catalyst.plans.{JoinType, LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, Join, LocalLimit, 
LogicalPlan, Project, RebalancePartitions, Repartition, 
RepartitionByExpression, Sort, Union}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{LIMIT, OUTER_JOIN, 
SORT}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * This rule supports push down local limit and local sort from TopK through 
other join:
+ *   - for a left outer join, the references of ordering of TopK come from the 
left side and
+ * the limits of TopK is smaller than left side max rows
+ *   - for a right outer join, the references of ordering of TopK come from 
the right side and
+ * the limits of TopK is smaller than right side max rows
+ */
+object PushLocalTopKThroughOuterJoin extends Rule[LogicalPlan] {
+  private def smallThan(limits: Int, maxRowsOpt: Option[Long]): Boolean = 
maxRowsOpt match {
+case Some(maxRows) => limits < maxRows
+case _ => true
+  }
+
+  private def canPushThroughOuterJoin(
+  joinType: JoinType,
+  order: Seq[SortOrder],
+  leftChild: LogicalPlan,
+  rightChild: LogicalPlan,
+  limits: Int): Boolean = joinType match {
+case LeftOuter =>
+  order.forall(_.references.subsetOf(leftChild.outputSet)) &&
+smallThan(limits, leftChild.maxRowsPerPartition)
+case RightOuter =>
+  order.forall(_.references.subsetOf(rightChild.outputSet)) &&
+smallThan(limits, rightChild.maxRowsPerPartition)
+case _ => false
+  }
+
+  private def findOuterJoin(limits: Int, order: Seq[SortOrder], child: 
LogicalPlan): Seq[Join] = {
+child match {
+  case j @ ExtractEquiJoinKeys(joinType, _, _, _, _, leftChild, 
rightChild, _)
+  if canPushThroughOuterJoin(joinType, order, leftChild, rightChild, 
limits) =>
+// we should find the bottom outer join which can push local topK
+val childOuterJoin = joinType match {
+  case LeftOuter => findOuterJoin(limits, order, leftChild)
+  case RightOuter => findOuterJoin(limits, order, rightChild)
+  case _ => Seq.empty
+}
+if (childOuterJoin.nonEmpty) {
+  childOuterJoin
+} else {
+  j :: Nil
+}
+  case u: Union => u.children.flatMap(child => findOuterJoin(limits, 
order, child))
+  case p: Project if p.projectList.forall(_.deterministic) =>
+findOuterJoin(limits, order, p.child)
+  case f: Filter if f.condition.deterministic =>
+findOuterJoin(limits, order, f.child)

Review Comment:
   Is it safe to push down through `Filter`? Won't it change final result?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a diff in pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join

2022-07-11 Thread GitBox


viirya commented on code in PR #37129:
URL: https://github.com/apache/spark/pull/37129#discussion_r918561846


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushLocalTopKThroughOuterJoin.scala:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{Literal, SortOrder}
+import org.apache.spark.sql.catalyst.planning.{ExtractEquiJoinKeys, 
ExtractTopK}
+import org.apache.spark.sql.catalyst.plans.{JoinType, LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, Join, LocalLimit, 
LogicalPlan, Project, RebalancePartitions, Repartition, 
RepartitionByExpression, Sort, Union}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{LIMIT, OUTER_JOIN, 
SORT}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * This rule supports push down local limit and local sort from TopK through 
other join:
+ *   - for a left outer join, the references of ordering of TopK come from the 
left side and
+ * the limits of TopK is smaller than left side max rows
+ *   - for a right outer join, the references of ordering of TopK come from 
the right side and
+ * the limits of TopK is smaller than right side max rows
+ */
+object PushLocalTopKThroughOuterJoin extends Rule[LogicalPlan] {
+  private def smallThan(limits: Int, maxRowsOpt: Option[Long]): Boolean = 
maxRowsOpt match {
+case Some(maxRows) => limits < maxRows
+case _ => true

Review Comment:
   If we don't know max rows, should we be conservative to skip the pushing 
down?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a diff in pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join

2022-07-11 Thread GitBox


viirya commented on code in PR #37129:
URL: https://github.com/apache/spark/pull/37129#discussion_r918560968


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushLocalTopKThroughOuterJoin.scala:
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{Literal, SortOrder}
+import org.apache.spark.sql.catalyst.planning.{ExtractEquiJoinKeys, 
ExtractTopK}
+import org.apache.spark.sql.catalyst.plans.{JoinType, LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, Join, LocalLimit, 
LogicalPlan, Project, RebalancePartitions, Repartition, 
RepartitionByExpression, Sort, Union}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{LIMIT, OUTER_JOIN, 
SORT}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * This rule supports push down local limit and local sort from TopK through 
other join:

Review Comment:
   ```suggestion
* This rule supports push down local limit and local sort from TopK through 
outer join:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a diff in pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join

2022-07-11 Thread GitBox


viirya commented on code in PR #37129:
URL: https://github.com/apache/spark/pull/37129#discussion_r918559156


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -412,6 +412,14 @@ object SQLConf {
   .longConf
   .createWithDefault(67108864L)
 
+  val PUSH_DOWN_LOCAL_TOPK_LIMIT_THRESHOLD =
+buildConf("spark.sql.optimizer.pushDownLocalTopKLimitThreshold")
+  .doc("If the limit number does not larger than this threshold, Spark 
will try to push " +

Review Comment:
   ```suggestion
 .doc("If the limit number is not larger than this threshold, Spark 
will try to push " +
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #37052: [SPARK-39647][CORE] Register the executor with ESS before registering the BlockManager

2022-07-11 Thread GitBox


mridulm commented on PR #37052:
URL: https://github.com/apache/spark/pull/37052#issuecomment-1181327891

   Merged to master and branch-3.3
   Thanks for fixing this @otterc !
   Thanks for reviewing @attilapiros, @zhouyejoe, @Ngone51 and @weixiuli :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm closed pull request #37052: [SPARK-39647][CORE] Register the executor with ESS before registering the BlockManager

2022-07-11 Thread GitBox


mridulm closed pull request #37052: [SPARK-39647][CORE] Register the executor 
with ESS before registering the BlockManager
URL: https://github.com/apache/spark/pull/37052


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #37052: [SPARK-39647][CORE] Register the executor with ESS before registering the BlockManager

2022-07-11 Thread GitBox


mridulm commented on PR #37052:
URL: https://github.com/apache/spark/pull/37052#issuecomment-1181326481

   Merging to master and branch-3.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37040:
URL: https://github.com/apache/spark/pull/37040#discussion_r918550920


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRand.scala:
##
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{DoubleLiteral, GreaterThan, 
GreaterThanOrEqual, LessThan, LessThanOrEqual, Rand}
+import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, 
TrueLiteral}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_COMPARISON, 
EXPRESSION_WITH_RANDOM_SEED, LITERAL}
+
+/**
+ * Rand() generates a random column with i.i.d. uniformly distributed values 
in [0, 1), so
+ * compare double literal value with 1.0 could eliminate Rand() in binary 
comparison.
+ *
+ * 1. Converts the binary comparison to true literal when the comparison value 
must be true.
+ * 2. Converts the binary comparison to false literal when the comparison 
value must be false.
+ */
+object OptimizeRand extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan =
+plan.transformAllExpressionsWithPruning(_.containsAllPatterns(
+  EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) {
+case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 =>
+  TrueLiteral
+case GreaterThan(_: Rand, DoubleLiteral(value)) if value >= 1.0 =>

Review Comment:
   we should also handle the `rand < 0.0` case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37116:
URL: https://github.com/apache/spark/pull/37116#discussion_r918550376


##
docs/sql-ref-syntax-qry-select-aggregate.md:
##
@@ -0,0 +1,145 @@
+---
+layout: global
+title: Aggregate Functions
+displayTitle: Aggregate Functions
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Aggregate functions operate on values across rows to perform mathematical 
calculations such as sum, average, counting, minimum/maximum values, standard 
deviation, and estimation, as well as some non-mathematical operations.
+
+### General Aggregation

Review Comment:
   I think we can remove `### General Aggregation` and start with ` Syntax` 
directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37116:
URL: https://github.com/apache/spark/pull/37116#discussion_r918549516


##
docs/sql-ref-syntax-qry-select-aggregate.md:
##
@@ -0,0 +1,145 @@
+---
+layout: global
+title: Aggregate Functions
+displayTitle: Aggregate Functions
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Aggregate functions operate on values across rows to perform mathematical 
calculations such as sum, average, counting, minimum/maximum values, standard 
deviation, and estimation, as well as some non-mathematical operations.
+
+### General Aggregation
+
+These aggregate Functions use general syntax.
+
+ Syntax
+
+```sql
+aggregate_function FILTER (WHERE boolean_expression)

Review Comment:
   ```suggestion
   aggregate_function(input1 [, input2, ...]) FILTER (WHERE boolean_expression)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37116:
URL: https://github.com/apache/spark/pull/37116#discussion_r918549516


##
docs/sql-ref-syntax-qry-select-aggregate.md:
##
@@ -0,0 +1,145 @@
+---
+layout: global
+title: Aggregate Functions
+displayTitle: Aggregate Functions
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Aggregate functions operate on values across rows to perform mathematical 
calculations such as sum, average, counting, minimum/maximum values, standard 
deviation, and estimation, as well as some non-mathematical operations.
+
+### General Aggregation
+
+These aggregate Functions use general syntax.
+
+ Syntax
+
+```sql
+aggregate_function FILTER (WHERE boolean_expression)

Review Comment:
   ```suggestion
   aggregate_function(input1, ...) FILTER (WHERE boolean_expression)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] singhpk234 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables

2022-07-11 Thread GitBox


singhpk234 commented on code in PR #37083:
URL: https://github.com/apache/spark/pull/37083#discussion_r918534436


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala:
##
@@ -17,16 +17,40 @@
 
 package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
 
+import org.apache.spark.sql.catalyst.expressions.AttributeMap
 import org.apache.spark.sql.catalyst.plans.logical._
 
 /**
- * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based 
optimizer.
+ * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: 
size in bytes.
  */
 object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
 
-  /** Falls back to the estimation computed by 
[[SizeInBytesOnlyStatsPlanVisitor]]. */
-  private def fallback(p: LogicalPlan): Statistics = 
SizeInBytesOnlyStatsPlanVisitor.visit(p)
+  /**
+   * A default, commonly used estimation for unary nodes. We assume the input 
row number is the
+   * same as the output row number, and compute sizes based on the column 
types.
+   */
+  private def visitUnaryNode(p: UnaryNode): Statistics = {
+// There should be some overhead in Row object, the size should not be 
zero when there is
+// no columns, this help to prevent divide-by-zero error.
+val childRowSize = EstimationUtils.getSizePerRow(p.child.output)
+val outputRowSize = EstimationUtils.getSizePerRow(p.output)
+// Assume there will be the same number of rows as child has.
+var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / 
childRowSize
+if (sizeInBytes == 0) {
+  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
zero
+  // (product of children).
+  sizeInBytes = 1
+}
+
+// v2 sources can bubble-up rowCount, so always propagate.
+// Don't propagate attributeStats, since they are not estimated here.
+Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount)

Review Comment:
   In this estimator i.e visitUnaryNode, we adjust the size by scaling it by 
(input row size / output row size) but since we don't have much info (in terms 
of min / max / ndv etc) to estimate the row count we just say the node output's 
child output row count which is mostly true for operators like project etc.
   
   Since we were just computing sizeInBytes and propagating rowCounts. 
appologies I forgot to update the comment as per proposed behaviour.
   
   Should I rephrase it to: 
   - `estimates size in bytes, row count for plan stats`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] singhpk234 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables

2022-07-11 Thread GitBox


singhpk234 commented on code in PR #37083:
URL: https://github.com/apache/spark/pull/37083#discussion_r918544779


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala:
##
@@ -17,16 +17,40 @@
 
 package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
 
+import org.apache.spark.sql.catalyst.expressions.AttributeMap
 import org.apache.spark.sql.catalyst.plans.logical._
 
 /**
- * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based 
optimizer.
+ * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: 
size in bytes.
  */
 object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
 
-  /** Falls back to the estimation computed by 
[[SizeInBytesOnlyStatsPlanVisitor]]. */
-  private def fallback(p: LogicalPlan): Statistics = 
SizeInBytesOnlyStatsPlanVisitor.visit(p)
+  /**
+   * A default, commonly used estimation for unary nodes. We assume the input 
row number is the
+   * same as the output row number, and compute sizes based on the column 
types.
+   */
+  private def visitUnaryNode(p: UnaryNode): Statistics = {
+// There should be some overhead in Row object, the size should not be 
zero when there is
+// no columns, this help to prevent divide-by-zero error.
+val childRowSize = EstimationUtils.getSizePerRow(p.child.output)
+val outputRowSize = EstimationUtils.getSizePerRow(p.output)
+// Assume there will be the same number of rows as child has.
+var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / 
childRowSize
+if (sizeInBytes == 0) {
+  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
zero
+  // (product of children).
+  sizeInBytes = 1
+}
+
+// v2 sources can bubble-up rowCount, so always propagate.
+// Don't propagate attributeStats, since they are not estimated here.
+Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount)

Review Comment:
   for dsv2 sources rowCount can be passed from the relation itself without 
running `analyze`, hence BasicStatsPlanVisitor which will be our default now, 
post this change will take rowcount into consideration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] singhpk234 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables

2022-07-11 Thread GitBox


singhpk234 commented on code in PR #37083:
URL: https://github.com/apache/spark/pull/37083#discussion_r918534436


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala:
##
@@ -17,16 +17,40 @@
 
 package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
 
+import org.apache.spark.sql.catalyst.expressions.AttributeMap
 import org.apache.spark.sql.catalyst.plans.logical._
 
 /**
- * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based 
optimizer.
+ * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: 
size in bytes.
  */
 object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
 
-  /** Falls back to the estimation computed by 
[[SizeInBytesOnlyStatsPlanVisitor]]. */
-  private def fallback(p: LogicalPlan): Statistics = 
SizeInBytesOnlyStatsPlanVisitor.visit(p)
+  /**
+   * A default, commonly used estimation for unary nodes. We assume the input 
row number is the
+   * same as the output row number, and compute sizes based on the column 
types.
+   */
+  private def visitUnaryNode(p: UnaryNode): Statistics = {
+// There should be some overhead in Row object, the size should not be 
zero when there is
+// no columns, this help to prevent divide-by-zero error.
+val childRowSize = EstimationUtils.getSizePerRow(p.child.output)
+val outputRowSize = EstimationUtils.getSizePerRow(p.output)
+// Assume there will be the same number of rows as child has.
+var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / 
childRowSize
+if (sizeInBytes == 0) {
+  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
zero
+  // (product of children).
+  sizeInBytes = 1
+}
+
+// v2 sources can bubble-up rowCount, so always propagate.
+// Don't propagate attributeStats, since they are not estimated here.
+Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount)

Review Comment:
   In this estimator i.e visitUnaryNode, we adjust the size by scaling it by 
(input row size / output row size) but since we don't have much info (in terms 
of min / max / ndv etc) to estimate the row count we just say the node output's 
child output row count which is mostly true for operators like project etc.
   
   Since we were just computing sizeInBytes and propagating rowCounts as it is 
I left the comment as it is.
   
   Should I rephrase it to: 
   - `estimates size in bytes, row count for plan stats`



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala:
##
@@ -17,16 +17,40 @@
 
 package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
 
+import org.apache.spark.sql.catalyst.expressions.AttributeMap
 import org.apache.spark.sql.catalyst.plans.logical._
 
 /**
- * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based 
optimizer.
+ * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: 
size in bytes.
  */
 object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
 
-  /** Falls back to the estimation computed by 
[[SizeInBytesOnlyStatsPlanVisitor]]. */
-  private def fallback(p: LogicalPlan): Statistics = 
SizeInBytesOnlyStatsPlanVisitor.visit(p)
+  /**
+   * A default, commonly used estimation for unary nodes. We assume the input 
row number is the
+   * same as the output row number, and compute sizes based on the column 
types.
+   */
+  private def visitUnaryNode(p: UnaryNode): Statistics = {
+// There should be some overhead in Row object, the size should not be 
zero when there is
+// no columns, this help to prevent divide-by-zero error.
+val childRowSize = EstimationUtils.getSizePerRow(p.child.output)
+val outputRowSize = EstimationUtils.getSizePerRow(p.output)
+// Assume there will be the same number of rows as child has.
+var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / 
childRowSize
+if (sizeInBytes == 0) {
+  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
zero
+  // (product of children).
+  sizeInBytes = 1
+}
+
+// v2 sources can bubble-up rowCount, so always propagate.
+// Don't propagate attributeStats, since they are not estimated here.
+Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount)

Review Comment:
   for dsv2 sources rowCount can be passed from the relation itself without 
running `analyze`, hence BasicStatsPlanVisitor which will be our default now 
will take rowcount into consideration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


--

[GitHub] [spark] AmplabJenkins commented on pull request #37147: [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy

2022-07-11 Thread GitBox


AmplabJenkins commented on PR #37147:
URL: https://github.com/apache/spark/pull/37147#issuecomment-1181307870

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] singhpk234 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables

2022-07-11 Thread GitBox


singhpk234 commented on code in PR #37083:
URL: https://github.com/apache/spark/pull/37083#discussion_r918534436


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala:
##
@@ -17,16 +17,40 @@
 
 package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
 
+import org.apache.spark.sql.catalyst.expressions.AttributeMap
 import org.apache.spark.sql.catalyst.plans.logical._
 
 /**
- * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based 
optimizer.
+ * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: 
size in bytes.
  */
 object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
 
-  /** Falls back to the estimation computed by 
[[SizeInBytesOnlyStatsPlanVisitor]]. */
-  private def fallback(p: LogicalPlan): Statistics = 
SizeInBytesOnlyStatsPlanVisitor.visit(p)
+  /**
+   * A default, commonly used estimation for unary nodes. We assume the input 
row number is the
+   * same as the output row number, and compute sizes based on the column 
types.
+   */
+  private def visitUnaryNode(p: UnaryNode): Statistics = {
+// There should be some overhead in Row object, the size should not be 
zero when there is
+// no columns, this help to prevent divide-by-zero error.
+val childRowSize = EstimationUtils.getSizePerRow(p.child.output)
+val outputRowSize = EstimationUtils.getSizePerRow(p.output)
+// Assume there will be the same number of rows as child has.
+var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / 
childRowSize
+if (sizeInBytes == 0) {
+  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
zero
+  // (product of children).
+  sizeInBytes = 1
+}
+
+// v2 sources can bubble-up rowCount, so always propagate.
+// Don't propagate attributeStats, since they are not estimated here.
+Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount)

Review Comment:
   In this estimator, we adjust the size by scaling it by (input row size / 
output row size) but since we don't have much info (in terms of min / max / ndv 
etc) to estimate the row count we just say the node output's child output row 
count which is mostly true for operators like project etc.
   
   Since we were just computing sizeInBytes and propagating rowCounts as it is 
I left the comment as it is.
   
   Should I rephrase it to: 
   - `estimates size in bytes, row count for plan stats`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] singhpk234 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables

2022-07-11 Thread GitBox


singhpk234 commented on code in PR #37083:
URL: https://github.com/apache/spark/pull/37083#discussion_r918534436


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala:
##
@@ -17,16 +17,40 @@
 
 package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
 
+import org.apache.spark.sql.catalyst.expressions.AttributeMap
 import org.apache.spark.sql.catalyst.plans.logical._
 
 /**
- * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based 
optimizer.
+ * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: 
size in bytes.
  */
 object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
 
-  /** Falls back to the estimation computed by 
[[SizeInBytesOnlyStatsPlanVisitor]]. */
-  private def fallback(p: LogicalPlan): Statistics = 
SizeInBytesOnlyStatsPlanVisitor.visit(p)
+  /**
+   * A default, commonly used estimation for unary nodes. We assume the input 
row number is the
+   * same as the output row number, and compute sizes based on the column 
types.
+   */
+  private def visitUnaryNode(p: UnaryNode): Statistics = {
+// There should be some overhead in Row object, the size should not be 
zero when there is
+// no columns, this help to prevent divide-by-zero error.
+val childRowSize = EstimationUtils.getSizePerRow(p.child.output)
+val outputRowSize = EstimationUtils.getSizePerRow(p.output)
+// Assume there will be the same number of rows as child has.
+var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / 
childRowSize
+if (sizeInBytes == 0) {
+  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
zero
+  // (product of children).
+  sizeInBytes = 1
+}
+
+// v2 sources can bubble-up rowCount, so always propagate.
+// Don't propagate attributeStats, since they are not estimated here.
+Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount)

Review Comment:
   In this estimator, we adjust the size by scaling it by (input row size / 
output row size) but since we don't have much info (in terms of min / max / ndv 
etc) to estimate the row count we just say the node output's child output row 
count which is mostly true for operators like project etc.
   
   Since we were just computing sizeInBytes and propagating rowCounts as it is 
I left the comment as it is.
   
   Should I rephrase it to let's say : 
   - `estimates size in bytes, row count for plan stats`



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import org.apache.spark.sql.catalyst.plans.logical._
+
+/**
+ * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based 
optimizer.
+ */
+object AdvancedStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
+
+  /** Falls back to the estimation computed by [[BasicStatsPlanVisitor]]. */
+  private def fallback(p: LogicalPlan): Statistics = 
BasicStatsPlanVisitor.visit(p)
+
+  override def default(p: LogicalPlan): Statistics = fallback(p)
+
+  override def visitAggregate(p: Aggregate): Statistics = {
+AggregateEstimation.estimate(p).getOrElse(fallback(p))
+  }
+
+  override def visitDistinct(p: Distinct): Statistics = {
+val child = p.child
+visitAggregate(Aggregate(child.output, child.output, child))
+  }
+
+  override def visitExcept(p: Except): Statistics = fallback(p)
+
+  override def visitExpand(p: Expand): Statistics = fallback(p)
+
+  override def visitFilter(p: Filter): Statistics = {
+FilterEstimation(p).estimate.getOrElse(fallback(p))
+  }
+
+  override def visitGenerate(p: Generate): Statistics = default(p)
+
+  override def visitGlobalLimit(p: GlobalLimit): Statistics = fallback(p)
+
+  override def visitOffset(p: Offset): Statistics = fallback(p)
+
+  override def visitIntersect(p: Intersect): Statistics = fallback(p)
+
+  override def visitJoin(p: Join): Statistics = fallback(p)

Review Comment:
   fallback here would endup calling 

[GitHub] [spark] beliefer commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions

2022-07-11 Thread GitBox


beliefer commented on code in PR #37116:
URL: https://github.com/apache/spark/pull/37116#discussion_r918523599


##
docs/sql-ref-syntax-qry-select-aggregate.md:
##
@@ -0,0 +1,125 @@
+---
+layout: global
+title: Aggregate Functions
+displayTitle: Aggregate Functions
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Aggregate functions operate on values across rows to perform mathematical 
calculations such as sum, average, counting, minimum/maximum values, standard 
deviation, and estimation, as well as some non-mathematical operations.
+
+### General Aggregation
+
+Please refer to the [Built-in Aggregation 
Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a 
complete list of Spark aggregate functions.
+
+### Ordered-Set Aggregate Functions

Review Comment:
   Let add syntax section into section `General Aggregation` too.



##
docs/sql-ref-syntax-qry-select-aggregate.md:
##
@@ -0,0 +1,125 @@
+---
+layout: global
+title: Aggregate Functions
+displayTitle: Aggregate Functions
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Aggregate functions operate on values across rows to perform mathematical 
calculations such as sum, average, counting, minimum/maximum values, standard 
deviation, and estimation, as well as some non-mathematical operations.
+
+### General Aggregation
+
+Please refer to the [Built-in Aggregation 
Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a 
complete list of Spark aggregate functions.
+
+### Ordered-Set Aggregate Functions

Review Comment:
   Let's add syntax section into section `General Aggregation` too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions

2022-07-11 Thread GitBox


beliefer commented on code in PR #37116:
URL: https://github.com/apache/spark/pull/37116#discussion_r918520601


##
docs/sql-ref-syntax-qry-select-aggregate.md:
##
@@ -0,0 +1,125 @@
+---
+layout: global
+title: Aggregate Functions
+displayTitle: Aggregate Functions
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Aggregate functions operate on values across rows to perform mathematical 
calculations such as sum, average, counting, minimum/maximum values, standard 
deviation, and estimation, as well as some non-mathematical operations.
+
+### General Aggregation
+
+Please refer to the [Built-in Aggregation 
Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a 
complete list of Spark aggregate functions.
+
+### Ordered-Set Aggregate Functions
+
+These aggregate Functions use different syntax than the other aggregate 
functions so that to specify an expression (typically a column name) by which 
to order the values.
+
+ Syntax
+
+```sql
+{ PERCENTILE_CONT | PERCENTILE_DISC }(percentile) WITHIN GROUP (ORDER BY { 
order_by_expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] }) 
FILTER (WHERE boolean_expression)
+```
+
+ Parameters
+
+* **percentile**
+
+The percentile of the value that you want to find. The percentile must be 
a constant between 0.0 and 1.0.
+
+* **order_by_expression**
+
+The expression (typically a column name) by which to order the values.
+
+* **boolean_expression**
+
+Specifies any expression that evaluates to a result type boolean. Two or 
more expressions may be combined together using the logical operators ( AND, OR 
).
+
+ Examples
+
+```sql
+CREATE OR REPLACE TEMPORARY VIEW basic_pays AS SELECT * FROM VALUES
+('Diane Murphy','Accounting',8435),
+('Mary Patterson','Accounting',9998),
+('Jeff Firrelli','Accounting',8992),
+('William Patterson','Accounting',8870),
+('Gerard Bondur','Accounting',11472),
+('Anthony Bow','Accounting',6627),
+('Leslie Jennings','IT',8113),
+('Leslie Thompson','IT',5186),
+('Julie Firrelli','Sales',9181),
+('Steve Patterson','Sales',9441),
+('Foon Yue Tseng','Sales',6660),
+('George Vanauf','Sales',10563),
+('Loui Bondur','SCM',10449),
+('Gerard Hernandez','SCM',6949),
+('Pamela Castillo','SCM',11303),
+('Larry Bott','SCM',11798),
+('Barry Jones','SCM',10586)
+AS basic_pays(employee_name, department, salary);
+
+SELECT * FROM basic_pays;
++-+--+--+
+|employee_name|department|salary|
++-+--+--+
+|  Anthony Bow|Accounting| 6627|
+|  Barry Jones|   SCM| 10586|
+| Diane Murphy|Accounting| 8435|
+|   Foon Yue Tseng| Sales| 6660|
+|George Vanauf| Sales| 10563|
+|Gerard Bondur|Accounting| 11472|
+| Gerard Hernandez|   SCM| 6949|
+|Jeff Firrelli|Accounting| 8992|
+|   Julie Firrelli| Sales| 9181|
+|   Larry Bott|   SCM| 11798|
+|  Leslie Jennings|IT| 8113|
+|  Leslie Thompson|IT| 5186|
+|  Loui Bondur|   SCM| 10449|
+|   Mary Patterson|Accounting| 9998|
+|  Pamela Castillo|   SCM| 11303|
+|  Steve Patterson| Sales| 9441|
+|William Patterson|Accounting| 8870|
++-+--+--+
+
+SELECT
+department,

Review Comment:
   These examples is in the section of Ordered-Set Aggregate Functions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a diff in pull request #37117: [WIP][SPARK-39714][PYTHON] Try to fix the mypy annotation tests

2022-07-11 Thread GitBox


Yikun commented on code in PR #37117:
URL: https://github.com/apache/spark/pull/37117#discussion_r918485337


##
python/pyspark/ml/util.py:
##
@@ -536,10 +536,8 @@ def __get_class(clazz: str) -> Type[RL]:
 """
 parts = clazz.split(".")
 module = ".".join(parts[:-1])
-m = __import__(module)
-for comp in parts[1:]:
-m = getattr(m, comp)
-return m
+m = __import__(module, fromlist=[parts[-1])

Review Comment:
   I also take a deep look on 
[this](https://docs.python.org/3/library/functions.html#import__):
   
   > When the name variable is of the form package.module, normally, the 
top-level package (the name up till the first dot) is returned, not the module 
named by name. However, when a non-empty fromlist argument is given, the module 
named by name is returned.
   
   such for `pyspark.ml.classification.LinearSVC`
   
   ```python
   >>> clazz = "pyspark.ml.classification.LinearSVC";parts = 
clazz.split(".");module = ".".join(parts[:-1])
   >>> __import__(module)
   
   >>> __import__(module, fromlist=[parts[-1]])
   
   ```
   
   and also return same results of wrong name like: 
`pyspark.ml.classification.unknow`, `pyspark.ml.unknow.LinearSVC`
   
   After this, import with non-empty fromlist works well, can also simplified 
implementations and cleanup the mypy error:
   ```
   python/pyspark/ml/util.py:542: error: Incompatible return value type (got 
Module, expected "Type[RL]")  [return-value]
   ```
   
   So, I'm fine with this change.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic

2022-07-11 Thread GitBox


beliefer commented on code in PR #37040:
URL: https://github.com/apache/spark/pull/37040#discussion_r918512433


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRand.scala:
##
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{DoubleLiteral, GreaterThan, 
GreaterThanOrEqual, LessThan, LessThanOrEqual, Rand}
+import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, 
TrueLiteral}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_COMPARISON, 
EXPRESSION_WITH_RANDOM_SEED, LITERAL}
+
+/**
+ * Rand() generates a random column with i.i.d. uniformly distributed values 
in [0, 1), so
+ * compare double literal value with 1.0 could eliminate Rand() in binary 
comparison.
+ *
+ * 1. Converts the binary comparison to true literal when the comparison value 
must be true.
+ * 2. Converts the binary comparison to false literal when the comparison 
value must be false.
+ */
+object OptimizeRand extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan =
+plan.transformAllExpressionsWithPruning(_.containsAllPatterns(
+  EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) {
+case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 =>

Review Comment:
   I feel that swap introduces additional complexity and reduces readability.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #37162: [SPARK-38910][YARN][FOLLOWUP] Clean spark staging before unregister

2022-07-11 Thread GitBox


AngersZh commented on PR #37162:
URL: https://github.com/apache/spark/pull/37162#issuecomment-1181266134

   waiting for @tgravescs  back and review this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu opened a new pull request, #37162: [SPARK-38910][YARN][FOLLOWUP] Clean spark staging before unregister

2022-07-11 Thread GitBox


AngersZh opened a new pull request, #37162:
URL: https://github.com/apache/spark/pull/37162

   ### What changes were proposed in this pull request?
   After discussing about https://github.com/apache/spark/pull/36207 and 
re-check the whole logic, we should revert  
https://github.com/apache/spark/pull/36207 and do some change
   
   1. If it's the last attempt, anyway yarn won't rerun the job, we can clean 
staging dir first then we can avoid remaining staging dir if unregister failed
   2. If it's not the last attempt and the final status is SUCCESS, if 
unregister failed, yarn can rerun the job again, we can't clean the staging dir 
before unregistering success.
   
   
   ### Why are the changes needed?
   Revert change and make it more accurate
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37116:
URL: https://github.com/apache/spark/pull/37116#discussion_r918509777


##
docs/sql-ref-syntax-qry-select-aggregate.md:
##
@@ -0,0 +1,125 @@
+---
+layout: global
+title: Aggregate Functions
+displayTitle: Aggregate Functions
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Aggregate functions operate on values across rows to perform mathematical 
calculations such as sum, average, counting, minimum/maximum values, standard 
deviation, and estimation, as well as some non-mathematical operations.
+
+### General Aggregation
+
+Please refer to the [Built-in Aggregation 
Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a 
complete list of Spark aggregate functions.
+
+### Ordered-Set Aggregate Functions
+
+These aggregate Functions use different syntax than the other aggregate 
functions so that to specify an expression (typically a column name) by which 
to order the values.
+
+ Syntax
+
+```sql
+{ PERCENTILE_CONT | PERCENTILE_DISC }(percentile) WITHIN GROUP (ORDER BY { 
order_by_expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] }) 
FILTER (WHERE boolean_expression)
+```
+
+ Parameters
+
+* **percentile**
+
+The percentile of the value that you want to find. The percentile must be 
a constant between 0.0 and 1.0.
+
+* **order_by_expression**
+
+The expression (typically a column name) by which to order the values.
+
+* **boolean_expression**
+
+Specifies any expression that evaluates to a result type boolean. Two or 
more expressions may be combined together using the logical operators ( AND, OR 
).
+
+ Examples
+
+```sql
+CREATE OR REPLACE TEMPORARY VIEW basic_pays AS SELECT * FROM VALUES
+('Diane Murphy','Accounting',8435),
+('Mary Patterson','Accounting',9998),
+('Jeff Firrelli','Accounting',8992),
+('William Patterson','Accounting',8870),
+('Gerard Bondur','Accounting',11472),
+('Anthony Bow','Accounting',6627),
+('Leslie Jennings','IT',8113),
+('Leslie Thompson','IT',5186),
+('Julie Firrelli','Sales',9181),
+('Steve Patterson','Sales',9441),
+('Foon Yue Tseng','Sales',6660),
+('George Vanauf','Sales',10563),
+('Loui Bondur','SCM',10449),
+('Gerard Hernandez','SCM',6949),
+('Pamela Castillo','SCM',11303),
+('Larry Bott','SCM',11798),
+('Barry Jones','SCM',10586)
+AS basic_pays(employee_name, department, salary);
+
+SELECT * FROM basic_pays;
++-+--+--+
+|employee_name|department|salary|
++-+--+--+
+|  Anthony Bow|Accounting| 6627|
+|  Barry Jones|   SCM| 10586|
+| Diane Murphy|Accounting| 8435|
+|   Foon Yue Tseng| Sales| 6660|
+|George Vanauf| Sales| 10563|
+|Gerard Bondur|Accounting| 11472|
+| Gerard Hernandez|   SCM| 6949|
+|Jeff Firrelli|Accounting| 8992|
+|   Julie Firrelli| Sales| 9181|
+|   Larry Bott|   SCM| 11798|
+|  Leslie Jennings|IT| 8113|
+|  Leslie Thompson|IT| 5186|
+|  Loui Bondur|   SCM| 10449|
+|   Mary Patterson|Accounting| 9998|
+|  Pamela Castillo|   SCM| 11303|
+|  Steve Patterson| Sales| 9441|
+|William Patterson|Accounting| 8870|
++-+--+--+
+
+SELECT
+department,

Review Comment:
   can we give some examples for simple agg functions such as count, sum?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37116:
URL: https://github.com/apache/spark/pull/37116#discussion_r918508889


##
docs/sql-ref-syntax-qry-select-aggregate.md:
##
@@ -0,0 +1,125 @@
+---
+layout: global
+title: Aggregate Functions
+displayTitle: Aggregate Functions
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Aggregate functions operate on values across rows to perform mathematical 
calculations such as sum, average, counting, minimum/maximum values, standard 
deviation, and estimation, as well as some non-mathematical operations.
+
+### General Aggregation
+
+Please refer to the [Built-in Aggregation 
Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a 
complete list of Spark aggregate functions.
+
+### Ordered-Set Aggregate Functions
+
+These aggregate Functions use different syntax than the other aggregate 
functions so that to specify an expression (typically a column name) by which 
to order the values.
+
+ Syntax
+
+```sql
+{ PERCENTILE_CONT | PERCENTILE_DISC }(percentile) WITHIN GROUP (ORDER BY { 
order_by_expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] }) 
FILTER (WHERE boolean_expression)
+```
+
+ Parameters
+
+* **percentile**
+
+The percentile of the value that you want to find. The percentile must be 
a constant between 0.0 and 1.0.
+
+* **order_by_expression**
+
+The expression (typically a column name) by which to order the values.

Review Comment:
   ```suggestion
   The expression (typically a column name) by which to order the values 
before aggregating them.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37116:
URL: https://github.com/apache/spark/pull/37116#discussion_r918508166


##
docs/sql-ref-syntax-qry-select-aggregate.md:
##
@@ -0,0 +1,125 @@
+---
+layout: global
+title: Aggregate Functions
+displayTitle: Aggregate Functions
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+Aggregate functions operate on values across rows to perform mathematical 
calculations such as sum, average, counting, minimum/maximum values, standard 
deviation, and estimation, as well as some non-mathematical operations.
+
+### General Aggregation
+
+Please refer to the [Built-in Aggregation 
Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a 
complete list of Spark aggregate functions.
+
+### Ordered-Set Aggregate Functions

Review Comment:
   shall we have a new section before this to talk about `FILTER (WHERE ...)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #37135: [SPARK-39723][R] Implement functionExists/getFunc in SparkR for 3L namespace

2022-07-11 Thread GitBox


zhengruifeng commented on PR #37135:
URL: https://github.com/apache/spark/pull/37135#issuecomment-1181262398

   Merged to master, thanks @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37153: [SPARK-26052] Add type comments to exposed Prometheus metrics

2022-07-11 Thread GitBox


AmplabJenkins commented on PR #37153:
URL: https://github.com/apache/spark/pull/37153#issuecomment-1181262208

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #37135: [SPARK-39723][R] Implement functionExists/getFunc in SparkR for 3L namespace

2022-07-11 Thread GitBox


zhengruifeng closed pull request #37135: [SPARK-39723][R] Implement 
functionExists/getFunc in SparkR for 3L namespace
URL: https://github.com/apache/spark/pull/37135


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37040:
URL: https://github.com/apache/spark/pull/37040#discussion_r918506608


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PruneFiltersSuite.scala:
##
@@ -129,9 +129,11 @@ class PruneFiltersSuite extends PlanTest {
   }
 
   test("Nondeterministic predicate is not pruned") {
-val originalQuery = testRelation.where(Rand(10) > 
5).select($"a").where(Rand(10) > 5).analyze

Review Comment:
   why do we need to change this file? The new rule is not invoked in this test 
suite.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37040:
URL: https://github.com/apache/spark/pull/37040#discussion_r918506272


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRand.scala:
##
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{DoubleLiteral, GreaterThan, 
GreaterThanOrEqual, LessThan, LessThanOrEqual, Rand}
+import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, 
TrueLiteral}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_COMPARISON, 
EXPRESSION_WITH_RANDOM_SEED, LITERAL}
+
+/**
+ * Rand() generates a random column with i.i.d. uniformly distributed values 
in [0, 1), so
+ * compare double literal value with 1.0 could eliminate Rand() in binary 
comparison.
+ *
+ * 1. Converts the binary comparison to true literal when the comparison value 
must be true.
+ * 2. Converts the binary comparison to false literal when the comparison 
value must be false.
+ */
+object OptimizeRand extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan =
+plan.transformAllExpressionsWithPruning(_.containsAllPatterns(
+  EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) {
+case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 =>

Review Comment:
   can we swap the comparison so that we don't need to handle each comparison 
twice?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37040:
URL: https://github.com/apache/spark/pull/37040#discussion_r918506022


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRand.scala:
##
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{DoubleLiteral, GreaterThan, 
GreaterThanOrEqual, LessThan, LessThanOrEqual, Rand}
+import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, 
TrueLiteral}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_COMPARISON, 
EXPRESSION_WITH_RANDOM_SEED, LITERAL}
+
+/**
+ * Rand() generates a random column with i.i.d. uniformly distributed values 
in [0, 1), so
+ * compare double literal value with 1.0 could eliminate Rand() in binary 
comparison.

Review Comment:
   1.0 or 0.0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #37161: [SPARK-39748][SQL][SS] Include the origin logical plan for LogicalRDD if it comes from DataFrame

2022-07-11 Thread GitBox


HeartSaVioR commented on PR #37161:
URL: https://github.com/apache/spark/pull/37161#issuecomment-1181260007

   cc. @cloud-fan @viirya Please take a look. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions

2022-07-11 Thread GitBox


beliefer commented on PR #37116:
URL: https://github.com/apache/spark/pull/37116#issuecomment-1181259930

   ping @cloud-fan 
   https://github.com/apache/spark/pull/37150 merged, please review this PR 
again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR opened a new pull request, #37161: [SPARK-39748][SQL][SS] Include the origin logical plan for LogicalRDD if it comes from DataFrame

2022-07-11 Thread GitBox


HeartSaVioR opened a new pull request, #37161:
URL: https://github.com/apache/spark/pull/37161

   ### What changes were proposed in this pull request?
   
   This PR proposes to include the origin logical plan for LogicalRDD, if the 
LogicalRDD is built from DataFrame's RDD. Once the origin logical plan is 
available, LogicalRDD produces the stats from origin logical plan rather than 
default one.
   
   Also, this PR applies the change to ForeachBatchSink, which seems to be the 
only case as of now in current codebase.
   
   ### Why are the changes needed?
   
   The origin logical plan can be useful for several use cases, including:
   
   1. wants to connect the two split logical plans into one (consider the case 
of foreachBatch sink: origin logical plan represents the plan for streaming 
query, and the logical plan for new Dataset represents the plan for batch query 
in user function)
   2. inherits plan stats from origin logical plan
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New UT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #36207: [SPARK-38910][YARN] Clean spark staging before `unregister`

2022-07-11 Thread GitBox


AngersZh commented on PR #36207:
URL: https://github.com/apache/spark/pull/36207#issuecomment-1181258817

   @tgravescs After re-check the whole logic, I got your point, although here 
said we don't need to rerun, but if it reregister failed, and it's not the last 
attempt,  yarn will rerun this job.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter

2022-07-11 Thread GitBox


beliefer commented on PR #37150:
URL: https://github.com/apache/spark/pull/37150#issuecomment-1181258765

   @cloud-fan Thank you !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #37160: [SPARK-39749][SQL] Use plain string representation on casting Decimal to String

2022-07-11 Thread GitBox


gengliangwang commented on PR #37160:
URL: https://github.com/apache/spark/pull/37160#issuecomment-1181252406

   cc @timarmstrong @entong @cloud-fan @srielau 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request, #37160: [SPARK-39749][SQL] Use plain string representation on casting Decimal to String

2022-07-11 Thread GitBox


gengliangwang opened a new pull request, #37160:
URL: https://github.com/apache/spark/pull/37160

   
   
   ### What changes were proposed in this pull request?
   
   Currently, casting decimal as string type will result in Strings with 
exponential notations if the adjusted exponent is less than -6. This is 
consistent with BigDecimal.toString 
https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString 
   
   After this PR, the casting always uses plain string representation.
   
   ### Why are the changes needed?
   
   
   1. The current behavior doesn't compliant to the ANSI SQL standard.
   https://user-images.githubusercontent.com/1097932/178395756-baecbe90-7a5f-4b4c-b63c-9f1fdf656107.png";>
   https://user-images.githubusercontent.com/1097932/178395567-fa5b6877-ff08-48b5-b715-243c954d6bbc.png";>
   
   2. It is different from databases like PostgreSQL/Oracle/MS SQL server/etc..
   3. The current behavior may surprise users since it only happens when the 
adjusted exponent is less than -6. The following query will return `false` by 
default (when ANSI SQL mode is off) since the `0.000123` is converted as 
`1.23E-8`:
   ```sql
   select '0.000123' in (0.000123, '0.123');
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, after changes, Spark SQL always uses plain string representation on 
casting Decimal to String. To restore the legacy behavior, which uses 
scientific notation if the adjusted exponent is less than -6, set 
`spark.sql.legacy.castDecimalToString.enabled` to `true`.
   
   ### How was this patch tested?
   
   Unit test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #37123: [SPARK-39711][TESTS] Remove redundant trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging

2022-07-11 Thread GitBox


huaxingao commented on PR #37123:
URL: https://github.com/apache/spark/pull/37123#issuecomment-1181244052

   Merged to master. Thanks @panbingkun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao closed pull request #37123: [SPARK-39711][TESTS] Remove redundant trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging

2022-07-11 Thread GitBox


huaxingao closed pull request #37123: [SPARK-39711][TESTS] Remove redundant 
trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging
URL: https://github.com/apache/spark/pull/37123


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter

2022-07-11 Thread GitBox


cloud-fan closed pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and 
`PERCENTILE_DISC` should support aggregate filter
URL: https://github.com/apache/spark/pull/37150


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter

2022-07-11 Thread GitBox


cloud-fan commented on PR #37150:
URL: https://github.com/apache/spark/pull/37150#issuecomment-1181237260

   thanks, merging to maseter!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a diff in pull request #37117: [WIP][SPARK-39714][PYTHON] Try to fix the mypy annotation tests

2022-07-11 Thread GitBox


Yikun commented on code in PR #37117:
URL: https://github.com/apache/spark/pull/37117#discussion_r918485337


##
python/pyspark/ml/util.py:
##
@@ -536,10 +536,8 @@ def __get_class(clazz: str) -> Type[RL]:
 """
 parts = clazz.split(".")
 module = ".".join(parts[:-1])
-m = __import__(module)
-for comp in parts[1:]:
-m = getattr(m, comp)
-return m
+m = __import__(module, fromlist=[parts[-1])

Review Comment:
   I also take a deep look on this:
   
   > When the name variable is of the form package.module, normally, the 
top-level package (the name up till the first dot) is returned, not the module 
named by name. However, when a non-empty fromlist argument is given, the module 
named by name is returned.
   
   such for `pyspark.ml.classification.LinearSVC`
   
   ```python
   >>> clazz = "pyspark.ml.classification.LinearSVC";parts = 
clazz.split(".");module = ".".join(parts[:-1])
   >>> __import__(module)
   
   >>> __import__(module, fromlist=[parts[-1]])
   
   ```
   
   and also return same results of wrong name like: 
`pyspark.ml.classification.unknow`, `pyspark.ml.unknow.LinearSVC`
   
   After this, import with non-empty fromlist works well, can also simplified 
implementations and cleanup the mypy error:
   ```
   python/pyspark/ml/util.py:542: error: Incompatible return value type (got 
Module, expected "Type[RL]")  [return-value]
   ```
   
   So, I'm fine with this change.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r918484209


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   */
+  val inferDate = {
+val inferDateFlag = getBool("inferDate")
+if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {

Review Comment:
   My point is, we can use a dedicated datetime parser when inferring date 
types, which does not need to respect `legacyTimeParserPolicy`, so there is no 
need of failing for unsupported parser policy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37147: [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy

2022-07-11 Thread GitBox


cloud-fan commented on PR #37147:
URL: https://github.com/apache/spark/pull/37147#issuecomment-1181227634

   If the legacy behavior is unreasonable, I think we don't have to keep it. If 
datetime patten is specified, we should not fall back to the legacy code path, 
even if it only supports 4 digits like Spark 2.x.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #37001: [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT

2022-07-11 Thread GitBox


beliefer commented on code in PR #37001:
URL: https://github.com/apache/spark/pull/37001#discussion_r918480531


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##
@@ -165,106 +169,63 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper wit
 val pushedAggregates = 
finalTranslatedAggregates.filter(r.pushAggregation)
 if (pushedAggregates.isEmpty) {
   aggNode // return original plan node
+} else if (r.supportCompletePushDown(pushedAggregates.get)) {
+  // If we can push down the aggregation completely, we need 
to consider
+  // pushing down the Limit or Offset operator, so keep the 
information about
+  // aggregation push down and push down aggregation later.
+  sHolder.pushedAggregation = Some(PushedAggregation(
+pushedAggregates,
+finalResultExpressions,
+finalAggregates,
+normalizedGroupingExpressions,
+aggExprToOutputOrdinal))
+  sHolder

Review Comment:
   Got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37156: [SPARK-39742][core]Fix a problem with the result of adjusting resources is not exp…

2022-07-11 Thread GitBox


AmplabJenkins commented on PR #37156:
URL: https://github.com/apache/spark/pull/37156#issuecomment-1181212204

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter

2022-07-11 Thread GitBox


beliefer commented on code in PR #37150:
URL: https://github.com/apache/spark/pull/37150#discussion_r918466540


##
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##
@@ -849,7 +849,8 @@ primaryExpression
 | OVERLAY LEFT_PAREN input=valueExpression PLACING replace=valueExpression
   FROM position=valueExpression (FOR length=valueExpression)? RIGHT_PAREN  
#overlay
 | name=(PERCENTILE_CONT | PERCENTILE_DISC) LEFT_PAREN 
percentage=valueExpression RIGHT_PAREN
-  WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN ( OVER 
windowSpec)?#percentile
+WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN
+(FILTER LEFT_PAREN WHERE where=booleanExpression RIGHT_PAREN)? ( OVER 
windowSpec)? #percentile

Review Comment:
   ```
   org.apache.spark.sql.AnalysisException
   window aggregate function with filter predicate is not supported yet.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #37147: [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy

2022-07-11 Thread GitBox


sadikovi commented on PR #37147:
URL: https://github.com/apache/spark/pull/37147#issuecomment-1181198109

   Thanks for the reviews. I will address the comments and failing tests and 
update the PR.
   My question was whether there are any concerns with this change and whether 
users might experience compatibility issues. I would appreciate some thoughts 
on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37158: [SPARK-39736][INFRA] Enable base image build in SparkR job

2022-07-11 Thread GitBox


HyukjinKwon closed pull request #37158: [SPARK-39736][INFRA] Enable base image 
build in SparkR job
URL: https://github.com/apache/spark/pull/37158


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37158: [SPARK-39736][INFRA] Enable base image build in SparkR job

2022-07-11 Thread GitBox


HyukjinKwon commented on PR #37158:
URL: https://github.com/apache/spark/pull/37158#issuecomment-1181176377

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37135: [SPARK-39723][R] Implement functionExists/getFunc in SparkR for 3L namespace

2022-07-11 Thread GitBox


HyukjinKwon commented on code in PR #37135:
URL: https://github.com/apache/spark/pull/37135#discussion_r918448953


##
R/pkg/NAMESPACE:
##
@@ -479,7 +479,9 @@ export("as.DataFrame",
"databaseExists",
"dropTempTable",
"dropTempView",
+   "functionExists",
"getDatabase",
+   "getFunction",

Review Comment:
   ```suggestion
  "getFunc",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37135: [SPARK-39723][R] Implement functionExists/getFunc in SparkR for 3L namespace

2022-07-11 Thread GitBox


HyukjinKwon commented on code in PR #37135:
URL: https://github.com/apache/spark/pull/37135#discussion_r918449024


##
R/pkg/pkgdown/_pkgdown_template.yml:
##
@@ -266,7 +266,9 @@ reference:
   - databaseExists
   - dropTempTable
   - dropTempView
+  - functionExists
   - getDatabase
+  - getFunction

Review Comment:
   ```suggestion
 - getFunc
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37158: [SPARK-39736][INFRA] Enable base image build in SparkR job

2022-07-11 Thread GitBox


Yikun commented on PR #37158:
URL: https://github.com/apache/spark/pull/37158#issuecomment-1181166080

   Ready to go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37135: [SPARK-39723][R] Implement functionExists/getFunction in SparkR for 3L namespace

2022-07-11 Thread GitBox


HyukjinKwon commented on code in PR #37135:
URL: https://github.com/apache/spark/pull/37135#discussion_r918440696


##
R/pkg/tests/fulltests/test_context.R:
##
@@ -21,10 +21,11 @@ test_that("Check masked functions", {
   # Check that we are not masking any new function from base, stats, testthat 
unexpectedly
   # NOTE: We should avoid adding entries to *namesOfMaskedCompletely* as 
masked functions make it
   # hard for users to use base R functions. Please check when in doubt.
-  namesOfMaskedCompletely <- c("cov", "filter", "sample", "not")
+  namesOfMaskedCompletely <- c("cov", "filter", "sample", "not", "getFunction")

Review Comment:
   cc @falaki @felixcheung @shivaram FYI



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-11 Thread GitBox


HyukjinKwon commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r918440378


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   */
+  val inferDate = {
+val inferDateFlag = getBool("inferDate")
+if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {

Review Comment:
   Yeah, I don't think legacy mode support this. Let's just throw an exception 
for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #37135: [SPARK-39723][R] Implement functionExists/getFunction in SparkR for 3L namespace

2022-07-11 Thread GitBox


zhengruifeng commented on code in PR #37135:
URL: https://github.com/apache/spark/pull/37135#discussion_r918440246


##
R/pkg/tests/fulltests/test_context.R:
##
@@ -21,10 +21,11 @@ test_that("Check masked functions", {
   # Check that we are not masking any new function from base, stats, testthat 
unexpectedly
   # NOTE: We should avoid adding entries to *namesOfMaskedCompletely* as 
masked functions make it
   # hard for users to use base R functions. Please check when in doubt.
-  namesOfMaskedCompletely <- c("cov", "filter", "sample", "not")
+  namesOfMaskedCompletely <- c("cov", "filter", "sample", "not", "getFunction")

Review Comment:
   got it, thanks. let me update the pr.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37157: [MINOR][FOLLOWUP] Remove redundant return

2022-07-11 Thread GitBox


HyukjinKwon closed pull request #37157: [MINOR][FOLLOWUP] Remove redundant 
return
URL: https://github.com/apache/spark/pull/37157


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc

2022-07-11 Thread GitBox


HyukjinKwon closed pull request #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] 
Remove try_to_char reference in the doc
URL: https://github.com/apache/spark/pull/37159


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc

2022-07-11 Thread GitBox


HyukjinKwon commented on PR #37159:
URL: https://github.com/apache/spark/pull/37159#issuecomment-1181096060

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #37147: [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy

2022-07-11 Thread GitBox


Jonathancui123 commented on code in PR #37147:
URL: https://github.com/apache/spark/pull/37147#discussion_r918421341


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -222,7 +226,11 @@ class UnivocityParser(
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
-// compatibility.
+// compatibility only if no custom pattern has been set. If there 
is a custom pattern,
+// fail since it may be different from the default pattern.
+if (options.dateFormatInRead.isDefined) {
+  throw e
+}

Review Comment:
   This change makes sense to me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #37147: [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy

2022-07-11 Thread GitBox


Jonathancui123 commented on code in PR #37147:
URL: https://github.com/apache/spark/pull/37147#discussion_r918421341


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -222,7 +226,11 @@ class UnivocityParser(
 } catch {
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
-// compatibility.
+// compatibility only if no custom pattern has been set. If there 
is a custom pattern,
+// fail since it may be different from the default pattern.
+if (options.dateFormatInRead.isDefined) {
+  throw e
+}

Review Comment:
   This change makes sense to me. Thanks Ivan :))



##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -2788,6 +2788,47 @@ abstract class CSVSuite
   }
 }
   }
+
+  test("SPARK-39731: Correctly parse dates with MMdd pattern") {
+withTempPath { path =>
+  Seq(
+"1,2020011,2020011",
+"2,20201203,20201203").toDF("data")
+.repartition(1)
+.write.text(path.getAbsolutePath)
+  val schema = new StructType()
+.add("id", IntegerType)
+.add("date", DateType)
+.add("ts", TimestampType)
+  val output = spark.read
+.schema(schema)
+.option("dateFormat", "MMdd")
+.option("timestampFormat", "MMdd")
+.csv(path.getAbsolutePath)
+
+  def check(mode: String, res: Seq[Row]): Unit = {
+withSQLConf(SQLConf.LEGACY_TIME_PARSER_POLICY.key -> mode) {
+  checkAnswer(output, res)
+}
+  }
+
+  check(
+"legacy",
+Seq(
+  Row(1, Date.valueOf("2020-01-01"), Timestamp.valueOf("2020-01-01 
00:00:00")),
+  Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 
00:00:00"))
+)
+  )
+
+  check(
+"corrected",
+Seq(
+  Row(1, null, null),
+  Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 
00:00:00"))
+)

Review Comment:
   For completeness, would you consider adding a check for 
`LEGACY_TIME_PARSER_POLICY` = `EXCEPTION`? Similar to the following?
   
   
https://github.com/apache/spark/blob/1193ce78d3efcbe1395305b4b7deb0a195fa09d9/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L2598-L2601



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #37157: [MINOR][FOLLOWUP] Remove redundant return

2022-07-11 Thread GitBox


panbingkun commented on PR #37157:
URL: https://github.com/apache/spark/pull/37157#issuecomment-1181059198

   yes,i check twice


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables

2022-07-11 Thread GitBox


c21 commented on code in PR #37083:
URL: https://github.com/apache/spark/pull/37083#discussion_r918415377


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import org.apache.spark.sql.catalyst.plans.logical._
+
+/**
+ * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based 
optimizer.
+ */
+object AdvancedStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
+
+  /** Falls back to the estimation computed by [[BasicStatsPlanVisitor]]. */
+  private def fallback(p: LogicalPlan): Statistics = 
BasicStatsPlanVisitor.visit(p)
+
+  override def default(p: LogicalPlan): Statistics = fallback(p)
+
+  override def visitAggregate(p: Aggregate): Statistics = {
+AggregateEstimation.estimate(p).getOrElse(fallback(p))
+  }
+
+  override def visitDistinct(p: Distinct): Statistics = {
+val child = p.child
+visitAggregate(Aggregate(child.output, child.output, child))
+  }
+
+  override def visitExcept(p: Except): Statistics = fallback(p)
+
+  override def visitExpand(p: Expand): Statistics = fallback(p)
+
+  override def visitFilter(p: Filter): Statistics = {
+FilterEstimation(p).estimate.getOrElse(fallback(p))
+  }
+
+  override def visitGenerate(p: Generate): Statistics = default(p)
+
+  override def visitGlobalLimit(p: GlobalLimit): Statistics = fallback(p)
+
+  override def visitOffset(p: Offset): Statistics = fallback(p)
+
+  override def visitIntersect(p: Intersect): Statistics = fallback(p)
+
+  override def visitJoin(p: Join): Statistics = fallback(p)

Review Comment:
   Why we fallback here, but not use `JoinEstimation.estimate`?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import org.apache.spark.sql.catalyst.plans.logical._
+
+/**
+ * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based 
optimizer.
+ */
+object AdvancedStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
+
+  /** Falls back to the estimation computed by [[BasicStatsPlanVisitor]]. */
+  private def fallback(p: LogicalPlan): Statistics = 
BasicStatsPlanVisitor.visit(p)
+
+  override def default(p: LogicalPlan): Statistics = fallback(p)
+
+  override def visitAggregate(p: Aggregate): Statistics = {
+AggregateEstimation.estimate(p).getOrElse(fallback(p))
+  }
+
+  override def visitDistinct(p: Distinct): Statistics = {
+val child = p.child
+visitAggregate(Aggregate(child.output, child.output, child))
+  }
+
+  override def visitExcept(p: Except): Statistics = fallback(p)
+
+  override def visitExpand(p: Expand): Statistics = fallback(p)
+
+  override def visitFilter(p: Filter): Statistics = {
+FilterEstimation(p).estimate.getOrElse(fallback(p))
+  }
+
+  override def visitGenerate(p: Generate): Statistics = default(p)
+
+  override def visitGlobalLimit(p: GlobalLimit): Statistics = fallback(p)
+
+  override def visitOffset(p: Offset): Statistics = fallback(p)
+
+  override def visitIntersect(p: Intersect): Statistics = fallback(p)
+
+  override def visitJoin(p: Join): Statistics = fallback(p)
+
+  

[GitHub] [spark] mridulm commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-07-11 Thread GitBox


mridulm commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r918226879


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -343,15 +397,44 @@ void closeAndDeletePartitionFilesIfNeeded(
 if (cleanupLocalDirs) {
   deleteExecutorDirs(appShuffleInfo);
 }
+removeAppShuffleInfoFromDB(appShuffleInfo);
+  }
+
+  /**
+   * Remove the application attempt local paths information from the DB.
+   */
+  private void removeAppAttemptPathInfoFromDB(String appId, int attemptId) 
throws Exception{

Review Comment:
   Can you add a comment to both `removeAppAttemptPathInfoFromDB` and 
`writeNewAppAttemptPathInfoToDBAndRemoveOutdated` that they are expected to be 
invoked with the `appsShuffleInfo` lock help for `appId` ?
   (Also, group them together in the source file)



##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -656,6 +768,232 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
 }
   }
 
+  /**
+   * Remove the former application attempt local paths information from the DB 
and insert the
+   * local paths information from the newer application attempt. If the 
deletion fails, the
+   * insertion will also be skipped. This ensures that there will always be a 
single application
+   * attempt local path information in the DB.
+   */
+  private void writeNewAppAttemptPathInfoToDBAndRemoveOutdated(
+  String appId,
+  int newAttemptId,
+  AppShuffleInfo appShuffleInfo,
+  AppPathsInfo appPathsInfo) {
+try{
+  if (appShuffleInfo != null) {
+removeAppAttemptPathInfoFromDB(appId, appShuffleInfo.attemptId);
+  }
+  writeAppPathsInfoToDb(appId, newAttemptId, appPathsInfo);

Review Comment:
   Thoughts on making the exception handling local to the DB invocation (since 
we dont need to handle cross DB invocation failures for now) ? We are already 
doing this for `removeAppShufflePartitionInfoFromDB`, 
`writeAppAttemptShuffleMergeInfoToDB`, etc.
   
   Given this, let us make the Exception handling local to 
`removeAppAttemptPathInfoFromDB` (so it does not throw an `Exception`)



##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -632,6 +736,12 @@ public void registerExecutor(String appId, 
ExecutorShuffleInfo executorInfo) {
   appsShuffleInfo.compute(appId, (id, appShuffleInfo) -> {
 if (appShuffleInfo == null || attemptId > 
appShuffleInfo.attemptId) {
   originalAppShuffleInfo.set(appShuffleInfo);
+  AppPathsInfo appPathsInfo = new AppPathsInfo(appId, 
executorInfo.localDirs,
+  mergeDir, executorInfo.subDirsPerLocalDir);
+  // Clean up the outdated App Attempt local path info in the DB 
and
+  // put the newly registered local path info from newer attempt 
into the DB.
+  writeNewAppAttemptPathInfoToDBAndRemoveOutdated(

Review Comment:
   nit: Do we need this method ? (the comment is helpful, just the method)
   We can simply do 
   ```
   if (null != appShuffleInfo) removeAppAttemptPathInfoFromDB()
   writeAppPathsInfoToDb()
   ```



##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -343,15 +397,44 @@ void closeAndDeletePartitionFilesIfNeeded(
 if (cleanupLocalDirs) {
   deleteExecutorDirs(appShuffleInfo);
 }
+removeAppShuffleInfoFromDB(appShuffleInfo);
+  }
+
+  /**
+   * Remove the application attempt local paths information from the DB.
+   */
+  private void removeAppAttemptPathInfoFromDB(String appId, int attemptId) 
throws Exception{
+AppAttemptId appAttemptId = new AppAttemptId(appId, attemptId);
+if (db != null) {
+  db.delete(getDbAppAttemptPathsKey(appAttemptId));

Review Comment:
   Given the possibility of stale entries from previous failed deletes hanging 
around, can we scan for all entries for an application id and delete them here ?
   
   Note that this will require changes to how we encode db key for 
`AppAttemptId` - we cannot use json for it, since we want to do a prefix scan.
   Something like:
   ```
 private byte[] getDbAppAttemptPathsKey(AppAttemptId appAttemptId) throws 
IOException {
   return (APP_ATTEMPT_PATH_KEY_PREFIX + DB_KEY_DELIMITER + 
appAttemptId.appId + DB_KEY_DELIMITER + 
appAttemptId.attemptId).getBytes(StandardCharsets.UTF_8);
 }
   
 private byte[] getDbAppAttemptPathsKeyPrefix(String appId) throws 
IOException {
   return (APP_ATTEMPT_PATH_KEY_PREFIX + DB_KEY_DELIMITER + 
appAttemptId.appId + DB_KEY_DELIMITER).getBytes(StandardCharsets.UTF_8);
 }
   
 private AppAttemptId parseDbAppAttemptPathsKey(String value) throws 
IOException {
   String[] parts = key.split(DB_KEY_DELIMITER);
   if (parts

[GitHub] [spark] HyukjinKwon commented on pull request #37157: [MINOR][FOLLOWUP] Remove redundant return

2022-07-11 Thread GitBox


HyukjinKwon commented on PR #37157:
URL: https://github.com/apache/spark/pull/37157#issuecomment-1181000672

   @panbingkun are they all to fix?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a diff in pull request #37080: [SPARK-35208][SQL][DOCS] Add docs for LATERAL subqueries

2022-07-11 Thread GitBox


huaxingao commented on code in PR #37080:
URL: https://github.com/apache/spark/pull/37080#discussion_r918358655


##
docs/sql-ref-syntax-qry-select-join.md:
##
@@ -26,7 +26,7 @@ A SQL join is used to combine rows from two relations based 
on join criteria. Th
 ### Syntax
 
 ```sql
-relation { [ join_type ] JOIN relation [ join_criteria ] | NATURAL join_type 
JOIN relation }
+relation { [ join_type ] JOIN [ LATERAL ] relation [ join_criteria ] | NATURAL 
join_type JOIN [ LATERAL ] relation }

Review Comment:
   Sounds good! Changed.
   
   before:
   
   https://user-images.githubusercontent.com/13592258/17835-86c7b9f8-03d6-471d-af38-d549af29d7bb.png";>
   
   
   after:
   
   https://user-images.githubusercontent.com/13592258/178359932-3af7c545-8f11-4e1e-bec0-4550e51f0b09.png";>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on pull request #37052: [SPARK-39647][CORE] Register the executor with ESS before registering the BlockManager

2022-07-11 Thread GitBox


otterc commented on PR #37052:
URL: https://github.com/apache/spark/pull/37052#issuecomment-1180723189

   All the tests pass now @Ngone51 @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #36162: [SPARK-32170][CORE] Improve the speculation through the stage task metrics.

2022-07-11 Thread GitBox


mridulm commented on PR #36162:
URL: https://github.com/apache/spark/pull/36162#issuecomment-1180706514

   It helps in two cases @weixiuli - the example you gave (generated input 
(like range()), etc where there is no input metrics). It also helps when 
reading shuffle input where there is a sort - the entire shuffle input will get 
consumed at beginning of the task, but the output rate would be impacted by 
computation/skew/etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on a diff in pull request #37080: [SPARK-35208][SQL][DOCS] Add docs for LATERAL subqueries

2022-07-11 Thread GitBox


allisonwang-db commented on code in PR #37080:
URL: https://github.com/apache/spark/pull/37080#discussion_r918189602


##
docs/sql-ref-syntax-qry-select-join.md:
##
@@ -26,7 +26,7 @@ A SQL join is used to combine rows from two relations based 
on join criteria. Th
 ### Syntax
 
 ```sql
-relation { [ join_type ] JOIN relation [ join_criteria ] | NATURAL join_type 
JOIN relation }
+relation { [ join_type ] JOIN [ LATERAL ] relation [ join_criteria ] | NATURAL 
join_type JOIN [ LATERAL ] relation }

Review Comment:
   Should we also update the syntax reference in `sql-ref-syntax-qry-select.md` 
under the `from_item`:
   `[LATERAL] (subquery)`?



##
docs/sql-ref-syntax-qry-select-lateral-subquery.md:
##
@@ -0,0 +1,87 @@
+---
+layout: global
+title: LATERAL SUBQUERY
+displayTitle: LATERAL SUBQUERY
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+`LATERAL SUBQUERY` is a subquery that is preceded by the keyword `LATERAL`. It 
provides a way to cross-reference columns in the preceding FROM clause.

Review Comment:
   ```suggestion
   `LATERAL SUBQUERY` is a subquery that is preceded by the keyword `LATERAL`. 
It provides a way to reference columns in the preceding `FROM` clause.
   ```



##
docs/sql-ref-syntax-qry-select-lateral-subquery.md:
##
@@ -0,0 +1,87 @@
+---
+layout: global
+title: LATERAL SUBQUERY
+displayTitle: LATERAL SUBQUERY
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+`LATERAL SUBQUERY` is a subquery that is preceded by the keyword `LATERAL`. It 
provides a way to cross-reference columns in the preceding FROM clause.
+Without the `LATERAL` keyword, subqueries can only refer to columns in the 
outer query, but not in the `FROM` clause. `LATERAL SUBQUERY` makes the 
complicated
+queries simpler and more efficient.
+
+### Syntax
+
+```sql
+[ LATERAL ] primary_relation [ join_relation ]
+```
+
+### Parameters
+
+* **primary_relation**
+
+  Specifies the primary relation. It can be one of the following:
+  * Table relation
+  * aliased query
+
+Syntax: `( query ) [ [ AS ] alias ]`
+  * aliased relation

Review Comment:
   ```suggestion
 * Aliased relation
   ```



##
docs/sql-ref-syntax-qry-select-lateral-subquery.md:
##
@@ -0,0 +1,87 @@
+---
+layout: global
+title: LATERAL SUBQUERY
+displayTitle: LATERAL SUBQUERY
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+`LATERAL SUBQUERY` is a subquery that is preceded by the keyword `LATERAL`. It 
provides a way to cross-reference columns in the preceding FROM clause.
+Without the `LATERAL` keyword, subqueries can only refer to columns in the 
outer query, but not in the `FROM` clau

[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-07-11 Thread GitBox


zhouyejoe commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r918214646


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -317,22 +353,24 @@ public void applicationRemoved(String appId, boolean 
cleanupLocalDirs) {
 logger.info("Application {} removed, cleanupLocalDirs = {}", appId, 
cleanupLocalDirs);
 AppShuffleInfo appShuffleInfo = appsShuffleInfo.remove(appId);
 if (null != appShuffleInfo) {
-  mergedShuffleCleaner.execute(
-() -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, 
cleanupLocalDirs));
+  submitCleanupTask(
+() -> closeAndDeletePartitions(appShuffleInfo, cleanupLocalDirs, 
true));
 }
+removeAppAttemptPathInfoFromDB(
+new AppAttemptId(appShuffleInfo.appId, appShuffleInfo.attemptId));
   }
 
-
   /**
* Clean up the AppShufflePartitionInfo for a specific AppShuffleInfo.
* If cleanupLocalDirs is true, the merged shuffle files will also be 
deleted.
* The cleanup will be executed in a separate thread.
*/
   @SuppressWarnings("SynchronizationOnLocalVariableOrMethodParameter")
   @VisibleForTesting
-  void closeAndDeletePartitionFilesIfNeeded(
+  void closeAndDeletePartitions(
   AppShuffleInfo appShuffleInfo,
-  boolean cleanupLocalDirs) {
+  boolean cleanupLocalDirs,
+  boolean removeFromDb) {

Review Comment:
   Removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2022-07-11 Thread GitBox


mridulm commented on code in PR #35906:
URL: https://github.com/apache/spark/pull/35906#discussion_r918184780


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -317,22 +353,24 @@ public void applicationRemoved(String appId, boolean 
cleanupLocalDirs) {
 logger.info("Application {} removed, cleanupLocalDirs = {}", appId, 
cleanupLocalDirs);
 AppShuffleInfo appShuffleInfo = appsShuffleInfo.remove(appId);
 if (null != appShuffleInfo) {
-  mergedShuffleCleaner.execute(
-() -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, 
cleanupLocalDirs));
+  submitCleanupTask(
+() -> closeAndDeletePartitions(appShuffleInfo, cleanupLocalDirs, 
true));
 }
+removeAppAttemptPathInfoFromDB(
+new AppAttemptId(appShuffleInfo.appId, appShuffleInfo.attemptId));
   }
 
-
   /**
* Clean up the AppShufflePartitionInfo for a specific AppShuffleInfo.
* If cleanupLocalDirs is true, the merged shuffle files will also be 
deleted.
* The cleanup will be executed in a separate thread.
*/
   @SuppressWarnings("SynchronizationOnLocalVariableOrMethodParameter")
   @VisibleForTesting
-  void closeAndDeletePartitionFilesIfNeeded(
+  void closeAndDeletePartitions(
   AppShuffleInfo appShuffleInfo,
-  boolean cleanupLocalDirs) {
+  boolean cleanupLocalDirs,
+  boolean removeFromDb) {

Review Comment:
   This is not exposed api, so let us remove the flag for now, and add it when 
required.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided

2022-07-11 Thread GitBox


xinrong-databricks commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r918171922


##
python/pyspark/sql/session.py:
##
@@ -1023,6 +1023,20 @@ def prepare(obj: Any) -> Any:
 
 if isinstance(data, RDD):
 rdd, struct = self._createFromRDD(data.map(prepare), schema, 
samplingRatio)
+elif isinstance(data, list) and schema is None:
+# Wrap each element with a tuple if there is any scalar in the list
+
+has_scalar = any(
+isinstance(x, str)
+or (
+not isinstance(x, Sized)
+and (type(x).__module__ == object.__module__)  # built-in

Review Comment:
   Accepts native Python scalars only.
   
   Supporting numpy/pandas scalars will be implemented 
https://issues.apache.org/jira/browse/SPARK-39745.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided

2022-07-11 Thread GitBox


xinrong-databricks commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r918171922


##
python/pyspark/sql/session.py:
##
@@ -1023,6 +1023,20 @@ def prepare(obj: Any) -> Any:
 
 if isinstance(data, RDD):
 rdd, struct = self._createFromRDD(data.map(prepare), schema, 
samplingRatio)
+elif isinstance(data, list) and schema is None:
+# Wrap each element with a tuple if there is any scalar in the list
+
+has_scalar = any(
+isinstance(x, str)
+or (
+not isinstance(x, Sized)
+and (type(x).__module__ == object.__module__)  # built-in

Review Comment:
   Accepts native Python scalars only.
   
   Supporting numpy/pandas scalars should be a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided

2022-07-11 Thread GitBox


xinrong-databricks commented on code in PR #36893:
URL: https://github.com/apache/spark/pull/36893#discussion_r918165714


##
python/pyspark/sql/tests/test_types.py:
##
@@ -374,12 +373,6 @@ def test_negative_decimal(self):
 finally:
 self.spark.sql("set 
spark.sql.legacy.allowNegativeScaleOfDecimal=false")
 
-def test_create_dataframe_from_objects(self):

Review Comment:
   Moved the test to python/pyspark/sql/tests/test_session.py.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-11 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r918153782


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   */
+  val inferDate = {
+val inferDateFlag = getBool("inferDate")
+if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {

Review Comment:
   Does the `Iso8601DateFormatter` properly parse legacy date types? I'm 
inclined to believe it doesn't since the `getFormatter` returns a special 
`LegacyDateFormatter` when the `legacyTimeParserPolicy == LEGACY`
   
   
https://github.com/apache/spark/blob/9c5c21ccc9c7f41a204a738cd540c14793ffc8cb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L174-L188
   
   cc @bersprockets @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikf commented on a diff in pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join

2022-07-11 Thread GitBox


Yikf commented on code in PR #37129:
URL: https://github.com/apache/spark/pull/37129#discussion_r918140392


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -230,6 +230,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
   // non-nullable when an empty relation child of a Union is removed
   UpdateAttributeNullability) :+
 Batch("Optimize One Row Plan", fixedPoint, OptimizeOneRowPlan) :+
+Batch("Push Local TopK Through Outer Join", FixedPoint(1), 
PushLocalTopKThroughOuterJoin) :+

Review Comment:
   This batch should be enforcement idempotent, we should use `Once` to check 
idempotent?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-11 Thread GitBox


Jonathancui123 commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r918142012


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -117,7 +123,10 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 case LongType => tryParseLong(field)
 case _: DecimalType => tryParseDecimal(field)
 case DoubleType => tryParseDouble(field)
+case DateType => tryParseDateTime(field)
+case TimestampNTZType if options.inferDate => tryParseDateTime(field)

Review Comment:
   Our expected behavior is that in a column with `TimestampType` entries and 
then `DateType` entries, the column will be inferred as `TimestampType`. 
   
   Here, `tryParseTimestampNTZ` and `tryParseTimestamp` will not be able to 
parse the `DateType` entries that show up later in the column and the column 
will be promoted to string type. So we must use `tryParseDateTime` which will 
give a chance for the date to be parsed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc

2022-07-11 Thread GitBox


dtenedor commented on code in PR #37159:
URL: https://github.com/apache/spark/pull/37159#discussion_r918129229


##
docs/sql-ref-number-pattern.md:
##
@@ -176,10 +176,10 @@ Note that the format string used in most of these 
examples expects:
   "$#.##"
 
 -- 'S' can be at the end.
-> SELECT try_to_char(decimal(-12454.8), '99,999.9S');
+> SELECT to_char(decimal(-12454.8), '99,999.9S');
   "12,454.8-"
 
-> SELECT try_to_char(decimal(12454.8), 'L99,999.9');
+> SELECT to_char(decimal(12454.8), 'L99,999.9');
   Error: cannot resolve 'try_to_char(Decimal(12454.8), 'L99,999.9')' due to 
data type mismatch:

Review Comment:
   ```suggestion
 Error: cannot resolve 'to_char(Decimal(12454.8), 'L99,999.9')' due to data 
type mismatch:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikf commented on a diff in pull request #37113: [SPARK-39741][SQL] Support url encode/decode as built-in function and tidy up url-related functions

2022-07-11 Thread GitBox


Yikf commented on code in PR #37113:
URL: https://github.com/apache/spark/pull/37113#discussion_r918128553


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala:
##
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import java.net.{URI, URISyntaxException, URLDecoder, URLEncoder}
+import java.util.regex.Pattern
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
+import org.apache.spark.sql.catalyst.trees.UnaryLike
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType}
+import org.apache.spark.unsafe.types.UTF8String
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+_FUNC_(str) - Translates a string into {@code 
application/x-www-form-urlencoded} format using a specific encoding scheme.

Review Comment:
   Looks it work normally,
   
   https://user-images.githubusercontent.com/51110188/178311420-3982ab92-0b71-470a-88b0-cfa27d5862a1.png";>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc

2022-07-11 Thread GitBox


cloud-fan commented on PR #37159:
URL: https://github.com/apache/spark/pull/37159#issuecomment-1180612285

   CC @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan opened a new pull request, #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc

2022-07-11 Thread GitBox


cloud-fan opened a new pull request, #37159:
URL: https://github.com/apache/spark/pull/37159

   
   
   ### What changes were proposed in this pull request?
   
   We have removed the `try_to_char` function and it shouldn't appear in the 
doc anymore. This PR also improves the doc a little bit.
   
   ### Why are the changes needed?
   
   fix doc
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   N/A


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37074: [SPARK-39672][SQL][3.1] Fix de-duplicating conflicting attributes when rewriting subquery

2022-07-11 Thread GitBox


cloud-fan commented on PR #37074:
URL: https://github.com/apache/spark/pull/37074#issuecomment-1180582242

   OK I think `DeduplicateRelations` needs some fix. Ideally the outer and 
inner plan should not have conflicting output attributes after analysis, but 
this local relation + project case is missed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37074: [SPARK-39672][SQL][3.1] Fix de-duplicating conflicting attributes when rewriting subquery

2022-07-11 Thread GitBox


cloud-fan commented on PR #37074:
URL: https://github.com/apache/spark/pull/37074#issuecomment-1180579396

   > This check is not accurate when there's And expression in the Join 
condition as in this case. Hence, this PR proposes to add a check whether the 
intersected attributes exist in all the children of the And expression.
   
   Can you explain the rationale? The subquery filter will be turned in to join 
eventually so it's not very clear to me how to resolve `a#266` in the join 
condition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37074: [SPARK-39672][SQL][3.1] Fix de-duplicating conflicting attributes when rewriting subquery

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37074:
URL: https://github.com/apache/spark/pull/37074#discussion_r918091135


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##
@@ -72,12 +72,22 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] 
with PredicateHelper {
 val outerRefs = outerPlan.outputSet ++ outerReferences
 val duplicates = outerRefs.intersect(subplan.outputSet)
 if (duplicates.nonEmpty) {
-  condition.foreach { e =>
+  def throwOnConflictingAttrs(attrs: AttributeSet): Unit = {

Review Comment:
   ```suggestion
 def failForConflictingAttrs(attrs: AttributeSet): Unit = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37113: [SPARK-39741][SQL] Support url encode/decode as built-in function and tidy up url-related functions

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37113:
URL: https://github.com/apache/spark/pull/37113#discussion_r918083814


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala:
##
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import java.net.{URI, URISyntaxException, URLDecoder, URLEncoder}
+import java.util.regex.Pattern
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
+import org.apache.spark.sql.catalyst.trees.UnaryLike
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType}
+import org.apache.spark.unsafe.types.UTF8String
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+_FUNC_(str) - Translates a string into {@code 
application/x-www-form-urlencoded} format using a specific encoding scheme.

Review Comment:
   Does `{@code application/x-www-form-urlencoded}` really work? Can you run 
`DESC FUNCTION url_encode` and give a screenshot?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter

2022-07-11 Thread GitBox


cloud-fan commented on code in PR #37150:
URL: https://github.com/apache/spark/pull/37150#discussion_r918064239


##
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##
@@ -849,7 +849,8 @@ primaryExpression
 | OVERLAY LEFT_PAREN input=valueExpression PLACING replace=valueExpression
   FROM position=valueExpression (FOR length=valueExpression)? RIGHT_PAREN  
#overlay
 | name=(PERCENTILE_CONT | PERCENTILE_DISC) LEFT_PAREN 
percentage=valueExpression RIGHT_PAREN
-  WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN ( OVER 
windowSpec)?#percentile
+WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN
+(FILTER LEFT_PAREN WHERE where=booleanExpression RIGHT_PAREN)? ( OVER 
windowSpec)? #percentile

Review Comment:
   what happens if `PERCENTILE_CONT` is used as a window function and `FILTER` 
is specified?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >