[GitHub] [spark] bzhaoopenstack commented on pull request #37117: [WIP][SPARK-39714][PYTHON] Try to fix the mypy annotation tests
bzhaoopenstack commented on PR #37117: URL: https://github.com/apache/spark/pull/37117#issuecomment-1181387164 > BTW, mind keeping the PR description template please? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE Thanks for correcting, sorry for that I'm not very familiar with Spark protocol. ;-) . Refreshed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bzhaoopenstack commented on a diff in pull request #37117: [WIP][SPARK-39714][PYTHON] Try to fix the mypy annotation tests
bzhaoopenstack commented on code in PR #37117: URL: https://github.com/apache/spark/pull/37117#discussion_r918609239 ## python/pyspark/pandas/series.py: ## @@ -4738,7 +4738,7 @@ def mode(self, dropna: bool = True) -> "Series": ser_count = self.value_counts(dropna=dropna, sort=False) sdf_count = ser_count._internal.spark_frame most_value = ser_count.max() -sdf_most_value = sdf_count.filter("count == {}".format(most_value)) +sdf_most_value = sdf_count.filter("count == {}".format(most_value.__str__())) Review Comment: Thanks, this is good. ;-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default
wangyum commented on PR #37163: URL: https://github.com/apache/spark/pull/37163#issuecomment-1181369549 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37160: [SPARK-39749][SQL] Always use plain string representation on casting Decimal to String
gengliangwang commented on PR #37160: URL: https://github.com/apache/spark/pull/37160#issuecomment-1181368882 Let's keep this open for one or two days since it is changing the behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic
beliefer commented on code in PR #37040: URL: https://github.com/apache/spark/pull/37040#discussion_r918590789 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRand.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{DoubleLiteral, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual, Rand} +import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, TrueLiteral} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_COMPARISON, EXPRESSION_WITH_RANDOM_SEED, LITERAL} + +/** + * Rand() generates a random column with i.i.d. uniformly distributed values in [0, 1), so + * compare double literal value with 1.0 could eliminate Rand() in binary comparison. + * + * 1. Converts the binary comparison to true literal when the comparison value must be true. + * 2. Converts the binary comparison to false literal when the comparison value must be false. + */ +object OptimizeRand extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = +plan.transformAllExpressionsWithPruning(_.containsAllPatterns( + EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) { +case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 => + TrueLiteral +case GreaterThan(_: Rand, DoubleLiteral(value)) if value >= 1.0 => Review Comment: Yeah. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default
wangyum commented on code in PR #37163: URL: https://github.com/apache/spark/pull/37163#discussion_r918589988 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/DynamicPartitionPruningHiveScanSuite.scala: ## @@ -41,6 +42,16 @@ abstract class DynamicPartitionPruningHiveScanSuiteBase case _ => Nil } } + + override protected def beforeAll(): Unit = { +super.beforeAll() +conf.setConf(SQLConf.CBO_ENABLED, false) Review Comment: To fix the test error: https://user-images.githubusercontent.com/5399861/178422851-a4f60bc5-2593-4ac0-876c-1bcff332918c.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default
wangyum commented on code in PR #37163: URL: https://github.com/apache/spark/pull/37163#discussion_r918589988 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/DynamicPartitionPruningHiveScanSuite.scala: ## @@ -41,6 +42,16 @@ abstract class DynamicPartitionPruningHiveScanSuiteBase case _ => Nil } } + + override protected def beforeAll(): Unit = { +super.beforeAll() +conf.setConf(SQLConf.CBO_ENABLED, false) Review Comment: To fix the test error: ![Uploading image.png…]() -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default
wangyum commented on code in PR #37163: URL: https://github.com/apache/spark/pull/37163#discussion_r918589133 ## sql/core/src/test/scala/org/apache/spark/sql/InjectRuntimeFilterSuite.scala: ## @@ -209,11 +209,13 @@ class InjectRuntimeFilterSuite extends QueryTest with SQLTestUtils with SharedSp // `MergeScalarSubqueries` can duplicate subqueries in the optimized plan and would make testing // complicated. conf.setConfString(SQLConf.OPTIMIZER_EXCLUDED_RULES.key, MergeScalarSubqueries.ruleName) +conf.setConf(SQLConf.CBO_ENABLED, false) Review Comment: To fix the test error: https://user-images.githubusercontent.com/5399861/178422514-d5bfdac8-5518-40d3-b8ae-6c6571b351e4.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default
wangyum commented on code in PR #37163: URL: https://github.com/apache/spark/pull/37163#discussion_r918588559 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala: ## @@ -58,6 +58,7 @@ case class FilterEstimation(plan: Filter) extends Logging { rowsAfterFilter = filteredRowCount) } val filteredSizeInBytes: BigInt = getOutputSize(plan.output, filteredRowCount, newColStats) + .min(plan.child.stats.sizeInBytes) Review Comment: To fix the test error: https://user-images.githubusercontent.com/5399861/178422367-6aa26b63-1135-4741-8fcc-c2a0551c0fd4.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default
wangyum commented on code in PR #37163: URL: https://github.com/apache/spark/pull/37163#discussion_r918586900 ## sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala: ## @@ -353,48 +353,65 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared test("invalidation of tableRelationCache after inserts") { val table = "invalidate_catalog_cache_table" -Seq(false, true).foreach { autoUpdate => - withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) { -withTable(table) { - spark.range(100).write.saveAsTable(table) - sql(s"ANALYZE TABLE $table COMPUTE STATISTICS") - spark.table(table) - val initialSizeInBytes = getTableFromCatalogCache(table).stats.sizeInBytes - spark.range(100).write.mode(SaveMode.Append).saveAsTable(table) - spark.table(table) - assert(getTableFromCatalogCache(table).stats.sizeInBytes == 2 * initialSizeInBytes) +Seq(false, true).foreach { cboEnabled => + Seq(false, true).foreach { autoUpdate => +withSQLConf( + SQLConf.CBO_ENABLED.key -> cboEnabled.toString, + SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) { + withTable(table) { +spark.range(100).write.saveAsTable(table) +sql(s"ANALYZE TABLE $table COMPUTE STATISTICS") +spark.table(table) +val initialSizeInBytes = getTableFromCatalogCache(table).stats.sizeInBytes +spark.range(100).write.mode(SaveMode.Append).saveAsTable(table) +spark.table(table) +if (!cboEnabled) { + assert(getTableFromCatalogCache(table).stats.sizeInBytes === 2 * initialSizeInBytes) +} else { + assert(getTableFromCatalogCache(table).stats.sizeInBytes > initialSizeInBytes) +} + } } } } } test("invalidation of tableRelationCache after alter table add partition") { val table = "invalidate_catalog_cache_table" -Seq(false, true).foreach { autoUpdate => - withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) { -withTempDir { dir => - withTable(table) { -val path = dir.getCanonicalPath -sql(s""" - |CREATE TABLE $table (col1 int, col2 int) - |USING PARQUET - |PARTITIONED BY (col2) - |LOCATION '${dir.toURI}'""".stripMargin) -sql(s"ANALYZE TABLE $table COMPUTE STATISTICS") -spark.table(table) -assert(getTableFromCatalogCache(table).stats.sizeInBytes == 0) -spark.catalog.recoverPartitions(table) -val df = Seq((1, 2), (1, 2)).toDF("col2", "col1") -df.write.parquet(s"$path/col2=1") -sql(s"ALTER TABLE $table ADD PARTITION (col2=1) LOCATION '${dir.toURI}'") -spark.table(table) -val cachedTable = getTableFromCatalogCache(table) -val cachedTableSizeInBytes = cachedTable.stats.sizeInBytes -val defaultSizeInBytes = conf.defaultSizeInBytes -if (autoUpdate) { - assert(cachedTableSizeInBytes != defaultSizeInBytes && cachedTableSizeInBytes > 0) -} else { - assert(cachedTableSizeInBytes == defaultSizeInBytes) +Seq(false, true).foreach { cboEnabled => + Seq(false, true).foreach { autoUpdate => +withSQLConf( + SQLConf.CBO_ENABLED.key -> cboEnabled.toString, + SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) { + withTempDir { dir => +withTable(table) { + val path = dir.getCanonicalPath + sql( +s""" + |CREATE TABLE $table (col1 int, col2 int) + |USING PARQUET + |PARTITIONED BY (col2) + |LOCATION '${dir.toURI}'""".stripMargin) + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS") + spark.table(table) + if (!cboEnabled) { +assert(getTableFromCatalogCache(table).stats.sizeInBytes === 0) + } else { +assert(getTableFromCatalogCache(table).stats.sizeInBytes === 1) Review Comment: The size is 1, not 0: https://github.com/apache/spark/blob/ddc61e62b9af5deff1b93e22f466f2a13f281155/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala#L116-L118 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.a
[GitHub] [spark] wangyum opened a new pull request, #37163: [SPARK-39750][SQL] Enable `spark.sql.cbo.enabled` by default
wangyum opened a new pull request, #37163: URL: https://github.com/apache/spark/pull/37163 ### What changes were proposed in this pull request? This PR enable `spark.sql.cbo.enabled` by default. ### Why are the changes needed? 1. Enable CBO to get better performance, we've enabled it over 3 years. 2. Benchmark related tests also enabled it:https://github.com/apache/spark/blob/e83f8a872a16d4f049cefb1fc445f91cf84443ad/sql/core/src/test/scala/org/apache/spark/sql/TPCBase.scala#L32 ### Does this PR introduce _any_ user-facing change? Maybe. But user can disable it by `set spark.sql.cbo.enabled=false`. ### How was this patch tested? Fix existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pralabhkumar commented on pull request #37009: [SPARK-38292][PYTHON]Support na_filter for pyspark.pandas.read_csv
pralabhkumar commented on PR #37009: URL: https://github.com/apache/spark/pull/37009#issuecomment-1181349887 @itholic @HyukjinKwon Gentle ping -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join
viirya commented on code in PR #37129: URL: https://github.com/apache/spark/pull/37129#discussion_r918563482 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushLocalTopKThroughOuterJoin.scala: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{Literal, SortOrder} +import org.apache.spark.sql.catalyst.planning.{ExtractEquiJoinKeys, ExtractTopK} +import org.apache.spark.sql.catalyst.plans.{JoinType, LeftOuter, RightOuter} +import org.apache.spark.sql.catalyst.plans.logical.{Filter, Join, LocalLimit, LogicalPlan, Project, RebalancePartitions, Repartition, RepartitionByExpression, Sort, Union} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{LIMIT, OUTER_JOIN, SORT} +import org.apache.spark.sql.internal.SQLConf + +/** + * This rule supports push down local limit and local sort from TopK through other join: + * - for a left outer join, the references of ordering of TopK come from the left side and + * the limits of TopK is smaller than left side max rows + * - for a right outer join, the references of ordering of TopK come from the right side and + * the limits of TopK is smaller than right side max rows + */ +object PushLocalTopKThroughOuterJoin extends Rule[LogicalPlan] { + private def smallThan(limits: Int, maxRowsOpt: Option[Long]): Boolean = maxRowsOpt match { +case Some(maxRows) => limits < maxRows +case _ => true + } + + private def canPushThroughOuterJoin( + joinType: JoinType, + order: Seq[SortOrder], + leftChild: LogicalPlan, + rightChild: LogicalPlan, + limits: Int): Boolean = joinType match { +case LeftOuter => + order.forall(_.references.subsetOf(leftChild.outputSet)) && +smallThan(limits, leftChild.maxRowsPerPartition) +case RightOuter => + order.forall(_.references.subsetOf(rightChild.outputSet)) && +smallThan(limits, rightChild.maxRowsPerPartition) +case _ => false + } + + private def findOuterJoin(limits: Int, order: Seq[SortOrder], child: LogicalPlan): Seq[Join] = { +child match { + case j @ ExtractEquiJoinKeys(joinType, _, _, _, _, leftChild, rightChild, _) + if canPushThroughOuterJoin(joinType, order, leftChild, rightChild, limits) => +// we should find the bottom outer join which can push local topK +val childOuterJoin = joinType match { + case LeftOuter => findOuterJoin(limits, order, leftChild) + case RightOuter => findOuterJoin(limits, order, rightChild) + case _ => Seq.empty +} +if (childOuterJoin.nonEmpty) { + childOuterJoin +} else { + j :: Nil +} + case u: Union => u.children.flatMap(child => findOuterJoin(limits, order, child)) + case p: Project if p.projectList.forall(_.deterministic) => +findOuterJoin(limits, order, p.child) + case f: Filter if f.condition.deterministic => +findOuterJoin(limits, order, f.child) Review Comment: Is it safe to push down through `Filter`? Won't it change final result? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join
viirya commented on code in PR #37129: URL: https://github.com/apache/spark/pull/37129#discussion_r918561846 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushLocalTopKThroughOuterJoin.scala: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{Literal, SortOrder} +import org.apache.spark.sql.catalyst.planning.{ExtractEquiJoinKeys, ExtractTopK} +import org.apache.spark.sql.catalyst.plans.{JoinType, LeftOuter, RightOuter} +import org.apache.spark.sql.catalyst.plans.logical.{Filter, Join, LocalLimit, LogicalPlan, Project, RebalancePartitions, Repartition, RepartitionByExpression, Sort, Union} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{LIMIT, OUTER_JOIN, SORT} +import org.apache.spark.sql.internal.SQLConf + +/** + * This rule supports push down local limit and local sort from TopK through other join: + * - for a left outer join, the references of ordering of TopK come from the left side and + * the limits of TopK is smaller than left side max rows + * - for a right outer join, the references of ordering of TopK come from the right side and + * the limits of TopK is smaller than right side max rows + */ +object PushLocalTopKThroughOuterJoin extends Rule[LogicalPlan] { + private def smallThan(limits: Int, maxRowsOpt: Option[Long]): Boolean = maxRowsOpt match { +case Some(maxRows) => limits < maxRows +case _ => true Review Comment: If we don't know max rows, should we be conservative to skip the pushing down? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join
viirya commented on code in PR #37129: URL: https://github.com/apache/spark/pull/37129#discussion_r918560968 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushLocalTopKThroughOuterJoin.scala: ## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{Literal, SortOrder} +import org.apache.spark.sql.catalyst.planning.{ExtractEquiJoinKeys, ExtractTopK} +import org.apache.spark.sql.catalyst.plans.{JoinType, LeftOuter, RightOuter} +import org.apache.spark.sql.catalyst.plans.logical.{Filter, Join, LocalLimit, LogicalPlan, Project, RebalancePartitions, Repartition, RepartitionByExpression, Sort, Union} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{LIMIT, OUTER_JOIN, SORT} +import org.apache.spark.sql.internal.SQLConf + +/** + * This rule supports push down local limit and local sort from TopK through other join: Review Comment: ```suggestion * This rule supports push down local limit and local sort from TopK through outer join: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join
viirya commented on code in PR #37129: URL: https://github.com/apache/spark/pull/37129#discussion_r918559156 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -412,6 +412,14 @@ object SQLConf { .longConf .createWithDefault(67108864L) + val PUSH_DOWN_LOCAL_TOPK_LIMIT_THRESHOLD = +buildConf("spark.sql.optimizer.pushDownLocalTopKLimitThreshold") + .doc("If the limit number does not larger than this threshold, Spark will try to push " + Review Comment: ```suggestion .doc("If the limit number is not larger than this threshold, Spark will try to push " + ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37052: [SPARK-39647][CORE] Register the executor with ESS before registering the BlockManager
mridulm commented on PR #37052: URL: https://github.com/apache/spark/pull/37052#issuecomment-1181327891 Merged to master and branch-3.3 Thanks for fixing this @otterc ! Thanks for reviewing @attilapiros, @zhouyejoe, @Ngone51 and @weixiuli :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm closed pull request #37052: [SPARK-39647][CORE] Register the executor with ESS before registering the BlockManager
mridulm closed pull request #37052: [SPARK-39647][CORE] Register the executor with ESS before registering the BlockManager URL: https://github.com/apache/spark/pull/37052 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37052: [SPARK-39647][CORE] Register the executor with ESS before registering the BlockManager
mridulm commented on PR #37052: URL: https://github.com/apache/spark/pull/37052#issuecomment-1181326481 Merging to master and branch-3.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic
cloud-fan commented on code in PR #37040: URL: https://github.com/apache/spark/pull/37040#discussion_r918550920 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRand.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{DoubleLiteral, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual, Rand} +import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, TrueLiteral} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_COMPARISON, EXPRESSION_WITH_RANDOM_SEED, LITERAL} + +/** + * Rand() generates a random column with i.i.d. uniformly distributed values in [0, 1), so + * compare double literal value with 1.0 could eliminate Rand() in binary comparison. + * + * 1. Converts the binary comparison to true literal when the comparison value must be true. + * 2. Converts the binary comparison to false literal when the comparison value must be false. + */ +object OptimizeRand extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = +plan.transformAllExpressionsWithPruning(_.containsAllPatterns( + EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) { +case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 => + TrueLiteral +case GreaterThan(_: Rand, DoubleLiteral(value)) if value >= 1.0 => Review Comment: we should also handle the `rand < 0.0` case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions
cloud-fan commented on code in PR #37116: URL: https://github.com/apache/spark/pull/37116#discussion_r918550376 ## docs/sql-ref-syntax-qry-select-aggregate.md: ## @@ -0,0 +1,145 @@ +--- +layout: global +title: Aggregate Functions +displayTitle: Aggregate Functions +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations. + +### General Aggregation Review Comment: I think we can remove `### General Aggregation` and start with ` Syntax` directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions
cloud-fan commented on code in PR #37116: URL: https://github.com/apache/spark/pull/37116#discussion_r918549516 ## docs/sql-ref-syntax-qry-select-aggregate.md: ## @@ -0,0 +1,145 @@ +--- +layout: global +title: Aggregate Functions +displayTitle: Aggregate Functions +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations. + +### General Aggregation + +These aggregate Functions use general syntax. + + Syntax + +```sql +aggregate_function FILTER (WHERE boolean_expression) Review Comment: ```suggestion aggregate_function(input1 [, input2, ...]) FILTER (WHERE boolean_expression) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions
cloud-fan commented on code in PR #37116: URL: https://github.com/apache/spark/pull/37116#discussion_r918549516 ## docs/sql-ref-syntax-qry-select-aggregate.md: ## @@ -0,0 +1,145 @@ +--- +layout: global +title: Aggregate Functions +displayTitle: Aggregate Functions +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations. + +### General Aggregation + +These aggregate Functions use general syntax. + + Syntax + +```sql +aggregate_function FILTER (WHERE boolean_expression) Review Comment: ```suggestion aggregate_function(input1, ...) FILTER (WHERE boolean_expression) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] singhpk234 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables
singhpk234 commented on code in PR #37083: URL: https://github.com/apache/spark/pull/37083#discussion_r918534436 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala: ## @@ -17,16 +17,40 @@ package org.apache.spark.sql.catalyst.plans.logical.statsEstimation +import org.apache.spark.sql.catalyst.expressions.AttributeMap import org.apache.spark.sql.catalyst.plans.logical._ /** - * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based optimizer. + * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: size in bytes. */ object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] { - /** Falls back to the estimation computed by [[SizeInBytesOnlyStatsPlanVisitor]]. */ - private def fallback(p: LogicalPlan): Statistics = SizeInBytesOnlyStatsPlanVisitor.visit(p) + /** + * A default, commonly used estimation for unary nodes. We assume the input row number is the + * same as the output row number, and compute sizes based on the column types. + */ + private def visitUnaryNode(p: UnaryNode): Statistics = { +// There should be some overhead in Row object, the size should not be zero when there is +// no columns, this help to prevent divide-by-zero error. +val childRowSize = EstimationUtils.getSizePerRow(p.child.output) +val outputRowSize = EstimationUtils.getSizePerRow(p.output) +// Assume there will be the same number of rows as child has. +var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize +if (sizeInBytes == 0) { + // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero + // (product of children). + sizeInBytes = 1 +} + +// v2 sources can bubble-up rowCount, so always propagate. +// Don't propagate attributeStats, since they are not estimated here. +Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount) Review Comment: In this estimator i.e visitUnaryNode, we adjust the size by scaling it by (input row size / output row size) but since we don't have much info (in terms of min / max / ndv etc) to estimate the row count we just say the node output's child output row count which is mostly true for operators like project etc. Since we were just computing sizeInBytes and propagating rowCounts. appologies I forgot to update the comment as per proposed behaviour. Should I rephrase it to: - `estimates size in bytes, row count for plan stats` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] singhpk234 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables
singhpk234 commented on code in PR #37083: URL: https://github.com/apache/spark/pull/37083#discussion_r918544779 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala: ## @@ -17,16 +17,40 @@ package org.apache.spark.sql.catalyst.plans.logical.statsEstimation +import org.apache.spark.sql.catalyst.expressions.AttributeMap import org.apache.spark.sql.catalyst.plans.logical._ /** - * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based optimizer. + * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: size in bytes. */ object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] { - /** Falls back to the estimation computed by [[SizeInBytesOnlyStatsPlanVisitor]]. */ - private def fallback(p: LogicalPlan): Statistics = SizeInBytesOnlyStatsPlanVisitor.visit(p) + /** + * A default, commonly used estimation for unary nodes. We assume the input row number is the + * same as the output row number, and compute sizes based on the column types. + */ + private def visitUnaryNode(p: UnaryNode): Statistics = { +// There should be some overhead in Row object, the size should not be zero when there is +// no columns, this help to prevent divide-by-zero error. +val childRowSize = EstimationUtils.getSizePerRow(p.child.output) +val outputRowSize = EstimationUtils.getSizePerRow(p.output) +// Assume there will be the same number of rows as child has. +var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize +if (sizeInBytes == 0) { + // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero + // (product of children). + sizeInBytes = 1 +} + +// v2 sources can bubble-up rowCount, so always propagate. +// Don't propagate attributeStats, since they are not estimated here. +Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount) Review Comment: for dsv2 sources rowCount can be passed from the relation itself without running `analyze`, hence BasicStatsPlanVisitor which will be our default now, post this change will take rowcount into consideration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] singhpk234 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables
singhpk234 commented on code in PR #37083: URL: https://github.com/apache/spark/pull/37083#discussion_r918534436 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala: ## @@ -17,16 +17,40 @@ package org.apache.spark.sql.catalyst.plans.logical.statsEstimation +import org.apache.spark.sql.catalyst.expressions.AttributeMap import org.apache.spark.sql.catalyst.plans.logical._ /** - * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based optimizer. + * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: size in bytes. */ object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] { - /** Falls back to the estimation computed by [[SizeInBytesOnlyStatsPlanVisitor]]. */ - private def fallback(p: LogicalPlan): Statistics = SizeInBytesOnlyStatsPlanVisitor.visit(p) + /** + * A default, commonly used estimation for unary nodes. We assume the input row number is the + * same as the output row number, and compute sizes based on the column types. + */ + private def visitUnaryNode(p: UnaryNode): Statistics = { +// There should be some overhead in Row object, the size should not be zero when there is +// no columns, this help to prevent divide-by-zero error. +val childRowSize = EstimationUtils.getSizePerRow(p.child.output) +val outputRowSize = EstimationUtils.getSizePerRow(p.output) +// Assume there will be the same number of rows as child has. +var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize +if (sizeInBytes == 0) { + // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero + // (product of children). + sizeInBytes = 1 +} + +// v2 sources can bubble-up rowCount, so always propagate. +// Don't propagate attributeStats, since they are not estimated here. +Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount) Review Comment: In this estimator i.e visitUnaryNode, we adjust the size by scaling it by (input row size / output row size) but since we don't have much info (in terms of min / max / ndv etc) to estimate the row count we just say the node output's child output row count which is mostly true for operators like project etc. Since we were just computing sizeInBytes and propagating rowCounts as it is I left the comment as it is. Should I rephrase it to: - `estimates size in bytes, row count for plan stats` ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala: ## @@ -17,16 +17,40 @@ package org.apache.spark.sql.catalyst.plans.logical.statsEstimation +import org.apache.spark.sql.catalyst.expressions.AttributeMap import org.apache.spark.sql.catalyst.plans.logical._ /** - * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based optimizer. + * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: size in bytes. */ object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] { - /** Falls back to the estimation computed by [[SizeInBytesOnlyStatsPlanVisitor]]. */ - private def fallback(p: LogicalPlan): Statistics = SizeInBytesOnlyStatsPlanVisitor.visit(p) + /** + * A default, commonly used estimation for unary nodes. We assume the input row number is the + * same as the output row number, and compute sizes based on the column types. + */ + private def visitUnaryNode(p: UnaryNode): Statistics = { +// There should be some overhead in Row object, the size should not be zero when there is +// no columns, this help to prevent divide-by-zero error. +val childRowSize = EstimationUtils.getSizePerRow(p.child.output) +val outputRowSize = EstimationUtils.getSizePerRow(p.output) +// Assume there will be the same number of rows as child has. +var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize +if (sizeInBytes == 0) { + // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero + // (product of children). + sizeInBytes = 1 +} + +// v2 sources can bubble-up rowCount, so always propagate. +// Don't propagate attributeStats, since they are not estimated here. +Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount) Review Comment: for dsv2 sources rowCount can be passed from the relation itself without running `analyze`, hence BasicStatsPlanVisitor which will be our default now will take rowcount into consideration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --
[GitHub] [spark] AmplabJenkins commented on pull request #37147: [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy
AmplabJenkins commented on PR #37147: URL: https://github.com/apache/spark/pull/37147#issuecomment-1181307870 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] singhpk234 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables
singhpk234 commented on code in PR #37083: URL: https://github.com/apache/spark/pull/37083#discussion_r918534436 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala: ## @@ -17,16 +17,40 @@ package org.apache.spark.sql.catalyst.plans.logical.statsEstimation +import org.apache.spark.sql.catalyst.expressions.AttributeMap import org.apache.spark.sql.catalyst.plans.logical._ /** - * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based optimizer. + * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: size in bytes. */ object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] { - /** Falls back to the estimation computed by [[SizeInBytesOnlyStatsPlanVisitor]]. */ - private def fallback(p: LogicalPlan): Statistics = SizeInBytesOnlyStatsPlanVisitor.visit(p) + /** + * A default, commonly used estimation for unary nodes. We assume the input row number is the + * same as the output row number, and compute sizes based on the column types. + */ + private def visitUnaryNode(p: UnaryNode): Statistics = { +// There should be some overhead in Row object, the size should not be zero when there is +// no columns, this help to prevent divide-by-zero error. +val childRowSize = EstimationUtils.getSizePerRow(p.child.output) +val outputRowSize = EstimationUtils.getSizePerRow(p.output) +// Assume there will be the same number of rows as child has. +var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize +if (sizeInBytes == 0) { + // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero + // (product of children). + sizeInBytes = 1 +} + +// v2 sources can bubble-up rowCount, so always propagate. +// Don't propagate attributeStats, since they are not estimated here. +Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount) Review Comment: In this estimator, we adjust the size by scaling it by (input row size / output row size) but since we don't have much info (in terms of min / max / ndv etc) to estimate the row count we just say the node output's child output row count which is mostly true for operators like project etc. Since we were just computing sizeInBytes and propagating rowCounts as it is I left the comment as it is. Should I rephrase it to: - `estimates size in bytes, row count for plan stats` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] singhpk234 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables
singhpk234 commented on code in PR #37083: URL: https://github.com/apache/spark/pull/37083#discussion_r918534436 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala: ## @@ -17,16 +17,40 @@ package org.apache.spark.sql.catalyst.plans.logical.statsEstimation +import org.apache.spark.sql.catalyst.expressions.AttributeMap import org.apache.spark.sql.catalyst.plans.logical._ /** - * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based optimizer. + * An [[LogicalPlanVisitor]] that computes a single dimension for plan stats: size in bytes. */ object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] { - /** Falls back to the estimation computed by [[SizeInBytesOnlyStatsPlanVisitor]]. */ - private def fallback(p: LogicalPlan): Statistics = SizeInBytesOnlyStatsPlanVisitor.visit(p) + /** + * A default, commonly used estimation for unary nodes. We assume the input row number is the + * same as the output row number, and compute sizes based on the column types. + */ + private def visitUnaryNode(p: UnaryNode): Statistics = { +// There should be some overhead in Row object, the size should not be zero when there is +// no columns, this help to prevent divide-by-zero error. +val childRowSize = EstimationUtils.getSizePerRow(p.child.output) +val outputRowSize = EstimationUtils.getSizePerRow(p.output) +// Assume there will be the same number of rows as child has. +var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize +if (sizeInBytes == 0) { + // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero + // (product of children). + sizeInBytes = 1 +} + +// v2 sources can bubble-up rowCount, so always propagate. +// Don't propagate attributeStats, since they are not estimated here. +Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount) Review Comment: In this estimator, we adjust the size by scaling it by (input row size / output row size) but since we don't have much info (in terms of min / max / ndv etc) to estimate the row count we just say the node output's child output row count which is mostly true for operators like project etc. Since we were just computing sizeInBytes and propagating rowCounts as it is I left the comment as it is. Should I rephrase it to let's say : - `estimates size in bytes, row count for plan stats` ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala: ## @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import org.apache.spark.sql.catalyst.plans.logical._ + +/** + * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based optimizer. + */ +object AdvancedStatsPlanVisitor extends LogicalPlanVisitor[Statistics] { + + /** Falls back to the estimation computed by [[BasicStatsPlanVisitor]]. */ + private def fallback(p: LogicalPlan): Statistics = BasicStatsPlanVisitor.visit(p) + + override def default(p: LogicalPlan): Statistics = fallback(p) + + override def visitAggregate(p: Aggregate): Statistics = { +AggregateEstimation.estimate(p).getOrElse(fallback(p)) + } + + override def visitDistinct(p: Distinct): Statistics = { +val child = p.child +visitAggregate(Aggregate(child.output, child.output, child)) + } + + override def visitExcept(p: Except): Statistics = fallback(p) + + override def visitExpand(p: Expand): Statistics = fallback(p) + + override def visitFilter(p: Filter): Statistics = { +FilterEstimation(p).estimate.getOrElse(fallback(p)) + } + + override def visitGenerate(p: Generate): Statistics = default(p) + + override def visitGlobalLimit(p: GlobalLimit): Statistics = fallback(p) + + override def visitOffset(p: Offset): Statistics = fallback(p) + + override def visitIntersect(p: Intersect): Statistics = fallback(p) + + override def visitJoin(p: Join): Statistics = fallback(p) Review Comment: fallback here would endup calling
[GitHub] [spark] beliefer commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions
beliefer commented on code in PR #37116: URL: https://github.com/apache/spark/pull/37116#discussion_r918523599 ## docs/sql-ref-syntax-qry-select-aggregate.md: ## @@ -0,0 +1,125 @@ +--- +layout: global +title: Aggregate Functions +displayTitle: Aggregate Functions +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations. + +### General Aggregation + +Please refer to the [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a complete list of Spark aggregate functions. + +### Ordered-Set Aggregate Functions Review Comment: Let add syntax section into section `General Aggregation` too. ## docs/sql-ref-syntax-qry-select-aggregate.md: ## @@ -0,0 +1,125 @@ +--- +layout: global +title: Aggregate Functions +displayTitle: Aggregate Functions +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations. + +### General Aggregation + +Please refer to the [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a complete list of Spark aggregate functions. + +### Ordered-Set Aggregate Functions Review Comment: Let's add syntax section into section `General Aggregation` too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions
beliefer commented on code in PR #37116: URL: https://github.com/apache/spark/pull/37116#discussion_r918520601 ## docs/sql-ref-syntax-qry-select-aggregate.md: ## @@ -0,0 +1,125 @@ +--- +layout: global +title: Aggregate Functions +displayTitle: Aggregate Functions +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations. + +### General Aggregation + +Please refer to the [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a complete list of Spark aggregate functions. + +### Ordered-Set Aggregate Functions + +These aggregate Functions use different syntax than the other aggregate functions so that to specify an expression (typically a column name) by which to order the values. + + Syntax + +```sql +{ PERCENTILE_CONT | PERCENTILE_DISC }(percentile) WITHIN GROUP (ORDER BY { order_by_expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] }) FILTER (WHERE boolean_expression) +``` + + Parameters + +* **percentile** + +The percentile of the value that you want to find. The percentile must be a constant between 0.0 and 1.0. + +* **order_by_expression** + +The expression (typically a column name) by which to order the values. + +* **boolean_expression** + +Specifies any expression that evaluates to a result type boolean. Two or more expressions may be combined together using the logical operators ( AND, OR ). + + Examples + +```sql +CREATE OR REPLACE TEMPORARY VIEW basic_pays AS SELECT * FROM VALUES +('Diane Murphy','Accounting',8435), +('Mary Patterson','Accounting',9998), +('Jeff Firrelli','Accounting',8992), +('William Patterson','Accounting',8870), +('Gerard Bondur','Accounting',11472), +('Anthony Bow','Accounting',6627), +('Leslie Jennings','IT',8113), +('Leslie Thompson','IT',5186), +('Julie Firrelli','Sales',9181), +('Steve Patterson','Sales',9441), +('Foon Yue Tseng','Sales',6660), +('George Vanauf','Sales',10563), +('Loui Bondur','SCM',10449), +('Gerard Hernandez','SCM',6949), +('Pamela Castillo','SCM',11303), +('Larry Bott','SCM',11798), +('Barry Jones','SCM',10586) +AS basic_pays(employee_name, department, salary); + +SELECT * FROM basic_pays; ++-+--+--+ +|employee_name|department|salary| ++-+--+--+ +| Anthony Bow|Accounting| 6627| +| Barry Jones| SCM| 10586| +| Diane Murphy|Accounting| 8435| +| Foon Yue Tseng| Sales| 6660| +|George Vanauf| Sales| 10563| +|Gerard Bondur|Accounting| 11472| +| Gerard Hernandez| SCM| 6949| +|Jeff Firrelli|Accounting| 8992| +| Julie Firrelli| Sales| 9181| +| Larry Bott| SCM| 11798| +| Leslie Jennings|IT| 8113| +| Leslie Thompson|IT| 5186| +| Loui Bondur| SCM| 10449| +| Mary Patterson|Accounting| 9998| +| Pamela Castillo| SCM| 11303| +| Steve Patterson| Sales| 9441| +|William Patterson|Accounting| 8870| ++-+--+--+ + +SELECT +department, Review Comment: These examples is in the section of Ordered-Set Aggregate Functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on a diff in pull request #37117: [WIP][SPARK-39714][PYTHON] Try to fix the mypy annotation tests
Yikun commented on code in PR #37117: URL: https://github.com/apache/spark/pull/37117#discussion_r918485337 ## python/pyspark/ml/util.py: ## @@ -536,10 +536,8 @@ def __get_class(clazz: str) -> Type[RL]: """ parts = clazz.split(".") module = ".".join(parts[:-1]) -m = __import__(module) -for comp in parts[1:]: -m = getattr(m, comp) -return m +m = __import__(module, fromlist=[parts[-1]) Review Comment: I also take a deep look on [this](https://docs.python.org/3/library/functions.html#import__): > When the name variable is of the form package.module, normally, the top-level package (the name up till the first dot) is returned, not the module named by name. However, when a non-empty fromlist argument is given, the module named by name is returned. such for `pyspark.ml.classification.LinearSVC` ```python >>> clazz = "pyspark.ml.classification.LinearSVC";parts = clazz.split(".");module = ".".join(parts[:-1]) >>> __import__(module) >>> __import__(module, fromlist=[parts[-1]]) ``` and also return same results of wrong name like: `pyspark.ml.classification.unknow`, `pyspark.ml.unknow.LinearSVC` After this, import with non-empty fromlist works well, can also simplified implementations and cleanup the mypy error: ``` python/pyspark/ml/util.py:542: error: Incompatible return value type (got Module, expected "Type[RL]") [return-value] ``` So, I'm fine with this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic
beliefer commented on code in PR #37040: URL: https://github.com/apache/spark/pull/37040#discussion_r918512433 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRand.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{DoubleLiteral, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual, Rand} +import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, TrueLiteral} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_COMPARISON, EXPRESSION_WITH_RANDOM_SEED, LITERAL} + +/** + * Rand() generates a random column with i.i.d. uniformly distributed values in [0, 1), so + * compare double literal value with 1.0 could eliminate Rand() in binary comparison. + * + * 1. Converts the binary comparison to true literal when the comparison value must be true. + * 2. Converts the binary comparison to false literal when the comparison value must be false. + */ +object OptimizeRand extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = +plan.transformAllExpressionsWithPruning(_.containsAllPatterns( + EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) { +case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 => Review Comment: I feel that swap introduces additional complexity and reduces readability. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #37162: [SPARK-38910][YARN][FOLLOWUP] Clean spark staging before unregister
AngersZh commented on PR #37162: URL: https://github.com/apache/spark/pull/37162#issuecomment-1181266134 waiting for @tgravescs back and review this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu opened a new pull request, #37162: [SPARK-38910][YARN][FOLLOWUP] Clean spark staging before unregister
AngersZh opened a new pull request, #37162: URL: https://github.com/apache/spark/pull/37162 ### What changes were proposed in this pull request? After discussing about https://github.com/apache/spark/pull/36207 and re-check the whole logic, we should revert https://github.com/apache/spark/pull/36207 and do some change 1. If it's the last attempt, anyway yarn won't rerun the job, we can clean staging dir first then we can avoid remaining staging dir if unregister failed 2. If it's not the last attempt and the final status is SUCCESS, if unregister failed, yarn can rerun the job again, we can't clean the staging dir before unregistering success. ### Why are the changes needed? Revert change and make it more accurate ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions
cloud-fan commented on code in PR #37116: URL: https://github.com/apache/spark/pull/37116#discussion_r918509777 ## docs/sql-ref-syntax-qry-select-aggregate.md: ## @@ -0,0 +1,125 @@ +--- +layout: global +title: Aggregate Functions +displayTitle: Aggregate Functions +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations. + +### General Aggregation + +Please refer to the [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a complete list of Spark aggregate functions. + +### Ordered-Set Aggregate Functions + +These aggregate Functions use different syntax than the other aggregate functions so that to specify an expression (typically a column name) by which to order the values. + + Syntax + +```sql +{ PERCENTILE_CONT | PERCENTILE_DISC }(percentile) WITHIN GROUP (ORDER BY { order_by_expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] }) FILTER (WHERE boolean_expression) +``` + + Parameters + +* **percentile** + +The percentile of the value that you want to find. The percentile must be a constant between 0.0 and 1.0. + +* **order_by_expression** + +The expression (typically a column name) by which to order the values. + +* **boolean_expression** + +Specifies any expression that evaluates to a result type boolean. Two or more expressions may be combined together using the logical operators ( AND, OR ). + + Examples + +```sql +CREATE OR REPLACE TEMPORARY VIEW basic_pays AS SELECT * FROM VALUES +('Diane Murphy','Accounting',8435), +('Mary Patterson','Accounting',9998), +('Jeff Firrelli','Accounting',8992), +('William Patterson','Accounting',8870), +('Gerard Bondur','Accounting',11472), +('Anthony Bow','Accounting',6627), +('Leslie Jennings','IT',8113), +('Leslie Thompson','IT',5186), +('Julie Firrelli','Sales',9181), +('Steve Patterson','Sales',9441), +('Foon Yue Tseng','Sales',6660), +('George Vanauf','Sales',10563), +('Loui Bondur','SCM',10449), +('Gerard Hernandez','SCM',6949), +('Pamela Castillo','SCM',11303), +('Larry Bott','SCM',11798), +('Barry Jones','SCM',10586) +AS basic_pays(employee_name, department, salary); + +SELECT * FROM basic_pays; ++-+--+--+ +|employee_name|department|salary| ++-+--+--+ +| Anthony Bow|Accounting| 6627| +| Barry Jones| SCM| 10586| +| Diane Murphy|Accounting| 8435| +| Foon Yue Tseng| Sales| 6660| +|George Vanauf| Sales| 10563| +|Gerard Bondur|Accounting| 11472| +| Gerard Hernandez| SCM| 6949| +|Jeff Firrelli|Accounting| 8992| +| Julie Firrelli| Sales| 9181| +| Larry Bott| SCM| 11798| +| Leslie Jennings|IT| 8113| +| Leslie Thompson|IT| 5186| +| Loui Bondur| SCM| 10449| +| Mary Patterson|Accounting| 9998| +| Pamela Castillo| SCM| 11303| +| Steve Patterson| Sales| 9441| +|William Patterson|Accounting| 8870| ++-+--+--+ + +SELECT +department, Review Comment: can we give some examples for simple agg functions such as count, sum? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions
cloud-fan commented on code in PR #37116: URL: https://github.com/apache/spark/pull/37116#discussion_r918508889 ## docs/sql-ref-syntax-qry-select-aggregate.md: ## @@ -0,0 +1,125 @@ +--- +layout: global +title: Aggregate Functions +displayTitle: Aggregate Functions +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations. + +### General Aggregation + +Please refer to the [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a complete list of Spark aggregate functions. + +### Ordered-Set Aggregate Functions + +These aggregate Functions use different syntax than the other aggregate functions so that to specify an expression (typically a column name) by which to order the values. + + Syntax + +```sql +{ PERCENTILE_CONT | PERCENTILE_DISC }(percentile) WITHIN GROUP (ORDER BY { order_by_expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] }) FILTER (WHERE boolean_expression) +``` + + Parameters + +* **percentile** + +The percentile of the value that you want to find. The percentile must be a constant between 0.0 and 1.0. + +* **order_by_expression** + +The expression (typically a column name) by which to order the values. Review Comment: ```suggestion The expression (typically a column name) by which to order the values before aggregating them. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions
cloud-fan commented on code in PR #37116: URL: https://github.com/apache/spark/pull/37116#discussion_r918508166 ## docs/sql-ref-syntax-qry-select-aggregate.md: ## @@ -0,0 +1,125 @@ +--- +layout: global +title: Aggregate Functions +displayTitle: Aggregate Functions +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations. + +### General Aggregation + +Please refer to the [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) document for a complete list of Spark aggregate functions. + +### Ordered-Set Aggregate Functions Review Comment: shall we have a new section before this to talk about `FILTER (WHERE ...)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37135: [SPARK-39723][R] Implement functionExists/getFunc in SparkR for 3L namespace
zhengruifeng commented on PR #37135: URL: https://github.com/apache/spark/pull/37135#issuecomment-1181262398 Merged to master, thanks @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37153: [SPARK-26052] Add type comments to exposed Prometheus metrics
AmplabJenkins commented on PR #37153: URL: https://github.com/apache/spark/pull/37153#issuecomment-1181262208 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #37135: [SPARK-39723][R] Implement functionExists/getFunc in SparkR for 3L namespace
zhengruifeng closed pull request #37135: [SPARK-39723][R] Implement functionExists/getFunc in SparkR for 3L namespace URL: https://github.com/apache/spark/pull/37135 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic
cloud-fan commented on code in PR #37040: URL: https://github.com/apache/spark/pull/37040#discussion_r918506608 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PruneFiltersSuite.scala: ## @@ -129,9 +129,11 @@ class PruneFiltersSuite extends PlanTest { } test("Nondeterministic predicate is not pruned") { -val originalQuery = testRelation.where(Rand(10) > 5).select($"a").where(Rand(10) > 5).analyze Review Comment: why do we need to change this file? The new rule is not invoked in this test suite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic
cloud-fan commented on code in PR #37040: URL: https://github.com/apache/spark/pull/37040#discussion_r918506272 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRand.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{DoubleLiteral, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual, Rand} +import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, TrueLiteral} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_COMPARISON, EXPRESSION_WITH_RANDOM_SEED, LITERAL} + +/** + * Rand() generates a random column with i.i.d. uniformly distributed values in [0, 1), so + * compare double literal value with 1.0 could eliminate Rand() in binary comparison. + * + * 1. Converts the binary comparison to true literal when the comparison value must be true. + * 2. Converts the binary comparison to false literal when the comparison value must be false. + */ +object OptimizeRand extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = +plan.transformAllExpressionsWithPruning(_.containsAllPatterns( + EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) { +case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 => Review Comment: can we swap the comparison so that we don't need to handle each comparison twice? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic
cloud-fan commented on code in PR #37040: URL: https://github.com/apache/spark/pull/37040#discussion_r918506022 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeRand.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{DoubleLiteral, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual, Rand} +import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, TrueLiteral} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_COMPARISON, EXPRESSION_WITH_RANDOM_SEED, LITERAL} + +/** + * Rand() generates a random column with i.i.d. uniformly distributed values in [0, 1), so + * compare double literal value with 1.0 could eliminate Rand() in binary comparison. Review Comment: 1.0 or 0.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37161: [SPARK-39748][SQL][SS] Include the origin logical plan for LogicalRDD if it comes from DataFrame
HeartSaVioR commented on PR #37161: URL: https://github.com/apache/spark/pull/37161#issuecomment-1181260007 cc. @cloud-fan @viirya Please take a look. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #37116: [SPARK-39707][SQL][DOCS] Add SQL reference for aggregate functions
beliefer commented on PR #37116: URL: https://github.com/apache/spark/pull/37116#issuecomment-1181259930 ping @cloud-fan https://github.com/apache/spark/pull/37150 merged, please review this PR again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #37161: [SPARK-39748][SQL][SS] Include the origin logical plan for LogicalRDD if it comes from DataFrame
HeartSaVioR opened a new pull request, #37161: URL: https://github.com/apache/spark/pull/37161 ### What changes were proposed in this pull request? This PR proposes to include the origin logical plan for LogicalRDD, if the LogicalRDD is built from DataFrame's RDD. Once the origin logical plan is available, LogicalRDD produces the stats from origin logical plan rather than default one. Also, this PR applies the change to ForeachBatchSink, which seems to be the only case as of now in current codebase. ### Why are the changes needed? The origin logical plan can be useful for several use cases, including: 1. wants to connect the two split logical plans into one (consider the case of foreachBatch sink: origin logical plan represents the plan for streaming query, and the logical plan for new Dataset represents the plan for batch query in user function) 2. inherits plan stats from origin logical plan ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #36207: [SPARK-38910][YARN] Clean spark staging before `unregister`
AngersZh commented on PR #36207: URL: https://github.com/apache/spark/pull/36207#issuecomment-1181258817 @tgravescs After re-check the whole logic, I got your point, although here said we don't need to rerun, but if it reregister failed, and it's not the last attempt, yarn will rerun this job. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter
beliefer commented on PR #37150: URL: https://github.com/apache/spark/pull/37150#issuecomment-1181258765 @cloud-fan Thank you ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37160: [SPARK-39749][SQL] Use plain string representation on casting Decimal to String
gengliangwang commented on PR #37160: URL: https://github.com/apache/spark/pull/37160#issuecomment-1181252406 cc @timarmstrong @entong @cloud-fan @srielau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #37160: [SPARK-39749][SQL] Use plain string representation on casting Decimal to String
gengliangwang opened a new pull request, #37160: URL: https://github.com/apache/spark/pull/37160 ### What changes were proposed in this pull request? Currently, casting decimal as string type will result in Strings with exponential notations if the adjusted exponent is less than -6. This is consistent with BigDecimal.toString https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#toString After this PR, the casting always uses plain string representation. ### Why are the changes needed? 1. The current behavior doesn't compliant to the ANSI SQL standard. https://user-images.githubusercontent.com/1097932/178395756-baecbe90-7a5f-4b4c-b63c-9f1fdf656107.png";> https://user-images.githubusercontent.com/1097932/178395567-fa5b6877-ff08-48b5-b715-243c954d6bbc.png";> 2. It is different from databases like PostgreSQL/Oracle/MS SQL server/etc.. 3. The current behavior may surprise users since it only happens when the adjusted exponent is less than -6. The following query will return `false` by default (when ANSI SQL mode is off) since the `0.000123` is converted as `1.23E-8`: ```sql select '0.000123' in (0.000123, '0.123'); ``` ### Does this PR introduce _any_ user-facing change? Yes, after changes, Spark SQL always uses plain string representation on casting Decimal to String. To restore the legacy behavior, which uses scientific notation if the adjusted exponent is less than -6, set `spark.sql.legacy.castDecimalToString.enabled` to `true`. ### How was this patch tested? Unit test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #37123: [SPARK-39711][TESTS] Remove redundant trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging
huaxingao commented on PR #37123: URL: https://github.com/apache/spark/pull/37123#issuecomment-1181244052 Merged to master. Thanks @panbingkun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao closed pull request #37123: [SPARK-39711][TESTS] Remove redundant trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging
huaxingao closed pull request #37123: [SPARK-39711][TESTS] Remove redundant trait: BeforeAndAfterAll & BeforeAndAfterEach & Logging URL: https://github.com/apache/spark/pull/37123 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter
cloud-fan closed pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter URL: https://github.com/apache/spark/pull/37150 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter
cloud-fan commented on PR #37150: URL: https://github.com/apache/spark/pull/37150#issuecomment-1181237260 thanks, merging to maseter! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on a diff in pull request #37117: [WIP][SPARK-39714][PYTHON] Try to fix the mypy annotation tests
Yikun commented on code in PR #37117: URL: https://github.com/apache/spark/pull/37117#discussion_r918485337 ## python/pyspark/ml/util.py: ## @@ -536,10 +536,8 @@ def __get_class(clazz: str) -> Type[RL]: """ parts = clazz.split(".") module = ".".join(parts[:-1]) -m = __import__(module) -for comp in parts[1:]: -m = getattr(m, comp) -return m +m = __import__(module, fromlist=[parts[-1]) Review Comment: I also take a deep look on this: > When the name variable is of the form package.module, normally, the top-level package (the name up till the first dot) is returned, not the module named by name. However, when a non-empty fromlist argument is given, the module named by name is returned. such for `pyspark.ml.classification.LinearSVC` ```python >>> clazz = "pyspark.ml.classification.LinearSVC";parts = clazz.split(".");module = ".".join(parts[:-1]) >>> __import__(module) >>> __import__(module, fromlist=[parts[-1]]) ``` and also return same results of wrong name like: `pyspark.ml.classification.unknow`, `pyspark.ml.unknow.LinearSVC` After this, import with non-empty fromlist works well, can also simplified implementations and cleanup the mypy error: ``` python/pyspark/ml/util.py:542: error: Incompatible return value type (got Module, expected "Type[RL]") [return-value] ``` So, I'm fine with this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
cloud-fan commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r918484209 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -148,7 +148,28 @@ class CSVOptions( // A language tag in IETF BCP 47 format val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US) - val dateFormatInRead: Option[String] = parameters.get("dateFormat") + /** + * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type). + * Disabled by default for backwards compatibility and performance. When enabled, date entries in + * timestamp columns will be cast to timestamp upon parsing. Not compatible with + * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + */ + val inferDate = { +val inferDateFlag = getBool("inferDate") +if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { Review Comment: My point is, we can use a dedicated datetime parser when inferring date types, which does not need to respect `legacyTimeParserPolicy`, so there is no need of failing for unsupported parser policy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37147: [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy
cloud-fan commented on PR #37147: URL: https://github.com/apache/spark/pull/37147#issuecomment-1181227634 If the legacy behavior is unreasonable, I think we don't have to keep it. If datetime patten is specified, we should not fall back to the legacy code path, even if it only supports 4 digits like Spark 2.x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37001: [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT
beliefer commented on code in PR #37001: URL: https://github.com/apache/spark/pull/37001#discussion_r918480531 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -165,106 +169,63 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper wit val pushedAggregates = finalTranslatedAggregates.filter(r.pushAggregation) if (pushedAggregates.isEmpty) { aggNode // return original plan node +} else if (r.supportCompletePushDown(pushedAggregates.get)) { + // If we can push down the aggregation completely, we need to consider + // pushing down the Limit or Offset operator, so keep the information about + // aggregation push down and push down aggregation later. + sHolder.pushedAggregation = Some(PushedAggregation( +pushedAggregates, +finalResultExpressions, +finalAggregates, +normalizedGroupingExpressions, +aggExprToOutputOrdinal)) + sHolder Review Comment: Got it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37156: [SPARK-39742][core]Fix a problem with the result of adjusting resources is not exp…
AmplabJenkins commented on PR #37156: URL: https://github.com/apache/spark/pull/37156#issuecomment-1181212204 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter
beliefer commented on code in PR #37150: URL: https://github.com/apache/spark/pull/37150#discussion_r918466540 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -849,7 +849,8 @@ primaryExpression | OVERLAY LEFT_PAREN input=valueExpression PLACING replace=valueExpression FROM position=valueExpression (FOR length=valueExpression)? RIGHT_PAREN #overlay | name=(PERCENTILE_CONT | PERCENTILE_DISC) LEFT_PAREN percentage=valueExpression RIGHT_PAREN - WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN ( OVER windowSpec)?#percentile +WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN +(FILTER LEFT_PAREN WHERE where=booleanExpression RIGHT_PAREN)? ( OVER windowSpec)? #percentile Review Comment: ``` org.apache.spark.sql.AnalysisException window aggregate function with filter predicate is not supported yet. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #37147: [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy
sadikovi commented on PR #37147: URL: https://github.com/apache/spark/pull/37147#issuecomment-1181198109 Thanks for the reviews. I will address the comments and failing tests and update the PR. My question was whether there are any concerns with this change and whether users might experience compatibility issues. I would appreciate some thoughts on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37158: [SPARK-39736][INFRA] Enable base image build in SparkR job
HyukjinKwon closed pull request #37158: [SPARK-39736][INFRA] Enable base image build in SparkR job URL: https://github.com/apache/spark/pull/37158 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37158: [SPARK-39736][INFRA] Enable base image build in SparkR job
HyukjinKwon commented on PR #37158: URL: https://github.com/apache/spark/pull/37158#issuecomment-1181176377 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37135: [SPARK-39723][R] Implement functionExists/getFunc in SparkR for 3L namespace
HyukjinKwon commented on code in PR #37135: URL: https://github.com/apache/spark/pull/37135#discussion_r918448953 ## R/pkg/NAMESPACE: ## @@ -479,7 +479,9 @@ export("as.DataFrame", "databaseExists", "dropTempTable", "dropTempView", + "functionExists", "getDatabase", + "getFunction", Review Comment: ```suggestion "getFunc", ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37135: [SPARK-39723][R] Implement functionExists/getFunc in SparkR for 3L namespace
HyukjinKwon commented on code in PR #37135: URL: https://github.com/apache/spark/pull/37135#discussion_r918449024 ## R/pkg/pkgdown/_pkgdown_template.yml: ## @@ -266,7 +266,9 @@ reference: - databaseExists - dropTempTable - dropTempView + - functionExists - getDatabase + - getFunction Review Comment: ```suggestion - getFunc ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #37158: [SPARK-39736][INFRA] Enable base image build in SparkR job
Yikun commented on PR #37158: URL: https://github.com/apache/spark/pull/37158#issuecomment-1181166080 Ready to go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37135: [SPARK-39723][R] Implement functionExists/getFunction in SparkR for 3L namespace
HyukjinKwon commented on code in PR #37135: URL: https://github.com/apache/spark/pull/37135#discussion_r918440696 ## R/pkg/tests/fulltests/test_context.R: ## @@ -21,10 +21,11 @@ test_that("Check masked functions", { # Check that we are not masking any new function from base, stats, testthat unexpectedly # NOTE: We should avoid adding entries to *namesOfMaskedCompletely* as masked functions make it # hard for users to use base R functions. Please check when in doubt. - namesOfMaskedCompletely <- c("cov", "filter", "sample", "not") + namesOfMaskedCompletely <- c("cov", "filter", "sample", "not", "getFunction") Review Comment: cc @falaki @felixcheung @shivaram FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
HyukjinKwon commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r918440378 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -148,7 +148,28 @@ class CSVOptions( // A language tag in IETF BCP 47 format val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US) - val dateFormatInRead: Option[String] = parameters.get("dateFormat") + /** + * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type). + * Disabled by default for backwards compatibility and performance. When enabled, date entries in + * timestamp columns will be cast to timestamp upon parsing. Not compatible with + * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + */ + val inferDate = { +val inferDateFlag = getBool("inferDate") +if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { Review Comment: Yeah, I don't think legacy mode support this. Let's just throw an exception for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #37135: [SPARK-39723][R] Implement functionExists/getFunction in SparkR for 3L namespace
zhengruifeng commented on code in PR #37135: URL: https://github.com/apache/spark/pull/37135#discussion_r918440246 ## R/pkg/tests/fulltests/test_context.R: ## @@ -21,10 +21,11 @@ test_that("Check masked functions", { # Check that we are not masking any new function from base, stats, testthat unexpectedly # NOTE: We should avoid adding entries to *namesOfMaskedCompletely* as masked functions make it # hard for users to use base R functions. Please check when in doubt. - namesOfMaskedCompletely <- c("cov", "filter", "sample", "not") + namesOfMaskedCompletely <- c("cov", "filter", "sample", "not", "getFunction") Review Comment: got it, thanks. let me update the pr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37157: [MINOR][FOLLOWUP] Remove redundant return
HyukjinKwon closed pull request #37157: [MINOR][FOLLOWUP] Remove redundant return URL: https://github.com/apache/spark/pull/37157 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc
HyukjinKwon closed pull request #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc URL: https://github.com/apache/spark/pull/37159 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc
HyukjinKwon commented on PR #37159: URL: https://github.com/apache/spark/pull/37159#issuecomment-1181096060 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #37147: [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy
Jonathancui123 commented on code in PR #37147: URL: https://github.com/apache/spark/pull/37147#discussion_r918421341 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -222,7 +226,11 @@ class UnivocityParser( } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards -// compatibility. +// compatibility only if no custom pattern has been set. If there is a custom pattern, +// fail since it may be different from the default pattern. +if (options.dateFormatInRead.isDefined) { + throw e +} Review Comment: This change makes sense to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #37147: [SPARK-39731][SQL] Fix issue in CSV data source when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy
Jonathancui123 commented on code in PR #37147: URL: https://github.com/apache/spark/pull/37147#discussion_r918421341 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -222,7 +226,11 @@ class UnivocityParser( } catch { case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards -// compatibility. +// compatibility only if no custom pattern has been set. If there is a custom pattern, +// fail since it may be different from the default pattern. +if (options.dateFormatInRead.isDefined) { + throw e +} Review Comment: This change makes sense to me. Thanks Ivan :)) ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -2788,6 +2788,47 @@ abstract class CSVSuite } } } + + test("SPARK-39731: Correctly parse dates with MMdd pattern") { +withTempPath { path => + Seq( +"1,2020011,2020011", +"2,20201203,20201203").toDF("data") +.repartition(1) +.write.text(path.getAbsolutePath) + val schema = new StructType() +.add("id", IntegerType) +.add("date", DateType) +.add("ts", TimestampType) + val output = spark.read +.schema(schema) +.option("dateFormat", "MMdd") +.option("timestampFormat", "MMdd") +.csv(path.getAbsolutePath) + + def check(mode: String, res: Seq[Row]): Unit = { +withSQLConf(SQLConf.LEGACY_TIME_PARSER_POLICY.key -> mode) { + checkAnswer(output, res) +} + } + + check( +"legacy", +Seq( + Row(1, Date.valueOf("2020-01-01"), Timestamp.valueOf("2020-01-01 00:00:00")), + Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 00:00:00")) +) + ) + + check( +"corrected", +Seq( + Row(1, null, null), + Row(2, Date.valueOf("2020-12-03"), Timestamp.valueOf("2020-12-03 00:00:00")) +) Review Comment: For completeness, would you consider adding a check for `LEGACY_TIME_PARSER_POLICY` = `EXCEPTION`? Similar to the following? https://github.com/apache/spark/blob/1193ce78d3efcbe1395305b4b7deb0a195fa09d9/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L2598-L2601 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #37157: [MINOR][FOLLOWUP] Remove redundant return
panbingkun commented on PR #37157: URL: https://github.com/apache/spark/pull/37157#issuecomment-1181059198 yes,i check twice -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a diff in pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables
c21 commented on code in PR #37083: URL: https://github.com/apache/spark/pull/37083#discussion_r918415377 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala: ## @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import org.apache.spark.sql.catalyst.plans.logical._ + +/** + * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based optimizer. + */ +object AdvancedStatsPlanVisitor extends LogicalPlanVisitor[Statistics] { + + /** Falls back to the estimation computed by [[BasicStatsPlanVisitor]]. */ + private def fallback(p: LogicalPlan): Statistics = BasicStatsPlanVisitor.visit(p) + + override def default(p: LogicalPlan): Statistics = fallback(p) + + override def visitAggregate(p: Aggregate): Statistics = { +AggregateEstimation.estimate(p).getOrElse(fallback(p)) + } + + override def visitDistinct(p: Distinct): Statistics = { +val child = p.child +visitAggregate(Aggregate(child.output, child.output, child)) + } + + override def visitExcept(p: Except): Statistics = fallback(p) + + override def visitExpand(p: Expand): Statistics = fallback(p) + + override def visitFilter(p: Filter): Statistics = { +FilterEstimation(p).estimate.getOrElse(fallback(p)) + } + + override def visitGenerate(p: Generate): Statistics = default(p) + + override def visitGlobalLimit(p: GlobalLimit): Statistics = fallback(p) + + override def visitOffset(p: Offset): Statistics = fallback(p) + + override def visitIntersect(p: Intersect): Statistics = fallback(p) + + override def visitJoin(p: Join): Statistics = fallback(p) Review Comment: Why we fallback here, but not use `JoinEstimation.estimate`? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala: ## @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.statsEstimation + +import org.apache.spark.sql.catalyst.plans.logical._ + +/** + * A [[LogicalPlanVisitor]] that computes the statistics for the cost-based optimizer. + */ +object AdvancedStatsPlanVisitor extends LogicalPlanVisitor[Statistics] { + + /** Falls back to the estimation computed by [[BasicStatsPlanVisitor]]. */ + private def fallback(p: LogicalPlan): Statistics = BasicStatsPlanVisitor.visit(p) + + override def default(p: LogicalPlan): Statistics = fallback(p) + + override def visitAggregate(p: Aggregate): Statistics = { +AggregateEstimation.estimate(p).getOrElse(fallback(p)) + } + + override def visitDistinct(p: Distinct): Statistics = { +val child = p.child +visitAggregate(Aggregate(child.output, child.output, child)) + } + + override def visitExcept(p: Except): Statistics = fallback(p) + + override def visitExpand(p: Expand): Statistics = fallback(p) + + override def visitFilter(p: Filter): Statistics = { +FilterEstimation(p).estimate.getOrElse(fallback(p)) + } + + override def visitGenerate(p: Generate): Statistics = default(p) + + override def visitGlobalLimit(p: GlobalLimit): Statistics = fallback(p) + + override def visitOffset(p: Offset): Statistics = fallback(p) + + override def visitIntersect(p: Intersect): Statistics = fallback(p) + + override def visitJoin(p: Join): Statistics = fallback(p) + +
[GitHub] [spark] mridulm commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
mridulm commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r918226879 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -343,15 +397,44 @@ void closeAndDeletePartitionFilesIfNeeded( if (cleanupLocalDirs) { deleteExecutorDirs(appShuffleInfo); } +removeAppShuffleInfoFromDB(appShuffleInfo); + } + + /** + * Remove the application attempt local paths information from the DB. + */ + private void removeAppAttemptPathInfoFromDB(String appId, int attemptId) throws Exception{ Review Comment: Can you add a comment to both `removeAppAttemptPathInfoFromDB` and `writeNewAppAttemptPathInfoToDBAndRemoveOutdated` that they are expected to be invoked with the `appsShuffleInfo` lock help for `appId` ? (Also, group them together in the source file) ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -656,6 +768,232 @@ public void registerExecutor(String appId, ExecutorShuffleInfo executorInfo) { } } + /** + * Remove the former application attempt local paths information from the DB and insert the + * local paths information from the newer application attempt. If the deletion fails, the + * insertion will also be skipped. This ensures that there will always be a single application + * attempt local path information in the DB. + */ + private void writeNewAppAttemptPathInfoToDBAndRemoveOutdated( + String appId, + int newAttemptId, + AppShuffleInfo appShuffleInfo, + AppPathsInfo appPathsInfo) { +try{ + if (appShuffleInfo != null) { +removeAppAttemptPathInfoFromDB(appId, appShuffleInfo.attemptId); + } + writeAppPathsInfoToDb(appId, newAttemptId, appPathsInfo); Review Comment: Thoughts on making the exception handling local to the DB invocation (since we dont need to handle cross DB invocation failures for now) ? We are already doing this for `removeAppShufflePartitionInfoFromDB`, `writeAppAttemptShuffleMergeInfoToDB`, etc. Given this, let us make the Exception handling local to `removeAppAttemptPathInfoFromDB` (so it does not throw an `Exception`) ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -632,6 +736,12 @@ public void registerExecutor(String appId, ExecutorShuffleInfo executorInfo) { appsShuffleInfo.compute(appId, (id, appShuffleInfo) -> { if (appShuffleInfo == null || attemptId > appShuffleInfo.attemptId) { originalAppShuffleInfo.set(appShuffleInfo); + AppPathsInfo appPathsInfo = new AppPathsInfo(appId, executorInfo.localDirs, + mergeDir, executorInfo.subDirsPerLocalDir); + // Clean up the outdated App Attempt local path info in the DB and + // put the newly registered local path info from newer attempt into the DB. + writeNewAppAttemptPathInfoToDBAndRemoveOutdated( Review Comment: nit: Do we need this method ? (the comment is helpful, just the method) We can simply do ``` if (null != appShuffleInfo) removeAppAttemptPathInfoFromDB() writeAppPathsInfoToDb() ``` ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -343,15 +397,44 @@ void closeAndDeletePartitionFilesIfNeeded( if (cleanupLocalDirs) { deleteExecutorDirs(appShuffleInfo); } +removeAppShuffleInfoFromDB(appShuffleInfo); + } + + /** + * Remove the application attempt local paths information from the DB. + */ + private void removeAppAttemptPathInfoFromDB(String appId, int attemptId) throws Exception{ +AppAttemptId appAttemptId = new AppAttemptId(appId, attemptId); +if (db != null) { + db.delete(getDbAppAttemptPathsKey(appAttemptId)); Review Comment: Given the possibility of stale entries from previous failed deletes hanging around, can we scan for all entries for an application id and delete them here ? Note that this will require changes to how we encode db key for `AppAttemptId` - we cannot use json for it, since we want to do a prefix scan. Something like: ``` private byte[] getDbAppAttemptPathsKey(AppAttemptId appAttemptId) throws IOException { return (APP_ATTEMPT_PATH_KEY_PREFIX + DB_KEY_DELIMITER + appAttemptId.appId + DB_KEY_DELIMITER + appAttemptId.attemptId).getBytes(StandardCharsets.UTF_8); } private byte[] getDbAppAttemptPathsKeyPrefix(String appId) throws IOException { return (APP_ATTEMPT_PATH_KEY_PREFIX + DB_KEY_DELIMITER + appAttemptId.appId + DB_KEY_DELIMITER).getBytes(StandardCharsets.UTF_8); } private AppAttemptId parseDbAppAttemptPathsKey(String value) throws IOException { String[] parts = key.split(DB_KEY_DELIMITER); if (parts
[GitHub] [spark] HyukjinKwon commented on pull request #37157: [MINOR][FOLLOWUP] Remove redundant return
HyukjinKwon commented on PR #37157: URL: https://github.com/apache/spark/pull/37157#issuecomment-1181000672 @panbingkun are they all to fix? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a diff in pull request #37080: [SPARK-35208][SQL][DOCS] Add docs for LATERAL subqueries
huaxingao commented on code in PR #37080: URL: https://github.com/apache/spark/pull/37080#discussion_r918358655 ## docs/sql-ref-syntax-qry-select-join.md: ## @@ -26,7 +26,7 @@ A SQL join is used to combine rows from two relations based on join criteria. Th ### Syntax ```sql -relation { [ join_type ] JOIN relation [ join_criteria ] | NATURAL join_type JOIN relation } +relation { [ join_type ] JOIN [ LATERAL ] relation [ join_criteria ] | NATURAL join_type JOIN [ LATERAL ] relation } Review Comment: Sounds good! Changed. before: https://user-images.githubusercontent.com/13592258/17835-86c7b9f8-03d6-471d-af38-d549af29d7bb.png";> after: https://user-images.githubusercontent.com/13592258/178359932-3af7c545-8f11-4e1e-bec0-4550e51f0b09.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on pull request #37052: [SPARK-39647][CORE] Register the executor with ESS before registering the BlockManager
otterc commented on PR #37052: URL: https://github.com/apache/spark/pull/37052#issuecomment-1180723189 All the tests pass now @Ngone51 @mridulm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #36162: [SPARK-32170][CORE] Improve the speculation through the stage task metrics.
mridulm commented on PR #36162: URL: https://github.com/apache/spark/pull/36162#issuecomment-1180706514 It helps in two cases @weixiuli - the example you gave (generated input (like range()), etc where there is no input metrics). It also helps when reading shuffle input where there is a sort - the entire shuffle input will get consumed at beginning of the task, but the output rate would be impacted by computation/skew/etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #37080: [SPARK-35208][SQL][DOCS] Add docs for LATERAL subqueries
allisonwang-db commented on code in PR #37080: URL: https://github.com/apache/spark/pull/37080#discussion_r918189602 ## docs/sql-ref-syntax-qry-select-join.md: ## @@ -26,7 +26,7 @@ A SQL join is used to combine rows from two relations based on join criteria. Th ### Syntax ```sql -relation { [ join_type ] JOIN relation [ join_criteria ] | NATURAL join_type JOIN relation } +relation { [ join_type ] JOIN [ LATERAL ] relation [ join_criteria ] | NATURAL join_type JOIN [ LATERAL ] relation } Review Comment: Should we also update the syntax reference in `sql-ref-syntax-qry-select.md` under the `from_item`: `[LATERAL] (subquery)`? ## docs/sql-ref-syntax-qry-select-lateral-subquery.md: ## @@ -0,0 +1,87 @@ +--- +layout: global +title: LATERAL SUBQUERY +displayTitle: LATERAL SUBQUERY +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +`LATERAL SUBQUERY` is a subquery that is preceded by the keyword `LATERAL`. It provides a way to cross-reference columns in the preceding FROM clause. Review Comment: ```suggestion `LATERAL SUBQUERY` is a subquery that is preceded by the keyword `LATERAL`. It provides a way to reference columns in the preceding `FROM` clause. ``` ## docs/sql-ref-syntax-qry-select-lateral-subquery.md: ## @@ -0,0 +1,87 @@ +--- +layout: global +title: LATERAL SUBQUERY +displayTitle: LATERAL SUBQUERY +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +`LATERAL SUBQUERY` is a subquery that is preceded by the keyword `LATERAL`. It provides a way to cross-reference columns in the preceding FROM clause. +Without the `LATERAL` keyword, subqueries can only refer to columns in the outer query, but not in the `FROM` clause. `LATERAL SUBQUERY` makes the complicated +queries simpler and more efficient. + +### Syntax + +```sql +[ LATERAL ] primary_relation [ join_relation ] +``` + +### Parameters + +* **primary_relation** + + Specifies the primary relation. It can be one of the following: + * Table relation + * aliased query + +Syntax: `( query ) [ [ AS ] alias ]` + * aliased relation Review Comment: ```suggestion * Aliased relation ``` ## docs/sql-ref-syntax-qry-select-lateral-subquery.md: ## @@ -0,0 +1,87 @@ +--- +layout: global +title: LATERAL SUBQUERY +displayTitle: LATERAL SUBQUERY +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Description + +`LATERAL SUBQUERY` is a subquery that is preceded by the keyword `LATERAL`. It provides a way to cross-reference columns in the preceding FROM clause. +Without the `LATERAL` keyword, subqueries can only refer to columns in the outer query, but not in the `FROM` clau
[GitHub] [spark] zhouyejoe commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
zhouyejoe commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r918214646 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -317,22 +353,24 @@ public void applicationRemoved(String appId, boolean cleanupLocalDirs) { logger.info("Application {} removed, cleanupLocalDirs = {}", appId, cleanupLocalDirs); AppShuffleInfo appShuffleInfo = appsShuffleInfo.remove(appId); if (null != appShuffleInfo) { - mergedShuffleCleaner.execute( -() -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, cleanupLocalDirs)); + submitCleanupTask( +() -> closeAndDeletePartitions(appShuffleInfo, cleanupLocalDirs, true)); } +removeAppAttemptPathInfoFromDB( +new AppAttemptId(appShuffleInfo.appId, appShuffleInfo.attemptId)); } - /** * Clean up the AppShufflePartitionInfo for a specific AppShuffleInfo. * If cleanupLocalDirs is true, the merged shuffle files will also be deleted. * The cleanup will be executed in a separate thread. */ @SuppressWarnings("SynchronizationOnLocalVariableOrMethodParameter") @VisibleForTesting - void closeAndDeletePartitionFilesIfNeeded( + void closeAndDeletePartitions( AppShuffleInfo appShuffleInfo, - boolean cleanupLocalDirs) { + boolean cleanupLocalDirs, + boolean removeFromDb) { Review Comment: Removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
mridulm commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r918184780 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -317,22 +353,24 @@ public void applicationRemoved(String appId, boolean cleanupLocalDirs) { logger.info("Application {} removed, cleanupLocalDirs = {}", appId, cleanupLocalDirs); AppShuffleInfo appShuffleInfo = appsShuffleInfo.remove(appId); if (null != appShuffleInfo) { - mergedShuffleCleaner.execute( -() -> closeAndDeletePartitionFilesIfNeeded(appShuffleInfo, cleanupLocalDirs)); + submitCleanupTask( +() -> closeAndDeletePartitions(appShuffleInfo, cleanupLocalDirs, true)); } +removeAppAttemptPathInfoFromDB( +new AppAttemptId(appShuffleInfo.appId, appShuffleInfo.attemptId)); } - /** * Clean up the AppShufflePartitionInfo for a specific AppShuffleInfo. * If cleanupLocalDirs is true, the merged shuffle files will also be deleted. * The cleanup will be executed in a separate thread. */ @SuppressWarnings("SynchronizationOnLocalVariableOrMethodParameter") @VisibleForTesting - void closeAndDeletePartitionFilesIfNeeded( + void closeAndDeletePartitions( AppShuffleInfo appShuffleInfo, - boolean cleanupLocalDirs) { + boolean cleanupLocalDirs, + boolean removeFromDb) { Review Comment: This is not exposed api, so let us remove the flag for now, and add it when required. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided
xinrong-databricks commented on code in PR #36893: URL: https://github.com/apache/spark/pull/36893#discussion_r918171922 ## python/pyspark/sql/session.py: ## @@ -1023,6 +1023,20 @@ def prepare(obj: Any) -> Any: if isinstance(data, RDD): rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio) +elif isinstance(data, list) and schema is None: +# Wrap each element with a tuple if there is any scalar in the list + +has_scalar = any( +isinstance(x, str) +or ( +not isinstance(x, Sized) +and (type(x).__module__ == object.__module__) # built-in Review Comment: Accepts native Python scalars only. Supporting numpy/pandas scalars will be implemented https://issues.apache.org/jira/browse/SPARK-39745. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided
xinrong-databricks commented on code in PR #36893: URL: https://github.com/apache/spark/pull/36893#discussion_r918171922 ## python/pyspark/sql/session.py: ## @@ -1023,6 +1023,20 @@ def prepare(obj: Any) -> Any: if isinstance(data, RDD): rdd, struct = self._createFromRDD(data.map(prepare), schema, samplingRatio) +elif isinstance(data, list) and schema is None: +# Wrap each element with a tuple if there is any scalar in the list + +has_scalar = any( +isinstance(x, str) +or ( +not isinstance(x, Sized) +and (type(x).__module__ == object.__module__) # built-in Review Comment: Accepts native Python scalars only. Supporting numpy/pandas scalars should be a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36893: [SPARK-39494][PYTHON] Support `createDataFrame` from a list of scalars when schema is not provided
xinrong-databricks commented on code in PR #36893: URL: https://github.com/apache/spark/pull/36893#discussion_r918165714 ## python/pyspark/sql/tests/test_types.py: ## @@ -374,12 +373,6 @@ def test_negative_decimal(self): finally: self.spark.sql("set spark.sql.legacy.allowNegativeScaleOfDecimal=false") -def test_create_dataframe_from_objects(self): Review Comment: Moved the test to python/pyspark/sql/tests/test_session.py. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r918153782 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -148,7 +148,28 @@ class CSVOptions( // A language tag in IETF BCP 47 format val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US) - val dateFormatInRead: Option[String] = parameters.get("dateFormat") + /** + * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type). + * Disabled by default for backwards compatibility and performance. When enabled, date entries in + * timestamp columns will be cast to timestamp upon parsing. Not compatible with + * legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters + */ + val inferDate = { +val inferDateFlag = getBool("inferDate") +if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { Review Comment: Does the `Iso8601DateFormatter` properly parse legacy date types? I'm inclined to believe it doesn't since the `getFormatter` returns a special `LegacyDateFormatter` when the `legacyTimeParserPolicy == LEGACY` https://github.com/apache/spark/blob/9c5c21ccc9c7f41a204a738cd540c14793ffc8cb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L174-L188 cc @bersprockets @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikf commented on a diff in pull request #37129: [SPARK-39710][SQL] Support push local topK through outer join
Yikf commented on code in PR #37129: URL: https://github.com/apache/spark/pull/37129#discussion_r918140392 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -230,6 +230,7 @@ abstract class Optimizer(catalogManager: CatalogManager) // non-nullable when an empty relation child of a Union is removed UpdateAttributeNullability) :+ Batch("Optimize One Row Plan", fixedPoint, OptimizeOneRowPlan) :+ +Batch("Push Local TopK Through Outer Join", FixedPoint(1), PushLocalTopKThroughOuterJoin) :+ Review Comment: This batch should be enforcement idempotent, we should use `Once` to check idempotent? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r918142012 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -117,7 +123,10 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { case LongType => tryParseLong(field) case _: DecimalType => tryParseDecimal(field) case DoubleType => tryParseDouble(field) +case DateType => tryParseDateTime(field) +case TimestampNTZType if options.inferDate => tryParseDateTime(field) Review Comment: Our expected behavior is that in a column with `TimestampType` entries and then `DateType` entries, the column will be inferred as `TimestampType`. Here, `tryParseTimestampNTZ` and `tryParseTimestamp` will not be able to parse the `DateType` entries that show up later in the column and the column will be promoted to string type. So we must use `tryParseDateTime` which will give a chance for the date to be parsed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc
dtenedor commented on code in PR #37159: URL: https://github.com/apache/spark/pull/37159#discussion_r918129229 ## docs/sql-ref-number-pattern.md: ## @@ -176,10 +176,10 @@ Note that the format string used in most of these examples expects: "$#.##" -- 'S' can be at the end. -> SELECT try_to_char(decimal(-12454.8), '99,999.9S'); +> SELECT to_char(decimal(-12454.8), '99,999.9S'); "12,454.8-" -> SELECT try_to_char(decimal(12454.8), 'L99,999.9'); +> SELECT to_char(decimal(12454.8), 'L99,999.9'); Error: cannot resolve 'try_to_char(Decimal(12454.8), 'L99,999.9')' due to data type mismatch: Review Comment: ```suggestion Error: cannot resolve 'to_char(Decimal(12454.8), 'L99,999.9')' due to data type mismatch: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikf commented on a diff in pull request #37113: [SPARK-39741][SQL] Support url encode/decode as built-in function and tidy up url-related functions
Yikf commented on code in PR #37113: URL: https://github.com/apache/spark/pull/37113#discussion_r918128553 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala: ## @@ -0,0 +1,290 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import java.net.{URI, URISyntaxException, URLDecoder, URLEncoder} +import java.util.regex.Pattern + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke +import org.apache.spark.sql.catalyst.trees.UnaryLike +import org.apache.spark.sql.errors.QueryExecutionErrors +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType} +import org.apache.spark.unsafe.types.UTF8String + +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = """ +_FUNC_(str) - Translates a string into {@code application/x-www-form-urlencoded} format using a specific encoding scheme. Review Comment: Looks it work normally, https://user-images.githubusercontent.com/51110188/178311420-3982ab92-0b71-470a-88b0-cfa27d5862a1.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc
cloud-fan commented on PR #37159: URL: https://github.com/apache/spark/pull/37159#issuecomment-1180612285 CC @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan opened a new pull request, #37159: [SPARK-38796][SQL][DOC][FOLLOWUP] Remove try_to_char reference in the doc
cloud-fan opened a new pull request, #37159: URL: https://github.com/apache/spark/pull/37159 ### What changes were proposed in this pull request? We have removed the `try_to_char` function and it shouldn't appear in the doc anymore. This PR also improves the doc a little bit. ### Why are the changes needed? fix doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37074: [SPARK-39672][SQL][3.1] Fix de-duplicating conflicting attributes when rewriting subquery
cloud-fan commented on PR #37074: URL: https://github.com/apache/spark/pull/37074#issuecomment-1180582242 OK I think `DeduplicateRelations` needs some fix. Ideally the outer and inner plan should not have conflicting output attributes after analysis, but this local relation + project case is missed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37074: [SPARK-39672][SQL][3.1] Fix de-duplicating conflicting attributes when rewriting subquery
cloud-fan commented on PR #37074: URL: https://github.com/apache/spark/pull/37074#issuecomment-1180579396 > This check is not accurate when there's And expression in the Join condition as in this case. Hence, this PR proposes to add a check whether the intersected attributes exist in all the children of the And expression. Can you explain the rationale? The subquery filter will be turned in to join eventually so it's not very clear to me how to resolve `a#266` in the join condition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37074: [SPARK-39672][SQL][3.1] Fix de-duplicating conflicting attributes when rewriting subquery
cloud-fan commented on code in PR #37074: URL: https://github.com/apache/spark/pull/37074#discussion_r918091135 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala: ## @@ -72,12 +72,22 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { val outerRefs = outerPlan.outputSet ++ outerReferences val duplicates = outerRefs.intersect(subplan.outputSet) if (duplicates.nonEmpty) { - condition.foreach { e => + def throwOnConflictingAttrs(attrs: AttributeSet): Unit = { Review Comment: ```suggestion def failForConflictingAttrs(attrs: AttributeSet): Unit = { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37113: [SPARK-39741][SQL] Support url encode/decode as built-in function and tidy up url-related functions
cloud-fan commented on code in PR #37113: URL: https://github.com/apache/spark/pull/37113#discussion_r918083814 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala: ## @@ -0,0 +1,290 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import java.net.{URI, URISyntaxException, URLDecoder, URLEncoder} +import java.util.regex.Pattern + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke +import org.apache.spark.sql.catalyst.trees.UnaryLike +import org.apache.spark.sql.errors.QueryExecutionErrors +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType} +import org.apache.spark.unsafe.types.UTF8String + +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = """ +_FUNC_(str) - Translates a string into {@code application/x-www-form-urlencoded} format using a specific encoding scheme. Review Comment: Does `{@code application/x-www-form-urlencoded}` really work? Can you run `DESC FUNCTION url_encode` and give a screenshot? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37150: [SPARK-39737][SQL] `PERCENTILE_CONT` and `PERCENTILE_DISC` should support aggregate filter
cloud-fan commented on code in PR #37150: URL: https://github.com/apache/spark/pull/37150#discussion_r918064239 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -849,7 +849,8 @@ primaryExpression | OVERLAY LEFT_PAREN input=valueExpression PLACING replace=valueExpression FROM position=valueExpression (FOR length=valueExpression)? RIGHT_PAREN #overlay | name=(PERCENTILE_CONT | PERCENTILE_DISC) LEFT_PAREN percentage=valueExpression RIGHT_PAREN - WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN ( OVER windowSpec)?#percentile +WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN +(FILTER LEFT_PAREN WHERE where=booleanExpression RIGHT_PAREN)? ( OVER windowSpec)? #percentile Review Comment: what happens if `PERCENTILE_CONT` is used as a window function and `FILTER` is specified? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org