[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1010092402 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala: ## @@ -177,47 +175,75 @@ private[sql] object ProtobufUtils extends Logging { .asInstanceOf[Descriptor] } + // TODO: Revisit to ensure that messageName is searched through all imports def buildDescriptor(descFilePath: String, messageName: String): Descriptor = { -val descriptor = parseFileDescriptor(descFilePath).getMessageTypes.asScala.find { desc => - desc.getName == messageName || desc.getFullName == messageName +val descriptorList = parseFileDescriptor(descFilePath).map(fileDescriptor => { Review Comment: @rangadi makes sense to use find and return, fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
amaliujia commented on code in PR #38347: URL: https://github.com/apache/spark/pull/38347#discussion_r1010090254 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -217,3 +218,23 @@ message Sample { int64 seed = 1; } } + +// Relation of type [[Range]] that generates a sequence of integers. +message Range { + // Optional. Default value = 0 + int32 start = 1; + int32 end = 2; + // Optional. Default value = 1 + Step step = 3; Review Comment: Yes let me follow up. I guess I was looking at python side API somehow thus confused myself on the types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38423: [SPARK-40949][CONNECT][PYTHON] Implement `DataFrame.sortWithinPartitions`
amaliujia commented on PR #38423: URL: https://github.com/apache/spark/pull/38423#issuecomment-1298056172 Late LGTM thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client
amaliujia commented on code in PR #38460: URL: https://github.com/apache/spark/pull/38460#discussion_r1010087722 ## python/pyspark/sql/tests/connect/test_connect_plan_only.py: ## @@ -118,6 +118,15 @@ def test_relation_alias(self): plan = df.alias("table_alias")._plan.to_proto(self.connect) self.assertEqual(plan.root.common.alias, "table_alias") +def test_range(self): +plan = self.connect.range(start=10, end=20, step=3, num_partitions=4)._plan.to_proto( +self.connect +) +self.assertEqual(plan.root.range.start, 10) +self.assertEqual(plan.root.range.end, 20) +self.assertEqual(plan.root.range.step.step, 3) +self.assertEqual(plan.root.range.num_partitions.num_partitions, 4) + Review Comment: I added this test case but only test `step` and `num_partitions` is not set. Right now there is a division between client and server such that: 1. Client take care of required fields, meaning that clients need to make sure the required fields are set. 2. Server side take care of default values for optional fields. This is to reduce load for both sides of implementation: 1. clients do not need to worry about default values for optional fields unless the default value is exposed on the DataFrame API already. 2. Server side do not care for whether required field is set (clients enforce it) but server side tracks the default value for optional fields. This can also avoid that clients side to set different default value. The default values are documented in proto: https://github.com/apache/spark/blob/fb64041cea3e094b0807cb580deacc721b302408/connector/connect/src/main/protobuf/spark/connect/relations.proto#L232 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
zhengruifeng commented on code in PR #38347: URL: https://github.com/apache/spark/pull/38347#discussion_r1010084761 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -217,3 +218,23 @@ message Sample { int64 seed = 1; } } + +// Relation of type [[Range]] that generates a sequence of integers. +message Range { + // Optional. Default value = 0 + int32 start = 1; + int32 end = 2; + // Optional. Default value = 1 + Step step = 3; Review Comment: `start`, `end`, `step` should use `int64` @amaliujia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client
amaliujia commented on code in PR #38460: URL: https://github.com/apache/spark/pull/38460#discussion_r1010084257 ## python/pyspark/sql/connect/client.py: ## @@ -145,6 +145,39 @@ def _build_metrics(self, metrics: "pb2.Response.Metrics") -> typing.List[PlanMet def sql(self, sql_string: str) -> "DataFrame": return DataFrame.withPlan(SQL(sql_string), self) +def range( +self, +start: int, +end: int, +step: Optional[int] = None, Review Comment: hmmm we are not marking `step` as required because Scala side implementation does not treat it as required and thus it has also a default value. ``` private def transformRange(rel: proto.Range): LogicalPlan = { val start = rel.getStart val end = rel.getEnd val step = if (rel.hasStep) { rel.getStep.getStep } else { 1 } val numPartitions = if (rel.hasNumPartitions) { rel.getNumPartitions.getNumPartitions } else { session.leafNodeDefaultParallelism } logical.Range(start, end, step, numPartitions) } ``` Same for `numPartitions`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client
amaliujia commented on code in PR #38460: URL: https://github.com/apache/spark/pull/38460#discussion_r1010084257 ## python/pyspark/sql/connect/client.py: ## @@ -145,6 +145,39 @@ def _build_metrics(self, metrics: "pb2.Response.Metrics") -> typing.List[PlanMet def sql(self, sql_string: str) -> "DataFrame": return DataFrame.withPlan(SQL(sql_string), self) +def range( +self, +start: int, +end: int, +step: Optional[int] = None, Review Comment: hmmm we are not marking `step` as required because Scala side implementation does not treat it as required and it has a default value. ``` private def transformRange(rel: proto.Range): LogicalPlan = { val start = rel.getStart val end = rel.getEnd val step = if (rel.hasStep) { rel.getStep.getStep } else { 1 } val numPartitions = if (rel.hasNumPartitions) { rel.getNumPartitions.getNumPartitions } else { session.leafNodeDefaultParallelism } logical.Range(start, end, step, numPartitions) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client
zhengruifeng commented on code in PR #38460: URL: https://github.com/apache/spark/pull/38460#discussion_r1010078740 ## python/pyspark/sql/connect/client.py: ## @@ -145,6 +145,39 @@ def _build_metrics(self, metrics: "pb2.Response.Metrics") -> typing.List[PlanMet def sql(self, sql_string: str) -> "DataFrame": return DataFrame.withPlan(SQL(sql_string), self) +def range( +self, +start: int, +end: int, +step: Optional[int] = None, Review Comment: i think we can use `step: int = 1` ## python/pyspark/sql/tests/connect/test_connect_plan_only.py: ## @@ -118,6 +118,15 @@ def test_relation_alias(self): plan = df.alias("table_alias")._plan.to_proto(self.connect) self.assertEqual(plan.root.common.alias, "table_alias") +def test_range(self): +plan = self.connect.range(start=10, end=20, step=3, num_partitions=4)._plan.to_proto( +self.connect +) +self.assertEqual(plan.root.range.start, 10) +self.assertEqual(plan.root.range.end, 20) +self.assertEqual(plan.root.range.step.step, 3) +self.assertEqual(plan.root.range.num_partitions.num_partitions, 4) + Review Comment: what about adding a new case like `range(start=10, end=20)` and check: 1, step is set 1 (if we make the default value 1); 2, num_partitions not set; ## python/pyspark/sql/connect/client.py: ## @@ -145,6 +145,39 @@ def _build_metrics(self, metrics: "pb2.Response.Metrics") -> typing.List[PlanMet def sql(self, sql_string: str) -> "DataFrame": return DataFrame.withPlan(SQL(sql_string), self) +def range( +self, +start: int, +end: int, +step: Optional[int] = None, Review Comment: furthermore, i think we can make `step` a required field in proto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog
LuciferYang commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1010061538 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: for (batchId <- batchIds if batchId > thresholdBatchId) { val path = batchIdToPath(batchId) fileManager.delete(path) + if (metadataCacheEnabled) batchCache.remove(batchId) logTrace(s"Removed metadata log file: $path") } } + + /** + * List the available batches on file system. As a workaround for S3 inconsistent list, it also + * tries to take `batchCache` into consideration to infer a better answer. + */ + protected def listBatches: Array[Long] = { +val batchIds = fileManager.list(metadataPath, batchFilesFilter) + .map(f => pathToBatchId(f.getPath)) ++ + // Iterate over keySet is not thread safe. We call `toArray` to make a copy in the lock to + // elimiate the race condition. + batchCache.synchronized { +batchCache.keySet.asScala.toArray Review Comment: ~How about `batchCache.keySet().toArray`? ~ ~And `SynchronizedCollection.toArray` seem threadsafe~ ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: for (batchId <- batchIds if batchId > thresholdBatchId) { val path = batchIdToPath(batchId) fileManager.delete(path) + if (metadataCacheEnabled) batchCache.remove(batchId) logTrace(s"Removed metadata log file: $path") } } + + /** + * List the available batches on file system. As a workaround for S3 inconsistent list, it also + * tries to take `batchCache` into consideration to infer a better answer. + */ + protected def listBatches: Array[Long] = { +val batchIds = fileManager.list(metadataPath, batchFilesFilter) + .map(f => pathToBatchId(f.getPath)) ++ + // Iterate over keySet is not thread safe. We call `toArray` to make a copy in the lock to + // elimiate the race condition. + batchCache.synchronized { +batchCache.keySet.asScala.toArray Review Comment: ~How about `batchCache.keySet().toArray`?~ ~And `SynchronizedCollection.toArray` seem threadsafe~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog
LuciferYang commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1010061538 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: for (batchId <- batchIds if batchId > thresholdBatchId) { val path = batchIdToPath(batchId) fileManager.delete(path) + if (metadataCacheEnabled) batchCache.remove(batchId) logTrace(s"Removed metadata log file: $path") } } + + /** + * List the available batches on file system. As a workaround for S3 inconsistent list, it also + * tries to take `batchCache` into consideration to infer a better answer. + */ + protected def listBatches: Array[Long] = { +val batchIds = fileManager.list(metadataPath, batchFilesFilter) + .map(f => pathToBatchId(f.getPath)) ++ + // Iterate over keySet is not thread safe. We call `toArray` to make a copy in the lock to + // elimiate the race condition. + batchCache.synchronized { +batchCache.keySet.asScala.toArray Review Comment: How about `batchCache.keySet().toArray`? ~And `SynchronizedCollection.toArray` seem threadsafe~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
zhengruifeng commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1010073193 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2101,3 +2101,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class UnresolvedSummary( +child: LogicalPlan, +statistics: Seq[String]) extends UnaryNode { + + private lazy val supported = +Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", "min", "max") + + { +// TODO: throw AnalysisException instead +require(statistics.nonEmpty) +val percentiles = statistics.filter(p => p.endsWith("%")).map { p => + try { +p.stripSuffix("%").toDouble / 100.0 + } catch { +case e: NumberFormatException => + throw QueryExecutionErrors.cannotParseStatisticAsPercentileError(p, e) + } +} +require(percentiles.forall(p => p >= 0 && p <= 1), "Percentiles must be in the range [0, 1]") + +statistics.foreach { + case s if supported.contains(s) => + case p if p.endsWith("%") => + case s => throw QueryExecutionErrors.statisticNotRecognizedError(s) +} + } + + override protected def stringArgs: Iterator[Any] = super.stringArgs.take(5) + + override lazy val resolved = false // Summary will be replaced after being resolved. + + final override val nodePatterns: Seq[TreePattern] = Seq(SUMMARY) + + override def output: Seq[Attribute] = { +AttributeReference("summary", StringType)() +: + child.output.flatMap { attr => +if (attr.dataType.isInstanceOf[NumericType] || attr.dataType.isInstanceOf[StringType]) { + Some(AttributeReference(attr.name, StringType)()) +} else None Review Comment: this is a `flatMap`, and `None` will be ignored currently datatypes other than `NumericType ` and `StringType ` will be discarded -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
zhengruifeng commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1010072313 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2101,3 +2101,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class UnresolvedSummary( +child: LogicalPlan, +statistics: Seq[String]) extends UnaryNode { + + private lazy val supported = +Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", "min", "max") + + { +// TODO: throw AnalysisException instead +require(statistics.nonEmpty) Review Comment: since there are several UT checking the thrown exception, so I let it alone to keep the behavior. I think we can define new error classes in separate PRs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client
amaliujia commented on PR #38460: URL: https://github.com/apache/spark/pull/38460#issuecomment-1298026802 R: @HyukjinKwon @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38423: [SPARK-40949][CONNECT][PYTHON] Implement `DataFrame.sortWithinPartitions`
zhengruifeng commented on PR #38423: URL: https://github.com/apache/spark/pull/38423#issuecomment-1298024181 merged into master, thanks @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #38423: [SPARK-40949][CONNECT][PYTHON] Implement `DataFrame.sortWithinPartitions`
zhengruifeng closed pull request #38423: [SPARK-40949][CONNECT][PYTHON] Implement `DataFrame.sortWithinPartitions` URL: https://github.com/apache/spark/pull/38423 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38423: [SPARK-40949][CONNECT][PYTHON] Implement `DataFrame.sortWithinPartitions`
zhengruifeng commented on PR #38423: URL: https://github.com/apache/spark/pull/38423#issuecomment-1298023324 the 3 force pushes only resolved conflicts in generated files, let me merge it now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
cloud-fan closed pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation URL: https://github.com/apache/spark/pull/38415 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
cloud-fan commented on PR #38415: URL: https://github.com/apache/spark/pull/38415#issuecomment-1298019712 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ben-zhang commented on a diff in pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands
ben-zhang commented on code in PR #38433: URL: https://github.com/apache/spark/pull/38433#discussion_r1010063647 ## sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala: ## @@ -1728,6 +1728,14 @@ class DataSourceV2SQLSuiteV1Filter extends DataSourceV2SQLSuite with AlterTableT } } + test("REPAIR TABLE omit MSCK") { Review Comment: Sounds good, I can remove this one. I agree that e2e tests are not necessary for a parser change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog
LuciferYang commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1010063341 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -19,7 +19,9 @@ package org.apache.spark.sql.execution.streaming import java.io._ import java.nio.charset.StandardCharsets +import java.util.{Collections, LinkedHashMap} Review Comment: nit: can rename `LinkedHashMap` to `JLinkedHashMap` or others to make it clearer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog
LuciferYang commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1010062994 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -64,6 +67,17 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: fileManager.mkdirs(metadataPath) } + protected val metadataCacheEnabled: Boolean + = sparkSession.sessionState.conf.getConf(SQLConf.STREAMING_METADATA_CACHE_ENABLED) + + /** + * Cache the latest two batches. [[StreamExecution]] usually just accesses the latest two batches + * when committing offsets, this cache will save some file system operations. + */ + protected[sql] val batchCache = Collections.synchronizedMap(new LinkedHashMap[Long, T](2) { Review Comment: Could we create `Map` instance only when `metadataCacheEnabled` is true? Maybe we can make it as `Option` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog
LuciferYang commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1010061925 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: for (batchId <- batchIds if batchId > thresholdBatchId) { val path = batchIdToPath(batchId) fileManager.delete(path) + if (metadataCacheEnabled) batchCache.remove(batchId) logTrace(s"Removed metadata log file: $path") } } + + /** + * List the available batches on file system. As a workaround for S3 inconsistent list, it also + * tries to take `batchCache` into consideration to infer a better answer. + */ + protected def listBatches: Array[Long] = { +val batchIds = fileManager.list(metadataPath, batchFilesFilter) + .map(f => pathToBatchId(f.getPath)) ++ + // Iterate over keySet is not thread safe. We call `toArray` to make a copy in the lock to + // elimiate the race condition. + batchCache.synchronized { +batchCache.keySet.asScala.toArray + } +logInfo("BatchIds found from listing: " + batchIds.sorted.mkString(", ")) + +if (batchIds.isEmpty) { + return Array.empty Review Comment: nit: redundant `return` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog
LuciferYang commented on code in PR #38430: URL: https://github.com/apache/spark/pull/38430#discussion_r1010061538 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala: ## @@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: for (batchId <- batchIds if batchId > thresholdBatchId) { val path = batchIdToPath(batchId) fileManager.delete(path) + if (metadataCacheEnabled) batchCache.remove(batchId) logTrace(s"Removed metadata log file: $path") } } + + /** + * List the available batches on file system. As a workaround for S3 inconsistent list, it also + * tries to take `batchCache` into consideration to infer a better answer. + */ + protected def listBatches: Array[Long] = { +val batchIds = fileManager.list(metadataPath, batchFilesFilter) + .map(f => pathToBatchId(f.getPath)) ++ + // Iterate over keySet is not thread safe. We call `toArray` to make a copy in the lock to + // elimiate the race condition. + batchCache.synchronized { +batchCache.keySet.asScala.toArray Review Comment: How about `batchCache.keySet().toArray`? And `SynchronizedCollection.toArray` seem threadsafe -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
rangadi commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1006324273 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala: ## @@ -175,19 +175,26 @@ private[sql] object ProtobufUtils extends Logging { .asInstanceOf[Descriptor] } + // TODO: Revisit to ensure that messageName is searched through all imports Review Comment: What is missing? Looks fairly complete to me. Better to state the problem here. ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala: ## @@ -177,47 +175,75 @@ private[sql] object ProtobufUtils extends Logging { .asInstanceOf[Descriptor] } + // TODO: Revisit to ensure that messageName is searched through all imports def buildDescriptor(descFilePath: String, messageName: String): Descriptor = { -val descriptor = parseFileDescriptor(descFilePath).getMessageTypes.asScala.find { desc => - desc.getName == messageName || desc.getFullName == messageName +val descriptorList = parseFileDescriptor(descFilePath).map(fileDescriptor => { + fileDescriptor.getMessageTypes.asScala.find { desc => +desc.getName == messageName || desc.getFullName == messageName + } +}).filter(f => !f.isEmpty) + +if (descriptorList.isEmpty) { + throw QueryCompilationErrors.noProtobufMessageTypeReturnError(messageName) } -descriptor match { +descriptorList.last match { case Some(d) => d case None => -throw new RuntimeException(s"Unable to locate Message '$messageName' in Descriptor") +throw QueryCompilationErrors.unableToLocateProtobufMessageError(messageName) } } - private def parseFileDescriptor(descFilePath: String): Descriptors.FileDescriptor = { + private def parseFileDescriptor(descFilePath: String): List[Descriptors.FileDescriptor] = { Review Comment: Rename to `parseFileDescriptorSet` (otherwise it sounds like it is parsing just one file descriptor). ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala: ## @@ -175,19 +175,26 @@ private[sql] object ProtobufUtils extends Logging { .asInstanceOf[Descriptor] } + // TODO: Revisit to ensure that messageName is searched through all imports def buildDescriptor(descFilePath: String, messageName: String): Descriptor = { -val descriptor = parseFileDescriptor(descFilePath).getMessageTypes.asScala.find { desc => - desc.getName == messageName || desc.getFullName == messageName +val descriptorList = parseFileDescriptor(descFilePath).map(fileDescriptor => { + fileDescriptor.getMessageTypes.asScala.find { desc => +desc.getName == messageName || desc.getFullName == messageName + } +}).filter(f => !f.isEmpty) + +if (descriptorList.isEmpty) { + throw QueryCompilationErrors.noProtobufMessageTypeReturnError(messageName) } -descriptor match { +descriptorList.last match { Review Comment: Could you add a comment on why we are picking the last one? Will be useful for future readers as well. ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala: ## @@ -177,47 +175,75 @@ private[sql] object ProtobufUtils extends Logging { .asInstanceOf[Descriptor] } + // TODO: Revisit to ensure that messageName is searched through all imports def buildDescriptor(descFilePath: String, messageName: String): Descriptor = { -val descriptor = parseFileDescriptor(descFilePath).getMessageTypes.asScala.find { desc => - desc.getName == messageName || desc.getFullName == messageName +val descriptorList = parseFileDescriptor(descFilePath).map(fileDescriptor => { Review Comment: Style: use `find()` rather than map().filter(). (you can use `findLast()` if there is a reason to use the last match). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client
amaliujia opened a new pull request, #38460: URL: https://github.com/apache/spark/pull/38460 ### What changes were proposed in this pull request? This PR adds `range` API to Python client's `RemoteSparkSession` with tests. ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38459: [SPARK-40980][CONNECT] Support session.sql in Connect DSL
amaliujia opened a new pull request, #38459: URL: https://github.com/apache/spark/pull/38459 ### What changes were proposed in this pull request? This PR adds `def sql(sqlText: String)` to DSL and test the SQL proto with SparkSession/DataFrame. ### Why are the changes needed? Improve testing coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 opened a new pull request, #38458: [SPARK-40983][DOC] Remove Hadoop requirements for zstd mentioned in Parquet compression codec
pan3793 opened a new pull request, #38458: URL: https://github.com/apache/spark/pull/38458 ### What changes were proposed in this pull request? Change the doc to remove Hadoop requirements for zstd mentioned in Parquet compression codec. ### Why are the changes needed? This requirement is removed after https://issues.apache.org/jira/browse/PARQUET-1866, and Spark uses Parquet 1.12.3 now. ### Does this PR introduce _any_ user-facing change? Yes, doc updated. ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] clairezhuang commented on pull request #38446: [Spark-40974]When the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified
clairezhuang commented on PR #38446: URL: https://github.com/apache/spark/pull/38446#issuecomment-1297983951 Sorry for using the batch-3.3 as the mistake. The issue exists in the master branch, and I have no experience to make file change. I have closed the pull request. You could delete the pull request if it is possible. I would go to jira: https://issues.apache.org/jira/browse/SPARK-40982?jql=project%20%3D%20SPARK . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] clairezhuang closed pull request #38446: [Spark-40974]When the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified
clairezhuang closed pull request #38446: [Spark-40974]When the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified URL: https://github.com/apache/spark/pull/38446 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] dcoliversun commented on pull request #22: [SPARK-40969] Replace spark TGZ url with apache archive url
dcoliversun commented on PR #22: URL: https://github.com/apache/spark-docker/pull/22#issuecomment-1297973951 Thanks for your review @HyukjinKwon @Yikun @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on pull request #21: [SPARK-40569] Add smoke test in standalone cluster for spark-docker
Yikun commented on PR #21: URL: https://github.com/apache/spark-docker/pull/21#issuecomment-1297967368 Mind to rebase now? Thanks @dcoliversun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on pull request #22: [SPARK-40969] Replace spark TGZ url with apache archive url
Yikun commented on PR #22: URL: https://github.com/apache/spark-docker/pull/22#issuecomment-1297966991 @HyukjinKwon @zhengruifeng @dcoliversun Thanks! Merged . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun closed pull request #22: [SPARK-40969] Replace spark TGZ url with apache archive url
Yikun closed pull request #22: [SPARK-40969] Replace spark TGZ url with apache archive url URL: https://github.com/apache/spark-docker/pull/22 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JiexingLi commented on a diff in pull request #38371: [SPARK-40968] Fix a few wrong/misleading comments in DAGSchedulerSuite
JiexingLi commented on code in PR #38371: URL: https://github.com/apache/spark/pull/38371#discussion_r1010030101 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -3089,13 +3089,14 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti submit(finalRdd, Array(0, 1), properties = new Properties()) // Finish the first 2 shuffle map stages. -completeShuffleMapStageSuccessfully(0, 0, 2) +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostB")) assert(mapOutputTracker.findMissingPartitions(shuffleId1) === Some(Seq.empty)) completeShuffleMapStageSuccessfully(1, 0, 2, Seq("hostB", "hostD")) assert(mapOutputTracker.findMissingPartitions(shuffleId2) === Some(Seq.empty)) -// Executor lost on hostB, both of stage 0 and 1 should be reran. +// FetchFailed on stage 2, both of stage 1 and 2 should be reran. Besides, executor lost on +// hostB, both of stage 0 and 1 should be reran. Review Comment: Thanks, Mridul. I updated the comment. Your rewrite is very neat and clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JiexingLi commented on a diff in pull request #38371: [SPARK-40968] Fix a few wrong/misleading comments in DAGSchedulerSuite
JiexingLi commented on code in PR #38371: URL: https://github.com/apache/spark/pull/38371#discussion_r1010029712 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -3089,13 +3089,14 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti submit(finalRdd, Array(0, 1), properties = new Properties()) // Finish the first 2 shuffle map stages. -completeShuffleMapStageSuccessfully(0, 0, 2) +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostB")) Review Comment: Sounds good. Deleted the hostNames param. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #38457: [SPARK-40371][SQL] Migrate type check failures of NthValue and NTile onto error classes
LuciferYang opened a new pull request, #38457: URL: https://github.com/apache/spark/pull/38457 ### What changes were proposed in this pull request? This pr aims to replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions, includes `NthValue` and `NTile` ### Why are the changes needed? Migration onto error classes unifies Spark SQL error messages. ### Does this PR introduce _any_ user-facing change? Yes. The PR changes user-facing error messages. ### How was this patch tested? Pass GitHub Actions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang closed pull request #38456: Spark 40950
LuciferYang closed pull request #38456: Spark 40950 URL: https://github.com/apache/spark/pull/38456 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #38456: Spark 40950
LuciferYang opened a new pull request, #38456: URL: https://github.com/apache/spark/pull/38456 ### What changes were proposed in this pull request? This pr aims to replace TypeCheckFailure by DataTypeMismatch in type checks in window expressions, includes `NthValue` and `NTile` ### Why are the changes needed? Migration onto error classes unifies Spark SQL error messages. ### Does this PR introduce _any_ user-facing change? Yes. The PR changes user-facing error messages. ### How was this patch tested? Pass GitHub Actions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
amaliujia commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1010024429 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveStatsFunctions.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.SUMMARY +import org.apache.spark.sql.types._ + +/** + * Resolve StatsFunctions. + */ +object ResolveStatsFunctions extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( +_.containsPattern(SUMMARY), ruleId) { + +case s @ UnresolvedSummary(child, statistics) if s.childrenResolved => + val percentiles = statistics.filter(p => p.endsWith("%")) +.map(p => p.stripSuffix("%").toDouble / 100.0) + + var mapExprs = Seq.empty[NamedExpression] + child.output.foreach { attr => +if (attr.dataType.isInstanceOf[NumericType] || attr.dataType.isInstanceOf[StringType]) { + val name = attr.name + val casted: Expression = attr.dataType match { +case StringType => Cast(attr, DoubleType, evalMode = EvalMode.TRY) +case _ => attr + } + + val approxPercentile = if (percentiles.nonEmpty) { +Alias( + new ApproximatePercentile( +casted, +Literal(percentiles.toArray), +Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY) + ).toAggregateExpression(), + s"__${name}_approx_percentile__")() + } else null Review Comment: Yeah looks this `val` is passed to `new Get` and there is no defensive check over `null` there. If there is ever a `null` case that appears, it will hit NPE. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
amaliujia commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1010010780 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveStatsFunctions.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.SUMMARY +import org.apache.spark.sql.types._ + +/** + * Resolve StatsFunctions. + */ +object ResolveStatsFunctions extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( +_.containsPattern(SUMMARY), ruleId) { + +case s @ UnresolvedSummary(child, statistics) if s.childrenResolved => + val percentiles = statistics.filter(p => p.endsWith("%")) +.map(p => p.stripSuffix("%").toDouble / 100.0) + + var mapExprs = Seq.empty[NamedExpression] + child.output.foreach { attr => +if (attr.dataType.isInstanceOf[NumericType] || attr.dataType.isInstanceOf[StringType]) { + val name = attr.name + val casted: Expression = attr.dataType match { +case StringType => Cast(attr, DoubleType, evalMode = EvalMode.TRY) +case _ => attr + } + + val approxPercentile = if (percentiles.nonEmpty) { +Alias( + new ApproximatePercentile( +casted, +Literal(percentiles.toArray), +Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY) + ).toAggregateExpression(), + s"__${name}_approx_percentile__")() + } else null Review Comment: I am seeing a potential issue of NPE. Is there a way that we get rid of `null` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38449: [SPARK-40798][SQL][FOLLOW-UP] Fix ansi mode in v2 ALTER TABLE PARTITION
ulysses-you commented on code in PR #38449: URL: https://github.com/apache/spark/pull/38449#discussion_r1010021095 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableAddPartitionSuite.scala: ## @@ -124,18 +123,4 @@ class AlterTableAddPartitionSuite checkPartitions(t, Map("id" -> "1"), Map("id" -> "2")) } } - - test("SPARK-40798: Alter partition should verify partition value - legacy") { -withNamespaceAndTable("ns", "tbl") { t => - sql(s"CREATE TABLE $t (c int) $defaultUsing PARTITIONED BY (p int)") - - withSQLConf( - SQLConf.SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION.key -> "true", - SQLConf.ANSI_ENABLED.key -> "false") { Review Comment: Before I thought the `storeAssignmentPolicy` is enough so this test is corvered by `AlterTableAddPartitionSuiteBase`. Just change to legacy to restroe previous behavior. Now I add one more `SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION` check and keep this test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38455: [SPARK-40827][PS][TESTS] Re-enable the DataFrame.corrwith test after fixing in future pandas.
itholic commented on PR #38455: URL: https://github.com/apache/spark/pull/38455#issuecomment-1297940463 Thanks for the review :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
amaliujia commented on code in PR #38415: URL: https://github.com/apache/spark/pull/38415#discussion_r1010018988 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -213,3 +213,13 @@ message Sample { int64 seed = 1; } } + +// Relation alias. +message SubqueryAlias { + // Required. The input relation. + Relation input = 1; + // Required. The alias. Review Comment: Given our discussion on the protocol, this is a required field so we ask clients to always set it. Server side only fetch the value in this field not matter what it is (either `""`, or not) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38393: [SPARK-40915][CONNECT] Improve `on` in Join in Python client
HyukjinKwon closed pull request #38393: [SPARK-40915][CONNECT] Improve `on` in Join in Python client URL: https://github.com/apache/spark/pull/38393 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38393: [SPARK-40915][CONNECT] Improve `on` in Join in Python client
HyukjinKwon commented on PR #38393: URL: https://github.com/apache/spark/pull/38393#issuecomment-1297935788 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
amaliujia commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1010014282 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2101,3 +2101,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class UnresolvedSummary( +child: LogicalPlan, +statistics: Seq[String]) extends UnaryNode { + + private lazy val supported = +Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", "min", "max") + + { +// TODO: throw AnalysisException instead +require(statistics.nonEmpty) Review Comment: One example of the compilation error is ``` throw QueryCompilationErrors.windowSpecificationNotDefinedError(windowName)) ``` Maybe there is a need to define a new error class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #38438: [SPARK-40748][SQL] Migrate type check failures of conditions onto error classes
panbingkun commented on code in PR #38438: URL: https://github.com/apache/spark/pull/38438#discussion_r1010013950 ## sql/core/src/test/java/test/org/apache/spark/sql/JavaColumnExpressionSuite.java: ## @@ -81,10 +81,13 @@ public void isInCollectionCheckExceptionMessage() { Dataset df = spark.createDataFrame(rows, schema); Exception e = Assert.assertThrows(Exception.class, () -> df.filter(df.col("a").isInCollection(Arrays.asList(new Column("b"); -Arrays.asList("cannot resolve", - "due to data type mismatch: Arguments must be same type but were") -.forEach(s -> - Assert.assertTrue(e.getMessage().toLowerCase(Locale.ROOT) -.contains(s.toLowerCase(Locale.ROOT; +System.out.println(e.getMessage().toLowerCase(Locale.ROOT)); +Arrays.asList( + "datatype_mismatch.data_diff_types", + "cannot resolve \"(a in (b))\"", + "due to data type mismatch: input to `in` should all be the same type, " + +"but it's [\"int\", \"array\"].").forEach(s -> + Assert.assertTrue(e.getMessage().toLowerCase(Locale.ROOT) +.contains(s.toLowerCase(Locale.ROOT; Review Comment: I will check: 1.Exception Type -> AnalysisException 2.Error classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
amaliujia commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1010010032 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2101,3 +2101,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class UnresolvedSummary( +child: LogicalPlan, +statistics: Seq[String]) extends UnaryNode { + + private lazy val supported = +Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", "min", "max") + + { +// TODO: throw AnalysisException instead +require(statistics.nonEmpty) Review Comment: I am thinking this is easy to handle in this PR: either we return the original plan in `ResolveStatsFunctions` when we find `UnresolvedSummary` has empty `statistics` thus the plan will remain `unresolved` and will fail at last, or we through the compilation exception in `ResolveStatsFunctions` when seeing empty `statistics`. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2101,3 +2101,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class UnresolvedSummary( +child: LogicalPlan, +statistics: Seq[String]) extends UnaryNode { + + private lazy val supported = +Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", "min", "max") + + { +// TODO: throw AnalysisException instead +require(statistics.nonEmpty) Review Comment: I am thinking this is easy to handle in this PR: either we return the original plan in `ResolveStatsFunctions` when we find `UnresolvedSummary` has empty `statistics` thus the plan will remain `unresolved` and will fail at last, or we throw the compilation exception in `ResolveStatsFunctions` when seeing empty `statistics`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] swamirishi commented on pull request #38377: [SPARK-40901][CORE] Unable to store Spark Driver logs with Absolute Hadoop based URI FS Path
swamirishi commented on PR #38377: URL: https://github.com/apache/spark/pull/38377#issuecomment-1297930510 > QQ: Any details for the query [above](https://github.com/apache/spark/pull/38377#pullrequestreview-1153485176) ? Currently all of the logs can be only written to the default filesytem defined by fs.Defaultfs job config. One wouldn't be able to write the logs to other filesystem directly without changing the property as absolute urls are not supported. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #38438: [SPARK-40748][SQL] Migrate type check failures of conditions onto error classes
panbingkun commented on code in PR #38438: URL: https://github.com/apache/spark/pull/38438#discussion_r1010013499 ## sql/core/src/test/java/test/org/apache/spark/sql/JavaColumnExpressionSuite.java: ## @@ -81,10 +81,13 @@ public void isInCollectionCheckExceptionMessage() { Dataset df = spark.createDataFrame(rows, schema); Exception e = Assert.assertThrows(Exception.class, () -> df.filter(df.col("a").isInCollection(Arrays.asList(new Column("b"); -Arrays.asList("cannot resolve", - "due to data type mismatch: Arguments must be same type but were") -.forEach(s -> - Assert.assertTrue(e.getMessage().toLowerCase(Locale.ROOT) -.contains(s.toLowerCase(Locale.ROOT; +System.out.println(e.getMessage().toLowerCase(Locale.ROOT)); Review Comment: Sorry, I left the debugging code by mistake. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38451: [SPARK-40976][BUILD] Upgrade sbt to 1.7.3
LuciferYang commented on PR #38451: URL: https://github.com/apache/spark/pull/38451#issuecomment-1297928916 thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
cloud-fan commented on PR #38347: URL: https://github.com/apache/spark/pull/38347#issuecomment-1297927909 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
cloud-fan closed pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto URL: https://github.com/apache/spark/pull/38347 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
amaliujia commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1010010032 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2101,3 +2101,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class UnresolvedSummary( +child: LogicalPlan, +statistics: Seq[String]) extends UnaryNode { + + private lazy val supported = +Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", "min", "max") + + { +// TODO: throw AnalysisException instead +require(statistics.nonEmpty) Review Comment: I am thinking this is easy to handle in this PR: either we return the plan in `ResolveStatsFunctions` when we find `UnresolvedSummary` has empty `statistics`, or we through the exception in `ResolveStatsFunctions` when seeing empty `statistics`. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveStatsFunctions.scala: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.SUMMARY +import org.apache.spark.sql.types._ + +/** + * Resolve StatsFunctions. + */ +object ResolveStatsFunctions extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( +_.containsPattern(SUMMARY), ruleId) { + +case s @ UnresolvedSummary(child, statistics) if s.childrenResolved => + val percentiles = statistics.filter(p => p.endsWith("%")) +.map(p => p.stripSuffix("%").toDouble / 100.0) + + var mapExprs = Seq.empty[NamedExpression] + child.output.foreach { attr => +if (attr.dataType.isInstanceOf[NumericType] || attr.dataType.isInstanceOf[StringType]) { + val name = attr.name + val casted: Expression = attr.dataType match { +case StringType => Cast(attr, DoubleType, evalMode = EvalMode.TRY) +case _ => attr + } + + val approxPercentile = if (percentiles.nonEmpty) { +Alias( + new ApproximatePercentile( +casted, +Literal(percentiles.toArray), +Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY) + ).toAggregateExpression(), + s"__${name}_approx_percentile__")() + } else null Review Comment: I am see a potential issue of NPE. Is there a way that we get rid of `null` here? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2101,3 +2101,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class UnresolvedSummary( +child: LogicalPlan, +statistics: Seq[String]) extends UnaryNode { + + private lazy val supported = +Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", "min", "max") + + { +// TODO: throw AnalysisException instead +require(statistics.nonEmpty) +val percentiles = statistics.filter(p => p.endsWith("%")).map { p => + try { +p.stripSuffix("%").toDouble / 100.0 + } catch { +case e: NumberFormatException => + throw QueryExecutionErrors.cannotParseStatisticAsPercentileError(p, e) + } +} +require(percentiles.forall(p => p >= 0 && p <= 1), "Percentiles must be in the range [0, 1]") + +statistics.foreach { + case s if supported.contains(s) => + case p if p.endsWith("%") => + case s => throw QueryExecutionErrors.statisticNotRecognizedError(s) +} + } + + override protected def stringArgs: Iterator[Any] = super.stringArgs.take(5) + + override lazy val resolved = false // Summary will be replaced after being resolved. + + final override val nodePatterns: Seq[TreePattern] = Seq(SUMMARY) + + override def output: Seq[Attribute] = { +
[GitHub] [spark] panbingkun commented on pull request #38439: [SPARK-40890][SQL][TESTS] Check error classes in DataSourceV2SQLSuite
panbingkun commented on PR #38439: URL: https://github.com/apache/spark/pull/38439#issuecomment-1297915406 > It seems this is related to your changes: > > ``` > DataSourceV2SQLSuiteV1Filter.MERGE INTO TABLE > org.scalatest.exceptions.TestFailedException: "_LEGACY_ERROR_TEMP_2309" did not equal null > ``` > > Most likely after [3c967f0](https://github.com/apache/spark/commit/3c967f06e6c37360e53f9c7a5ecff95ae818e713), need to merge/rebase the master. Done, waiting for CI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #38371: [SPARK-40968] Fix a few wrong/misleading comments in DAGSchedulerSuite
mridulm commented on code in PR #38371: URL: https://github.com/apache/spark/pull/38371#discussion_r1010007062 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -3089,13 +3089,14 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti submit(finalRdd, Array(0, 1), properties = new Properties()) // Finish the first 2 shuffle map stages. -completeShuffleMapStageSuccessfully(0, 0, 2) +completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostB")) Review Comment: Enhancing documentation of `completeShuffleMapStageSuccessfully` is sufficient - we dont want to explicitly specify `hostNames` for all invocations of the method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38420: [SPARK-40947][SPARK-40966][PS][INFRA][TEST] Upgrade pandas to 1.5.1
zhengruifeng commented on PR #38420: URL: https://github.com/apache/spark/pull/38420#issuecomment-1297914857 late lgtm , thanks all -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #38455: [SPARK-40827][PS][TEST] Re-enable the DataFrame.corrwith test after fixing in future pandas.
itholic opened a new pull request, #38455: URL: https://github.com/apache/spark/pull/38455 ### What changes were proposed in this pull request? This PR proposes to make the manual tests for `DataFrame.corrwith` back into formal approach, if the pandas version is not 1.5.0. ### Why are the changes needed? There was a regression introduced by pandas 1.5.0 (https://github.com/pandas-dev/pandas/issues/48826), and seems it's resolved now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The fixed test should pass the CI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
HyukjinKwon commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1010003793 ## python/pyspark/worker.py: ## @@ -146,7 +146,74 @@ def verify_result_type(result): ) -def wrap_cogrouped_map_pandas_udf(f, return_type, argspec): +def verify_pandas_result(result, return_type, assign_cols_by_name): +import pandas as pd + +if not isinstance(result, pd.DataFrame): +raise TypeError( +"Return type of the user-defined function should be " +"pandas.DataFrame, but is {}".format(type(result)) +) + +# check the schema of the result only if it is not empty or has columns +if not result.empty or len(result.columns) != 0: +# if any column name of the result is a string +# the column names of the result have to match the return type +# see create_array in pyspark.sql.pandas.serializers.ArrowStreamPandasSerializer +field_names = set([field.name for field in return_type.fields]) +column_names = set(result.columns) +if ( +assign_cols_by_name +and any(isinstance(name, str) for name in result.columns) +and column_names != field_names +): +limit = 5 Review Comment: e.g., would have to consider having a config to control this too .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38377: [SPARK-40901][CORE] Unable to store Spark Driver logs with Absolute Hadoop based URI FS Path
mridulm commented on PR #38377: URL: https://github.com/apache/spark/pull/38377#issuecomment-1297907890 QQ: Any details for the query [above](https://github.com/apache/spark/pull/38377#pullrequestreview-1153485176) ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
HyukjinKwon commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1010003153 ## python/pyspark/worker.py: ## @@ -146,7 +146,74 @@ def verify_result_type(result): ) -def wrap_cogrouped_map_pandas_udf(f, return_type, argspec): +def verify_pandas_result(result, return_type, assign_cols_by_name): +import pandas as pd + +if not isinstance(result, pd.DataFrame): +raise TypeError( +"Return type of the user-defined function should be " +"pandas.DataFrame, but is {}".format(type(result)) +) + +# check the schema of the result only if it is not empty or has columns +if not result.empty or len(result.columns) != 0: +# if any column name of the result is a string +# the column names of the result have to match the return type +# see create_array in pyspark.sql.pandas.serializers.ArrowStreamPandasSerializer +field_names = set([field.name for field in return_type.fields]) +column_names = set(result.columns) +if ( +assign_cols_by_name +and any(isinstance(name, str) for name in result.columns) +and column_names != field_names +): +limit = 5 Review Comment: Hm, I wonder if this abbreviation is too much (and the character limit) ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
HyukjinKwon commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1010002613 ## python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py: ## @@ -165,100 +148,191 @@ def merge_pandas(lft, _): ) def test_apply_in_pandas_not_returning_pandas_dataframe(self): -left = self.data1 -right = self.data2 +self._test_merge_error( +fn=lambda lft, rgt: lft.size + rgt.size, +error_class=PythonException, +error_message_regex="Return type of the user-defined function " +"should be pandas.DataFrame, but is ", +) + +def test_apply_in_pandas_returning_column_names(self): +self._test_merge(fn=lambda lft, rgt: pd.merge(lft, rgt, on=["id", "k"])) +def test_apply_in_pandas_returning_no_column_names(self): def merge_pandas(lft, rgt): -return lft.size + rgt.size +res = pd.merge(lft, rgt, on=["id", "k"]) +res.columns = range(res.columns.size) +return res -with QuietTest(self.sc): -with self.assertRaisesRegex( -PythonException, -"Return type of the user-defined function should be pandas.DataFrame, " -"but is ", -): -( -left.groupby("id") -.cogroup(right.groupby("id")) -.applyInPandas(merge_pandas, "id long, k int, v int, v2 int") -.collect() -) +self._test_merge(fn=merge_pandas) -def test_apply_in_pandas_returning_wrong_number_of_columns(self): -left = self.data1 -right = self.data2 +def test_apply_in_pandas_returning_column_names_sometimes(self): +def merge_pandas(lft, rgt): +res = pd.merge(lft, rgt, on=["id", "k"]) +if 0 in lft["id"] and lft["id"][0] % 2 == 0: +return res +res.columns = range(res.columns.size) +return res + +self._test_merge(fn=merge_pandas) +def test_apply_in_pandas_returning_wrong_column_names(self): def merge_pandas(lft, rgt): if 0 in lft["id"] and lft["id"][0] % 2 == 0: lft["add"] = 0 if 0 in rgt["id"] and rgt["id"][0] % 3 == 0: rgt["more"] = 1 return pd.merge(lft, rgt, on=["id", "k"]) -with QuietTest(self.sc): -with self.assertRaisesRegex( -PythonException, -"Number of columns of the returned pandas.DataFrame " -"doesn't match specified schema. Expected: 4 Actual: 6", -): -( -# merge_pandas returns two columns for even keys while we set schema to four -left.groupby("id") -.cogroup(right.groupby("id")) -.applyInPandas(merge_pandas, "id long, k int, v int, v2 int") -.collect() -) +self._test_merge_error( +fn=merge_pandas, +error_class=PythonException, +error_message_regex="Column names of the returned pandas.DataFrame " +"do not match specified schema. Unexpected: add, more Schema: id, k, v, v2\n", +) -def test_apply_in_pandas_returning_empty_dataframe(self): -left = self.data1 -right = self.data2 +# with very large schema, missing and unexpected is limited to 5 +# and the schema is abbreviated in the error message +schema = "id long, k long, mean double, " + ", ".join( +f"column_with_long_column_name_{no} integer" for no in range(35) +) +self._test_merge_error( +fn=lambda lft, rgt: pd.DataFrame( +[ +( +lft.id, +lft.k, +lft.v.mean(), +) ++ tuple(lft.v.mean() for _ in range(7)) +], +columns=["id", "k", "mean"] + [f"extra_column_{no} integer" for no in range(7)], +), +output_schema=schema, +error_class=PythonException, +error_message_regex="Column names of the returned pandas\\.DataFrame " +"do not match specified schema\\. " +"Missing \\(first 5 of 35\\): column_with_long_column_name_0," +" column_with_long_column_name_1, column_with_long_column_name_10," +" column_with_long_column_name_11, column_with_long_column_name_12 " +"Unexpected \\(first 5 of 7\\): extra_column_0 integer, extra_column_1 integer," +" extra_column_2 integer, extra_column_3 integer, extra_column_4 integer " +"Schema: id, k, mean, column_with_long_column_name_0, column_with_long_column_name_1," +"
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
HyukjinKwon commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1010002388 ## python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py: ## @@ -165,100 +148,191 @@ def merge_pandas(lft, _): ) def test_apply_in_pandas_not_returning_pandas_dataframe(self): -left = self.data1 -right = self.data2 +self._test_merge_error( +fn=lambda lft, rgt: lft.size + rgt.size, +error_class=PythonException, +error_message_regex="Return type of the user-defined function " +"should be pandas.DataFrame, but is ", +) + +def test_apply_in_pandas_returning_column_names(self): +self._test_merge(fn=lambda lft, rgt: pd.merge(lft, rgt, on=["id", "k"])) +def test_apply_in_pandas_returning_no_column_names(self): def merge_pandas(lft, rgt): -return lft.size + rgt.size +res = pd.merge(lft, rgt, on=["id", "k"]) +res.columns = range(res.columns.size) +return res -with QuietTest(self.sc): -with self.assertRaisesRegex( -PythonException, -"Return type of the user-defined function should be pandas.DataFrame, " -"but is ", -): -( -left.groupby("id") -.cogroup(right.groupby("id")) -.applyInPandas(merge_pandas, "id long, k int, v int, v2 int") -.collect() -) +self._test_merge(fn=merge_pandas) -def test_apply_in_pandas_returning_wrong_number_of_columns(self): -left = self.data1 -right = self.data2 +def test_apply_in_pandas_returning_column_names_sometimes(self): +def merge_pandas(lft, rgt): +res = pd.merge(lft, rgt, on=["id", "k"]) +if 0 in lft["id"] and lft["id"][0] % 2 == 0: +return res +res.columns = range(res.columns.size) +return res + +self._test_merge(fn=merge_pandas) +def test_apply_in_pandas_returning_wrong_column_names(self): def merge_pandas(lft, rgt): if 0 in lft["id"] and lft["id"][0] % 2 == 0: lft["add"] = 0 if 0 in rgt["id"] and rgt["id"][0] % 3 == 0: rgt["more"] = 1 return pd.merge(lft, rgt, on=["id", "k"]) -with QuietTest(self.sc): -with self.assertRaisesRegex( -PythonException, -"Number of columns of the returned pandas.DataFrame " -"doesn't match specified schema. Expected: 4 Actual: 6", -): -( -# merge_pandas returns two columns for even keys while we set schema to four -left.groupby("id") -.cogroup(right.groupby("id")) -.applyInPandas(merge_pandas, "id long, k int, v int, v2 int") -.collect() -) +self._test_merge_error( +fn=merge_pandas, +error_class=PythonException, +error_message_regex="Column names of the returned pandas.DataFrame " +"do not match specified schema. Unexpected: add, more Schema: id, k, v, v2\n", +) -def test_apply_in_pandas_returning_empty_dataframe(self): -left = self.data1 -right = self.data2 +# with very large schema, missing and unexpected is limited to 5 +# and the schema is abbreviated in the error message +schema = "id long, k long, mean double, " + ", ".join( +f"column_with_long_column_name_{no} integer" for no in range(35) +) +self._test_merge_error( +fn=lambda lft, rgt: pd.DataFrame( +[ +( +lft.id, +lft.k, +lft.v.mean(), +) ++ tuple(lft.v.mean() for _ in range(7)) +], +columns=["id", "k", "mean"] + [f"extra_column_{no} integer" for no in range(7)], +), +output_schema=schema, +error_class=PythonException, +error_message_regex="Column names of the returned pandas\\.DataFrame " +"do not match specified schema\\. " +"Missing \\(first 5 of 35\\): column_with_long_column_name_0," +" column_with_long_column_name_1, column_with_long_column_name_10," +" column_with_long_column_name_11, column_with_long_column_name_12 " +"Unexpected \\(first 5 of 7\\): extra_column_0 integer, extra_column_1 integer," +" extra_column_2 integer, extra_column_3 integer, extra_column_4 integer " +"Schema: id, k, mean, column_with_long_column_name_0, column_with_long_column_name_1," +"
[GitHub] [spark] zhengruifeng commented on pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
zhengruifeng commented on PR #38395: URL: https://github.com/apache/spark/pull/38395#issuecomment-1297904765 @cloud-fan @HyukjinKwon would you mind taking another look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
HyukjinKwon commented on PR #38223: URL: https://github.com/apache/spark/pull/38223#issuecomment-1297904278 Since this touches the core Python worker side, let's make sure to get reviewed from other people like @BryanCutler @viirya @ueshin too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38442: [SPARK-40953][CONNECT][PYTHON][TESTS][FOLLOW-UP] Add a test case for head(n)
zhengruifeng commented on PR #38442: URL: https://github.com/apache/spark/pull/38442#issuecomment-1297902530 late lgtm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands
cloud-fan commented on code in PR #38433: URL: https://github.com/apache/spark/pull/38433#discussion_r1009998588 ## sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala: ## @@ -1728,6 +1728,14 @@ class DataSourceV2SQLSuiteV1Filter extends DataSourceV2SQLSuite with AlterTableT } } + test("REPAIR TABLE omit MSCK") { Review Comment: oh this is not parser test. But I think a parser change only needs parser test, no end-to-end test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands
cloud-fan commented on code in PR #38433: URL: https://github.com/apache/spark/pull/38433#discussion_r1009998364 ## sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala: ## @@ -1728,6 +1728,14 @@ class DataSourceV2SQLSuiteV1Filter extends DataSourceV2SQLSuite with AlterTableT } } + test("REPAIR TABLE omit MSCK") { Review Comment: can we move the test to `MsckRepairTableParserSuite`? And probably rename the test suite. And also `MsckRepairTableSuiteBase` and sub classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands
cloud-fan commented on PR #38433: URL: https://github.com/apache/spark/pull/38433#issuecomment-1297895255 Making `MSCK` optional makes our SQL syntax more flexible (and simpler), and compatible with MySQL. And there seems no down side. @dongjoon-hyun Can you share your concerns? We can discuss more here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38422: [SPARK-40948][SQL] Introduce new error class: PATH_NOT_FOUND
itholic commented on PR #38422: URL: https://github.com/apache/spark/pull/38422#issuecomment-1297892959 The latest CI failure is not reproduced on my test env. Let me just try to rebase with master branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38449: [SPARK-40798][SQL][FOLLOW-UP] Fix ansi mode in v2 ALTER TABLE PARTITION
HyukjinKwon commented on code in PR #38449: URL: https://github.com/apache/spark/pull/38449#discussion_r1009993046 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableAddPartitionSuite.scala: ## @@ -124,18 +123,4 @@ class AlterTableAddPartitionSuite checkPartitions(t, Map("id" -> "1"), Map("id" -> "2")) } } - - test("SPARK-40798: Alter partition should verify partition value - legacy") { -withNamespaceAndTable("ns", "tbl") { t => - sql(s"CREATE TABLE $t (c int) $defaultUsing PARTITIONED BY (p int)") - - withSQLConf( - SQLConf.SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION.key -> "true", - SQLConf.ANSI_ENABLED.key -> "false") { Review Comment: Shouldn't we maybe keep the test after removing `SQLConf.ANSI_ENABLED.key -> "false"`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'
vinodkc commented on code in PR #38263: URL: https://github.com/apache/spark/pull/38263#discussion_r1009985691 ## sql/core/src/test/resources/sql-tests/inputs/string-functions.sql: ## @@ -58,6 +58,14 @@ SELECT substring('Spark SQL' from 5); SELECT substring('Spark SQL' from -3); SELECT substring('Spark SQL' from 5 for 1); +-- mask function +select mask_hash('TestString-123'), Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37251: [SPARK-39838][SQL] Preserve explicit empty column metadata
github-actions[bot] closed pull request #37251: [SPARK-39838][SQL] Preserve explicit empty column metadata URL: https://github.com/apache/spark/pull/37251 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options
gengliangwang closed pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options URL: https://github.com/apache/spark/pull/38418 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options
gengliangwang commented on PR #38418: URL: https://github.com/apache/spark/pull/38418#issuecomment-1297829168 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug
sadikovi commented on PR #38277: URL: https://github.com/apache/spark/pull/38277#issuecomment-1297826231 Thank you @dongjoon-hyun for merging . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands
srielau commented on PR #38433: URL: https://github.com/apache/spark/pull/38433#issuecomment-1297817456 > May I ask if dropping `MSCK` is the official Databricks's direction in order to do the lead, @srielau ? > > > As perhaps the most popular platform, we have the chance to lead here and make things easier. Yes, Databricks will make MSCK a noise word. I will probably remove it from our documented syntax and add a footnote such as "for compatibility with Hive you may also use MSCK REPAIR TABLE". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug
dongjoon-hyun commented on PR #38277: URL: https://github.com/apache/spark/pull/38277#issuecomment-1297815756 Thank you so much, @sadikovi , @sunchao , @mridulm . Merged to master for Apache Spark 3.4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug
dongjoon-hyun closed pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug URL: https://github.com/apache/spark/pull/38277 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug
sadikovi commented on PR #38277: URL: https://github.com/apache/spark/pull/38277#issuecomment-1297806098 Thanks for updating the PR title . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands
dongjoon-hyun commented on PR #38433: URL: https://github.com/apache/spark/pull/38433#issuecomment-1297804976 May I ask if dropping `MSCK` is the official Databricks's direction in order to do the lead, @srielau ? > As perhaps the most popular platform, we have the chance to lead here and make things easier. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands
srielau commented on PR #38433: URL: https://github.com/apache/spark/pull/38433#issuecomment-1297801930 > If MySQL is the only DBMS for this syntax, I'm negative about this change because this PR only increases the chance of SQL incompatibility. Is there any other support cases? IMHO mySQL was onto something to not include MSCK (who is the chicken and who the egg here anyway?). It seems like everyone (e.g. IBM) just copied Hive to not rock the boat. Interestingly our own docs omit MSCK from the title (and the file name): https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html It seems the only purpose of this keyword is to trip people up. As perhaps the most popular platform, we have the chance to lead here and make things easier. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug
dongjoon-hyun commented on PR #38277: URL: https://github.com/apache/spark/pull/38277#issuecomment-1297752636 I revise the PR title to simplify, @sadikovi . You can change it back if you want. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #38277: [SPARK-40815][SQL] Introduce DelegateSymlinkTextInputFormat to handle empty splits when "spark.hadoopRDD.ignoreEmptySplits" is ena
sadikovi commented on code in PR #38277: URL: https://github.com/apache/spark/pull/38277#discussion_r1009895816 ## sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java: ## @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.io; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; + +import org.apache.hadoop.fs.ContentSummary; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.hadoop.mapred.InputSplit; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.RecordReader; +import org.apache.hadoop.mapred.Reporter; + +/** + * Delegate for SymlinkTextInputFormat, created to address SPARK-40815. + * Fixes an issue where SymlinkTextInputFormat returns empty splits which could result in + * the correctness issue when "spark.hadoopRDD.ignoreEmptySplits" is enabled. + * Review Comment: Updated. ## sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java: ## @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.io; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; + +import org.apache.hadoop.fs.ContentSummary; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.hadoop.mapred.InputSplit; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.RecordReader; +import org.apache.hadoop.mapred.Reporter; + +/** + * Delegate for SymlinkTextInputFormat, created to address SPARK-40815. + * Fixes an issue where SymlinkTextInputFormat returns empty splits which could result in + * the correctness issue when "spark.hadoopRDD.ignoreEmptySplits" is enabled. + * + * In this class, we update the split start and length to match the target file input thus fixing + * the issue. + */ +@SuppressWarnings("deprecation") Review Comment: Removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38441: [SPARK-40979][CORE] Keep removed executor info in decommission state
AmplabJenkins commented on PR #38441: URL: https://github.com/apache/spark/pull/38441#issuecomment-1297692337 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk opened a new pull request, #38454: [WIP][SPARK-40978][SQL] Migrate `failAnalysis()` onto error classes
MaxGekk opened a new pull request, #38454: URL: https://github.com/apache/spark/pull/38454 ### What changes were proposed in this pull request? In the PR, I propose to migrate `failAnalysis()` errors without a context onto temporary error classes with the prefix `_LEGACY_ERROR_TEMP_24xx`. The error message will not include the error classes, so, in this way we will preserve the existing behaviour. ### Why are the changes needed? The migration on temporary error classes allows to gather statistics about errors and detect most popular error classes. After that we could prioritise the work on migration. The new error class name prefix `_LEGACY_ERROR_TEMP_` proposed here kind of marks the error as developer-facing, not user-facing. Developers can still get the error class programmatically via the `SparkThrowable` interface, so that they can build error infra with it. End users won't see the error class in the message. This allows us to do the error migration very quickly, and we can refine the error classes and mark them as user-facing later (naming them properly, adding tests, etc.). ### Does this PR introduce _any_ user-facing change? No. The error messages should be almost the same by default. ### How was this patch tested? By running the affected test suites: ``` $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] warrenzhu25 commented on pull request #38441: [SPARK-40979][CORE] Keep removed executor info in decommission state
warrenzhu25 commented on PR #38441: URL: https://github.com/apache/spark/pull/38441#issuecomment-1297682478 > Could you explain how it can prevent your problem completely? That sounds still greedy and best-effort approach to mitigate your issue. > > > What I mean is something like below, the memory added is constant regardless of cluster size. @dongjoon-hyun > > ``` > > // limit the set size to 100 or 1000 > > val executorsRemovedDueToDecom = new HashSet[String] > > > > executorsPendingDecommission.remove(executorId) > > executorsRemovedDueToDecom.add(executorId) > > ``` FetchFailed caused by decom executor could be divided into 2 categories: 1. When FetchFailed reached DAGScheduler, the executor is still alive or is lost but the lost info hasn't reached TaskSchedulerImpl. This is already handled in [SPARK-40979](https://issues.apache.org/jira/browse/SPARK-40979) 2. FetchFailed is caused by decom executor loss, so the decom info is already removed in TaskSchedulerImpl. If we keep such info in a short period, it is good enough. Even we limit the size of removed executors to 10K, it could be only at most 10MB memory usage. In real case, it's rare to have cluster size of over 10K and the chance that all these executors decomed and lost at the same time would be small. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38441: [SPARK-40979][CORE] Keep removed executor info in decommission state
dongjoon-hyun commented on PR #38441: URL: https://github.com/apache/spark/pull/38441#issuecomment-1297671091 As I mentioned before, this cannot be a followup. I created SPARK-40979 for you because you want to discuss this more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38441: [SPARK-40481][CORE][FOLLOWUP] Keep removed executor info in decommission state
dongjoon-hyun commented on PR #38441: URL: https://github.com/apache/spark/pull/38441#issuecomment-129747 Could you explain how it can prevent your problem completely? That sounds still greedy and best-effort approach to mitigate your issue. > What I mean is something like below, the memory added is constant regardless of cluster size. @dongjoon-hyun > > ``` > // limit the set size to 100 or 1000 > val executorsRemovedDueToDecom = new HashSet[String] > > executorsPendingDecommission.remove(executorId) > executorsRemovedDueToDecom.add(executorId) > ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options
dtenedor commented on PR #38418: URL: https://github.com/apache/spark/pull/38418#issuecomment-1297666241 @cloud-fan now a different test flaked: ![image](https://user-images.githubusercontent.com/99207096/199107322-edb4aea8-40b3-42f6-b302-c577ae2083c6.png) I am going to re-try it agin... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] warrenzhu25 commented on pull request #38441: [SPARK-40481][CORE][FOLLOWUP] Keep removed executor info in decommission state
warrenzhu25 commented on PR #38441: URL: https://github.com/apache/spark/pull/38441#issuecomment-1297661121 > No, I don't think that is a correct way because there exists a large Spark cluster. Please don't try to remove that line. What I mean is something like below, the memory added is constant regardless of cluster size. ``` // limit the set size to 100 or 1000 val executorsRemovedDueToDecom = new HashSet[String] executorsPendingDecommission.remove(executorId) executorsRemovedDueToDecom.add(executorId) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38441: [SPARK-40481][CORE][FOLLOWUP] Keep removed executor info in decommission state
dongjoon-hyun commented on PR #38441: URL: https://github.com/apache/spark/pull/38441#issuecomment-1297654097 No, I don't think that is a correct way because there exists a large Spark cluster. Please don't try to remove this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] warrenzhu25 commented on pull request #38441: [SPARK-40481][CORE][FOLLOWUP] Keep removed executor info in decommission state
warrenzhu25 commented on PR #38441: URL: https://github.com/apache/spark/pull/38441#issuecomment-1297651992 > If the original contribution [SPARK-40481](https://issues.apache.org/jira/browse/SPARK-40481) is insufficient for you still, we can revert it first before pursuing this kind of big change. We need to revisit from the bottom, @warrenzhu25 . The original PR works for most cases, but some lost executors due to decommission cannot be ignored as we lost such info so I want to improve this. I understand the concern of keeping removed executor. What do you think of keeping only most recent 100 lost executors due to decommission in a separate set? @dongjoon-hyun @mridulm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
grundprinzip commented on code in PR #38347: URL: https://github.com/apache/spark/pull/38347#discussion_r1009839486 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -207,3 +208,23 @@ message Sample { int64 seed = 1; } } + +// Relation of type [[Range]] that generates a sequence of integers. +message Range { + // Optional. Default value = 0 + int32 start = 1; + int32 end = 2; + // Optional. Default value = 1 + Step step = 3; + // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if + // it is set, or 2) spark default parallelism. + NumPartitions num_partitions = 4; Review Comment: So in fewer words :) when `num_partitions` is an integer the default value is `0` even if it's not and for scalar types we can't differentiate between present or not. Understanding if `0` is a valid or invalid value defeats the purpose. Thanks for the additional color! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
amaliujia commented on code in PR #38347: URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -207,3 +208,23 @@ message Sample { int64 seed = 1; } } + +// Relation of type [[Range]] that generates a sequence of integers. +message Range { + // Optional. Default value = 0 + int32 start = 1; + int32 end = 2; + // Optional. Default value = 1 + Step step = 3; + // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if + // it is set, or 2) spark default parallelism. + NumPartitions num_partitions = 4; Review Comment: There are two dimensions of things in this area: 1. Required versus Optional. A field is required, meaning it must be set. A field can be optional. Meaning it could be set or not. 2. Field has default value or not. A field can have a default value if not set. The second point is an addition for the first point. If there is a field which is not set, there could be a default value to be used. There are special cases that the default value for proto, is the same as the default value that Spark uses. In that case we don't need to differentiate the optionality. Otherwise we need this way to differentiate `set versus not set`, to adopt default values of Spark (unless we don't care the default values in Spark). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
amaliujia commented on code in PR #38347: URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -207,3 +208,23 @@ message Sample { int64 seed = 1; } } + +// Relation of type [[Range]] that generates a sequence of integers. +message Range { + // Optional. Default value = 0 + int32 start = 1; + int32 end = 2; + // Optional. Default value = 1 + Step step = 3; + // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if + // it is set, or 2) spark default parallelism. + NumPartitions num_partitions = 4; Review Comment: There are two dimensions of things in this area: 1. Required versus Optional. A field is required, meaning it must be set. A field can be optional. Meaning it could be set or not. 2. Field has default value or not. A field can have a default value if not set. The second point is an addition for the first point. If there is a field which is not set, there could be a default value to be used. There are special cases that the default value for proto, is the same as the default value that Spark uses. In that case we don't need to differentiate the optionally. Otherwise we need this way to differentiate `set versus not set`, to adopt default values of Spark (unless we don't care the default values in Spark). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
amaliujia commented on code in PR #38347: URL: https://github.com/apache/spark/pull/38347#discussion_r1009836536 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -207,3 +208,23 @@ message Sample { int64 seed = 1; } } + +// Relation of type [[Range]] that generates a sequence of integers. +message Range { + // Optional. Default value = 0 + int32 start = 1; + int32 end = 2; + // Optional. Default value = 1 + Step step = 3; + // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if + // it is set, or 2) spark default parallelism. + NumPartitions num_partitions = 4; Review Comment: To really answer your question: if we plan to respect default values for Spark for those optionally fields whose default proto values are different from Spark default values, this is the only way to respect default values for Spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
amaliujia commented on code in PR #38347: URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -207,3 +208,23 @@ message Sample { int64 seed = 1; } } + +// Relation of type [[Range]] that generates a sequence of integers. +message Range { + // Optional. Default value = 0 + int32 start = 1; + int32 end = 2; + // Optional. Default value = 1 + Step step = 3; + // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if + // it is set, or 2) spark default parallelism. + NumPartitions num_partitions = 4; Review Comment: There are two dimensions of things in this area: 1. Required versus Optional. A field is required, meaning it must be set. A field can be optional. Meaning it could be set or not. 2. Field has default value or not. A field can have a default value if not set. 2. is an addition for 1. If there is a field which is not set, there could be a default value to be used. There are special cases that the default value for proto, is the same as the default value that Spark uses. In that case we don't need to differentiate the optionally. Otherwise we need this way to differentiate `set versus not set`, to adopt default values of Spark (unless we don't care the default values in Spark). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
grundprinzip commented on code in PR #38347: URL: https://github.com/apache/spark/pull/38347#discussion_r1009832423 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala: ## @@ -175,6 +178,28 @@ package object dsl { } object plans { // scalastyle:ignore +implicit class DslMockRemoteSession(val session: MockRemoteSession) { + def range( + start: Option[Int], + end: Int, + step: Option[Int], + numPartitions: Option[Int]): Relation = { +val range = proto.Range.newBuilder() Review Comment: As long as no catalyst is in this package this is good with me. Thanks for clarifying. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
amaliujia commented on code in PR #38347: URL: https://github.com/apache/spark/pull/38347#discussion_r1009828791 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala: ## @@ -175,6 +178,28 @@ package object dsl { } object plans { // scalastyle:ignore +implicit class DslMockRemoteSession(val session: MockRemoteSession) { + def range( + start: Option[Int], + end: Int, + step: Option[Int], + numPartitions: Option[Int]): Relation = { +val range = proto.Range.newBuilder() Review Comment: It makes sense for `SparkConnectPlanner` where Catalyst and Proto are both mixed together, and we are keeping the approach you are asking there. However this is the Connect DSL that only deal with protos. No Catalyst included in this package: https://github.com/apache/spark/blob/9fc3aa0b1c092ab1f13b26582e3ece7440fbfc3b/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala#L17 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org