[GitHub] [spark] MaxGekk commented on a diff in pull request #36946: [SPARK-39552][SQL] Unify v1 and v2 `DESCRIBE TABLE`
MaxGekk commented on code in PR #36946: URL: https://github.com/apache/spark/pull/36946#discussion_r904609109 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala: ## @@ -80,13 +91,26 @@ case class DescribeTableExec( } private def addPartitioning(rows: ArrayBuffer[InternalRow]): Unit = { -rows += emptyRow() -rows += toCatalystRow("# Partitioning", "", "") -if (table.partitioning.isEmpty) { - rows += toCatalystRow("Not partitioned", "", "") -} else { - rows ++= table.partitioning.zipWithIndex.map { -case (transform, index) => toCatalystRow(s"Part $index", transform.describe(), "") +if (table.partitioning.nonEmpty) { + val partitionColumnsOnly = table.partitioning.forall(t => t.isInstanceOf[IdentityTransform]) + if (partitionColumnsOnly) { +rows += toCatalystRow("# Partition Information", "", "") +rows += toCatalystRow(s"# ${output(0).name}", output(1).name, output(2).name) +rows ++= table.partitioning + .map(_.asInstanceOf[IdentityTransform].ref.fieldNames()) + .flatMap(table.schema.findNestedField(_)) Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #36942: [SPARK-39545][SQL] Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
LuciferYang commented on PR #36942: URL: https://github.com/apache/spark/pull/36942#issuecomment-1163977592 thanks all ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36946: [SPARK-39552][SQL] Unify v1 and v2 `DESCRIBE TABLE`
cloud-fan commented on code in PR #36946: URL: https://github.com/apache/spark/pull/36946#discussion_r904603138 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala: ## @@ -80,13 +91,26 @@ case class DescribeTableExec( } private def addPartitioning(rows: ArrayBuffer[InternalRow]): Unit = { -rows += emptyRow() -rows += toCatalystRow("# Partitioning", "", "") -if (table.partitioning.isEmpty) { - rows += toCatalystRow("Not partitioned", "", "") -} else { - rows ++= table.partitioning.zipWithIndex.map { -case (transform, index) => toCatalystRow(s"Part $index", transform.describe(), "") +if (table.partitioning.nonEmpty) { + val partitionColumnsOnly = table.partitioning.forall(t => t.isInstanceOf[IdentityTransform]) + if (partitionColumnsOnly) { +rows += toCatalystRow("# Partition Information", "", "") +rows += toCatalystRow(s"# ${output(0).name}", output(1).name, output(2).name) +rows ++= table.partitioning + .map(_.asInstanceOf[IdentityTransform].ref.fieldNames()) + .flatMap(table.schema.findNestedField(_)) Review Comment: Can we do the same here? assert that `table.schema.findNestedField` does not return None. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down
dongjoon-hyun commented on PR #36696: URL: https://github.com/apache/spark/pull/36696#issuecomment-1163974482 No problem~ Thank you for informing that, @huaxingao . Take your time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #36946: [SPARK-39552][SQL] Unify v1 and v2 `DESCRIBE TABLE`
MaxGekk commented on code in PR #36946: URL: https://github.com/apache/spark/pull/36946#discussion_r904601619 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala: ## @@ -80,13 +91,26 @@ case class DescribeTableExec( } private def addPartitioning(rows: ArrayBuffer[InternalRow]): Unit = { -rows += emptyRow() -rows += toCatalystRow("# Partitioning", "", "") -if (table.partitioning.isEmpty) { - rows += toCatalystRow("Not partitioned", "", "") -} else { - rows ++= table.partitioning.zipWithIndex.map { -case (transform, index) => toCatalystRow(s"Part $index", transform.describe(), "") +if (table.partitioning.nonEmpty) { + val partitionColumnsOnly = table.partitioning.forall(t => t.isInstanceOf[IdentityTransform]) + if (partitionColumnsOnly) { +rows += toCatalystRow("# Partition Information", "", "") +rows += toCatalystRow(s"# ${output(0).name}", output(1).name, output(2).name) +rows ++= table.partitioning + .map(_.asInstanceOf[IdentityTransform].ref.fieldNames()) + .flatMap(table.schema.findNestedField(_)) Review Comment: I have checked that v1 should trigger an assert in the case: 1. describePartitionInfo() https://github.com/apache/spark/blob/0b4739eb2c66ce69ffc16ad05ee0f12fe51d150b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L657 2. table.partitionSchema 3. assert(partitionFields.map(_.name) == partitionColumnNames) https://github.com/apache/spark/blob/8bbbdb5a4667d90b27b202870cd73c9a19b74781/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L263 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down
huaxingao commented on PR #36696: URL: https://github.com/apache/spark/pull/36696#issuecomment-1163962503 @dongjoon-hyun Thanks for the ping. Give me a couple of more days. I want to check one more time before mark this ready for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #36963: [SPARK-39564][SS] Expose the information of catalog table to the logical plan in streaming query
HeartSaVioR commented on code in PR #36963: URL: https://github.com/apache/spark/pull/36963#discussion_r904526435 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala: ## @@ -77,7 +78,11 @@ class IncrementalExecution( */ override lazy val optimizedPlan: LogicalPlan = executePhase(QueryPlanningTracker.OPTIMIZATION) { -sparkSession.sessionState.optimizer.executeAndTrack(withCachedData, +// Performing pre-optimization for streaming specific +val preOptimized = withCachedData.transform { Review Comment: Self-comment: additional comment to explain what it does. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala: ## @@ -877,13 +879,18 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product with Tre t.copy(properties = Utils.redact(t.properties).toMap, options = Utils.redact(t.options).toMap) :: Nil case table: CatalogTable => - table.storage.serde match { -case Some(serde) => table.identifier :: serde :: Nil -case _ => table.identifier :: Nil - } + stringArgsForCatalogTable(table) + case other => other :: Nil }.mkString(", ") + private def stringArgsForCatalogTable(table: CatalogTable): Seq[Any] = { +table.storage.serde match { + case Some(serde) => table.identifier :: serde :: Nil Review Comment: I see inconsistence on representing table identifier on the plan - while this produces "quoted" string, I also see existing codebase to produce "unquoted" string. e.g. LogicalRelation. I'd like to hear the voice which is our preference. For now I used "unquoted" string for other places and leave this as it is (hence "quoted" string). ## sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala: ## @@ -167,6 +173,17 @@ case class StreamingDataSourceV2Relation( case _ => Statistics(sizeInBytes = conf.defaultSizeInBytes) } + + private val stringArgsVal: Seq[Any] = { +val tableQualifier = (catalog, identifier) match { Review Comment: Printing out `catalog` and `identifier` separately does not seem to give a good output, hence I changed to produce a single output for composite fields `catalog` and `identifier` whenever they are provided altogether in the logical node (only on streaming code path). ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala: ## @@ -868,6 +868,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product with Tre case null => Nil case None => Nil case Some(null) => Nil +case Some(table: CatalogTable) => Review Comment: This is needed to represent the CatalogTable properly if it is presented. (Before the fix it calls CatalogTable.toString which produces a bunch of information with multiple lines.) I feel it could be something be generalized, but I don't know we have other existing cases to handle. ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -576,14 +589,22 @@ class MicroBatchExecution( // Replace sources in the logical plan with data that has arrived since the last batch. val newBatchesPlan = logicalPlan transform { // For v1 sources. - case StreamingExecutionRelation(source, output) => + case StreamingExecutionRelation(source, output, catalogTable) => newData.get(source).map { dataPlan => val hasFileMetadata = output.exists { case FileSourceMetadataAttribute(_) => true case _ => false } val finalDataPlan = dataPlan transformUp { -case l: LogicalRelation if hasFileMetadata => l.withMetadataColumns() +case l: LogicalRelation => Review Comment: Many DSv1 data sources produce LogicalRelation as leaf node, which contains Relation having a field of catalogTable. We fill out the information of the catalogTable in case when the source is not able to provide the information. ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala: ## @@ -42,7 +48,11 @@ object StreamingRelation { * It should be used to create [[Source]] and converted to [[StreamingExecutionRelation]] when * passing to [[StreamExecution]] to run a query. */ -case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute]) +case class StreamingRelation( +dataSource: DataSource, +sourceName: String, +output: Seq[Attribute], +catalogTable: Option[CatalogTable]) Review Comment: Self-comment: DataSource has a field `catalogTable`. Leverage it instead of adding a new column. ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/WriteToMicroBatchDataSourceV1.scala: ## @@
[GitHub] [spark] cloud-fan closed pull request #36947: [SPARK-39548][SQL] CreateView Command with a window clause query hit a wrong window definition not found issue
cloud-fan closed pull request #36947: [SPARK-39548][SQL] CreateView Command with a window clause query hit a wrong window definition not found issue URL: https://github.com/apache/spark/pull/36947 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #36948: [SPARK-39547][SQL] V2SessionCatalog should not throw NoSuchDatabaseException in loadNamspaceMetadata
cloud-fan commented on PR #36948: URL: https://github.com/apache/spark/pull/36948#issuecomment-1163947633 thanks, merging to master/3.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #36778: [SPARK-39383][SQL] Support DEFAULT columns in ALTER TABLE ALTER COLUMNS to V2 data sources
gengliangwang commented on PR #36778: URL: https://github.com/apache/spark/pull/36778#issuecomment-1163946455 LGTM except for one comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #36948: [SPARK-39547][SQL] V2SessionCatalog should not throw NoSuchDatabaseException in loadNamspaceMetadata
cloud-fan closed pull request #36948: [SPARK-39547][SQL] V2SessionCatalog should not throw NoSuchDatabaseException in loadNamspaceMetadata URL: https://github.com/apache/spark/pull/36948 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36778: [SPARK-39383][SQL] Support DEFAULT columns in ALTER TABLE ALTER COLUMNS to V2 data sources
gengliangwang commented on code in PR #36778: URL: https://github.com/apache/spark/pull/36778#discussion_r904530629 ## sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala: ## @@ -203,6 +203,14 @@ private[sql] object CatalogV2Util { }) } +case update: UpdateColumnDefaultValue => + replace(schema, update.fieldNames, field => +if (update.newDefaultValue().nonEmpty) { Review Comment: Oh wait, it seems that will be "\"\"" Then when will the default value get cleared? ## sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala: ## @@ -203,6 +203,14 @@ private[sql] object CatalogV2Util { }) } +case update: UpdateColumnDefaultValue => + replace(schema, update.fieldNames, field => +if (update.newDefaultValue().nonEmpty) { Review Comment: Oh wait, it seems that will be `"\"\""` Then when will the default value get cleared? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36778: [SPARK-39383][SQL] Support DEFAULT columns in ALTER TABLE ALTER COLUMNS to V2 data sources
gengliangwang commented on code in PR #36778: URL: https://github.com/apache/spark/pull/36778#discussion_r904530233 ## sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala: ## @@ -203,6 +203,14 @@ private[sql] object CatalogV2Util { }) } +case update: UpdateColumnDefaultValue => + replace(schema, update.fieldNames, field => +if (update.newDefaultValue().nonEmpty) { Review Comment: What if the new default value is an empty string? Do we just clear the default value? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #36885: [WIP][SPARK-39489][CORE] Improve event logging JsonProtocol performance by using Jackson instead of Json4s
mridulm commented on PR #36885: URL: https://github.com/apache/spark/pull/36885#issuecomment-1163944293 +CC @shardulm94, @thejdeep -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #35683: [SPARK-30835][SPARK-39018][CORE][YARN] Add support for YARN decommissioning when ESS is disabled
mridulm commented on PR #35683: URL: https://github.com/apache/spark/pull/35683#issuecomment-1163943816 Catching up on PR's ... this should have been caught by GA hooks, right ? Should have double checked, thanks for flagging this @tgravescs ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36957: [SPARK-39555][PYTHON] Make createTable and listTables in the python side support 3-layer-namespace
cloud-fan commented on code in PR #36957: URL: https://github.com/apache/spark/pull/36957#discussion_r904529034 ## python/pyspark/sql/catalog.py: ## @@ -341,6 +357,9 @@ def createTable( .. versionchanged:: 3.1 Added the ``description`` parameter. + +.. versionchanged:: 3.4 + Made ``tableName`` support 3-layer namespace. Review Comment: to be more general: ``` Allowed ``tableName`` to be qualified with catalog name. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36696: [SPARK-39312][SQL] Use parquet native In predicate for in filter push down
dongjoon-hyun commented on PR #36696: URL: https://github.com/apache/spark/pull/36696#issuecomment-1163943147 What is the next step for us, @huaxingao ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #36941: [SPARK-39543] The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
cloud-fan closed pull request #36941: [SPARK-39543] The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1 URL: https://github.com/apache/spark/pull/36941 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #36941: [SPARK-39543] The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
cloud-fan commented on PR #36941: URL: https://github.com/apache/spark/pull/36941#issuecomment-1163932293 thanks, merging to master/3.3/3.2! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #36963: [SPARK-39564][SS] Expose the information of catalog table to the logical plan in streaming query
HeartSaVioR opened a new pull request, #36963: URL: https://github.com/apache/spark/pull/36963 ### What changes were proposed in this pull request? This PR proposes to expose the information of catalog table (V1/V2) to the logical plan in streaming query, specifically, parsed plan and analyzed plan. (We may discard some information in optimized plan.) The major change is to propagate the information of catalog table from the place we resolve the table to the place we execute the query. In MicroBatch execution, we have several transformations on the logical plan which replace the node with another node, hence this PR touches multiple logical nodes which the code path passes through. Specifically for DSv1 sink, we don't have a specific write logical node, hence it's not feasible to expose the information for the destination. This PR introduces `WriteToMicroBatchDataSourceV1` which is DSv1 version of `WriteToMicroBatchDataSource` as a logical node for DSv1 sink. Worth noting that `WriteToMicroBatchDataSourceV1` plays as a marker - we eliminate this node in streaming specific optimization phase. ### Why are the changes needed? This PR give a better UX to end users who use table API for streaming query. Previously it's not easy or even not feasible to check which tables are being read and written from the streaming query. Most likely it requires end users to look into their code/query. ### Does this PR introduce _any_ user-facing change? Yes, in parsed/analyzed plan, we now expose the table information into the read/write logical node. Specifically for DSv1, we introduce a marker write node to expose the information for destination without majorly changing existing logic. > DSv1 read and write >> Before the patch https://user-images.githubusercontent.com/1317309/175210731-dcc4cc4d-a70b-467d-b577-79c20600db32.png;> >> After the patch https://user-images.githubusercontent.com/1317309/175210753-8f3c0b81-6ec5-45df-a2f8-37589960bba2.png;> > DSv2 read and write >> Before the patch https://user-images.githubusercontent.com/1317309/175210780-4a99c670-8a42-4511-959c-cafe0c24bc00.png;> >> After the patch https://user-images.githubusercontent.com/1317309/175210807-2b0b6935-e696-4051-b1a2-725d784d9d5f.png;> ### How was this patch tested? New test cases. Also manually tested via running following query and checked the UI page: > DSv1 read and write ``` /* ./bin/spark-shell --conf "spark.sql.ui.explainMode=extended" */ spark.sql("drop table if exists stream_source") spark.sql("drop table if exists stream_target") spark.sql("create table stream_source (col1 string, col2 int) using parquet") spark.sql("create table stream_target (col1 string, col2 int) using parquet") val checkpointDir = java.nio.file.Files.createTempDirectory("checkpoint-") val q = spark.readStream.table("stream_source").writeStream.format("parquet").option("checkpointLocation", checkpointDir.toString).toTable("stream_target") Thread.sleep(1) spark.sql("insert into stream_source values ('a', 1)") spark.sql("insert into stream_source values ('a', 2)") spark.sql("insert into stream_source values ('a', 3)") q.processAllAvailable() spark.sql("insert into stream_source values ('b', 1)") spark.sql("insert into stream_source values ('b', 2)") spark.sql("insert into stream_source values ('b', 3)") q.processAllAvailable() spark.sql("insert into stream_source values ('c', 1)") spark.sql("insert into stream_source values ('c', 2)") spark.sql("insert into stream_source values ('c', 3)") q.processAllAvailable() q.stop() ``` > DSv2 read and write ``` /* ./bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1\ --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.local.type=hadoop \ --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \ --conf spark.sql.ui.explainMode=extended */ spark.sql("drop table if exists local.db.stream_target") spark.sql("create table local.db.stream_source (col1 string, col2 int) using iceberg") spark.sql("create table local.db.stream_target (col1 string, col2 int) using iceberg") val checkpointDir = java.nio.file.Files.createTempDirectory("checkpoint-") val q = spark.readStream.table("local.db.stream_source").writeStream.format("iceberg").option("checkpointLocation", checkpointDir.toString).toTable("local.db.stream_target") Thread.sleep(1) spark.sql("insert into local.db.stream_source values ('a', 1)") spark.sql("insert into local.db.stream_source values ('a', 2)") spark.sql("insert into local.db.stream_source values ('a', 3)")
[GitHub] [spark] zhengruifeng commented on a diff in pull request #36870: [SPARK-39483][PYTHON] Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array
zhengruifeng commented on code in PR #36870: URL: https://github.com/apache/spark/pull/36870#discussion_r904503764 ## python/pyspark/sql/session.py: ## @@ -975,6 +976,17 @@ def createDataFrame( # type: ignore[misc] if data.ndim not in [1, 2]: raise ValueError("NumPy array input should be of 1 or 2 dimensions.") column_names = ["value"] if data.ndim == 1 else ["_1", "_2"] +if schema is None and not self._jconf.arrowPySparkEnabled(): +# Construct `schema` from `np.dtype` of the input NumPy array +# TODO: Apply the logic below when self._jconf.arrowPySparkEnabled() is True +spark_type = _from_numpy_type(data.dtype) +if spark_type is not None: + Review Comment: It is nice to follow existing logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #36663: [SPARK-38899][SQL]DS V2 supports push down datetime functions
beliefer commented on code in PR #36663: URL: https://github.com/apache/spark/pull/36663#discussion_r904495438 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Extract.java: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.expressions; + +import org.apache.spark.annotation.Evolving; + +import java.io.Serializable; + +/** + * Represent an extract expression, which contains a field to be extracted + * and a source expression where the field should be extracted. + * @since 3.4.0 Review Comment: ```suggestion * @since 3.4.0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #36960: [SPARK-39557][SQL] Support ARRAY, STRUCT, MAP types as DEFAULT values
AmplabJenkins commented on PR #36960: URL: https://github.com/apache/spark/pull/36960#issuecomment-1163897293 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compatib
cloud-fan commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904487769 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -589,4 +589,25 @@ abstract class Catalog { * @since 2.0.0 */ def refreshByPath(path: String): Unit + + /** + * Returns the current default catalog in this session. + * + * @since 3.4.0 + */ + def currentCatalog(): String Review Comment: anyway, let's update the build file to make mima happy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] DS V2 supports push down OFFSET operator
cloud-fan commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r904487223 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -407,7 +407,67 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper wit case other => (other, false) } - def pushDownLimits(plan: LogicalPlan): LogicalPlan = plan.transform { + private def pushDownOffset( + plan: LogicalPlan, + offset: Int): Boolean = plan match { +case sHolder: ScanBuilderHolder => + val isPushed = PushDownUtils.pushOffset(sHolder.builder, offset) + if (isPushed) { +sHolder.pushedOffset = Some(offset) + } + isPushed +case Project(projectList, child) if projectList.forall(_.deterministic) => + pushDownOffset(child, offset) +case _ => false + } + + def pushDownLimitAndOffset(plan: LogicalPlan): LogicalPlan = plan.transform { +case offset @ LimitAndOffset(limit, offsetValue, child) => + val (newChild, canRemoveLimit) = pushDownLimit(child, limit) + if (canRemoveLimit) { +// Try to push down OFFSET only if the LIMIT operator has been pushed and can be removed. +// For `df.limit(m).offset(n)`, try to push down `limit(m).offset(n)`. +// For example, `df.limit(5).offset(3)`, we can push down `limit(5).offset(3)`. +val isPushed = pushDownOffset(newChild, offsetValue) +if (isPushed) { + newChild +} else { + // For `df.limit(m).offset(n)`, only push down `limit(m)`. + // Keep the OFFSET operator if we failed to push down OFFSET to the data source. + offset.withNewChildren(Seq(newChild)) +} + } else { +// Keep the OFFSET operator if we can't remove LIMIT operator. +offset + } +case globalLimit @ OffsetAndLimit(offset, limit, child) => + // For `df.offset(n).limit(m)`, we can push down `limit(m + n)` first. + val (newChild, canRemoveLimit) = pushDownLimit(child, limit + offset) + if (canRemoveLimit) { +// Try to push down OFFSET only if the LIMIT operator has been pushed and can be removed. +// For `df.offset(n).limit(m)`, try to push down `limit(m + n).offset(n)`. +// For example, `df.offset(3).limit(5)`, we can push down `limit(8).offset(3)`. +val isPushed = pushDownOffset(newChild, offset) +if (isPushed) { + newChild +} else { + // Still keep the OFFSET operator if we can't push it down. + // For example, `df.offset(3).limit(5)`, `limit(8)` has been pushed + // and can be removed, Spark still do `offset(3)`. Review Comment: ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] DS V2 supports push down OFFSET operator
cloud-fan commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r904486692 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -407,7 +407,67 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper wit case other => (other, false) } - def pushDownLimits(plan: LogicalPlan): LogicalPlan = plan.transform { + private def pushDownOffset( + plan: LogicalPlan, + offset: Int): Boolean = plan match { +case sHolder: ScanBuilderHolder => + val isPushed = PushDownUtils.pushOffset(sHolder.builder, offset) + if (isPushed) { +sHolder.pushedOffset = Some(offset) + } + isPushed +case Project(projectList, child) if projectList.forall(_.deterministic) => + pushDownOffset(child, offset) +case _ => false + } + + def pushDownLimitAndOffset(plan: LogicalPlan): LogicalPlan = plan.transform { +case offset @ LimitAndOffset(limit, offsetValue, child) => + val (newChild, canRemoveLimit) = pushDownLimit(child, limit) + if (canRemoveLimit) { +// Try to push down OFFSET only if the LIMIT operator has been pushed and can be removed. +// For `df.limit(m).offset(n)`, try to push down `limit(m).offset(n)`. +// For example, `df.limit(5).offset(3)`, we can push down `limit(5).offset(3)`. +val isPushed = pushDownOffset(newChild, offsetValue) +if (isPushed) { + newChild +} else { + // For `df.limit(m).offset(n)`, only push down `limit(m)`. + // Keep the OFFSET operator if we failed to push down OFFSET to the data source. + offset.withNewChildren(Seq(newChild)) +} + } else { +// Keep the OFFSET operator if we can't remove LIMIT operator. +offset + } +case globalLimit @ OffsetAndLimit(offset, limit, child) => + // For `df.offset(n).limit(m)`, we can push down `limit(m + n)` first. + val (newChild, canRemoveLimit) = pushDownLimit(child, limit + offset) + if (canRemoveLimit) { +// Try to push down OFFSET only if the LIMIT operator has been pushed and can be removed. +// For `df.offset(n).limit(m)`, try to push down `limit(m + n).offset(n)`. +// For example, `df.offset(3).limit(5)`, we can push down `limit(8).offset(3)`. Review Comment: ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] DS V2 supports push down OFFSET operator
cloud-fan commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r904486458 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -407,7 +407,67 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper wit case other => (other, false) } - def pushDownLimits(plan: LogicalPlan): LogicalPlan = plan.transform { + private def pushDownOffset( + plan: LogicalPlan, + offset: Int): Boolean = plan match { +case sHolder: ScanBuilderHolder => + val isPushed = PushDownUtils.pushOffset(sHolder.builder, offset) + if (isPushed) { +sHolder.pushedOffset = Some(offset) + } + isPushed +case Project(projectList, child) if projectList.forall(_.deterministic) => + pushDownOffset(child, offset) +case _ => false + } + + def pushDownLimitAndOffset(plan: LogicalPlan): LogicalPlan = plan.transform { +case offset @ LimitAndOffset(limit, offsetValue, child) => + val (newChild, canRemoveLimit) = pushDownLimit(child, limit) + if (canRemoveLimit) { +// Try to push down OFFSET only if the LIMIT operator has been pushed and can be removed. +// For `df.limit(m).offset(n)`, try to push down `limit(m).offset(n)`. +// For example, `df.limit(5).offset(3)`, we can push down `limit(5).offset(3)`. +val isPushed = pushDownOffset(newChild, offsetValue) +if (isPushed) { + newChild +} else { + // For `df.limit(m).offset(n)`, only push down `limit(m)`. Review Comment: ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] DS V2 supports push down OFFSET operator
cloud-fan commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r904486283 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -407,7 +407,67 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper wit case other => (other, false) } - def pushDownLimits(plan: LogicalPlan): LogicalPlan = plan.transform { + private def pushDownOffset( + plan: LogicalPlan, + offset: Int): Boolean = plan match { +case sHolder: ScanBuilderHolder => + val isPushed = PushDownUtils.pushOffset(sHolder.builder, offset) + if (isPushed) { +sHolder.pushedOffset = Some(offset) + } + isPushed +case Project(projectList, child) if projectList.forall(_.deterministic) => + pushDownOffset(child, offset) +case _ => false + } + + def pushDownLimitAndOffset(plan: LogicalPlan): LogicalPlan = plan.transform { +case offset @ LimitAndOffset(limit, offsetValue, child) => + val (newChild, canRemoveLimit) = pushDownLimit(child, limit) + if (canRemoveLimit) { +// Try to push down OFFSET only if the LIMIT operator has been pushed and can be removed. +// For `df.limit(m).offset(n)`, try to push down `limit(m).offset(n)`. +// For example, `df.limit(5).offset(3)`, we can push down `limit(5).offset(3)`. Review Comment: ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compat
HyukjinKwon commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904485439 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -589,4 +589,25 @@ abstract class Catalog { * @since 2.0.0 */ def refreshByPath(path: String): Unit + + /** + * Returns the current default catalog in this session. + * + * @since 3.4.0 + */ + def currentCatalog(): String Review Comment: Yeah, acutlaly not sure why it complains. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36946: [SPARK-39552][SQL] Unify v1 and v2 `DESCRIBE TABLE`
cloud-fan commented on code in PR #36946: URL: https://github.com/apache/spark/pull/36946#discussion_r904482933 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala: ## @@ -80,13 +91,26 @@ case class DescribeTableExec( } private def addPartitioning(rows: ArrayBuffer[InternalRow]): Unit = { -rows += emptyRow() -rows += toCatalystRow("# Partitioning", "", "") -if (table.partitioning.isEmpty) { - rows += toCatalystRow("Not partitioned", "", "") -} else { - rows ++= table.partitioning.zipWithIndex.map { -case (transform, index) => toCatalystRow(s"Part $index", transform.describe(), "") +if (table.partitioning.nonEmpty) { + val partitionColumnsOnly = table.partitioning.forall(t => t.isInstanceOf[IdentityTransform]) + if (partitionColumnsOnly) { +rows += toCatalystRow("# Partition Information", "", "") +rows += toCatalystRow(s"# ${output(0).name}", output(1).name, output(2).name) +rows ++= table.partitioning + .map(_.asInstanceOf[IdentityTransform].ref.fieldNames()) + .flatMap(table.schema.findNestedField(_)) Review Comment: what's the v1 DESC TABLE behavior for malformed tables? e.g. partition column does not exist in the table schema? do we fail or silently ignore it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #36295: [SPARK-38978][SQL] DS V2 supports push down OFFSET operator
beliefer commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r904481912 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -407,7 +407,66 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper wit case other => (other, false) } - def pushDownLimits(plan: LogicalPlan): LogicalPlan = plan.transform { + private def pushDownOffset( + plan: LogicalPlan, + offset: Int): Boolean = plan match { +case sHolder: ScanBuilderHolder => + val isPushed = PushDownUtils.pushOffset(sHolder.builder, offset) + if (isPushed) { +sHolder.pushedOffset = Some(offset) + } + isPushed +case p: Project => + pushDownOffset(p.child, offset) +case _ => false + } + + def pushDownLimitAndOffset(plan: LogicalPlan): LogicalPlan = plan.transform { +case offset @ LimitAndOffset(limit, offsetValue, child) => + val (newChild, canRemoveLimit) = pushDownLimit(child, limit) + if (canRemoveLimit) { +// If we can remove limit, it indicates data source only have one partition. +// For `dataset.limit(m).offset(n)`, try to push down `limit(m).offset(n)`. +// For example, `dataset.limit(5).offset(3)`, we can push down `limit(5).offset(3)`. +val isPushed = pushDownOffset(newChild, offsetValue) +if (isPushed) { + newChild +} else { + // For `dataset.limit(m).offset(n)`, only push down `limit(m)`. + // Keep the OFFSET operator if we failed to push down OFFSET to the data source. + offset.withNewChildren(Seq(newChild)) +} + } else { +// Keep the OFFSET operator if we can't remove LIMIT operator. +offset + } +case globalLimit @ OffsetAndLimit(offset, limit, child) => + val (newChild, canRemoveLimit) = pushDownLimit(child, limit + offset) Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compatib
cloud-fan commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904481141 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -589,4 +589,25 @@ abstract class Catalog { * @since 2.0.0 */ def refreshByPath(path: String): Unit + + /** + * Returns the current default catalog in this session. + * + * @since 3.4.0 + */ + def currentCatalog(): String + + /** + * Sets the current default catalog in this session. + * + * @since 3.4.0 + */ + def setCurrentCatalog(catalogName: String): Unit + + /** + * Returns a list of catalogs in this session. Review Comment: ```suggestion * Returns a list of catalogs available in this session. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compatib
cloud-fan commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904480952 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -589,4 +589,25 @@ abstract class Catalog { * @since 2.0.0 */ def refreshByPath(path: String): Unit + + /** + * Returns the current default catalog in this session. + * + * @since 3.4.0 + */ + def currentCatalog(): String + + /** + * Sets the current default catalog in this session. Review Comment: ```suggestion * Sets the current catalog in this session. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compatib
cloud-fan commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904480863 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -589,4 +589,25 @@ abstract class Catalog { * @since 2.0.0 */ def refreshByPath(path: String): Unit + + /** + * Returns the current default catalog in this session. + * + * @since 3.4.0 + */ + def currentCatalog(): String Review Comment: Do we need backward compatibility here? This is an interface that Spark implements and users call. Adding a new method shouldn't break anything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikf commented on a diff in pull request #36941: [SPARK-39543] The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
Yikf commented on code in PR #36941: URL: https://github.com/apache/spark/pull/36941#discussion_r904480670 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala: ## @@ -531,6 +534,23 @@ class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with Befo assert(table.properties === (Map("provider" -> "foo") ++ defaultOwnership).asJava) } + test("SPARK-39543 writeOption should be passed to storage properties when fallback to v1") { +val provider = classOf[InMemoryV1Provider].getName + +withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, provider)) { Review Comment: Yea, Other tests trigger v1 fallback without set `USE_V1_SOURCE_LIST `, AFAIK, - Other tests aim to test the read/write process, and the `InMemoryV1Provider` is actually a v2 format, and we trigger v1 fallback at the `newScanBuilder` & `newWriteBuilder` layer. - This test in PR needs to be fallback to V1 when the table is created, so we need to set `USE_V1_SOURCE_LIST`(see: [isV2Provider](https://github.com/apache/spark/blob/3d68ad8003a16229bd79f86cb31f618167814a7f/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala#L604)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compatib
cloud-fan commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904480555 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -589,4 +589,25 @@ abstract class Catalog { * @since 2.0.0 */ def refreshByPath(path: String): Unit + + /** + * Returns the current default catalog in this session. Review Comment: ```suggestion * Returns the current catalog in this session. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #36295: [SPARK-38978][SQL] DS V2 supports push down OFFSET operator
beliefer commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r904473795 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala: ## @@ -139,6 +142,25 @@ case class JDBCScanBuilder( false } + override def pushOffset(offset: Int): Boolean = { +if (jdbcOptions.pushDownOffset && !isPartiallyPushed) { + // Spark pushes down LIMIT first, then OFFSET. In SQL statements, OFFSET is applied before + // LIMIT. Here we need to adjust the LIMIT value to match SQL statements. + // 1. For `dataset.limit(m).offset(n)`, try to push down `LIMIT (m - n) OFFSET n`. + //For example, `dataset.limit(5).offset(3)`, we can push down `LIMIT 2 OFFSET 3`. + // 2. For `dataset.offset(n).limit(m)`, try to push down `LIMIT m OFFSET n`. + //For example, `dataset.offset(3).limit(5)`, we can push down `LIMIT 5 OFFSET 3`. + // 3. For `dataset.offset(n)`, try to push down `OFFSET n`. Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #36953: [SPARK-39551][SQL] Add AQE invalid plan check
ulysses-you commented on code in PR #36953: URL: https://github.com/apache/spark/pull/36953#discussion_r904468078 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ValidateSparkPlan.scala: ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.adaptive + +import org.apache.spark.sql.catalyst.optimizer.{BuildLeft, BuildRight} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, BroadcastNestedLoopJoinExec} + +/** + * Detects invalid physical plans generated by AQE replanning and throws `InvalidAQEPlanException` + * if such plans are detected. This rule should be called after EnsureRequirements where all + * necessary Exchange nodes are added. + */ +object ValidateSparkPlan extends Rule[SparkPlan] { + + def apply(plan: SparkPlan): SparkPlan = { +validate(plan) +plan + } + + /** + * Validate that the plan satisfies the following condition: + * - BroadcastQueryStage only appears as the immediate child and the build side of a broadcast + * hash join or broadcast nested loop join. + */ + private def validate(plan: SparkPlan): Unit = plan match { +case b: BroadcastHashJoinExec => + val (buildPlan, probePlan) = b.buildSide match { +case BuildLeft => (b.left, b.right) +case BuildRight => (b.right, b.left) + } + if (!buildPlan.isInstanceOf[BroadcastQueryStageExec]) { Review Comment: oh, got it. It checks the `BroadcastQueryStageExec` can only appear at the direct child of broadcast join but not check the direct child must be the `BroadcastQueryStageExec`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #36954: [WIP][SPARK-39519][BUILD] Add `-XX:NewRatio=4` to test javaOptions to improve the test stability of `sql/core` module.
LuciferYang commented on PR #36954: URL: https://github.com/apache/spark/pull/36954#issuecomment-1163859874 > Thank you for investigation, @LuciferYang . Could you add several empty commits to see the flakiness? ok -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compatib
cloud-fan commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904460638 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala: ## @@ -26,6 +26,22 @@ import org.apache.spark.sql.catalyst.DefinedByConstructorParams // Note: all classes here are expected to be wrapped in Datasets and so must extend // DefinedByConstructorParams for the catalog to be able to create encoders for them. +/** + * A catalog in Spark, as returned by the `listCatalogs` method defined in [[Catalog]]. + * + * @param name name of the catalog + * @since 3.2.0 + */ +class CatalogMetadata( +val name: String) Review Comment: ah I see! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #36295: [SPARK-38978][SQL] DS V2 supports push down OFFSET operator
beliefer commented on code in PR #36295: URL: https://github.com/apache/spark/pull/36295#discussion_r904460364 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: ## @@ -407,7 +407,66 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper wit case other => (other, false) } - def pushDownLimits(plan: LogicalPlan): LogicalPlan = plan.transform { + private def pushDownOffset( + plan: LogicalPlan, + offset: Int): Boolean = plan match { +case sHolder: ScanBuilderHolder => + val isPushed = PushDownUtils.pushOffset(sHolder.builder, offset) + if (isPushed) { +sHolder.pushedOffset = Some(offset) + } + isPushed +case p: Project => + pushDownOffset(p.child, offset) +case _ => false + } + + def pushDownLimitAndOffset(plan: LogicalPlan): LogicalPlan = plan.transform { +case offset @ LimitAndOffset(limit, offsetValue, child) => + val (newChild, canRemoveLimit) = pushDownLimit(child, limit) + if (canRemoveLimit) { +// If we can remove limit, it indicates data source only have one partition. Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36957: [SPARK-39555][PYTHON] Make createTable and listTables in the python side support 3-layer-namespace
HyukjinKwon commented on PR #36957: URL: https://github.com/apache/spark/pull/36957#issuecomment-1163848082 That's sort of annoying problem yeah. We should build Spark w/ `sbt test:package`, and then run the tests. Should probably have a logic like https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/test_dataframe.py#L1204-L1238. Another option is to move the test codes into somewhere within the main code with `private[sql]`. the classes or methods with `private[sql]` can be accessed via Py4J (because that syntax is specific to Scala, and Java class files don't understand them). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #36957: [SPARK-39555][PYTHON] Make createTable and listTables in the python side support 3-layer-namespace
zhengruifeng commented on PR #36957: URL: https://github.com/apache/spark/pull/36957#issuecomment-1163845192 @HyukjinKwon @cloud-fan @amaliujia @huaxingao Please take a look when you are free. I encountered a problem that the python side also need `InMemoryCatalog` if we want to add UT for 3-layer-namespace, but `InMemoryCatalog` is only for `test` and not accessable from the python side. Is there some approach to work around? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36784: [SPARK-39396][SQL] Fix LDAP login exception 'error code 49 - invalid credentials'
HyukjinKwon commented on PR #36784: URL: https://github.com/apache/spark/pull/36784#issuecomment-1163842938 @xiuzhu9527 mind rebasing your fork to sync with the latest master branch in Apache Spark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compat
HyukjinKwon commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904453265 ## sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala: ## @@ -743,4 +722,30 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest with BeforeAndAf val catalogName2 = "catalog_not_exists" assert(!spark.catalog.databaseExists(Array(catalogName2, dbName).mkString("."))) } + + test("three layer namespace compatibility - cache table, isCached and uncacheTable") { Review Comment: I would add a JIRA prefix here in the test title. e.g., `"SPARK-39506: three ...` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compat
HyukjinKwon commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904453265 ## sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala: ## @@ -743,4 +722,30 @@ class CatalogSuite extends SharedSparkSession with AnalysisTest with BeforeAndAf val catalogName2 = "catalog_not_exists" assert(!spark.catalog.databaseExists(Array(catalogName2, dbName).mkString("."))) } + + test("three layer namespace compatibility - cache table, isCached and uncacheTable") { Review Comment: I would add a JIRA prefix here in the test title. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compat
HyukjinKwon commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904452971 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala: ## @@ -26,6 +26,24 @@ import org.apache.spark.sql.catalyst.DefinedByConstructorParams // Note: all classes here are expected to be wrapped in Datasets and so must extend // DefinedByConstructorParams for the catalog to be able to create encoders for them. +/** + * A catalog in Spark, as returned by the `listCatalogs` method defined in [[Catalog]]. + * + * @param name name of the catalog + * @since 3.4.0 + */ +class CatalogMetadata( Review Comment: Would have to be a case class. ## sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala: ## @@ -26,6 +26,24 @@ import org.apache.spark.sql.catalyst.DefinedByConstructorParams // Note: all classes here are expected to be wrapped in Datasets and so must extend // DefinedByConstructorParams for the catalog to be able to create encoders for them. +/** + * A catalog in Spark, as returned by the `listCatalogs` method defined in [[Catalog]]. + * + * @param name name of the catalog Review Comment: `description` is missing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compat
HyukjinKwon commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904452603 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala: ## @@ -589,4 +589,25 @@ abstract class Catalog { * @since 2.0.0 */ def refreshByPath(path: String): Unit + + /** + * Returns the current default catalog in this session. + * + * @since 3.4.0 + */ + def currentCatalog(): String Review Comment: Would need to 1. have a default implementation for backward compatibility, or 2. exclude this from binary compatibility check. The real error is: ``` [error] spark-sql: Failed binary compatibility check against org.apache.spark:spark-sql_2.12:3.2.0! Found 3 potential problems (filtered 602) [error] * abstract method currentCatalog()java.lang.String in class org.apache.spark.sql.catalog.Catalog is present only in current version [error]filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.sql.catalog.Catalog.currentCatalog") [error] * abstract method setCurrentCatalog(java.lang.String)Unit in class org.apache.spark.sql.catalog.Catalog is present only in current version [error]filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.sql.catalog.Catalog.setCurrentCatalog") [error] * abstract method listCatalogs()org.apache.spark.sql.Dataset in class org.apache.spark.sql.catalog.Catalog is present only in current version [error]filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.sql.catalog.Catalog.listCatalogs") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compatible
HyukjinKwon commented on PR #36904: URL: https://github.com/apache/spark/pull/36904#issuecomment-1163839514 Adding @zhengruifeng FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
HyukjinKwon commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r904450412 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala: ## @@ -529,6 +529,15 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase { """.stripMargin) } + def inferDateWithLegacyTimeParserError(): Throwable = { +new IllegalArgumentException( Review Comment: For user-facing errors, it should inherit `SparkThrowable`. cc @gengliangwang @MaxGekk @cloud-fan fyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
HyukjinKwon commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r904450171 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -195,6 +195,19 @@ class CSVOptions( */ val enforceSchema = getBool("enforceSchema", default = true) + /** + * Infer columns with all valid date entries as date type (otherwise inferred as timestamp type). + * Disabled by default for backwards compatibility and performance. When enabled, date entries in + * timestamp columns will be cast to timestamp upon parsing. Not compatible with + * legacyTimeParserPolicy == LEGACY + */ + val inferDate = { +val inferDateFlag = getBool("inferDate") +if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) { + throw QueryExecutionErrors.inferDateWithLegacyTimeParserError() Review Comment: Yeah, this is nice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
HyukjinKwon commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r904449724 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -117,8 +123,19 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { case LongType => tryParseLong(field) case _: DecimalType => tryParseDecimal(field) case DoubleType => tryParseDouble(field) -case TimestampNTZType => tryParseTimestampNTZ(field) -case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDateTime(field) +case TimestampNTZType => Review Comment: Maybe we can fix it as below: ```suggestion case TimestampNTZType options.inferDate => tryParseDateTime(field) case TimestampNTZType => tryParseTimestampNTZ(field) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #36942: [SPARK-39545][SQL] Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
cloud-fan commented on PR #36942: URL: https://github.com/apache/spark/pull/36942#issuecomment-1163820993 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #36848: [SPARK-39449][SQL] Propagate empty relation through Window
wangyum commented on PR #36848: URL: https://github.com/apache/spark/pull/36848#issuecomment-1163816151 cc @sigmod @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36962: [SPARK-39563][CORE][TESTS] Use `localHostNameForURI` in `UISuite`
dongjoon-hyun closed pull request #36962: [SPARK-39563][CORE][TESTS] Use `localHostNameForURI` in `UISuite` URL: https://github.com/apache/spark/pull/36962 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sigmod commented on a diff in pull request #36909: [SPARK-39511][SQL] Push down limit 1 for right side of left semi/anti join if join condition is empty
sigmod commented on code in PR #36909: URL: https://github.com/apache/spark/pull/36909#discussion_r904429256 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -753,6 +753,9 @@ object LimitPushDown extends Rule[LogicalPlan] { // Merge offset value and limit value into LocalLimit and pushes down LocalLimit through Offset. case LocalLimit(le, Offset(oe, grandChild)) => Offset(oe, LocalLimit(Add(le, oe), grandChild)) +// Push down local limit 1 if join type is LeftSemiOrAnti and join condition is empty. +case j @ Join(_, right, LeftSemiOrAnti(_), None, _) if !right.maxRows.exists(_ <= 1) => Review Comment: Gotcha, make sense. Thank you, @wangyum ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36961: [SPARK-39562][SQL][TESTS] Make `hive-thriftserver` module unit tests pass in IPv6 env
dongjoon-hyun closed pull request #36961: [SPARK-39562][SQL][TESTS] Make `hive-thriftserver` module unit tests pass in IPv6 env URL: https://github.com/apache/spark/pull/36961 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36961: [SPARK-39562][SQL][TESTS] Make `hive-thriftserver` module unit tests pass in IPv6 env
dongjoon-hyun commented on PR #36961: URL: https://github.com/apache/spark/pull/36961#issuecomment-1163807315 Thank you, @HyukjinKwon and @wangyum . Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] williamhyun opened a new pull request, #36962: [SPARK-39563][CORE][TESTS] Use localHostNameForURI in UISuite
williamhyun opened a new pull request, #36962: URL: https://github.com/apache/spark/pull/36962 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #36951: [SPARK-38614][SQL] Don't push down limit through window that's using percent_rank
wangyum commented on PR #36951: URL: https://github.com/apache/spark/pull/36951#issuecomment-1163806447 Merged to master, branch-3.3 and branch-3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #36926: [SPARK-39533][ML] Deprecate scoreLabelsWeight in BinaryClassificationMetrics
zhengruifeng commented on PR #36926: URL: https://github.com/apache/spark/pull/36926#issuecomment-1163802602 Thank you @srowen @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36961: [SPARK-39562][SQL][TESTS] Make `hive-thriftserver` module unit tests pass in IPv6 env
dongjoon-hyun commented on PR #36961: URL: https://github.com/apache/spark/pull/36961#issuecomment-1163800429 Thank you, @wangyum ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum closed pull request #36951: [SPARK-38614][SQL] Don't push down limit through window that's using percent_rank
wangyum closed pull request #36951: [SPARK-38614][SQL] Don't push down limit through window that's using percent_rank URL: https://github.com/apache/spark/pull/36951 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #36909: [SPARK-39511][SQL] Push down limit 1 for right side of left semi/anti join if join condition is empty
wangyum commented on code in PR #36909: URL: https://github.com/apache/spark/pull/36909#discussion_r904407888 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -753,6 +753,9 @@ object LimitPushDown extends Rule[LogicalPlan] { // Merge offset value and limit value into LocalLimit and pushes down LocalLimit through Offset. case LocalLimit(le, Offset(oe, grandChild)) => Offset(oe, LocalLimit(Add(le, oe), grandChild)) +// Push down local limit 1 if join type is LeftSemiOrAnti and join condition is empty. +case j @ Join(_, right, LeftSemiOrAnti(_), None, _) if !right.maxRows.exists(_ <= 1) => Review Comment: ```scala import org.apache.spark.sql.catalyst.plans.logical.Join val df = spark.sql("select * from range(5) t1 left semi join (select * from range(10) where id < 0) t2") println(df.queryExecution.optimizedPlan.asInstanceOf[Join].right.maxRows) ``` The `right.maxRows` is `Some(10)` but actually rows is 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on PR #36871: URL: https://github.com/apache/spark/pull/36871#issuecomment-1163777137 I added a new `QueryExecutionError` called `inferDateWithLegacyTimeParserError` that is thrown when `inferDate=true` and SQL Configuration `LegacyTimeParserPolicy` is `LEGACY`. This prevents the legacy parser from being used with schema inference for date and fixes the failing test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Jonathancui123 commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference
Jonathancui123 commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r904402104 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -110,15 +116,43 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { def inferField(typeSoFar: DataType, field: String): DataType = { if (field == null || field.isEmpty || field == options.nullValue) { typeSoFar +} else +if (options.inferDate) { + val typeElemInfer = typeSoFar match { +case NullType => tryParseInteger(field) +case IntegerType => tryParseInteger(field) +case LongType => tryParseLong(field) +case _: DecimalType => tryParseDecimal(field) +case DoubleType => tryParseDouble(field) +case DateType => tryParseDateTime(field) +case TimestampNTZType => tryParseDateTime(field) +case TimestampType => tryParseDateTime(field) +case BooleanType => tryParseBoolean(field) +case StringType => StringType +case other: DataType => + throw QueryExecutionErrors.dataTypeUnexpectedError(other) + } + compatibleType(typeSoFar, typeElemInfer).getOrElse(StringType) } else { val typeElemInfer = typeSoFar match { case NullType => tryParseInteger(field) case IntegerType => tryParseInteger(field) case LongType => tryParseLong(field) case _: DecimalType => tryParseDecimal(field) case DoubleType => tryParseDouble(field) -case TimestampNTZType => tryParseTimestampNTZ(field) -case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDateTime(field) +case TimestampNTZType => + if (options.inferDate) { Review Comment: My bad, I meant to remove the first match expression but missed it during code cleanup -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36951: [SPARK-38614][SQL] Don't push down limit through window that's using percent_rank
HyukjinKwon commented on PR #36951: URL: https://github.com/apache/spark/pull/36951#issuecomment-1163774647 cc @hvanhovell FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #36930: [SPARK-39538][SQL] Avoid creating unnecessary SLF4J Logger
srowen commented on PR #36930: URL: https://github.com/apache/spark/pull/36930#issuecomment-1163770578 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #36930: [SPARK-39538][SQL] Avoid creating unnecessary SLF4J Logger
srowen closed pull request #36930: [SPARK-39538][SQL] Avoid creating unnecessary SLF4J Logger URL: https://github.com/apache/spark/pull/36930 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #36961: [SPARK-39562][SQL][TESTS] Make hive-thrift server module passes in IPv6 env
dongjoon-hyun opened a new pull request, #36961: URL: https://github.com/apache/spark/pull/36961 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #36942: [SPARK-39545][SQL] Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
srowen closed pull request #36942: [SPARK-39545][SQL] Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance URL: https://github.com/apache/spark/pull/36942 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #36942: [SPARK-39545][SQL] Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
srowen commented on PR #36942: URL: https://github.com/apache/spark/pull/36942#issuecomment-1163770025 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a diff in pull request #35906: [SPARK-33236][shuffle] Enable Push-based shuffle service to store state in NM level DB for work preserving restart
otterc commented on code in PR #35906: URL: https://github.com/apache/spark/pull/35906#discussion_r903254647 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -993,6 +1311,42 @@ AppShufflePartitionInfo getPartitionInfo() { } } + /** + * Simply encodes an application attempt ID. Review Comment: Nit: Remove `Simply` ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -1021,12 +1375,55 @@ public boolean isFinalized() { } } + /** + * Simply encodes an application attempt shuffle merge ID. Review Comment: Nit: Remove `Simply` ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -632,6 +737,14 @@ public void registerExecutor(String appId, ExecutorShuffleInfo executorInfo) { appsShuffleInfo.compute(appId, (id, appShuffleInfo) -> { if (appShuffleInfo == null || attemptId > appShuffleInfo.attemptId) { originalAppShuffleInfo.set(appShuffleInfo); + AppPathsInfo appPathsInfo = new AppPathsInfo(appId, executorInfo.localDirs, + mergeDir, executorInfo.subDirsPerLocalDir); + // Clean up the outdated App Attempt local path info in the DB and + // put the newly registered local path info from newer attempt into the DB. + if (appShuffleInfo != null) { +removeAppAttemptPathInfoFromDB(new AppAttemptId(appId, appShuffleInfo.attemptId)); + } + writeAppPathsInfoToDb(appId, attemptId, appPathsInfo); Review Comment: > If we have to guarantee the success of DB removal, but it fails, should we fail the new app attempt Executor registration here? I feel the later one is a little bit overkill. WDYT? I don't think it is an overkill. We can have one method that does something like this: ``` writeAppsPathInfoAndDeleteOlder(olderAttempt, newAttempt) { try{ deleteOldAttempt writeNewAttempt } catch (IOException) { } ``` If delete fails, write will never be executed. ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -656,6 +771,206 @@ public void registerExecutor(String appId, ExecutorShuffleInfo executorInfo) { } } + /** + * Close the DB during shutdown + */ + @Override + public void close() { +if (db != null) { + try { +db.close(); + } catch (IOException e) { +logger.error("Exception closing leveldb with registered app paths info and " ++ "shuffle partition info", e); + } +} + } + + /** + * Write the application attempt's local path information to the DB + */ + private void writeAppPathsInfoToDb(String appId, int attemptId, AppPathsInfo appPathsInfo) { +if (db != null) { + try { +byte[] key = getDbAppAttemptPathsKey(new AppAttemptId(appId, attemptId)); +String valueStr = mapper.writeValueAsString(appPathsInfo); +byte[] value = valueStr.getBytes(StandardCharsets.UTF_8); +db.put(key, value); + } catch (Exception e) { +logger.error("Error saving registered app paths info", e); + } +} + } + + /** + * Write the finalized shuffle merge partition information into the DB + */ + private void writeAppAttemptShuffleMergeInfoToDB( + AppAttemptShuffleMergeId appAttemptShuffleMergeId) { +if (db != null) { + // Write AppAttemptShuffleMergeId into LevelDB for finalized shuffles + try{ +byte[] dbKey = getDbAppAttemptShufflePartitionKey(appAttemptShuffleMergeId); +db.put(dbKey, new byte[0]); + } catch (Exception e) { +logger.error("Error saving active app shuffle partition", e); + } +} + + } + + /** + * Parse the DB key with the prefix and the expected return value type + */ + private T parseDbKey(String key, String prefix, Class valueType) throws IOException { +String json = key.substring(prefix.length() + 1); +return mapper.readValue(json, valueType); + } + + /** + * Generate AppAttemptId from the DB key + */ + private AppAttemptId parseDbAppAttemptPathsKey(String key) throws IOException { +return parseDbKey(key, APP_ATTEMPT_PATH_KEY_PREFIX, AppAttemptId.class); + } + + /** + * Generate AppAttemptShuffleMergeId from the DB key + */ + private AppAttemptShuffleMergeId parseDbAppAttemptShufflePartitionKey( + String key) throws IOException { +return parseDbKey( +key, APP_ATTEMPT_SHUFFLE_FINALIZE_STATUS_KEY_PREFIX, AppAttemptShuffleMergeId.class); + } + + /** + * Generate the DB key with the key object and the specified string prefix + */ + private byte[] getDbKey(Object key, String prefix) { +// We add a common prefix on
[GitHub] [spark] dtenedor commented on pull request #36960: [SPARK-39557][SQL] Support ARRAY, STRUCT, MAP types as DEFAULT values
dtenedor commented on PR #36960: URL: https://github.com/apache/spark/pull/36960#issuecomment-1163679026 Hi @gengliangwang this PR adds support for DEFAULT values with array/map/struct type. It should make it possible to use this feature for more varieties of columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor opened a new pull request, #36960: [SPARK-39557][SQL] Support ARRAY, STRUCT, MAP types as DEFAULT values
dtenedor opened a new pull request, #36960: URL: https://github.com/apache/spark/pull/36960 ### What changes were proposed in this pull request? Support ARRAY, STRUCT, MAP types as DEFAULT values. Previously these types were not supported and DDL commands that attempted to use them returned error messages. Now they work wherever DEFAULT values are supported. ### Why are the changes needed? This new functionality expands the usefulness of DEFAULT column values and Spark SQL in general. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? This PR adds test both positive and negative unit test coverage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36958: [SPARK-39559][CORE][WEBUI] Support IPv6 in WebUI
dongjoon-hyun closed pull request #36958: [SPARK-39559][CORE][WEBUI] Support IPv6 in WebUI URL: https://github.com/apache/spark/pull/36958 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36958: [SPARK-39559][CORE][WEBUI] Support IPv6 in WebUI
dongjoon-hyun commented on PR #36958: URL: https://github.com/apache/spark/pull/36958#issuecomment-1163658132 Thank you so much, @viirya . Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36958: [SPARK-39559][CORE][WEBUI] Support IPv6 in WebUI
dongjoon-hyun commented on PR #36958: URL: https://github.com/apache/spark/pull/36958#issuecomment-1163657925 All tests passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #36959: [SPARK-39561][CORE] Improve `SparkContext` to propagate`java.net.preferIPv6Addresses`
dongjoon-hyun closed pull request #36959: [SPARK-39561][CORE] Improve `SparkContext` to propagate`java.net.preferIPv6Addresses` URL: https://github.com/apache/spark/pull/36959 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36959: [SPARK-39561][CORE] Improve `SparkContext` to propagate`java.net.preferIPv6Addresses`
dongjoon-hyun commented on PR #36959: URL: https://github.com/apache/spark/pull/36959#issuecomment-1163656859 Thank you so much, @huaxingao ! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #36959: [SPARK-39561][CORE] Improve `SparkContext` to propagate`java.net.preferIPv6Addresses`
huaxingao commented on PR #36959: URL: https://github.com/apache/spark/pull/36959#issuecomment-1163641151 LGTM. Thanks for pinging me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36958: [SPARK-39559][CORE][WEBUI] Support IPv6 in WebUI
dongjoon-hyun commented on PR #36958: URL: https://github.com/apache/spark/pull/36958#issuecomment-1163628594 Could you review this when you have some time, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #36940: [SPARK-39529][INFRA] Refactor and merge all related job selection logic into precondition
HyukjinKwon commented on PR #36940: URL: https://github.com/apache/spark/pull/36940#issuecomment-1163598207 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36947: [SPARK-39548][SQL] CreateView Command with a window clause query hit a wrong window definition not found issue
amaliujia commented on code in PR #36947: URL: https://github.com/apache/spark/pull/36947#discussion_r904198563 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala: ## @@ -4449,6 +4449,28 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark """.stripMargin), Seq(Row(2), Row(1))) } + + test("CreateView will make query go into inline CTE code path") { Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36959: [SPARK-39561][CORE] Improve `SparkContext` to propagate`java.net.preferIPv6Addresses`
dongjoon-hyun commented on PR #36959: URL: https://github.com/apache/spark/pull/36959#issuecomment-1163557969 Could you review this please, @huaxingao ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #36959: [SPARK-39561][CORE] Improve `SparkContext` to propagate`java.net.preferIPv6Addresses`
dongjoon-hyun commented on code in PR #36959: URL: https://github.com/apache/spark/pull/36959#discussion_r904186486 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -407,6 +407,7 @@ class SparkContext(config: SparkConf) extends Logging { SparkContext.fillMissingMagicCommitterConfsIfNeeded(_conf) SparkContext.supplementJavaModuleOptions(_conf) +SparkContext.supplementJavaIPv6Options(_conf) Review Comment: This works in the same way with `SparkContext.supplementJavaModuleOptions` in the above line. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #36959: [SPARK-39561][CORE] Improve `SparkContext` to propagate`java.net.preferIPv6Addresses`
dongjoon-hyun opened a new pull request, #36959: URL: https://github.com/apache/spark/pull/36959 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #36958: [SPARK-39559][CORE][WEBUI] Support IPv6 in WebUI
dongjoon-hyun opened a new pull request, #36958: URL: https://github.com/apache/spark/pull/36958 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] singhpk234 commented on a diff in pull request #36948: [SPARK-39547][SQL] V2SessionCatalog should not throw NoSuchDatabaseException in loadNamspaceMetadata
singhpk234 commented on code in PR #36948: URL: https://github.com/apache/spark/pull/36948#discussion_r904129674 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -244,7 +244,7 @@ class V2SessionCatalog(catalog: SessionCatalog) override def loadNamespaceMetadata(namespace: Array[String]): util.Map[String, String] = { namespace match { - case Array(db) => + case Array(db) if catalog.databaseExists(db) => catalog.getDatabaseMetadata(db).toMetadata Review Comment: makes sense, we can certainly save a call in case db exists, made the changes, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #36943: [SPARK-39195][SQL][FOLLOWUP] Remove flaky test of OutputCommitCoordinator
dongjoon-hyun commented on PR #36943: URL: https://github.com/apache/spark/pull/36943#issuecomment-1163490433 Thank you, @AngersZh , @HyukjinKwon , @cloud-fan . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compatible
amaliujia commented on PR #36904: URL: https://github.com/apache/spark/pull/36904#issuecomment-1163477190 Also I am seeing this ``` [info] spark-streaming-kinesis-asl: mimaPreviousArtifacts not set, not analyzing binary compatibility [error] spark-sql: Failed binary compatibility check against org.apache.spark:spark-sql_2.12:3.2.0! Found 3 potential problems (filtered 602) [error] * abstract method currentCatalog()java.lang.String in class org.apache.spark.sql.catalog.Catalog is present only in current version [error]filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.sql.catalog.Catalog.currentCatalog") [error] * abstract method setCurrentCatalog(java.lang.String)Unit in class org.apache.spark.sql.catalog.Catalog is present only in current version [error]filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.sql.catalog.Catalog.setCurrentCatalog") [error] * abstract method listCatalogs()org.apache.spark.sql.Dataset in class org.apache.spark.sql.catalog.Catalog is present only in current version [error]filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.sql.catalog.Catalog.listCatalogs") ``` Any idea how to make the new API pass such API compatibility check? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #36904: [SPARK-39506][SQL] Make CacheTable, isCached, UncacheTable, setCurrentCatalog, currentCatalog, listCatalogs 3l namespace compatib
amaliujia commented on code in PR #36904: URL: https://github.com/apache/spark/pull/36904#discussion_r904112358 ## sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala: ## @@ -26,6 +26,22 @@ import org.apache.spark.sql.catalyst.DefinedByConstructorParams // Note: all classes here are expected to be wrapped in Datasets and so must extend // DefinedByConstructorParams for the catalog to be able to create encoders for them. +/** + * A catalog in Spark, as returned by the `listCatalogs` method defined in [[Catalog]]. + * + * @param name name of the catalog + * @since 3.2.0 + */ +class CatalogMetadata( +val name: String) Review Comment: Sounds good for keeping the class and additionally adding a description field. Regrading renaming to `Catalog`, it will have a naming conflict with https://github.com/apache/spark/blob/59eee98024dac42309f2e7196c7e68832317f284/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L33 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #36946: [SPARK-39552][SQL] Unify v1 and v2 `DESCRIBE TABLE`
MaxGekk commented on code in PR #36946: URL: https://github.com/apache/spark/pull/36946#discussion_r904103207 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/DescribeTableSuite.scala: ## @@ -74,11 +54,11 @@ class DescribeTableSuite extends command.DescribeTableSuiteBase with CommandSuit QueryTest.checkAnswer( descriptionDf, Seq( - Row("id", "bigint", ""), - Row("data", "string", ""), - Row("", "", ""), - Row("# Partitioning", "", ""), - Row("Part 0", "id", ""), + Row("id", "bigint", null), + Row("data", "string", null), + Row("# Partition Information", "", ""), + Row("# col_name", "data_type", "comment"), + Row("id", "bigint", null), Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #36946: [SPARK-39552][SQL] Unify v1 and v2 `DESCRIBE TABLE`
MaxGekk commented on code in PR #36946: URL: https://github.com/apache/spark/pull/36946#discussion_r904072394 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala: ## @@ -80,13 +81,25 @@ case class DescribeTableExec( } private def addPartitioning(rows: ArrayBuffer[InternalRow]): Unit = { -rows += emptyRow() -rows += toCatalystRow("# Partitioning", "", "") -if (table.partitioning.isEmpty) { - rows += toCatalystRow("Not partitioned", "", "") -} else { - rows ++= table.partitioning.zipWithIndex.map { -case (transform, index) => toCatalystRow(s"Part $index", transform.describe(), "") +if (table.partitioning.nonEmpty) { + val partitionColumnsOnly = table.partitioning.forall(t => t.isInstanceOf[IdentityTransform]) + if (partitionColumnsOnly) { +rows += toCatalystRow("# Partition Information", "", "") +rows += toCatalystRow(s"# ${output(0).name}", output(1).name, output(2).name) +val nameToField = table.schema.map(f => (f.name, f)).toMap +rows ++= table.partitioning + .map(_.asInstanceOf[IdentityTransform]) + .flatMap(_.ref.fieldNames()) Review Comment: I added a test for v2 implementation: partitioning by nested columns. Just in case, v1 doesn't support partitioning by nested columns. Also I fixed v2 impl to pass the new test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #36784: [SPARK-39396][SQL] Fix LDAP login exception 'error code 49 - invalid credentials'
srowen commented on PR #36784: URL: https://github.com/apache/spark/pull/36784#issuecomment-1163429380 Sorry, see "Testing with GitHub actions workflow" under https://spark.apache.org/developer-tools.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #36946: [SPARK-39552][SQL] Unify v1 and v2 `DESCRIBE TABLE`
cloud-fan commented on code in PR #36946: URL: https://github.com/apache/spark/pull/36946#discussion_r904039714 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/DescribeTableSuite.scala: ## @@ -74,11 +54,11 @@ class DescribeTableSuite extends command.DescribeTableSuiteBase with CommandSuit QueryTest.checkAnswer( descriptionDf, Seq( - Row("id", "bigint", ""), - Row("data", "string", ""), - Row("", "", ""), - Row("# Partitioning", "", ""), - Row("Part 0", "id", ""), + Row("id", "bigint", null), + Row("data", "string", null), + Row("# Partition Information", "", ""), + Row("# col_name", "data_type", "comment"), + Row("id", "bigint", null), Review Comment: I see, can we at least include `Table Type` in v2 command? It's simply checking if the table has a reserved `EXTERNAL` table property. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #36946: [SPARK-39552][SQL] Unify v1 and v2 `DESCRIBE TABLE`
MaxGekk commented on code in PR #36946: URL: https://github.com/apache/spark/pull/36946#discussion_r904016227 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/DescribeTableSuite.scala: ## @@ -74,11 +54,11 @@ class DescribeTableSuite extends command.DescribeTableSuiteBase with CommandSuit QueryTest.checkAnswer( descriptionDf, Seq( - Row("id", "bigint", ""), - Row("data", "string", ""), - Row("", "", ""), - Row("# Partitioning", "", ""), - Row("Part 0", "id", ""), + Row("id", "bigint", null), + Row("data", "string", null), + Row("# Partition Information", "", ""), + Row("# col_name", "data_type", "comment"), + Row("id", "bigint", null), Review Comment: v1 (hive): ``` ++--+---+ |col_name|data_type |comment| ++--+---+ |data|string |null | |id |bigint |null | |# Partition Information | | | |# col_name |data_type |comment| |id |bigint |null | || | | |# Detailed Table Information| | | |Database|ns | | |Table |table | | |Owner |maximgekk | | |Created Time|Wed Jun 22 09:39:42 PDT 2022 | | |Last Access |UNKNOWN | | |Created By |Spark 3.4.0-SNAPSHOT | | |Type|EXTERNAL | | |Provider|hive | | |Comment |this is a test table | | |Table Properties|[transient_lastDdlTime=1655915982] | | |Location|file:/tmp/testcat/table_name | | |Serde Library |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe| | |InputFormat |org.apache.hadoop.mapred.TextInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat| | |Storage Properties |[serialization.format=1] | | |Partition Provider |Catalog | | ++--+---+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sigmod commented on a diff in pull request #36909: [SPARK-39511][SQL] Push down limit 1 for right side of left semi/anti join if join condition is empty
sigmod commented on code in PR #36909: URL: https://github.com/apache/spark/pull/36909#discussion_r904002034 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -753,6 +753,9 @@ object LimitPushDown extends Rule[LogicalPlan] { // Merge offset value and limit value into LocalLimit and pushes down LocalLimit through Offset. case LocalLimit(le, Offset(oe, grandChild)) => Offset(oe, LocalLimit(Add(le, oe), grandChild)) +// Push down local limit 1 if join type is LeftSemiOrAnti and join condition is empty. +case j @ Join(_, right, LeftSemiOrAnti(_), None, _) if !right.maxRows.exists(_ <= 1) => Review Comment: > when right.maxRows > 0. It can not guarantee the result must not be empty. - For left semi with empty condition, it doesn't matter? Non empty `right` means all `left` rows are join output rows? - For left anti with empty condition, it doesn't matter neither? Non empty `right` means all `left` rows are killed and join doesn't output any row? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #36946: [SPARK-39552][SQL] Unify v1 and v2 `DESCRIBE TABLE`
MaxGekk commented on code in PR #36946: URL: https://github.com/apache/spark/pull/36946#discussion_r904008641 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/DescribeTableSuite.scala: ## @@ -74,11 +54,11 @@ class DescribeTableSuite extends command.DescribeTableSuiteBase with CommandSuit QueryTest.checkAnswer( descriptionDf, Seq( - Row("id", "bigint", ""), - Row("data", "string", ""), - Row("", "", ""), - Row("# Partitioning", "", ""), - Row("Part 0", "id", ""), + Row("id", "bigint", null), + Row("data", "string", null), + Row("# Partition Information", "", ""), + Row("# col_name", "data_type", "comment"), + Row("id", "bigint", null), Review Comment: v2 (after the PR): ``` +++---+ |col_name|data_type |comment | +++---+ |id |bigint |null | |data|string |null | |# Partition Information || | |# col_name |data_type |comment | |id |bigint |null | ||| | |# Metadata Columns || | |index |int |Metadata column used to conflict with a data column| |_partition |string |Partition key used to store the row| ||| | |# Detailed Table Information|| | |Name|test_catalog.ns.table | | |Comment |this is a test table| | |Location|file:/tmp/testcat/table_name| | |Provider|_ | | |Owner |maximgekk | | |Table Properties|[bar=baz] | | +++---+ ``` v1 in memory: ``` +++---+ |col_name|data_type |comment| +++---+ |data|string |null | |id |bigint |null | |# Partition Information || | |# col_name |data_type |comment| |id |bigint |null | ||| | |# Detailed Table Information|| | |Database|ns | | |Table |table | | |Created Time|Wed Jun 22 09:37:48 PDT 2022| | |Last Access |UNKNOWN | | |Created By |Spark 3.4.0-SNAPSHOT| | |Type|EXTERNAL| | |Provider|parquet | | |Comment |this is a test table| | |Table Properties|[bar=baz] | | |Location|file:/tmp/testcat/table_name| | |Partition Provider |Catalog | | +++---+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to
[GitHub] [spark] sigmod commented on a diff in pull request #36909: [SPARK-39511][SQL] Push down limit 1 for right side of left semi/anti join if join condition is empty
sigmod commented on code in PR #36909: URL: https://github.com/apache/spark/pull/36909#discussion_r904002034 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -753,6 +753,9 @@ object LimitPushDown extends Rule[LogicalPlan] { // Merge offset value and limit value into LocalLimit and pushes down LocalLimit through Offset. case LocalLimit(le, Offset(oe, grandChild)) => Offset(oe, LocalLimit(Add(le, oe), grandChild)) +// Push down local limit 1 if join type is LeftSemiOrAnti and join condition is empty. +case j @ Join(_, right, LeftSemiOrAnti(_), None, _) if !right.maxRows.exists(_ <= 1) => Review Comment: > when right.maxRows > 0. It can not guarantee the result must not be empty. - For left semi with empty condition, it doesn't matter? Non empty right means all left rows are join output rows? - For left anti with empty condition, it doesn't matter neither? Non empty right means all left rows are killed and join doesn't output any row? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org