[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types
dtenedor commented on code in PR #36745: URL: https://github.com/apache/spark/pull/36745#discussion_r891452022 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2881,6 +2881,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DEFAULT_COLUMN_ALLOWED_PROVIDERS = +buildConf("spark.sql.defaultColumn.allowedProviders") + .internal() + .doc("List of table providers wherein SQL commands are permitted to assign DEFAULT column " + +"values. Comma-separated list, whitespace ignored, case-insensitive.") + .version("3.4.0") + .stringConf + .createWithDefault("csv,json,orc,parquet") Review Comment: @cloud-fan @gengliangwang this is a good point. I was thinking the default value of this conf contains the four data sources that we actually have support for scanning the default values. The primary purpose is to serve as a mechanism for banning default values with the other data sources, users are not expected to have to change this. As Gengliang mentions, it could be a possible escape hatch if we discover a bug in one data source later, and it's also useful for testing. Maybe we can extend the description of this conf to mention that in the normal/expected case, users don't have to change anything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types
dtenedor commented on code in PR #36745: URL: https://github.com/apache/spark/pull/36745#discussion_r890579762 ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala: ## @@ -41,21 +41,29 @@ import org.apache.spark.sql.types.{MetadataBuilder, StructField, StructType} * * We can remove this rule once we implement all the catalog functionality in `V2SessionCatalog`. */ -class ResolveSessionCatalog(val catalogManager: CatalogManager) +class ResolveSessionCatalog(val analyzer: Analyzer) extends Rule[LogicalPlan] with LookupCatalog { import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ import org.apache.spark.sql.connector.catalog.CatalogV2Util._ import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._ + override val catalogManager = analyzer.catalogManager + override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { -case AddColumns(ResolvedV1TableIdentifier(ident), cols) => +case AddColumns( + ResolvedTable(catalog, ident, v1Table: V1Table, _), cols) +if isSessionCatalog(catalog) => cols.foreach { c => assertTopLevelColumn(c.name, "AlterTableAddColumnsCommand") if (!c.nullable) { throw QueryCompilationErrors.addColumnWithV1TableCannotSpecifyNotNullError } } - AlterTableAddColumnsCommand(ident.asTableIdentifier, cols.map(convertToStructField)) + val prevSchema = StructType(cols.map(convertToStructField)) + val newSchema: StructType = +DefaultCols.constantFoldCurrentDefaultsToExistDefaults( Review Comment: Thanks for pointing this out, we don't need to duplicate it again. Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types
dtenedor commented on code in PR #36745: URL: https://github.com/apache/spark/pull/36745#discussion_r890579364 ## sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala: ## @@ -186,7 +186,7 @@ abstract class BaseSessionStateBuilder( new ResolveSQLOnFile(session) +: new FallBackFileSourceV2(session) +: ResolveEncodersInScalaAgg +: -new ResolveSessionCatalog(catalogManager) +: +new ResolveSessionCatalog(this) +: Review Comment: Good point, this is not needed, we can revert this part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types
dtenedor commented on code in PR #36745: URL: https://github.com/apache/spark/pull/36745#discussion_r889272134 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala: ## @@ -122,7 +122,7 @@ abstract class SessionCatalogSuite extends AnalysisTest with Eventually { } test("create table with default columns") { -withBasicCatalog { catalog => +if (!isHiveExternalCatalog) withBasicCatalog { catalog => Review Comment: Good question. This test is part of the base class `SessionCatalogSuite`. It fails when running with the `HiveExternalSessionCatalogSuite` subclass, since Hive tables are not included in the allowlist of providers that allow DEFAULT columns. To simplify this, I edited the config to add Hive tables to the list of allowed providers and this test passes now in all cases. ## sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala: ## @@ -41,21 +41,28 @@ import org.apache.spark.sql.types.{MetadataBuilder, StructField, StructType} * * We can remove this rule once we implement all the catalog functionality in `V2SessionCatalog`. */ -class ResolveSessionCatalog(val catalogManager: CatalogManager) +class ResolveSessionCatalog(val analyzer: Analyzer) extends Rule[LogicalPlan] with LookupCatalog { + val catalogManager = analyzer.catalogManager Review Comment: SG, done. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -83,16 +83,26 @@ object ResolveDefaultColumns { * * @param analyzer used for analyzing the result of parsing the expression stored as text. * @param tableSchema represents the names and types of the columns of the statement to process. + * @param tableProvider provider of the target table to store default values for, if any. * @param statementType name of the statement being processed, such as INSERT; useful for errors. * @return a copy of `tableSchema` with field metadata updated with the constant-folded values. */ def constantFoldCurrentDefaultsToExistDefaults( analyzer: Analyzer, tableSchema: StructType, + tableProvider: Option[String], statementType: String): StructType = { if (SQLConf.get.enableDefaultColumns) { val newFields: Seq[StructField] = tableSchema.fields.map { field => if (field.metadata.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY)) { + // Make sure that the target table has a provider that supports default column values. + val allowedProviders: Array[String] = +SQLConf.get.getConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS) Review Comment: Good idea, done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types
dtenedor commented on code in PR #36745: URL: https://github.com/apache/spark/pull/36745#discussion_r888505739 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala: ## @@ -427,6 +428,7 @@ class SessionCatalog( tableDefinition.copy(identifier = tableIdentifier) } + ResolveDefaultColumns.checkDataSourceSupportsDefaultColumns(tableDefinition) Review Comment: This is not supported currently, the feature is SQL-only as of now. References to columns named "default" will simply return "column not found" errors until then. We can certainly consider adding this feature to the DataFrame API and PySpark later, I would be interested in helping with this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types
dtenedor commented on code in PR #36745: URL: https://github.com/apache/spark/pull/36745#discussion_r888505435 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -231,4 +232,18 @@ object ResolveDefaultColumns { } } } + + def checkDataSourceSupportsDefaultColumns(table: CatalogTable): Unit = { +if (table.schema.fields.map(_.metadata).exists { m => + m.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY) || +m.contains(EXISTS_DEFAULT_COLUMN_METADATA_KEY) +}) { + table.provider.getOrElse("").toLowerCase() match { +case "csv" | "json" | "parquet" | "orc" => Review Comment: Good ideas! I made this a configuration option. It does support DataSource V2, I added a couple more tests to show that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org