[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-07 Thread GitBox


dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r891452022


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2881,6 +2881,15 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val DEFAULT_COLUMN_ALLOWED_PROVIDERS =
+buildConf("spark.sql.defaultColumn.allowedProviders")
+  .internal()
+  .doc("List of table providers wherein SQL commands are permitted to 
assign DEFAULT column " +
+"values. Comma-separated list, whitespace ignored, case-insensitive.")
+  .version("3.4.0")
+  .stringConf
+  .createWithDefault("csv,json,orc,parquet")

Review Comment:
   @cloud-fan @gengliangwang this is a good point. I was thinking the default 
value of this conf contains the four data sources that we actually have support 
for scanning the default values. The primary purpose is to serve as a mechanism 
for banning default values with the other data sources, users are not expected 
to have to change this.
   
   As Gengliang mentions, it could be a possible escape hatch if we discover a 
bug in one data source later, and it's also useful for testing. Maybe we can 
extend the description of this conf to mention that in the normal/expected 
case, users don't have to change anything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-06 Thread GitBox


dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890579762


##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala:
##
@@ -41,21 +41,29 @@ import org.apache.spark.sql.types.{MetadataBuilder, 
StructField, StructType}
  *
  * We can remove this rule once we implement all the catalog functionality in 
`V2SessionCatalog`.
  */
-class ResolveSessionCatalog(val catalogManager: CatalogManager)
+class ResolveSessionCatalog(val analyzer: Analyzer)
   extends Rule[LogicalPlan] with LookupCatalog {
   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
   import org.apache.spark.sql.connector.catalog.CatalogV2Util._
   import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._
 
+  override val catalogManager = analyzer.catalogManager
+
   override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp 
{
-case AddColumns(ResolvedV1TableIdentifier(ident), cols) =>
+case AddColumns(
+  ResolvedTable(catalog, ident, v1Table: V1Table, _), cols)
+if isSessionCatalog(catalog) =>
   cols.foreach { c =>
 assertTopLevelColumn(c.name, "AlterTableAddColumnsCommand")
 if (!c.nullable) {
   throw 
QueryCompilationErrors.addColumnWithV1TableCannotSpecifyNotNullError
 }
   }
-  AlterTableAddColumnsCommand(ident.asTableIdentifier, 
cols.map(convertToStructField))
+  val prevSchema = StructType(cols.map(convertToStructField))
+  val newSchema: StructType =
+DefaultCols.constantFoldCurrentDefaultsToExistDefaults(

Review Comment:
   Thanks for pointing this out, we don't need to duplicate it again. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-06 Thread GitBox


dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890579364


##
sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala:
##
@@ -186,7 +186,7 @@ abstract class BaseSessionStateBuilder(
 new ResolveSQLOnFile(session) +:
 new FallBackFileSourceV2(session) +:
 ResolveEncodersInScalaAgg +:
-new ResolveSessionCatalog(catalogManager) +:
+new ResolveSessionCatalog(this) +:

Review Comment:
   Good point, this is not needed, we can revert this part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-03 Thread GitBox


dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r889272134


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala:
##
@@ -122,7 +122,7 @@ abstract class SessionCatalogSuite extends AnalysisTest 
with Eventually {
   }
 
   test("create table with default columns") {
-withBasicCatalog { catalog =>
+if (!isHiveExternalCatalog) withBasicCatalog { catalog =>

Review Comment:
   Good question. This test is part of the base class `SessionCatalogSuite`. It 
fails when running with the `HiveExternalSessionCatalogSuite` subclass, since 
Hive tables are not included in the allowlist of providers that allow DEFAULT 
columns. To simplify this, I edited the config to add Hive tables to the list 
of allowed providers and this test passes now in all cases.



##
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala:
##
@@ -41,21 +41,28 @@ import org.apache.spark.sql.types.{MetadataBuilder, 
StructField, StructType}
  *
  * We can remove this rule once we implement all the catalog functionality in 
`V2SessionCatalog`.
  */
-class ResolveSessionCatalog(val catalogManager: CatalogManager)
+class ResolveSessionCatalog(val analyzer: Analyzer)
   extends Rule[LogicalPlan] with LookupCatalog {
+  val catalogManager = analyzer.catalogManager

Review Comment:
   SG, done.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:
##
@@ -83,16 +83,26 @@ object ResolveDefaultColumns {
*
* @param analyzer  used for analyzing the result of parsing the 
expression stored as text.
* @param tableSchema   represents the names and types of the columns of the 
statement to process.
+   * @param tableProvider provider of the target table to store default values 
for, if any.
* @param statementType name of the statement being processed, such as 
INSERT; useful for errors.
* @return a copy of `tableSchema` with field metadata updated with the 
constant-folded values.
*/
   def constantFoldCurrentDefaultsToExistDefaults(
   analyzer: Analyzer,
   tableSchema: StructType,
+  tableProvider: Option[String],
   statementType: String): StructType = {
 if (SQLConf.get.enableDefaultColumns) {
   val newFields: Seq[StructField] = tableSchema.fields.map { field =>
 if (field.metadata.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY)) {
+  // Make sure that the target table has a provider that supports 
default column values.
+  val allowedProviders: Array[String] =
+SQLConf.get.getConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS)

Review Comment:
   Good idea, done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-02 Thread GitBox


dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r888505739


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:
##
@@ -427,6 +428,7 @@ class SessionCatalog(
   tableDefinition.copy(identifier = tableIdentifier)
 }
 
+
ResolveDefaultColumns.checkDataSourceSupportsDefaultColumns(tableDefinition)

Review Comment:
   This is not supported currently, the feature is SQL-only as of now. 
References to columns named "default" will simply return "column not found" 
errors until then. We can certainly consider adding this feature to the 
DataFrame API and PySpark later, I would be interested in helping with this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

2022-06-02 Thread GitBox


dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r888505435


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:
##
@@ -231,4 +232,18 @@ object ResolveDefaultColumns {
   }
 }
   }
+
+  def checkDataSourceSupportsDefaultColumns(table: CatalogTable): Unit = {
+if (table.schema.fields.map(_.metadata).exists { m =>
+  m.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY) ||
+m.contains(EXISTS_DEFAULT_COLUMN_METADATA_KEY)
+}) {
+  table.provider.getOrElse("").toLowerCase() match {
+case "csv" | "json" | "parquet" | "orc" =>

Review Comment:
   Good ideas! I made this a configuration option. It does support DataSource 
V2, I added a couple more tests to show that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org