[GitHub] [spark] MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366189514 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -39,27 +40,49 @@ import org.apache.spark.unsafe.types.UTF8String * @param requiredSchema The schema of the data that should be output for each row. This should be a * subset of the columns in dataSchema. * @param options Configuration options for a CSV parser. + * @param filters The pushdown filters that should be applied to converted values. */ class UnivocityParser( dataSchema: StructType, requiredSchema: StructType, -val options: CSVOptions) extends Logging { +val options: CSVOptions, +filters: Seq[Filter]) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + s"dataSchema (${dataSchema.catalogString}).") + def this(dataSchema: StructType, requiredSchema: StructType, options: CSVOptions) = { +this(dataSchema, requiredSchema, options, Seq.empty) + } def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val csvFilters = new CSVFilters( +filters, +dataSchema, +requiredSchema, +options.columnPruning) + + private[sql] val parsedSchema = csvFilters.readSchema + + // Mapping of field indexes of `parsedSchema` to indexes of `requiredSchema`. + // It returns -1 if `requiredSchema` doesn't contain a field from `parsedSchema`. Review comment: If `requiredSchema` always contains filter references, it is significant assumption, and it can simplify this implementation slightly. Is it just specific of current implementation or `requiredSchema` could contain **really required** column in the future? because filter columns are not actually required if the filter is applied only once in a datasource. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
AmplabJenkins removed a comment on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574045415 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions
AmplabJenkins removed a comment on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions URL: https://github.com/apache/spark/pull/26805#issuecomment-574045429 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116675/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions
AmplabJenkins commented on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions URL: https://github.com/apache/spark/pull/26805#issuecomment-574045429 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116675/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions
AmplabJenkins commented on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions URL: https://github.com/apache/spark/pull/26805#issuecomment-574045420 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
AmplabJenkins commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574045422 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21468/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
AmplabJenkins removed a comment on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574045422 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21468/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
AmplabJenkins commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574045415 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions
AmplabJenkins removed a comment on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions URL: https://github.com/apache/spark/pull/26805#issuecomment-574045420 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions
SparkQA commented on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions URL: https://github.com/apache/spark/pull/26805#issuecomment-574044885 **[Test build #116675 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116675/testReport)** for PR 26805 at commit [`3de6491`](https://github.com/apache/spark/commit/3de6491597f2d1b8a011df707d643fba5b14fce8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
SparkQA commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574044887 **[Test build #116689 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116689/testReport)** for PR 27198 at commit [`cc658c9`](https://github.com/apache/spark/commit/cc658c99b4633fe49b14990b80be672b11f62ced). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions
SparkQA removed a comment on issue #26805: [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions URL: https://github.com/apache/spark/pull/26805#issuecomment-573980935 **[Test build #116675 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116675/testReport)** for PR 26805 at commit [`3de6491`](https://github.com/apache/spark/commit/3de6491597f2d1b8a011df707d643fba5b14fce8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on issue #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on issue #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#issuecomment-574044689 @cloud-fan Should I close this PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface
AmplabJenkins removed a comment on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface URL: https://github.com/apache/spark/pull/26868#issuecomment-574044182 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface
AmplabJenkins commented on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface URL: https://github.com/apache/spark/pull/26868#issuecomment-574044191 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116674/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface
AmplabJenkins removed a comment on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface URL: https://github.com/apache/spark/pull/26868#issuecomment-574044191 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116674/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface
AmplabJenkins commented on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface URL: https://github.com/apache/spark/pull/26868#issuecomment-574044182 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface
SparkQA commented on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface URL: https://github.com/apache/spark/pull/26868#issuecomment-574043654 **[Test build #116674 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116674/testReport)** for PR 26868 at commit [`ffb0b29`](https://github.com/apache/spark/commit/ffb0b2984a74f173c42723c3ae567668667e3fd7). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` implicit class PartitionTypeHelper(colNames: Seq[String]) ` * `trait SimpleTableProvider extends TableProvider ` * `class NoopDataSource extends SimpleTableProvider with DataSourceRegister ` * `class RateStreamProvider extends SimpleTableProvider with DataSourceRegister ` * `class TextSocketSourceProvider extends SimpleTableProvider with DataSourceRegister with Logging ` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface
SparkQA removed a comment on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface URL: https://github.com/apache/spark/pull/26868#issuecomment-573980923 **[Test build #116674 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116674/testReport)** for PR 26868 at commit [`ffb0b29`](https://github.com/apache/spark/commit/ffb0b2984a74f173c42723c3ae567668667e3fd7). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
AmplabJenkins removed a comment on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574040718 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
SparkQA removed a comment on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574037384 **[Test build #116687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116687/testReport)** for PR 27194 at commit [`bfbaf2b`](https://github.com/apache/spark/commit/bfbaf2b4eb17cfc0457a8c710003fc59138bace5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
AmplabJenkins removed a comment on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574040733 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116687/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
SparkQA commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574040611 **[Test build #116687 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116687/testReport)** for PR 27194 at commit [`bfbaf2b`](https://github.com/apache/spark/commit/bfbaf2b4eb17cfc0457a8c710003fc59138bace5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
AmplabJenkins commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574040733 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116687/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
AmplabJenkins commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574040718 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
beliefer commented on a change in pull request #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#discussion_r366183077 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala ## @@ -149,10 +149,24 @@ case class AggregateExpression( case PartialMerge => "merge_" case Final | Complete => "" } -prefix + aggregateFunction.toAggString(isDistinct) +val aggFuncStr = prefix + aggregateFunction.toAggString(isDistinct) +mode match { + case Partial | Complete if filter.isDefined => Review comment: I mean the mode is `PartialMerge`. If we only need to show filter in physical plans after rewrite, this is OK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mickjermsurawong-stripe commented on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection)
mickjermsurawong-stripe commented on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) URL: https://github.com/apache/spark/pull/27153#issuecomment-574038632 Thanks @mt40! Also added more tests on filtering on dataframe c23705d to illustrate the expected schemas more explicitly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mickjermsurawong-stripe commented on a change in pull request #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection)
mickjermsurawong-stripe commented on a change in pull request #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) URL: https://github.com/apache/spark/pull/27153#discussion_r366181241 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ## @@ -679,6 +679,33 @@ class DataFrameSuite extends QueryTest with SharedSparkSession { assert(df.schema.map(_.name) === Seq("key", "valueRenamed", "newCol")) } + test("Value class filter") { +val df = spark.sparkContext + .parallelize(Seq(StringWrapper("a"), StringWrapper("b"), StringWrapper("c"))) + .toDF() +val filtered = df.where("s = \"a\"") +checkAnswer(filtered, spark.sparkContext.parallelize(Seq(StringWrapper("a"))).toDF) + } + + test("Array value class filter") { +val ab = ArrayStringWrapper(Seq(StringWrapper("a"), StringWrapper("b"))) +val cd = ArrayStringWrapper(Seq(StringWrapper("c"), StringWrapper("d"))) + +val df = spark.sparkContext.parallelize(Seq(ab, cd)).toDF +val filtered = df.where(array_contains(col("wrappers.s"), "b")) +checkAnswer(filtered, spark.sparkContext.parallelize(Seq(ab)).toDF) + } + + test("Nested value class filter") { +val a = ContainerStringWrapper(StringWrapper("a")) +val b = ContainerStringWrapper(StringWrapper("b")) + +val df = spark.sparkContext.parallelize(Seq(a, b)).toDF +// flat value class, `s` field is not in schema +val filtered = df.where("wrapper = \"a\"") +checkAnswer(filtered, spark.sparkContext.parallelize(Seq(a)).toDF) Review comment: Before this change we never support nested value class: - Filter with `wrapper` would break with ``` org.apache.spark.sql.AnalysisException: cannot resolve '(`wrapper` = 'a')' due to data type mismatch: differing types in '(`wrapper` = 'a')' (struct and string).; line 1 pos 0; ``` - Filter with `wrapper.s` would break with: ``` java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.test.SQLTestData$StringWrapper ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection)
AmplabJenkins removed a comment on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) URL: https://github.com/apache/spark/pull/27153#issuecomment-574037786 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21467/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
AmplabJenkins removed a comment on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574037807 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21466/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
AmplabJenkins commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574037801 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection)
AmplabJenkins commented on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) URL: https://github.com/apache/spark/pull/27153#issuecomment-57403 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection)
AmplabJenkins removed a comment on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) URL: https://github.com/apache/spark/pull/27153#issuecomment-57403 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection)
AmplabJenkins commented on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) URL: https://github.com/apache/spark/pull/27153#issuecomment-574037786 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21467/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
AmplabJenkins commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574037807 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21466/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
AmplabJenkins removed a comment on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574037801 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection)
SparkQA commented on issue #27153: [SPARK-20384][SQL] Support value class in schema of Dataset (without breaking existing current projection) URL: https://github.com/apache/spark/pull/27153#issuecomment-574037400 **[Test build #116688 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116688/testReport)** for PR 27153 at commit [`c23705d`](https://github.com/apache/spark/commit/c23705d9715efa16d73491191143c9f802bdff51). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
SparkQA commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574037384 **[Test build #116687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116687/testReport)** for PR 27194 at commit [`bfbaf2b`](https://github.com/apache/spark/commit/bfbaf2b4eb17cfc0457a8c710003fc59138bace5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366180228 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -39,27 +40,49 @@ import org.apache.spark.unsafe.types.UTF8String * @param requiredSchema The schema of the data that should be output for each row. This should be a * subset of the columns in dataSchema. * @param options Configuration options for a CSV parser. + * @param filters The pushdown filters that should be applied to converted values. */ class UnivocityParser( dataSchema: StructType, requiredSchema: StructType, -val options: CSVOptions) extends Logging { +val options: CSVOptions, +filters: Seq[Filter]) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + s"dataSchema (${dataSchema.catalogString}).") + def this(dataSchema: StructType, requiredSchema: StructType, options: CSVOptions) = { +this(dataSchema, requiredSchema, options, Seq.empty) + } def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val csvFilters = new CSVFilters( +filters, +dataSchema, +requiredSchema, +options.columnPruning) + + private[sql] val parsedSchema = csvFilters.readSchema + + // Mapping of field indexes of `parsedSchema` to indexes of `requiredSchema`. + // It returns -1 if `requiredSchema` doesn't contain a field from `parsedSchema`. Review comment: It should read the column referenced by filters from the source because we filter it one more time within Spark side - that referenced column should be read from the source. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366180369 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -39,27 +40,49 @@ import org.apache.spark.unsafe.types.UTF8String * @param requiredSchema The schema of the data that should be output for each row. This should be a * subset of the columns in dataSchema. * @param options Configuration options for a CSV parser. + * @param filters The pushdown filters that should be applied to converted values. */ class UnivocityParser( dataSchema: StructType, requiredSchema: StructType, -val options: CSVOptions) extends Logging { +val options: CSVOptions, +filters: Seq[Filter]) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + s"dataSchema (${dataSchema.catalogString}).") + def this(dataSchema: StructType, requiredSchema: StructType, options: CSVOptions) = { +this(dataSchema, requiredSchema, options, Seq.empty) + } def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val csvFilters = new CSVFilters( +filters, +dataSchema, +requiredSchema, +options.columnPruning) + + private[sql] val parsedSchema = csvFilters.readSchema Review comment: You can just make it public and remove `private[sql]` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366180228 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -39,27 +40,49 @@ import org.apache.spark.unsafe.types.UTF8String * @param requiredSchema The schema of the data that should be output for each row. This should be a * subset of the columns in dataSchema. * @param options Configuration options for a CSV parser. + * @param filters The pushdown filters that should be applied to converted values. */ class UnivocityParser( dataSchema: StructType, requiredSchema: StructType, -val options: CSVOptions) extends Logging { +val options: CSVOptions, +filters: Seq[Filter]) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + s"dataSchema (${dataSchema.catalogString}).") + def this(dataSchema: StructType, requiredSchema: StructType, options: CSVOptions) = { +this(dataSchema, requiredSchema, options, Seq.empty) + } def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val csvFilters = new CSVFilters( +filters, +dataSchema, +requiredSchema, +options.columnPruning) + + private[sql] val parsedSchema = csvFilters.readSchema + + // Mapping of field indexes of `parsedSchema` to indexes of `requiredSchema`. + // It returns -1 if `requiredSchema` doesn't contain a field from `parsedSchema`. Review comment: It should read the column referenced by filters from the source because we filter it one more time within Spark side. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on issue #23859: [SPARK-26956][SS] remove streaming output mode from data source v2 APIs
HeartSaVioR edited a comment on issue #23859: [SPARK-26956][SS] remove streaming output mode from data source v2 APIs URL: https://github.com/apache/spark/pull/23859#issuecomment-574034418 Btw, does the concern based on the real world workload? Because I cannot imagine "complete mode" works with decent amount of traffic, especially you're running the query for long time. "complete mode" means you cannot evict any state regardless of watermark, which won't make sense except you have finite set of group key (if then the cardinality of group keys will define the overall size of state). > If my assumption is right aren't we going back to Dstream behaviour of applying window transformation over the batch interval? That's why "state" comes into play in structured streaming. The state retains the values across micro-batches, "windows" in case of window transformations. In fact, as previous comments in this PR stated already, the only mode works without any tweak in production is append mode. In update mode you can tweak with custom sink to make it correctly upsert with the output, but there's no API to define "group keys" in existing sinks. (Even we have defined the group keys for sink side, it would be odd if we cannot guarantee the keys are same being used to group the data and do the stateful operation. It's semantically incorrect.) Btw, the streaming output mode is all about how to emit output for the stateful operation. If you don't do any stateful operation, output mode is no-op. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
AmplabJenkins removed a comment on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574034248 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116684/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on issue #23859: [SPARK-26956][SS] remove streaming output mode from data source v2 APIs
HeartSaVioR commented on issue #23859: [SPARK-26956][SS] remove streaming output mode from data source v2 APIs URL: https://github.com/apache/spark/pull/23859#issuecomment-574034418 Btw, does the concern based on the real world workload? Because I cannot imagine "complete mode" works with decent amount of traffic, especially you're running the query for long time. "complete mode" means you cannot evict any state regardless of watermark, which won't make sense except you have finite set of group key (if then the cardinality of group keys will define the overall size of state). > If my assumption is right aren't we going back to Dstream behaviour of applying window transformation over the batch interval? That's why "state" comes into play in structured streaming. The state retains the values across micro-batches, "windows" in case of window transformations. In fact, as previous comments in this PR stated already, the only mode works without any tweak in production is append mode. In update mode you can tweak with custom sink to make it correctly upsert with the output, but there's no API to define "group keys" in existing sinks. Btw, the streaming output mode is all about how to emit output for the stateful operation. If you don't do any stateful operation, output mode is no-op. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
SparkQA removed a comment on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574010302 **[Test build #116684 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116684/testReport)** for PR 27198 at commit [`e54839b`](https://github.com/apache/spark/commit/e54839bab74b1b98048b82e1f3122883f390). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
AmplabJenkins removed a comment on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574034245 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
gengliangwang commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574034462 @MaxGekk I see. I didn't know that the option `pathGlobFilter` is documented in DataFrameReader.scala. I think reverting the description of the option pathGlobFilter in `sql-data-sources-avro.md` makes sense. Still, we need to document the options in SPARK-30506. I will find someone else or do it myself. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
AmplabJenkins commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574034248 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116684/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
AmplabJenkins commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574034245 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression
SparkQA commented on issue #27198: [SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression URL: https://github.com/apache/spark/pull/27198#issuecomment-574034177 **[Test build #116684 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116684/testReport)** for PR 27198 at commit [`e54839b`](https://github.com/apache/spark/commit/e54839bab74b1b98048b82e1f3122883f390). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366177956 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -39,27 +40,49 @@ import org.apache.spark.unsafe.types.UTF8String * @param requiredSchema The schema of the data that should be output for each row. This should be a * subset of the columns in dataSchema. * @param options Configuration options for a CSV parser. + * @param filters The pushdown filters that should be applied to converted values. */ class UnivocityParser( dataSchema: StructType, requiredSchema: StructType, -val options: CSVOptions) extends Logging { +val options: CSVOptions, +filters: Seq[Filter]) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + s"dataSchema (${dataSchema.catalogString}).") + def this(dataSchema: StructType, requiredSchema: StructType, options: CSVOptions) = { +this(dataSchema, requiredSchema, options, Seq.empty) + } def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val csvFilters = new CSVFilters( +filters, +dataSchema, +requiredSchema, +options.columnPruning) + + private[sql] val parsedSchema = csvFilters.readSchema Review comment: Just for the context, originally I made it private but had to make it more open because it is used in other places in `sql`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #27166: [SPARK-30482][SQL][CORE][TESTS] Add sub-class of `AppenderSkeleton` reusable in tests
maropu commented on issue #27166: [SPARK-30482][SQL][CORE][TESTS] Add sub-class of `AppenderSkeleton` reusable in tests URL: https://github.com/apache/spark/pull/27166#issuecomment-574033880 Merged to master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu closed pull request #27166: [SPARK-30482][SQL][CORE][TESTS] Add sub-class of `AppenderSkeleton` reusable in tests
maropu closed pull request #27166: [SPARK-30482][SQL][CORE][TESTS] Add sub-class of `AppenderSkeleton` reusable in tests URL: https://github.com/apache/spark/pull/27166 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #27166: [SPARK-30482][SQL][CORE][TESTS] Add sub-class of `AppenderSkeleton` reusable in tests
maropu commented on issue #27166: [SPARK-30482][SQL][CORE][TESTS] Add sub-class of `AppenderSkeleton` reusable in tests URL: https://github.com/apache/spark/pull/27166#issuecomment-574033453 yea, I will, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366177325 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -39,27 +40,49 @@ import org.apache.spark.unsafe.types.UTF8String * @param requiredSchema The schema of the data that should be output for each row. This should be a * subset of the columns in dataSchema. * @param options Configuration options for a CSV parser. + * @param filters The pushdown filters that should be applied to converted values. */ class UnivocityParser( dataSchema: StructType, requiredSchema: StructType, -val options: CSVOptions) extends Logging { +val options: CSVOptions, +filters: Seq[Filter]) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + s"dataSchema (${dataSchema.catalogString}).") + def this(dataSchema: StructType, requiredSchema: StructType, options: CSVOptions) = { +this(dataSchema, requiredSchema, options, Seq.empty) + } def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val csvFilters = new CSVFilters( +filters, +dataSchema, +requiredSchema, +options.columnPruning) + + private[sql] val parsedSchema = csvFilters.readSchema Review comment: Do you propose just `private`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
AmplabJenkins removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-574032554 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116673/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26961: [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated
AmplabJenkins removed a comment on issue #26961: [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated URL: https://github.com/apache/spark/pull/26961#issuecomment-574032924 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21465/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
AmplabJenkins removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-574032548 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26961: [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated
AmplabJenkins removed a comment on issue #26961: [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated URL: https://github.com/apache/spark/pull/26961#issuecomment-574032919 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366177325 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -39,27 +40,49 @@ import org.apache.spark.unsafe.types.UTF8String * @param requiredSchema The schema of the data that should be output for each row. This should be a * subset of the columns in dataSchema. * @param options Configuration options for a CSV parser. + * @param filters The pushdown filters that should be applied to converted values. */ class UnivocityParser( dataSchema: StructType, requiredSchema: StructType, -val options: CSVOptions) extends Logging { +val options: CSVOptions, +filters: Seq[Filter]) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + s"dataSchema (${dataSchema.catalogString}).") + def this(dataSchema: StructType, requiredSchema: StructType, options: CSVOptions) = { +this(dataSchema, requiredSchema, options, Seq.empty) + } def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val csvFilters = new CSVFilters( +filters, +dataSchema, +requiredSchema, +options.columnPruning) + + private[sql] val parsedSchema = csvFilters.readSchema Review comment: Do you propose just `private`? or public? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26961: [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated
AmplabJenkins commented on issue #26961: [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated URL: https://github.com/apache/spark/pull/26961#issuecomment-574032924 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21465/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26961: [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated
AmplabJenkins commented on issue #26961: [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated URL: https://github.com/apache/spark/pull/26961#issuecomment-574032919 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #26961: [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated
SparkQA commented on issue #26961: [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated URL: https://github.com/apache/spark/pull/26961#issuecomment-574032515 **[Test build #116686 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116686/testReport)** for PR 26961 at commit [`cdcc4d0`](https://github.com/apache/spark/commit/cdcc4d017645df5e8cfd9775b60444b443e3bf82). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
AmplabJenkins commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-574032554 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116673/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
AmplabJenkins commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-574032548 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366176665 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -204,15 +234,15 @@ class UnivocityParser( * Parses a single CSV string and turns it into either one resulting row or no row (if the * the record is malformed). */ - def parse(input: String): InternalRow = doParse(input) + def parse(input: String): Seq[InternalRow] = doParse(input) private val getToken = if (options.columnPruning) { (tokens: Array[String], index: Int) => tokens(index) } else { (tokens: Array[String], index: Int) => tokens(tokenIndexArr(index)) } - private def convert(tokens: Array[String]): InternalRow = { + private def convert(tokens: Array[String]): Seq[InternalRow] = { Review comment: Yup, It should be fine to do it separately This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
SparkQA removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-573975813 **[Test build #116673 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116673/testReport)** for PR 26957 at commit [`42bf872`](https://github.com/apache/spark/commit/42bf8722f3634c0975db89af8dfcd3373a12e5f7). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366176019 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -204,15 +234,15 @@ class UnivocityParser( * Parses a single CSV string and turns it into either one resulting row or no row (if the * the record is malformed). */ - def parse(input: String): InternalRow = doParse(input) + def parse(input: String): Seq[InternalRow] = doParse(input) private val getToken = if (options.columnPruning) { (tokens: Array[String], index: Int) => tokens(index) } else { (tokens: Array[String], index: Int) => tokens(tokenIndexArr(index)) } - private def convert(tokens: Array[String]): InternalRow = { + private def convert(tokens: Array[String]): Seq[InternalRow] = { Review comment: The required change in `FailureSafeParser` is one liner. I think we should use `Option` here instead of `Seq` ... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
SparkQA commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-574032046 **[Test build #116673 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116673/testReport)** for PR 26957 at commit [`42bf872`](https://github.com/apache/spark/commit/42bf8722f3634c0975db89af8dfcd3373a12e5f7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366176306 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVFilters.scala ## @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.csv + +import scala.util.Try + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources +import org.apache.spark.sql.types.{BooleanType, StructType} + +/** + * An instance of the class compiles filters to predicates and allows to + * apply the predicates to an internal row with partially initialized values + * converted from parsed CSV fields. + * + * @param filters The filters pushed down to CSV datasource. + * @param dataSchema The full schema with all fields in CSV files. + * @param requiredSchema The schema with only fields requested by the upper layer. + * @param columnPruning true if CSV parser can read sub-set of columns otherwise false. + */ +class CSVFilters( +filters: Seq[sources.Filter], +dataSchema: StructType, +requiredSchema: StructType, +columnPruning: Boolean) { + require(checkFilters(), "All filters must be applicable to the data schema.") Review comment: Initially the function was slightly complex, see https://github.com/apache/spark/pull/26973/commits/f0aa0a88bfa0c87007f8781ba7fac8f9cd3057ba#diff-44a98c4a53980cb04e57f0489b257a37L126 . That's why I extracted it to a separate method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent URL: https://github.com/apache/spark/pull/26975#issuecomment-574031548 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
HyukjinKwon commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366176019 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -204,15 +234,15 @@ class UnivocityParser( * Parses a single CSV string and turns it into either one resulting row or no row (if the * the record is malformed). */ - def parse(input: String): InternalRow = doParse(input) + def parse(input: String): Seq[InternalRow] = doParse(input) private val getToken = if (options.columnPruning) { (tokens: Array[String], index: Int) => tokens(index) } else { (tokens: Array[String], index: Int) => tokens(tokenIndexArr(index)) } - private def convert(tokens: Array[String]): InternalRow = { + private def convert(tokens: Array[String]): Seq[InternalRow] = { Review comment: The required change in `FailureSafeParser` is one liner. I think we should use `Option` here instead of `Seq` ... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
AmplabJenkins removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent URL: https://github.com/apache/spark/pull/26975#issuecomment-574031555 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116679/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent URL: https://github.com/apache/spark/pull/26975#issuecomment-574031548 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
AmplabJenkins commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent URL: https://github.com/apache/spark/pull/26975#issuecomment-574031555 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116679/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
SparkQA removed a comment on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent URL: https://github.com/apache/spark/pull/26975#issuecomment-573986167 **[Test build #116679 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116679/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent
SparkQA commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent URL: https://github.com/apache/spark/pull/26975#issuecomment-574030980 **[Test build #116679 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116679/testReport)** for PR 26975 at commit [`92e30c3`](https://github.com/apache/spark/commit/92e30c3d4ae52a8ce78bc10328446c5149a38e55). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366175302 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -204,15 +234,15 @@ class UnivocityParser( * Parses a single CSV string and turns it into either one resulting row or no row (if the * the record is malformed). */ - def parse(input: String): InternalRow = doParse(input) + def parse(input: String): Seq[InternalRow] = doParse(input) private val getToken = if (options.columnPruning) { (tokens: Array[String], index: Int) => tokens(index) } else { (tokens: Array[String], index: Int) => tokens(tokenIndexArr(index)) } - private def convert(tokens: Array[String]): InternalRow = { + private def convert(tokens: Array[String]): Seq[InternalRow] = { Review comment: We can do that but modification of `FailureSafeParser` is slightly orthogonal to the purpose of the PR, and it is not necessary for this changes. Can we do that separately? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
AmplabJenkins removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-574030333 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
AmplabJenkins removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-574030343 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116672/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
AmplabJenkins commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-574030333 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
AmplabJenkins commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-574030343 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116672/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource
MaxGekk commented on a change in pull request #26973: [SPARK-30323][SQL] Support filters pushdown in CSV datasource URL: https://github.com/apache/spark/pull/26973#discussion_r366174594 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala ## @@ -39,27 +40,49 @@ import org.apache.spark.unsafe.types.UTF8String * @param requiredSchema The schema of the data that should be output for each row. This should be a * subset of the columns in dataSchema. * @param options Configuration options for a CSV parser. + * @param filters The pushdown filters that should be applied to converted values. */ class UnivocityParser( dataSchema: StructType, requiredSchema: StructType, -val options: CSVOptions) extends Logging { +val options: CSVOptions, +filters: Seq[Filter]) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + s"dataSchema (${dataSchema.catalogString}).") + def this(dataSchema: StructType, requiredSchema: StructType, options: CSVOptions) = { +this(dataSchema, requiredSchema, options, Seq.empty) + } def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val csvFilters = new CSVFilters( +filters, +dataSchema, +requiredSchema, +options.columnPruning) + + private[sql] val parsedSchema = csvFilters.readSchema + + // Mapping of field indexes of `parsedSchema` to indexes of `requiredSchema`. + // It returns -1 if `requiredSchema` doesn't contain a field from `parsedSchema`. Review comment: What should the upper layer do with the column if a datasource already applied filters? As far as I know filters are applied only once in DSv2, @cloud-fan or I am wrong? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
SparkQA removed a comment on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-573970634 **[Test build #116672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116672/testReport)** for PR 26957 at commit [`c80a155`](https://github.com/apache/spark/commit/c80a155610472029eb1e2adf988aba3d0cf54206). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
SparkQA commented on issue #26957: [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation URL: https://github.com/apache/spark/pull/26957#issuecomment-574029847 **[Test build #116672 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116672/testReport)** for PR 26957 at commit [`c80a155`](https://github.com/apache/spark/commit/c80a155610472029eb1e2adf988aba3d0cf54206). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events
AmplabJenkins removed a comment on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events URL: https://github.com/apache/spark/pull/27164#issuecomment-574028851 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116680/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events
AmplabJenkins removed a comment on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events URL: https://github.com/apache/spark/pull/27164#issuecomment-574028846 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events
AmplabJenkins commented on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events URL: https://github.com/apache/spark/pull/27164#issuecomment-574028851 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116680/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events
AmplabJenkins commented on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events URL: https://github.com/apache/spark/pull/27164#issuecomment-574028846 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events
SparkQA removed a comment on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events URL: https://github.com/apache/spark/pull/27164#issuecomment-573987598 **[Test build #116680 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116680/testReport)** for PR 27164 at commit [`01ece29`](https://github.com/apache/spark/commit/01ece298b6be6cd6cb40c89b62a391c7fb47a3e4). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on issue #27166: [SPARK-30482][SQL][CORE][TESTS] Add sub-class of `AppenderSkeleton` reusable in tests
MaxGekk commented on issue #27166: [SPARK-30482][SQL][CORE][TESTS] Add sub-class of `AppenderSkeleton` reusable in tests URL: https://github.com/apache/spark/pull/27166#issuecomment-574028201 @HyukjinKwon 's fix helped, @maropu this can be merged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events
SparkQA commented on issue #27164: [SPARK-30479][SQL] Apply compaction of event log to SQL events URL: https://github.com/apache/spark/pull/27164#issuecomment-574028326 **[Test build #116680 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116680/testReport)** for PR 27164 at commit [`01ece29`](https://github.com/apache/spark/commit/01ece298b6be6cd6cb40c89b62a391c7fb47a3e4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression al
maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause URL: https://github.com/apache/spark/pull/27058#discussion_r366171505 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala ## @@ -118,6 +118,52 @@ import org.apache.spark.sql.types.IntegerType * LocalTableScan [...] * }}} * + * Third example: single distinct aggregate function with filter clauses (in sql): + * {{{ + * SELECT + * COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt1, + * COUNT(DISTINCT cat1) as cat1_cnt2, + * SUM(value) AS total + * FROM + *data + * GROUP BY + *key + * }}} + * + * This translates to the following (pseudo) logical plan: + * {{{ + * Aggregate( + *key = ['key] + *functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1), + * COUNT(DISTINCT 'cat1), + * sum('value)] + *output = ['key, 'cat1_cnt1, 'cat1_cnt2, 'total]) + * LocalTableScan [...] + * }}} + * + * This rule rewrites this logical plan to the following (pseudo) logical plan: + * {{{ + * Aggregate( + * key = ['key] + * functions = [count(if (('gid = 1)) '_gen_distinct_1 else null), + * count(if (('gid = 2)) '_gen_distinct_2 else null), + * first(if (('gid = 0)) 'total else null) ignore nulls] + * output = ['key, 'cat1_cnt, 'cat1_cnt2, 'total]) + * Aggregate( + *key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid] + *functions = [sum('value)] + *output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total]) + * Expand( + * projections = [('key, null, null, 0, 'value), + * ('key, '_gen_distinct_1, null, 1, null), + * ('key, null, '_gen_distinct_2, 2, null)] + * output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value]) + * Expand( Review comment: Probably, this is related to [the comment](https://github.com/apache/spark/pull/27058#discussion_r366139379). If we avoid the recursive call, I think we can have a chance to merge them. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression al
maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause URL: https://github.com/apache/spark/pull/27058#discussion_r366171011 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala ## @@ -316,6 +362,86 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] { }.asInstanceOf[NamedExpression] } Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate) +} else if (distinctAggGroups.size == 1) { + val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct) + if (distinctAggExpressions.exists(_.filter.isDefined)) { +val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable)) +val regularFunChildren = regularAggExprs + .flatMap(_.aggregateFunction.children.filter(!_.foldable)) +val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes) +val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct +val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair) +val regularAggChildAttrLookup = regularAggChildAttrMap.toMap +val regularOperatorMap = regularAggExprs.map { + case ae @ AggregateExpression(af, _, _, filter, _) => +val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c)) +val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction] +val filterOpt = filter.map(_.transform { + case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a) +}) +val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt) +(ae, aggExpr) +} +val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable)) +val rewriteDistinctOperatorMap = distinctAggExprs.map { + case ae @ AggregateExpression(af, _, _, filter, _) => +// Why do we need to construct the phantom id ? +// First, In order to reduce costs, it is better to handle the filter clause locally. +// e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression +// If(id > 1) 'a else null first, and use the result as output. +// Second, If more than one DISTINCT aggregate expression uses the same column, +// We need to construct the phantom attributes so as the output not lost. +// e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output +// attribute '_gen_distinct-1 and attribute '_gen_distinct-2 instead of two 'a. +// Note: We just need to illusion the expression with filter clause. +// The illusionary mechanism may result in multiple distinct aggregations uses +// different column, so we still need to call `rewrite`. +val phantomId = NamedExpression.newExprId.id +val unfoldableChildren = af.children.filter(!_.foldable) +val exprAttrs = unfoldableChildren.map { e => + (e, AttributeReference(s"_gen_distinct_$phantomId", e.dataType, nullable = true)()) +} +val exprAttrLookup = exprAttrs.toMap +val newChildren = af.children.map(c => exprAttrLookup.getOrElse(c, c)) +val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction] +val aggExpr = ae.copy(aggregateFunction = raf, filter = None) +// Expand projection +val projection = unfoldableChildren.map { + case e if filter.isDefined => If(filter.get, e, nullify(e)) + case e => e +} +(projection, exprAttrs, (ae, aggExpr)) +} +val rewriteDistinctAttrMap = rewriteDistinctOperatorMap.flatMap(_._2) +val distinctAggChildAttrs = rewriteDistinctAttrMap.map(_._2) +val allAggAttrs = regularAggChildAttrMap.map(_._2) ++ distinctAggChildAttrs +// Construct the aggregate input projection. +val rewriteDistinctProjections = rewriteDistinctOperatorMap.flatMap(_._1) +val rewriteAggProjections = + Seq((a.groupingExpressions ++ regularAggChildren ++ rewriteDistinctProjections)) +val groupByMap = a.groupingExpressions.collect { + case ne: NamedExpression => ne -> ne.toAttribute + case e => e -> AttributeReference(e.sql, e.dataType, e.nullable)() +} +val groupByAttrs = groupByMap.map(_._2) +// Construct the expand operator. +val expand = Expand(rewriteAggProjections, groupByAttrs ++ allAggAttrs, a.child) +val rewriteAggExprLookup = + (rewriteDistinctOperatorMap.map(_._3) ++
[GitHub] [spark] viirya commented on a change in pull request #27187: [SPARK-30497][SQL] migrate DESCRIBE TABLE to the new framework
viirya commented on a change in pull request #27187: [SPARK-30497][SQL] migrate DESCRIBE TABLE to the new framework URL: https://github.com/apache/spark/pull/27187#discussion_r366167726 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -900,6 +902,18 @@ class Analyzer( } case _ => u } + + case u @ UnresolvedTableOrView(identifier) => +expandRelationName(identifier) match { + case SessionCatalogAndIdentifier(catalog, ident) => +CatalogV2Util.loadTable(catalog, ident) match { + // We don't support view in v2 catalog yet. Here we treat v1 view as table and use + // `ResolvedTable` to carry it. + case Some(table) => ResolvedTable(catalog.asTableCatalog, ident, table) Review comment: Do you mean to add a case? ``` CatalogV2Util.loadTable(catalog, ident) match { case Some(v1Table: V1Table) if v1Table.v1Table.tableType == CatalogTableType.VIEW => ... } ``` Otherwise I'm not sure why `case Some(table) =>` is for v1 view. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #27187: [SPARK-30497][SQL] migrate DESCRIBE TABLE to the new framework
viirya commented on a change in pull request #27187: [SPARK-30497][SQL] migrate DESCRIBE TABLE to the new framework URL: https://github.com/apache/spark/pull/27187#discussion_r366169592 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -759,6 +759,8 @@ class Analyzer( u.failAnalysis(s"${ident.quoted} is a temp view not table.") } u + case u @ UnresolvedTableOrView(ident) => +lookupTempView(ident).getOrElse(u) Review comment: Looks like it is very similar to `UnresolvedRelation`? `UnresolvedRelation` can also be resolve to view by `ResolveTempViews`, can be resolved to v2 relation by `ResolveTables`, can be resolved to v1 relation by `ResolveRelations`. Seems we have a few similar ways to do resolving? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized
cloud-fan commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized URL: https://github.com/apache/spark/pull/26696#issuecomment-574024661 sounds good, can you open a new PR for your new idea? then we can review and leave comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md
MaxGekk commented on issue #27194: [SPARK-30505][DOCS] Deprecate Avro option `ignoreExtension` in sql-data-sources-avro.md URL: https://github.com/apache/spark/pull/27194#issuecomment-574024291 I described the global option because it has been already described in each load method - json, text, parquet ... in DataFrameReader, for example https://github.com/apache/spark/blob/f8d59572b014e5254b0c574b26e101c2e4157bdd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L410 > Are you interested in writing the new doc? No, I am not. If you think, description of the option `pathGlobFilter ` is useless in `sql-data-sources-avro.md`. Let me know, I will revert it from the PR but I would leave it as in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #23859: [SPARK-26956][SS] remove streaming output mode from data source v2 APIs
cloud-fan commented on issue #23859: [SPARK-26956][SS] remove streaming output mode from data source v2 APIs URL: https://github.com/apache/spark/pull/23859#issuecomment-574023315 the "new data" mentioned in this PR means the new result of running the query with the entire input, not a single micro-batch. This is the semantic of complete mode AFAIK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #26201: [SPARK-29543][SS][UI] Init structured streaming ui
AmplabJenkins removed a comment on issue #26201: [SPARK-29543][SS][UI] Init structured streaming ui URL: https://github.com/apache/spark/pull/26201#issuecomment-574020999 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116681/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org