[jira] [Created] (SPARK-48526) Allow passing custom sink to StreamTest::testStream
Johan Lasperas created SPARK-48526: -- Summary: Allow passing custom sink to StreamTest::testStream Key: SPARK-48526 URL: https://issues.apache.org/jira/browse/SPARK-48526 Project: Spark Issue Type: Test Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Johan Lasperas The testing helpers for streaming don't allow providing a custom sink, this is limiting in (at least) two ways: * A sink can't be reused across multiple calls to `testStream`, e.g. when canceling and resuming streaming * A custom sink implementation other than `MemorySink` can't be provided. A use case here is for example to test the Delta streaming sink by wrapping it in a MemorySink interface and passing it to the test framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-48308: --- Description: In [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191] the schema of the data excluding partition columns is computed 2 times in a slightly different way: {code:java} val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionSet.contains) {code} vs {code:java} val readDataColumns = dataColumns .filterNot(partitionColumns.contains) {code} was: In [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191] the schema of the data excluding partition columns is computed 2 times in a slightly different way: {code:java} val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionSet.contains) {code} vs {code:java} val readDataColumns = dataColumns .filterNot(partitionColumns.contains) {code} > Unify getting data schema without partition columns in FileSourceStrategy > - > > Key: SPARK-48308 > URL: https://issues.apache.org/jira/browse/SPARK-48308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Johan Lasperas >Assignee: Johan Lasperas >Priority: Trivial > Labels: pull-request-available > Fix For: 4.0.0 > > > In > [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191] > the schema of the data excluding partition columns is computed 2 times in a > slightly different way: > > {code:java} > val dataColumnsWithoutPartitionCols = > dataColumns.filterNot(partitionSet.contains) {code} > vs > {code:java} > val readDataColumns = dataColumns > .filterNot(partitionColumns.contains) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-48308: --- Description: In [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191] the schema of the data excluding partition columns is computed 2 times in a slightly different way: {code:java} val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionSet.contains) {code} vs {code:java} val readDataColumns = dataColumns .filterNot(partitionColumns.contains) {code} was: In [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191] the schema of the data excluding partition columns is computed 2 times in a slightly different way: ``` val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionSet.contains) ``` vs ``` val readDataColumns = dataColumns .filterNot(partitionColumns.contains) ``` This should be unified > Unify getting data schema without partition columns in FileSourceStrategy > - > > Key: SPARK-48308 > URL: https://issues.apache.org/jira/browse/SPARK-48308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Johan Lasperas >Priority: Trivial > > In > [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191] > the schema of the data excluding partition columns is computed 2 times in a > slightly different way: > > {code:java} > val dataColumnsWithoutPartitionCols = > dataColumns.filterNot(partitionSet.contains) {code} > > vs > {code:java} > val readDataColumns = dataColumns > .filterNot(partitionColumns.contains) {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy
Johan Lasperas created SPARK-48308: -- Summary: Unify getting data schema without partition columns in FileSourceStrategy Key: SPARK-48308 URL: https://issues.apache.org/jira/browse/SPARK-48308 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.1 Reporter: Johan Lasperas In [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191] the schema of the data excluding partition columns is computed 2 times in a slightly different way: ``` val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionSet.contains) ``` vs ``` val readDataColumns = dataColumns .filterNot(partitionColumns.contains) ``` This should be unified -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46092) Overflow in Parquet row group filter creation causes incorrect results
[ https://issues.apache.org/jira/browse/SPARK-46092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-46092: --- Description: While the parquet readers don't support reading parquet values into larger Spark types, it's possible to trigger an overflow when creating a Parquet row group filter that will then incorrectly skip row groups and bypass the exception in the reader, Repro: {code:java} Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect(){code} This succeeds and returns no results. This should either fail if the Parquet reader doesn't support the upcast from int to long or produce result `[0]` if it does. was: While the parquet readers don't support reading parquet values into larger Spark types, it's possible to trigger an overflow when creating a Parquet row group filter that will then incorrectly skip row groups and bypass the exception in the reader, Repro: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` This succeeds and returns no results. This should either fail if the Parquet reader doesn't support the upcast from int to long or produce result `[0]` if it does. > Overflow in Parquet row group filter creation causes incorrect results > -- > > Key: SPARK-46092 > URL: https://issues.apache.org/jira/browse/SPARK-46092 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Johan Lasperas >Priority: Major > > While the parquet readers don't support reading parquet values into larger > Spark types, it's possible to trigger an overflow when creating a Parquet row > group filter that will then incorrectly skip row groups and bypass the > exception in the reader, > Repro: > {code:java} > Seq(0).toDF("a").write.parquet(path) > spark.read.schema("a LONG").parquet(path).where(s"a < > ${Long.MaxValue}").collect(){code} > This succeeds and returns no results. This should either fail if the Parquet > reader doesn't support the upcast from int to long or produce result `[0]` if > it does. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46092) Overflow in Parquet row group filter creation causes incorrect results
Johan Lasperas created SPARK-46092: -- Summary: Overflow in Parquet row group filter creation causes incorrect results Key: SPARK-46092 URL: https://issues.apache.org/jira/browse/SPARK-46092 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Johan Lasperas While the parquet readers don't support reading parquet values into larger Spark types, it's possible to trigger an overflow when creating a Parquet row group filter that will then incorrectly skip row groups and bypass the exception in the reader, Repro: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` This succeeds and returns no results. This should either fail if the Parquet reader doesn't support the upcast from int to long or produce result `[0]` if it does. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44026) Update helper methods to create SQLMetric with initial value
Johan Lasperas created SPARK-44026: -- Summary: Update helper methods to create SQLMetric with initial value Key: SPARK-44026 URL: https://issues.apache.org/jira/browse/SPARK-44026 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Johan Lasperas The helper methods in [SQLMetric.scala|https://github.com/apache/spark/blob/7107742a381cde2e6de9425e3e436282a8c0d27c/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala#L38] all use a fixed default value `-1`. Callers may want the metric to start with a different initial value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43487) Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError`
Johan Lasperas created SPARK-43487: -- Summary: Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError` Key: SPARK-43487 URL: https://issues.apache.org/jira/browse/SPARK-43487 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Johan Lasperas The batch of errors migrated to error classes as part of spark-40540 contains an error that got mixed up with the wrong error message: [ambiguousRelationAliasNameInNestedCTEError|https://github.com/apache/spark/commit/43a6b932759865c45ccf36f3e9cf6898c1b762da#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56R1983] uses the same error message as the following commandUnsupportedInV2TableError: ``` WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t) SELECT * FROM t2; AnalysisException: t is not supported for v2 tables ``` The error should be: ``` AnalysisException: Name tis ambiguous in nested CTE. Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name defined in inner CTE takes precedence. If set it to LEGACY, outer CTE definitions will take precedence. See more details in SPARK-28228. ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43487) Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError`
[ https://issues.apache.org/jira/browse/SPARK-43487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-43487: --- Description: The batch of errors migrated to error classes as part of spark-40540 contains an error that got mixed up with the wrong error message: [ambiguousRelationAliasNameInNestedCTEError|https://github.com/apache/spark/commit/43a6b932759865c45ccf36f3e9cf6898c1b762da#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56R1983] uses the same error message as the following commandUnsupportedInV2TableError: {code:java} WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t) SELECT * FROM t2; AnalysisException: t is not supported for v2 tables {code} The error should be: {code:java} AnalysisException: Name tis ambiguous in nested CTE. Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name defined in inner CTE takes precedence. If set it to LEGACY, outer CTE definitions will take precedence. See more details in SPARK-28228.{code} was: The batch of errors migrated to error classes as part of spark-40540 contains an error that got mixed up with the wrong error message: [ambiguousRelationAliasNameInNestedCTEError|https://github.com/apache/spark/commit/43a6b932759865c45ccf36f3e9cf6898c1b762da#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56R1983] uses the same error message as the following commandUnsupportedInV2TableError: ``` WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t) SELECT * FROM t2; AnalysisException: t is not supported for v2 tables ``` The error should be: ``` AnalysisException: Name tis ambiguous in nested CTE. Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name defined in inner CTE takes precedence. If set it to LEGACY, outer CTE definitions will take precedence. See more details in SPARK-28228. ``` > Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError` > - > > Key: SPARK-43487 > URL: https://issues.apache.org/jira/browse/SPARK-43487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Johan Lasperas >Priority: Minor > > The batch of errors migrated to error classes as part of spark-40540 contains > an error that got mixed up with the wrong error message: > [ambiguousRelationAliasNameInNestedCTEError|https://github.com/apache/spark/commit/43a6b932759865c45ccf36f3e9cf6898c1b762da#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56R1983] > uses the same error message as the following > commandUnsupportedInV2TableError: > > {code:java} > WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t) SELECT * > FROM t2; > AnalysisException: t is not supported for v2 tables > {code} > The error should be: > {code:java} > AnalysisException: Name tis ambiguous in nested CTE. > Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name > defined in inner CTE takes precedence. If set it to LEGACY, outer CTE > definitions will take precedence. See more details in SPARK-28228.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43217) Correctly recurse into maps of maps and arrays of arrays in StructType.findNestedField
[ https://issues.apache.org/jira/browse/SPARK-43217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-43217: --- Description: [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325] is unable to reach nested fields below two directly nested maps or arrays. Whenever it reaches a map or an array, it'll throw an `invalidFieldName` exception if the child is not a struct. The following throws '{{{}Field name `a`.`element`.element`.`i` is invalid: `a`.`element`.`element` is not a struct.'{}}}, even though the access path is valid: {code:java} val schema = new StructType() .add("a", ArrayType(ArrayType( new StructType().add("i", "int" findNestedField(Seq("a", "element", "element", "i"), schema) {code} was: [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325] is unable to reach nested field below two directly nested maps or arrays. Whenever it reaches a map or an array, it'll throw an `invalidFieldName` exception if the child is not a struct. The following throws '{{{}Field name `a`.`element`.element`.`i` is invalid: `a`.`element`.`element` is not a struct.'{}}}, even though the access path is valid: {code:java} val schema = new StructType() .add("a", ArrayType(ArrayType( new StructType().add("i", "int" findNestedField(Seq("a", "element", "element", "i"), schema) {code} > Correctly recurse into maps of maps and arrays of arrays in > StructType.findNestedField > -- > > Key: SPARK-43217 > URL: https://issues.apache.org/jira/browse/SPARK-43217 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Johan Lasperas >Priority: Minor > > [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325] > is unable to reach nested fields below two directly nested maps or arrays. > Whenever it reaches a map or an array, it'll throw an `invalidFieldName` > exception if the child is not a struct. > The following throws '{{{}Field name `a`.`element`.element`.`i` is invalid: > `a`.`element`.`element` is not a struct.'{}}}, even though the access path is > valid: > {code:java} > val schema = new StructType() > .add("a", ArrayType(ArrayType( > new StructType().add("i", "int" > findNestedField(Seq("a", "element", "element", "i"), schema) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43217) Correctly recurse into maps of maps and arrays of arrays in StructType.findNestedField
[ https://issues.apache.org/jira/browse/SPARK-43217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-43217: --- Description: [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325] is unable to reach nested field below two directly nested maps or arrays. Whenever it reaches a map or an array, it'll throw an `invalidFieldName` exception if the child is not a struct. The following throws '{{{}Field name `a`.`element`.element`.`i` is invalid: `a`.`element`.`element` is not a struct.'{}}}, even though the access path is valid: {code:java} val schema = new StructType() .add("a", ArrayType(ArrayType( new StructType().add("i", "int" findNestedField(Seq("a", "element", "element", "i"), schema) {code} was: [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325] is unable to reach nested field below two directly nested maps or arrays. Whenever it reaches a map or an array, it'll throw an `invalidFieldName` exception if the child is not a struct. The following throws 'Field name `a`.`element`.element`.`i` is invalid: `a`.`element`.`element` is not a struct.', even though the access path is valid: {code:java} val schema = new StructType() .add("a", ArrayType(ArrayType( new StructType().add("i", "int" findNestedField(Seq("a", "element", "element", "i"), schema) {code} > Correctly recurse into maps of maps and arrays of arrays in > StructType.findNestedField > -- > > Key: SPARK-43217 > URL: https://issues.apache.org/jira/browse/SPARK-43217 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Johan Lasperas >Priority: Minor > > [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325] > is unable to reach nested field below two directly nested maps or arrays. > Whenever it reaches a map or an array, it'll throw an `invalidFieldName` > exception if the child is not a struct. > The following throws '{{{}Field name `a`.`element`.element`.`i` is invalid: > `a`.`element`.`element` is not a struct.'{}}}, even though the access path is > valid: > {code:java} > val schema = new StructType() > .add("a", ArrayType(ArrayType( > new StructType().add("i", "int" > findNestedField(Seq("a", "element", "element", "i"), schema) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43217) Correctly recurse into maps of maps and arrays of arrays in StructType.findNestedField
[ https://issues.apache.org/jira/browse/SPARK-43217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-43217: --- Description: [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325] is unable to reach nested field below two directly nested maps or arrays. Whenever it reaches a map or an array, it'll throw an `invalidFieldName` exception if the child is not a struct. The following throws 'Field name `a`.`element`.element`.`i` is invalid: `a`.`element`.`element` is not a struct.', even though the access path is valid: {code:java} val schema = new StructType() .add("a", ArrayType(ArrayType( new StructType().add("i", "int" findNestedField(Seq("a", "element", "element", "i"), schema) {code} was: [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325] is unable to reach nested field below two directly nested maps or arrays. Whenever it reaches a map or an array, it'll throw an `invalidFieldName` exception if the child is not a struct. The following throws 'Field name `a`.`element`.element`.`i` is invalid: `a`.`element`.`element` is not a struct.', even though the access path is valid: ``` val schema = new StructType() .add("a", ArrayType(ArrayType( new StructType().add("i", "int" findNestedField(Seq("a", "element", "element", "i"), schema) ``` > Correctly recurse into maps of maps and arrays of arrays in > StructType.findNestedField > -- > > Key: SPARK-43217 > URL: https://issues.apache.org/jira/browse/SPARK-43217 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Johan Lasperas >Priority: Minor > > [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325] > is unable to reach nested field below two directly nested maps or arrays. > Whenever it reaches a map or an array, it'll throw an `invalidFieldName` > exception if the child is not a struct. > The following throws 'Field name `a`.`element`.element`.`i` is invalid: > `a`.`element`.`element` is not a struct.', even though the access path is > valid: > {code:java} > val schema = new StructType() > .add("a", ArrayType(ArrayType( > new StructType().add("i", "int" > findNestedField(Seq("a", "element", "element", "i"), schema) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43217) Correctly recurse into maps of maps and arrays of arrays in StructType.findNestedField
Johan Lasperas created SPARK-43217: -- Summary: Correctly recurse into maps of maps and arrays of arrays in StructType.findNestedField Key: SPARK-43217 URL: https://issues.apache.org/jira/browse/SPARK-43217 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Johan Lasperas [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325] is unable to reach nested field below two directly nested maps or arrays. Whenever it reaches a map or an array, it'll throw an `invalidFieldName` exception if the child is not a struct. The following throws 'Field name `a`.`element`.element`.`i` is invalid: `a`.`element`.`element` is not a struct.', even though the access path is valid: ``` val schema = new StructType() .add("a", ArrayType(ArrayType( new StructType().add("i", "int" findNestedField(Seq("a", "element", "element", "i"), schema) ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-42918: --- Description: A first step towards allowing file format implementations to inject custom metadata fields into plans is to make the handling of metadata attributes in `FileSourceStrategy` more generic. Today in `FileSourceStrategy` , the lists of constant and generated metadata fields are created manually, checking for known generated fields on one hand and considering the remaining fields as constant metadata fields. We need instead to introduce a way of declaring metadata fields as generated or constant directly in `FileFormat` and propagate that information to `FileSourceStrategy`. was: A first step towards allowing file format implementations to inject custom metadata columns into plans is to make the handling of metadata attributes in `FileSourceStrategy` more generic. Today in `FileSourceStrategy` , the lists of constant and generated metadata columns are created manually, checking for known generated columns on one hand and considering the remaining columns as constant metadata columns. We need instead to introduce a way of declaring metadata columns as generated or constant directly in `FileFormat` and propagate that information to `FileSourceStrategy`. > Generalize handling of metadata attributes in FileSourceStrategy > > > Key: SPARK-42918 > URL: https://issues.apache.org/jira/browse/SPARK-42918 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: Johan Lasperas >Priority: Minor > > A first step towards allowing file format implementations to inject custom > metadata fields into plans is to make the handling of metadata attributes in > `FileSourceStrategy` more generic. > Today in `FileSourceStrategy` , the lists of constant and generated metadata > fields are created manually, checking for known generated fields on one hand > and considering the remaining fields as constant metadata fields. We need > instead to introduce a way of declaring metadata fields as generated or > constant directly in `FileFormat` and propagate that information to > `FileSourceStrategy`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-42918: --- Description: A first step towards allowing file format implementations to inject custom metadata columns into plans is to make the handling of metadata attributes in `FileSourceStrategy` more generic. Today in `FileSourceStrategy` , the lists of constant and generated metadata columns are created manually, checking for known generated columns on one hand and considering the remaining columns as constant metadata columns. We need instead to introduce a way of declaring metadata columns as generated or constant directly in `FileFormat` and propagate that information to `FileSourceStrategy`. was:A first step towards allowing file format implementations to inject custom metadata columns into plans is to make handling of metadata attributes in `FileSourceStrategy` more generic. > Generalize handling of metadata attributes in FileSourceStrategy > > > Key: SPARK-42918 > URL: https://issues.apache.org/jira/browse/SPARK-42918 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: Johan Lasperas >Priority: Minor > > A first step towards allowing file format implementations to inject custom > metadata columns into plans is to make the handling of metadata attributes in > `FileSourceStrategy` more generic. > Today in `FileSourceStrategy` , the lists of constant and generated metadata > columns are created manually, checking for known generated columns on one > hand and considering the remaining columns as constant metadata columns. We > need instead to introduce a way of declaring metadata columns as generated or > constant directly in `FileFormat` and propagate that information to > `FileSourceStrategy`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-42918: --- Description: A first step towards allowing file format implementations to inject custom metadata columns into plans is to make handling of metadata attributes in `FileSourceStrategy` more generic. (was: A first step towards allowing file format implementations to inject custom metadata columns into plans is to make ) > Generalize handling of metadata attributes in FileSourceStrategy > > > Key: SPARK-42918 > URL: https://issues.apache.org/jira/browse/SPARK-42918 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: Johan Lasperas >Priority: Minor > > A first step towards allowing file format implementations to inject custom > metadata columns into plans is to make handling of metadata attributes in > `FileSourceStrategy` more generic. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-42918: --- Description: A first step towards allowing file format implementations to inject custom metadata columns into plans is to (was: As a follow-up on https://issues.apache.org/jira/browse/SPARK-41791 that introduced `FileSourceConstantMetadataAttribute` and `FileSourceGeneratedMetadataAttribute` to handle constant and generated metadata columns, we may want to introduce corresponding abstractions for struct fields to allow creating metadata fields more easily.) > Generalize handling of metadata attributes in FileSourceStrategy > > > Key: SPARK-42918 > URL: https://issues.apache.org/jira/browse/SPARK-42918 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: Johan Lasperas >Priority: Minor > > A first step towards allowing file format implementations to inject custom > metadata columns into plans is to -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-42918: --- Description: A first step towards allowing file format implementations to inject custom metadata columns into plans is to make (was: A first step towards allowing file format implementations to inject custom metadata columns into plans is to ) > Generalize handling of metadata attributes in FileSourceStrategy > > > Key: SPARK-42918 > URL: https://issues.apache.org/jira/browse/SPARK-42918 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: Johan Lasperas >Priority: Minor > > A first step towards allowing file format implementations to inject custom > metadata columns into plans is to make -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy
[ https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-42918: --- Summary: Generalize handling of metadata attributes in FileSourceStrategy (was: Introduce abstractions to create constant and generated metadata fields) > Generalize handling of metadata attributes in FileSourceStrategy > > > Key: SPARK-42918 > URL: https://issues.apache.org/jira/browse/SPARK-42918 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: Johan Lasperas >Priority: Minor > > As a follow-up on https://issues.apache.org/jira/browse/SPARK-41791 that > introduced `FileSourceConstantMetadataAttribute` and > `FileSourceGeneratedMetadataAttribute` to handle constant and generated > metadata columns, we may want to introduce corresponding abstractions for > struct fields to allow creating metadata fields more easily. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42918) Introduce abstractions to create constant and generated metadata fields
[ https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-42918: --- Description: As a follow-up on https://issues.apache.org/jira/browse/SPARK-41791 that introduced `FileSourceConstantMetadataAttribute` and `FileSourceGeneratedMetadataAttribute` to handle constant and generated metadata columns, we may want to introduce corresponding abstractions for struct fields to allow creating metadata fields more easily. (was: As a follow-up on https://issues.apache.org/jira/browse/SPARK-41791 that introduced `FileSourceConstantMetadataAttribute` and `FileSourceGeneratedMetadataStructField` to handle constant and generated metadata columns, we may want to introduce corresponding abstractions for struct fields to allow creating metadata fields more easily.) > Introduce abstractions to create constant and generated metadata fields > --- > > Key: SPARK-42918 > URL: https://issues.apache.org/jira/browse/SPARK-42918 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: Johan Lasperas >Priority: Minor > > As a follow-up on https://issues.apache.org/jira/browse/SPARK-41791 that > introduced `FileSourceConstantMetadataAttribute` and > `FileSourceGeneratedMetadataAttribute` to handle constant and generated > metadata columns, we may want to introduce corresponding abstractions for > struct fields to allow creating metadata fields more easily. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42918) Introduce abstractions to create constant and generated metadata fields
Johan Lasperas created SPARK-42918: -- Summary: Introduce abstractions to create constant and generated metadata fields Key: SPARK-42918 URL: https://issues.apache.org/jira/browse/SPARK-42918 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 3.4.1 Reporter: Johan Lasperas As a follow-up on https://issues.apache.org/jira/browse/SPARK-41791 that introduced `FileSourceConstantMetadataAttribute` and `FileSourceGeneratedMetadataStructField` to handle constant and generated metadata columns, we may want to introduce corresponding abstractions for struct fields to allow creating metadata fields more easily. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40921) Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO command
[ https://issues.apache.org/jira/browse/SPARK-40921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johan Lasperas updated SPARK-40921: --- Target Version/s: (was: 3.4.0) > Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO command > --- > > Key: SPARK-40921 > URL: https://issues.apache.org/jira/browse/SPARK-40921 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Johan Lasperas >Priority: Major > > The MERGE INTO syntax in Spark allows two types of WHEN clause: > * WHEN MATCHED: Specify optional condition and actions (delete or update) to > apply to rows that satisfy the merge condition. > * WHEN NOT MATCHED: Specify optional condition and actions (insert) to apply > toi rows from the source table that don't satisfy the match condition. > Other products also offer a third type of WHEN clause: > * WHEN NOT MATCHED BY SOURCE: Specify optional condition and actions (delete > or update) to apply to rows from the target table that don't satisfy the > merge condition. > See for example [T-SQL Merge > Documentation|https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver16] > Example: > {code:java} > MERGE INTO target > USING source > ON target.key = source.key > WHEN MATCHED THEN UPDATE SET * > WHEN NOT MATCHED THEN INSERT * > WHEN NOT MATCHED BY SOURCE THEN DELETE {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40921) Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO command
Johan Lasperas created SPARK-40921: -- Summary: Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO command Key: SPARK-40921 URL: https://issues.apache.org/jira/browse/SPARK-40921 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Johan Lasperas The MERGE INTO syntax in Spark allows two types of WHEN clause: * WHEN MATCHED: Specify optional condition and actions (delete or update) to apply to rows that satisfy the merge condition. * WHEN NOT MATCHED: Specify optional condition and actions (insert) to apply toi rows from the source table that don't satisfy the match condition. Other products also offer a third type of WHEN clause: * WHEN NOT MATCHED BY SOURCE: Specify optional condition and actions (delete or update) to apply to rows from the target table that don't satisfy the merge condition. See for example [T-SQL Merge Documentation|https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver16] Example: {code:java} MERGE INTO target USING source ON target.key = source.key WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * WHEN NOT MATCHED BY SOURCE THEN DELETE {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org