[jira] [Created] (SPARK-48526) Allow passing custom sink to StreamTest::testStream

2024-06-04 Thread Johan Lasperas (Jira)
Johan Lasperas created SPARK-48526:
--

 Summary: Allow passing custom sink to StreamTest::testStream
 Key: SPARK-48526
 URL: https://issues.apache.org/jira/browse/SPARK-48526
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Johan Lasperas


The testing helpers for streaming don't allow providing a custom sink, this is 
limiting in (at least) two ways:
 * A sink can't be reused across multiple calls to `testStream`, e.g. when 
canceling and resuming streaming
 * A custom sink implementation other than `MemorySink` can't be provided. A 
use case here is for example to test the Delta streaming sink by wrapping it in 
a MemorySink interface and passing it to the test framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-48308:
---
Description: 
In 
[FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
 the schema of the data excluding partition columns is computed 2 times in a 
slightly different way:

 
{code:java}
val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionSet.contains) {code}
vs 
{code:java}
val readDataColumns = dataColumns
  .filterNot(partitionColumns.contains) {code}

  was:
In 
[FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
 the schema of the data excluding partition columns is computed 2 times in a 
slightly different way:

 
{code:java}
val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionSet.contains) {code}
 

vs 
{code:java}
val readDataColumns = dataColumns
  .filterNot(partitionColumns.contains) {code}
 

 


> Unify getting data schema without partition columns in FileSourceStrategy
> -
>
> Key: SPARK-48308
> URL: https://issues.apache.org/jira/browse/SPARK-48308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Johan Lasperas
>Assignee: Johan Lasperas
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In 
> [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
>  the schema of the data excluding partition columns is computed 2 times in a 
> slightly different way:
>  
> {code:java}
> val dataColumnsWithoutPartitionCols = 
> dataColumns.filterNot(partitionSet.contains) {code}
> vs 
> {code:java}
> val readDataColumns = dataColumns
>   .filterNot(partitionColumns.contains) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-48308:
---
Description: 
In 
[FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
 the schema of the data excluding partition columns is computed 2 times in a 
slightly different way:

 
{code:java}
val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionSet.contains) {code}
 

vs 
{code:java}
val readDataColumns = dataColumns
  .filterNot(partitionColumns.contains) {code}
 

 

  was:
In 
[FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
 the schema of the data excluding partition columns is computed 2 times in a 
slightly different way:

```

val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionSet.contains)

```

vs 

```

      val readDataColumns = dataColumns
        .filterNot(partitionColumns.contains)

```

This should be unified

 


> Unify getting data schema without partition columns in FileSourceStrategy
> -
>
> Key: SPARK-48308
> URL: https://issues.apache.org/jira/browse/SPARK-48308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Johan Lasperas
>Priority: Trivial
>
> In 
> [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
>  the schema of the data excluding partition columns is computed 2 times in a 
> slightly different way:
>  
> {code:java}
> val dataColumnsWithoutPartitionCols = 
> dataColumns.filterNot(partitionSet.contains) {code}
>  
> vs 
> {code:java}
> val readDataColumns = dataColumns
>   .filterNot(partitionColumns.contains) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread Johan Lasperas (Jira)
Johan Lasperas created SPARK-48308:
--

 Summary: Unify getting data schema without partition columns in 
FileSourceStrategy
 Key: SPARK-48308
 URL: https://issues.apache.org/jira/browse/SPARK-48308
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.1
Reporter: Johan Lasperas


In 
[FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
 the schema of the data excluding partition columns is computed 2 times in a 
slightly different way:

```

val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionSet.contains)

```

vs 

```

      val readDataColumns = dataColumns
        .filterNot(partitionColumns.contains)

```

This should be unified

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46092) Overflow in Parquet row group filter creation causes incorrect results

2023-11-24 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-46092:
---
Description: 
While the parquet readers don't support reading parquet values into larger 
Spark types, it's possible to trigger an overflow when creating a Parquet row 
group filter that will then incorrectly skip row groups and bypass the 
exception in the reader,

Repro:
{code:java}
Seq(0).toDF("a").write.parquet(path)
spark.read.schema("a LONG").parquet(path).where(s"a < 
${Long.MaxValue}").collect(){code}
This succeeds and returns no results. This should either fail if the Parquet 
reader doesn't support the upcast from int to long or produce result `[0]` if 
it does.

  was:
While the parquet readers don't support reading parquet values into larger 
Spark types, it's possible to trigger an overflow when creating a Parquet row 
group filter that will then incorrectly skip row groups and bypass the 
exception in the reader,

Repro:

```

Seq(0).toDF("a").write.parquet(path)
spark.read.schema("a LONG").parquet(path).where(s"a < 
${Long.MaxValue}").collect()

```

This succeeds and returns no results. This should either fail if the Parquet 
reader doesn't support the upcast from int to long or produce result `[0]` if 
it does.


> Overflow in Parquet row group filter creation causes incorrect results
> --
>
> Key: SPARK-46092
> URL: https://issues.apache.org/jira/browse/SPARK-46092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Johan Lasperas
>Priority: Major
>
> While the parquet readers don't support reading parquet values into larger 
> Spark types, it's possible to trigger an overflow when creating a Parquet row 
> group filter that will then incorrectly skip row groups and bypass the 
> exception in the reader,
> Repro:
> {code:java}
> Seq(0).toDF("a").write.parquet(path)
> spark.read.schema("a LONG").parquet(path).where(s"a < 
> ${Long.MaxValue}").collect(){code}
> This succeeds and returns no results. This should either fail if the Parquet 
> reader doesn't support the upcast from int to long or produce result `[0]` if 
> it does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46092) Overflow in Parquet row group filter creation causes incorrect results

2023-11-24 Thread Johan Lasperas (Jira)
Johan Lasperas created SPARK-46092:
--

 Summary: Overflow in Parquet row group filter creation causes 
incorrect results
 Key: SPARK-46092
 URL: https://issues.apache.org/jira/browse/SPARK-46092
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Johan Lasperas


While the parquet readers don't support reading parquet values into larger 
Spark types, it's possible to trigger an overflow when creating a Parquet row 
group filter that will then incorrectly skip row groups and bypass the 
exception in the reader,

Repro:

```

Seq(0).toDF("a").write.parquet(path)
spark.read.schema("a LONG").parquet(path).where(s"a < 
${Long.MaxValue}").collect()

```

This succeeds and returns no results. This should either fail if the Parquet 
reader doesn't support the upcast from int to long or produce result `[0]` if 
it does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44026) Update helper methods to create SQLMetric with initial value

2023-06-12 Thread Johan Lasperas (Jira)
Johan Lasperas created SPARK-44026:
--

 Summary: Update helper methods to create SQLMetric with initial 
value
 Key: SPARK-44026
 URL: https://issues.apache.org/jira/browse/SPARK-44026
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Johan Lasperas


The helper methods in 
[SQLMetric.scala|https://github.com/apache/spark/blob/7107742a381cde2e6de9425e3e436282a8c0d27c/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala#L38]
 all use a fixed default value `-1`. Callers may want the metric to start with 
a different initial value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43487) Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError`

2023-05-12 Thread Johan Lasperas (Jira)
Johan Lasperas created SPARK-43487:
--

 Summary: Wrong error message used for 
`ambiguousRelationAliasNameInNestedCTEError`
 Key: SPARK-43487
 URL: https://issues.apache.org/jira/browse/SPARK-43487
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Johan Lasperas


The batch of errors migrated to error classes as part of spark-40540 contains 
an error that got mixed up with the wrong error message:

[ambiguousRelationAliasNameInNestedCTEError|https://github.com/apache/spark/commit/43a6b932759865c45ccf36f3e9cf6898c1b762da#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56R1983]
 uses the same error message as the following commandUnsupportedInV2TableError:

```

WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t) SELECT * 
FROM t2;

 

AnalysisException: t is not supported for v2 tables

```

The error should be:

```

AnalysisException: Name tis ambiguous in nested CTE.
Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name 
defined in inner CTE takes precedence. If set it to LEGACY, outer CTE 
definitions will take precedence. See more details in SPARK-28228.

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43487) Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError`

2023-05-12 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-43487:
---
Description: 
The batch of errors migrated to error classes as part of spark-40540 contains 
an error that got mixed up with the wrong error message:

[ambiguousRelationAliasNameInNestedCTEError|https://github.com/apache/spark/commit/43a6b932759865c45ccf36f3e9cf6898c1b762da#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56R1983]
 uses the same error message as the following commandUnsupportedInV2TableError:

 
{code:java}
WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t) SELECT * 
FROM t2;
AnalysisException: t is not supported for v2 tables
{code}
The error should be:
{code:java}
AnalysisException: Name tis ambiguous in nested CTE.
Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name 
defined in inner CTE takes precedence. If set it to LEGACY, outer CTE 
definitions will take precedence. See more details in SPARK-28228.{code}

  was:
The batch of errors migrated to error classes as part of spark-40540 contains 
an error that got mixed up with the wrong error message:

[ambiguousRelationAliasNameInNestedCTEError|https://github.com/apache/spark/commit/43a6b932759865c45ccf36f3e9cf6898c1b762da#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56R1983]
 uses the same error message as the following commandUnsupportedInV2TableError:

```

WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t) SELECT * 
FROM t2;

 

AnalysisException: t is not supported for v2 tables

```

The error should be:

```

AnalysisException: Name tis ambiguous in nested CTE.
Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name 
defined in inner CTE takes precedence. If set it to LEGACY, outer CTE 
definitions will take precedence. See more details in SPARK-28228.

```


> Wrong error message used for `ambiguousRelationAliasNameInNestedCTEError`
> -
>
> Key: SPARK-43487
> URL: https://issues.apache.org/jira/browse/SPARK-43487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Johan Lasperas
>Priority: Minor
>
> The batch of errors migrated to error classes as part of spark-40540 contains 
> an error that got mixed up with the wrong error message:
> [ambiguousRelationAliasNameInNestedCTEError|https://github.com/apache/spark/commit/43a6b932759865c45ccf36f3e9cf6898c1b762da#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56R1983]
>  uses the same error message as the following 
> commandUnsupportedInV2TableError:
>  
> {code:java}
> WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t) SELECT * 
> FROM t2;
> AnalysisException: t is not supported for v2 tables
> {code}
> The error should be:
> {code:java}
> AnalysisException: Name tis ambiguous in nested CTE.
> Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name 
> defined in inner CTE takes precedence. If set it to LEGACY, outer CTE 
> definitions will take precedence. See more details in SPARK-28228.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43217) Correctly recurse into maps of maps and arrays of arrays in StructType.findNestedField

2023-04-20 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-43217:
---
Description: 
[StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325]
 is unable to reach nested fields below two directly nested maps or arrays. 
Whenever it reaches a map or an array, it'll throw an `invalidFieldName` 
exception if the child is not a struct.

The following throws '{{{}Field name `a`.`element`.element`.`i` is invalid: 
`a`.`element`.`element` is not a struct.'{}}}, even though the access path is 
valid:
{code:java}
val schema = new StructType()
  .add("a", ArrayType(ArrayType(
    new StructType().add("i", "int"
findNestedField(Seq("a", "element", "element", "i"), schema) {code}
 

  was:
[StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325]
 is unable to reach nested field below two directly nested maps or arrays. 
Whenever it reaches a map or an array, it'll throw an `invalidFieldName` 
exception if the child is not a struct.

The following throws '{{{}Field name `a`.`element`.element`.`i` is invalid: 
`a`.`element`.`element` is not a struct.'{}}}, even though the access path is 
valid:
{code:java}
val schema = new StructType()
  .add("a", ArrayType(ArrayType(
    new StructType().add("i", "int"
findNestedField(Seq("a", "element", "element", "i"), schema) {code}
 


> Correctly recurse into maps of maps and arrays of arrays in 
> StructType.findNestedField
> --
>
> Key: SPARK-43217
> URL: https://issues.apache.org/jira/browse/SPARK-43217
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Johan Lasperas
>Priority: Minor
>
> [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325]
>  is unable to reach nested fields below two directly nested maps or arrays. 
> Whenever it reaches a map or an array, it'll throw an `invalidFieldName` 
> exception if the child is not a struct.
> The following throws '{{{}Field name `a`.`element`.element`.`i` is invalid: 
> `a`.`element`.`element` is not a struct.'{}}}, even though the access path is 
> valid:
> {code:java}
> val schema = new StructType()
>   .add("a", ArrayType(ArrayType(
>     new StructType().add("i", "int"
> findNestedField(Seq("a", "element", "element", "i"), schema) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43217) Correctly recurse into maps of maps and arrays of arrays in StructType.findNestedField

2023-04-20 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-43217:
---
Description: 
[StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325]
 is unable to reach nested field below two directly nested maps or arrays. 
Whenever it reaches a map or an array, it'll throw an `invalidFieldName` 
exception if the child is not a struct.

The following throws '{{{}Field name `a`.`element`.element`.`i` is invalid: 
`a`.`element`.`element` is not a struct.'{}}}, even though the access path is 
valid:
{code:java}
val schema = new StructType()
  .add("a", ArrayType(ArrayType(
    new StructType().add("i", "int"
findNestedField(Seq("a", "element", "element", "i"), schema) {code}
 

  was:
[StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325]
 is unable to reach nested field below two directly nested maps or arrays. 
Whenever it reaches a map or an array, it'll throw an `invalidFieldName` 
exception if the child is not a struct.

The following throws 'Field name `a`.`element`.element`.`i` is invalid: 
`a`.`element`.`element` is not a struct.', even though the access path is valid:
{code:java}
val schema = new StructType()
  .add("a", ArrayType(ArrayType(
    new StructType().add("i", "int"
findNestedField(Seq("a", "element", "element", "i"), schema) {code}
 


> Correctly recurse into maps of maps and arrays of arrays in 
> StructType.findNestedField
> --
>
> Key: SPARK-43217
> URL: https://issues.apache.org/jira/browse/SPARK-43217
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Johan Lasperas
>Priority: Minor
>
> [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325]
>  is unable to reach nested field below two directly nested maps or arrays. 
> Whenever it reaches a map or an array, it'll throw an `invalidFieldName` 
> exception if the child is not a struct.
> The following throws '{{{}Field name `a`.`element`.element`.`i` is invalid: 
> `a`.`element`.`element` is not a struct.'{}}}, even though the access path is 
> valid:
> {code:java}
> val schema = new StructType()
>   .add("a", ArrayType(ArrayType(
>     new StructType().add("i", "int"
> findNestedField(Seq("a", "element", "element", "i"), schema) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43217) Correctly recurse into maps of maps and arrays of arrays in StructType.findNestedField

2023-04-20 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-43217:
---
Description: 
[StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325]
 is unable to reach nested field below two directly nested maps or arrays. 
Whenever it reaches a map or an array, it'll throw an `invalidFieldName` 
exception if the child is not a struct.

The following throws 'Field name `a`.`element`.element`.`i` is invalid: 
`a`.`element`.`element` is not a struct.', even though the access path is valid:
{code:java}
val schema = new StructType()
  .add("a", ArrayType(ArrayType(
    new StructType().add("i", "int"
findNestedField(Seq("a", "element", "element", "i"), schema) {code}
 

  was:
[StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325]
 is unable to reach nested field below two directly nested maps or arrays. 
Whenever it reaches a map or an array, it'll throw an `invalidFieldName` 
exception if the child is not a struct.

The following throws 'Field name `a`.`element`.element`.`i` is invalid: 
`a`.`element`.`element` is not a struct.', even though the access path is valid:

```

val schema = new StructType()

  .add("a", ArrayType(ArrayType(

    new StructType().add("i", "int"

findNestedField(Seq("a", "element", "element", "i"), schema)

```


> Correctly recurse into maps of maps and arrays of arrays in 
> StructType.findNestedField
> --
>
> Key: SPARK-43217
> URL: https://issues.apache.org/jira/browse/SPARK-43217
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Johan Lasperas
>Priority: Minor
>
> [StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325]
>  is unable to reach nested field below two directly nested maps or arrays. 
> Whenever it reaches a map or an array, it'll throw an `invalidFieldName` 
> exception if the child is not a struct.
> The following throws 'Field name `a`.`element`.element`.`i` is invalid: 
> `a`.`element`.`element` is not a struct.', even though the access path is 
> valid:
> {code:java}
> val schema = new StructType()
>   .add("a", ArrayType(ArrayType(
>     new StructType().add("i", "int"
> findNestedField(Seq("a", "element", "element", "i"), schema) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43217) Correctly recurse into maps of maps and arrays of arrays in StructType.findNestedField

2023-04-20 Thread Johan Lasperas (Jira)
Johan Lasperas created SPARK-43217:
--

 Summary: Correctly recurse into maps of maps and arrays of arrays 
in StructType.findNestedField
 Key: SPARK-43217
 URL: https://issues.apache.org/jira/browse/SPARK-43217
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Johan Lasperas


[StructType.findNestedField|https://github.com/apache/spark/blob/db2625c70a8c3aff64e6a9466981c8dd49a4ca51/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala#L325]
 is unable to reach nested field below two directly nested maps or arrays. 
Whenever it reaches a map or an array, it'll throw an `invalidFieldName` 
exception if the child is not a struct.

The following throws 'Field name `a`.`element`.element`.`i` is invalid: 
`a`.`element`.`element` is not a struct.', even though the access path is valid:

```

val schema = new StructType()

  .add("a", ArrayType(ArrayType(

    new StructType().add("i", "int"

findNestedField(Seq("a", "element", "element", "i"), schema)

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy

2023-03-24 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-42918:
---
Description: 
A first step towards allowing file format implementations to inject custom 
metadata fields into plans is to make the handling of metadata attributes in 
`FileSourceStrategy` more generic.

Today in `FileSourceStrategy` , the lists of constant and generated metadata 
fields are created manually, checking for known generated fields on one hand 
and considering the remaining fields as constant metadata fields. We need 
instead to introduce a way of declaring metadata fields as generated or 
constant directly in `FileFormat` and propagate that information to 
`FileSourceStrategy`.

 

  was:
A first step towards allowing file format implementations to inject custom 
metadata columns into plans is to make the handling of metadata attributes in 
`FileSourceStrategy` more generic.

Today in `FileSourceStrategy` , the lists of constant and generated metadata 
columns are created manually, checking for known generated columns on one hand 
and considering the remaining columns as constant metadata columns. We need 
instead to introduce a way of declaring metadata columns as generated or 
constant directly in `FileFormat` and propagate that information to 
`FileSourceStrategy`.

 


> Generalize handling of metadata attributes in FileSourceStrategy
> 
>
> Key: SPARK-42918
> URL: https://issues.apache.org/jira/browse/SPARK-42918
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Johan Lasperas
>Priority: Minor
>
> A first step towards allowing file format implementations to inject custom 
> metadata fields into plans is to make the handling of metadata attributes in 
> `FileSourceStrategy` more generic.
> Today in `FileSourceStrategy` , the lists of constant and generated metadata 
> fields are created manually, checking for known generated fields on one hand 
> and considering the remaining fields as constant metadata fields. We need 
> instead to introduce a way of declaring metadata fields as generated or 
> constant directly in `FileFormat` and propagate that information to 
> `FileSourceStrategy`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy

2023-03-24 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-42918:
---
Description: 
A first step towards allowing file format implementations to inject custom 
metadata columns into plans is to make the handling of metadata attributes in 
`FileSourceStrategy` more generic.

Today in `FileSourceStrategy` , the lists of constant and generated metadata 
columns are created manually, checking for known generated columns on one hand 
and considering the remaining columns as constant metadata columns. We need 
instead to introduce a way of declaring metadata columns as generated or 
constant directly in `FileFormat` and propagate that information to 
`FileSourceStrategy`.

 

  was:A first step towards allowing file format implementations to inject 
custom metadata columns into plans is to make handling of metadata attributes 
in `FileSourceStrategy` more generic.


> Generalize handling of metadata attributes in FileSourceStrategy
> 
>
> Key: SPARK-42918
> URL: https://issues.apache.org/jira/browse/SPARK-42918
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Johan Lasperas
>Priority: Minor
>
> A first step towards allowing file format implementations to inject custom 
> metadata columns into plans is to make the handling of metadata attributes in 
> `FileSourceStrategy` more generic.
> Today in `FileSourceStrategy` , the lists of constant and generated metadata 
> columns are created manually, checking for known generated columns on one 
> hand and considering the remaining columns as constant metadata columns. We 
> need instead to introduce a way of declaring metadata columns as generated or 
> constant directly in `FileFormat` and propagate that information to 
> `FileSourceStrategy`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy

2023-03-24 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-42918:
---
Description: A first step towards allowing file format implementations to 
inject custom metadata columns into plans is to make handling of metadata 
attributes in `FileSourceStrategy` more generic.  (was: A first step towards 
allowing file format implementations to inject custom metadata columns into 
plans is to make )

> Generalize handling of metadata attributes in FileSourceStrategy
> 
>
> Key: SPARK-42918
> URL: https://issues.apache.org/jira/browse/SPARK-42918
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Johan Lasperas
>Priority: Minor
>
> A first step towards allowing file format implementations to inject custom 
> metadata columns into plans is to make handling of metadata attributes in 
> `FileSourceStrategy` more generic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy

2023-03-24 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-42918:
---
Description: A first step towards allowing file format implementations to 
inject custom metadata columns into plans is to   (was: As a follow-up on 
https://issues.apache.org/jira/browse/SPARK-41791 that introduced 
`FileSourceConstantMetadataAttribute` and 
`FileSourceGeneratedMetadataAttribute` to handle constant and generated 
metadata columns, we may want to introduce corresponding abstractions for 
struct fields to allow creating metadata fields more easily.)

> Generalize handling of metadata attributes in FileSourceStrategy
> 
>
> Key: SPARK-42918
> URL: https://issues.apache.org/jira/browse/SPARK-42918
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Johan Lasperas
>Priority: Minor
>
> A first step towards allowing file format implementations to inject custom 
> metadata columns into plans is to 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy

2023-03-24 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-42918:
---
Description: A first step towards allowing file format implementations to 
inject custom metadata columns into plans is to make   (was: A first step 
towards allowing file format implementations to inject custom metadata columns 
into plans is to )

> Generalize handling of metadata attributes in FileSourceStrategy
> 
>
> Key: SPARK-42918
> URL: https://issues.apache.org/jira/browse/SPARK-42918
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Johan Lasperas
>Priority: Minor
>
> A first step towards allowing file format implementations to inject custom 
> metadata columns into plans is to make 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy

2023-03-24 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-42918:
---
Summary: Generalize handling of metadata attributes in FileSourceStrategy  
(was: Introduce abstractions to create constant and generated metadata fields)

> Generalize handling of metadata attributes in FileSourceStrategy
> 
>
> Key: SPARK-42918
> URL: https://issues.apache.org/jira/browse/SPARK-42918
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Johan Lasperas
>Priority: Minor
>
> As a follow-up on https://issues.apache.org/jira/browse/SPARK-41791 that 
> introduced `FileSourceConstantMetadataAttribute` and 
> `FileSourceGeneratedMetadataAttribute` to handle constant and generated 
> metadata columns, we may want to introduce corresponding abstractions for 
> struct fields to allow creating metadata fields more easily.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42918) Introduce abstractions to create constant and generated metadata fields

2023-03-24 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-42918:
---
Description: As a follow-up on 
https://issues.apache.org/jira/browse/SPARK-41791 that introduced 
`FileSourceConstantMetadataAttribute` and 
`FileSourceGeneratedMetadataAttribute` to handle constant and generated 
metadata columns, we may want to introduce corresponding abstractions for 
struct fields to allow creating metadata fields more easily.  (was: As a 
follow-up on https://issues.apache.org/jira/browse/SPARK-41791 that introduced 
`FileSourceConstantMetadataAttribute` and 
`FileSourceGeneratedMetadataStructField` to handle constant and generated 
metadata columns, we may want to introduce corresponding abstractions for 
struct fields to allow creating metadata fields more easily.)

> Introduce abstractions to create constant and generated metadata fields
> ---
>
> Key: SPARK-42918
> URL: https://issues.apache.org/jira/browse/SPARK-42918
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Johan Lasperas
>Priority: Minor
>
> As a follow-up on https://issues.apache.org/jira/browse/SPARK-41791 that 
> introduced `FileSourceConstantMetadataAttribute` and 
> `FileSourceGeneratedMetadataAttribute` to handle constant and generated 
> metadata columns, we may want to introduce corresponding abstractions for 
> struct fields to allow creating metadata fields more easily.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42918) Introduce abstractions to create constant and generated metadata fields

2023-03-24 Thread Johan Lasperas (Jira)
Johan Lasperas created SPARK-42918:
--

 Summary: Introduce abstractions to create constant and generated 
metadata fields
 Key: SPARK-42918
 URL: https://issues.apache.org/jira/browse/SPARK-42918
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.4.1
Reporter: Johan Lasperas


As a follow-up on https://issues.apache.org/jira/browse/SPARK-41791 that 
introduced `FileSourceConstantMetadataAttribute` and 
`FileSourceGeneratedMetadataStructField` to handle constant and generated 
metadata columns, we may want to introduce corresponding abstractions for 
struct fields to allow creating metadata fields more easily.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40921) Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO command

2022-10-26 Thread Johan Lasperas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-40921:
---
Target Version/s:   (was: 3.4.0)

> Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO command
> ---
>
> Key: SPARK-40921
> URL: https://issues.apache.org/jira/browse/SPARK-40921
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Johan Lasperas
>Priority: Major
>
> The MERGE INTO syntax in Spark allows two types of WHEN clause:
>  * WHEN MATCHED: Specify optional condition and actions (delete or update) to 
> apply to rows that satisfy the merge condition.
>  * WHEN NOT MATCHED: Specify optional condition and actions (insert) to apply 
> toi rows from the source table that don't satisfy the match condition.
> Other products also offer a third type of WHEN clause:
>  * WHEN NOT MATCHED BY SOURCE: Specify optional condition and actions (delete 
> or update) to apply to rows from the target table that don't satisfy the 
> merge condition.
> See for example [T-SQL Merge 
> Documentation|https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver16]
> Example:
> {code:java}
> MERGE INTO target
> USING source
> ON target.key = source.key
> WHEN MATCHED THEN UPDATE SET *
> WHEN NOT MATCHED THEN INSERT *
> WHEN NOT MATCHED BY SOURCE THEN DELETE {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40921) Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO command

2022-10-26 Thread Johan Lasperas (Jira)
Johan Lasperas created SPARK-40921:
--

 Summary: Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO 
command
 Key: SPARK-40921
 URL: https://issues.apache.org/jira/browse/SPARK-40921
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Johan Lasperas


The MERGE INTO syntax in Spark allows two types of WHEN clause:
 * WHEN MATCHED: Specify optional condition and actions (delete or update) to 
apply to rows that satisfy the merge condition.
 * WHEN NOT MATCHED: Specify optional condition and actions (insert) to apply 
toi rows from the source table that don't satisfy the match condition.

Other products also offer a third type of WHEN clause:
 * WHEN NOT MATCHED BY SOURCE: Specify optional condition and actions (delete 
or update) to apply to rows from the target table that don't satisfy the merge 
condition.

See for example [T-SQL Merge 
Documentation|https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver16]

Example:
{code:java}
MERGE INTO target
USING source
ON target.key = source.key
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org