subject:"\[jira\] \[Commented\] \(SPARK\-34805\) PySpark loses metadata in DataFrame fields when selecting nested columns"

[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2022-09-20 Thread Joost Farla (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607270#comment-17607270
 ] 

Joost Farla commented on SPARK-34805:
-

[~cloud_fan] I was running into the exact same issue using Spark v3.3.0. It 
looks like the fix was merged into the 3.3 branch (on March 21st), but was not 
yet released as part of v3.3. It is also not mentioned in the release notes. Is 
that possible? Thanks in advance!

> PySpark loses metadata in DataFrame fields when selecting nested columns
> 
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Mark Ressler
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: jsonMetadataTest.py, nested_columns_metadata.scala
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for 
> fields in the schema, that metadata is lost when a DataFrame selects nested 
> fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in 
> the DataFrame and "SubField0" is the name of the first nested field under 
> "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2022-01-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479989#comment-17479989
 ] 

Apache Spark commented on SPARK-34805:
--

User 'kevinwallimann' has created a pull request for this issue:
https://github.com/apache/spark/pull/35270

> PySpark loses metadata in DataFrame fields when selecting nested columns
> 
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Mark Ressler
>Priority: Major
> Attachments: jsonMetadataTest.py, nested_columns_metadata.scala
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for 
> fields in the schema, that metadata is lost when a DataFrame selects nested 
> fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in 
> the DataFrame and "SubField0" is the name of the first nested field under 
> "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2022-01-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479991#comment-17479991
 ] 

Apache Spark commented on SPARK-34805:
--

User 'kevinwallimann' has created a pull request for this issue:
https://github.com/apache/spark/pull/35270

> PySpark loses metadata in DataFrame fields when selecting nested columns
> 
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Mark Ressler
>Priority: Major
> Attachments: jsonMetadataTest.py, nested_columns_metadata.scala
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for 
> fields in the schema, that metadata is lost when a DataFrame selects nested 
> fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in 
> the DataFrame and "SubField0" is the name of the first nested field under 
> "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2022-01-19 Thread Kevin Wallimann (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478610#comment-17478610
 ] 

Kevin Wallimann commented on SPARK-34805:
-

The problem happens in Scala as well. I attached a scala file 
[^nested_columns_metadata.scala] to demonstrate the issue. I tried it in the 
spark-shell of versions 2.4.7, 3.1.2 and 3.2.0, always with the same result. 
This behavior is a bug, because the documentation for {{StructField}} clearly 
says that the "metadata should be preserved during transformation if the 
content of the column is not modified, e.g, in selection"

> PySpark loses metadata in DataFrame fields when selecting nested columns
> 
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Mark Ressler
>Priority: Major
> Attachments: jsonMetadataTest.py, nested_columns_metadata.scala
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for 
> fields in the schema, that metadata is lost when a DataFrame selects nested 
> fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in 
> the DataFrame and "SubField0" is the name of the first nested field under 
> "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2021-04-30 Thread Michel Trottier-McDonald (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337491#comment-17337491
 ] 

Michel Trottier-McDonald commented on SPARK-34805:
--

I believe this is not a PySpark-specific issue. We have a unit test in 
[transmogif.ai|https://transmogrif.ai/] where we are specifying [column 
metadata 
manually|https://github.com/salesforce/TransmogrifAI/blob/90a0f298f14506a27c84a71de414d53a30cf687f/core/src/test/scala/com/salesforce/op/stages/impl/preparators/SanityCheckerTest.scala#L137]
 and check whether the metadata is properly passed on to a model that consumes 
this column. The column metadata is properly given to the column using 
{{.as(columnName, metadata)}}, but is immediately lost once the select is 
executed. I've traced the issue to the changes in {{ExpressionEncoder}}:
 * In Spark 2.4, [it takes it a schema 
argument|https://github.com/apache/spark/blob/e89526d2401b3a04719721c923a6f630e555e286/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L222]
 through which the column metadata is passed along
 * In Spark 3.0, [it no longer 
takes|https://github.com/apache/spark/blob/39889df32a7a916d826e255fda6fc62e2a3d7971/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L232]
 this schema parameter and it seems like the column metadata is lost as a result

I can't tell if this was intentional or not, but it renders the metadata 
argument of the {{.as}} 
[method|https://github.com/apache/spark/blob/39889df32a7a916d826e255fda6fc62e2a3d7971/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1133]
 in {{Column}} mostly useless.

> PySpark loses metadata in DataFrame fields when selecting nested columns
> 
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Mark Ressler
>Priority: Major
> Attachments: jsonMetadataTest.py
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for 
> fields in the schema, that metadata is lost when a DataFrame selects nested 
> fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in 
> the DataFrame and "SubField0" is the name of the first nested field under 
> "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

5 matches

Site Navigation

Mail list logo

Footer information