[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns
[ https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607270#comment-17607270 ] Joost Farla commented on SPARK-34805: - [~cloud_fan] I was running into the exact same issue using Spark v3.3.0. It looks like the fix was merged into the 3.3 branch (on March 21st), but was not yet released as part of v3.3. It is also not mentioned in the release notes. Is that possible? Thanks in advance! > PySpark loses metadata in DataFrame fields when selecting nested columns > > > Key: SPARK-34805 > URL: https://issues.apache.org/jira/browse/SPARK-34805 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.1 >Reporter: Mark Ressler >Priority: Major > Fix For: 3.3.0 > > Attachments: jsonMetadataTest.py, nested_columns_metadata.scala > > > For a DataFrame schema with nested StructTypes, where metadata is set for > fields in the schema, that metadata is lost when a DataFrame selects nested > fields. For example, suppose > {code:java} > df.schema.fields[0].dataType.fields[0].metadata > {code} > returns a non-empty dictionary, then > {code:java} > df.select('Field0.SubField0').schema.fields[0].metadata{code} > returns an empty dictionary, where "Field0" is the name of the first field in > the DataFrame and "SubField0" is the name of the first nested field under > "Field0". > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns
[ https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479989#comment-17479989 ] Apache Spark commented on SPARK-34805: -- User 'kevinwallimann' has created a pull request for this issue: https://github.com/apache/spark/pull/35270 > PySpark loses metadata in DataFrame fields when selecting nested columns > > > Key: SPARK-34805 > URL: https://issues.apache.org/jira/browse/SPARK-34805 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.1 >Reporter: Mark Ressler >Priority: Major > Attachments: jsonMetadataTest.py, nested_columns_metadata.scala > > > For a DataFrame schema with nested StructTypes, where metadata is set for > fields in the schema, that metadata is lost when a DataFrame selects nested > fields. For example, suppose > {code:java} > df.schema.fields[0].dataType.fields[0].metadata > {code} > returns a non-empty dictionary, then > {code:java} > df.select('Field0.SubField0').schema.fields[0].metadata{code} > returns an empty dictionary, where "Field0" is the name of the first field in > the DataFrame and "SubField0" is the name of the first nested field under > "Field0". > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns
[ https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479991#comment-17479991 ] Apache Spark commented on SPARK-34805: -- User 'kevinwallimann' has created a pull request for this issue: https://github.com/apache/spark/pull/35270 > PySpark loses metadata in DataFrame fields when selecting nested columns > > > Key: SPARK-34805 > URL: https://issues.apache.org/jira/browse/SPARK-34805 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.1 >Reporter: Mark Ressler >Priority: Major > Attachments: jsonMetadataTest.py, nested_columns_metadata.scala > > > For a DataFrame schema with nested StructTypes, where metadata is set for > fields in the schema, that metadata is lost when a DataFrame selects nested > fields. For example, suppose > {code:java} > df.schema.fields[0].dataType.fields[0].metadata > {code} > returns a non-empty dictionary, then > {code:java} > df.select('Field0.SubField0').schema.fields[0].metadata{code} > returns an empty dictionary, where "Field0" is the name of the first field in > the DataFrame and "SubField0" is the name of the first nested field under > "Field0". > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns
[ https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478610#comment-17478610 ] Kevin Wallimann commented on SPARK-34805: - The problem happens in Scala as well. I attached a scala file [^nested_columns_metadata.scala] to demonstrate the issue. I tried it in the spark-shell of versions 2.4.7, 3.1.2 and 3.2.0, always with the same result. This behavior is a bug, because the documentation for {{StructField}} clearly says that the "metadata should be preserved during transformation if the content of the column is not modified, e.g, in selection" > PySpark loses metadata in DataFrame fields when selecting nested columns > > > Key: SPARK-34805 > URL: https://issues.apache.org/jira/browse/SPARK-34805 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.1 >Reporter: Mark Ressler >Priority: Major > Attachments: jsonMetadataTest.py, nested_columns_metadata.scala > > > For a DataFrame schema with nested StructTypes, where metadata is set for > fields in the schema, that metadata is lost when a DataFrame selects nested > fields. For example, suppose > {code:java} > df.schema.fields[0].dataType.fields[0].metadata > {code} > returns a non-empty dictionary, then > {code:java} > df.select('Field0.SubField0').schema.fields[0].metadata{code} > returns an empty dictionary, where "Field0" is the name of the first field in > the DataFrame and "SubField0" is the name of the first nested field under > "Field0". > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns
[ https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337491#comment-17337491 ] Michel Trottier-McDonald commented on SPARK-34805: -- I believe this is not a PySpark-specific issue. We have a unit test in [transmogif.ai|https://transmogrif.ai/] where we are specifying [column metadata manually|https://github.com/salesforce/TransmogrifAI/blob/90a0f298f14506a27c84a71de414d53a30cf687f/core/src/test/scala/com/salesforce/op/stages/impl/preparators/SanityCheckerTest.scala#L137] and check whether the metadata is properly passed on to a model that consumes this column. The column metadata is properly given to the column using {{.as(columnName, metadata)}}, but is immediately lost once the select is executed. I've traced the issue to the changes in {{ExpressionEncoder}}: * In Spark 2.4, [it takes it a schema argument|https://github.com/apache/spark/blob/e89526d2401b3a04719721c923a6f630e555e286/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L222] through which the column metadata is passed along * In Spark 3.0, [it no longer takes|https://github.com/apache/spark/blob/39889df32a7a916d826e255fda6fc62e2a3d7971/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L232] this schema parameter and it seems like the column metadata is lost as a result I can't tell if this was intentional or not, but it renders the metadata argument of the {{.as}} [method|https://github.com/apache/spark/blob/39889df32a7a916d826e255fda6fc62e2a3d7971/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1133] in {{Column}} mostly useless. > PySpark loses metadata in DataFrame fields when selecting nested columns > > > Key: SPARK-34805 > URL: https://issues.apache.org/jira/browse/SPARK-34805 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.1 >Reporter: Mark Ressler >Priority: Major > Attachments: jsonMetadataTest.py > > > For a DataFrame schema with nested StructTypes, where metadata is set for > fields in the schema, that metadata is lost when a DataFrame selects nested > fields. For example, suppose > {code:java} > df.schema.fields[0].dataType.fields[0].metadata > {code} > returns a non-empty dictionary, then > {code:java} > df.select('Field0.SubField0').schema.fields[0].metadata{code} > returns an empty dictionary, where "Field0" is the name of the first field in > the DataFrame and "SubField0" is the name of the first nested field under > "Field0". > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org