[ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust updated SPARK-16483: ------------------------------------- Target Version/s: 2.1.0 > Unifying struct fields and columns > ---------------------------------- > > Key: SPARK-16483 > URL: https://issues.apache.org/jira/browse/SPARK-16483 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Simeon Simeonov > Labels: sql > > This issue comes as a result of an exchange with Michael Armbrust outside of > the usual JIRA/dev list channels. > DataFrame provides a full set of manipulation operations for top-level > columns. They have be added, removed, modified and renamed. The same is not > true about fields inside structs yet, from a logical standpoint, Spark users > may very well want to perform the same operations on struct fields, > especially since automatic schema discovery from JSON input tends to create > deeply nested structs. > Common use-cases include: > - Remove and/or rename struct field(s) to adjust the schema > - Fix a data quality issue with a struct field (update/rewrite) > To do this with the existing API by hand requires manually calling > {{named_struct}} and listing all fields, including ones we don't want to > manipulate. This leads to complex, fragile code that cannot survive schema > evolution. > It would be far better if the various APIs that can now manipulate top-level > columns were extended to handle struct fields at arbitrary locations or, > alternatively, if we introduced new APIs for modifying any field in a > dataframe, whether it is a top-level one or one nested inside a struct. > Purely for discussion purposes, here is the skeleton implementation of an > update() implicit that we've use to modify any existing field in a dataframe. > (Note that it depends on various other utilities and implicits that are not > included). https://gist.github.com/ssimeonov/f98dcfa03cd067157fa08aaa688b0f66 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org