[ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023920#comment-17023920 ]
fqaiser94 commented on SPARK-16483: ----------------------------------- This is very similar to [SPARK-22231|https://issues.apache.org/jira/browse/SPARK-22231#] where there has been more discussion. > Unifying struct fields and columns > ---------------------------------- > > Key: SPARK-16483 > URL: https://issues.apache.org/jira/browse/SPARK-16483 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.3.1 > Reporter: Simeon Simeonov > Priority: Major > Labels: sql > > This issue comes as a result of an exchange with Michael Armbrust outside of > the usual JIRA/dev list channels. > DataFrame provides a full set of manipulation operations for top-level > columns. They have be added, removed, modified and renamed. The same is not > true about fields inside structs yet, from a logical standpoint, Spark users > may very well want to perform the same operations on struct fields, > especially since automatic schema discovery from JSON input tends to create > deeply nested structs. > Common use-cases include: > - Remove and/or rename struct field(s) to adjust the schema > - Fix a data quality issue with a struct field (update/rewrite) > To do this with the existing API by hand requires manually calling > {{named_struct}} and listing all fields, including ones we don't want to > manipulate. This leads to complex, fragile code that cannot survive schema > evolution. > It would be far better if the various APIs that can now manipulate top-level > columns were extended to handle struct fields at arbitrary locations or, > alternatively, if we introduced new APIs for modifying any field in a > dataframe, whether it is a top-level one or one nested inside a struct. > Purely for discussion purposes (overloaded methods are not shown): > {code:java} > class Column(val expr: Expression) extends Logging { > // ... > // matches Dataset.schema semantics > def schema: StructType > // matches Dataset.select() semantics > // '* support allows multiple new fields to be added easily, saving > cumbersome repeated withColumn() calls > def select(cols: Column*): Column > // matches Dataset.withColumn() semantics of add or replace > def withColumn(colName: String, col: Column): Column > // matches Dataset.drop() semantics > def drop(colName: String): Column > } > class Dataset[T] ... { > // ... > // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema) > def cast(newShema: StructType): DataFrame > } > {code} > The benefit of the above API is that it unifies manipulating top-level & > nested columns. The addition of {{schema}} and {{select()}} to {{Column}} > allows for nested field reordering, casting, etc., which is important in data > exchange scenarios where field position matters. That's also the reason to > add {{cast}} to {{Dataset}}: it improves consistency and readability (with > method chaining). Another way to think of {{Dataset.cast}} is as the Spark > schema equivalent of {{Dataset.as}}. {{as}} is to {{cast}} as a Scala > encodable type is to a {{StructType}} instance. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org