[ https://issues.apache.org/jira/browse/SPARK-30319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003718#comment-17003718 ]
Farooq Qaiser commented on SPARK-30319: --------------------------------------- I have written similar variants of this feature (using scala's implicit-conversion technique to monkey-patch the Dataset class) across multiple organizations/codebases now and wanted to share my thoughts in case its helpful to the discussion. I can affirm that this would be a valuable feature to have in Spark. Without this feature, our developers would nearly always have to pair an {{as}} operation with a {{select}} operation. As such, my preference would be to change the existing Dataset {{as[T]}} method to add this strict-ness by default when {{T}} is a class. This would be a breaking change but since the next version of Spark is a major release (3.0.0), this should be okay. Also, I saw that in your PR you included eager-casting-of-Column-types. I'm not sure if this is a good idea although I can't think of any concrete objections. In my own implementations of this feature, I've always just raised an exception if the column types don't match what's specified in {{T}}, leaving it to the developer to explicitly cast Columns to the correct types prior to using this feature. > Adds a stricter version of as[T] > -------------------------------- > > Key: SPARK-30319 > URL: https://issues.apache.org/jira/browse/SPARK-30319 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.4.4 > Reporter: Enrico Minack > Priority: Major > Fix For: 3.0.0 > > > The behaviour of as[T] is not intuitive when you read code like > df.as[T].write.csv("data.csv"). The result depends on the actual schema of > df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The > expected behaviour is not provided elsewhere: > * Extra columns that are not part of the type {{T}} are not dropped. > * Order of columns is not aligned with schema of {{T}}. > * Columns are not cast to the types of {{T}}'s fields. They have to be cast > explicitly. > A method that enforces schema of T on a given Dataset would be very > convenient and allows to articulate and guarantee above assumptions about > your data with the native Spark Dataset API. This method plays a more > explicit and enforcing role than as[T] with respect to columns, column order > and column type. > Possible naming of a stricter version of {{as[T]}}: > * {{as[T](strict = true)}} > * {{toDS[T]}} (as in {{toDF}}) > * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}}) > The naming {{toDS[T]}} is chosen here. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org