EnricoMi opened a new pull request #26969: Adds a stricter version of `as[T]` URL: https://github.com/apache/spark/pull/26969 ### What changes were proposed in this pull request? Some aspects of `as[T]` are not intuitive and expected behaviour is not provided elsewhere: * Extra columns that are not part of the type `T` are not dropped. * Order of columns is not aligned with schema of `T`. * Columns are not cast to the types of `T`'s fields. They have to be cast explicitly. **This PR adds a stricter version of `as[T]` to `Dataset`.** ### Why are the changes needed? The behaviour of `as[T]` is not intuitive when you read code like `df.as[T].write.csv("data.csv")`. The result depends on the actual schema of `df`, where `def as[T](): Dataset[T]` should be agnostic to the schema of `df`. A method that enforces schema of `T` on a given Dataset would be very convenient and allows to articulate and guarantee above assumptions about your data with the native Spark Dataset API. This method plays a more explicit and enforcing role than `as[T]` with respect to columns, column order and column type. ### Does this PR introduce any user-facing change? Yes, it adds a new method to `Dataset`. It does not touch the existing `as[T]`. Possible naming of a stricter version of `as[T]`: * `as[T](strict = true)` * `toDS[T]` (as in `toDF`) * `selectAs[T]` (as this is merely selecting the columns of schema `T`) This PR chooses the `toDS[T]` naming. ### How was this patch tested? Existing tests for `as[T]` have been extended to assert the actual schema and emphasize the differences between `as[T]` and `toDS[T]`. Tests for `toDS[T]` are based on the `as[T]` tests.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org