EnricoMi opened a new pull request #26969: Adds a stricter version of `as[T]`
URL: https://github.com/apache/spark/pull/26969
 
 
   ### What changes were proposed in this pull request?
   Some aspects of `as[T]` are not intuitive and expected behaviour is not 
provided elsewhere:
   * Extra columns that are not part of the type `T` are not dropped.
   * Order of columns is not aligned with schema of `T`.
   * Columns are not cast to the types of `T`'s fields. They have to be cast 
explicitly.
   
   **This PR adds a stricter version of `as[T]` to `Dataset`.** 
   
   ### Why are the changes needed?
   The behaviour of `as[T]` is not intuitive when you read code like 
`df.as[T].write.csv("data.csv")`. The result depends on the actual schema of 
`df`, where `def as[T](): Dataset[T]` should be agnostic to the schema of `df`.
   
   A method that enforces schema of `T` on a given Dataset would be very 
convenient and allows to articulate and guarantee above assumptions about your 
data with the native Spark Dataset API. This method plays a more explicit and 
enforcing role than `as[T]` with respect to columns, column order and column 
type.
   
   ### Does this PR introduce any user-facing change?
   Yes, it adds a new method to `Dataset`. It does not touch the existing 
`as[T]`.
   
   Possible naming of a stricter version of `as[T]`:
   * `as[T](strict = true)`
   * `toDS[T]` (as in `toDF`)
   * `selectAs[T]` (as this is merely selecting the columns of schema `T`)
   
   This PR chooses the `toDS[T]` naming.
   
   ### How was this patch tested?
   Existing tests for `as[T]` have been extended to assert the actual schema 
and emphasize the differences between `as[T]` and `toDS[T]`. Tests for 
`toDS[T]` are based on the `as[T]` tests.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to