[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

mahmoudmahdi24 Thu, 02 Aug 2018 02:17:38 -0700

Github user mahmoudmahdi24 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21944#discussion_r207156078
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1367,6 +1367,22 @@ class Dataset[T] private[sql](
         }: _*)
       }
     
    +  /**
    +   * Casts all the values of the current Dataset following the types of a 
specific StructType.
    +   * This method works also with nested structTypes.
    +   *
    +   *  @group typedrel
    +   *  @since 2.4.0
    +   */
    +  def castBySchema(schema: StructType): DataFrame = {
    +    
assert(schema.fields.map(_.name).toList.sameElements(this.schema.fields.map(_.name).toList),
    +      "schema should have the same fields as the original schema")
    +
    +    selectExpr(schema.map(
    --- End diff --
    
    Hello @HyukjinKwon, Thanks for your feedback.
    Actually, some methods are one liner in the current API, but they help 
users by allowing them to know what they can do via the API (printSchema, 
dtypes, columns are perfect examples for that).
    Even if it's a one liner, this method can be helpful since it will tell 
spark's users that they can cast a dataframe by passing a schema.
    This method can be very useful in the Big Data world; generally, we parse 
different files which are in strings, and we have to cast all these values 
following a schema. In my case, my client provides Avro schemas to define the 
columns and names.
    I searched a lot for a method which applies a schema on a dataframe but I 
couldn't find one 
(https://stackoverflow.com/questions/51561715/cast-values-of-a-spark-dataframe-using-a-defined-structtype/51562763#51562763).
    Even by searching on github, I always see examples where people cast each 
value separately.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21944: [SPARK-24988][SQL]Add a castBySchema method which...

Reply via email to