GitHub user mahmoudmahdi24 opened a pull request:

    https://github.com/apache/spark/pull/21944

    [SPARK-24988][SQL]Add a castBySchema method which casts all the values of a 
DataFrame based on the DataTypes of a StructType

    ## What changes were proposed in this pull request?
    
    The main goal of this Pull Request is to extend the Dataframe methods in 
order to add a method which casts all the values of a Dataframe, based on the 
DataTypes of a StructType.
    
    This feature can be useful when we have a large dataframe, and that we need 
to make multiple casts. In that case, we won't have to cast each value 
independently, all we have to do is to pass a StructType to the method 
castBySchema with the types we need (In real world examples, this schema is 
generally provided by the client, which was my case).
    
    Here's an example here on how we can apply this method (let's say that we 
have a dataframe of strings, and that we want to cast the values of the second 
columns to Int) : 
    ```
    // We start by creating the dataframe
    val df = Seq(("test1", "0"), ("test2", "1")).toDF("name", "id")
    
    // we prepare the StructType of the casted Dataframe that we'll obtain:
    val schema = StructType( Seq( StructField("name", StringType, true), 
StructField("id", IntegerType, true)))
    
    // and then, we simply use the method castBySchema :
    val castedDf = df.castBySchema(schema) 
    ```
    
    
    ## How was this patch tested?
    
    I modified DataFrameSuite in order to test the new added method 
(castBySchema).
    I first tested the method on a simple dataframe with a simple schema, then 
I tested it on Dataframes with a complex schemas (Nested StructTypes for 
example).
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mahmoudmahdi24/spark SPARK-24988

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21944.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21944
    
----
commit b48819e3894e4d2f246fc2dba7db73ad5714757d
Author: mahmoud_mahdi <mahmoudmahdi24@...>
Date:   2018-08-01T14:00:22Z

    Add a castBySchema method which casts all the values of a DataFrame based 
on the DataTypes of a StructType

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to