[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

dongjoon-hyun Tue, 09 Jan 2018 10:36:07 -0800

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/20208


    [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based 
data sources

    ## What changes were proposed in this pull request?
    
    A schema can evolve in several ways and the followings are already 
supported in file-based data sources.
    
       1. Add a column
       2. Remove a column
       3. Change a column position
       4. Change a column type
    
    This issue aims to guarantee users a backward-compatible schema evolution 
coverage on file-based data sources and to prevent future regressions by 
*adding schema evolution test suites explicitly*.
    
    Here, we consider safe evolution without data loss. For example, data type 
evolution should be from small types to larger types like `int`-to-`long`, not 
vice versa.
    
    As of today, in the master branch, file-based data sources have schema 
evolution coverages like the followings.
    
    File Format | Coverage  | Note
    ----------- | ---------- | ------------------------------------------------
    TEXT          | N/A            | Schema consists of a single string column.
    CSV            | 1, 2, 4        |
    JSON          | 1, 2, 3, 4    |
    ORC            | 1, 2, 3, 4    | Native vectorized ORC reader has the 
widest coverage.
    PARQUET   | 1, 2, 3        |
    
    
    ## How was this patch tested?
    
    Pass the jenkins with newly added test suites.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-SCHEMA-EVOLUTION

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20208.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20208
    
----
commit 499801e7fdd545ac5918dd5f7a9294db2d5373be
Author: Dongjoon Hyun <dongjoon@...>
Date:   2018-01-07T00:02:09Z

    [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based 
data sources

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

Reply via email to