GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/20894

    [SPARK-23786][SQL] Checking column names of csv headers

    ## What changes were proposed in this pull request?
    
    Currently column names of headers in CSV files are not checked against 
provided schema of CSV data. It could cause errors like showed in the 
[SPARK-23786](https://issues.apache.org/jira/browse/SPARK-23786). I introduced 
new CSV option - `checkHeader` (`true` by default) which enables checking of 
column names against schema's fields. The checking is performed during 
processing of the first partition of csv files. If names are not matched, the 
following exception is thrown:
    
    ```
    java.lang.IllegalArgumentException: Fields in the header of csv file are 
not matched to field names of the schema:
     Header: depth, temperature
     Schema: temperature, depth
    ``` 
    
    ## How was this patch tested?
    
    The changes were tested by existing tests of CSVSuite and by 2 new tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 check-column-names

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20894.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20894
    
----
commit 112ce2d34d0d039711777351c1ab8e74629fc8e6
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-20T15:30:44Z

    Checks column names are compatible to provided schema

commit a85ccce23c3c5ee69ff321303ad830c71dd05931
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-20T20:51:03Z

    Checking header is matched to schema in per-line mode

commit 75e15345b6a5a9e807375fdf465dccfce4ea62c7
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-20T21:36:56Z

    Extract header and check that it is matched to schema

commit 8eb45b8b634ba2c9b641de12e09f17c63240ccc4
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T10:57:30Z

    Checking column names in header in multiLine mode

commit 9b1a9862531b8d3fb3cffce75126413ca9a844b9
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T11:13:17Z

    Adding the checkHeader option with true by default

commit 64426332b2ab42a1cd9c5a05a77e90332572bbec
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T11:25:31Z

    Fix csv test by changing headers or disabling header checking

commit 9440d8a5c097a1d8e111b397fbda9e54751b7a84
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T11:36:21Z

    Adding comment for the checkHeader option

commit 9f91ce73c5c313a9c51067a81e395e9385016ec5
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T11:42:48Z

    Added comments

commit 0878f7aad3c074e63ac3ab1d6e471ce8b988f278
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T12:09:20Z

    Adding a space between column names

commit a341dd79c976df59fc8bffb272449973a09b86fe
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T15:15:14Z

    Fix a test: checking name duplication in schemas

commit 98c27eaa80cf3fae11092d78f22122688e4041a4
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-23T21:04:57Z

    Fixing the test and adding ticket number to test's title

commit 811df6fa7b17ff12bdd70318cf330a0f54815397
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-23T21:10:20Z

    Refactoring - removing unneeded parameter

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to