GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/20894
[SPARK-23786][SQL] Checking column names of csv headers ## What changes were proposed in this pull request? Currently column names of headers in CSV files are not checked against provided schema of CSV data. It could cause errors like showed in the [SPARK-23786](https://issues.apache.org/jira/browse/SPARK-23786). I introduced new CSV option - `checkHeader` (`true` by default) which enables checking of column names against schema's fields. The checking is performed during processing of the first partition of csv files. If names are not matched, the following exception is thrown: ``` java.lang.IllegalArgumentException: Fields in the header of csv file are not matched to field names of the schema: Header: depth, temperature Schema: temperature, depth ``` ## How was this patch tested? The changes were tested by existing tests of CSVSuite and by 2 new tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 check-column-names Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20894.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20894 ---- commit 112ce2d34d0d039711777351c1ab8e74629fc8e6 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-20T15:30:44Z Checks column names are compatible to provided schema commit a85ccce23c3c5ee69ff321303ad830c71dd05931 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-20T20:51:03Z Checking header is matched to schema in per-line mode commit 75e15345b6a5a9e807375fdf465dccfce4ea62c7 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-20T21:36:56Z Extract header and check that it is matched to schema commit 8eb45b8b634ba2c9b641de12e09f17c63240ccc4 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-21T10:57:30Z Checking column names in header in multiLine mode commit 9b1a9862531b8d3fb3cffce75126413ca9a844b9 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-21T11:13:17Z Adding the checkHeader option with true by default commit 64426332b2ab42a1cd9c5a05a77e90332572bbec Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-21T11:25:31Z Fix csv test by changing headers or disabling header checking commit 9440d8a5c097a1d8e111b397fbda9e54751b7a84 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-21T11:36:21Z Adding comment for the checkHeader option commit 9f91ce73c5c313a9c51067a81e395e9385016ec5 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-21T11:42:48Z Added comments commit 0878f7aad3c074e63ac3ab1d6e471ce8b988f278 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-21T12:09:20Z Adding a space between column names commit a341dd79c976df59fc8bffb272449973a09b86fe Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-21T15:15:14Z Fix a test: checking name duplication in schemas commit 98c27eaa80cf3fae11092d78f22122688e4041a4 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-23T21:04:57Z Fixing the test and adding ticket number to test's title commit 811df6fa7b17ff12bdd70318cf330a0f54815397 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-03-23T21:10:20Z Refactoring - removing unneeded parameter ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org