[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r211309642 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,39 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { --- End diff -- it seems enforceSchema always kind of "works" because it simply means it ignores the headers. what do we expect to verify in the test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r211287978 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,39 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { --- End diff -- Also need a test case for checking enforceSchema works well when column pruning is on. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r211283824 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -227,10 +210,9 @@ object TextInputCSVDataSource extends CSVDataSource { // Note: if there are only comments in the first block, the header would probably // be not extracted. CSVUtils.extractHeader(lines, parser.options).foreach { header => -CSVDataSource.checkHeader( - header, - parser.tokenizer, - dataSchema, +CSVDataSource.checkHeaderColumnNames( + if (columnPruning) requiredSchema else dataSchema, + parser.tokenizer.parseLine(header), --- End diff -- Nit: the following code style is preferred. ``` val schema = if (columnPruning) requiredSchema else dataSchema val columnNames = parser.tokenizer.parseLine(header) ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r211132019 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -230,7 +232,7 @@ object TextInputCSVDataSource extends CSVDataSource { CSVDataSource.checkHeader( --- End diff -- Can we remove `CSVDataSource.checkHeader` and switch to `CSVDataSource.checkHeaderColumnNames`? Looks `CSVDataSource.checkHeader` is an overkill and marks hard to read the code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r211081732 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,25 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") { + withTempPath { path => +val dir = path.getAbsolutePath +Seq(("a", "b")).toDF("columnA", "columnB").write + .format("csv") + .option("header", true) + .save(dir) +checkAnswer(spark.read + .format("csv") + .option("header", true) + .option("enforceSchema", false) + .load(dir) + .select("columnA"), --- End diff -- Could you check a corner case when required Schema is empty. For example, `.option("enforceSchema", false)` + `count()`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r210801081 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") { + withTempPath { path => +val dir = path.getAbsolutePath +Seq(("a", "b")).toDF("columnA", "columnB").write + .format("csv") + .option("header", true) + .save(dir) +checkAnswer(spark.read + .format("csv") + .option("header", true) + .option("enforceSchema", false) + .load(dir) + .select("columnA"), + Row("a")) + } +} + } + + test("SPARK-25134: check header on parsing of dataset with projection and no column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") { --- End diff -- ok will remove --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r210788916 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") { + withTempPath { path => +val dir = path.getAbsolutePath +Seq(("a", "b")).toDF("columnA", "columnB").write + .format("csv") + .option("header", true) + .save(dir) +checkAnswer(spark.read + .format("csv") + .option("header", true) + .option("enforceSchema", false) + .load(dir) + .select("columnA"), + Row("a")) + } +} + } + + test("SPARK-25134: check header on parsing of dataset with projection and no column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") { --- End diff -- I think `false` case test can be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/22123 [SPARK-25134][SQL] Csv column pruning with checking of headers throws incorrect error ## What changes were proposed in this pull request? When column pruning is turned on the checking of headers in the csv should only be for the fields in the requiredSchema, not the dataSchema, because column pruning means only requiredSchema is read. ## How was this patch tested? Added 2 unit tests where column pruning is turned on/off and csv headers are checked againt schema Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata-opensource/spark feat-csv-column-pruning-and-check-header Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22123.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22123 commit dcd9ac45673af31e59dcfb633a2b87f76f2bee03 Author: Koert Kuipers Date: 2018-08-16T15:35:16Z if csv column-pruning is turned on header should be checked with requiredSchema not dataSchema commit c4179a9f0a85b412178323e6cb881385fa644051 Author: Koert Kuipers Date: 2018-08-16T15:52:02Z update jira reference in unit test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org