[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22123


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-20 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r211309642
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1603,6 +1603,39 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
   .exists(msg => msg.getRenderedMessage.contains("CSV header does not 
conform to the schema")))
   }
 
+  test("SPARK-25134: check header on parsing of dataset with projection 
and column pruning") {
--- End diff --

it seems enforceSchema always kind of "works" because it simply means it 
ignores the headers.
what do we expect to verify in the test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r211287978
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1603,6 +1603,39 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
   .exists(msg => msg.getRenderedMessage.contains("CSV header does not 
conform to the schema")))
   }
 
+  test("SPARK-25134: check header on parsing of dataset with projection 
and column pruning") {
--- End diff --

Also need a test case for checking enforceSchema works well when column 
pruning is on. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r211283824
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -227,10 +210,9 @@ object TextInputCSVDataSource extends CSVDataSource {
   // Note: if there are only comments in the first block, the header 
would probably
   // be not extracted.
   CSVUtils.extractHeader(lines, parser.options).foreach { header =>
-CSVDataSource.checkHeader(
-  header,
-  parser.tokenizer,
-  dataSchema,
+CSVDataSource.checkHeaderColumnNames(
+  if (columnPruning) requiredSchema else dataSchema,
+  parser.tokenizer.parseLine(header),
--- End diff --

Nit: the following code style is preferred.
```
val schema = if (columnPruning) requiredSchema else dataSchema
val columnNames = parser.tokenizer.parseLine(header)
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r211132019
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -230,7 +232,7 @@ object TextInputCSVDataSource extends CSVDataSource {
 CSVDataSource.checkHeader(
--- End diff --

Can we remove `CSVDataSource.checkHeader` and switch to 
`CSVDataSource.checkHeaderColumnNames`? Looks `CSVDataSource.checkHeader` is an 
overkill and marks hard to read the code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-18 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r211081732
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1603,6 +1603,25 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
   .exists(msg => msg.getRenderedMessage.contains("CSV header does not 
conform to the schema")))
   }
 
+  test("SPARK-25134: check header on parsing of dataset with projection 
and column pruning") {
+withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") {
+  withTempPath { path =>
+val dir = path.getAbsolutePath
+Seq(("a", "b")).toDF("columnA", "columnB").write
+  .format("csv")
+  .option("header", true)
+  .save(dir)
+checkAnswer(spark.read
+  .format("csv")
+  .option("header", true)
+  .option("enforceSchema", false)
+  .load(dir)
+  .select("columnA"),
--- End diff --

Could you check a corner case when required Schema is empty. For example, 
`.option("enforceSchema", false)` + `count()`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-16 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r210801081
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1603,6 +1603,44 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
   .exists(msg => msg.getRenderedMessage.contains("CSV header does not 
conform to the schema")))
   }
 
+  test("SPARK-25134: check header on parsing of dataset with projection 
and column pruning") {
+withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") {
+  withTempPath { path =>
+val dir = path.getAbsolutePath
+Seq(("a", "b")).toDF("columnA", "columnB").write
+  .format("csv")
+  .option("header", true)
+  .save(dir)
+checkAnswer(spark.read
+  .format("csv")
+  .option("header", true)
+  .option("enforceSchema", false)
+  .load(dir)
+  .select("columnA"),
+  Row("a"))
+  }
+}
+  }
+
+  test("SPARK-25134: check header on parsing of dataset with projection 
and no column pruning") {
+withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") {
--- End diff --

ok will remove


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-16 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r210788916
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1603,6 +1603,44 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
   .exists(msg => msg.getRenderedMessage.contains("CSV header does not 
conform to the schema")))
   }
 
+  test("SPARK-25134: check header on parsing of dataset with projection 
and column pruning") {
+withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") {
+  withTempPath { path =>
+val dir = path.getAbsolutePath
+Seq(("a", "b")).toDF("columnA", "columnB").write
+  .format("csv")
+  .option("header", true)
+  .save(dir)
+checkAnswer(spark.read
+  .format("csv")
+  .option("header", true)
+  .option("enforceSchema", false)
+  .load(dir)
+  .select("columnA"),
+  Row("a"))
+  }
+}
+  }
+
+  test("SPARK-25134: check header on parsing of dataset with projection 
and no column pruning") {
+withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") {
--- End diff --

I think `false` case test can be removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-16 Thread koertkuipers
GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/22123

[SPARK-25134][SQL] Csv column pruning with checking of headers throws 
incorrect error

## What changes were proposed in this pull request?

When column pruning is turned on the checking of headers in the csv should 
only be for the fields in the requiredSchema, not the dataSchema, because 
column pruning means only requiredSchema is read.

## How was this patch tested?

Added 2 unit tests where column pruning is turned on/off and csv headers 
are checked againt schema 

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata-opensource/spark 
feat-csv-column-pruning-and-check-header

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22123


commit dcd9ac45673af31e59dcfb633a2b87f76f2bee03
Author: Koert Kuipers 
Date:   2018-08-16T15:35:16Z

if csv column-pruning is turned on header should be checked with 
requiredSchema not dataSchema

commit c4179a9f0a85b412178323e6cb881385fa644051
Author: Koert Kuipers 
Date:   2018-08-16T15:52:02Z

update jira reference in unit test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org