Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21296#discussion_r187426203 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -267,7 +267,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .options(Map("header" -> "true", "mode" -> "dropmalformed")) .load(testFile(carsFile)) - assert(cars.select("year").collect().size === 2) + assert(cars.collect().size === 2) --- End diff -- The `cars.csv` file has header with 5 columns: ``` year,make,model,comment,blank ``` and 2 rows with 4 valid columns and the last one is blank: ``` "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they are going fast", ``` and one more row with only with 3 columns: ``` 2015,Chevy,Volt ``` Previous (current) implementation drops the last row in the `dropmalformed` mode because it parses whole rows, and the last one is incorrect. If only the `year` column is selected, uniVocity parser returns values for first column (with index `0`) and doesn't analyze correctness of the rest part of the rows. So in this way `cars.select("year").collect().size` returns `3`
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org