Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21296#discussion_r187604963 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -267,7 +267,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .options(Map("header" -> "true", "mode" -> "dropmalformed")) .load(testFile(carsFile)) - assert(cars.select("year").collect().size === 2) + assert(cars.collect().size === 2) --- End diff -- > it's intendedly parsed to keep the backword compatibility. Right, by selecting all columns I force *UnivocityParser* to fall to the case: https://github.com/MaxGekk/spark-1/blob/a4a0a549156a15011c33c7877a35f244d75b7a4f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L193-L213 when number of returned tokens are less than required. In the case of `cars.select("year")`, uniVocity parser returns only one token as it is expected. > There was an issue about the different number of counts. The PR changes behavior for some malformed inputs but I believe we could provide better performance for users who have correct inputs. > I think you are basically saying cars.select("year").collect().size and cars.collect().size are different and they are correct, right? Yes, you can say that. You are right it seems the PR proposes another interpretation for malformed rows. `cars.select("year")` is: ``` +----+ |year| +----+ |2012| |1997| |2015| +----+ ``` and we should not reject `2015` only because there are problems in not requested columns. In this particular case, the last row consists of only one value at `0` position and it is correct.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org