[ https://issues.apache.org/jira/browse/SPARK-29101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933822#comment-16933822 ]
Dongjoon Hyun commented on SPARK-29101: --------------------------------------- This is backported to branch-2.4 for Apache Spark 2.4.5 via https://github.com/apache/spark/pull/25843 > CSV datasource returns incorrect .count() from file with malformed records > -------------------------------------------------------------------------- > > Key: SPARK-29101 > URL: https://issues.apache.org/jira/browse/SPARK-29101 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 > Reporter: Stuart White > Assignee: Sandeep Katta > Priority: Minor > Labels: correctness > Fix For: 2.4.5, 3.0.0 > > > Spark 2.4 introduced a change to the way csv files are read. See [Upgrading > From Spark SQL 2.3 to > 2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24] > for more details. > In that document, it states: _To restore the previous behavior, set > spark.sql.csv.parser.columnPruning.enabled to false._ > I am configuring Spark 2.4.4 as such, yet I'm still getting results > inconsistent with pre-2.4. For example: > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > > With Spark 2.1.1, if I call .count() on a DataFrame created from this file > (using option DROPMALFORMED), "3" is returned. > {noformat} > (using Spark 2.1.1) > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").count > 19/09/16 14:28:01 WARN CSVRelation: Dropping malformed line: xxx > res1: Long = 3 > {noformat} > With Spark 2.4.4, I set the "spark.sql.csv.parser.columnPruning.enabled" > option to false to restore the pre-2.4 behavior for handling malformed > records, then call .count() and "4" is returned. > {noformat} > (using spark 2.4.4) > scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false) > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").count > res1: Long = 4 > {noformat} > So, using the *spark.sql.csv.parser.columnPruning.enabled* option did not > actually restore previous behavior. > How can I, using Spark 2.4+, get a count of the records in a .csv which > excludes malformed records? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org