[jira] [Commented] (SPARK-29101) CSV datasource returns incorrect .count() from file with malformed records

Dongjoon Hyun (Jira) Thu, 19 Sep 2019 15:26:28 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-29101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933822#comment-16933822
 ]


Dongjoon Hyun commented on SPARK-29101:
---------------------------------------

This is backported to branch-2.4 for Apache Spark 2.4.5 via 
https://github.com/apache/spark/pull/25843

> CSV datasource returns incorrect .count() from file with malformed records
> --------------------------------------------------------------------------
>
>                 Key: SPARK-29101
>                 URL: https://issues.apache.org/jira/browse/SPARK-29101
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>            Reporter: Stuart White
>            Assignee: Sandeep Katta
>            Priority: Minor
>              Labels: correctness
>             Fix For: 2.4.5, 3.0.0
>
>
> Spark 2.4 introduced a change to the way csv files are read.  See [Upgrading 
> From Spark SQL 2.3 to 
> 2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24]
>  for more details.
> In that document, it states: _To restore the previous behavior, set 
> spark.sql.csv.parser.columnPruning.enabled to false._
> I am configuring Spark 2.4.4 as such, yet I'm still getting results 
> inconsistent with pre-2.4.  For example:
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
>  
> With Spark 2.1.1, if I call .count() on a DataFrame created from this file 
> (using option DROPMALFORMED), "3" is returned.
> {noformat}
> (using Spark 2.1.1)
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").count
> 19/09/16 14:28:01 WARN CSVRelation: Dropping malformed line: xxx
> res1: Long = 3
> {noformat}
> With Spark 2.4.4, I set the "spark.sql.csv.parser.columnPruning.enabled" 
> option to false to restore the pre-2.4 behavior for handling malformed 
> records, then call .count() and "4" is returned.
> {noformat}
> (using spark 2.4.4)
> scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").count
> res1: Long = 4
> {noformat}
> So, using the *spark.sql.csv.parser.columnPruning.enabled* option did not 
> actually restore previous behavior.
> How can I, using Spark 2.4+, get a count of the records in a .csv which 
> excludes malformed records?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29101) CSV datasource returns incorrect .count() from file with malformed records

Reply via email to