[jira] [Resolved] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

Hyukjin Kwon (JIRA) Wed, 19 Jun 2019 20:54:18 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-28079.
----------------------------------
    Resolution: Duplicate

> CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is 
> manually added to the schema
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-28079
>                 URL: https://issues.apache.org/jira/browse/SPARK-28079
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.2, 2.4.3
>            Reporter: F Jimenez
>            Priority: Major
>
> When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged 
> as such and read in. Only way to get them flagged is to manually set 
> "columnNameOfCorruptRecord" AND manually setting the schema including this 
> column. Example:
> {code:java}
> // Second row has a 4th column that is not declared in the header/schema
> val csvText = s"""
>                  | FieldA, FieldB, FieldC
>                  | a1,b1,c1
>                  | a2,b2,c2,d*""".stripMargin
> val csvFile = new File("/tmp/file.csv")
> FileUtils.write(csvFile, csvText)
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
>   .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> This produces the correct result:
> {code:java}
> +------------+------+------+------+
> |corrupt     |fieldA|fieldB|fieldC|
> +------------+------+------+------+
> |null        | a1   |b1    |c1    |
> | a2,b2,c2,d*| a2   |b2    |c2    |
> +------------+------+------+------+
> {code}
> However removing the "schema" option and going:
> {code:java}
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> Yields:
> {code:java}
> +-------+-------+-------+
> | FieldA| FieldB| FieldC|
> +-------+-------+-------+
> | a1    |b1     |c1     |
> | a2    |b2     |c2     |
> +-------+-------+-------+
> {code}
> The fourth value "d*" in the second row has been removed and the row not 
> marked as corrupt
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

Reply via email to