[ https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-28079. ---------------------------------- Resolution: Duplicate > CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is > manually added to the schema > ----------------------------------------------------------------------------------------------------- > > Key: SPARK-28079 > URL: https://issues.apache.org/jira/browse/SPARK-28079 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.2, 2.4.3 > Reporter: F Jimenez > Priority: Major > > When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged > as such and read in. Only way to get them flagged is to manually set > "columnNameOfCorruptRecord" AND manually setting the schema including this > column. Example: > {code:java} > // Second row has a 4th column that is not declared in the header/schema > val csvText = s""" > | FieldA, FieldB, FieldC > | a1,b1,c1 > | a2,b2,c2,d*""".stripMargin > val csvFile = new File("/tmp/file.csv") > FileUtils.write(csvFile, csvText) > val reader = sqlContext.read > .format("csv") > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "corrupt") > .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING") > reader.load(csvFile.getAbsolutePath).show(truncate = false) > {code} > This produces the correct result: > {code:java} > +------------+------+------+------+ > |corrupt |fieldA|fieldB|fieldC| > +------------+------+------+------+ > |null | a1 |b1 |c1 | > | a2,b2,c2,d*| a2 |b2 |c2 | > +------------+------+------+------+ > {code} > However removing the "schema" option and going: > {code:java} > val reader = sqlContext.read > .format("csv") > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "corrupt") > reader.load(csvFile.getAbsolutePath).show(truncate = false) > {code} > Yields: > {code:java} > +-------+-------+-------+ > | FieldA| FieldB| FieldC| > +-------+-------+-------+ > | a1 |b1 |c1 | > | a2 |b2 |c2 | > +-------+-------+-------+ > {code} > The fourth value "d*" in the second row has been removed and the row not > marked as corrupt > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org