[ https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851948#comment-16851948 ]
Liang-Chi Hsieh commented on SPARK-27873: ----------------------------------------- I guess what Marcin meant is: {code} val schema = StructType.fromDDL("a int, b date") val columnNameOfCorruptRecord = "_unparsed" val schemaWithCorrField1 = schema.add(columnNameOfCorruptRecord, StringType) val df = spark .read .option("mode", "Permissive") .option("header", "true") .option("enforceSchema", false) .option("columnNameOfCorruptRecord", columnNameOfCorruptRecord) .schema(schemaWithCorrField1) .csv(testFile(valueMalformedWithHeaderFile)) {code} If we want to keep corrupt record, we provide a new column into the schema. But this new column isn't in CSV header. So if enforceSchema is disable at the same time, CSVHeaderChecker throws a exception like: {code} [info] Cause: java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: [info] Header length: 2, schema size: 3 {code} It is because CSVHeaderChecker doesn't consider columnNameOfCorruptRecord for now. > Csv reader, adding a corrupt record column causes error if enforceSchema=false > ------------------------------------------------------------------------------ > > Key: SPARK-27873 > URL: https://issues.apache.org/jira/browse/SPARK-27873 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.3 > Reporter: Marcin Mejran > Priority: Major > > In the Spark CSV reader If you're using permissive mode with a column for > storing corrupt records then you need to add a new schema column > corresponding to columnNameOfCorruptRecord. > However, if you have a header row and enforceSchema=false the schema vs. > header validation fails because there is an extra column corresponding to > columnNameOfCorruptRecord. > Since, the FAILFAST mode doesn't print informative error messages on which > rows failed to parse there is no way other to track down broken rows without > setting a corrupt record column. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org