[jira] [Commented] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false
[ https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851952#comment-16851952 ] Liang-Chi Hsieh commented on SPARK-27873: - I can prepare a PR if Marcin or Hyukjin Kwon don't plan to do. > Csv reader, adding a corrupt record column causes error if enforceSchema=false > -- > > Key: SPARK-27873 > URL: https://issues.apache.org/jira/browse/SPARK-27873 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Marcin Mejran >Priority: Major > > In the Spark CSV reader If you're using permissive mode with a column for > storing corrupt records then you need to add a new schema column > corresponding to columnNameOfCorruptRecord. > However, if you have a header row and enforceSchema=false the schema vs. > header validation fails because there is an extra column corresponding to > columnNameOfCorruptRecord. > Since, the FAILFAST mode doesn't print informative error messages on which > rows failed to parse there is no way other to track down broken rows without > setting a corrupt record column. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false
[ https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851948#comment-16851948 ] Liang-Chi Hsieh commented on SPARK-27873: - I guess what Marcin meant is: {code} val schema = StructType.fromDDL("a int, b date") val columnNameOfCorruptRecord = "_unparsed" val schemaWithCorrField1 = schema.add(columnNameOfCorruptRecord, StringType) val df = spark .read .option("mode", "Permissive") .option("header", "true") .option("enforceSchema", false) .option("columnNameOfCorruptRecord", columnNameOfCorruptRecord) .schema(schemaWithCorrField1) .csv(testFile(valueMalformedWithHeaderFile)) {code} If we want to keep corrupt record, we provide a new column into the schema. But this new column isn't in CSV header. So if enforceSchema is disable at the same time, CSVHeaderChecker throws a exception like: {code} [info] Cause: java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: [info] Header length: 2, schema size: 3 {code} It is because CSVHeaderChecker doesn't consider columnNameOfCorruptRecord for now. > Csv reader, adding a corrupt record column causes error if enforceSchema=false > -- > > Key: SPARK-27873 > URL: https://issues.apache.org/jira/browse/SPARK-27873 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Marcin Mejran >Priority: Major > > In the Spark CSV reader If you're using permissive mode with a column for > storing corrupt records then you need to add a new schema column > corresponding to columnNameOfCorruptRecord. > However, if you have a header row and enforceSchema=false the schema vs. > header validation fails because there is an extra column corresponding to > columnNameOfCorruptRecord. > Since, the FAILFAST mode doesn't print informative error messages on which > rows failed to parse there is no way other to track down broken rows without > setting a corrupt record column. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false
[ https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851534#comment-16851534 ] Hyukjin Kwon commented on SPARK-27873: -- Can you show a reproducer with output and expected output? > Csv reader, adding a corrupt record column causes error if enforceSchema=false > -- > > Key: SPARK-27873 > URL: https://issues.apache.org/jira/browse/SPARK-27873 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Marcin Mejran >Priority: Major > > In the Spark CSV reader If you're using permissive mode with a column for > storing corrupt records then you need to add a new schema column > corresponding to columnNameOfCorruptRecord. > However, if you have a header row and enforceSchema=false the schema vs. > header validation fails because there is an extra column corresponding to > columnNameOfCorruptRecord. > Since, the FAILFAST mode doesn't print informative error messages on which > rows failed to parse there is no way other to track down broken rows without > setting a corrupt record column. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org