[ https://issues.apache.org/jira/browse/SPARK-34422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen updated SPARK-34422: --------------------------------- Priority: Major (was: Minor) > CSV(/JSON?) files with corrupt row + Permissive mode can yield wrong partial > result row > --------------------------------------------------------------------------------------- > > Key: SPARK-34422 > URL: https://issues.apache.org/jira/browse/SPARK-34422 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.7, 3.0.1, 3.1.1 > Reporter: Sean R. Owen > Assignee: Sean R. Owen > Priority: Major > > (This was actually found and fixed in spark-xml, which copied some Spark code > for handling bad records. See > https://github.com/databricks/spark-xml/issues/517 ) > When CSV parsing (or, I think JSON?) encounters a bad record, in Permissive > mode, it can return a partial result of values that were successfully parsed, > along with the problem input in a new 'corrupt record' column. > However the logic in FailureSafeParser that copies the partial results to the > resulting Row has an off-by-one error that arises when the catalyst > projection puts the 'corrupt record' column anywhere but the last column, > which can readily happen. This could mean the resulting partial results are > wrong, or, that processing the bad record in permissive mode fails entirely, > if the resulting elements don't happen to match the schema of the result. > The partial results are usually not that useful, so being wrong isn't a huge > deal, but, failing entirely in permissive mode is a problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org