[ 
https://issues.apache.org/jira/browse/SPARK-34422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34422:
---------------------------------
    Priority: Major  (was: Minor)

> CSV(/JSON?) files with corrupt row + Permissive mode can yield wrong partial 
> result row
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-34422
>                 URL: https://issues.apache.org/jira/browse/SPARK-34422
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.7, 3.0.1, 3.1.1
>            Reporter: Sean R. Owen
>            Assignee: Sean R. Owen
>            Priority: Major
>
> (This was actually found and fixed in spark-xml, which copied some Spark code 
> for handling bad records. See 
> https://github.com/databricks/spark-xml/issues/517 )
> When CSV parsing (or, I think JSON?) encounters a bad record, in Permissive 
> mode, it can return a partial result of values that were successfully parsed, 
> along with the problem input in a new 'corrupt record' column.
> However the logic in FailureSafeParser that copies the partial results to the 
> resulting Row has an off-by-one error that arises when the catalyst 
> projection puts the 'corrupt record' column anywhere but the last column, 
> which can readily happen. This could mean the resulting partial results are 
> wrong, or, that processing the bad record in permissive mode fails entirely, 
> if the resulting elements don't happen to match the schema of the result.
> The partial results are usually not that useful, so being wrong isn't a huge 
> deal, but, failing entirely in permissive mode is a problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to