Sean R. Owen created SPARK-34422:
------------------------------------

             Summary: CSV(/JSON?) files with corrupt row + Permissive mode can 
yield wrong partial result row
                 Key: SPARK-34422
                 URL: https://issues.apache.org/jira/browse/SPARK-34422
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.0.1, 2.4.7, 3.1.1
            Reporter: Sean R. Owen
            Assignee: Sean R. Owen


(This was actually found and fixed in spark-xml, which copied some Spark code 
for handling bad records. See 
https://github.com/databricks/spark-xml/issues/517 )

When CSV parsing (or, I think JSON?) encounters a bad record, in Permissive 
mode, it can return a partial result of values that were successfully parsed, 
along with the problem input in a new 'corrupt record' column.

However the logic in FailureSafeParser that copies the partial results to the 
resulting Row has an off-by-one error that arises when the catalyst projection 
puts the 'corrupt record' column anywhere but the last column, which can 
readily happen. This could mean the resulting partial results are wrong, or, 
that processing the bad record in permissive mode fails entirely, if the 
resulting elements don't happen to match the schema of the result.

The partial results are usually not that useful, so being wrong isn't a huge 
deal, but, failing entirely in permissive mode is a problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to