[ 
https://issues.apache.org/jira/browse/SPARK-27093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16788177#comment-16788177
 ] 

Tim Cerexhe commented on SPARK-27093:
-------------------------------------

Thanks for that idea [~Gengliang.Wang]. It catches some of the failure modes we 
need to protect against.

However we also have files that do not conform to the requisite schema (or have 
malformed schemas, eg. with invalid byte sequences in keys, since they come 
from user uploads over faulty networks), and these exceptions aren't currently 
being squashed.

If these failure modes were treated as "corrupt" files then this would 
completely satisfy our needs (though this may be a stretch of the definition).

I've uploaded our internal patch for your reference: 
https://github.com/apache/spark/pull/24027

> Honor ParseMode in AvroFileFormat
> ---------------------------------
>
>                 Key: SPARK-27093
>                 URL: https://issues.apache.org/jira/browse/SPARK-27093
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.4.0
>            Reporter: Tim Cerexhe
>            Priority: Major
>
> The Avro reader is missing the ability to handle malformed or truncated files 
> like the JSON reader. Currently it throws exceptions when it encounters any 
> bad or truncated record in an Avro file, causing the entire Spark job to fail 
> from a single dodgy file. 
> Ideally the AvroFileFormat would accept a Permissive or DropMalformed 
> ParseMode like Spark's JSON format. This would enable the the Avro reader to 
> drop bad records and continue processing the good records rather than abort 
> the entire job. 
> Obviously the default could remain as FailFastMode, which is the current 
> effective behavior, so this wouldn’t break any existing users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to