[ 
https://issues.apache.org/jira/browse/SPARK-27093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787618#comment-16787618
 ] 

Gengliang Wang commented on SPARK-27093:
----------------------------------------

Let me post the email here as well. I think setting the SQL Configuration is 
enough:
{quote}Hi Tim,

I think you can try setting the option spark.sql.files.ignoreCorruptFiles as 
true. With the option enabled, the Spark jobs will continue to run when 
encountering corrupted files and the contents that have been read will still be 
returned.
The CSV/JSON data source supports the Permissive modes in reading files because 
it is possible that users still want partial row results. 
When reading corrupted Avro files, I think skipping the rest of files is enough 
if users want to ignore them. 
For processing data with function `from_avro`, I have created a PR to support  
PERMISSIVE/FAILFAST mode: https://github.com/apache/spark/pull/22814{quote}

> Honor ParseMode in AvroFileFormat
> ---------------------------------
>
>                 Key: SPARK-27093
>                 URL: https://issues.apache.org/jira/browse/SPARK-27093
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.4.0
>            Reporter: Tim Cerexhe
>            Priority: Major
>
> The Avro reader is missing the ability to handle malformed or truncated files 
> like the JSON reader. Currently it throws exceptions when it encounters any 
> bad or truncated record in an Avro file, causing the entire Spark job to fail 
> from a single dodgy file. 
> Ideally the AvroFileFormat would accept a Permissive or DropMalformed 
> ParseMode like Spark's JSON format. This would enable the the Avro reader to 
> drop bad records and continue processing the good records rather than abort 
> the entire job. 
> Obviously the default could remain as FailFastMode, which is the current 
> effective behavior, so this wouldn’t break any existing users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to