[jira] [Comment Edited] (FLINK-20795) add a parameter to decide whether print dirty record when `ignore-parse-errors` is true

Jark Wu (Jira) Wed, 06 Jan 2021 18:39:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260178#comment-17260178
 ]


Jark Wu edited comment on FLINK-20795 at 1/7/21, 2:38 AM:
----------------------------------------------------------

If we want to refactor this configuration. I would suggest to first investigate 
how other projects handle this, e.g. Spark, Hive, Presto, Kafka. 

For example, Spark provides a ParseMode for dealing with corrupt records during 
parsing, it allows the following modes:

- PERMISSIVE : sets other fields to null when it meets a corrupted record, and 
puts the malformed string into a new field configured by 
columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra 
fields.
- DROPMALFORMED : ignores the whole corrupted records.
- FAILFAST : throws an exception when it meets corrupted records.

See 
https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/DataFrameReader.html
 and 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala


was (Author: jark):
If we want to refactor this configuration. I would suggest to investigate how 
other projects handle this, e.g. Spark, Hive, Presto, Kafka. 

For example, Spark provides a ParseMode for dealing with corrupt records during 
parsing, it allows the following modes:

- PERMISSIVE : sets other fields to null when it meets a corrupted record, and 
puts the malformed string into a new field configured by 
columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra 
fields.
- DROPMALFORMED : ignores the whole corrupted records.
- FAILFAST : throws an exception when it meets corrupted records.

See 
https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/DataFrameReader.html
 and 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala

> add a parameter to decide whether print dirty record when 
> `ignore-parse-errors` is true
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-20795
>                 URL: https://issues.apache.org/jira/browse/FLINK-20795
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Table 
> SQL / Ecosystem
>    Affects Versions: 1.13.0
>            Reporter: zoucao
>            Priority: Major
>
> add a parameter to decide whether print dirty data when 
> `ignore-parse-errors`=true, some users want to make his task stability and 
> know the dirty record to fix the upstream, too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-20795) add a parameter to decide whether print dirty record when `ignore-parse-errors` is true

Reply via email to