[ https://issues.apache.org/jira/browse/FLINK-20795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260178#comment-17260178 ]
Jark Wu edited comment on FLINK-20795 at 1/7/21, 2:38 AM: ---------------------------------------------------------- If we want to refactor this configuration. I would suggest to first investigate how other projects handle this, e.g. Spark, Hive, Presto, Kafka. For example, Spark provides a ParseMode for dealing with corrupt records during parsing, it allows the following modes: - PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra fields. - DROPMALFORMED : ignores the whole corrupted records. - FAILFAST : throws an exception when it meets corrupted records. See https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/DataFrameReader.html and https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala was (Author: jark): If we want to refactor this configuration. I would suggest to investigate how other projects handle this, e.g. Spark, Hive, Presto, Kafka. For example, Spark provides a ParseMode for dealing with corrupt records during parsing, it allows the following modes: - PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra fields. - DROPMALFORMED : ignores the whole corrupted records. - FAILFAST : throws an exception when it meets corrupted records. See https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/DataFrameReader.html and https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala > add a parameter to decide whether print dirty record when > `ignore-parse-errors` is true > --------------------------------------------------------------------------------------- > > Key: FLINK-20795 > URL: https://issues.apache.org/jira/browse/FLINK-20795 > Project: Flink > Issue Type: Improvement > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Table > SQL / Ecosystem > Affects Versions: 1.13.0 > Reporter: zoucao > Priority: Major > > add a parameter to decide whether print dirty data when > `ignore-parse-errors`=true, some users want to make his task stability and > know the dirty record to fix the upstream, too. -- This message was sent by Atlassian Jira (v8.3.4#803005)