[ https://issues.apache.org/jira/browse/SPARK-45035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Max Gekk reassigned SPARK-45035: -------------------------------- Assignee: Jia Fan > Support ignoreCorruptFiles for multiline CSV > -------------------------------------------- > > Key: SPARK-45035 > URL: https://issues.apache.org/jira/browse/SPARK-45035 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.5.0 > Reporter: Yaohua Zhao > Assignee: Jia Fan > Priority: Major > Labels: pull-request-available > > Today, `ignoreCorruptFiles` does not work well for multiline CSV mode. > {code:java} > spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val > testCorruptDF0 = spark.read.option("ignoreCorruptFiles", > "true").option("multiline", "true").csv("/tmp/sourcepath/").show() {code} > It throws an exception instead of ignoring silently: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 4940.0 (TID 4031) (10.68.177.106 executor 0): > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalStateException - Error reading from input > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Auto-closing enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Delimiters for detection=null > Empty value= > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore leading whitespaces in quotes=false > Ignore trailing whitespaces=false > Ignore trailing whitespaces in quotes=false > Input buffer size=1048576 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=1000 > Line separator detection enabled=true > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip bits as whitespace=true > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=# > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=0, column=0, record=0 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402) > at > com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277) > at > com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.<init>(UnivocityParser.scala:463) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46... > {code} > It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) > which does not go through `FileScanRDD`. We could potentially add this > support to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline > parsing mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org