Yaohua Zhao created SPARK-45035:
-----------------------------------
Summary: Support ignoreCorruptFiles for multiline CSV
Key: SPARK-45035
URL: https://issues.apache.org/jira/browse/SPARK-45035
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.5.0
Reporter: Yaohua Zhao
Today, `ignoreCorruptFiles` does not work well for multiline CSV mode.
{code:java}
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val testCorruptDF0
= spark.read.option("ignoreCorruptFiles", "true").option("multiline",
"true").csv("/tmp/sourcepath/").show() {code}
It throws an exception instead of ignoring silently:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4940.0
(TID 4031) (10.68.177.106 executor 0):
com.univocity.parsers.common.TextParsingException:
java.lang.IllegalStateException - Error reading from input
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Auto-closing enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Delimiters for detection=null
Empty value=
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore leading whitespaces in quotes=false
Ignore trailing whitespaces=false
Ignore trailing whitespaces in quotes=false
Input buffer size=1048576
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=1000
Line separator detection enabled=true
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip bits as whitespace=true
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=#
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=0, column=0, record=0
at
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
at
com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
at
com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
at
org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.<init>(UnivocityParser.scala:463)
at
org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46...
{code}
It is because the multiline parsing uses a different RDD (`BinaryFileRDD`)
which does not go through `FileScanRDD`. We could potentially add this support
to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline parsing mode.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]