[jira] [Created] (SPARK-45035) Support ignoreCorruptFiles for multiline CSV

Yaohua Zhao (Jira) Thu, 31 Aug 2023 09:43:59 -0700

Yaohua Zhao created SPARK-45035:
-----------------------------------

             Summary: Support ignoreCorruptFiles for multiline CSV
                 Key: SPARK-45035
                 URL: https://issues.apache.org/jira/browse/SPARK-45035
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.5.0
            Reporter: Yaohua Zhao



Today, `ignoreCorruptFiles` does not work well for multiline CSV mode.
{code:java}
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val testCorruptDF0 
= spark.read.option("ignoreCorruptFiles", "true").option("multiline", 
"true").csv("/tmp/sourcepath/").show() {code}
It throws an exception instead of ignoring silently:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4940.0 
(TID 4031) (10.68.177.106 executor 0): 
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalStateException - Error reading from input
Parser Configuration: CsvParserSettings:
        Auto configuration enabled=true
        Auto-closing enabled=true
        Autodetect column delimiter=false
        Autodetect quotes=false
        Column reordering enabled=true
        Delimiters for detection=null
        Empty value=
        Escape unquoted values=false
        Header extraction enabled=null
        Headers=null
        Ignore leading whitespaces=false
        Ignore leading whitespaces in quotes=false
        Ignore trailing whitespaces=false
        Ignore trailing whitespaces in quotes=false
        Input buffer size=1048576
        Input reading on separate thread=false
        Keep escape sequences=false
        Keep quotes=false
        Length of content displayed on error=1000
        Line separator detection enabled=true
        Maximum number of characters per column=-1
        Maximum number of columns=20480
        Normalize escaped line separators=true
        Null value=
        Number of records to read=all
        Processor=none
        Restricting data in exceptions=false
        RowProcessor error handler=null
        Selected fields=none
        Skip bits as whitespace=true
        Skip empty lines=true
        Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
        CsvFormat:
                Comment character=#
                Field delimiter=,
                Line separator (normalized)=\n
                Line separator sequence=\n
                Quote character="
                Quote escape character=\
                Quote escape escape character=null
Internal state when error was thrown: line=0, column=0, record=0
        at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
        at 
com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
        at 
com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
        at 
org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.<init>(UnivocityParser.scala:463)
        at 
org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46...
 {code}
It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) 
which does not go through `FileScanRDD`. We could potentially add this support 
to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline parsing mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-45035) Support ignoreCorruptFiles for multiline CSV

Reply via email to