[jira] [Assigned] (SPARK-45035) Support ignoreCorruptFiles for multiline CSV

Max Gekk (Jira) Tue, 17 Oct 2023 23:09:51 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-45035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Max Gekk reassigned SPARK-45035:
--------------------------------

    Assignee: Jia Fan

> Support ignoreCorruptFiles for multiline CSV
> --------------------------------------------
>
>                 Key: SPARK-45035
>                 URL: https://issues.apache.org/jira/browse/SPARK-45035
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.0
>            Reporter: Yaohua Zhao
>            Assignee: Jia Fan
>            Priority: Major
>              Labels: pull-request-available
>
> Today, `ignoreCorruptFiles` does not work well for multiline CSV mode.
> {code:java}
> spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val 
> testCorruptDF0 = spark.read.option("ignoreCorruptFiles", 
> "true").option("multiline", "true").csv("/tmp/sourcepath/").show() {code}
> It throws an exception instead of ignoring silently:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 4940.0 (TID 4031) (10.68.177.106 executor 0): 
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalStateException - Error reading from input
> Parser Configuration: CsvParserSettings:
>       Auto configuration enabled=true
>       Auto-closing enabled=true
>       Autodetect column delimiter=false
>       Autodetect quotes=false
>       Column reordering enabled=true
>       Delimiters for detection=null
>       Empty value=
>       Escape unquoted values=false
>       Header extraction enabled=null
>       Headers=null
>       Ignore leading whitespaces=false
>       Ignore leading whitespaces in quotes=false
>       Ignore trailing whitespaces=false
>       Ignore trailing whitespaces in quotes=false
>       Input buffer size=1048576
>       Input reading on separate thread=false
>       Keep escape sequences=false
>       Keep quotes=false
>       Length of content displayed on error=1000
>       Line separator detection enabled=true
>       Maximum number of characters per column=-1
>       Maximum number of columns=20480
>       Normalize escaped line separators=true
>       Null value=
>       Number of records to read=all
>       Processor=none
>       Restricting data in exceptions=false
>       RowProcessor error handler=null
>       Selected fields=none
>       Skip bits as whitespace=true
>       Skip empty lines=true
>       Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
>       CsvFormat:
>               Comment character=#
>               Field delimiter=,
>               Line separator (normalized)=\n
>               Line separator sequence=\n
>               Quote character="
>               Quote escape character=\
>               Quote escape escape character=null
> Internal state when error was thrown: line=0, column=0, record=0
>       at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
>       at 
> com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
>       at 
> com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
>       at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.<init>(UnivocityParser.scala:463)
>       at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46...
>  {code}
> It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) 
> which does not go through `FileScanRDD`. We could potentially add this 
> support to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline 
> parsing mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45035) Support ignoreCorruptFiles for multiline CSV

Reply via email to