Yash Botadra created SPARK-57195:
------------------------------------
Summary: CSV multiLine schema inference throws
ArrayIndexOutOfBoundsException for a row exceeding maxColumns
Key: SPARK-57195
URL: https://issues.apache.org/jira/browse/SPARK-57195
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.1.2, 3.5.8, 4.0.2
Reporter: Yash Botadra
When inferring the schema of a CSV with multiLine=true and inferSchema=true, a
data row with more columns than maxColumns causes a raw
java.lang.ArrayIndexOutOfBoundsException to propagate and fail the query,
instead of a user-facing error.
SPARK-49444 made the per-line parse path (UnivocityParser.parseLine) translate
this into a MALFORMED_CSV_RECORD error. The streaming path used by multiLine
reads and schema inference (UnivocityParser.tokenizeStream / convertStream, via
tokenizer.parseNext()) was not covered, so the same input still throws a raw
ArrayIndexOutOfBoundsException.
Repro:
```
file:
a,b
c,d
1,2,3
spark.read.option("header","false").option("inferSchema","true")
.option("multiLine","true").option("maxColumns","2").csv(path)
Expected: MALFORMED_CSV_RECORD (SQLSTATE KD000). Actual:
java.lang.ArrayIndexOutOfBoundsException.
```
How introduced: the streaming parseNext() path predates SPARK-49444 and was
missed when that fix was applied to parseLine.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]