Incorrect csv parsing when delimiter used within the data

Saurabh Gulati Tue, 03 Jan 2023 08:59:24 -0800

Hello,
We are seeing a case with csv data when it parses csv data incorrectly.
The issue can be replicated using the below csv data


"a","b","c"
"1","",","
"2","","abc"
and using the spark csv read command.
df = spark.read.format("csv")\
.option("multiLine", True)\
.option("escape", '"')\
.option("enforceSchema", False) \
.option("header", True)\
.load(f"/tmp/test.csv")

df.show(100, False) # prints both rows
|a  |b       |c  |
+---+--------+---+
|1  |null    |,  |
|2  |null    |abc|

df.select("c").show() # merges last column of first row and first column of 
second row
+------+
|     c|
+------+
|"\n"2"|

print(df.count()) # prints 1, should be 2

It feels like a bug and I thought of asking the community before creating a bug 
on jira.

Mvg/Regards
Saurabh

Incorrect csv parsing when delimiter used within the data

Reply via email to