Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

Saurabh Gulati Wed, 04 Jan 2023 02:30:36 -0800

Hey guys, much appreciate your quick responses.

To answer your questions,
@Mich Talebzadeh<mailto:mich.talebza...@gmail.com> We get data from multiple 
sources, and we don't have any control over what they put in. In this case the 
column is supposed to contain some feedback and it can also contain quoted 
strings.


@Sean Owen<mailto:sro...@gmail.com> Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"
Here if we don't escape with "

df = spark.read.option("multiLine", True).option("enforceSchema", 
False).option("header", True).csv(f"/tmp/test.csv")

df.show(100, False)

+---+----+--------------------+
|a  |b   |c                   |
+---+----+--------------------+
|1  |null|",see what ""I did""|

+---+----+--------------------+

df.count()

1

So, we put in "? as the escape character and then its parsed fine but the count 
is wrong.


df = spark.read.option("escape", '"').option("multiLine", 
True).option("enforceSchema", False).option("header", 
True).csv(f"/tmp/test.csv")

df.show(100, False)

+---+----+------------------+
|a  |b   |c                 |
+---+----+------------------+
|1  |null|,see what "I did",|
|2  |null|abc               |
+---+----+------------------+

df.count()
1

I understand its a complex case or maybe an edge case which makes it difficult 
for spark
to understand when a column ends as we have even enabled multiline=True?.

See another example below which even has multiline value for column c?.

"a","b","c"
"1","",",see what ""I did"",
i am still writing"
"2","","abc"

# with escape

df = spark.read.option("escape", '"').option("multiLine", 
True).option("enforceSchema", False).option("header", 
True).csv(f"/tmp/test.csv")

df.show(10, False)

+---+----+--------------------------------------+
|a  |b   |c                                     |
+---+----+--------------------------------------+
|1  |null|,see what "I did",\ni am still writing|
|2  |null|abc                                   |
+---+----+--------------------------------------+

df.count()
1

df.select("c").show(10, False)
+------------------+
|c                 |
+------------------+
|see what ""I did""|
|null              |
|abc               |
+------------------+

# without escape "


df.show(10, False)

+-------------------+----+--------------------+
|a                  |b   |c                   |
+-------------------+----+--------------------+
|1                  |null|",see what ""I did""|
|i am still writing"|null|null                |
|2                  |null|abc                 |
+-------------------+----+--------------------+

df.select("c").show(10, False)

+--------------------+
|c                   |
+--------------------+
|",see what ""I did""|
|null                |
|abc                 |
+--------------------+


The issue is that it can print the complete data frame correctly with escape 
enabled,
but when you select a column or ask a count then it gives wrong output.


Regards
Saurabh
________________________________
From: Mich Talebzadeh <mich.talebza...@gmail.com>
Sent: 04 January 2023 10:14
To: Sean Owen <sro...@gmail.com>
Cc: Saurabh Gulati <saurabh.gul...@fedex.com>; User <user@spark.apache.org>
Subject: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the 
data

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

What is the point of having  , as a column value? From a business point of view 
it does not signify anything IMO




 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!BL9GA0TyTA!erAOVE6gcxktT3dQCu6OSdzFqng9xRG1oLmXuetC6pn_3nMnzlWnC_pNmhtZMwXPc3QxaSb8w6V55rIjuRXHqVXSIPo5aQ$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!BL9GA0TyTA!erAOVE6gcxktT3dQCu6OSdzFqng9xRG1oLmXuetC6pn_3nMnzlWnC_pNmhtZMwXPc3QxaSb8w6V55rIjuRXHqVUcys0piQ$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 3 Jan 2023 at 20:39, Sean Owen 
<sro...@gmail.com<mailto:sro...@gmail.com>> wrote:
Why does the data even need cleaning? That's all perfectly correct. The error 
was setting quote to be an escape char.

On Tue, Jan 3, 2023, 2:32 PM Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
if you take your source CSV as below


"a","b","c"
"1","",","
"2","","abc"


and define your code as below


   csv_file="hdfs://rhes75:9000/data/stg/test/testcsv.csv"
    # read hive table in spark
    listing_df = 
spark.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", "true").load(csv_file)
    listing_df.printSchema()
    print(f"""\n Reading from Hive table {csv_file}\n""")
    listing_df.show(100,False)
    listing_df.select("c").show()


results in


 Reading from Hive table hdfs://rhes75:9000/data/stg/test/testcsv.csv

+---+----+---+
|a  |b   |c  |
+---+----+---+
|1  |null|,  |
|2  |null|abc|
+---+----+---+

+---+
|  c|
+---+
|  ,|
|abc|
+---+


which assumes that "," is a value for column c in row 1


This interpretation is correct. You ought to do data cleansing before.


HTH



 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!BL9GA0TyTA!erAOVE6gcxktT3dQCu6OSdzFqng9xRG1oLmXuetC6pn_3nMnzlWnC_pNmhtZMwXPc3QxaSb8w6V55rIjuRXHqVXSIPo5aQ$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!BL9GA0TyTA!erAOVE6gcxktT3dQCu6OSdzFqng9xRG1oLmXuetC6pn_3nMnzlWnC_pNmhtZMwXPc3QxaSb8w6V55rIjuRXHqVUcys0piQ$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 3 Jan 2023 at 17:03, Sean Owen 
<sro...@gmail.com<mailto:sro...@gmail.com>> wrote:
No, you've set the escape character to double-quote, when it looks like you 
mean for it to be the quote character (which it already is). Remove this 
setting, as it's incorrect.

On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati 
<saurabh.gul...@fedex.com.invalid> wrote:
Hello,
We are seeing a case with csv data when it parses csv data incorrectly.
The issue can be replicated using the below csv data

"a","b","c"
"1","",","
"2","","abc"
and using the spark csv read command.
df = spark.read.format("csv")\
.option("multiLine", True)\
.option("escape", '"')\
.option("enforceSchema", False) \
.option("header", True)\
.load(f"/tmp/test.csv")

df.show(100, False) # prints both rows
|a  |b       |c  |
+---+--------+---+
|1  |null    |,  |
|2  |null    |abc|

df.select("c").show() # merges last column of first row and first column of 
second row
+------+
|     c|
+------+
|"\n"2"|

print(df.count()) # prints 1, should be 2

It feels like a bug and I thought of asking the community before creating a bug 
on jira.

Mvg/Regards
Saurabh

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

Reply via email to