Re: Incorrect csv parsing when delimiter used within the data

Mich Talebzadeh Wed, 04 Jan 2023 01:15:37 -0800

What is the point of having  *,* as a column value? From a business point
of view it does not signify anything IMO




   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Jan 2023 at 20:39, Sean Owen <sro...@gmail.com> wrote:

> Why does the data even need cleaning? That's all perfectly correct. The
> error was setting quote to be an escape char.
>
> On Tue, Jan 3, 2023, 2:32 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> if you take your source CSV as below
>>
>> "a","b","c"
>> "1","",","
>> "2","","abc"
>>
>>
>> and define your code as below
>>
>>
>>    csv_file="hdfs://rhes75:9000/data/stg/test/testcsv.csv"
>>     # read hive table in spark
>>     listing_df =
>> spark.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header", "true").load(csv_file)
>>     listing_df.printSchema()
>>     print(f"""\n Reading from Hive table {csv_file}\n""")
>>     listing_df.show(100,False)
>>     listing_df.select("c").show()
>>
>>
>> results in
>>
>>
>>  Reading from Hive table hdfs://rhes75:9000/data/stg/test/testcsv.csv
>>
>> +---+----+---+
>> |a  |b   |c  |
>> +---+----+---+
>> |1  |null|,  |
>> |2  |null|abc|
>> +---+----+---+
>>
>> +---+
>> |  c|
>> +---+
>> |  ,|
>> |abc|
>> +---+
>>
>>
>> which assumes that "," is a value for column c in row 1
>>
>>
>> This interpretation is correct. You ought to do data cleansing before.
>>
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 3 Jan 2023 at 17:03, Sean Owen <sro...@gmail.com> wrote:
>>
>>> No, you've set the escape character to double-quote, when it looks like
>>> you mean for it to be the quote character (which it already is). Remove
>>> this setting, as it's incorrect.
>>>
>>> On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati
>>> <saurabh.gul...@fedex.com.invalid> wrote:
>>>
>>>> Hello,
>>>> We are seeing a case with csv data when it parses csv data incorrectly.
>>>> The issue can be replicated using the below csv data
>>>>
>>>> "a","b","c"
>>>> "1","",","
>>>> "2","","abc"
>>>>
>>>> and using the spark csv read command.
>>>>
>>>> df = spark.read.format("csv")\
>>>> .option("multiLine", True)\
>>>> .option("escape", '"')\
>>>> .option("enforceSchema", False) \
>>>> .option("header", True)\
>>>> .load(f"/tmp/test.csv")
>>>>
>>>> df.show(100, False) # prints both rows
>>>> |a  |b       |c  |
>>>> +---+--------+---+
>>>> |1  |null    |,  |
>>>> |2  |null    |abc|
>>>>
>>>> df.select("c").show() # merges last column of first row and first
>>>> column of second row
>>>> +------+
>>>> |     c|
>>>> +------+
>>>> |"\n"2"|
>>>>
>>>> print(df.count()) # prints 1, should be 2
>>>>
>>>>
>>>> It feels like a bug and I thought of asking the community before
>>>> creating a bug on jira.
>>>>
>>>> Mvg/Regards
>>>> Saurabh
>>>>
>>>>

Re: Incorrect csv parsing when delimiter used within the data

Reply via email to