Sidhant Bansal <sidhban...@gmail.com> added the comment:

Hi Remi,

Currently a code like this:
```
with open("abc.csv", "w", encoding='utf-8') as f:
    data = [b'\x41']
    w = csv.writer(f)
    w.writerow(data)
with open("abc.csv", "r") as f:
    rows = csv.reader(f)
    for row in rows:
        print(row[0]) # prints b'A'
```
Is able to write the string "b'A'" in a CSV file. You are correct that the 
ideal way should indeed be to decode the byte first.

However if a user does not decode the byte then the CSV module calls the str() 
method on the byte object as you said, but in real-life that b-prefixed string 
is just not readable by another program in an easy way (they will need to first 
chop off the b-prefix and single quotes around the string) and has turned out 
to be a pain point in one of the pandas issue I referred to in my first message.

Also I am not sure if you have taken a look at my PR, but my approach to fix 
this problem does NOT involve guessing the encoding scheme used, instead we 
simply use the encoding scheme that the user provided when they open the file 
object. So if you open it with `open("abc.csv", "w", encoding="latin1")` then 
it will try to decode the byte using "latin1". Incase it fails to decode using 
that, then it will throw a UnicodeDecodeError (So there is no unknowing file 
corruption, a UnicodeDecode error is thrown when this happens). You can refer 
to the tests + NEWS.d in the PR to confirm the same.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue40762>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to