Rémi Lapeyre <[email protected]> added the comment:
> in real-life that b-prefixed string is just not readable by another program
> in an easy way
If another program opens this CSV file, it will read the string "b'A'" which is
what this field actually contains. Everything that is not a number or a string
gets converted to a string:
In [1]: import collections, dataclasses, random, secrets, io, csv
...:
...: Point = collections.namedtuple('Point', 'x y')
...:
...: @dataclasses.dataclass
...: class Valar:
...: name: str
...: age: int
...:
...: a = Point(1, 2)
...: b = Valar('Melkor', 2900)
...: c = secrets.token_bytes(4)
...:
...: out = io.StringIO()
...: f = csv.writer(out)
...: f.writerow((a, b, c))
...:
...: out.seek(0)
...: print(out.read())
...:
"Point(x=1, y=2)","Valar(name='Melkor', age=2900)",b'\x95g6\xa2'
Here another would find three fields, all strings: "Point(x=1, y=2)",
"Valar(name='Melkor', age=2900)" and "b'\x95g6\xa2'". Would you expect to get
actual objects instead of strings when reading the two first fields?
> Incase it fails to decode using that, then it will throw a UnicodeDecodeError
I read your PR, but succeeding to decode it does not mean it's correct:
In [4]: b'r\xc3\xa9sum\xc3\xa9'.decode('latin')
Out[4]: 'résumé'
It worked, but is it the appropriate encoding? Probably not
In [5]: b'r\xc3\xa9sum\xc3\xa9'.decode('utf8')
Out[5]: 'résumé'
If you want to be able to save bytes, the best way is to use a format that can
roundtrip bytes like parquet:
In [18]: df = pd.DataFrame.from_dict({'a': [b'a']})
In [19]: df.to_parquet('foo.parquet')
In [20]: type(pd.read_parquet('foo.parquet')['a'][0])
Out[20]: bytes
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue40762>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com