[issue40762] Writing bytes using CSV module results in b prefixed strings

Rémi Lapeyre Mon, 25 May 2020 07:06:31 -0700

Rémi Lapeyre <remi.lape...@henki.fr> added the comment:

> in real-life that b-prefixed string is just not readable by another program 
> in an easy way


If another program opens this CSV file, it will read the string "b'A'" which is 
what this field actually contains. Everything that is not a number or a string 
gets converted to a string:

In [1]: import collections, dataclasses, random, secrets, io, csv 
   ...:  
   ...: Point = collections.namedtuple('Point', 'x y') 
   ...:  
   ...: @dataclasses.dataclass 
   ...: class Valar: 
   ...:     name: str 
   ...:     age: int 
   ...:  
   ...: a = Point(1, 2) 
   ...: b = Valar('Melkor', 2900) 
   ...: c = secrets.token_bytes(4) 
   ...:  
   ...: out = io.StringIO() 
   ...: f = csv.writer(out) 
   ...: f.writerow((a, b, c)) 
   ...:  
   ...: out.seek(0) 
   ...: print(out.read()) 
   ...:                                                                         
                                                                                
       
"Point(x=1, y=2)","Valar(name='Melkor', age=2900)",b'\x95g6\xa2'

Here another would find three fields, all strings: "Point(x=1, y=2)", 
"Valar(name='Melkor', age=2900)" and "b'\x95g6\xa2'". Would you expect to get 
actual objects instead of strings when reading the two first fields?


> Incase it fails to decode using that, then it will throw a UnicodeDecodeError

I read your PR, but succeeding to decode it does not mean it's correct:

   In [4]: b'r\xc3\xa9sum\xc3\xa9'.decode('latin')                              
                                                                                
          
   Out[4]: 'rÃ©sumÃ©'

It worked, but is it the appropriate encoding? Probably not

   In [5]: b'r\xc3\xa9sum\xc3\xa9'.decode('utf8')                               
                                                                                
          
   Out[5]: 'résumé'



If you want to be able to save bytes, the best way is to use a format that can 
roundtrip bytes like parquet:

    In [18]: df = pd.DataFrame.from_dict({'a': [b'a']})                         
                                                                                
           

    In [19]: df.to_parquet('foo.parquet')                                       
                                                                                
           

    In [20]: type(pd.read_parquet('foo.parquet')['a'][0])                       
                                                                                
           
    Out[20]: bytes

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue40762>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue40762] Writing bytes using CSV module results in b prefixed strings

Reply via email to