On 08/17/2017 05:14 PM, John Nagle wrote: > I'm cleaning up some data which has text description fields from > multiple sources. A few more cases:
bytearray(b'miguel \xe3\x81ngel santos') bytearray(b'lidija kmeti\xe4\x8d') bytearray(b'\xe5\x81ukasz zmywaczyk') bytearray(b'M\x81\x81\xfcnster') bytearray(b'ji\xe5\x99\xe3\xad urban\xe4\x8d\xe3\xadk') bytearray(b'\xe4\xbdubom\xe3\xadr mi\xe4\x8dko') bytearray(b'petr urban\xe4\x8d\xe3\xadk') 0x9d is the most common; that occurs in English text. The others seem to be in some Eastern European character set. Understand, there's no metadata available to disambiguate this. What I have is a big CSV file in which different character sets are mixed. Each field has a uniform character set, so I need character set detection on a per-field basis. John Nagle -- https://mail.python.org/mailman/listinfo/python-list