On 08/17/2017 05:14 PM, John Nagle wrote:
>      I'm cleaning up some data which has text description fields from
> multiple sources.
A few more cases:

bytearray(b'miguel \xe3\x81ngel santos')
bytearray(b'lidija kmeti\xe4\x8d')
bytearray(b'\xe5\x81ukasz zmywaczyk')
bytearray(b'M\x81\x81\xfcnster')
bytearray(b'ji\xe5\x99\xe3\xad urban\xe4\x8d\xe3\xadk')
bytearray(b'\xe4\xbdubom\xe3\xadr mi\xe4\x8dko')
bytearray(b'petr urban\xe4\x8d\xe3\xadk')

0x9d is the most common; that occurs in English text. The others
seem to be in some Eastern European character set.

Understand, there's no metadata available to disambiguate this. What I
have is a big CSV file in which different character sets are mixed.
Each field has a uniform character set, so I need character set
detection on a per-field basis.

                                John Nagle

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to