On Tue, Jun 28, 2016 at 6:30 PM, Peter Otten <__pete...@web.de> wrote: > Does chardet ever return an encoding that fails to decode > the line? Only in that case the "ignore" error handler would make sense.
Assuming the module the OP is using is functionally identical to the one I use from the command line (which is implemented in Python), yes it can. Usually what happens is that it detects something as an ISO-8859-* when it's actually the corresponding Windows codepage; if you try to decode it that way, you end up with a handful of byte values that don't correctly decode. I have a "cdless" command that does a chardet, decodes the file, re-encodes as UTF-8, and pipes the result into less(1); great way to figure out what encoding something is (if it gets it wrong, it's usually really obvious to a human). It has a magic second parameter "win" to switch from ISO-8859 to Windows encoding - ISO-8859-1 becomes Windows-1252, -2 becomes 1250, etc. Additionally, chardet often returns "MacCyrillic" for files that are actually encoded Windows-1256 (Arabic). So, yes, it's definitely possible for chardet to pick something that you can't actually decode with. For the OP's situation, frankly, I doubt there'll be anything other than UTF-8, Latin-1, and CP-1252. The chances that someone casually mixes CP-1252 with (say) CP-1254 would be vanishingly small. So the simple decode of "UTF-8, or failing that, 1252" is probably going to give correct results for most of the content. The trick is figuring out a correct boundary for the check; line-by-line may be sufficient, or it may not. ChrisA -- https://mail.python.org/mailman/listinfo/python-list