Content-Type: text/html; charset=utf-8lias

For Python to parse this, I had to use Python's list of known encodings in order to determine whether I could even parse the site (for passing it to a string's .encode() method).

You haven't said why you think you need a list of known encodings!

I would have thought that just trying it on some dummy data will let you determine very quickly whether the alleged encoding is supported by the Python version etc that you are using.

E.g.

| >>> alleged_encoding = "utf-8lias"
| >>> "any old ascii".decode(alleged_encoding)
| Traceback (most recent call last):
|  File "<stdin>", line 1, in <module>
| LookupError: unknown encoding: utf-8lias

I then try to remap the bogus encoding to one it seems most like (in this case, utf-8) and retry. Having a list of encodings allows me to either eyeball or define a heuristic to say "this is the closest match...try this one instead". That mapping can then be used to update a mapping file so I don't have to think about it the next time I encounter the same bogus encoding.

-tkc



--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to