On Mon, 24 Dec 2012 13:50:39 +0000, Steven D'Aprano wrote: > On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote: > >> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller >> <kurt.alfred.muel...@gmail.com> wrote: >>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2 >>> with confidence 0.803579722043 $ >> >> And it sucks, because it uses magic, and not reading the HTML tags. The >> RIGHT thing to do for websites is detect the meta charset definition, >> which is >> >> <meta http-equiv="content-type" content="text/html; charset=utf-8"> >> >> or >> >> <meta charset="utf-8"> >> >> The second one for HTML5 websites, and both may require case conversion >> and the useless ` /` at the end. But if somebody is using HTML5, you >> are pretty much guaranteed to get UTF-8. >> >> In today’s world, the proper assumption to make is “UTF-8 or GTFO”. >> Because nobody in the right mind would use something else today. > > Alas, there are many, many, many, MANY websites that are created by > people who are *not* in their right mind. To say nothing of 15 year old > websites that use a legacy encoding. And to support those, you may need > to guess the encoding, and for that, chardetect.py is the solution.
Indeed due to the poor quality of most websites it is not possible to be 100% accurate for all sites. personally I would start by checking the doc type & then the meta data as these should be quick & correct, I then use chardectect only if these fail to provide any result. -- I have found little that is good about human beings. In my experience most of them are trash. -- Sigmund Freud -- http://mail.python.org/mailman/listinfo/python-list