On Wed, 05 Oct 2011 21:39:17 -0700, Greg wrote: > Here is the final code for those who are struggling with similar > problems: > > ## open and decode file > # In this case, the encoding comes from the charset argument in a meta > tag > # e.g. <meta charset="iso-8859-2"> > fileObj = open(filePath,"r").read() > fileContent = fileObj.decode("iso-8859-2") > fileSoup = BeautifulSoup(fileContent)
The fileObj.decode() step should be unnecessary, and is usually undesirable; Beautiful Soup should be doing the decoding itself. If you actually know the encoding (e.g. from a Content-Type header), you can specify it via the fromEncoding parameter to the BeautifulSoup constructor, e.g.: fileSoup = BeautifulSoup(fileObj.read(), fromEncoding="iso-8859-2") If you don't specify the encoding, it will be deduced from a meta tag if one is present, or a Unicode BOM, or using the chardet library if available, or using built-in heuristics, before finally falling back to Windows-1252 (which seems to be the preferred encoding of people who don't understand what an encoding is or why it needs to be specified). -- http://mail.python.org/mailman/listinfo/python-list