In article <rn%Bs.693798$nB6.605938@fx21.am4>, Alister <alister.w...@ntlworld.com> wrote:
> Indeed due to the poor quality of most websites it is not possible to be > 100% accurate for all sites. > > personally I would start by checking the doc type & then the meta data as > these should be quick & correct, I then use chardectect only if these > fail to provide any result. I agree that checking the metadata is the right thing to do. But, I wouldn't go so far as to assume it will always be correct. There's a lot of crap out there with perfectly formed metadata which just happens to be wrong. Although it pains me greatly to quote Ronald Reagan as a source of wisdom, I have to admit he got it right with "Trust, but verify". It's the only way to survive in the unicode world. Write defensive code. Wrap try blocks around calls that might raise exceptions if the external data is borked w/r/t what the metadata claims it should be. -- http://mail.python.org/mailman/listinfo/python-list