On 2007-02-06, Robert Kern <[EMAIL PROTECTED]> wrote: > John Nagle wrote: >> File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch >> sitetext = sitetext.encode('ascii','replace') # force to clean ASCII >> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in >> position 29151: ordinal not in range(128) >> >> Why is that exception being raised when the codec was told 'replace'? > > The .encode('ascii') takes unicode strings to str strings. > Since you gave it a str string, it first tried to convert it to > a unicode string using the default codec ('ascii'), just as if > you were to have done unicode(sitetext).encode('ascii', > 'replace'). > > I think you want something like this: > > sitetext = sitetext.decode('ascii', 'replace').encode('ascii', 'replace')
This is the cue for the translate method, which will be much faster and simpler for cases like this. You can build the translation table yourself, or use maketrans. >>> asciitable = string.maketrans(''.join(chr(a) for a in xrange(127, 256)), ... '?'*127) You'd only want to do that once. Then to strip off the non-ascii: sitetext.translate(asciitable) I used a similar solution in an application I'm working on that must uses a Latin-1 byte-encoding internally, but displays on stdout in ascii. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list