In an HTML page that I'm scraping using urllib2, a \xc2\xa0 bytestring appears.
The page's charset = utf-8, and the Chrome browser I'm using displays the characters as a space. The page requires authentication: https://www.nolaready.info/myalertlog.php When I try to concatenate strings containing the bytestring, Python chokes because it refuses to coerce the bytestring into ascii. wfile.write('|'.join(valueList)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 163: ordinal not in range(128) In searching for help with this issue, I've learned that the bytestring *might* represent a non-breaking space. When I scrape the page using urllib2, however, the characters print as   in a Windows command prompt (though I wouldn't be surprised if this is some erroneous attempt by the antiquated command window to handle something it doesn't understand). If I use IDLE to attempt to decode the single byte referenced in the error message, and convert it into UTF-8, another error message is generated: >>> weird = unicode('\xc2', 'utf-8') Traceback (most recent call last): File "<pyshell#72>", line 1, in <module> weird = unicode('\xc2', 'utf-8') UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: unexpected end of data If I attempt to decode the full bytestring, I don't obtain a human- readable string (expecting, perhaps, a non-breaking space): >>> weird = unicode('\xc2\xa0', 'utf-8') >>> par = ' - '.join(['This is', weird]) >>> par u'This is - \xa0' I suspect that the bytestring isn't UTF-8, but what is it? Latin1? >>> weirder = unicode('\xc2\xa0', 'latin1') >>> weirder u'\xc2\xa0' >>> 'This just gets ' + weirder u'This just gets \xc2\xa0' Or is it a Microsoft bytestring? >>> weirder = unicode('\xc2\xa0', 'mbcs') >>> 'This just gets ' + weirder u'This just gets \xc2\xa0' None of these codecs seem to work. Back to the original purpose, as I'm scraping the page, I'm storing the field/value pair in a dictionary with each iteration through table elements on the page. This is all fine, until a value is found that contains the offending bytestring. I have attempted to coerce all value strings into an encoding, but Python doesn't seem to like that when the string is already Unicode: valuesDict[fieldString] = unicode(value, 'UTF-8') TypeError: decoding Unicode is not supported The solution I've arrived at is to specify the encoding for value strings both when reading and writing value strings. for k, v in valuesDict.iteritems(): valuePair = ':'.join([k, v.encode('UTF-8')]) [snip] ... wfile.write('|'.join(valueList)) I'm not sure I have a question, but does this sound familiar to any Unicode experts out there? How should I handle these odd bytestring values? Am I doing it correctly, or what could I improve? Thanks! -- http://mail.python.org/mailman/listinfo/python-list