In an HTML page that I'm scraping using urllib2, a  \xc2\xa0
bytestring appears.

The page's charset = utf-8, and the Chrome browser I'm using displays
the characters as a space.

The page requires authentication:
https://www.nolaready.info/myalertlog.php

When I try to concatenate strings containing the bytestring, Python
chokes because it refuses to coerce the bytestring into ascii.

wfile.write('|'.join(valueList))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
163: ordinal not in range(128)

In searching for help with this issue, I've learned that the
bytestring *might* represent a non-breaking space.

When I scrape the page using urllib2, however, the characters print
as     in a Windows command prompt (though I wouldn't be surprised if
this is some erroneous attempt by the antiquated command window to
handle something it doesn't understand).

If I use IDLE to attempt to decode the single byte referenced in the
error message, and convert it into UTF-8, another error message is
generated:

>>> weird = unicode('\xc2', 'utf-8')

Traceback (most recent call last):
  File "<pyshell#72>", line 1, in <module>
    weird = unicode('\xc2', 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0:
unexpected end of data

If I attempt to decode the full bytestring, I don't obtain a human-
readable string (expecting, perhaps, a non-breaking space):

>>> weird = unicode('\xc2\xa0', 'utf-8')
>>> par = ' - '.join(['This is', weird])
>>> par
u'This is - \xa0'

I suspect that the bytestring isn't UTF-8, but what is it? Latin1?

>>> weirder = unicode('\xc2\xa0', 'latin1')
>>> weirder
u'\xc2\xa0'
>>> 'This just gets ' + weirder
u'This just gets \xc2\xa0'

Or is it a Microsoft bytestring?

>>> weirder = unicode('\xc2\xa0', 'mbcs')
>>> 'This just gets ' + weirder
u'This just gets \xc2\xa0'

None of these codecs seem to work.

Back to the original purpose, as I'm scraping the page, I'm storing
the field/value pair in a dictionary with each iteration through table
elements on the page. This is all fine, until a value is found that
contains the offending bytestring. I have attempted to coerce all
value strings into an encoding, but Python doesn't seem to like that
when the string is already Unicode:

valuesDict[fieldString] = unicode(value, 'UTF-8')
TypeError: decoding Unicode is not supported

The solution I've arrived at is to specify the encoding for value
strings both when reading and writing value strings.

for k, v in valuesDict.iteritems():
    valuePair = ':'.join([k, v.encode('UTF-8')])
    [snip] ...
    wfile.write('|'.join(valueList))

I'm not sure I have a question, but does this sound familiar to any
Unicode experts out there?

How should I handle these odd bytestring values? Am I doing it
correctly, or what could I improve?

Thanks!


-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to