Forgot a part... You need the encoding list:

encodings = [
    'utf-8',
    'latin-1',
    'ascii',
    'cp1252',
    ]

Christian Ergh wrote:
Dylan wrote:

Here's what I'm trying to do:

- scrape some html content from various sources

The issue I'm running to:

- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word

Finally: For me this works, all inside my own class, and the module has a logger, for reuse you would need to fix this stuff... Im am updating a postgreSQL Database, in case someone wonders about the __setattr__, and my class inherits from SQLObject.

    def doDecode(self, st):
        "Returns an encoding that doesn't fail"
        for encoding in encodings:
            try:
                stEncoded = st.decode(encoding)
                return stEncoded
            except UnicodeError:
                pass

    def setAttribute(self, name, data):
        import HTMLFilter
        data = self.doDecode(data)
        try:
            data = data.encode('ascii', "xmlcharrefreplace")
        except:
            log.warn('new method did not fit')

        try:
            if '&#' in data:
                data = HTMLFilter.HTMLDecode(data)
        except UnicodeDecodeError:
            log.debug('HTML decoding failed!!!')

        try:
            data = data.encode('utf-8')
        except:
            log.warn('new utf 8 method did not fit')

        try:
            self.__setattr__(name, data)
        except:
            log.debug('1. try failed: ')
            log.warning(type(data))
            log.debug(data)
            log.warning('Some unicode error while updating')
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to