Forgot a part... You need the encoding list:
encodings = [ 'utf-8', 'latin-1', 'ascii', 'cp1252', ]
Christian Ergh wrote:
Dylan wrote:
Finally: For me this works, all inside my own class, and the module has a logger, for reuse you would need to fix this stuff... Im am updating a postgreSQL Database, in case someone wonders about the __setattr__, and my class inherits from SQLObject.Here's what I'm trying to do:
- scrape some html content from various sources
The issue I'm running to:
- some of the sources have incorrectly encoded characters... for example, cp1252 curly quotes that were likely the result of the author copying and pasting content from Word
def doDecode(self, st): "Returns an encoding that doesn't fail" for encoding in encodings: try: stEncoded = st.decode(encoding) return stEncoded except UnicodeError: pass
def setAttribute(self, name, data): import HTMLFilter data = self.doDecode(data) try: data = data.encode('ascii', "xmlcharrefreplace") except: log.warn('new method did not fit')
try: if '&#' in data: data = HTMLFilter.HTMLDecode(data) except UnicodeDecodeError: log.debug('HTML decoding failed!!!')
try: data = data.encode('utf-8') except: log.warn('new utf 8 method did not fit')
try: self.__setattr__(name, data) except: log.debug('1. try failed: ') log.warning(type(data)) log.debug(data) log.warning('Some unicode error while updating')
-- http://mail.python.org/mailman/listinfo/python-list