Re: character encoding conversion

2004-12-13 Thread "Martin v. Löwis"
Max M wrote: A smiple way to try out different encodings in a given order: The loop is fine - although ('UTF-8', 'Latin-1', 'ASCII') is somewhat redundant. The 'ASCII' case is never considered, since Latin-1 effectively works as a catch-all encoding (as all byte sequences can be considered Latin-1

Re: character encoding conversion

2004-12-13 Thread "Martin v. Löwis"
Christian Ergh wrote: Once more, indention should be correct now, and the 128 is gone too. So, something like this? Yes, something like this. The tricky part is of, course, then the fragments which you didn't implement. Also, it might be possible to do this in a for loop, e.g. for encoding in (pag

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Forgot a part... You need the encoding list: encodings = [ 'utf-8', 'latin-1', 'ascii', 'cp1252', ] Christian Ergh wrote: Dylan wrote: Here's what I'm trying to do: - scrape some html content from various sources The issue I'm running to: - some of the sources have incorrectly e

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Dylan wrote: Here's what I'm trying to do: - scrape some html content from various sources The issue I'm running to: - some of the sources have incorrectly encoded characters... for example, cp1252 curly quotes that were likely the result of the author copying and pasting content from Word Finally:

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
- snip - def get_encoded(st, encodings): "Returns an encoding that doesn't fail" for encoding in encodings: try: st_encoded = st.decode(encoding) return st_encoded, encoding except UnicodeError: pass -snip- This works fine, but after this

Re: character encoding conversion

2004-12-13 Thread Max M
Christian Ergh wrote: A smiple way to try out different encodings in a given order: # -*- coding: latin-1 -*- def get_encoded(st, encodings): "Returns an encoding that doesn't fail" for encoding in encodings: try: st_encoded = st.decode(encoding) return st_en

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Once more, indention should be correct now, and the 128 is gone too. So, something like this? Chris import urllib2 url = 'www.someurl.com' f = urllib2.urlopen(url) data = f.read() # if it is not in the pagecode, how do i get the encoding of the page? pageencoding = '???' xmlencoding = 'whatever

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Peter Otten wrote: Steven Bethard wrote: Christian Ergh wrote: flag = true for char in data: if 127 < ord(char) < 128: flag = false if flag: try: data = data.encode('latin-1') except: pass A little OT, but (assuming I got your indentation right[1]) this kind of loop i

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Martin v. Löwis wrote: Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII charact

Re: character encoding conversion

2004-12-13 Thread Peter Otten
Steven Bethard wrote: > Christian Ergh wrote: >> flag = true >> for char in data: >> if 127 < ord(char) < 128: >> flag = false >> if flag: >> try: >> data = data.encode('latin-1') >> except: >> pass > > A little OT, but (assuming I got your indentation right[1]

Re: character encoding conversion

2004-12-13 Thread Steven Bethard
Christian Ergh wrote: flag = true for char in data: if 127 < ord(char) < 128: flag = false if flag: try: data = data.encode('latin-1') except: pass A little OT, but (assuming I got your indentation right[1]) this kind of loop is exactly what the else clause of a

Re: character encoding conversion

2004-12-12 Thread "Martin v. Löwis"
Christian Ergh wrote: - it works with the characters i mentioned It does. - what encoding do you have in the end US-ASCII - and how exactly are you doing all this? All with somestring.decode() or... Can you please give an example for these 7 steps? I could, but I don't have the time - just try to

Re: character encoding conversion

2004-12-12 Thread Christian Ergh
Martin v. Löwis wrote: Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII charact

Re: character encoding conversion

2004-12-12 Thread "Martin v. Löwis"
Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII characters, and character refer

character encoding conversion

2004-12-11 Thread Dylan
Here's what I'm trying to do: - scrape some html content from various sources The issue I'm running to: - some of the sources have incorrectly encoded characters... for example, cp1252 curly quotes that were likely the result of the author copying and pasting content from Word I've searched an