Re: Print encoding problems in console
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 On 2011.07.15 07:02 PM, Pedro Abranches wrote: > Now, if you're using your python script in some shell script you > might have to store the output in some variable, like this: > > $ var=`python -c 'import sys; print sys.stdout.encoding; print > u"\xe9"'` > > And what you get is: > > Traceback (most recent call last): File "", line 1, in > UnicodeEncodeError: 'ascii' codec can't encode character > u'\xe9' in position 0: ordinal not in range(128) > > So, python is not being able to detect the encoding of the output in > a situation like that, in which the python script is called not > directly but around ``. FWIW, it works for me with Python 3: $ x=$(/c/Python32/python -c print\(\'\\xe9\'\)) $ echo $x é I don't know how to get it to work with more than one command to Python; bash always thinks the next commands are for it: $ x=$(/c/Python32/python -c import sys; print\(sys.output.encoding\); print\(\'\\xe9\'\)) File "", line 1 import ^ SyntaxError: invalid syntax bash: print(sys.output.encoding): command not found bash: print('\xe9'): No such file or directory This is using a very old MinGW bash, though. - -- CPython 3.2.1 | Windows NT 6.1.7601.17592 | Thunderbird 5.0 PGP/GPG Public Key ID: 0xF88E034060A78FCB -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAwAGBQJOIOEIAAoJEPiOA0Bgp4/LbSIIAJS9hVMTwQtV17pxWU5/IwRa 0X5v3W8mKZAyXTCSL5HmMQ07pPWRAkg5dEmnt+MTmFOVRjWg1yWIzeArmAc/MCmj LiQcwp9ue6rY7Gt+gUqLFMQgVW9qs4zLLRAcThw9zMVLheOCrVoDc6miyLqcpb8+ RPjVuT9Bd5Vj67lIPOtZNTdB0hZGSwF5maerkot/95NBIuvP8UVBcub3dI6w1bJL 7dIW3NmjkeuWOdRch5s/X+gdPuoBNpfLfsFW3t7sdUscKKWaVjj0tOiNMHne42hD XFuFauzmizaKpu16Zn9YJGPUhfvCn8QW+mcPFlBzv3g2oxuZMMssFykhU4Yb/7E= =jVgu -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: Print encoding problems in console
I've used the code below successfully to deal with such a problem when outputting filenames. Python2x3 is at http://stromberg.dnsalias.org/svn/python2x3/ , but here it's just being used to convert Python 3.x's byte strings to strings (to eliminate the b'' stuff), while on 2.x it's an identity function - if you're targeting 3.x alone, there's no need to take a dependency on python2x3. If you really do need to output such characters, rather than replacing them with ?'s, you could use os.write() to filedescriptor 1 - that works in both 2.x and 3.x. def ascii_ize(binary): '''Replace non-ASCII characters with question marks; otherwise writing to sys.stdout tracebacks''' list_ = [] question_mark_ordinal = ord('?') for ordinal in python2x3.binary_to_intlist(binary): if 0 <= ordinal <= 127: list_.append(ordinal) else: list_.append(question_mark_ordinal) return python2x3.intlist_to_binary(list_) def output_filename(filename, add_eol=True): '''Output a filename to the tty (stdout), taking into account that some tty's do not allow non-ASCII characters''' if sys.stdout.encoding == 'US-ASCII': converted = python2x3.binary_to_string(ascii_ize(filename)) else: converted = python2x3.binary_to_string(filename) replaced = converted.replace('\n', '?').replace('\r', '?').replace('\t', '?') sys.stdout.write(replaced) if add_eol: sys.stdout.write('\n') On Fri, Jul 15, 2011 at 5:02 PM, Pedro Abranches wrote: > Hello everyone. > > I'm having a problem when outputing UTF-8 strings to a console. > Let me show a simple example that explains it: > > $ python -c 'import sys; print sys.stdout.encoding; print u"\xe9"' > UTF-8 > é > > It's everything ok. > Now, if you're using your python script in some shell script you might have > to store the output in some variable, like this: > > $ var=`python -c 'import sys; print sys.stdout.encoding; print u"\xe9"'` > > And what you get is: > > Traceback (most recent call last): > File "", line 1, in > UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in > position 0: ordinal not in range(128) > > So, python is not being able to detect the encoding of the output in a > situation like that, in which the python script is called not directly but > around ``. > > Why does happen? Is there a way to solve it either by python or by shell > code? > > Thanks, > Pedro Abranches > > -- > http://mail.python.org/mailman/listinfo/python-list > > -- http://mail.python.org/mailman/listinfo/python-list
Print encoding problems in console
Hello everyone. I'm having a problem when outputing UTF-8 strings to a console. Let me show a simple example that explains it: $ python -c 'import sys; print sys.stdout.encoding; print u"\xe9"' UTF-8 é It's everything ok. Now, if you're using your python script in some shell script you might have to store the output in some variable, like this: $ var=`python -c 'import sys; print sys.stdout.encoding; print u"\xe9"'` And what you get is: Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128) So, python is not being able to detect the encoding of the output in a situation like that, in which the python script is called not directly but around ``. Why does happen? Is there a way to solve it either by python or by shell code? Thanks, Pedro Abranches -- http://mail.python.org/mailman/listinfo/python-list
Re: How to force SAX parser to ignore encoding problems
Łukasz wrote: > I have a problem with my XML parser (created with libraries from > xml.sax package). When parser finds a invalid character (in CDATA > section) for example �, throws an exception SAXParseException. > > Is there any way to just ignore this kind of problem. Maybe there is a > way to set up parser in less strict mode? > > I know that I can catch this exception and determine if this is this > kind of problem and then ignore this, but I am asking about any global > setting. The parser from libxml2 that lxml provides has a recovery option, i.e. it can keep parsing regardless of errors and will drop the broken content. However, it is *always* better to fix the input, if you get any hand on it. Broken XML is *not* XML at all. If you can't fix the source, you can never be sure that the data you received is in any way complete or even usable. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: How to force SAX parser to ignore encoding problems
On 31 Lip, 09:28, Łukasz wrote: > Hi, > I have a problem with my XML parser (created with libraries from > xml.sax package). When parser finds a invalid character (in CDATA > section) for example , After sending this message I noticed that example invalid characters are not displaying on some platforms :) -- http://mail.python.org/mailman/listinfo/python-list
How to force SAX parser to ignore encoding problems
Hi, I have a problem with my XML parser (created with libraries from xml.sax package). When parser finds a invalid character (in CDATA section) for example �, throws an exception SAXParseException. Is there any way to just ignore this kind of problem. Maybe there is a way to set up parser in less strict mode? I know that I can catch this exception and determine if this is this kind of problem and then ignore this, but I am asking about any global setting. -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems
> > is there a way to sort this string properly (sorted()?) > I mean first 'a' then 'à' then 'e' etc. (sorted puts accented letters at > the end). Or should I have to provide a comparison function to sorted? After setting the locale... locale.strcoll() -- damjan -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems
Ricardo Aráoz wrote: > Lawrence D'Oliveiro wrote: >> In message <[EMAIL PROTECTED]>, tool69 wrote: >> >>> p2.content = """Ce poste possède des accents : é à ê è""" >> >> My guess is this is being encoded as a Latin-1 string, but when you try >> to output it it goes through the ASCII encoder, which doesn't understand >> the accents. Try this: >> >> p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8") >> > > is there a way to sort this string properly (sorted()?) > I mean first 'a' then 'à' then 'e' etc. (sorted puts accented letters at > the end). Or should I have to provide a comparison function to sorted? First of all: please don't hijack threads. Start a new one with your specific question. Second: this might be what you are looking for: http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/ Didn't try it myself though. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems
Lawrence D'Oliveiro wrote: > In message <[EMAIL PROTECTED]>, tool69 wrote: > >> p2.content = """Ce poste possède des accents : é à ê è""" > > My guess is this is being encoded as a Latin-1 string, but when you try to > output it it goes through the ASCII encoder, which doesn't understand the > accents. Try this: > > p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8") > is there a way to sort this string properly (sorted()?) I mean first 'a' then 'à' then 'e' etc. (sorted puts accented letters at the end). Or should I have to provide a comparison function to sorted? -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems
Diez B. Roggisch a écrit : > tool69 wrote: > >> Hi, >> >> I would like to transform reST contents to HTML, but got problems >> with accented chars. >> >> Here's a rather simplified version using SVN Docutils 0.5: >> >> %- >> >> #!/usr/bin/env python >> # -*- coding: utf-8 -*- > > > This declaration only affects unicode-literals. > >> from docutils.core import publish_parts >> >> class Post(object): >> def __init__(self, title='', content=''): >> self.title = title >> self.content = content >> >> def _get_html_content(self): >> return publish_parts(self.content, >> writer_name="html")["html_body"] >> html_content = property(_get_html_content) > > Did you know that you can do this like this: > > @property > def html_content(self): > ... > > ? > I only took some part of code from someone else (an old TurboGears tutorial if I remember). But you're right : decorators are better. >> # Instanciate 2 Post objects >> p1 = Post() >> p1.title = "First post without accented chars" >> p1.content = """This is the first. >> ...blabla >> ... end of post...""" >> >> p2 = Post() >> p2.title = "Second post with accented chars" >> p2.content = """Ce poste possède des accents : é à ê è""" > > > This needs to be a unicode-literal: > > p2.content = u"""Ce poste possède des accents : é à ê è""" > > Note the u in front. > > > You need to encode a unicode-string into the encoding you want it. > Otherwise, the default (ascii) is taken. > > So > > print post.html_content.encodec("utf-8") > > should work. > That solved it : thank you so much. > Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems
Lawrence D'Oliveiro a écrit : > In message <[EMAIL PROTECTED]>, tool69 wrote: > >> p2.content = """Ce poste possède des accents : é à ê è""" > > My guess is this is being encoded as a Latin-1 string, but when you try to > output it it goes through the ASCII encoder, which doesn't understand the > accents. Try this: > > p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8") > Thanks for your answer Lawrence, but I always got the error. Any other idea ? -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems
tool69 wrote: > Hi, > > I would like to transform reST contents to HTML, but got problems > with accented chars. > > Here's a rather simplified version using SVN Docutils 0.5: > > %- > > #!/usr/bin/env python > # -*- coding: utf-8 -*- This declaration only affects unicode-literals. > from docutils.core import publish_parts > > class Post(object): > def __init__(self, title='', content=''): > self.title = title > self.content = content > > def _get_html_content(self): > return publish_parts(self.content, > writer_name="html")["html_body"] > html_content = property(_get_html_content) Did you know that you can do this like this: @property def html_content(self): ... ? > # Instanciate 2 Post objects > p1 = Post() > p1.title = "First post without accented chars" > p1.content = """This is the first. > ...blabla > ... end of post...""" > > p2 = Post() > p2.title = "Second post with accented chars" > p2.content = """Ce poste possède des accents : é à ê è""" This needs to be a unicode-literal: p2.content = u"""Ce poste possède des accents : é à ê è""" Note the u in front. > for post in [p1,p2]: > print post.title, "\n" +"-"*30 > print post.html_content > > %- > > The output gives me : > > First post without accented chars > -- > > This is the first. > ...blabla > ... end of post... > > > Second post with accented chars > -- > Traceback (most recent call last): > File "C:\Documents and > Settings\kib\Bureau\Projets\python\dbTest\rest_error.py", line 30, in > > print post.html_content > UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in > position 39: > ordinal not in range(128) You need to encode a unicode-string into the encoding you want it. Otherwise, the default (ascii) is taken. So print post.html_content.encodec("utf-8") should work. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems
In message <[EMAIL PROTECTED]>, tool69 wrote: > p2.content = """Ce poste possède des accents : é à ê è""" My guess is this is being encoded as a Latin-1 string, but when you try to output it it goes through the ASCII encoder, which doesn't understand the accents. Try this: p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8") -- http://mail.python.org/mailman/listinfo/python-list
encoding problems
Hi, I would like to transform reST contents to HTML, but got problems with accented chars. Here's a rather simplified version using SVN Docutils 0.5: %- #!/usr/bin/env python # -*- coding: utf-8 -*- from docutils.core import publish_parts class Post(object): def __init__(self, title='', content=''): self.title = title self.content = content def _get_html_content(self): return publish_parts(self.content, writer_name="html")["html_body"] html_content = property(_get_html_content) # Instanciate 2 Post objects p1 = Post() p1.title = "First post without accented chars" p1.content = """This is the first. ...blabla ... end of post...""" p2 = Post() p2.title = "Second post with accented chars" p2.content = """Ce poste possède des accents : é à ê è""" for post in [p1,p2]: print post.title, "\n" +"-"*30 print post.html_content %- The output gives me : First post without accented chars -- This is the first. ...blabla ... end of post... Second post with accented chars -- Traceback (most recent call last): File "C:\Documents and Settings\kib\Bureau\Projets\python\dbTest\rest_error.py", line 30, in print post.html_content UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 39: ordinal not in range(128) Any idea of what I've missed ? Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
Serge Orlov wrote: > The problem is that U+0587 is a ligature in Western Armenian dialect > (hy locale) and a character in Eastern Armenian dialect (hy_AM locale). > It is strange the code point is marked as compatibility char. It either > mistake or political decision. It used to be a ligature before > orthographic reform in 1930s by communist government in Armenia, then > it became a character, but after end of Soviet Union (1991) they > started to think about going back to old orthography. Though it hasn't > happened and it's not clear if it will ever happen. So U+0587 is a > character. Thanks for the explanation. Without any knowledge, I would suspect a combination of mistake and political decision. The Unicode consortium (and ISO) always uses native language experts to come up with character definitions, although the process is today likely more elaborate and precise than in the early days. Likely, the Unicode consortium found somebody speaking the Western Armenian dialect (given that many of these speakers live in North America today); the decision might have been a mixture of lack of knowledge, ignorance, and perhaps even political bias. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
Jean-Paul Calderone wrote: > On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <[EMAIL PROTECTED]> wrote: > >On 24/03/2006 8:36 AM, Peter Otten wrote: > >> John Machin wrote: > >> > >>>You can replace ALL of this upshifting and accent removal in one blow by > >>>using the string translate() method with a suitable table. > >> > >> Only if you convert to unicode first or if your data maintains 1 byte == 1 > >> character, in particular it is not UTF-8. > >> > > > >I'm sorry, I forgot that there were people who are unaware that > >variable-length gizmos like UTF-8 and various legacy CJK encodings are > >for storage & transmission, and are better changed to a > >one-character-per-storage-unit representation before *ANY* data > >processing is attempted. > > Unfortunately, unicode only appears to solve this problem in a sane manner. What problem do you mean? Loose matching is solved by unicode in a sane manner, it is described in the unicode collation algorithm. Serge. -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
Martin v. Löwis wrote: > John Machin wrote: > >> and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH > >> YIWN), it does not even work. > > > > Sorry, I don't understand. > > 0565 is stand-alone ECH > > 0582 is stand-alone YIWN > > 0587 is the ligature. > > What doesn't work? At first guess, in the absence of an Armenian > > informant, for pre-matching normalisation, I'd replace 0587 by the two > > constituents -- just like 00DF would be expanded to "ss" (before > > upshifting and before not caring too much about differences caused by > > doubled letters). > > Looking at the UnicodeData helps here: > > 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;N;;German;;; > 0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L; 0565 0582N; > > So U+0587 is a compatibility character for U+0565,U+0582. Not sure > what the rationale for *this* compatibility character is, but in many > cases, they are in Unicode only for compatibility with some existing > encoding - if they had gone through the proper Unification, they should > not have been introduced as separate characters. The problem is that U+0587 is a ligature in Western Armenian dialect (hy locale) and a character in Eastern Armenian dialect (hy_AM locale). It is strange the code point is marked as compatibility char. It either mistake or political decision. It used to be a ligature before orthographic reform in 1930s by communist government in Armenia, then it became a character, but after end of Soviet Union (1991) they started to think about going back to old orthography. Though it hasn't happened and it's not clear if it will ever happen. So U+0587 is a character. By the way, this char/ligature is present on both Western and Eastern Armenian keyboard layouts: http://www.datacal.com/products/armenian-western-layout.htm It is between 9 and (. In Eastern Armenian this character is used in words և ( the word "and" in English) , արև ( "sun" in English) and hundreds others. Needless to say how many documents exist with this character. > > In many cases, ligature characters exist for typographical reasons; > other examples are > > FB00;LATIN SMALL LIGATURE FF;Ll;0;L; 0066 0066N; > FB01;LATIN SMALL LIGATURE FI;Ll;0;L; 0066 0069N; > FB02;LATIN SMALL LIGATURE FL;Ll;0;L; 0066 006CN; > FB03;LATIN SMALL LIGATURE FFI;Ll;0;L; 0066 0066 0069N; > FB04;LATIN SMALL LIGATURE FFL;Ll;0;L; 0066 0066 006CN; > > In these cases, it is the font designers which want to have code points > for these characters: the glyphs of the ligature cannot be automatically > derived from the glyphs of the individual characters. I can only guess > that the issue with that Armenian ligature is similar. > > Notice that the issue of U+00DF is entirely different: it is a character > on its own, not a ligature. That a common transliteration for this > character exists is again a different story. > > Now, as to what might not work: While compatibility decomposition > (NFKD) converts \u0587 to \u0565\u0582, the reverse process is not > supported. This is intentional, of course: there is no "canonical" > compatibility character for every decomposed code point. Seems like NFKD will damage Eastern Armenian text (there are millions of such documents). The result will be readable but the text will look strange to the person who wrote the text. Serge. -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
John Machin wrote: >> and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH >> YIWN), it does not even work. > > Sorry, I don't understand. > 0565 is stand-alone ECH > 0582 is stand-alone YIWN > 0587 is the ligature. > What doesn't work? At first guess, in the absence of an Armenian > informant, for pre-matching normalisation, I'd replace 0587 by the two > constituents -- just like 00DF would be expanded to "ss" (before > upshifting and before not caring too much about differences caused by > doubled letters). Looking at the UnicodeData helps here: 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;N;;German;;; 0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L; 0565 0582N; So U+0587 is a compatibility character for U+0565,U+0582. Not sure what the rationale for *this* compatibility character is, but in many cases, they are in Unicode only for compatibility with some existing encoding - if they had gone through the proper Unification, they should not have been introduced as separate characters. In many cases, ligature characters exist for typographical reasons; other examples are FB00;LATIN SMALL LIGATURE FF;Ll;0;L; 0066 0066N; FB01;LATIN SMALL LIGATURE FI;Ll;0;L; 0066 0069N; FB02;LATIN SMALL LIGATURE FL;Ll;0;L; 0066 006CN; FB03;LATIN SMALL LIGATURE FFI;Ll;0;L; 0066 0066 0069N; FB04;LATIN SMALL LIGATURE FFL;Ll;0;L; 0066 0066 006CN; In these cases, it is the font designers which want to have code points for these characters: the glyphs of the ligature cannot be automatically derived from the glyphs of the individual characters. I can only guess that the issue with that Armenian ligature is similar. Notice that the issue of U+00DF is entirely different: it is a character on its own, not a ligature. That a common transliteration for this character exists is again a different story. Now, as to what might not work: While compatibility decomposition (NFKD) converts \u0587 to \u0565\u0582, the reverse process is not supported. This is intentional, of course: there is no "canonical" compatibility character for every decomposed code point. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (� and
John Machin wrote: > Some of the transformations are a little unfortunate :-( here's a slightly silly way to map a unicode string to its "unaccented" version: ### import unicodedata, sys CHAR_REPLACEMENT = { 0xc6: u"AE", # LATIN CAPITAL LETTER AE 0xd0: u"D", # LATIN CAPITAL LETTER ETH 0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE 0xde: u"Th", # LATIN CAPITAL LETTER THORN 0xdf: u"ss", # LATIN SMALL LETTER SHARP S 0xe6: u"ae", # LATIN SMALL LETTER AE 0xf0: u"d", # LATIN SMALL LETTER ETH 0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE 0xfe: u"th", # LATIN SMALL LETTER THORN } class unaccented_map(dict): def mapchar(self, key): ch = self.get(key) if ch is not None: return ch ch = unichr(key) try: ch = unichr(int(unicodedata.decomposition(ch).split()[0], 16)) except (IndexError, ValueError): ch = CHAR_REPLACEMENT.get(key, ch) # uncomment the following line if you want to remove remaining # non-ascii characters # if ch >= u"\x80": return None self[key] = ch return ch if sys.version >= "2.5": __missing__ = mapchar else: __getitem__ = mapchar assert isinstance(mystring, unicode) print mystring.translate(unaccented_map()) ### if the source string is not unicode, you can use something like s = mystring.decode("iso-8859-1") s = s.translate(unaccented_map()) s = s.encode("ascii", "ignore") (this works well for characters in the latin-1 range, at least. no guarantees for other character ranges) -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
On 24/03/2006 11:44 PM, Peter Otten wrote: > John Machin wrote: > > >>0x00d0: ord('D'), # Ð >>0x00f0: ord('o'), # ð >>Icelandic capital eth becomes D, OK; but the small letter becomes o!!! > > > I see information flow from Iceland is a bit better than from Armenia :-) No information flow needed. Capital letter BLAH -> D and small letter BLAH -> o should trigger one's palpable nonsense detector for *any* BLAH. > > >>Some of the transformations are a little unfortunate :-( > > > The OP, as you pointed out in your first post in this thread, has more > pressing problems with his normalization approach. > > Lastly, even if all went well, turning a list of French addresses into an > ascii-uppercase graveyard would be a sad thing to do... Oh indeed. Not only sad, but incredibly stupid. I fervently hope and trust that such a normalisation is intended only for fuzzy matching purposes. I can't imagine that anyone would contemplate writing the output to storage for any reason other than logging or for regression testing. Update it back to the database? Do you know anyone who would do that?? -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (X and X)
Duncan Booth wrote: > [...] > Unfortunately, just as I finished writing this I discovered that the > latscii module isn't as robust as I thought, it blows up on consecutive > accented characters. > > :( Replace the error handler with this (untested) and it should work with consecutive accented characters: def latscii_error( uerr ): v = [] for c in uerr.object[uerr.start:uerr.end] key = ord(c) try: v.append(unichr(decoding_map[key])) except KeyError: v.append(u"?") return (u"".join(v), uerr.end) codecs.register_error('replacelatscii', latscii_error) Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
John Machin wrote: > 0x00d0: ord('D'), # Ð > 0x00f0: ord('o'), # ð > Icelandic capital eth becomes D, OK; but the small letter becomes o!!! I see information flow from Iceland is a bit better than from Armenia :-) > Some of the transformations are a little unfortunate :-( The OP, as you pointed out in your first post in this thread, has more pressing problems with his normalization approach. Lastly, even if all went well, turning a list of French addresses into an ascii-uppercase graveyard would be a sad thing to do... Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
On 24/03/2006 8:11 PM, Duncan Booth wrote: > Peter Otten wrote: > > >>>You can replace ALL of this upshifting and accent removal in one blow >>>by using the string translate() method with a suitable table. >> >>Only if you convert to unicode first or if your data maintains 1 byte >>== 1 character, in particular it is not UTF-8. >> > > > There's a nice little codec from Skip Montaro for removing accents from For the benefit of those who may read only this far, it is NOT nice. > latin-1 encoded strings. It also has an error handler so you can convert > from unicode to ascii and strip all the accents as you do so: > > http://orca.mojam.com/~skip/python/latscii.py > > import latscii import htmlentitydefs print u'\u00c9'.encode('ascii','replacelatscii') > > E > > > So Bussiere could replace a large chunk of his code with: Could, but definitely shouldn't. > > ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii') > ligneA = ligneA.upper() > > INPUTENCODING is 'utf8' unless (one possible explanation for his problem) > his files are actually in some different encoding. > > Unfortunately, just as I finished writing this I discovered that the > latscii module isn't as robust as I thought, it blows up on consecutive > accented characters. > > :( > Some of the transformations are a little unfortunate :-( 0x00d0: ord('D'), # Ð 0x00f0: ord('o'), # ð Icelandic capital eth becomes D, OK; but the small letter becomes o!!! The Icelandic thorn letters become P & p (based on physical appearance), when they should become Th and th. The German letter Eszett (00DF) becomes B (appearance) when it should be ss. Creating alphabetics out of punctuation is scarcely something that bussiere should be interested in: 0x00a2: ord('c'), # ¢ 0x00a4: ord('o'), # ¤ 0x00a5: ord('Y'), # ¥ 0x00a7: ord('S'), # § 0x00a9: ord('c'), # © 0x00ae: ord('R'), # ® 0x00b6: ord('P'), # ¶ -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
Duncan Booth wrote: > There's a nice little codec from Skip Montaro for removing accents from > latin-1 encoded strings. It also has an error handler so you can convert > from unicode to ascii and strip all the accents as you do so: > > http://orca.mojam.com/~skip/python/latscii.py > import latscii import htmlentitydefs print u'\u00c9'.encode('ascii','replacelatscii') > E > > So Bussiere could replace a large chunk of his code with: > > ligneA = ligneA.decode(INPUTENCODING).encode('ascii', > 'replacelatscii') ligneA = ligneA.upper() > > INPUTENCODING is 'utf8' unless (one possible explanation for his problem) > his files are actually in some different encoding. > > Unfortunately, just as I finished writing this I discovered that the > latscii module isn't as robust as I thought, it blows up on consecutive > accented characters. > > :( You made me look into it -- and I found that reusing the decoding map as the encoding map lets you write >>> u"Élève ééé".encode("latscii") 'Eleve eee' without relying on the faulty error handler. I tried to fix the handler, too: >>> u"Élève ééé".encode("ascii", "replacelatscii") 'Eleve eee' >>> g = u"\N{GREEK CAPITAL LETTER GAMMA}" >>> (u"möglich ähnlich üblich ááá" + g*3).encode("ascii", "replacelatscii") 'moglich ahnlich ublich aaa???' No real testing was performed. Peter --- latscii_old.py 2006-03-24 11:45:22.580588520 +0100 +++ latscii.py 2006-03-24 11:48:13.191651696 +0100 @@ -141,7 +141,7 @@ ### Encoding Map -encoding_map = codecs.make_identity_dict(range(256)) +encoding_map = decoding_map ### From Martin Blais @@ -166,9 +166,9 @@ ## ustr.encode('ascii', 'replacelatscii') ## def latscii_error( uerr ): -key = ord(uerr.object[uerr.start:uerr.end]) +key = ord(uerr.object[uerr.start]) try: -return unichr(decoding_map[key]), uerr.end +return unichr(decoding_map[key]), uerr.start + 1 except KeyError: handler = codecs.lookup_error('replace') return handler(uerr) -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (� and
Peter Otten wrote: >> You can replace ALL of this upshifting and accent removal in one blow >> by using the string translate() method with a suitable table. > > Only if you convert to unicode first or if your data maintains 1 byte > == 1 character, in particular it is not UTF-8. > There's a nice little codec from Skip Montaro for removing accents from latin-1 encoded strings. It also has an error handler so you can convert from unicode to ascii and strip all the accents as you do so: http://orca.mojam.com/~skip/python/latscii.py >>> import latscii >>> import htmlentitydefs >>> print u'\u00c9'.encode('ascii','replacelatscii') E >>> So Bussiere could replace a large chunk of his code with: ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii') ligneA = ligneA.upper() INPUTENCODING is 'utf8' unless (one possible explanation for his problem) his files are actually in some different encoding. Unfortunately, just as I finished writing this I discovered that the latscii module isn't as robust as I thought, it blows up on consecutive accented characters. :( -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
On 24/03/2006 2:19 PM, Jean-Paul Calderone wrote: > On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <[EMAIL PROTECTED]> > wrote: > >> On 24/03/2006 8:36 AM, Peter Otten wrote: >> >>> John Machin wrote: >>> You can replace ALL of this upshifting and accent removal in one blow by using the string translate() method with a suitable table. >>> >>> >>> Only if you convert to unicode first or if your data maintains 1 byte >>> == 1 >>> character, in particular it is not UTF-8. >>> >> >> I'm sorry, I forgot that there were people who are unaware that >> variable-length gizmos like UTF-8 and various legacy CJK encodings are >> for storage & transmission, and are better changed to a >> one-character-per-storage-unit representation before *ANY* data >> processing is attempted. > > > Unfortunately, unicode only appears to solve this problem in a sane > manner. Most people conveniently forget (or never learn in the first > place) about combining sequences and denormalized forms. Consider > u'e\u0301', u'U\u0301', or u'C\u0327'. Yes, and many people don't even bother to look at their data. If they did, and found combining forms, then they would treat them as I said as "variable-length gizmos" which are "better changed to a one-character-per-storage-unit representation before *ANY* data processing is attempted." In any case, as the OP is upshifting and stripping accents [presumably as elementary preparation for some sort of fuzzy matching], all that is needed is to throw away the combining accents (0301, 0327, etc). > These difficulties can be > mitigated to some degree via normalization (see unicodedata.normalize), > but this step is often forgotten It's not a matter of forget or not. People should bother to examine their data and see what characters are in use; then they would know whether they had a problem or not. > and, for things like u'\u0565\u0582' > (ARMENIAN SMALL LIGATURE ECH YIWN), it does not even work. Sorry, I don't understand. 0565 is stand-alone ECH 0582 is stand-alone YIWN 0587 is the ligature. What doesn't work? At first guess, in the absence of an Armenian informant, for pre-matching normalisation, I'd replace 0587 by the two constituents -- just like 00DF would be expanded to "ss" (before upshifting and before not caring too much about differences caused by doubled letters). -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <[EMAIL PROTECTED]> wrote: >On 24/03/2006 8:36 AM, Peter Otten wrote: >> John Machin wrote: >> >>>You can replace ALL of this upshifting and accent removal in one blow by >>>using the string translate() method with a suitable table. >> >> Only if you convert to unicode first or if your data maintains 1 byte == 1 >> character, in particular it is not UTF-8. >> > >I'm sorry, I forgot that there were people who are unaware that >variable-length gizmos like UTF-8 and various legacy CJK encodings are >for storage & transmission, and are better changed to a >one-character-per-storage-unit representation before *ANY* data >processing is attempted. Unfortunately, unicode only appears to solve this problem in a sane manner. Most people conveniently forget (or never learn in the first place) about combining sequences and denormalized forms. Consider u'e\u0301', u'U\u0301', or u'C\u0327'. These difficulties can be mitigated to some degree via normalization (see unicodedata.normalize), but this step is often forgotten and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH YIWN), it does not even work. > >:-) >Unicode? I'm just a benighted Anglo from the a**-end of the globe; who >am I to be preaching Unicode to a European? >(-: Heh ;P Same here. And I don't really claim to understand all this stuff, I just know enough to know it's really hard to do anything correctly. ;) Jean-Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
On 24/03/2006 8:36 AM, Peter Otten wrote: > John Machin wrote: > >>You can replace ALL of this upshifting and accent removal in one blow by >>using the string translate() method with a suitable table. > > Only if you convert to unicode first or if your data maintains 1 byte == 1 > character, in particular it is not UTF-8. > I'm sorry, I forgot that there were people who are unaware that variable-length gizmos like UTF-8 and various legacy CJK encodings are for storage & transmission, and are better changed to a one-character-per-storage-unit representation before *ANY* data processing is attempted. :-) Unicode? I'm just a benighted Anglo from the a**-end of the globe; who am I to be preaching Unicode to a European? (-: -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
John Machin wrote: > You can replace ALL of this upshifting and accent removal in one blow by > using the string translate() method with a suitable table. Only if you convert to unicode first or if your data maintains 1 byte == 1 character, in particular it is not UTF-8. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
On 23/03/2006 10:07 PM, bussiere bussiere wrote: > hi i'am making a program for formatting string, > or > i've added : > #!/usr/bin/python > # -*- coding: utf-8 -*- > > in the begining of my script but > > str = str.replace('Ç', 'C') > str = str.replace('é', 'E') > str = str.replace('É', 'E') > str = str.replace('è', 'E') > str = str.replace('È', 'E') > str = str.replace('ê', 'E') > > > doesn't work it put me " and , instead of remplacing é by E > > > if someone have an idea it could be great Hi, I've added some comments below ... I hope they help. Cheers, John > > regards > Bussiere > ps : i've added the whole script under : > __ [snip] > > if ligneA != "": > str = ligneA > str = str.replace('a', 'A') [snip] > str = str.replace('z', 'Z') > > str = str.replace('ç', 'C') > str = str.replace('Ç', 'C') > str = str.replace('é', 'E') > str = str.replace('É', 'E') > str = str.replace('è', 'E') [snip] > str = str.replace('Ú','U') You can replace ALL of this upshifting and accent removal in one blow by using the string translate() method with a suitable table. > str = str.replace(' ', ' ') > str = str.replace(' ', ' ') > str = str.replace('', ' ') The standard Python idiom for normalising whitespace is strg = ' '.join(strg.split()) >>> strg = ' ALLOBUSSIERE\tCA VA? ' >>> strg.split() ['ALLO', 'BUSSIERE', 'CA', 'VA?'] >>> ' '.join(strg.split()) 'ALLO BUSSIERE CA VA?' >>> [snip] > if normalisation2 == "O": > str = str.replace('MONSIEUR', 'M') > str = str.replace('MR', 'M') You need to be very careful with this approach. You are changing EVERY occurrence of "MR" in the string, not just where it is a whole "word" meaning "Monsieur". Copnstructed example of what can go wrong: >>> strg = 'MR IMRE NAGY, 123 PRIMROSE STREET, SHAMROCK VALLEY' >>> strg.replace('MR', 'M') 'M IME NAGY, 123 PRIMOSE STREET, SHAMOCK VALLEY' >>> A real, non-constructed history lesson: A certain database indicated duplicate records by having the annotation "DUP" in the surname field e.g. "SMITH DUP". Fortunately it was detected in testing that the so-called clean-up was causing DUPLESSIS to become PLESSIS and DUPRAT to become RAT! Two points here: (1) Split up your strings into "words" or "tokens". Using strg.split() is a start but you may need something more sophisticated e.g. "-" as an additional token separator. (2) Instead of writing out all those lines of code, consider putting those substitutions in a dictionary: title_substitution = { 'MONSIEUR': 'M', 'MR': 'M', 'MADAME': 'MME', # etc } Next level of improvement is to read that stuff from a file. [snip] > > if normalisation4 == "O": > str = str.replace(';\"', ' ') > str = str.replace('\"', ' ') > str = str.replace('\'', ' ') > str = str.replace('-', ' ') > str = str.replace(',', ' ') > str = str.replace('\\', ' ') > str = str.replace('\/', ' ') > str = str.replace('&', ' ') [snip] Again, consider the string translate() method. Also, consider that some of those characters may have some meaning that you perhaps shouldn't blow away e.g. compare 'SMITH & WESSON' with 'SMITH ET WESSON' :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (é and è)
Seems to work fine for me. >>> x="éÇ" >>> x=x.replace('é','E') 'E\xc7' >>> x=x.replace('Ç','C') >>> x 'E\xc7' >>> x=x.replace('Ç','C') >>> x 'EC' You should also be able to use .upper() method to uppercase everything in the string in a single statement: tstr=ligneA.upper() Note: you should never use 'str' as a variable as it will mask the built-in str function. -Larry Bates bussiere bussiere wrote: > hi i'am making a program for formatting string, > or > i've added : > #!/usr/bin/python > # -*- coding: utf-8 -*- > > in the begining of my script but > > str = str.replace('Ç', 'C') > str = str.replace('é', 'E') > str = str.replace('É', 'E') > str = str.replace('è', 'E') > str = str.replace('È', 'E') > str = str.replace('ê', 'E') > > > doesn't work it put me " and , instead of remplacing é by E > > > if someone have an idea it could be great > > regards > Bussiere > ps : i've added the whole script under : > > > > > > > __ > > > > > #!/usr/bin/python > # -*- coding: utf-8 -*- > import fileinput, glob, string, sys, os, re > > fichA=raw_input("Entrez le nom du fichier d'entree : ") > print ("\n") > fichC=raw_input("Entrez le nom du fichier de sortie : ") > print ("\n") > normalisation1 = raw_input("Normaliser les adresses 1 (ex : Avenue-> > AV) (O/N) ou A pour tout normaliser \n") > normalisation1 = normalisation1.upper() > > if normalisation1 != "A": > print ("\n") > normalisation2 = raw_input("Normaliser les civilités (ex : > Docteur-> DR) (O/N) \n") > normalisation2 = normalisation2.upper() > print ("\n") > normalisation3 = raw_input("Normaliser les Adresses 2 (ex : > Place-> PL) (O/N) \n") > normalisation3 = normalisation3.upper() > > > normalisation4 = raw_input("Normaliser les caracteres / et - (ex : > / -> ) (O/N) \n" ) > normalisation4 = normalisation4.upper() > > if normalisation1 == "A": > normalisation1 = "O" > normalisation2 = "O" > normalisation3 = "O" > normalisation4 = "O" > > > fiA=open(fichA,"r") > fiC=open(fichC,"w") > > > compteur = 0 > > while 1: > > ligneA=fiA.readline() > > > > if ligneA == "": > > break > > if ligneA != "": > str = ligneA > str = str.replace('a', 'A') > str = str.replace('b', 'B') > str = str.replace('c', 'C') > str = str.replace('d', 'D') > str = str.replace('e', 'E') > str = str.replace('f', 'F') > str = str.replace('g', 'G') > str = str.replace('h', 'H') > str = str.replace('i', 'I') > str = str.replace('j', 'J') > str = str.replace('k', 'K') > str = str.replace('l', 'L') > str = str.replace('m', 'M') > str = str.replace('n', 'N') > str = str.replace('o', 'O') > str = str.replace('p', 'P') > str = str.replace('q', 'Q') > str = str.replace('r', 'R') > str = str.replace('s', 'S') > str = str.replace('t', 'T') > str = str.replace('u', 'U') > str = str.replace('v', 'V') > str = str.replace('w', 'W') > str = str.replace('x', 'X') > str = str.replace('y', 'Y') > str = str.replace('z', 'Z') > > str = str.replace('ç', 'C') > str = str.replace('Ç', 'C') > str = str.replace('é', 'E') > str = str.replace('É', 'E') > str = str.replace('è', 'E') > str = str.replace('È', 'E') > str = str.replace('ê', 'E') > str = str.replace('Ê', 'E') > str = str.replace('ë', 'E') > str = str.replace('Ë', 'E') > str = str.replace('ä', 'A') > str = str.replace('Ä', 'A') > str = str.replace('à', 'A') > str = str.replace('À', 'A') > str = str.replace('Á', 'A') > str = str.replace('Â', 'A') > str = str.replace('Ä', 'A') > str = str.replace('Ã', 'A') > str = str.replace('â', 'A') > str = str.replace('Ä', 'A') > str = str.replace('ï', 'I') > str = str.replace('Ï', 'I') > str = str.replace('î', 'I') > str = str.replace('Î', 'I') > str = str.replace('ô', 'O') > str = str.replace('Ô', 'O') > str = str.replace('ö', 'O') > str = str.replace('Ö', 'O') > str = str.replace('Ú','U') > str = str.replace(' ', ' ') > str = str.replace(' ', ' ') > str = str.replace('', ' ') > > > > if normalisation1 == "O": > str = str.replace('AVENUE', 'AV') > str = str.replace('BOULEVARD', 'BD') > str = str.replace('FAUBOURG', 'FBG') > str = str.replace('GENERAL', 'GAL') > str = str.replace('COMMANDANT', 'CMDT') > str = str.replace('MARECHAL', 'MAL') > str = str.replace('PRESIDENT', 'PRDT') > str = str.replace('SAINT', 'ST') >
Re: encoding problems (é and è)
bussiere bussiere wrote: > hi i'am making a program for formatting string, > i've added : > #!/usr/bin/python > # -*- coding: utf-8 -*- > > in the begining of my script but > > str = str.replace('Ç', 'C') > ... > doesn't work it put me " and , instead of remplacing é by E Are your sure your script and your input file *is* actually encoded with utf-8? If it does not work as expected, it is probably latin-1, just like your posting. Try changing the coding to latin-1. Does it work now? -- Christoph -- http://mail.python.org/mailman/listinfo/python-list
encoding problems (é and è)
hi i'am making a program for formatting string, or i've added : #!/usr/bin/python # -*- coding: utf-8 -*- in the begining of my script but str = str.replace('Ç', 'C') str = str.replace('é', 'E') str = str.replace('É', 'E') str = str.replace('è', 'E') str = str.replace('È', 'E') str = str.replace('ê', 'E') doesn't work it put me " and , instead of remplacing é by E if someone have an idea it could be great regards Bussiere ps : i've added the whole script under : __ #!/usr/bin/python # -*- coding: utf-8 -*- import fileinput, glob, string, sys, os, re fichA=raw_input("Entrez le nom du fichier d'entree : ") print ("\n") fichC=raw_input("Entrez le nom du fichier de sortie : ") print ("\n") normalisation1 = raw_input("Normaliser les adresses 1 (ex : Avenue-> AV) (O/N) ou A pour tout normaliser \n") normalisation1 = normalisation1.upper() if normalisation1 != "A": print ("\n") normalisation2 = raw_input("Normaliser les civilités (ex : Docteur-> DR) (O/N) \n") normalisation2 = normalisation2.upper() print ("\n") normalisation3 = raw_input("Normaliser les Adresses 2 (ex : Place-> PL) (O/N) \n") normalisation3 = normalisation3.upper() normalisation4 = raw_input("Normaliser les caracteres / et - (ex : / -> ) (O/N) \n" ) normalisation4 = normalisation4.upper() if normalisation1 == "A": normalisation1 = "O" normalisation2 = "O" normalisation3 = "O" normalisation4 = "O" fiA=open(fichA,"r") fiC=open(fichC,"w") compteur = 0 while 1: ligneA=fiA.readline() if ligneA == "": break if ligneA != "": str = ligneA str = str.replace('a', 'A') str = str.replace('b', 'B') str = str.replace('c', 'C') str = str.replace('d', 'D') str = str.replace('e', 'E') str = str.replace('f', 'F') str = str.replace('g', 'G') str = str.replace('h', 'H') str = str.replace('i', 'I') str = str.replace('j', 'J') str = str.replace('k', 'K') str = str.replace('l', 'L') str = str.replace('m', 'M') str = str.replace('n', 'N') str = str.replace('o', 'O') str = str.replace('p', 'P') str = str.replace('q', 'Q') str = str.replace('r', 'R') str = str.replace('s', 'S') str = str.replace('t', 'T') str = str.replace('u', 'U') str = str.replace('v', 'V') str = str.replace('w', 'W') str = str.replace('x', 'X') str = str.replace('y', 'Y') str = str.replace('z', 'Z') str = str.replace('ç', 'C') str = str.replace('Ç', 'C') str = str.replace('é', 'E') str = str.replace('É', 'E') str = str.replace('è', 'E') str = str.replace('È', 'E') str = str.replace('ê', 'E') str = str.replace('Ê', 'E') str = str.replace('ë', 'E') str = str.replace('Ë', 'E') str = str.replace('ä', 'A') str = str.replace('Ä', 'A') str = str.replace('à', 'A') str = str.replace('À', 'A') str = str.replace('Á', 'A') str = str.replace('Â', 'A') str = str.replace('Ä', 'A') str = str.replace('Ã', 'A') str = str.replace('â', 'A') str = str.replace('Ä', 'A') str = str.replace('ï', 'I') str = str.replace('Ï', 'I') str = str.replace('î', 'I') str = str.replace('Î', 'I') str = str.replace('ô', 'O') str = str.replace('Ô', 'O') str = str.replace('ö', 'O') str = str.replace('Ö', 'O') str = str.replace('Ú','U') str = str.replace(' ', ' ') str = str.replace(' ', ' ') str = str.replace('', ' ') if normalisation1 == "O": str = str.replace('AVENUE', 'AV') str = str.replace('BOULEVARD', 'BD') str = str.replace('FAUBOURG', 'FBG') str = str.replace('GENERAL', 'GAL') str = str.replace('COMMANDANT', 'CMDT') str = str.replace('MARECHAL', 'MAL') str = str.replace('PRESIDENT', 'PRDT') str = str.replace('SAINT', 'ST') str = str.replace('SAINTE', 'STE') str = str.replace('LOTISSEMENT', 'LOT') str = str.replace('RESIDENCE', 'RES') str = str.replace('IMMEUBLE', 'IMM') str = str.replace('IMEUBLE', 'IMM') str = str.replace('BATIMENT', 'BAT') if normalisation2 == "O": str = str.replace('MONSIEUR', 'M') str = str.replace('MR', 'M') str = str.replace('MADAME', 'MME') str = str.replace('MADEMOISELLE', 'MLLE') str = str.replace('DOCTEUR', 'DR') str = str.replace('PROFESSEUR', 'PR') str = str.replace('MONSEIGNEUR', 'MGR') str = str.replace('M ME','MME') if normalisation
Encoding problems with gettext and wxPython: how to do things in "good style"
I'm trying to change an app so that it uses gettext for translations rather than the idiosyncratic way I am using. I've tried the example on the wxPython wiki http://wiki.wxpython.org/index.cgi/RecipesI18n but found that the accented letters would not display properly. I have found a workaround that works from Python in a Nutshell; however it is said in that book that "...this is not good style". I would like to do things "in good style" :-) Here are some further details: 1. all the .po files are encoded in utf-8 2. my local sitecustomization uses iso-8859-1 (yes, I could easily change it on *my* computer, but I want the solution to work for anyone else, without asking them to change their local default encoding). 3. I am programming under Windows XP. The workaround I use is to write the following at the beginning of the script: import sys reload(sys) sys.setdefaultencoding('utf-8') del sys.setdefaultencoding I tried various other ways to change the encoding in the example given, but nothing else worked. I can live with the "bad style" workaround if nothing else... André I have tried -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems with pymssql / win
> (to email use "boris at batiment71 dot ch") oops, that's "boris at batiment71 dot net" -- http://mail.python.org/mailman/listinfo/python-list
encoding problems with pymssql / win
I have a strange problem : some code that fetches queries from an mssql database works fine under Idle but the very same code run from a shell window obtains its strings garbled as if the encoding codepage was modified. This occurs only when using pymssql to connect; if I connect through odbc (using the odbc module included in pywin) this problem vanishes. I briefly thought sys.getdefaultencoding() would pin it down but the value is "ascii" in all cases. If anyone has a lead as to what may be happening here or how to solve it (apart from using odbc, that is), advice is welcome. I am using python 2.4.2 and pymssql 0.7.3 under win2k sp4 on the client machine, while the server machine has windows server 2003 + sqlserver2k, locale is fr_CH. For an example of garbling, I get '\x8a' where I should get '\xe8' TIA, mc -- (to email use "boris at batiment71 dot ch") -- http://mail.python.org/mailman/listinfo/python-list