subject:"encoding problems"

Re: Print encoding problems in console

2011-07-15 Thread Andrew Berg

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160

On 2011.07.15 07:02 PM, Pedro Abranches wrote:
> Now, if you're using your python script in some shell script you
> might have to store the output in some variable, like this:
> 
> $ var=`python -c 'import sys; print sys.stdout.encoding; print
> u"\xe9"'`
> 
> And what you get is:
> 
> Traceback (most recent call last): File "", line 1, in
>  UnicodeEncodeError: 'ascii' codec can't encode character
> u'\xe9' in position 0: ordinal not in range(128)
> 
> So, python is not being able to detect the encoding of the output in
> a situation like that, in which the python script is called not
> directly but around ``.
FWIW, it works for me with Python 3:
$ x=$(/c/Python32/python -c print\(\'\\xe9\'\))

$ echo $x
é

I don't know how to get it to work with more than one command to Python;
bash always thinks the next commands are for it:
$ x=$(/c/Python32/python -c import sys; print\(sys.output.encoding\);
print\(\'\\xe9\'\))
  File "", line 1
import
 ^
SyntaxError: invalid syntax
bash: print(sys.output.encoding): command not found
bash: print('\xe9'): No such file or directory

This is using a very old MinGW bash, though.
- -- 
CPython 3.2.1 | Windows NT 6.1.7601.17592 | Thunderbird 5.0
PGP/GPG Public Key ID: 0xF88E034060A78FCB
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAwAGBQJOIOEIAAoJEPiOA0Bgp4/LbSIIAJS9hVMTwQtV17pxWU5/IwRa
0X5v3W8mKZAyXTCSL5HmMQ07pPWRAkg5dEmnt+MTmFOVRjWg1yWIzeArmAc/MCmj
LiQcwp9ue6rY7Gt+gUqLFMQgVW9qs4zLLRAcThw9zMVLheOCrVoDc6miyLqcpb8+
RPjVuT9Bd5Vj67lIPOtZNTdB0hZGSwF5maerkot/95NBIuvP8UVBcub3dI6w1bJL
7dIW3NmjkeuWOdRch5s/X+gdPuoBNpfLfsFW3t7sdUscKKWaVjj0tOiNMHne42hD
XFuFauzmizaKpu16Zn9YJGPUhfvCn8QW+mcPFlBzv3g2oxuZMMssFykhU4Yb/7E=
=jVgu
-END PGP SIGNATURE-
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Print encoding problems in console

2011-07-15 Thread Dan Stromberg

I've used the code below successfully to deal with such a problem when
outputting filenames.  Python2x3 is at
http://stromberg.dnsalias.org/svn/python2x3/ , but here it's just being used
to convert Python 3.x's byte strings to strings (to eliminate the b''
stuff), while on 2.x it's an identity function - if you're targeting 3.x
alone, there's no need to take a dependency on python2x3.

If you really do need to output such characters, rather than replacing them
with ?'s, you could use os.write() to filedescriptor 1 - that works in both
2.x and 3.x.

def ascii_ize(binary):
   '''Replace non-ASCII characters with question marks; otherwise writing to
sys.stdout tracebacks'''
   list_ = []
   question_mark_ordinal = ord('?')
   for ordinal in python2x3.binary_to_intlist(binary):
  if 0 <= ordinal <= 127:
 list_.append(ordinal)
  else:
 list_.append(question_mark_ordinal)
   return python2x3.intlist_to_binary(list_)


def output_filename(filename, add_eol=True):
   '''Output a filename to the tty (stdout), taking into account that some
tty's do not allow non-ASCII characters'''

   if sys.stdout.encoding == 'US-ASCII':
  converted = python2x3.binary_to_string(ascii_ize(filename))
   else:
  converted = python2x3.binary_to_string(filename)

   replaced = converted.replace('\n', '?').replace('\r', '?').replace('\t',
'?')

   sys.stdout.write(replaced)

   if add_eol:
  sys.stdout.write('\n')


On Fri, Jul 15, 2011 at 5:02 PM, Pedro Abranches  wrote:

> Hello everyone.
>
> I'm having a problem when outputing UTF-8 strings to a console.
> Let me show a simple example that explains it:
>
> $ python -c 'import sys; print sys.stdout.encoding; print u"\xe9"'
> UTF-8
> é
>
> It's everything ok.
> Now, if you're using your python script in some shell script you might have
> to store the output in some variable, like this:
>
> $ var=`python -c 'import sys; print sys.stdout.encoding; print u"\xe9"'`
>
> And what you get is:
>
> Traceback (most recent call last):
>   File "", line 1, in 
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
> position 0: ordinal not in range(128)
>
> So, python is not being able to detect the encoding of the output in a
> situation like that, in which the python script is called not directly but
> around ``.
>
> Why does happen? Is there a way to solve it either by python or by shell
> code?
>
> Thanks,
> Pedro Abranches
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
-- 
http://mail.python.org/mailman/listinfo/python-list

Print encoding problems in console

2011-07-15 Thread Pedro Abranches

Hello everyone.

I'm having a problem when outputing UTF-8 strings to a console.
Let me show a simple example that explains it:

$ python -c 'import sys; print sys.stdout.encoding; print u"\xe9"'
UTF-8
é

It's everything ok.
Now, if you're using your python script in some shell script you might have
to store the output in some variable, like this:

$ var=`python -c 'import sys; print sys.stdout.encoding; print u"\xe9"'`

And what you get is:

Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position
0: ordinal not in range(128)

So, python is not being able to detect the encoding of the output in a
situation like that, in which the python script is called not directly but
around ``.

Why does happen? Is there a way to solve it either by python or by shell
code?

Thanks,
Pedro Abranches
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to force SAX parser to ignore encoding problems

2009-08-06 Thread Stefan Behnel

Łukasz wrote:
> I have a problem with my XML parser (created with libraries from
> xml.sax package). When parser finds a invalid character (in CDATA
> section) for example �, throws an exception SAXParseException.
> 
> Is there any way to just ignore this kind of problem. Maybe there is a
> way to set up parser in less strict mode?
> 
> I know that I can catch this exception and determine if this is this
> kind of problem and then ignore this, but I am asking about any global
> setting.

The parser from libxml2 that lxml provides has a recovery option, i.e. it
can keep parsing regardless of errors and will drop the broken content.

However, it is *always* better to fix the input, if you get any hand on it.
Broken XML is *not* XML at all. If you can't fix the source, you can never
be sure that the data you received is in any way complete or even usable.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to force SAX parser to ignore encoding problems

2009-07-31 Thread Łukasz

On 31 Lip, 09:28, Łukasz  wrote:
> Hi,
> I have a problem with my XML parser (created with libraries from
> xml.sax package). When parser finds a invalid character (in CDATA
> section) for example ,

After sending this message I noticed that example invalid characters
are not displaying on some platforms :)

-- 
http://mail.python.org/mailman/listinfo/python-list

How to force SAX parser to ignore encoding problems

2009-07-31 Thread Łukasz

Hi,
I have a problem with my XML parser (created with libraries from
xml.sax package). When parser finds a invalid character (in CDATA
section) for example �, throws an exception SAXParseException.

Is there any way to just ignore this kind of problem. Maybe there is a
way to set up parser in less strict mode?

I know that I can catch this exception and determine if this is this
kind of problem and then ignore this, but I am asking about any global
setting.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems

2007-08-29 Thread Damjan

> 
> is there a way to sort this string properly (sorted()?)
> I mean first 'a' then 'à' then 'e' etc. (sorted puts accented letters at
> the end). Or should I have to provide a comparison function to sorted?

After setting the locale...

locale.strcoll()


-- 
damjan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems

2007-08-29 Thread Diez B. Roggisch

Ricardo Aráoz wrote:

> Lawrence D'Oliveiro wrote:
>> In message <[EMAIL PROTECTED]>, tool69 wrote:
>> 
>>> p2.content = """Ce poste possède des accents : é à ê è"""
>> 
>> My guess is this is being encoded as a Latin-1 string, but when you try
>> to output it it goes through the ASCII encoder, which doesn't understand
>> the accents. Try this:
>> 
>> p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8")
>> 
> 
> is there a way to sort this string properly (sorted()?)
> I mean first 'a' then 'à' then 'e' etc. (sorted puts accented letters at
> the end). Or should I have to provide a comparison function to sorted?

First of all: please don't hijack threads. Start a new one with your
specific question.

Second: this might be what you are looking for:

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

Didn't try it myself though.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems

2007-08-29 Thread Ricardo Aráoz

Lawrence D'Oliveiro wrote:
> In message <[EMAIL PROTECTED]>, tool69 wrote:
> 
>> p2.content = """Ce poste possède des accents : é à ê è"""
> 
> My guess is this is being encoded as a Latin-1 string, but when you try to
> output it it goes through the ASCII encoder, which doesn't understand the
> accents. Try this:
> 
> p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8")
> 

is there a way to sort this string properly (sorted()?)
I mean first 'a' then 'à' then 'e' etc. (sorted puts accented letters at
the end). Or should I have to provide a comparison function to sorted?



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems

2007-08-29 Thread tool69

Diez B. Roggisch a écrit :
> tool69 wrote:
> 
>> Hi,
>>
>> I would like to transform reST contents to HTML, but got problems
>> with accented chars.
>>
>> Here's a rather simplified version using SVN Docutils 0.5:
>>
>> %-
>>
>> #!/usr/bin/env python
>> # -*- coding: utf-8 -*-
> 
> 
> This declaration only affects unicode-literals.
> 
>> from docutils.core import publish_parts
>>
>> class Post(object):
>>  def __init__(self, title='', content=''):
>>  self.title = title
>>  self.content = content
>>
>>  def _get_html_content(self):
>>  return publish_parts(self.content,
>>  writer_name="html")["html_body"]
>>  html_content = property(_get_html_content)
> 
> Did you know that you can do this like this:
> 
> @property
> def html_content(self):
> ...
> 
> ?
> 

I only took some part of code from someone else
(an old TurboGears tutorial if I remember).

But you're right : decorators are better.

>> # Instanciate 2 Post objects
>> p1 = Post()
>> p1.title = "First post without accented chars"
>> p1.content = """This is the first.
>> ...blabla
>> ... end of post..."""
>>
>> p2 = Post()
>> p2.title = "Second post with accented chars"
>> p2.content = """Ce poste possède des accents : é à ê è"""
> 
> 
> This needs to be a unicode-literal:
> 
> p2.content = u"""Ce poste possède des accents : é à ê è"""
> 
> Note the u in front.
>  


> 
> You need to encode a unicode-string into the encoding you want it.
> Otherwise, the default (ascii) is taken.
> 
> So 
> 
> print post.html_content.encodec("utf-8")
> 
> should work.
> 

That solved it : thank you so much.

> Diez
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems

2007-08-29 Thread tool69

Lawrence D'Oliveiro a écrit :
> In message <[EMAIL PROTECTED]>, tool69 wrote:
> 
>> p2.content = """Ce poste possède des accents : é à ê è"""
> 
> My guess is this is being encoded as a Latin-1 string, but when you try to
> output it it goes through the ASCII encoder, which doesn't understand the
> accents. Try this:
> 
> p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8")
> 

Thanks for your answer Lawrence, but I always got the error.
Any other idea ?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems

2007-08-29 Thread Diez B. Roggisch

tool69 wrote:

> Hi,
> 
> I would like to transform reST contents to HTML, but got problems
> with accented chars.
> 
> Here's a rather simplified version using SVN Docutils 0.5:
> 
> %-
> 
> #!/usr/bin/env python
> # -*- coding: utf-8 -*-


This declaration only affects unicode-literals.

> from docutils.core import publish_parts
> 
> class Post(object):
>  def __init__(self, title='', content=''):
>  self.title = title
>  self.content = content
> 
>  def _get_html_content(self):
>  return publish_parts(self.content,
>  writer_name="html")["html_body"]
>  html_content = property(_get_html_content)

Did you know that you can do this like this:

@property
def html_content(self):
...

?

> # Instanciate 2 Post objects
> p1 = Post()
> p1.title = "First post without accented chars"
> p1.content = """This is the first.
> ...blabla
> ... end of post..."""
> 
> p2 = Post()
> p2.title = "Second post with accented chars"
> p2.content = """Ce poste possède des accents : é à ê è"""


This needs to be a unicode-literal:

p2.content = u"""Ce poste possède des accents : é à ê è"""

Note the u in front.
 
> for post in [p1,p2]:
>  print post.title, "\n" +"-"*30
>  print post.html_content
> 
> %-
> 
> The output gives me :
> 
> First post without accented chars
> --
> 
> This is the first.
> ...blabla
> ... end of post...
> 
> 
> Second post with accented chars
> --
> Traceback (most recent call last):
> File "C:\Documents and
> Settings\kib\Bureau\Projets\python\dbTest\rest_error.py", line 30, in
> 
> print post.html_content
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in
> position 39:
> ordinal not in range(128)

You need to encode a unicode-string into the encoding you want it.
Otherwise, the default (ascii) is taken.

So 

print post.html_content.encodec("utf-8")

should work.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems

2007-08-29 Thread Lawrence D'Oliveiro

In message <[EMAIL PROTECTED]>, tool69 wrote:

> p2.content = """Ce poste possède des accents : é à ê è"""

My guess is this is being encoded as a Latin-1 string, but when you try to
output it it goes through the ASCII encoder, which doesn't understand the
accents. Try this:

p2.content = u"""Ce poste possède des accents : é à ê è""".encode("utf8")

-- 
http://mail.python.org/mailman/listinfo/python-list

encoding problems

2007-08-29 Thread tool69

Hi,

I would like to transform reST contents to HTML, but got problems
with accented chars.

Here's a rather simplified version using SVN Docutils 0.5:

%-

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from docutils.core import publish_parts

class Post(object):
 def __init__(self, title='', content=''):
 self.title = title
 self.content = content

 def _get_html_content(self):
 return publish_parts(self.content,
 writer_name="html")["html_body"]
 html_content = property(_get_html_content)

# Instanciate 2 Post objects
p1 = Post()
p1.title = "First post without accented chars"
p1.content = """This is the first.
...blabla
... end of post..."""

p2 = Post()
p2.title = "Second post with accented chars"
p2.content = """Ce poste possède des accents : é à ê è"""

for post in [p1,p2]:
 print post.title, "\n" +"-"*30
 print post.html_content

%-

The output gives me :

First post without accented chars
--

This is the first.
...blabla
... end of post...


Second post with accented chars
--
Traceback (most recent call last):
File "C:\Documents and
Settings\kib\Bureau\Projets\python\dbTest\rest_error.py", line 30, in 

print post.html_content
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in 
position 39:
ordinal not in range(128)

Any idea of what I've missed ?

Thanks.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-25 Thread Martin v. Löwis

Serge Orlov wrote:
> The problem is that U+0587 is a ligature in Western Armenian dialect
> (hy locale) and a character in Eastern Armenian dialect (hy_AM locale).
> It is strange the code point is marked as compatibility char. It either
> mistake or political decision. It used to be a ligature before
> orthographic reform in 1930s by communist government in Armenia, then
> it became a character, but after end of Soviet Union (1991) they
> started to think about going back to old orthography. Though it hasn't
> happened and it's not clear if it will ever happen. So U+0587 is a
> character.

Thanks for the explanation. Without any knowledge, I would suspect
a combination of mistake and political decision. The Unicode consortium
(and ISO) always uses native language experts to come up with character
definitions, although the process is today likely more elaborate and
precise than in the early days. Likely, the Unicode consortium found
somebody speaking the Western Armenian dialect (given that many of these
speakers live in North America today); the decision might have been
a mixture of lack of knowledge, ignorance, and perhaps even political
bias.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-24 Thread Serge Orlov

Jean-Paul Calderone wrote:
> On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <[EMAIL PROTECTED]> wrote:
> >On 24/03/2006 8:36 AM, Peter Otten wrote:
> >> John Machin wrote:
> >>
> >>>You can replace ALL of this upshifting and accent removal in one blow by
> >>>using the string translate() method with a suitable table.
> >>
> >> Only if you convert to unicode first or if your data maintains 1 byte == 1
> >> character, in particular it is not UTF-8.
> >>
> >
> >I'm sorry, I forgot that there were people who are unaware that
> >variable-length gizmos like UTF-8 and various legacy CJK encodings are
> >for storage & transmission, and are better changed to a
> >one-character-per-storage-unit representation before *ANY* data
> >processing is attempted.
>
> Unfortunately, unicode only appears to solve this problem in a sane manner.

What problem do you mean? Loose matching is solved by unicode in a sane
manner, it is described in the unicode collation algorithm.

  Serge.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-24 Thread Serge Orlov

Martin v. Löwis wrote:
> John Machin wrote:
> >> and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH
> >> YIWN), it does not even work.
> >
> > Sorry, I don't understand.
> > 0565 is stand-alone ECH
> > 0582 is stand-alone YIWN
> > 0587 is the ligature.
> > What doesn't work? At first guess, in the absence of an Armenian
> > informant, for pre-matching normalisation, I'd replace 0587 by the two
> > constituents -- just like 00DF would be expanded to "ss" (before
> > upshifting and before not caring too much about differences caused by
> > doubled letters).
>
> Looking at the UnicodeData helps here:
>
> 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;N;;German;;;
> 0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L; 0565 0582N;
>
> So U+0587 is a compatibility character for U+0565,U+0582. Not sure
> what the rationale for *this* compatibility character is, but in many
> cases, they are in Unicode only for compatibility with some existing
> encoding - if they had gone through the proper Unification, they should
> not have been introduced as separate characters.

The problem is that U+0587 is a ligature in Western Armenian dialect
(hy locale) and a character in Eastern Armenian dialect (hy_AM locale).
It is strange the code point is marked as compatibility char. It either
mistake or political decision. It used to be a ligature before
orthographic reform in 1930s by communist government in Armenia, then
it became a character, but after end of Soviet Union (1991) they
started to think about going back to old orthography. Though it hasn't
happened and it's not clear if it will ever happen. So U+0587 is a
character. By the way, this char/ligature is present on both Western
and Eastern Armenian keyboard layouts:
http://www.datacal.com/products/armenian-western-layout.htm
It is between 9 and (. In Eastern Armenian this character is used in
words և ( the word "and" in English) , արև ( "sun" in English) and
hundreds others. Needless to say how many documents exist with this
character.

>
> In many cases, ligature characters exist for typographical reasons;
> other examples are
>
> FB00;LATIN SMALL LIGATURE FF;Ll;0;L; 0066 0066N;
> FB01;LATIN SMALL LIGATURE FI;Ll;0;L; 0066 0069N;
> FB02;LATIN SMALL LIGATURE FL;Ll;0;L; 0066 006CN;
> FB03;LATIN SMALL LIGATURE FFI;Ll;0;L; 0066 0066 0069N;
> FB04;LATIN SMALL LIGATURE FFL;Ll;0;L; 0066 0066 006CN;
>
> In these cases, it is the font designers which want to have code points
> for these characters: the glyphs of the ligature cannot be automatically
> derived from the glyphs of the individual characters. I can only guess
> that the issue with that Armenian ligature is similar.
>
> Notice that the issue of U+00DF is entirely different: it is a character
> on its own, not a ligature. That a common transliteration for this
> character exists is again a different story.
>
> Now, as to what might not work: While compatibility decomposition
> (NFKD) converts \u0587 to \u0565\u0582, the reverse process is not
> supported. This is intentional, of course: there is no "canonical"
> compatibility character for every decomposed code point.

Seems like NFKD will damage Eastern Armenian text (there are millions
of such documents). The result will be readable but the text will look
strange to the person who wrote the text.

  Serge.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-24 Thread Martin v. Löwis

John Machin wrote:
>> and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH 
>> YIWN), it does not even work.
> 
> Sorry, I don't understand.
> 0565 is stand-alone ECH
> 0582 is stand-alone YIWN
> 0587 is the ligature.
> What doesn't work? At first guess, in the absence of an Armenian 
> informant, for pre-matching normalisation, I'd replace 0587 by the two 
> constituents -- just like 00DF would be expanded to "ss" (before 
> upshifting and before not caring too much about differences caused by 
> doubled letters).

Looking at the UnicodeData helps here:

00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;N;;German;;;
0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L; 0565 0582N;

So U+0587 is a compatibility character for U+0565,U+0582. Not sure
what the rationale for *this* compatibility character is, but in many
cases, they are in Unicode only for compatibility with some existing
encoding - if they had gone through the proper Unification, they should
not have been introduced as separate characters.

In many cases, ligature characters exist for typographical reasons; 
other examples are

FB00;LATIN SMALL LIGATURE FF;Ll;0;L; 0066 0066N;
FB01;LATIN SMALL LIGATURE FI;Ll;0;L; 0066 0069N;
FB02;LATIN SMALL LIGATURE FL;Ll;0;L; 0066 006CN;
FB03;LATIN SMALL LIGATURE FFI;Ll;0;L; 0066 0066 0069N;
FB04;LATIN SMALL LIGATURE FFL;Ll;0;L; 0066 0066 006CN;

In these cases, it is the font designers which want to have code points
for these characters: the glyphs of the ligature cannot be automatically
derived from the glyphs of the individual characters. I can only guess
that the issue with that Armenian ligature is similar.

Notice that the issue of U+00DF is entirely different: it is a character
on its own, not a ligature. That a common transliteration for this
character exists is again a different story.

Now, as to what might not work: While compatibility decomposition
(NFKD) converts \u0587 to \u0565\u0582, the reverse process is not
supported. This is intentional, of course: there is no "canonical"
compatibility character for every decomposed code point.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (ï¿½ and

2006-03-24 Thread Fredrik Lundh

John Machin wrote:

> Some of the transformations are a little unfortunate :-(

here's a slightly silly way to map a unicode string to its "unaccented"
version:

###

import unicodedata, sys

CHAR_REPLACEMENT = {
0xc6: u"AE", # LATIN CAPITAL LETTER AE
0xd0: u"D",  # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
0xde: u"Th", # LATIN CAPITAL LETTER THORN
0xdf: u"ss", # LATIN SMALL LETTER SHARP S
0xe6: u"ae", # LATIN SMALL LETTER AE
0xf0: u"d",  # LATIN SMALL LETTER ETH
0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
0xfe: u"th", # LATIN SMALL LETTER THORN
}

class unaccented_map(dict):

def mapchar(self, key):
ch = self.get(key)
if ch is not None:
return ch
ch = unichr(key)
try:
ch = unichr(int(unicodedata.decomposition(ch).split()[0], 16))
except (IndexError, ValueError):
ch = CHAR_REPLACEMENT.get(key, ch)
# uncomment the following line if you want to remove remaining
# non-ascii characters
# if ch >= u"\x80": return None
self[key] = ch
return ch

if sys.version >= "2.5":
__missing__ = mapchar
else:
__getitem__ = mapchar

assert isinstance(mystring, unicode)

print mystring.translate(unaccented_map())

###

if the source string is not unicode, you can use something like

s = mystring.decode("iso-8859-1")
s = s.translate(unaccented_map())
s = s.encode("ascii", "ignore")

(this works well for characters in the latin-1 range, at least.  no
guarantees for other character ranges)





-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-24 Thread John Machin

On 24/03/2006 11:44 PM, Peter Otten wrote:
> John Machin wrote:
> 
> 
>>0x00d0: ord('D'), # Ð
>>0x00f0: ord('o'), # ð
>>Icelandic capital eth becomes D, OK; but the small letter becomes o!!!
> 
> 
> I see information flow from Iceland is a bit better than from Armenia :-)

No information flow needed. Capital letter BLAH -> D and small letter 
BLAH -> o should trigger one's palpable nonsense detector for *any* BLAH.

> 
> 
>>Some of the transformations are a little unfortunate :-(
> 
> 
> The OP, as you pointed out in your first post in this thread, has more
> pressing problems with his normalization approach. 
> 
> Lastly, even if all went well, turning a list of French addresses into an
> ascii-uppercase graveyard would be a sad thing to do...

Oh indeed. Not only sad, but incredibly stupid. I fervently hope and 
trust that such a normalisation is intended only for fuzzy matching 
purposes. I can't imagine that anyone would contemplate writing the 
output to storage for any reason other than logging or for regression 
testing. Update it back to the database? Do you know anyone who would do 
that??

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (X and X)

2006-03-24 Thread Walter Dörwald

Duncan Booth wrote:

> [...]
> Unfortunately, just as I finished writing this I discovered that the 
> latscii module isn't as robust as I thought, it blows up on consecutive 
> accented characters. 
> 
>  :(

Replace the error handler with this (untested) and it should work with
consecutive accented characters:

def latscii_error( uerr ):
v = []
for c in uerr.object[uerr.start:uerr.end]
key = ord(c)
try:
v.append(unichr(decoding_map[key]))
except KeyError:
v.append(u"?")
return (u"".join(v), uerr.end)
codecs.register_error('replacelatscii', latscii_error)

Bye,
   Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-24 Thread Peter Otten

John Machin wrote:

> 0x00d0: ord('D'), # Ð
> 0x00f0: ord('o'), # ð
> Icelandic capital eth becomes D, OK; but the small letter becomes o!!!

I see information flow from Iceland is a bit better than from Armenia :-)

> Some of the transformations are a little unfortunate :-(

The OP, as you pointed out in your first post in this thread, has more
pressing problems with his normalization approach. 

Lastly, even if all went well, turning a list of French addresses into an
ascii-uppercase graveyard would be a sad thing to do...

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-24 Thread John Machin

On 24/03/2006 8:11 PM, Duncan Booth wrote:
> Peter Otten wrote:
> 
> 
>>>You can replace ALL of this upshifting and accent removal in one blow
>>>by using the string translate() method with a suitable table.
>>
>>Only if you convert to unicode first or if your data maintains 1 byte
>>== 1 character, in particular it is not UTF-8. 
>>
> 
> 
> There's a nice little codec from Skip Montaro for removing accents from 

For the benefit of those who may read only this far, it is NOT nice.

> latin-1 encoded strings. It also has an error handler so you can convert 
> from unicode to ascii and strip all the accents as you do so:
> 
> http://orca.mojam.com/~skip/python/latscii.py
> 
> 
import latscii
import htmlentitydefs
print u'\u00c9'.encode('ascii','replacelatscii')
> 
> E
> 
> 
> So Bussiere could replace a large chunk of his code with:

Could, but definitely shouldn't.

> 
> ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii')
> ligneA = ligneA.upper()
> 
> INPUTENCODING is 'utf8' unless (one possible explanation for his problem) 
> his files are actually in some different encoding.
> 
> Unfortunately, just as I finished writing this I discovered that the 
> latscii module isn't as robust as I thought, it blows up on consecutive 
> accented characters. 
> 
>  :(
> 
Some of the transformations are a little unfortunate :-(
0x00d0: ord('D'), # Ð
0x00f0: ord('o'), # ð
Icelandic capital eth becomes D, OK; but the small letter becomes o!!!
The Icelandic thorn letters become P & p (based on physical appearance), 
when they should become Th and th.
The German letter Eszett (00DF) becomes B (appearance) when it should be ss.
Creating alphabetics out of punctuation is scarcely something that 
bussiere should be interested in:
 0x00a2: ord('c'), # ¢
 0x00a4: ord('o'), # ¤
 0x00a5: ord('Y'), # ¥
 0x00a7: ord('S'), # §
 0x00a9: ord('c'), # ©
 0x00ae: ord('R'), # ®
 0x00b6: ord('P'), # ¶
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-24 Thread Peter Otten

Duncan Booth wrote:

> There's a nice little codec from Skip Montaro for removing accents from
> latin-1 encoded strings. It also has an error handler so you can convert
> from unicode to ascii and strip all the accents as you do so:
> 
> http://orca.mojam.com/~skip/python/latscii.py
> 
 import latscii
 import htmlentitydefs
 print u'\u00c9'.encode('ascii','replacelatscii')
> E
 
> 
> So Bussiere could replace a large chunk of his code with:
> 
> ligneA = ligneA.decode(INPUTENCODING).encode('ascii',
> 'replacelatscii') ligneA = ligneA.upper()
> 
> INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
> his files are actually in some different encoding.
> 
> Unfortunately, just as I finished writing this I discovered that the
> latscii module isn't as robust as I thought, it blows up on consecutive
> accented characters.
> 
>  :(

You made me look into it -- and I found that reusing the decoding map as the
encoding map lets you write

>>> u"Élève ééé".encode("latscii")
'Eleve eee'

without relying on the faulty error handler. I tried to fix the handler,
too:

>>> u"Élève ééé".encode("ascii", "replacelatscii")
'Eleve eee'
>>> g = u"\N{GREEK CAPITAL LETTER GAMMA}"
>>> (u"möglich ähnlich üblich ááá" + g*3).encode("ascii", "replacelatscii")
'moglich ahnlich ublich aaa???'

No real testing was performed.

Peter

--- latscii_old.py  2006-03-24 11:45:22.580588520 +0100
+++ latscii.py  2006-03-24 11:48:13.191651696 +0100
@@ -141,7 +141,7 @@

 ### Encoding Map

-encoding_map = codecs.make_identity_dict(range(256))
+encoding_map = decoding_map


 ### From Martin Blais
@@ -166,9 +166,9 @@
 ##   ustr.encode('ascii', 'replacelatscii')
 ##
 def latscii_error( uerr ):
-key = ord(uerr.object[uerr.start:uerr.end])
+key = ord(uerr.object[uerr.start])
 try:
-return unichr(decoding_map[key]), uerr.end
+return unichr(decoding_map[key]), uerr.start + 1
 except KeyError:
 handler = codecs.lookup_error('replace')
 return handler(uerr)


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (ï¿½ and

2006-03-24 Thread Duncan Booth

Peter Otten wrote:

>> You can replace ALL of this upshifting and accent removal in one blow
>> by using the string translate() method with a suitable table.
> 
> Only if you convert to unicode first or if your data maintains 1 byte
> == 1 character, in particular it is not UTF-8. 
> 

There's a nice little codec from Skip Montaro for removing accents from 
latin-1 encoded strings. It also has an error handler so you can convert 
from unicode to ascii and strip all the accents as you do so:

http://orca.mojam.com/~skip/python/latscii.py

>>> import latscii
>>> import htmlentitydefs
>>> print u'\u00c9'.encode('ascii','replacelatscii')
E
>>> 

So Bussiere could replace a large chunk of his code with:

ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii')
ligneA = ligneA.upper()

INPUTENCODING is 'utf8' unless (one possible explanation for his problem) 
his files are actually in some different encoding.

Unfortunately, just as I finished writing this I discovered that the 
latscii module isn't as robust as I thought, it blows up on consecutive 
accented characters. 

 :(

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-23 Thread John Machin

On 24/03/2006 2:19 PM, Jean-Paul Calderone wrote:
> On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <[EMAIL PROTECTED]> 
> wrote:
> 
>> On 24/03/2006 8:36 AM, Peter Otten wrote:
>>
>>> John Machin wrote:
>>>
 You can replace ALL of this upshifting and accent removal in one 
 blow by
 using the string translate() method with a suitable table.
>>>
>>>
>>> Only if you convert to unicode first or if your data maintains 1 byte 
>>> == 1
>>> character, in particular it is not UTF-8.
>>>
>>
>> I'm sorry, I forgot that there were people who are unaware that
>> variable-length gizmos like UTF-8 and various legacy CJK encodings are
>> for storage & transmission, and are better changed to a
>> one-character-per-storage-unit representation before *ANY* data
>> processing is attempted.
> 
> 
> Unfortunately, unicode only appears to solve this problem in a sane 
> manner.  Most people conveniently forget (or never learn in the first 
> place) about combining sequences and denormalized forms.  Consider 
> u'e\u0301', u'U\u0301', or u'C\u0327'.

Yes, and many people don't even bother to look at their data. If they 
did, and found combining forms, then they would treat them as I said as 
"variable-length gizmos" which are "better changed to a
one-character-per-storage-unit representation before *ANY* data
processing is attempted."

In any case, as the OP is upshifting and stripping accents [presumably 
as elementary preparation for some sort of fuzzy matching], all that is 
needed is to throw away the combining accents (0301, 0327, etc).

 >  These difficulties can be
> mitigated to some degree via normalization (see unicodedata.normalize), 
> but this step is often forgotten

It's not a matter of forget or not. People should bother to examine 
their data and see what characters are in use; then they would know 
whether they had a problem or not.

> and, for things like u'\u0565\u0582' 
> (ARMENIAN SMALL LIGATURE ECH YIWN), it does not even work.

Sorry, I don't understand.
0565 is stand-alone ECH
0582 is stand-alone YIWN
0587 is the ligature.
What doesn't work? At first guess, in the absence of an Armenian 
informant, for pre-matching normalisation, I'd replace 0587 by the two 
constituents -- just like 00DF would be expanded to "ss" (before 
upshifting and before not caring too much about differences caused by 
doubled letters).
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-23 Thread Jean-Paul Calderone

On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <[EMAIL PROTECTED]> wrote:
>On 24/03/2006 8:36 AM, Peter Otten wrote:
>> John Machin wrote:
>>
>>>You can replace ALL of this upshifting and accent removal in one blow by
>>>using the string translate() method with a suitable table.
>>
>> Only if you convert to unicode first or if your data maintains 1 byte == 1
>> character, in particular it is not UTF-8.
>>
>
>I'm sorry, I forgot that there were people who are unaware that
>variable-length gizmos like UTF-8 and various legacy CJK encodings are
>for storage & transmission, and are better changed to a
>one-character-per-storage-unit representation before *ANY* data
>processing is attempted.

Unfortunately, unicode only appears to solve this problem in a sane manner.  
Most people conveniently forget (or never learn in the first place) about 
combining sequences and denormalized forms.  Consider u'e\u0301', u'U\u0301', 
or u'C\u0327'.  These difficulties can be mitigated to some degree via 
normalization (see unicodedata.normalize), but this step is often forgotten 
and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH YIWN), it 
does not even work.

>
>:-)
>Unicode? I'm just a benighted Anglo from the a**-end of the globe; who
>am I to be preaching Unicode to a European?
>(-:

Heh ;P  Same here.  And I don't really claim to understand all this stuff, I 
just know enough to know it's really hard to do anything correctly. ;)

Jean-Paul
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-23 Thread John Machin

On 24/03/2006 8:36 AM, Peter Otten wrote:
> John Machin wrote:
> 
>>You can replace ALL of this upshifting and accent removal in one blow by
>>using the string translate() method with a suitable table.
> 
> Only if you convert to unicode first or if your data maintains 1 byte == 1
> character, in particular it is not UTF-8. 
> 

I'm sorry, I forgot that there were people who are unaware that 
variable-length gizmos like UTF-8 and various legacy CJK encodings are 
for storage & transmission, and are better changed to a 
one-character-per-storage-unit representation before *ANY* data 
processing is attempted.

:-)
Unicode? I'm just a benighted Anglo from the a**-end of the globe; who 
am I to be preaching Unicode to a European?
(-:
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-23 Thread Peter Otten

John Machin wrote:

> You can replace ALL of this upshifting and accent removal in one blow by
> using the string translate() method with a suitable table.

Only if you convert to unicode first or if your data maintains 1 byte == 1
character, in particular it is not UTF-8. 

Peter

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-23 Thread John Machin

On 23/03/2006 10:07 PM, bussiere bussiere wrote:
> hi i'am making a program for formatting string,
> or
> i've added :
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
> 
> in the begining of my script but
> 
>  str = str.replace('Ç', 'C')
> str = str.replace('é', 'E')
> str = str.replace('É', 'E')
> str = str.replace('è', 'E')
> str = str.replace('È', 'E')
> str = str.replace('ê', 'E')
> 
> 
> doesn't work it put me " and , instead of remplacing é by E
> 
> 
> if someone have an idea it could be great

Hi, I've added some comments below ... I hope they help.
Cheers,
John

> 
> regards
> Bussiere
> ps : i've added the whole script under :
> __
[snip]
> 
> if ligneA != "":
> str = ligneA
> str = str.replace('a', 'A')
[snip]
> str = str.replace('z', 'Z')
>
> str = str.replace('ç', 'C')
> str = str.replace('Ç', 'C')
> str = str.replace('é', 'E')
> str = str.replace('É', 'E')
> str = str.replace('è', 'E')
[snip]
> str = str.replace('Ú','U')

You can replace ALL of this upshifting and accent removal in one blow by 
using the string translate() method with a suitable table.

> str = str.replace('  ', ' ')
> str = str.replace('   ', ' ')
> str = str.replace('', ' ')

The standard Python idiom for normalising whitespace is
strg = ' '.join(strg.split())

 >>> strg = '  ALLOBUSSIERE\tCA  VA? '
 >>> strg.split()
['ALLO', 'BUSSIERE', 'CA', 'VA?']
 >>> ' '.join(strg.split())
'ALLO BUSSIERE CA VA?'
 >>>

[snip]
> if normalisation2 == "O":
> str = str.replace('MONSIEUR', 'M')
> str = str.replace('MR', 'M')

You need to be very careful with this approach. You are changing EVERY 
occurrence of "MR" in the string, not just where it is a whole "word" 
meaning "Monsieur".
Copnstructed example of what can go wrong:
 >>> strg = 'MR IMRE NAGY, 123 PRIMROSE STREET, SHAMROCK VALLEY'
 >>> strg.replace('MR', 'M')
'M IME NAGY, 123 PRIMOSE STREET, SHAMOCK VALLEY'
 >>>

A real, non-constructed history lesson: A certain database indicated 
duplicate records by having the annotation "DUP" in the surname field 
e.g. "SMITH DUP". Fortunately it was detected in testing that the 
so-called clean-up was causing DUPLESSIS to become PLESSIS and DUPRAT to 
become RAT!

Two points here: (1) Split up your strings into "words" or "tokens". 
Using strg.split() is a start but you may need something more 
sophisticated e.g. "-" as an additional token separator. (2) Instead of 
writing out all those lines of code, consider putting those 
substitutions in a dictionary:

title_substitution = {
 'MONSIEUR': 'M',
 'MR': 'M',
 'MADAME': 'MME',
 # etc
 }
Next level of improvement is to read that stuff from a file.
[snip]
> 
> if normalisation4 == "O":
> str = str.replace(';\"', ' ')
> str = str.replace('\"', ' ')
> str = str.replace('\'', ' ')
> str = str.replace('-', ' ')
> str = str.replace(',', ' ')
> str = str.replace('\\', ' ')
> str = str.replace('\/', ' ')
> str = str.replace('&', ' ')
[snip]
Again, consider the string translate() method.
Also, consider that some of those characters may have some meaning that 
you perhaps shouldn't blow away e.g. compare 'SMITH & WESSON' with 
'SMITH ET WESSON' :-)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (é and è)

2006-03-23 Thread Larry Bates

Seems to work fine for me.

>>> x="éÇ"
>>> x=x.replace('é','E')
'E\xc7'
>>> x=x.replace('Ç','C')
>>> x
'E\xc7'
>>> x=x.replace('Ç','C')
>>> x
'EC'

You should also be able to use .upper() method to
uppercase everything in the string in a single statement:

tstr=ligneA.upper()

Note: you should never use 'str' as a variable as
it will mask the built-in str function.

-Larry Bates

bussiere bussiere wrote:
> hi i'am making a program for formatting string,
> or
> i've added :
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
> 
> in the begining of my script but
> 
>  str = str.replace('Ç', 'C')
> str = str.replace('é', 'E')
> str = str.replace('É', 'E')
> str = str.replace('è', 'E')
> str = str.replace('È', 'E')
> str = str.replace('ê', 'E')
> 
> 
> doesn't work it put me " and , instead of remplacing é by E
> 
> 
> if someone have an idea it could be great
> 
> regards
> Bussiere
> ps : i've added the whole script under :
> 
> 
> 
> 
> 
> 
> __
> 
> 
> 
> 
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
> import fileinput, glob, string, sys, os, re
> 
> fichA=raw_input("Entrez le nom du fichier d'entree : ")
> print ("\n")
> fichC=raw_input("Entrez le nom du fichier de sortie : ")
> print ("\n")
> normalisation1 = raw_input("Normaliser les adresses 1 (ex : Avenue->
> AV) (O/N) ou A pour tout normaliser \n")
> normalisation1 = normalisation1.upper()
> 
> if normalisation1 != "A":
> print ("\n")
> normalisation2 = raw_input("Normaliser les civilités (ex :
> Docteur-> DR) (O/N) \n")
> normalisation2 = normalisation2.upper()
> print ("\n")
> normalisation3 = raw_input("Normaliser les Adresses 2 (ex :
> Place-> PL) (O/N) \n")
> normalisation3 = normalisation3.upper()
> 
> 
> normalisation4 = raw_input("Normaliser les caracteres / et - (ex :
> / ->   ) (O/N) \n" )
> normalisation4 = normalisation4.upper()
> 
> if normalisation1 == "A":
> normalisation1 = "O"
> normalisation2 = "O"
> normalisation3 = "O"
> normalisation4 = "O"
> 
> 
> fiA=open(fichA,"r")
> fiC=open(fichC,"w")
> 
> 
> compteur = 0
> 
> while 1:
> 
> ligneA=fiA.readline()
> 
> 
> 
> if ligneA == "":
> 
> break
> 
> if ligneA != "":
> str = ligneA
> str = str.replace('a', 'A')
> str = str.replace('b', 'B')
> str = str.replace('c', 'C')
> str = str.replace('d', 'D')
> str = str.replace('e', 'E')
> str = str.replace('f', 'F')
> str = str.replace('g', 'G')
> str = str.replace('h', 'H')
> str = str.replace('i', 'I')
> str = str.replace('j', 'J')
> str = str.replace('k', 'K')
> str = str.replace('l', 'L')
> str = str.replace('m', 'M')
> str = str.replace('n', 'N')
> str = str.replace('o', 'O')
> str = str.replace('p', 'P')
> str = str.replace('q', 'Q')
> str = str.replace('r', 'R')
> str = str.replace('s', 'S')
> str = str.replace('t', 'T')
> str = str.replace('u', 'U')
> str = str.replace('v', 'V')
> str = str.replace('w', 'W')
> str = str.replace('x', 'X')
> str = str.replace('y', 'Y')
> str = str.replace('z', 'Z')
> 
> str = str.replace('ç', 'C')
> str = str.replace('Ç', 'C')
> str = str.replace('é', 'E')
> str = str.replace('É', 'E')
> str = str.replace('è', 'E')
> str = str.replace('È', 'E')
> str = str.replace('ê', 'E')
> str = str.replace('Ê', 'E')
> str = str.replace('ë', 'E')
> str = str.replace('Ë', 'E')
> str = str.replace('ä', 'A')
> str = str.replace('Ä', 'A')
> str = str.replace('à', 'A')
> str = str.replace('À', 'A')
> str = str.replace('Á', 'A')
> str = str.replace('Â', 'A')
> str = str.replace('Ä', 'A')
> str = str.replace('Ã', 'A')
> str = str.replace('â', 'A')
> str = str.replace('Ä', 'A')
> str = str.replace('ï', 'I')
> str = str.replace('Ï', 'I')
> str = str.replace('î', 'I')
> str = str.replace('Î', 'I')
> str = str.replace('ô', 'O')
> str = str.replace('Ô', 'O')
> str = str.replace('ö', 'O')
> str = str.replace('Ö', 'O')
> str = str.replace('Ú','U')
> str = str.replace('  ', ' ')
> str = str.replace('   ', ' ')
> str = str.replace('', ' ')
> 
> 
> 
> if normalisation1 == "O":
> str = str.replace('AVENUE', 'AV')
> str = str.replace('BOULEVARD', 'BD')
> str = str.replace('FAUBOURG', 'FBG')
> str = str.replace('GENERAL', 'GAL')
> str = str.replace('COMMANDANT', 'CMDT')
> str = str.replace('MARECHAL', 'MAL')
> str = str.replace('PRESIDENT', 'PRDT')
> str = str.replace('SAINT', 'ST')
>

Re: encoding problems (é and è)

2006-03-23 Thread Christoph Zwerschke

bussiere bussiere wrote:
> hi i'am making a program for formatting string,
> i've added :
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
> 
> in the begining of my script but
> 
>  str = str.replace('Ç', 'C')
> ...
> doesn't work it put me " and , instead of remplacing é by E

Are your sure your script and your input file *is* actually encoded with 
utf-8? If it does not work as expected, it is probably latin-1, just 
like your posting. Try changing the coding to latin-1. Does it work now?

-- Christoph
-- 
http://mail.python.org/mailman/listinfo/python-list

encoding problems (é and è)

2006-03-23 Thread bussiere bussiere

hi i'am making a program for formatting string,
or
i've added :
#!/usr/bin/python
# -*- coding: utf-8 -*-

in the begining of my script but

 str = str.replace('Ç', 'C')
str = str.replace('é', 'E')
str = str.replace('É', 'E')
str = str.replace('è', 'E')
str = str.replace('È', 'E')
str = str.replace('ê', 'E')


doesn't work it put me " and , instead of remplacing é by E


if someone have an idea it could be great

regards
Bussiere
ps : i've added the whole script under :






__




#!/usr/bin/python
# -*- coding: utf-8 -*-
import fileinput, glob, string, sys, os, re

fichA=raw_input("Entrez le nom du fichier d'entree : ")
print ("\n")
fichC=raw_input("Entrez le nom du fichier de sortie : ")
print ("\n")
normalisation1 = raw_input("Normaliser les adresses 1 (ex : Avenue->
AV) (O/N) ou A pour tout normaliser \n")
normalisation1 = normalisation1.upper()

if normalisation1 != "A":
print ("\n")
normalisation2 = raw_input("Normaliser les civilités (ex :
Docteur-> DR) (O/N) \n")
normalisation2 = normalisation2.upper()
print ("\n")
normalisation3 = raw_input("Normaliser les Adresses 2 (ex :
Place-> PL) (O/N) \n")
normalisation3 = normalisation3.upper()


normalisation4 = raw_input("Normaliser les caracteres / et - (ex :
/ ->   ) (O/N) \n" )
normalisation4 = normalisation4.upper()

if normalisation1 == "A":
normalisation1 = "O"
normalisation2 = "O"
normalisation3 = "O"
normalisation4 = "O"


fiA=open(fichA,"r")
fiC=open(fichC,"w")


compteur = 0

while 1:

ligneA=fiA.readline()



if ligneA == "":

break

if ligneA != "":
str = ligneA
str = str.replace('a', 'A')
str = str.replace('b', 'B')
str = str.replace('c', 'C')
str = str.replace('d', 'D')
str = str.replace('e', 'E')
str = str.replace('f', 'F')
str = str.replace('g', 'G')
str = str.replace('h', 'H')
str = str.replace('i', 'I')
str = str.replace('j', 'J')
str = str.replace('k', 'K')
str = str.replace('l', 'L')
str = str.replace('m', 'M')
str = str.replace('n', 'N')
str = str.replace('o', 'O')
str = str.replace('p', 'P')
str = str.replace('q', 'Q')
str = str.replace('r', 'R')
str = str.replace('s', 'S')
str = str.replace('t', 'T')
str = str.replace('u', 'U')
str = str.replace('v', 'V')
str = str.replace('w', 'W')
str = str.replace('x', 'X')
str = str.replace('y', 'Y')
str = str.replace('z', 'Z')

str = str.replace('ç', 'C')
str = str.replace('Ç', 'C')
str = str.replace('é', 'E')
str = str.replace('É', 'E')
str = str.replace('è', 'E')
str = str.replace('È', 'E')
str = str.replace('ê', 'E')
str = str.replace('Ê', 'E')
str = str.replace('ë', 'E')
str = str.replace('Ë', 'E')
str = str.replace('ä', 'A')
str = str.replace('Ä', 'A')
str = str.replace('à', 'A')
str = str.replace('À', 'A')
str = str.replace('Á', 'A')
str = str.replace('Â', 'A')
str = str.replace('Ä', 'A')
str = str.replace('Ã', 'A')
str = str.replace('â', 'A')
str = str.replace('Ä', 'A')
str = str.replace('ï', 'I')
str = str.replace('Ï', 'I')
str = str.replace('î', 'I')
str = str.replace('Î', 'I')
str = str.replace('ô', 'O')
str = str.replace('Ô', 'O')
str = str.replace('ö', 'O')
str = str.replace('Ö', 'O')
str = str.replace('Ú','U')
str = str.replace('  ', ' ')
str = str.replace('   ', ' ')
str = str.replace('', ' ')



if normalisation1 == "O":
str = str.replace('AVENUE', 'AV')
str = str.replace('BOULEVARD', 'BD')
str = str.replace('FAUBOURG', 'FBG')
str = str.replace('GENERAL', 'GAL')
str = str.replace('COMMANDANT', 'CMDT')
str = str.replace('MARECHAL', 'MAL')
str = str.replace('PRESIDENT', 'PRDT')
str = str.replace('SAINT', 'ST')
str = str.replace('SAINTE', 'STE')
str = str.replace('LOTISSEMENT', 'LOT')
str = str.replace('RESIDENCE', 'RES')
str = str.replace('IMMEUBLE', 'IMM')
str = str.replace('IMEUBLE', 'IMM')
str = str.replace('BATIMENT', 'BAT')

if normalisation2 == "O":
str = str.replace('MONSIEUR', 'M')
str = str.replace('MR', 'M')
str = str.replace('MADAME', 'MME')
str = str.replace('MADEMOISELLE', 'MLLE')
str = str.replace('DOCTEUR', 'DR')
str = str.replace('PROFESSEUR', 'PR')
str = str.replace('MONSEIGNEUR', 'MGR')
str = str.replace('M ME','MME')


if normalisation

Encoding problems with gettext and wxPython: how to do things in "good style"

2006-03-01 Thread André

I'm trying to change an app so that it uses gettext for translations
rather than the idiosyncratic way I am using.  I've tried the example
on the wxPython wiki
http://wiki.wxpython.org/index.cgi/RecipesI18n
but found that the accented letters would not display properly.  I have
found a workaround that works from Python in a Nutshell; however it is
said in that book that
   "...this is not good style".

I would like to do things "in good style" :-)

Here are some further details:
1. all the .po files are encoded in utf-8
2. my local sitecustomization uses iso-8859-1   (yes, I could easily
change it on *my* computer, but I want the solution to work for anyone
else, without asking them to change their local default encoding).
3. I am programming under Windows XP.

The workaround I use is to write the following at the beginning of the
script:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
del sys.setdefaultencoding

I tried various other ways to change the encoding in the example given,
but nothing else worked.

I can live with the "bad style" workaround if nothing else...

André


I have tried

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems with pymssql / win

2006-02-11 Thread morris carre

> (to email use "boris at batiment71 dot ch")

oops, that's "boris at batiment71 dot net"
-- 
http://mail.python.org/mailman/listinfo/python-list

encoding problems with pymssql / win

2006-02-11 Thread morris carre


I have a strange problem : some code that fetches queries from an mssql 
database works fine under Idle but the very same code run from a shell 
window obtains its strings garbled as if the encoding codepage was 
modified. This occurs only when using pymssql to connect; if I connect 
through odbc (using the odbc module included in pywin) this problem 
vanishes.

I briefly thought sys.getdefaultencoding() would pin it down but the 
value is "ascii" in all cases.

If anyone has a lead as to what may be happening here or how to solve it 
(apart from using odbc, that is), advice is welcome.

I am using python 2.4.2 and pymssql 0.7.3 under win2k sp4 on the client 
machine, while the server machine has windows server 2003 + sqlserver2k, 
locale is fr_CH. For an example of garbling, I get '\x8a' where I should 
get '\xe8'

TIA, mc
--
(to email use "boris at batiment71 dot ch")
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Print encoding problems in console

Re: Print encoding problems in console

Print encoding problems in console

Re: How to force SAX parser to ignore encoding problems

Re: How to force SAX parser to ignore encoding problems

How to force SAX parser to ignore encoding problems

Re: encoding problems

Re: encoding problems

Re: encoding problems

Re: encoding problems

Re: encoding problems

Re: encoding problems

Re: encoding problems

encoding problems

Re: encoding problems (é and è)

Re: encoding problems (é and è)

Re: encoding problems (é and è)

Re: encoding problems (é and è)

Re: encoding problems (ï¿½ and

Re: encoding problems (é and è)

Re: encoding problems (X and X)

Re: encoding problems (é and è)

Re: encoding problems (é and è)

Re: encoding problems (é and è)

Re: encoding problems (ï¿½ and

Re: encoding problems (é and è)

Re: encoding problems (é and è)

Re: encoding problems (é and è)

Re: encoding problems (é and è)

Re: encoding problems (é and è)

Re: encoding problems (é and è)

Re: encoding problems (é and è)

encoding problems (é and è)

Encoding problems with gettext and wxPython: how to do things in "good style"

Re: encoding problems with pymssql / win

encoding problems with pymssql / win

36 matches

Site Navigation

Mail list logo

Footer information