Re: hard_decoding

2005-02-15 Thread Skip Montanaro

Andrew> for another variation see that "Unicode Hammer" at
Andrew>   http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871

Andrew> It doesn't do the registry hooks that Skip does, and I see I
Andrew> need to learn more about the functions in the codes module.

Note that latscii.py didn't implement the registry hooks until a user
pointed them out to me a couple weeks ago.  Other than that I really haven't
learned anything about them either.  ;-)

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hard_decoding

2005-02-15 Thread Andrew Dalke
Coming in a few days late to this one ...

Skip
> See if my latscii codec works for you:
> 
> http://www.musi-cal.com/~skip/python/latscii.py

for another variation see that "Unicode Hammer" at
  http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871

It doesn't do the registry hooks that Skip does, and I see
I need to learn more about the functions in the codes module.

Andrew
[EMAIL PROTECTED]

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hard_decoding

2005-02-10 Thread Skip Montanaro

Tamas> Do you have a convinient, easy way to remove special charachters
Tamas> from u'strings'?

Tamas> Replacing:
Tamas> ÀÁÂÃÄÅ   => A
... etc ...

See if my latscii codec works for you:

http://www.musi-cal.com/~skip/python/latscii.py

Skip
--
http://mail.python.org/mailman/listinfo/python-list


Re: hard_decoding

2005-02-10 Thread John Lenton
On Wed, Feb 09, 2005 at 05:22:12PM -0700, Tamas Hegedus wrote:
> Hi!
> 
> Do you have a convinient, easy way to remove special charachters from 
> u'strings'?
> 
> Replacing:
> ÀÁÂÃÄÅ=> A
> èéêë  => e
> etc.
> 'L0xe1szl0xf3' => Laszlo
> or something like that:
> 'L\xc3\xa1szl\xc3\xb3' => Laszlo

for the examples you have given, this works:

from unicodedata import normalize

def strip_composition(unichar):
"""
Return the first character from the canonical decomposition of
a unicode character. This wil typically be the unaccented
version of the character passed in (in Latin regions, at
least).
"""
return normalize('NFD', unichar)[0]

def remove_special_chars(anystr):
"""
strip_composition of the whole string
"""
return ''.join(map(strip_composition, unicode(anystr)))

for i in ('ÀÁÂÃÄÅ', 'èéêë',
  u'L\xe1szl\xf3',
  'L\xc3\xa1szl\xc3\xb3'):
print i, '->', remove_special_chars(i)

produces:

ÀÁÂÃÄÅ -> AA
èéêë -> 
László -> Laszlo
László -> Laszlo

although building a translation mapping is, in general, faster. You
could use the above to build that map automatically, like this:

def build_translation(sample, table=None):
"""
Return a translation table that strips composition characters
out of a sample unicode string. If a table is supplied, it
will be updated.
"""
assert isinstance(sample, unicode), 'sample must be unicode'
if table is None:
table = {}
for i in set(sample) - set(table):
table[ord(i)] = ord(strip_composition(i))
return table

this is much faster on larger strings, or if you have many strings,
but know the alphabet (so you compute the table once). You might also
try to build the table incrementally,

for i in strings:
i = i.translate(table)
try:
i.encode('ascii')
except UnicodeEncodeError:
table = build_translation(i, table)
i = i.translate(table)
stripped.append(i)

of course this won't work if you have other, non-ascii but
non-composite, chars in your strings.

-- 
John Lenton ([EMAIL PROTECTED]) -- Random fortune:
El que está en la aceña, muele; que el otro va y viene. 


signature.asc
Description: Digital signature
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: hard_decoding

2005-02-10 Thread Peter Maas
Tamas Hegedus schrieb:
Do you have a convinient, easy way to remove special charachters from 
u'strings'?

Replacing:
ÀÁÂÃÄÅ => A
èéêë=> e
etc.
'L0xe1szl0xf3' => Laszlo
or something like that:
'L\xc3\xa1szl\xc3\xb3' => Laszlo
>>> ord(u'ë')
235
>>> ord(u'e')
101
>>> cmap = {235:101}
>>> u'hello'.translate(cmap)
u'hello'
>>> u'hëllo'.translate(cmap)
u'hello'
The inconvenient part is to generate cmap. I suggest you write a
helper class genmap for this:
>>> g = genmap()
>>> g.add(u'ÀÁÂÃÄÅ', u'A')
>>> g.add(u'èéêë', u'e')
>>> 'László'.translate(g.cmap())
Laszlo
--
---
Peter Maas,  M+R Infosysteme,  D-52070 Aachen,  Tel +49-241-93878-0
E-mail 'cGV0ZXIubWFhc0BtcGx1c3IuZGU=\n'.decode('base64')
---
--
http://mail.python.org/mailman/listinfo/python-list


hard_decoding

2005-02-10 Thread Tamas Hegedus
Hi!
Do you have a convinient, easy way to remove special charachters from 
u'strings'?

Replacing:
ÀÁÂÃÄÅ  => A
èéêë=> e
etc.
'L0xe1szl0xf3' => Laszlo
or something like that:
'L\xc3\xa1szl\xc3\xb3' => Laszlo
Thanks,
Tamas
--
Tamas Hegedus, Research Fellow | phone: (1) 480-301-6041
Mayo Clinic Scottsdale | fax:   (1) 480-301-7017
13000 E. Shea Blvd | mailto:[EMAIL PROTECTED]
Scottsdale, AZ, 85259  | http://hegedus.brumart.org
--
http://mail.python.org/mailman/listinfo/python-list