Re: How to find number of characters in a unicode string?

2006-09-29 Thread Lawrence D'Oliveiro
In message <[EMAIL PROTECTED]>, Marc 'BlackJack'
Rintsch wrote:

> In <[EMAIL PROTECTED]>,
> Preben Randhol wrote:
> 
>> Is there a way to calculate in characters
>> and not in bytes to represent the characters.
> 
> Decode the byte string and use `len()` on the unicode string.

Hmmm, for some reason

len(u"C\u0327")

returns 2.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find number of characters in a unicode string?

2006-09-29 Thread Marc 'BlackJack' Rintsch
In <[EMAIL PROTECTED]>, Lawrence D'Oliveiro wrote:

> In message <[EMAIL PROTECTED]>, Marc 'BlackJack'
> Rintsch wrote:
> 
>> In <[EMAIL PROTECTED]>,
>> Preben Randhol wrote:
>> 
>>> Is there a way to calculate in characters
>>> and not in bytes to represent the characters.
>> 
>> Decode the byte string and use `len()` on the unicode string.
> 
> Hmmm, for some reason
> 
> len(u"C\u0327")
> 
> returns 2.

Okay, decode and normalize and then use `len()` on the unicode string.

Ciao,
Marc 'BlackJack' Rintsch

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find number of characters in a unicode string?

2006-09-29 Thread Gabriel Genellina
At Friday 29/9/2006 04:52, Lawrence D'Oliveiro wrote:

> >> Is there a way to calculate in characters
> >> and not in bytes to represent the characters.
> >
> > Decode the byte string and use `len()` on the unicode string.
>
>Hmmm, for some reason
>
> len(u"C\u0327")
>
>returns 2.

That's correct, these are two unicode characters, 
C and combining-cedilla; display as Ç. From 
:

"Unicode takes the role of providing a unique 
code point — a number, not a glyph — for each 
character. In other words, Unicode represents a 
character in an abstract way, and leaves the 
visual rendering (size, shape, font or style) to 
other software [...] This simple aim becomes 
complicated, however, by concessions made by 
Unicode's designers, in the hope of encouraging a 
more rapid adoption of Unicode. [...] A lot of 
essentially identical characters were encoded 
multiple times at different code points to 
preserve distinctions used by legacy encodings 
and therefore allow conversion from those 
encodings to Unicode (and back) without losing 
any information. [...] Also, while Unicode allows 
for combining characters, it also contains 
precomposed versions of most letter/diacritic 
combinations in normal use. These make conversion 
to and from legacy encodings simpler and allow 
applications to use Unicode as an internal text 
format without having to implement combining 
characters. For example é can be represented in 
Unicode as U+0065 (Latin small letter e) followed 
by U+0301 (combining acute) but it can also be 
represented as the precomposed character U+00E9 
(Latin small letter e with acute)."

Gabriel Genellina
Softlab SRL 





__
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find number of characters in a unicode string?

2006-09-29 Thread Leif K-Brooks
Lawrence D'Oliveiro wrote:
> Hmmm, for some reason
> 
> len(u"C\u0327")
> 
> returns 2.

Is len(unicodedata.normalize('NFC', u"C\u0327")) what you want?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find number of characters in a unicode string?

2006-10-10 Thread Leo Kislov

Lawrence D'Oliveiro wrote:
> In message <[EMAIL PROTECTED]>, Marc 'BlackJack'
> Rintsch wrote:
>
> > In <[EMAIL PROTECTED]>,
> > Preben Randhol wrote:
> >
> >> Is there a way to calculate in characters
> >> and not in bytes to represent the characters.
> >
> > Decode the byte string and use `len()` on the unicode string.
>
> Hmmm, for some reason
>
> len(u"C\u0327")
>
> returns 2.

If python ever provide this functionality it would be I guess
u"C\u0327".width() == 1. But it's not clear when unicode.org will
provide recommended fixed font character width information for *all*
characters. I recently stumbled upon Tamil language, where for example
u'\u0b95\u0bcd', u'\u0b95\u0bbe', u'\u0b95\u0bca', u'\u0b95\u0bcc'
looks like they have width 1,2,3 and 4 columns. To add insult to injury
these 4 symbols are all considered *single* letter symbols :) If your
email reader is able to show them, here they are in all their glory:
க், கா, கொ, கௌ.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to find number of characters in a unicode string?

2006-10-10 Thread Theerasak Photha
On 10 Oct 2006 22:50:21 -0700, Leo Kislov <[EMAIL PROTECTED]> wrote:

> If python ever provide this functionality it would be I guess
> u"C\u0327".width() == 1. But it's not clear when unicode.org will
> provide recommended fixed font character width information for *all*
> characters. I recently stumbled upon Tamil language, where for example
> u'\u0b95\u0bcd', u'\u0b95\u0bbe', u'\u0b95\u0bca', u'\u0b95\u0bcc'
> looks like they have width 1,2,3 and 4 columns. To add insult to injury
> these 4 symbols are all considered *single* letter symbols :) If your
> email reader is able to show them, here they are in all their glory:
> க், கா, கொ, கௌ.

Letters? Not as such. They are, however, single syllabic units; Tamil,
like other Indic scripts, is an alphasyllabary.

I believe the syllables or sounds thus encoded are k (with nothing
after), kaa, ko, and kau.

Seamonkey is being a jerk and not rendering the glyphs properly... :?

-- Theerasak
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to find number of characters in a unicode string?

2006-09-18 Thread Marc 'BlackJack' Rintsch
In <[EMAIL PROTECTED]>,
Preben Randhol wrote:

> If I use len() on a string containing unicode letters I get the number
> of bytes the string uses. This means that len() can report size 6 when
> the unicode string only contains 3 characters (that one would write by
> hand or see on the screen). Is there a way to calculate in characters
> and not in bytes to represent the characters.

Yes and you already seem to know the answer:  Decode the byte string and
use `len()` on the unicode string.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find number of characters in a unicode string?

2006-09-18 Thread faulkner
are you sure you're using unicode objects?
len(u'\u') == 1
the encodings module should help you turn '\xff\xff' into u'\u'.

Preben Randhol wrote:
> Hi
>
> If I use len() on a string containing unicode letters I get the number
> of bytes the string uses. This means that len() can report size 6 when
> the unicode string only contains 3 characters (that one would write by
> hand or see on the screen). Is there a way to calculate in characters
> and not in bytes to represent the characters.
>
> The reason for asking is that PyGTK needs number of characters to set
> the width of Entry widgets to a certain length, and it expects viewable
> characters and not number of bytes to represent them.
> 
> 
> Thanks in advance
> 
> 
> Preben

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find number of characters in a unicode string?

2006-09-18 Thread Preben Randhol
On Mon, 18 Sep 2006 22:29:20 +0200
Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:

> Yes and you already seem to know the answer:  Decode the byte string
> and use `len()` on the unicode string.

.decode("utf-8") did the trick. Thanks!

Preben
-- 
http://mail.python.org/mailman/listinfo/python-list