[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2021-07-24 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

Which comes out 'Tr̥Tīyā'.  The underdot '̥' is '0x325'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2021-07-24 Thread Vishvas Vasuki

Vishvas Vasuki  added the comment:

This case still fails with 3.9 - 

'Tr̥tīyā'.title()

--
nosy: +vishvas.vasuki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2020-10-26 Thread STINNER Victor


Change by STINNER Victor :


--
nosy:  -vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2020-10-26 Thread Irit Katriel


Irit Katriel  added the comment:

You're right, I see that too when I don't tamper with the test.

--
components: +Library (Lib)

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2020-10-25 Thread Guido van Rossum

Guido van Rossum  added the comment:

Are you sure? Running Ezio's titletest.py, I get this output (note that the UCD 
major version is in the double digits so the test for that misfires :-).

titletest.py: Please set your PYTHONIOENCODING envariable to utf8
WARNING: Your old UCD is out of date, expected 6.0.0 but got 13.0.0
titlecase of  'déme un café'  should be  'Déme Un Café'  not  'DéMe Un 
Café'
titlecase of  'i̇stanbul'  should be  'İstanbul'  not  'İStanbul'
titlecase of  'ᾲ στο διάολο'  should be  'Ὰͅ Στο Διάολο'  not  'ᾺΙ Στο 
ΔιάΟλο'
failed 3 out of 6 tests

Note that the test program specifically uses combining marks, which are 
alternate ways to spell some characters. It seems what's failing is the second 
deme un cafe, the first istanbul, and the (only) greek phrase.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2020-10-25 Thread Irit Katriel

Irit Katriel  added the comment:

Of the examples given two seem ok now, but the Istanbul one is still wrong:

>>> "déme un café".title()
'Déme Un Café'
>>> "ᾲ στο διάολο".title()
'Ὰͅ Στο Διάολο'
>>>

>>> "i̇stanbul".title()
'İStanbul'

--
nosy: +iritkatriel
versions: +Python 3.10, Python 3.9 -Python 3.2, Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-10-18 Thread Florent Xicluna

Changes by Florent Xicluna :


--
nosy: +flox

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-10-01 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

>> As for terminology: I think the documentation should continue to
>> speak about "words" and "letters", and then define what is meant
>> in this context. It's not that the Unicode consortium invented
>> the term "letter", so we should use it more liberally than just
>> referring to the L* categories.
> 
> I really don't think it wise to have private definitions of these.
> 
> If Letter doesn't mean L?, things get too weird.  That's why 
> there are separate definitions of alphabetic, word, etc.

But I won't be using the word "Letter", but "letter" (lower case).
Nobody will assume that this refers to the Unicode standard;
people would rather expect that this is [A-Za-z] (i.e. not expect
non-ASCII characters to be considered at all). So elaboration is
necessary, anyway. I take the risk of confusing the 10 people that
ever read UTS#18 :-)

--
title: str.title() is overzealous by upcasing combining marks inappropriately 
-> str.title() is overzealous by upcasing combining marks inappropriately

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-10-01 Thread Tom Christiansen

Tom Christiansen  added the comment:

Martin v. Löwis  wrote
   on Sat, 01 Oct 2011 10:59:48 -: 

>>  * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.

> Where did you get that definition from? UTS#18 defines
> "", which is Alphabetic + U+200C + U+200D
> (i.e. not including marks, but including those

>From UTS#18 RL1.2A in Annex C, where a \p{word} or \w character 
is defined to be 

 \p{alpha}
 \p{gc=Mark}
 \p{digit}
 \p{gc=Connector_Punctuation}

>> I think you are looking for here are Word characters without 
>> Nd + Pc, so just Alphabetic + Mn+Mc+Me.  
>> 
>> Is that right?
> 
> With your definition of "Word character" above, yes, that's right.

It's not mine.  It's tr18's.

> Marks won't start a word, though.

That's the smarter boundary thing they talk about.  

I'm not myself familiar with \pM

> As for terminology: I think the documentation should continue to
> speak about "words" and "letters", and then define what is meant
> in this context. It's not that the Unicode consortium invented
> the term "letter", so we should use it more liberally than just
> referring to the L* categories.

I really don't think it wise to have private definitions of these.

If Letter doesn't mean L?, things get too weird.  That's why 
there are separate definitions of alphabetic, word, etc.

--tom

--
title: str.title() is overzealous by upcasing combining marks   inappropriately 
-> str.title() is overzealous by upcasing combining marks inappropriately

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-10-01 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

>  * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.

Where did you get that definition from? UTS#18 defines
"", which is Alphabetic + U+200C + U+200D
(i.e. not including marks, but including those

> I think you are looking for here are Word characters without 
> Nd + Pc, so just Alphabetic + Mn+Mc+Me.  
> 
> Is that right?

With your definition of "Word character" above, yes, that's right.
Marks won't start a word, though.

As for terminology: I think the documentation should continue to
speak about "words" and "letters", and then define what is meant
in this context. It's not that the Unicode consortium invented
the term "letter", so we should use it more liberally than just
referring to the L* categories.

--
title: str.title() is overzealous by upcasing combining marks inappropriately 
-> str.title() is overzealous by upcasing combining marks inappropriately

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-09-30 Thread Guido van Rossum

Guido van Rossum  added the comment:

I like how we're actually converging on an implementable and
maximally-useful algorithm.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-09-30 Thread Tom Christiansen

Tom Christiansen  added the comment:

> Martin v. Löwis  added the comment:

> "Split S into words. Change the first letter in a word to upper-case,

Except that I think you actually mean that the first "letter" is 
changed into titlecase not uppercase.  

One might also say *try* to change for all these, in that not
all cased code points in Unicode have casemaps that are different
from themselves.  For example, a superscript lowercase a or b has
no distinct uppercase mapping, the way the non-superscript versions do:

% (echo xyz; echo ab AB | unisupers) | uc
XYZ
ᵃᵇ ᴬᴮ

> and all subsequent letters to lower case. A word is a sequence that
> starts with a letter, followed by letter-related characters."

I don't like the way you have defined letters and letter-related
characters.  The first already has a definition, which is not the
one you are using.  Word characters also has a definition in Unicode,
and it is not the one you are using.  I strongly advise against
redefining standard Unicode properties.  Choose other, unused terms 
if you must.  It is very confusing otherwise.

> Letters are all characters from the "Alphabetic" category, i.e.
> Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic.

Except that is exactly the definition of the Unicode Alphabetic property,
not the Unicode Letter property.  It is a mistake to equate
Letter=Alphabetic, and very confusing too.

I agree that this probably what you want, though.  I just don't think you
should use "letter-related characters" when there is an existing formal
definition that works, or that you should redefine Letter.

> "letter-related" characters are letters + marks (Mn, Mc, Me).

That isn't quite right.  

 * Letters are Lu+Ll+Lt+Lm+Lo.

 * Alphabetic is Letters + Other_Alphabetic.

 * Other_Alphabetic is certain marks (like the iota subscript) and the
   letter numbers (Nl), as well as a few symbols.

 * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.

I think you are looking for here are Word characters without 
Nd + Pc, so just Alphabetic + Mn+Mc+Me.  

Is that right?

--tom

PS: You can do union/intersection stuff with properties to see what
the resulting sets look like using the unichars command-line tool.

This is everything that is both alphabetic and also a mark:

% unichars -gs '\p{Alphabetic}' '\pM'
‭ ○ͅ  U+0345 GC=Mn SC=InheritedCOMBINING GREEK YPOGEGRAMMENI
‭ ○ְ  U+05B0 GC=Mn SC=Hebrew   HEBREW POINT SHEVA
‭ ○ֱ  U+05B1 GC=Mn SC=Hebrew   HEBREW POINT HATAF SEGOL
‭ ○ֲ  U+05B2 GC=Mn SC=Hebrew   HEBREW POINT HATAF PATAH
‭ ○ֳ  U+05B3 GC=Mn SC=Hebrew   HEBREW POINT HATAF QAMATS
...
‭ ○ं  U+0902 GC=Mn SC=Devanagari   DEVANAGARI SIGN ANUSVARA
‭ ः  U+0903 GC=Mc SC=Devanagari   DEVANAGARI SIGN VISARGA
‭ ा  U+093E GC=Mc SC=Devanagari   DEVANAGARI VOWEL SIGN AA
‭ ि  U+093F GC=Mc SC=Devanagari   DEVANAGARI VOWEL SIGN I
‭ ी  U+0940 GC=Mc SC=Devanagari   DEVANAGARI VOWEL SIGN II
‭ ○ु  U+0941 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN U
‭ ○ू  U+0942 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN UU
‭ ○ृ  U+0943 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN VOCALIC R
‭ ○ॄ  U+0944 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN VOCALIC RR
...

While these are the NON-alphabetic marks, which are still Word
characters though of course:

% unichars -gs '\P{Alphabetic}' '\pM'
‭ ○̀  U+0300 GC=Mn SC=InheritedCOMBINING GRAVE ACCENT
‭ ○́  U+0301 GC=Mn SC=InheritedCOMBINING ACUTE ACCENT
‭ ○̂  U+0302 GC=Mn SC=InheritedCOMBINING CIRCUMFLEX ACCENT
‭ ○̃  U+0303 GC=Mn SC=InheritedCOMBINING TILDE
‭ ○̄  U+0304 GC=Mn SC=InheritedCOMBINING MACRON
‭ ○̅  U+0305 GC=Mn SC=InheritedCOMBINING OVERLINE
‭ ○̆  U+0306 GC=Mn SC=InheritedCOMBINING BREVE
‭ ○̇  U+0307 GC=Mn SC=InheritedCOMBINING DOT ABOVE
‭ ○̈  U+0308 GC=Mn SC=InheritedCOMBINING DIAERESIS
‭ ○̉  U+0309 GC=Mn SC=InheritedCOMBINING HOOK ABOVE
‭ ○̊  U+030A GC=Mn SC=InheritedCOMBINING RING ABOVE
‭ ○̋  U+030B GC=Mn SC=InheritedCOMBINING DOUBLE ACUTE ACCENT
‭ ○̌  U+030C GC=Mn SC=InheritedCOMBINING CARON
...

And here are the Cased code points that are do not change when 
upper-, title-, or lowercased:

% unichars -gs '\p{Cased}' '[^\p{CWU}\p{CWT}\p{CWL}]'
‭ ª  U+00AA GC=Ll SC=LatinFEMININE ORDINAL INDICATOR
‭ º  U+00BA GC=Ll SC=LatinMASCULINE ORDINAL INDICATOR
‭ ĸ  U+0138 GC=Ll SC=LatinLATIN SMALL LETTER KRA
‭ ƍ  U+018D GC=Ll SC=LatinLATIN SMALL LETTER TURNED DELTA
‭ ƛ  U+019B GC=Ll SC=LatinLATIN SMALL LETTER LAMBDA WITH STROKE
‭ ƪ  U+01AA GC=Ll SC=LatinLATIN LETTER REVERSED ESH LOOP
‭ ƫ  U+01AB GC=Ll SC=LatinLATIN SMALL LETTER T WITH PALATAL HOOK
‭ ƺ  U+01BA GC=Ll SC=LatinLATIN SMALL LETTER EZH WITH TAIL
‭ ƾ  U+01BE GC=Ll SC=LatinLATIN LETTER INVERTED GLOTTAL STOP WITH 
STROKE
‭ ȡ  U+0221 GC=Ll SC=

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-09-30 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

> Martin, do you think that str.title() should follow the Unicode standard?

I don't think that "follow the Unicode standard" has any meaning in this
context: the Unicode standard doesn't specify (AFAIK) what a .title()
method in a programming language should do.

> Should string methods work with all the normalizations or just with NFC?

When we know what .title() should do, it should do so correctly for all
strings. I try to propose a definition for .title()

"Split S into words. Change the first letter in a word to upper-case,
and all subsequent letters to lower case. A word is a sequence that
starts with a letter, followed by letter-related characters."

Letters are all characters from the "Alphabetic" category, i.e.
Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic.

"letter-related" characters are letters + marks (Mn, Mc, Me).

--
title: str.title() is overzealous by upcasing combining marks inappropriately 
-> str.title() is overzealous by upcasing combining marks inappropriately

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-09-29 Thread Ezio Melotti

Ezio Melotti  added the comment:

After PEP 393 the result is still the same (I attached a slightly improved 
version of the script):

titlecase of  'deme un cafe'  should be  'Deme Un Cafe'  not  'DeMe Un Cafe'
titlecase of  'istanbul'  should be  'Istanbul'  not  'IStanbul'
titlecase of  'α στο διαολο'  should be  'Α Στο Διαολο'  not  'ΑΙ Στο ΔιαΟλο'
failed 3 out of 6 tests

Martin, do you think that str.title() should follow the Unicode standard?
Should string methods work with all the normalizations or just with NFC?

--
Added file: http://bugs.python.org/file23271/titletest.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-09-18 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

Tom: it's intentional that .title() doesn't use traditional word break 
algorithms. In 2.x, "foo3bar".title() is "Foo3Bar", i.e. the 3 counts as a word 
end. So neither UTS#18 \w nor UAX#29 apply. So in UTS#18 terminology, .title() 
matches more closes \alpha+, despite UTS#18 saying that this shouldn't be used 
for word-breaking.

It's not clear to me how UTS#18 defines \alpha. On the one hand, they say that 
marks should be included, OTOH they refer to the Alphabetic derived category 
which doesn't include marks, except for the few that have been included in 
Other_Alphatetic.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-09-17 Thread Ezio Melotti

Ezio Melotti  added the comment:

I think string methods (and other parts of the stdlib) assume NFC and leave 
normalization to NFC up to the user.  Before fixing str.title() we should take 
a more general decision about handling strings that use other normalization 
forms.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-26 Thread Tom Christiansen

Tom Christiansen  added the comment:

Guido van Rossum  wrote
   on Fri, 26 Aug 2011 21:16:57 -: 

> Yeah, this should be fixed in 3.3 and probably backported to 3.2
> and 2.7.  (There is already no guarantee that len(s) ==
> len(s.title()), right?)

Well, *I* don't know of any such guarantee, 
but I don't know Python very well.

In general, Unicode makes very few guarantees about casing.  Under full
casemapping, which is the only way to do the silly Turkish stuff amongst
quite a bit else, any of the three casemappings can change the length of
the string.

Other things you can't rely on are round tripping and "single paths".  By
roundtripping, just look at the two lowercase sigmas and think about how
you can't get back to one of them if you uppercase them both.  By single
paths, I mean that code that does some sort of conversion where it first
lowercases everything and then titlecases the first letter can produce
something different from titlecasing just the original first letter and
then lowercasing the rest of them.  That's because tc(x) and tc(lc(x)) can
be different.

--tom

--
title: str.title()  is overzealous by upcasing combining marks inappropriately 
-> str.title() is overzealous by upcasing combining marks inappropriately

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-26 Thread Guido van Rossum

Guido van Rossum  added the comment:

Yeah, this should be fixed in 3.3 and probably backported to 3.2 and 2.7.  
(There is already no guarantee that len(s) == len(s.title()), right?)

--
nosy: +gvanrossum

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

So the issue here is that while using combing chars, str.title() fails to 
titlecase the string properly.

The algorithm implemented by str.title() [0] is quite simple: it loops through 
the code units, and uppercases all the chars that follow a char that is not 
lower/upper/titlecased.
This means that if Déme doesn't use combining accents, the char before the 'm' 
is 'é', 'é' is a lowercase char, so 'm' is not capitalized.
If the 'é' is represented as 'e' + '´', the char before the 'm' is '´', '´' is 
not a lower/upper/titlecase char, so the 'm' is capitalized.

I guess we could normalize the string before doing the title casing, and then 
normalize it back.
Also the str methods don't claim to follow Unicode afaik, so unless we decide 
that they should, we could implement whatever algorithm we want.

[0]: Objects/unicodeobject.c:6752

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-15 Thread STINNER Victor

STINNER Victor  added the comment:

See also #12746.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-13 Thread Antoine Pitrou

Changes by Antoine Pitrou :


--
nosy: +haypo, loewis
stage:  -> needs patch
versions: +Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-12 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

I changed the title because 'string' is a module that once contained the 
functions that are now attached to the str class as methods. So 'string.title' 
is an obsolete attribute reference.

--
nosy: +terry.reedy
title: string.title()  is overzealous by upcasing combining marks 
inappropriately -> str.title()  is overzealous by upcasing combining marks 
inappropriately

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com