[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2016-03-10 Thread Benjamin Peterson

Benjamin Peterson added the comment:

The full case mappings do not preserve normalization form.

>>> for c in 'ΰ'.upper().lower(): print(unicodedata.name(c))
... 
GREEK SMALL LETTER UPSILON
COMBINING DIAERESIS
COMBINING ACUTE ACCENT
>>> unicodedata.normalize('NFC', 'ΰ'.upper().lower()) == 'ΰ'
True

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2016-03-10 Thread Guido van Rossum

Changes by Guido van Rossum :


--
nosy:  -gvanrossum

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2016-03-10 Thread SilentGhost

Changes by SilentGhost :


--
versions: +Python 3.4, Python 3.5, Python 3.6 -Python 2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2016-03-10 Thread Андрей Баксаляр

Андрей Баксаляр added the comment:

Interestingly, the bug is still reproducible in version 3.5.1, but fixed in 
2.7.6.

--
versions: +Python 2.7 -Python 3.4
Added file: http://bugs.python.org/file42121/pythonbug.png

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2016-03-10 Thread Андрей Баксаляр

Андрей Баксаляр added the comment:

A same problem with the unicode case mapping is still present in the Python 
3.4.3. You can repeat the bug with this code, for instance:

'ΰ'.upper().lower() == 'ΰ'

The case swapping is strangelly leads to character replacement:

b'\xce\xb0' → b'\xcf\x85\xcc\x88\xcc\x81'

--
nosy: +Андрей Баксаляр
versions: +Python 3.4 -Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2012-01-15 Thread Jim Jewett

Jim Jewett jimjjew...@gmail.com added the comment:

Why was the delta-processing removed from the casing functions?

As best I can tell, the whole point of going through multiple levels of 
indirection (courtesy splitbins) is to maximize compression and minimize the 
amount of cache that unicode might occupy.

By using deltas, only one record is needed for each combination of (upper - 
lower, upper - title), which is generally only one or two combinations per 
script.  

Without deltas, nearly every cased letter needs its own record, and the index 
tables also get bigger. (It seems to be about 2.6 times as large, but cache 
effects may be worse, since letters from the same script will no longer be in 
the same record or the same index chain.)

If it is a concern about not enough room for flags, then the decimal/digit 
chars could be combined.  They are always the same, unless the number isn't 
decimal (in which case the flag is enough).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2012-01-15 Thread Roundup Robot

Roundup Robot devn...@psf.upfronthosting.co.za added the comment:

New changeset 03ea95e3b497 by Benjamin Peterson in branch 'default':
delta encoding of upper/lower/title makes a glorious return (#12736)
http://hg.python.org/cpython/rev/03ea95e3b497

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2012-01-12 Thread Jim Jewett

Jim Jewett jimjjew...@gmail.com added the comment:

The currently applied patch ( http://hg.python.org/cpython/rev/f7e05d205a52 ) 
left some dead code in unicodeobject.c

function fixup ( 
http://hg.python.org/cpython/file/f7e05d205a52/Objects/unicodeobject.c#l9386 ) 
has a shortcut for when the fixer doesn't make any actual changes.  The removed 
fixers (like fixupper ) returned 0 rather than maxchar to indicate that.  The 
only remaining fixer, fix_decimal_and_space_to_ascii (line 8839), does not.  (I 
think fix_decimal_and_space_to_ascii *should* add a touched flag, but until it 
does, the shortcut dedup code is dead.)

Also, around line 10502, there is an #if 0 section with code that relied on one 
of the removed fixers; is it time to remove that section?

--
nosy: +Jim.Jewett

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2012-01-11 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

New patch with title casing mappings added.

--
Added file: http://bugs.python.org/file24204/full-casemapping.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2012-01-11 Thread Roundup Robot

Roundup Robot devn...@psf.upfronthosting.co.za added the comment:

New changeset f7e05d205a52 by Benjamin Peterson in branch 'default':
use full unicode mappings for upper/lower/title case (#12736)
http://hg.python.org/cpython/rev/f7e05d205a52

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2012-01-11 Thread Benjamin Peterson

Changes by Benjamin Peterson benja...@python.org:


--
resolution:  - fixed
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2012-01-10 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

__ap__'s implementation method is about 2x faster than mine.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2012-01-10 Thread Benjamin Peterson

Changes by Benjamin Peterson benja...@python.org:


Added file: http://bugs.python.org/file24199/full-casemapping.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2012-01-09 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

New patch. I implemented it the way Antoine desired. It seems rather 
inefficient to be copying around so much data...

--
Added file: http://bugs.python.org/file24190/full-casemapping.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2012-01-07 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

Here is a patch. I only dealt with case mappings and not titlecase. Doing 
titlecase properly requires word segmentation, which I think should be another 
patch/issue. This patch fixes swapcase(), capitalize(), upper(), and lower(). 
It does not include the changes to Objects/unicodetype_db.h because those are 
huge. Regenerate the database if you want to test it. Please review.

--
keywords: +patch
nosy: +benjamin.peterson
Added file: http://bugs.python.org/file24171/full-casemapping.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-29 Thread Jean-Michel Fauth

Jean-Michel Fauth wxjmfa...@gmail.com added the comment:

Œ, œ or even  are historically ligatures or ligatured forms.
In the French typography, they are single plain letters and
they belong the group of the 42 letters used in the French
typography.
Typographically speaking, using oe instead of œ is considered
as a mistake, while not using the ligatured forms for the groups
of letters like ff, ffi, ffl, fj, et, st is acceptable.

Microsoft with cp1252, Apple with mac-roman, Adobe and all
foundries and now Unicode are working correctly.

It should be noted, when TeX moved from the ascii to iso-8859-1
(more precisely CorkEncoding) as default encoding, they saw
the problem and introduced the \oe or \OE commands.

From my understanding and my point of view on the subject, ISO has
somehow recognized his mistake by introducing iso-8859-15.
Infortunatelly, it was too late.

To the subject: Œdipe: correct, Oedipe, OEdipe: incorrect.

Without beeing an expert on that field, all the informations
one can find on Wikipedia (French) regarding questions about
typography are generally correct.

--
nosy: +Jean-Michel.Fauth

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-29 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 Œ, œ or even  are historically ligatures or ligatured forms.
 In the French typography, they are single plain letters and
 they belong the group of the 42 letters used in the French
 typography.
 Typographically speaking, using oe instead of œ is considered
 as a mistake,

It's not only typographically speaking, it's really a spelling error,
even in hand-written text :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-29 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Antoine Pitrou rep...@bugs.python.org wrote
   on Mon, 29 Aug 2011 13:21:06 -: 

 It's not only typographically speaking, it's really a spelling error,
 even in hand-written text :-)

Sure, and so too is omitting an accent mark or diaeresis.  But—alas!—you’ll
never convince most monoglot anglophones of that, the ones who keep wanting to
strip them from résumé, façade, châteaux, crème brûlée, fête, tête-à-tête, 
à la française, or naïveté, not to mention José, jalapeño, the erstwhile
American Secretary of State Federico Peña, or nearby Cañon City, Colorado, 
where I have family.  I think œnonlogy has survived solely on its rarity, 
and the Encyclopædia Britannica is that way because the ligat(ur)ed letter
is in their actual trademark.

Cell phone users sending text messages have long suffered the grievous
injuries to their language(s) that naked ASCII imparts, but this is
nothing like the crossdressing nightmare called Greeklish, also variously
known as Grenglish, Latinoellinika/Λατινοελληνικά, or ASCII Greek.

http://en.wikipedia.org/wiki/Greeklish

[...] The reason for this is the fact that text written in Greeklish
is considerably less aesthetically pleasing, and also much harder to
read, compared to text written in the Greek alphabet. A non-Greek
speaker/reader can guess this by this example: δις ιζ χαρντ του
ριντ would be the way to write this is hard to read in English
but utilizing the Greek alphabet.

I especially enjoy  George Baloglou’s Byzantine Grenglish, wherein:

Ὀδυσσεύς= Oducceusinstead of Odysseus
Ἀχιλλεύς= Axilleusinstead of Achilleus
Σίσυφος = Sicuphosinstead of Sisyphus
Περικλῆς= 5epiklhsinstead of Pericles
Χθονός  = X8onos  instead of Chthonos
 Οι Ατρείδες= Oi Atpeides instead of the Atreïdes

Terrible though the depredations upon the French language that may
have been committed by ASCII, surely these go even further. :)

--tom

Η ΙλιάδαH Iliada

Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλῆος   Mhnin aeide, 8ea, 5hlhiadeo 
Axilhos
οὐλομένην, ἣ μυρί’ Ἀχαιοῖς ἄλγε’ ἔθηκε,   oulomenhn, 'h mupi’ Axaiois alge’ 
e8hke,
πολλὰς δ’ ἰφθίμους ψυχὰς Ἄϊδι προῒαψενnollas d’ iph8imous yuxas Aidi 
npoiayen
ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν'hpoon, autous de elopia teuxe 
kuneccin
οἰωνοῖσί τε πᾶσι· Διὸς δ’ ἐτελείετο βουλή·oionoici te naci· Dios d’ 
eteleieto boulh·
ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε  eks o'u dh ta npota diacththn 
epicante
Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.Atpeidhs te anaks andpon kai dios 
Axilleus.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-28 Thread Guido van Rossum

Guido van Rossum gu...@python.org added the comment:

Thanks Tom for such a clear explanation! I hope someone will implement
this. (Matthew, does this affect regex? I am guessing it does, for
case-insensitive matching?)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-28 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

The regex module currently uses simple case-folding, although I'm working 
towards full case-folding, as listed in 
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-28 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Antoine Pitrou rep...@bugs.python.org wrote on Sat, 27 Aug 2011 20:04:56 
-: 

 Neither am I.  Even in old-style English with ae and oe, one wrote
 ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
 *Aesir.  Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.

 Trying to disprove you a bit:
 http://ecx.images-amazon.com/images/I/51G6CH9XFFL._SL500_AA300_.jpg
 http://ecx.images-amazon.com/images/I/51k7TmosPdL._SL500_AA300_.jpg
 http://ecx.images-amazon.com/images/I/518UzMeLFCL._SL500_AA300_.jpg

 but classical typographies seem to write either the uppercase Πor the
 lowercase œ.

That's what I meant: one only ever sees œufs or ŒUFS, never OEUFS.
French doesn't fit into ISO 8859-1.  That's one of the changes to
ISO-8859-15 compared with ISO-8859-1 (and Unicode):

iso-8859-1   A4  ⇔  U+00A4  < ¤ >  \N{CURRENCY SIGN}
iso-8859-15  A4  ⇒  U+20AC  < € >  \N{EURO SIGN}

iso-8859-1   A6  ⇔  U+00A6  < ¦ >  \N{BROKEN BAR}
iso-8859-15  A6  ⇒  U+0160  < Š >  \N{LATIN CAPITAL LETTER S WITH CARON}

iso-8859-1   A8  ⇔  U+00A8  < ¨ >  \N{DIAERESIS}
iso-8859-15  A8  ⇒  U+0161  < š >  \N{LATIN SMALL LETTER S WITH CARON}

iso-8859-1   B4  ⇔  U+00B4  < ´ >  \N{ACUTE ACCENT}
iso-8859-15  B4  ⇒  U+017D  < Ž >  \N{LATIN CAPITAL LETTER Z WITH CARON}

iso-8859-1   B8  ⇔  U+00B8  < ¸ >  \N{CEDILLA}
iso-8859-15  B8  ⇒  U+017E  < ž >  \N{LATIN SMALL LETTER Z WITH CARON}

iso-8859-1   BC  ⇔  U+00BC  < ¼ >  \N{VULGAR FRACTION ONE QUARTER}
iso-8859-15  BC  ⇒  U+0152  < Œ >  \N{LATIN CAPITAL LIGATURE OE}

iso-8859-1   BD  ⇔  U+00BD  < ½ >  \N{VULGAR FRACTION ONE HALF}
iso-8859-15  BD  ⇒  U+0153  < œ >  \N{LATIN SMALL LIGATURE OE}

iso-8859-1   BE  ⇔  U+00BE  < ¾ >  \N{VULGAR FRACTION THREE QUARTERS}
iso-8859-15  BE  ⇒  U+0178  < Ÿ >  \N{LATIN CAPITAL LETTER Y WITH DIAERESIS}

 That said, I wonder why Unicode even includes ligatures like ff. Sounds
 like mission creep to me (and horrible annoyances for people like us).

I'm pretty sure that typographic ligatures are there for roundtripping
with legacy encodings.  I believe that œ/Œ is the only code point
with ligature in its name that you're supposed to still use, and
that all others should be figured out by modern fonting software.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Guido van Rossum rep...@bugs.python.org wrote
   on Fri, 26 Aug 2011 21:11:24 -: 

 Would this also affect .islower() and friends?

SHORT VERSION:  (7 lines)

I don't believe so, but the relationship between lower() and islower()
is not as clear to me as I would have thought, and more importantly,
the code and the documentation for Python's islower() etc currently seem
to disagree.  For future releases, I recommend fixing the code, but if
compatibility is an issue, then perhaps for previous releases still in
maintenance mode fixing only the documentation would possibly be good
enough--your call.

===

MEDIUM VERSION: (87 lines)

I was initially confused with Python's islower() family because of the way
they are defined to operate on full strings.  They don't check that
everything is lowercase even though they say they do.

   
http://docs.python.org/py3k/library/stdtypes.html#sequence-types-str-bytes-bytearray-list-tuple-range

str.lower()

Return a copy of the string with all the cased characters [4]
converted to lowercase.

str.islower()

Return true if all cased characters [4] in the string are lowercase 
and there is at least one cased character, false otherwise.

[4] (1, 2, 3, 4) Cased characters are those with general category
property being one of “Lu” (Letter, uppercase), “Ll” (Letter,
lowercase), or “Lt” (Letter, titlecase).

This is strange in several ways.  Of lesser importance is that
strings can be considered lowercase even if they don't match

^\p{lowercase}+$

Another is that the result of calling str.lower() may not be .islower().
I'm not sure what these are particularly for, since I myself would just use
a regex to get finer-grained control.  (I suppose that's because re doesn't
give access to the Unicode properties needed that this approach never
gained any traction in the Python community.)

However, the worst of this is that the documentation defines both cased
characters and lowercase characters *differently* from how Unicode does
defines those very same terms.  This was quite confusing.

Unicode distinguishes Cased code points from Cased_*Letter* code points.
Python is using the Cased_Letter property but calling it Cased.  Cased in 
a proper superset of Cased_Letter.  From the DerivedCoreProperties file in
the Unicode Character Database:

# Derived Property:   Cased (Cased)
#  As defined by Unicode Standard Definition D120
#  C has the Lowercase or Uppercase property or has a General_Category 
value of Titlecase_Letter.

In the same way, the Lowercase and Uppercase properties are not the same as
the Lowercase_*Letter* and Uppercase_*Letter* properties.  Rather, the former
are respectively proper supersets of the latter.  

# Derived Property: Lowercase
#  Generated from: Ll + Other_Lowercase

[...]

# Derived Property: Uppercase
#  Generated from: Lu + Other_Uppercase

In all these, you almost always want the superset versions not the
restricted subset versions you are using.  If it were in the regex engine,
the user could select either.

Java used to miss all these, too.  But in 1.7, they updated their character
methods to use the properties that they'd all along said they were using:

   
http://download.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char)

public static boolean isLowerCase(char ch)
Determines if the specified character is a lowercase character. 

 A character is lowercase if its general category type, provided by
 Character.getType(ch), is LOWERCASE_LETTER, or it has contributory
-   property Other_Lowercase as defined by the Unicode Standard.

Note: This method cannot handle supplementary characters.  To
  support all Unicode characters, including supplementary
  characters, use the isLowerCase(int) method.

(And yes, that's where Java uses character to mean code unit 
 not code point, alas.  No wonder people get confused)

I'm pretty sure that Python needs to either update its documentation to
match its code, update its code to match its documentation, or both.  Java
chose to update the code to match the documentation, and this is the course
I would recommend if at all possible.  If you say you are checking for
cased code points, then you should use the Unicode definition of cased code
points not your own, and if you say you are checking for lowercase code
points, then you should use the Unicode definition not your own.  Both of
these require access to contributory properties from the UCD and not 
just general categories alone.

--tom

===

LONG VERSION: (222 lines)

Essential tools I use for inspecting Unicode code points and their 
properties include


[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Guido van Rossum

Guido van Rossum gu...@python.org added the comment:

Thanks you very much. We should fix the behavior in 3.3 for sure. I'm
thinking that we may be able to backport the behavior fix to 2.7 and
3.2 as well, since it just makes the behavior generally better (and
for most folks it won't matter anyway).

I'm not sure where the somewhat odd rules for .islower() come from, I
think in part from the desire to have .islower() be False but a
b.islower() to be True. Intuitively, this means that .islower() means
both there is at least one lower case character and there are no
upper case characters, but not all characters are lowercase. I
forget what we do w.r.t. titlecase, but the intuitive meaning should
not change. Although personally I don't have much of an intuition for
what titlecase means (and why it's important), perhaps because I'm not
familiar with any language where there is a third case for some
letters.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Guido van Rossum rep...@bugs.python.org wrote
   on Sat, 27 Aug 2011 16:15:33 -: 

 Although personally I don't have much of an intuition for what
 titlecase means (and why it's important), perhaps because I'm not
 familiar with any language where there is a third case for some
 letters.

Neither am I.  Even in old-style English with ae and oe, one wrote
ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
*Aesir.  Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.

(BTW, in French you really shouldn't split up the œ into oe, 
  nor in Old English, Old Norse, or Icelandic the æ in ae;
  although in contemporary English, it's usually ok to do so.)

I believe that almost but not quite all the sticky situations with
Unicode casing involve compatibility characters for clean round-trips
with legacy encodings.  Exceptions include the German sharp s (both of 
them now) and the two Greek lowercase sigmas.  Thank goodness we don't
use the long s in English anymore.  What is it with s's, anyway? :)

Most of the titlecase letters are in Greek, with a few in Armenian.
I know no Armenian (their letters all look the same to me :), and the
folks I talked to about the Greek are skeptical.  The German sharp s is
a red herring, because you can never have it as the first letter
(although it needn't be the last, as in Rußland).  That's no more
possible than having the old legacy ff ligature appear at the beginning
of an English world.

In any event, there are only 129 total code points that are
problematic in terms of their case, where by problematic 
I mean one or more of:

   --- titlecase differs from uppercase
   --- foldcase  differs from lowercase
   --- any of fold/lower/title/uppercase yields more than one code point

Of all these, it's the (now two!) sharp s's and the Turkic i that are the most 
annoying.
It's really quite a lot of trouble to go through for so few code points of so 
little
(perceived) use.  But I suppose you never know what new ones they'll uncover, 
either.
Here are those 129 case-problematicals arranged in UCA order.  Some of these
normilizations forms that decompose into graphemes with four code points (not 
shown).
There are a few other oddities, like the Kelvin sign and other singletons, 
but these
are most of the trouble. They're all in the BMP; I guess we learned our lesson. 
:)

--tom

  1: U+0345 ○ͅ  COMBINING  GREEK YPOGEGRAMMENI
   fc=ι  U+3B9 lc=○ͅ  U+345 tc=Ι  U+399 uc=Ι  U+399 
  2: U+1E9A ẚ  LATIN SMALL LETTER A WITH RIGHT HALF RING
   fc=aʾ  U+61.2BE lc=ẚ  U+1E9A tc=Aʾ  U+41.2BE uc=Aʾ  U+41.2BE 
  3: U+01F3 dz  LATIN SMALL LETTER DZ
   fc=dz  U+1F3 lc=dz  U+1F3 tc=Dz  U+1F2 uc=DZ  U+1F1 
  4: U+01F2 Dz  LATIN CAPITAL LETTER D WITH SMALL LETTER Z
   fc=dz  U+1F3 lc=dz  U+1F3 tc=Dz  U+1F2 uc=DZ  U+1F1 
  5: U+01F1 DZ  LATIN CAPITAL LETTER DZ
   fc=dz  U+1F3 lc=dz  U+1F3 tc=Dz  U+1F2 uc=DZ  U+1F1 
  6: U+01C6 dž  LATIN SMALL LETTER DZ WITH CARON
   fc=dž  U+1C6 lc=dž  U+1C6 tc=Dž  U+1C5 uc=DŽ  U+1C4 
  7: U+01C5 Dž  LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
   fc=dž  U+1C6 lc=dž  U+1C6 tc=Dž  U+1C5 uc=DŽ  U+1C4 
  8: U+01C4 DŽ  LATIN CAPITAL LETTER DZ WITH CARON
   fc=dž  U+1C6 lc=dž  U+1C6 tc=Dž  U+1C5 uc=DŽ  U+1C4 
  9: U+FB00 ff  LATIN SMALL LIGATURE FF
   fc=ff  U+66.66 lc=ff  U+FB00 tc=Ff  U+46.66 uc=FF  U+46.46 
 10: U+FB03 ffi  LATIN SMALL LIGATURE FFI
   fc=ffi  U+66.66.69 lc=ffi  U+FB03 tc=Ffi  U+46.66.69 uc=FFI  
U+46.46.49 
 11: U+FB04 ffl  LATIN SMALL LIGATURE FFL
   fc=ffl  U+66.66.6C lc=ffl  U+FB04 tc=Ffl  U+46.66.6C uc=FFL  
U+46.46.4C 
 12: U+FB01 fi  LATIN SMALL LIGATURE FI
   fc=fi  U+66.69 lc=fi  U+FB01 tc=Fi  U+46.69 uc=FI  U+46.49 
 13: U+FB02 fl  LATIN SMALL LIGATURE FL
   fc=fl  U+66.6C lc=fl  U+FB02 tc=Fl  U+46.6C uc=FL  U+46.4C 
 14: U+1E96 ẖ  LATIN SMALL LETTER H WITH LINE BELOW
   fc=ẖ  U+68.331 lc=ẖ  U+1E96 tc=H̱  U+48.331 uc=H̱  U+48.331 
 15: U+0130 İ  LATIN CAPITAL LETTER I WITH DOT ABOVE
   fc=i̇  U+69.307 lc=i̇  U+69.307 tc=İ  U+130 uc=İ  U+130 
 16: U+01F0 ǰ  LATIN SMALL LETTER J WITH CARON
   fc=ǰ  U+6A.30C lc=ǰ  U+1F0 tc=J̌  U+4A.30C uc=J̌  U+4A.30C 
 17: U+01C9 lj  LATIN SMALL LETTER LJ
   fc=lj  U+1C9 lc=lj  U+1C9 tc=Lj  U+1C8 uc=LJ  U+1C7 
 18: U+01C8 Lj  LATIN CAPITAL LETTER L WITH SMALL LETTER J
   fc=lj  U+1C9 lc=lj  U+1C9 tc=Lj  U+1C8 uc=LJ  U+1C7 
 19: U+01C7 LJ  LATIN CAPITAL LETTER LJ
   fc=lj  U+1C9 lc=lj  U+1C9 tc=Lj  U+1C8 uc=LJ  U+1C7 
 20: U+01CC nj  LATIN SMALL LETTER NJ
   fc=nj  U+1CC lc=nj  U+1CC tc=Nj  U+1CB uc=NJ  U+1CA 
 21: U+01CB Nj  LATIN CAPITAL LETTER N WITH SMALL LETTER J
   fc=nj  U+1CC lc=nj  U+1CC tc=Nj  U+1CB uc=NJ  U+1CA 
 22: U+01CA NJ  LATIN CAPITAL LETTER NJ
   fc=nj  U+1CC lc=nj  U+1CC tc=Nj  U+1CB uc=NJ  

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

There are some oddities in Unicode case-folding.

Under full case-folding, both \N{LATIN CAPITAL LETTER SHARP S} and \N{LATIN 
SMALL LETTER SHARP S} fold to ss, which means that those codepoints match 
each other.

However, under simple case-folding, they fold to themselves, which means that 
those codepoints _don't_ match each other.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 Neither am I.  Even in old-style English with ae and oe, one wrote
 ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
 *Aesir.  Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.

Trying to disprove you a bit:
http://ecx.images-amazon.com/images/I/51G6CH9XFFL._SL500_AA300_.jpg
http://ecx.images-amazon.com/images/I/51k7TmosPdL._SL500_AA300_.jpg
http://ecx.images-amazon.com/images/I/518UzMeLFCL._SL500_AA300_.jpg

but classical typographies seem to write either the uppercase Πor the 
lowercase œ.

That said, I wonder why Unicode even includes ligatures like ff. Sounds like 
mission creep to me (and horrible annoyances for people like us).

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

FTR, with the latest Python 3.2/3.3 (narrow) I get:
   Total failures:   58 / 500 ( 12%)
   Total successes: 442 / 500 ( 88%)
and with the latest Python 3.2/3.3 (wide) I get:
   Total failures:   52 / 500 ( 10%)
   Total successes: 448 / 500 ( 90%)

--
Added file: http://bugs.python.org/file23055/casing-results.txt

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-26 Thread Guido van Rossum

Guido van Rossum gu...@python.org added the comment:

I presume this applies to builtin str methods like .lower(), right?  I think it 
is a good thing to do for Python 3.3.

We'd need to define what should happen in edge cases, e.g. when (against all 
odds) a string happens to contain a lone surrogate or some other code point or 
sequence of code points that the Unicode standard considers illegal.  I think 
it should not fail but just leave those code points alone.

Does this require us to import more data files from the Unicode standard?  By 
itself that doesn't scare me.

Would this also affect .islower() and friends?

--
nosy: +gvanrossum

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-26 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Guido van Rossum rep...@bugs.python.org wrote
   on Fri, 26 Aug 2011 21:11:24 -: 

 Guido van Rossum gu...@python.org added the comment:

 I presume this applies to builtin str methods like .lower(), right?  I
 think it is a good thing to do for Python 3.3.

Yes, the full casemaps are for upper, title, and lowercase.  There is 
also a full casefold and turkic case fold (which is full), but you
don't have a casefold function so I guess that doesn't matter.

 We'd need to define what should happen in edge cases, e.g. when
 (against all odds) a string happens to contain a lone surrogate or
 some other code point or sequence of code points that the Unicode
 standard considers illegal.  I think it should not fail but just leave
 those code points alone.

Well, it's a funny thing.  There are properties given for all
Unicode code points, even noncharacter code points.  This
includes the casing properties, oddly enough.

From UnicodeData.txt, which has a few surrogate entries; notice
no casing is given:

D800;Non Private Use High Surrogate, First;Cs;0;L;N;
DB7F;Non Private Use High Surrogate, Last;Cs;0;L;N;
DB80;Private Use High Surrogate, First;Cs;0;L;N;
DBFF;Private Use High Surrogate, Last;Cs;0;L;N;
DC00;Low Surrogate, First;Cs;0;L;N;
DFFF;Low Surrogate, Last;Cs;0;L;N;

And in SpecialCasing.txt, which does not have surrogates but does have
a default clause:

# This file is a supplement to the UnicodeData file.
# It contains additional information about the casing of Unicode characters.
# (For compatibility, the UnicodeData.txt file only contains case mappings 
for
# characters where they are 1-1, and independent of context and language.
# For more information, see the discussion of Case Mappings in the Unicode 
Standard.
#
# All code points not listed in this file that do not have a simple case 
mappings
# in UnicodeData.txt map to themselves.

And in CaseFolding.txt, which also does not have surrogates but again does 
have a default clause:

# The data supports both implementations that require simple case foldings
# (where string lengths don't change), and implementations that allow full 
case folding
# (where string lengths may grow). Note that where they can be supported, 
the
# full case foldings are superior: for example, they allow MASSE and 
Maße to match.
#
# All code points not listed in this file map to themselves.

Taken all together, it follows that the surrogates have case{map,fold}s
back to themselves, since they have no case{map,fold}s listed.

It's ok to have arbitrary code points in memory, including surrogates and
the 66 noncharacters.  It just isn't legal to have them in a UTF stream
for open interchange, whatever that means.  

 Does this require us to import more data files from the Unicode
 standard?  By itself that doesn't scare me.

One way or the other, yes, notably the SpecialCasing file for
casemapping and the CaseFolding file for casefolding (which you
should do anyway to fix re.I).  But you can and should process the
new files into some tighter format optimized for your own lookups.

Oddly, Java doesn't provide for String methods that do full casing on
titlecase, even those they do do so on lowercase and uppercase.  On
titlecase they only expose the simple casemaps via the Character class,
which are the ones from UnicodeData.  They recognize that this is flaw, 
but it was too late to fix it for JAva 7.

 Would this also affect .islower() and friends?

Well, it shouldn't, but .islower() and friends are already mistaken.
They seem to be checking for GC=Ll and such, but they need to be
checking the Unicode binary property Lowercase and such.  Watch:

test 37 for string Ⅷ
wanted ⅷ to be lowercase of Ⅷ but python disagrees
wanted Ⅷ to be titlecase of Ⅷ but python disagrees
wanted Ⅷ to be uppercase of Ⅷ but python disagrees
test 37 failed 3 subtests

test 39 for string Ⓚ
wanted ⓚ to be lowercase of Ⓚ but python disagrees
wanted Ⓚ to be titlecase of Ⓚ but python disagrees
wanted Ⓚ to be uppercase of Ⓚ but python disagrees
test 39 failed 3 subtests

That's because the Roman numerals are GC=Nl but still have
case and change case.  Similarly for the circled letters which
are GC=So but have case and change case.  Plus there's U+0345,
the iota subscript, which is GC=Mn but has case and changes case.

I don't remember whether I've sent in my full test suite or not.  
If I haven't yet, I should attach it to the bug report.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-26 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Here’s my casing test suite; I thought I sent it in but the mux file here isn’t 
the full thing.

 It does several things, including letting you run it with regex vs re.  It 
also checks for the islower, etc functions. It has both simple and full (and 
turkic) maps and folds in it, but is configured to only check the simple 
versions for now.  The islower and isupper etc functions seem to be checking 
the wrong Unicode property.

Yes, it has my quaint Unixisms in it, because it needs to run with UTF-8 
output, or you can't read what's going on.

--
Added file: http://bugs.python.org/file23051/casing-tests.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-12 Thread Éric Araujo

Changes by Éric Araujo mer...@netwok.org:


--
components: +Interpreter Core, Unicode -Library (Lib)
versions: +Python 3.3 -Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-12 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:


--
nosy: +Arfrever

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-12 Thread Matthew Barnett

Changes by Matthew Barnett pyt...@mrabarnett.plus.com:


--
nosy: +mrabarnett

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen tchr...@perl.com:

Python's casemapping functions only use what Unicode calls simple casemaps. 
These are only appropriate for functions that operate on single characters 
alone, not for those that operate on strings. The reason for this is that you 
get much better results with full casemapping. Java, Ruby, and Perl all do full 
casemapping for their equivalent functions that do string mapping, and Python 
should, too.

I include a program that has a much of mappings and foldings both simple and 
full.  Yes, it was machine-generated.

--
components: Library (Lib)
files: mux.python
messages: 141928
nosy: tchrist
priority: normal
severity: normal
status: open
title: Request for python casemapping functions to use full not simple casemaps 
per Unicode's recommendation
type: feature request
versions: Python 3.2
Added file: http://bugs.python.org/file22883/mux.python

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-11 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +belopolsky, ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com