[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread Benjamin Peterson


Benjamin Peterson  added the comment:

Greg has read my mind. An optional parameter to upper/lower/casefold was 
exactly the API I was thinking of. No C locales or the locale module involved.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread Greg Price


Greg Price  added the comment:

(I should add that it was only after doing the reading that produced the OP 
that I had a clear idea what I thought the priority of the issue was -- before 
doing that work I didn't have a clear sense of the scope of what it affects. 
Based on that SpecialCasing.txt file as of Unicode 12.0.0, I believe the 
functionality we don't currently support is entirely about the handling of 
certain versions of the Latin letter I, as treated in Lithuanian, Turkish, and 
Azerbaijani. Though one function of this issue thread is that it would be a 
great place to point out if there's another component to it!)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread Greg Price


Greg Price  added the comment:

> Maintaining Python is already expensive [...] There are already enough bugs 
> waiting for you to be fixed ;-)

BTW I basically agree with this. I think this is not a high-priority issue, and 
I have my eye on some of those bugs. :-)

I think the fact that it's per-*language* (despite my inaccurate phrasing in 
the OP), not per-locale, simplifies it some -- for example the whole `.UTF-8` 
vs `.utf8` thing doesn't appear. And in particular I think if/when someone 
decides to sit down and make an implementation of this, then if they take the 
time to carefully read and absorb the relevant pages of the standard... this is 
a feature where it's pretty feasible for the implementation to be a 
self-contained and relatively stable and low-bugs piece of code.

And in general I think even if nobody implements it soon, it's useful to have 
an issue that can be pointed to for this feature, and especially so if the 
discussion clearly lays out what the feature involves and what different 
people's views are on the API. For example #18236 has been open for 6 years, 
but the discussion there was extremely helpful for me to understand it and work 
up a fix, after just being pointed to it by someone who'd searched the tracker 
on seeing me send in the doc fix GH-15019.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread Greg Price

Greg Price  added the comment:

> I believe that all locale specific things should be in the locale module, not 
> in the str class.

The locale module is all about doing things with the current process-global 
Unix locale. I don't think that'd be an appropriate interface for this -- if 
it's worth doing, it's worth doing in such a way that the same web server 
process can handle requests for Turkish-, Lithuanian-, and Spanish-speaking 
users without having to reset a global variable for each one.

> If a locale specific mapping is requested, this should be done
> explicitly by e.g. providing a parameter to str.lower() / upper() /
> title().

I like this design.

I said "locale" above, but that wasn't quite right, I think -- the file says 
e.g. `tr`, not `tr_TR` and `tr_CY`, and it describes the identifiers as 
"language IDs".  So perhaps

str.lower(*, lang=None)

?  And then

"I".lower(lang="tr") == "ı" == "\N{Latin small letter dotless I}"

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread STINNER Victor


STINNER Victor  added the comment:

Handling locales correctly is a pain. Each platform uses different locale names 
(different Linux distributions, Windows, FreeBSD, macOS, etc.). For example, 
en_US.UTF-8 vs en_US.utf8. For example, there are tons of bugs related to 
locale.getdefaultlocale() which tries to be smart on parsing locales. I'm 
fixing locale encoding bugs for 10 years and I'm not done yet: there are still 
many open bugs.

I suggest you to first create a module on PyPI to experiment getting the locale 
and attempt to implement Unicode algorithms which depend on the locale.

Maintaining Python is already expensive, I would prefer to not complicate the 
codebase too much with locales. There are already enough bugs waiting for you 
to be fixed ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I believe that all locale specific things should be in the locale module, not 
in the str class.

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-14 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

The Unicode implementation is deliberately not locale specific and
this should not change.

If a locale specific mapping is requested, this should be done
explicitly by e.g. providing a parameter to str.lower() / upper() /
title().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-13 Thread Greg Price


Greg Price  added the comment:

Another previous discussion is #4610.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37848] More fully implement Unicode's case mappings

2019-08-13 Thread Greg Price

New submission from Greg Price :

Splitting this out from #32771 for more specific discussion. Benjamin writes 
there that it would be good to:

> implement the locale-specific case mappings of 
> https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt and §3.13 of 
> the Unicode 12 standard in str.lower/upper/casefold.

and adds that an implementation would require having available in the core the 
data on canonical combining classes, which is currently only in the unicodedata 
module.

---

First, I'd like to better understand what functionality we have now and what 
else the standard describes.  Reading 
https://www.unicode.org/Public/12.0.0/ucd/SpecialCasing.txt , I see
* a bunch of rules that aren't language-specific
* some other rules that are.

I also see in makeunicodedata.py that we don't even parse the language-specific 
rules.

Here's, IIUC, a demo of us correctly implementing the language-independent 
rules.  One line in the data file reads:

FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF

And in fact the `lower`, `title`, and `upper` of `\uFB00` are those strings 
respectively:

$ unicode --brief "$(./python -c \
   's="\ufb00"; print(" ".join((s.lower(), s.title(), s.upper(')"
ff U+FB00 LATIN SMALL LIGATURE FF
  U+0020 SPACE
F U+0046 LATIN CAPITAL LETTER F
f U+0066 LATIN SMALL LETTER F
  U+0020 SPACE
F U+0046 LATIN CAPITAL LETTER F
F U+0046 LATIN CAPITAL LETTER F

OK, great.

---


Then here's something we don't implement. Another line in the file reads:

00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE

IOW `'\u00CD'` should lowercase to `'\u0069\u0307\u0301'`, i.e.:

i U+0069 LATIN SMALL LETTER I
 ̇ U+0307 COMBINING DOT ABOVE
 ́ U+0301 COMBINING ACUTE ACCENT

... but only in a Lithuanian (`lt`) locale.

One question is: what would the right API for this be? I'm not sure I'd want 
`str.lower`'s results to depend on the process's current Unix locale... and I 
definitely wouldn't want to get that without some way of instead telling it 
what locale to use. (Either to use a single constant locale, or to use a 
per-user locale in e.g. a web application.)  Perhaps `str.lower` and friends 
would take a keyword argument `locale`?


Oh, one more link for reference: the said section of the standard is in this 
PDF: https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf , near the end.


And a related previous issue: #12736.

--
components: Unicode
messages: 349646
nosy: Greg Price, benjamin.peterson, ezio.melotti, lemburg, vstinner
priority: normal
severity: normal
status: open
title: More fully implement Unicode's case mappings
versions: Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com