[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-14 Thread Greg Price
Greg Price added the comment: > About the RSS memory, I'm not sure how Linux accounts the Unicode databases > before they are accessed. Is it like read-only memory loaded on demand when > accessed? It stands for "resident set size", as in "resident in memory"; and it only counts pages of

[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-14 Thread Benjamin Peterson
Benjamin Peterson added the comment: It's also possible we're missing some logical compression opportunities by artificially partitioning the Unicode databases. Encoded optimally, the combined databases could very well take up less space than their raw sum suggests. --

[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-14 Thread STINNER Victor
STINNER Victor added the comment: Note: On Debian and Ubuntu, the unicodedata is a built-in module. It's not built as a dynamic library. About the RSS memory, I'm not sure how Linux accounts the Unicode databases before they are accessed. Is it like read-only memory loaded on demand when

[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-14 Thread Greg Price
Greg Price added the comment: OK, I forked off the discussion of case-mapping as #37848. I think it's probably good to first sort out what we want, before returning to how to implement it (if it's agreed that changes are desired.) Are there other areas of functionality that would be good to

[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Benjamin Peterson
Benjamin Peterson added the comment: The goal is to implement the locale-specific case mappings of https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt and ยง3.13 of the Unicode 12 standard in str.lower/upper/casefold. To do this, you need access to certain character properties

[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Greg Price
Greg Price added the comment: Speaking of improving functionality: > Having unicodedata readily accessible to the str type would also permit > higher a fidelity unicode implementation. For example, implementing > language-tailored str.lower() requires having canonical combining class of a

[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Greg Price
Greg Price added the comment: > Loading it dynamically reduces the memory footprint. Ah, this is a good question to ask! First, FWIW on my Debian buster desktop I get a smaller figure for `import unicodedata`: only 64 kiB. $ python Python 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0]

[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread STINNER Victor
STINNER Victor added the comment: Hum, I forget to mention that the module is compiled as a dynamically library, at least on Fedora: $ python3 Python 3.7.4 (default, Jul 9 2019, 16:32:37) [GCC 9.1.1 20190503 (Red Hat 9.1.1-1)] on linux Type "help", "copyright", "credits" or "license" for

[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread STINNER Victor
STINNER Victor added the comment: > This will remove awkward maneuvers like ast.c importing unicodedata in order > to perform normalization. unicodedata is not needed by default. ast.c only imports unicodedata at the first non-ASCII identifier. If you application (and all dependencies) only

[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Greg Price
Change by Greg Price : -- nosy: +Greg Price ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue32771] merge the underlying data stores of unicodedata and the str type

2018-04-13 Thread Matej Cepl
Change by Matej Cepl : -- nosy: +mcepl ___ Python tracker ___ ___ Python-bugs-list mailing

[issue32771] merge the underlying data stores of unicodedata and the str type

2018-02-05 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: +1. And perhaps a new C API for direct access to the Unicode DB should be provided. -- components: +Interpreter Core nosy: +serhiy.storchaka ___ Python tracker

[issue32771] merge the underlying data stores of unicodedata and the str type

2018-02-04 Thread Benjamin Peterson
New submission from Benjamin Peterson : Both Objects/unicodeobject.c and Modules/unicodedatamodule.c rely on large generated databases (Objects/unicodetype_db.h, Modules/unicodename_db.h, Modules/unicodedata_db.h). This separation made sense in Python 2 where Unicode was