New submission from Benjamin Peterson <benja...@python.org>:

Both Objects/unicodeobject.c and Modules/unicodedatamodule.c rely on large 
generated databases (Objects/unicodetype_db.h, Modules/unicodename_db.h, 
Modules/unicodedata_db.h). This separation made sense in Python 2 where Unicode 
was less of an important part of the language than Python3-recall Python 2's 
configure script has --without-unicode!. However, in Python 3, Unicode is a 
core language concept and literally baked into the syntax of the language. I 
therefore propose moving all of unicodedata's tables and algorithms into the 
interpreter core proper and converting Modules/unicodedata.c into a facade. 
This will remove awkward maneuvers like ast.c importing unicodedata in order to 
perform normalization. Having unicodedata readily accessible to the str type 
would also permit higher a fidelity unicode implementation. For example, 
implementing language-tailored str.lower() requires having canonical combining 
class of a character available. This data lives only in unicodedata currently.

----------
components: Unicode
messages: 311634
nosy: benjamin.peterson, ezio.melotti, vstinner
priority: normal
severity: normal
stage: needs patch
status: open
title: merge the underlying data stores of unicodedata and the str type
type: enhancement
versions: Python 3.8

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue32771>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to