Amaury Forgeot d'Arc <amaur...@gmail.com> added the comment: Here is a refreshed version of the patch, without the generated files. The patch combines several changes which are fairly independent from each other:
- Using the unicode database to generate the functions adds 143 new codepoints to PyUnicode_ToNumeric, and one codepoint to PyUnicode_IsWhitespace. - In addition, PyUnicode_ToNumeric now contains code for all numerics; previously those which are also digits fell in the 'default:' case and were converted with PyUnicode_ToDigit(). This adds 468 new codepoints, but removes the need to call PyUnicode_ToDigit() - The Unihan.txt files (two files to download, 25Mb each) are now parsed, and this adds 73 more codepoints to PyUnicode_ToNumeric. (There are now 1009 entries in this function.) The 3.2.0 version of this file contains two huge numbers: 1e16 and 1e20, I had to widen the type of 'change_record.numeric_changed' from 'int' to 'double'. It is possible that these were removed from the Unicode database between versions 4.1 and 5.1. - the database has a new flag, NUMERIC_MASK, used by PyUnicode_IsNumeric. This adds ~350 lines in the arrays of numbers in unicodetype_db.h If this patch is accepted, the md5 checksum in test_unicodedata.py will need to change. ---------- nosy: +amaury.forgeotdarc Added file: http://bugs.python.org/file14413/unicodedata-2.7.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue1571184> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com