I guess this is pretty general for the described need: >>> %time unicode_whitespace = [chr(c) for c in range(0xFFFF) if >>> unicodedata.category(chr(c)) == "Zs"] CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms Wall time: 18.7 ms >>> unicode_whitespace [' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a', '\u202f', '\u205f', '\u3000']
It's milliseconds not nanoseconds, but presumably something you do once at the start of an application. Can anyone think of a more efficient and/or more concise way of doing this? This definitely feels better than making a static sequence of characters since the Unicode Consortium may (and has) changed the definition. In particular, MONGOLIAN VOWEL SEPARATOR (U+180E) was removed from the whitespace category to which it previously belonged. I'm not sure why U+FEFF isn't included, but that seems to match the current standards, so all good. On Thu, Jun 1, 2023 at 1:29 PM Marc-Andre Lemburg <m...@egenix.com> wrote: > > On 01.06.2023 18:18, Paul Moore wrote: > > On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio > > <antonio...@gmail.com <mailto:antonio...@gmail.com>> wrote: > > > > I suggest including a simple str variable in unicodedata module to > > mirror string.whitespace, so it would contain all characters defined > > in CPython function > > > > [_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314 > > > > <https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314>) > > so that: > > > > # existent > > string.whitespace = ' \t\n\r\x0b\x0c' > > > > # proposed > > unicodedata.whitespace = ' > > > > \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000' > > > > > > What's the use case? I can't think of a single occasion when I would > > have found this useful. > > Same here. > > For those few cases, where it might be useful, you can easily put the > string into your application code. > > Putting this into the stdlib would just mean that we'd have to recheck > whether new Unicode whitespace chars were added, every time the standard > upgrades. With ASCII, this won't happen in the foreseeable future ;-) > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Jun 01 2023) > >>> Python Projects, Coaching and Support ... https://www.egenix.com/ > >>> Python Product Development ... https://consulting.egenix.com/ > ________________________________________________________________________ > > ::: We implement business ideas - efficiently in both time and costs ::: > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 > https://www.egenix.com/company/contact/ > https://www.malemburg.com/ > > _______________________________________________ > Python-ideas mailing list -- python-ideas@python.org > To unsubscribe send an email to python-ideas-le...@python.org > https://mail.python.org/mailman3/lists/python-ideas.python.org/ > Message archived at > https://mail.python.org/archives/list/python-ideas@python.org/message/REMDZ2SVFVOIDEJYX3VSB2WUZTQPTTLM/ > Code of Conduct: http://python.org/psf/codeofconduct/ -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/3CH6FHG4BCXNBTF4LBZOYLRNHEKXCMYY/ Code of Conduct: http://python.org/psf/codeofconduct/