On 01.06.2023 20:06, David Mertz, Ph.D. wrote:
I guess this is pretty general for the described need:
%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c))
== "Zs"]
Use sys.maxunicode instead of 0xFFFF
CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms
Wall time: 18.7 ms
unicode_whitespace
[' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003',
'\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a',
'\u202f', '\u205f', '\u3000']
It's milliseconds not nanoseconds, but presumably something you do
once at the start of an application. Can anyone think of a more
efficient and/or more concise way of doing this?
There isn't. You essentially have to scan the entire database for
whitespacy chars.
This definitely feels better than making a static sequence of
characters since the Unicode Consortium may (and has) changed the
definition.
Which was my point: including the above in a stdlib module wouldn't make
sense, since it increases module load time (and possibly startup time),
so it's better to generate a string and put this verbatim into the
application.
However, this would have to be part of the Unicode database update dance
and whitespace is only possible category of chars which would be
interesting. Digits or numbers are another, letter, linebreaks, symbols,
etc. others:
https://www.unicode.org/reports/tr44/#GC_Values_Table
It's better to put this into the application in question or to have
someone maintain such collections outside the stdlib in a package on PyPI.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Experts (#1, Jun 02 2023)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/NPO3RLDFXP7IWHP6X54GXTF6CYKOY75U/
Code of Conduct: http://python.org/psf/codeofconduct/