On 01.06.2023 20:06, David Mertz, Ph.D. wrote:
I guess this is pretty general for the described need:

%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) 
== "Zs"]

Use sys.maxunicode instead of 0xFFFF

CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms
Wall time: 18.7 ms
unicode_whitespace
[' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003',
'\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a',
'\u202f', '\u205f', '\u3000']

It's milliseconds not nanoseconds, but presumably something you do
once at the start of an application.  Can anyone think of a more
efficient and/or more concise way of doing this?

There isn't. You essentially have to scan the entire database for whitespacy chars.

This definitely feels better than making a static sequence of
characters since the Unicode Consortium may (and has) changed the
definition.

Which was my point: including the above in a stdlib module wouldn't make sense, since it increases module load time (and possibly startup time), so it's better to generate a string and put this verbatim into the application.

However, this would have to be part of the Unicode database update dance and whitespace is only possible category of chars which would be interesting. Digits or numbers are another, letter, linebreaks, symbols, etc. others:

https://www.unicode.org/reports/tr44/#GC_Values_Table

It's better to put this into the application in question or to have someone maintain such collections outside the stdlib in a package on PyPI.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jun 02 2023)
>>> Python Projects, Coaching and Support ...    https://www.egenix.com/
>>> Python Product Development ...        https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               https://www.egenix.com/company/contact/
                     https://www.malemburg.com/

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/NPO3RLDFXP7IWHP6X54GXTF6CYKOY75U/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to