[Python-ideas] Re: Add a .whitespace property to module unicodedata

Marc-Andre Lemburg Fri, 02 Jun 2023 13:35:41 -0700

On 01.06.2023 20:06, David Mertz, Ph.D. wrote:

I guess this is pretty general for the described need:

%time unicode_whitespace = [chr(c) for c in range(0xFFFF) if unicodedata.category(chr(c)) 
== "Zs"]


Use sys.maxunicode instead of 0xFFFF

CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms
Wall time: 18.7 ms

unicode_whitespace

[' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003',
'\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a',
'\u202f', '\u205f', '\u3000']

It's milliseconds not nanoseconds, but presumably something you do
once at the start of an application.  Can anyone think of a more
efficient and/or more concise way of doing this?

There isn't. You essentially have to scan the entire database forwhitespacy chars.

This definitely feels better than making a static sequence of
characters since the Unicode Consortium may (and has) changed the

definition.

Which was my point: including the above in a stdlib module wouldn't makesense, since it increases module load time (and possibly startup time),so it's better to generate a string and put this verbatim into theapplication.

However, this would have to be part of the Unicode database update danceand whitespace is only possible category of chars which would beinteresting. Digits or numbers are another, letter, linebreaks, symbols,etc. others:


https://www.unicode.org/reports/tr44/#GC_Values_Table

It's better to put this into the application in question or to havesomeone maintain such collections outside the stdlib in a package on PyPI.


--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jun 02 2023)
>>> Python Projects, Coaching and Support ...    https://www.egenix.com/
>>> Python Product Development ...        https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               https://www.egenix.com/company/contact/
                     https://www.malemburg.com/

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/NPO3RLDFXP7IWHP6X54GXTF6CYKOY75U/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Add a .whitespace property to module unicodedata

Reply via email to