If we're talking PyPI, it would be nice to have:

unicode_categories = {"Zs": [...], "Ll": [...], ...}

For all the various categories.  It would just take one pass through
all the characters to generate it, but then every category would be
fast to access later.  On the other hand, it's a few lines of code
with a lazy import.  Probably not enough code to put on PyPI.

On Fri, Jun 2, 2023 at 4:32 PM Marc-Andre Lemburg <m...@egenix.com> wrote:
>
> On 01.06.2023 20:06, David Mertz, Ph.D. wrote:
> > I guess this is pretty general for the described need:
> >
> >>>> %time unicode_whitespace = [chr(c) for c in range(0xFFFF) if 
> >>>> unicodedata.category(chr(c)) == "Zs"]
>
> Use sys.maxunicode instead of 0xFFFF
>
> > CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms
> > Wall time: 18.7 ms
> >>>> unicode_whitespace
> > [' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003',
> > '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a',
> > '\u202f', '\u205f', '\u3000']
> >
> > It's milliseconds not nanoseconds, but presumably something you do
> > once at the start of an application.  Can anyone think of a more
> > efficient and/or more concise way of doing this?
>
> There isn't. You essentially have to scan the entire database for
> whitespacy chars.
>
> > This definitely feels better than making a static sequence of
> > characters since the Unicode Consortium may (and has) changed the
> > definition.
>
> Which was my point: including the above in a stdlib module wouldn't make
> sense, since it increases module load time (and possibly startup time),
> so it's better to generate a string and put this verbatim into the
> application.
>
> However, this would have to be part of the Unicode database update dance
> and whitespace is only possible category of chars which would be
> interesting. Digits or numbers are another, letter, linebreaks, symbols,
> etc. others:
>
> https://www.unicode.org/reports/tr44/#GC_Values_Table
>
> It's better to put this into the application in question or to have
> someone maintain such collections outside the stdlib in a package on PyPI.
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Jun 02 2023)
>  >>> Python Projects, Coaching and Support ...    https://www.egenix.com/
>  >>> Python Product Development ...        https://consulting.egenix.com/
> ________________________________________________________________________
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
>     eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>      D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>             Registered at Amtsgericht Duesseldorf: HRB 46611
>                 https://www.egenix.com/company/contact/
>                       https://www.malemburg.com/
>


-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/PXQN4HVSM4ZQEHSQQCQDED3ABKFZX5ES/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to