[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 10:12, David Mertz, Ph.D.  wrote:
>
> Let's call the styles a tie.  Using the SOWPODS scrabble wordlist (no
> currency symbols, so False answer):
>
> >>> unicode_currency = {chr(c) for c in range(0x) if 
> >>> unicodedata.category(chr(c)) == "Sc"}
> >>> wordlist = open('/usr/local/share/sowpods').read()
> >>> len(wordlist)
> 2707021
> >>> %timeit any(unicodedata.category(ch) == "Sc" for ch in wordlist)
> 176 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>> %timeit any(unicodedata.category(ch) == "Sc" for ch in set(wordlist))
> 17.8 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >>> bool(set(wordlist) & unicode_currency)
> False
> >>> %timeit bool(set(wordlist) & unicode_currency)
> 18 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>
> Of course, this is a small character set of 26 lowercase letters (and
> newline as I did it).  A more diverse alphabet might tip the timing
> slightly, but it's going to be a small matter either way.
>

Remember though, the original request was not for a set, but for a
string. Try your timing again when working with a string.

The any() form is almost certainly the most effective, although I
suppose it could be implemented in C for better performance (avoiding
calling back into Python repeatedly). Not sure it's necessary though.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/TOAR5FT3MDIEZFBVT7YGR6CTZ2JKCZCQ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
Let's call the styles a tie.  Using the SOWPODS scrabble wordlist (no
currency symbols, so False answer):

>>> unicode_currency = {chr(c) for c in range(0x) if 
>>> unicodedata.category(chr(c)) == "Sc"}
>>> wordlist = open('/usr/local/share/sowpods').read()
>>> len(wordlist)
2707021
>>> %timeit any(unicodedata.category(ch) == "Sc" for ch in wordlist)
176 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit any(unicodedata.category(ch) == "Sc" for ch in set(wordlist))
17.8 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> bool(set(wordlist) & unicode_currency)
False
>>> %timeit bool(set(wordlist) & unicode_currency)
18 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Of course, this is a small character set of 26 lowercase letters (and
newline as I did it).  A more diverse alphabet might tip the timing
slightly, but it's going to be a small matter either way.

On Fri, Jun 2, 2023 at 7:49 PM Chris Angelico  wrote:
>
> On Sat, 3 Jun 2023 at 09:42, David Mertz, Ph.D.  wrote:
> >
> > Yeah... oops. Obviously I typed the version in email. Should have done it 
> > in the shell. But you got the intention of set-ifying the characters in the 
> > large string.
>
> Yep. I thought of that as I was originally writing, but absent
> benchmarking data, I prefer the simplest way of writing something.
>
> ChrisA
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/BVPDSXXOCOWZ5G2THPB3ZVG6VPXDBE24/
> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/Q2N4ZJHEJN4XP4S43K5V3RPMHXDMOUOH/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 09:42, David Mertz, Ph.D.  wrote:
>
> Yeah... oops. Obviously I typed the version in email. Should have done it in 
> the shell. But you got the intention of set-ifying the characters in the 
> large string.

Yep. I thought of that as I was originally writing, but absent
benchmarking data, I prefer the simplest way of writing something.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/BVPDSXXOCOWZ5G2THPB3ZVG6VPXDBE24/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
Yeah... oops. Obviously I typed the version in email. Should have done it
in the shell. But you got the intention of set-ifying the characters in the
large string.

Yes on lies, damn lies, and benchmarks.

On Fri, Jun 2, 2023, 7:29 PM Chris Angelico  wrote:

> On Sat, 3 Jun 2023 at 08:28, David Mertz, Ph.D. 
> wrote:
> >
> > This is just bar talk at this point.  I think we've shown that this is
> > easy enough to do that programmers can roll their own.
> >
> > But as idle chat goes, note that in your code:
> >
> >set(unicodedata.category(ch) for ch in s)
> >
> > If `s` is a billion characters long, then we make a billion calls to
> > the `.category()` method.  Python calls are comparatively expensive,
> > even on well optimized data structures like strings.
> >
> > In my version:
> >
> > bool(set(s) & set(unicode_categories['Sc'])
> >
> > The billion characters are first reduced to a smallish set of hundreds
> > or thousands of distinct characters without needing method calls. Then
> > that is intersected with a smallish set of characters in the category.
> >
> > You could optimize your version, however, simply by using:
> >
> >set(unicodedata.category(set(ch)) for ch in s)
>
> Or perhaps:
>
> set(unicodedata.category(ch) for ch in set(s))
>
> But measure before considering this worthwhile.
>
> > Yours provides more information, since it lists all the categories.
> > But if you REALLY only care about one category, then you still have to
> > ask `'Sc' in set(unicodedata.category(set(ch)) for ch in s)`.  Which
> > is fine, that's not a hard question to ask.
>
> If you REALLY want to just check whether any category is there, you
> probably want something like:
>
> any(unicodedata.category(ch) == "Sc" for ch in s)
>
> which is completely different from what you were suggesting, and still
> doesn't require the string of all codepoints in the category.
>
> Point is, querying the string is almost always going to be more
> efficient than intersecting with the full gamut of that category.
>
> ChrisA
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/KMHZOQJQPILZD6Z3AKKRQXGHXVYFQPER/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FC64VVAITJTQLIHQYT2BUHSU64VXJXSC/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 08:28, David Mertz, Ph.D.  wrote:
>
> This is just bar talk at this point.  I think we've shown that this is
> easy enough to do that programmers can roll their own.
>
> But as idle chat goes, note that in your code:
>
>set(unicodedata.category(ch) for ch in s)
>
> If `s` is a billion characters long, then we make a billion calls to
> the `.category()` method.  Python calls are comparatively expensive,
> even on well optimized data structures like strings.
>
> In my version:
>
> bool(set(s) & set(unicode_categories['Sc'])
>
> The billion characters are first reduced to a smallish set of hundreds
> or thousands of distinct characters without needing method calls. Then
> that is intersected with a smallish set of characters in the category.
>
> You could optimize your version, however, simply by using:
>
>set(unicodedata.category(set(ch)) for ch in s)

Or perhaps:

set(unicodedata.category(ch) for ch in set(s))

But measure before considering this worthwhile.

> Yours provides more information, since it lists all the categories.
> But if you REALLY only care about one category, then you still have to
> ask `'Sc' in set(unicodedata.category(set(ch)) for ch in s)`.  Which
> is fine, that's not a hard question to ask.

If you REALLY want to just check whether any category is there, you
probably want something like:

any(unicodedata.category(ch) == "Sc" for ch in s)

which is completely different from what you were suggesting, and still
doesn't require the string of all codepoints in the category.

Point is, querying the string is almost always going to be more
efficient than intersecting with the full gamut of that category.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KMHZOQJQPILZD6Z3AKKRQXGHXVYFQPER/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
This is just bar talk at this point.  I think we've shown that this is
easy enough to do that programmers can roll their own.

But as idle chat goes, note that in your code:

   set(unicodedata.category(ch) for ch in s)

If `s` is a billion characters long, then we make a billion calls to
the `.category()` method.  Python calls are comparatively expensive,
even on well optimized data structures like strings.

In my version:

bool(set(s) & set(unicode_categories['Sc'])

The billion characters are first reduced to a smallish set of hundreds
or thousands of distinct characters without needing method calls. Then
that is intersected with a smallish set of characters in the category.

You could optimize your version, however, simply by using:

   set(unicodedata.category(set(ch)) for ch in s)

Yours provides more information, since it lists all the categories.
But if you REALLY only care about one category, then you still have to
ask `'Sc' in set(unicodedata.category(set(ch)) for ch in s)`.  Which
is fine, that's not a hard question to ask.

On Fri, Jun 2, 2023 at 5:36 PM Chris Angelico  wrote:
>
> On Sat, 3 Jun 2023 at 07:28, David Mertz, Ph.D.  wrote:
> >
> > Sure. That's fine. With a sufficiently long strings my code is faster, but 
> > for "typical" strings yours will be.
>
> Really? How? Your code has to build a set of every character in the
> string; mine builds a set of every category in the string. Set
> intersection won't be slower for a smaller set.
>
> ChrisA
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/5C7WSPFDJ4A6LRHL67N7UFPXGU4KI56O/
> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/5XXPVXLWZQXEQW7B35QIPXHJK7G4N6X7/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 07:28, David Mertz, Ph.D.  wrote:
>
> Sure. That's fine. With a sufficiently long strings my code is faster, but 
> for "typical" strings yours will be.

Really? How? Your code has to build a set of every character in the
string; mine builds a set of every category in the string. Set
intersection won't be slower for a smaller set.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/5C7WSPFDJ4A6LRHL67N7UFPXGU4KI56O/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
Sure. That's fine. With a sufficiently long strings my code is faster, but
for "typical" strings yours will be.

On Fri, Jun 2, 2023, 5:20 PM Chris Angelico  wrote:

> On Sat, 3 Jun 2023 at 07:08, David Mertz, Ph.D. 
> wrote:
> >
> > def does_string_have_currency_mark(s):
> > return bool(set(s) & set(unicode_categories['Sc'])
> >
> > def does_string_have_numeric_digit(s): ...
> >
> > ... and so on.  Those seem like questions one asks often enough. Not
> > every day, but more than never.
> >
>
> These questions are much better answered with the
> unicodedata.category() function. First figure out what categories your
> string has:
>
> cats = set(unicodedata.category(ch) for ch in s)
>
> And then check whether Sc is in that set, or whatever others you care
> about.
>
> This way, the set contains only the categories, not the characters;
> there's no reason to do set intersection with all of the characters.
>
> ChrisA
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/3EK66S27AO2IFBWPOIJ6ABUEJ6C6W2YB/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/DJXKMQNXP4O23LHN43YAVL4XSWUSWMUT/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 07:08, David Mertz, Ph.D.  wrote:
>
> def does_string_have_currency_mark(s):
> return bool(set(s) & set(unicode_categories['Sc'])
>
> def does_string_have_numeric_digit(s): ...
>
> ... and so on.  Those seem like questions one asks often enough. Not
> every day, but more than never.
>

These questions are much better answered with the
unicodedata.category() function. First figure out what categories your
string has:

cats = set(unicodedata.category(ch) for ch in s)

And then check whether Sc is in that set, or whatever others you care about.

This way, the set contains only the categories, not the characters;
there's no reason to do set intersection with all of the characters.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/3EK66S27AO2IFBWPOIJ6ABUEJ6C6W2YB/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
def does_string_have_currency_mark(s):
return bool(set(s) & set(unicode_categories['Sc'])

def does_string_have_numeric_digit(s): ...

... and so on.  Those seem like questions one asks often enough. Not
every day, but more than never.

On Fri, Jun 2, 2023 at 4:59 PM Chris Angelico  wrote:
>
> On Sat, 3 Jun 2023 at 06:54, David Mertz, Ph.D.  wrote:
> >
> > If we're talking PyPI, it would be nice to have:
> >
> > unicode_categories = {"Zs": [...], "Ll": [...], ...}
> >
> > For all the various categories.  It would just take one pass through
> > all the characters to generate it, but then every category would be
> > fast to access later.  On the other hand, it's a few lines of code
> > with a lazy import.  Probably not enough code to put on PyPI.
> >
>
> Question: What is the advantage of having this? What are the use-cases?
>
> ChrisA
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/4ZFJWXPYS6TWU7XBA5G63RY5H4KGOSW2/
> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ALTCCL6LRXS75PDVSZBGS5RGOHXJLPFC/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Chris Angelico
On Sat, 3 Jun 2023 at 06:54, David Mertz, Ph.D.  wrote:
>
> If we're talking PyPI, it would be nice to have:
>
> unicode_categories = {"Zs": [...], "Ll": [...], ...}
>
> For all the various categories.  It would just take one pass through
> all the characters to generate it, but then every category would be
> fast to access later.  On the other hand, it's a few lines of code
> with a lazy import.  Probably not enough code to put on PyPI.
>

Question: What is the advantage of having this? What are the use-cases?

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/4ZFJWXPYS6TWU7XBA5G63RY5H4KGOSW2/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread David Mertz, Ph.D.
If we're talking PyPI, it would be nice to have:

unicode_categories = {"Zs": [...], "Ll": [...], ...}

For all the various categories.  It would just take one pass through
all the characters to generate it, but then every category would be
fast to access later.  On the other hand, it's a few lines of code
with a lazy import.  Probably not enough code to put on PyPI.

On Fri, Jun 2, 2023 at 4:32 PM Marc-Andre Lemburg  wrote:
>
> On 01.06.2023 20:06, David Mertz, Ph.D. wrote:
> > I guess this is pretty general for the described need:
> >
>  %time unicode_whitespace = [chr(c) for c in range(0x) if 
>  unicodedata.category(chr(c)) == "Zs"]
>
> Use sys.maxunicode instead of 0x
>
> > CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms
> > Wall time: 18.7 ms
>  unicode_whitespace
> > [' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003',
> > '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a',
> > '\u202f', '\u205f', '\u3000']
> >
> > It's milliseconds not nanoseconds, but presumably something you do
> > once at the start of an application.  Can anyone think of a more
> > efficient and/or more concise way of doing this?
>
> There isn't. You essentially have to scan the entire database for
> whitespacy chars.
>
> > This definitely feels better than making a static sequence of
> > characters since the Unicode Consortium may (and has) changed the
> > definition.
>
> Which was my point: including the above in a stdlib module wouldn't make
> sense, since it increases module load time (and possibly startup time),
> so it's better to generate a string and put this verbatim into the
> application.
>
> However, this would have to be part of the Unicode database update dance
> and whitespace is only possible category of chars which would be
> interesting. Digits or numbers are another, letter, linebreaks, symbols,
> etc. others:
>
> https://www.unicode.org/reports/tr44/#GC_Values_Table
>
> It's better to put this into the application in question or to have
> someone maintain such collections outside the stdlib in a package on PyPI.
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Jun 02 2023)
>  >>> Python Projects, Coaching and Support ...https://www.egenix.com/
>  >>> Python Product Development ...https://consulting.egenix.com/
> 
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
> eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>  D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
> Registered at Amtsgericht Duesseldorf: HRB 46611
> https://www.egenix.com/company/contact/
>   https://www.malemburg.com/
>


-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/PXQN4HVSM4ZQEHSQQCQDED3ABKFZX5ES/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Marc-Andre Lemburg

On 01.06.2023 20:06, David Mertz, Ph.D. wrote:

I guess this is pretty general for the described need:


%time unicode_whitespace = [chr(c) for c in range(0x) if unicodedata.category(chr(c)) 
== "Zs"]


Use sys.maxunicode instead of 0x


CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms
Wall time: 18.7 ms

unicode_whitespace

[' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003',
'\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a',
'\u202f', '\u205f', '\u3000']

It's milliseconds not nanoseconds, but presumably something you do
once at the start of an application.  Can anyone think of a more
efficient and/or more concise way of doing this?


There isn't. You essentially have to scan the entire database for 
whitespacy chars.



This definitely feels better than making a static sequence of
characters since the Unicode Consortium may (and has) changed the
definition. 


Which was my point: including the above in a stdlib module wouldn't make 
sense, since it increases module load time (and possibly startup time), 
so it's better to generate a string and put this verbatim into the 
application.


However, this would have to be part of the Unicode database update dance 
and whitespace is only possible category of chars which would be 
interesting. Digits or numbers are another, letter, linebreaks, symbols, 
etc. others:


https://www.unicode.org/reports/tr44/#GC_Values_Table

It's better to put this into the application in question or to have 
someone maintain such collections outside the stdlib in a package on PyPI.


--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jun 02 2023)
>>> Python Projects, Coaching and Support ...https://www.egenix.com/
>>> Python Product Development ...https://consulting.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   https://www.egenix.com/company/contact/
 https://www.malemburg.com/

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/NPO3RLDFXP7IWHP6X54GXTF6CYKOY75U/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-02 Thread Barry



> On 1 Jun 2023, at 19:10, David Mertz, Ph.D.  wrote:
> 
> %time unicode_whitespace = [chr(c) for c in range(0x) if 
> unicodedata.category(chr(c)) == "Zs"]

Try 0x10 to get all of unicode. 

Barry
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HMB57XHRGWRUJE4ZMULNBOGKD3ILH2DY/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Ethan Furman

On 6/1/23 11:06, David Mertz, Ph.D. wrote:

> I guess this is pretty general for the described need:
>
> >>> unicode_whitespace = [chr(c) for c in range(0x) if unicodedata.category(chr(c)) 
== "Zs"]

Using the module-level `__getattr__` that could be a lazy attribute.

--
~Ethan~
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/TMHAJMGZH4JHZOIKKWE55ZCF4N4CHGNI/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Paul Moore
On Thu, 1 Jun 2023 at 18:16, David Mertz, Ph.D. 
wrote:

> OK, fair enough. What about "has whitespace (including Unicode beyond
> ASCII)"?
>

>>> import re
>>> r = re.compile(r'\s', re.U)
>>> r.search('ab\u2002cd')


❯ py -m timeit -s "import re; r = re.compile(r'\s', re.U)"
"r.search('ab\u2002cd')"
100 loops, best of 5: 262 nsec per loop

Paul
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/Z7CASFLDWL7N2IPB2QPOWDGALNRBCMF4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Richard Damon

On 6/1/23 2:06 PM, David Mertz, Ph.D. wrote:

I'm not sure why U+FEFF isn't included, but that seems to match the
current standards, so all good.


I think because Zero Width, No-Breaking Space, (aka BOM Mark) doesn't 
act like a "Space" character.


If used as the BOM mark, it is intended that it gets stripped out when 
read and the UTF-16/UTF-32 data file that follows it be typically just 
read and have its byte order corrected as the mark indicates.


If used elsewhere as the ZWNBSP (which has been deprecated and replaced 
with U+2060) then it use is intentionally "no-break" so not a space to 
seperate on.


--
Richard Damon

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/7D2NZMF445F4XNKJFVXLDKDLI3NGDK65/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread David Mertz, Ph.D.
I guess this is pretty general for the described need:

>>> %time unicode_whitespace = [chr(c) for c in range(0x) if 
>>> unicodedata.category(chr(c)) == "Zs"]
CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms
Wall time: 18.7 ms
>>> unicode_whitespace
[' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003',
'\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a',
'\u202f', '\u205f', '\u3000']

It's milliseconds not nanoseconds, but presumably something you do
once at the start of an application.  Can anyone think of a more
efficient and/or more concise way of doing this?

This definitely feels better than making a static sequence of
characters since the Unicode Consortium may (and has) changed the
definition.  In particular, MONGOLIAN VOWEL SEPARATOR (U+180E) was
removed from the whitespace category to which it previously belonged.
I'm not sure why U+FEFF isn't included, but that seems to match the
current standards, so all good.

On Thu, Jun 1, 2023 at 1:29 PM Marc-Andre Lemburg  wrote:
>
> On 01.06.2023 18:18, Paul Moore wrote:
> > On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio
> > mailto:antonio...@gmail.com>> wrote:
> >
> > I suggest including a simple str variable in unicodedata module to
> > mirror string.whitespace, so it would contain all characters defined
> > in CPython function
> > 
> > [_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314
> >  
> > )
> >  so that:
> >
> >   # existent
> > string.whitespace = ' \t\n\r\x0b\x0c'
> >
> > # proposed
> > unicodedata.whitespace = '
> > 
> > \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
> >
> >
> > What's the use case? I can't think of a single occasion when I would
> > have found this useful.
>
> Same here.
>
> For those few cases, where it might be useful, you can easily put the
> string into your application code.
>
> Putting this into the stdlib would just mean that we'd have to recheck
> whether new Unicode whitespace chars were added, every time the standard
> upgrades. With ASCII, this won't happen in the foreseeable future ;-)
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Jun 01 2023)
>  >>> Python Projects, Coaching and Support ...https://www.egenix.com/
>  >>> Python Product Development ...https://consulting.egenix.com/
> 
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
> eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>  D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
> Registered at Amtsgericht Duesseldorf: HRB 46611
> https://www.egenix.com/company/contact/
>   https://www.malemburg.com/
>
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/REMDZ2SVFVOIDEJYX3VSB2WUZTQPTTLM/
> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/3CH6FHG4BCXNBTF4LBZOYLRNHEKXCMYY/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Marc-Andre Lemburg

On 01.06.2023 18:18, Paul Moore wrote:
On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio 
mailto:antonio...@gmail.com>> wrote:


I suggest including a simple str variable in unicodedata module to
mirror string.whitespace, so it would contain all characters defined
in CPython function

[_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314
 ) 
so that:

  # existent
string.whitespace = ' \t\n\r\x0b\x0c'

# proposed
unicodedata.whitespace = '
\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000' 



What's the use case? I can't think of a single occasion when I would 
have found this useful.


Same here.

For those few cases, where it might be useful, you can easily put the 
string into your application code.


Putting this into the stdlib would just mean that we'd have to recheck 
whether new Unicode whitespace chars were added, every time the standard 
upgrades. With ASCII, this won't happen in the foreseeable future ;-)


--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jun 01 2023)
>>> Python Projects, Coaching and Support ...https://www.egenix.com/
>>> Python Product Development ...https://consulting.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   https://www.egenix.com/company/contact/
 https://www.malemburg.com/

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/REMDZ2SVFVOIDEJYX3VSB2WUZTQPTTLM/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread David Mertz, Ph.D.
OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"?

On Thu, Jun 1, 2023 at 1:08 PM Chris Angelico  wrote:
>
> On Fri, 2 Jun 2023 at 02:27, David Mertz, Ph.D.  wrote:
> >
> > It feels to me like "split on whitespace" or "remove whitespace" are
> > quite common operations.  I've been frustrated a number of times by
> > settling for the ASCII whitespace class when I really wanted the
> > Unicode whitespace class.
> >
>
> They are indeed, quite common. It's a good thing Python makes those easy.
>
> >>> len("\u2000spam\u2001".strip())
> 4
> >>> "spam\u2002ham".split()
> ['spam', 'ham']
>
> ChrisA
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/CJB356TCUPJ7DITRHQE6NPJ2ILWGYXZY/
> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/OCEQ5W4QYO3AGNVNGNKXB2E3QFPZW3AO/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Chris Angelico
On Fri, 2 Jun 2023 at 02:27, David Mertz, Ph.D.  wrote:
>
> It feels to me like "split on whitespace" or "remove whitespace" are
> quite common operations.  I've been frustrated a number of times by
> settling for the ASCII whitespace class when I really wanted the
> Unicode whitespace class.
>

They are indeed, quite common. It's a good thing Python makes those easy.

>>> len("\u2000spam\u2001".strip())
4
>>> "spam\u2002ham".split()
['spam', 'ham']

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/CJB356TCUPJ7DITRHQE6NPJ2ILWGYXZY/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread David Mertz, Ph.D.
It feels to me like "split on whitespace" or "remove whitespace" are
quite common operations.  I've been frustrated a number of times by
settling for the ASCII whitespace class when I really wanted the
Unicode whitespace class.

On Thu, Jun 1, 2023 at 12:20 PM Paul Moore  wrote:
>
> On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio 
>  wrote:
>>
>> I suggest including a simple str variable in unicodedata module to mirror 
>> string.whitespace, so it would contain all characters defined in CPython 
>> function 
>> [_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314)
>>  so that:
>>
>>  # existent
>> string.whitespace = ' \t\n\r\x0b\x0c'
>>
>> # proposed
>> unicodedata.whitespace = ' 
>> \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
>
>
> What's the use case? I can't think of a single occasion when I would have 
> found this useful.
> Paul
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/374AYVMYWOLB2Q3NH3NM6UMEBK6KIFSP/
> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/OT7NREOQC4OHNXMFJCWCDOXBQ3Z34VXH/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Paul Moore
On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio <
antonio...@gmail.com> wrote:

> I suggest including a simple str variable in unicodedata module to mirror
> string.whitespace, so it would contain all characters defined in CPython
> function [_PyUnicode_IsWhitespace()](
> https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314)
> so that:
>
>  # existent
> string.whitespace = ' \t\n\r\x0b\x0c'
>
> # proposed
> unicodedata.whitespace = '
> \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'


What's the use case? I can't think of a single occasion when I would have
found this useful.
Paul
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/374AYVMYWOLB2Q3NH3NM6UMEBK6KIFSP/
Code of Conduct: http://python.org/psf/codeofconduct/