[Python-ideas] Add a .whitespace property to module unicodedata

2023-06-01 Thread Antonio Carlos Jorge Patricio
I suggest including a simple str variable in unicodedata module to mirror 
string.whitespace, so it would contain all characters defined in CPython 
function 
[_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314)
 so that:

 # existent
string.whitespace = ' \t\n\r\x0b\x0c' 

# proposed
unicodedata.whitespace = ' 
\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/CCA23FXFOVICMX7IHDGT4O7RRO3Y5A2X/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Add a .whitespace property to module unicodedata

2023-06-01 Thread Antonio Carlos Jorge Patricio
I suggest including a simple str variable in unicodedata module to mirror 
string.whitespace, so it would contain all characters defined in CPython 
function 
[_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314)
 so that:

 # existent
string.whitespace = ' \t\n\r\x0b\x0c' 

# proposed
unicodedata.whitespace = ' 
\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/RCFULMCPVDNWB6WOC6DVGAFIAJFYFGMP/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Paul Moore
On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio <
antonio...@gmail.com> wrote:

> I suggest including a simple str variable in unicodedata module to mirror
> string.whitespace, so it would contain all characters defined in CPython
> function [_PyUnicode_IsWhitespace()](
> https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314)
> so that:
>
>  # existent
> string.whitespace = ' \t\n\r\x0b\x0c'
>
> # proposed
> unicodedata.whitespace = '
> \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'


What's the use case? I can't think of a single occasion when I would have
found this useful.
Paul
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/374AYVMYWOLB2Q3NH3NM6UMEBK6KIFSP/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread David Mertz, Ph.D.
It feels to me like "split on whitespace" or "remove whitespace" are
quite common operations.  I've been frustrated a number of times by
settling for the ASCII whitespace class when I really wanted the
Unicode whitespace class.

On Thu, Jun 1, 2023 at 12:20 PM Paul Moore  wrote:
>
> On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio 
>  wrote:
>>
>> I suggest including a simple str variable in unicodedata module to mirror 
>> string.whitespace, so it would contain all characters defined in CPython 
>> function 
>> [_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314)
>>  so that:
>>
>>  # existent
>> string.whitespace = ' \t\n\r\x0b\x0c'
>>
>> # proposed
>> unicodedata.whitespace = ' 
>> \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
>
>
> What's the use case? I can't think of a single occasion when I would have 
> found this useful.
> Paul
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/374AYVMYWOLB2Q3NH3NM6UMEBK6KIFSP/
> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/OT7NREOQC4OHNXMFJCWCDOXBQ3Z34VXH/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Chris Angelico
On Fri, 2 Jun 2023 at 02:27, David Mertz, Ph.D.  wrote:
>
> It feels to me like "split on whitespace" or "remove whitespace" are
> quite common operations.  I've been frustrated a number of times by
> settling for the ASCII whitespace class when I really wanted the
> Unicode whitespace class.
>

They are indeed, quite common. It's a good thing Python makes those easy.

>>> len("\u2000spam\u2001".strip())
4
>>> "spam\u2002ham".split()
['spam', 'ham']

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/CJB356TCUPJ7DITRHQE6NPJ2ILWGYXZY/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread David Mertz, Ph.D.
OK, fair enough. What about "has whitespace (including Unicode beyond ASCII)"?

On Thu, Jun 1, 2023 at 1:08 PM Chris Angelico  wrote:
>
> On Fri, 2 Jun 2023 at 02:27, David Mertz, Ph.D.  wrote:
> >
> > It feels to me like "split on whitespace" or "remove whitespace" are
> > quite common operations.  I've been frustrated a number of times by
> > settling for the ASCII whitespace class when I really wanted the
> > Unicode whitespace class.
> >
>
> They are indeed, quite common. It's a good thing Python makes those easy.
>
> >>> len("\u2000spam\u2001".strip())
> 4
> >>> "spam\u2002ham".split()
> ['spam', 'ham']
>
> ChrisA
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/CJB356TCUPJ7DITRHQE6NPJ2ILWGYXZY/
> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/OCEQ5W4QYO3AGNVNGNKXB2E3QFPZW3AO/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Marc-Andre Lemburg

On 01.06.2023 18:18, Paul Moore wrote:
On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio 
mailto:antonio...@gmail.com>> wrote:


I suggest including a simple str variable in unicodedata module to
mirror string.whitespace, so it would contain all characters defined
in CPython function

[_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314
 ) 
so that:

  # existent
string.whitespace = ' \t\n\r\x0b\x0c'

# proposed
unicodedata.whitespace = '
\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000' 



What's the use case? I can't think of a single occasion when I would 
have found this useful.


Same here.

For those few cases, where it might be useful, you can easily put the 
string into your application code.


Putting this into the stdlib would just mean that we'd have to recheck 
whether new Unicode whitespace chars were added, every time the standard 
upgrades. With ASCII, this won't happen in the foreseeable future ;-)


--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jun 01 2023)
>>> Python Projects, Coaching and Support ...https://www.egenix.com/
>>> Python Product Development ...https://consulting.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   https://www.egenix.com/company/contact/
 https://www.malemburg.com/

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/REMDZ2SVFVOIDEJYX3VSB2WUZTQPTTLM/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread David Mertz, Ph.D.
I guess this is pretty general for the described need:

>>> %time unicode_whitespace = [chr(c) for c in range(0x) if 
>>> unicodedata.category(chr(c)) == "Zs"]
CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms
Wall time: 18.7 ms
>>> unicode_whitespace
[' ', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003',
'\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a',
'\u202f', '\u205f', '\u3000']

It's milliseconds not nanoseconds, but presumably something you do
once at the start of an application.  Can anyone think of a more
efficient and/or more concise way of doing this?

This definitely feels better than making a static sequence of
characters since the Unicode Consortium may (and has) changed the
definition.  In particular, MONGOLIAN VOWEL SEPARATOR (U+180E) was
removed from the whitespace category to which it previously belonged.
I'm not sure why U+FEFF isn't included, but that seems to match the
current standards, so all good.

On Thu, Jun 1, 2023 at 1:29 PM Marc-Andre Lemburg  wrote:
>
> On 01.06.2023 18:18, Paul Moore wrote:
> > On Thu, 1 Jun 2023 at 15:09, Antonio Carlos Jorge Patricio
> > mailto:antonio...@gmail.com>> wrote:
> >
> > I suggest including a simple str variable in unicodedata module to
> > mirror string.whitespace, so it would contain all characters defined
> > in CPython function
> > 
> > [_PyUnicode_IsWhitespace()](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h#L6314
> >  
> > )
> >  so that:
> >
> >   # existent
> > string.whitespace = ' \t\n\r\x0b\x0c'
> >
> > # proposed
> > unicodedata.whitespace = '
> > 
> > \t\n\x0b\x0c\r\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
> >
> >
> > What's the use case? I can't think of a single occasion when I would
> > have found this useful.
>
> Same here.
>
> For those few cases, where it might be useful, you can easily put the
> string into your application code.
>
> Putting this into the stdlib would just mean that we'd have to recheck
> whether new Unicode whitespace chars were added, every time the standard
> upgrades. With ASCII, this won't happen in the foreseeable future ;-)
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Jun 01 2023)
>  >>> Python Projects, Coaching and Support ...https://www.egenix.com/
>  >>> Python Product Development ...https://consulting.egenix.com/
> 
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
> eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>  D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
> Registered at Amtsgericht Duesseldorf: HRB 46611
> https://www.egenix.com/company/contact/
>   https://www.malemburg.com/
>
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-ideas@python.org/message/REMDZ2SVFVOIDEJYX3VSB2WUZTQPTTLM/
> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/3CH6FHG4BCXNBTF4LBZOYLRNHEKXCMYY/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: extend method of the list class could return a reference to the list so that we can chain method calls

2023-06-01 Thread Tal Einat
On Mon, May 29, 2023 at 2:52 AM Richard Damon  wrote:
>
> On 5/28/23 7:32 PM, Samuel Muldoon wrote:
> > *Currently, list.extend does not allow method chaining.*
> > *
> > *
> > *parameters = [
> > "zero or more",
> > "zero or more".upper(),
> > "zero or more".lower(),
> > "Zero or More"
> > *
> > *].extend(["(0, +inf)", "[0; +in**f]"])*
> > *
> > *
> > *parameters.extend(["{0, \u221E}"]).append("(0 inf)")
> > *
> > *
> > *
> >
> > *Samuel Muldoon*
> >
> My understanding is that it is a deliberate choice that mutating member
> functions return NONE, rather than the object (to allow chaining)
> because otherwise it is too easy to think it return a new copy of the
> object (like operation that aren't mutations).
>
> Thinking you got a new copy when you are working with the original
> object gives hard to find problems.
>
> Since the operations return NONE, mistakenly trying to chain hits an
> obvious, and normally easy to fix error.
>
> It just says that "chaining" has become less pythonic.

Indeed, this was a deliberate choice, and I believe this will not be changed.

If you really want to use method-chaining style, there are libraries
on PyPI to facilitate that. Take a look at PyDash[1], or at my own
take on this, funcy-chain[2].

[1] pydash.readthedocs.io
[2] github.com/taleinat/funcy-chain

- Tal
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/F3LALA33SCESQJ2FGXMTOBQ5PJB6DGXV/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Richard Damon

On 6/1/23 2:06 PM, David Mertz, Ph.D. wrote:

I'm not sure why U+FEFF isn't included, but that seems to match the
current standards, so all good.


I think because Zero Width, No-Breaking Space, (aka BOM Mark) doesn't 
act like a "Space" character.


If used as the BOM mark, it is intended that it gets stripped out when 
read and the UTF-16/UTF-32 data file that follows it be typically just 
read and have its byte order corrected as the mark indicates.


If used elsewhere as the ZWNBSP (which has been deprecated and replaced 
with U+2060) then it use is intentionally "no-break" so not a space to 
seperate on.


--
Richard Damon

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/7D2NZMF445F4XNKJFVXLDKDLI3NGDK65/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Paul Moore
On Thu, 1 Jun 2023 at 18:16, David Mertz, Ph.D. 
wrote:

> OK, fair enough. What about "has whitespace (including Unicode beyond
> ASCII)"?
>

>>> import re
>>> r = re.compile(r'\s', re.U)
>>> r.search('ab\u2002cd')


❯ py -m timeit -s "import re; r = re.compile(r'\s', re.U)"
"r.search('ab\u2002cd')"
100 loops, best of 5: 262 nsec per loop

Paul
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/Z7CASFLDWL7N2IPB2QPOWDGALNRBCMF4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Add a .whitespace property to module unicodedata

2023-06-01 Thread Ethan Furman

On 6/1/23 11:06, David Mertz, Ph.D. wrote:

> I guess this is pretty general for the described need:
>
> >>> unicode_whitespace = [chr(c) for c in range(0x) if unicodedata.category(chr(c)) 
== "Zs"]

Using the module-level `__getattr__` that could be a lazy attribute.

--
~Ethan~
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/TMHAJMGZH4JHZOIKKWE55ZCF4N4CHGNI/
Code of Conduct: http://python.org/psf/codeofconduct/