[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-06-16 Thread Andrei Kulakov


Andrei Kulakov  added the comment:

Just to sum up the current state the way I see it, as well as the history of 
the discussion, I think there were 2 initial requests based on experience and 
one additional, more theoretical "nice to have":

A. ''.split() => ['']
B. ''.split(sep) => []  # where sep!=None

C. a way to get the current semantics of sep=None, but with specific whitespace 
separators like just spaces or just tabs. 'ab'.split(' ') => ['a','b']

The idea was to cover all 3 enhancements with the current patch.

As I pointed out in the comments above, current patch does not "cleanly" cover 
case B, potentially leading to confusion and/or bugs.

My suggestion was to cover cases A and B, and leave out C, potentially for some 
future patch.

If we go with the current patch, there will be no practical way to fix the 
issue with B later, other than adding a new `str.split2`. Conversely, it would 
be possible to add a new flag to handle C in the future.

This leads to a few questions:
- will the issue I brought up not really be a problem in practice?
- what's more important, B or C?
- if both B and C are important, can we leave C for a future patch?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-06-05 Thread Andrei Kulakov


Andrei Kulakov  added the comment:

> Of course, but the main thing is that you spotted this before the PR was 
> merged :)

I know, better late then never but also better sooner than late :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-06-05 Thread Andrei Kulakov

Andrei Kulakov  added the comment:

> I imagine that the discussion focussed on this since this is precisely what 
> happens when sep=None. For example, 'a b   c ​'.split() == ['a', 'b', 
> 'c']. I guess that the point was to provide users with explicit, manual 
> control over whether the behaviour of split should drop all empty strings or 
> retain all empty strings instead of this decision just being made on whether 
> sep is None or not.

That's true on some level but it seems to me that it's somewhat more nuanced 
than that.

The intent of sep=None is not to remove empties but to collapse invisible 
whitespace of mixed types into a single separator. ' \t  ' probably means a 
single separator because it looks like one visually. Yes, the effect is the 
same as removing empties but it's a relevant distinction when designing (and 
naming) a flag to make split() consistent with this behaviour when sep is ',', 
';', etc.

Because when you have 'a,,,' - the most likely intent is to have 3 empty 
values, NOT to collapse 3 commas into a single sep; - and then you might 
potentially have additional processing that gets rid of empties, as part of 
split() operation. So it's quite a different operation, even though the end 
effect is the same. So is this change really making the behaviour consistent? 
To me, consistency implies that intent is roughly the same, and outcome is also 
roughly the same. 

You might say, but: practicality beats purity?

However, there are some real issues here:

- harder to explain, remember, document.
- naming issue
- not completely solving the initial issue (and it would most likely leave no 
practical way to patch up that corner case if this PR is accepted)

Re: naming, for example, using keep_empty=False for sep=None is confusing, - it 
would seem that most (or even all) users would think of the operation as 
collapsing contiguous mixed whitespace into a single separator rather than 
splitting everything up and then purging empties. So this name could cause a 
fair bit of confusion for this case.

What if we call it `collapse_contiguous_separators`? I can live with an awkward 
name, but even then it doesn't work for the case like 'a' -- it doesn't 
make sense (mostly) to collapse 4 commas into one separator. Here you are 
actually purging empty values.

So the consistency seems labored in that any name you pick would be confusing 
for some cases.

And is the consistency for this case really needed? Is it common to have 
something like 'a' and say "I wish to get rid of those empty values but I 
don't want to use filter(None, values)"?

In regard to the workaround you suggested, that seems fine. If this PR is 
accepted, any of the workarounds that people now use for ''.split(',') or 
similar would still work just as before..

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-06-05 Thread Mark Bell

Mark Bell  added the comment:

> Instead, the discussion was focused on removing *all* empty strings from the 
> result.

I imagine that the discussion focussed on this since this is precisely what 
happens when sep=None. For example, 'a b   c ​'.split() == ['a', 'b', 
'c']. I guess that the point was to provide users with explicit, manual control 
over whether the behaviour of split should drop all empty strings or retain all 
empty strings instead of this decision just being made on whether sep is None 
or not.

So I wonder whether the "expected" solution for parsing CSV like strings is for 
you to actually filter out the empty strings yourself and never pass them to 
split at all. For example by doing something like:

[line.split(sep=',') for line in content.splitlines() if line]

but if this is the case then this is the kind of thing that would require 
careful thought about what is the right name for this parameter / right way to 
express this in the documentation to make sure that users don't fall into the 
trap that you mentioned.

> Sorry that I bring this up only now when the discussion was finished and the 
> work on PR completed; I wish I had seen the issue sooner.

Of course, but the main thing is that you spotted this before the PR was merged 
:)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-06-05 Thread Andrei Kulakov


Andrei Kulakov  added the comment:

To clarify with pseudocode, this is how it could work:

'' => []   # sep=None, keep_single_empty=False
'' => [''] # sep=None, keep_single_empty=True
'' => []   # sep=',', keep_single_empty=False
'a,,' => ['a','','']# sep=',', keep_single_empty=False

I guess `keepempty=False` could be too easily confused for filtering out all 
empties.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-06-05 Thread Andrei Kulakov


Andrei Kulakov  added the comment:

Mark:

With sep=None, I don't think there is an issue. My only concern is when sep is 
set to some other value.

The original issue was that the single empty str result is removed when using 
sep=None and that it's kept when sep is some other value. So the most direct 
solution would seem to be to have a flag that controls the removal/retention of 
a single empty str in results.

Instead, the discussion was focused on removing *all* empty strings from the 
result.

My concern is that this doesn't solve the original issue in some cases, i.e. if 
I want to use a sep other than None, and I want an empty line to mean there are 
no values (result=[]), but I do want to keep empty values (a,, => [a,'','']) -- 
all of these seem like fairly normal, not unusual requirements.

The second concern, as I noted in previous message, is a potential for bugs if 
this flag being interpreted narrowly as a solution for the original issue only.

[Note I don't think it would be a very widespread bug but I can see it 
happening occasionally.]

I think to avoid both of these issues we could change the flag to narrowly 
target the original issue, i.e. one empty str only. The name of the flag can 
remain the same or possibly something like `keep_single_empty` would be more 
explicit (though a bit awkward).

The downside is that we'd lose the convenience of splitting and filtering out 
all empties in one operation.

Sorry that I bring this up only now when the discussion was finished and the 
work on PR completed; I wish I had seen the issue sooner.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-06-05 Thread Mark Bell


Mark Bell  added the comment:

Andrei: That is a very interesting observation, thank you for pointing it out. 
I guess your example / argument also currently applies to whitespace separation 
too. For example, if we have a whitespace separated string with contents:

col1 col2 col3
a b c

x y z

then using [row.split() for row in contents.splitlines()] results in
[['col1', 'col2', 'col3'], ['a', 'b', 'c'], [], ['x', 'y', 'z']]

However if later a user appends the row:

p  q

aiming to have p, and empty cell and then q then they will actually get

[['col1', 'col2', 'col3'], ['a', 'b', 'c'], [], ['x', 'y', 'z'], ['p', 'q']]

So at least this patch results in behaviour that is consistent with how split 
currently works. 

Are you suggesting that this is something that could be addressed by clearer 
documentation or using a different flag name?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-06-02 Thread Andrei Kulakov


Andrei Kulakov  added the comment:

I'm not sure I understand why the discussion was focused on removing *all* 
empty values.

Consider this in a context of a cvs-like string:

1. 'a,b,c' => [a,b,c]# of course
2. ',,'=> ['','',''] # follows naturally from above
3. ''  => [] # arguably most intuitive
4. ''  => ['']   # less intuitive but can be correct

>From the point of view of intent of the initial string, the first two
are clear - 3 values are provided, in 2) they just happen to be empty.
It's up to the later logic to skip empty values if needed.

The empty string is ambiguous because the intent may be no values or a single 
empty value.

So ideally the new API would let me choose explicitly between 3) and 4). But I 
don't see why it would affect 2) !!

The processing of 2) is already not ambiguous. That's what I would want any 
version of split() to do, and later filter or skip empty values.

Current patch either forces me to choose 4) or to explicitly choose but
also break normal, "correct" handling of 2). 

It can lead to bugs as follows:

Let's say I have a csv-like string:

col1,col2,col3
1,2,3

a,b,c

I note that row 2 creates an empty col1 value, which is probably not what I 
want. I look at split() args and think that keepempty=False is designed for 
this use case. I use it in my code. Next time the code will break when someone 
adds a row:

a,,c

--
nosy: +andrei.avk

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-21 Thread Matthew Barnett


Matthew Barnett  added the comment:

I've only just realised that the test cases don't cover all eventualities: none 
of them test what happens with multiple spaces _between_ the letters, such as:

'  a  b c '.split(maxsplit=1) == ['a', 'b c ']

Comparing that with:

'  a  b c '.split(' ', maxsplit=1)

you see that passing None as the split character does not mean "any whitespace 
character". There's clearly a little more to it than that.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-20 Thread Mark Bell


Mark Bell  added the comment:

Thank you very much for confirming these test cases. Using these I believe that 
I have now been able to complete a patch that would implement this feature. The 
PR is available at https://github.com/python/cpython/pull/26222. As I am a 
first-time contributor, please could a maintainer approve running the CI 
workflows so that I can confirm that all the (new) tests pass.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Mark Bell


Change by Mark Bell :


--
pull_requests: +24839
pull_request: https://github.com/python/cpython/pull/26222

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Matthew Barnett


Matthew Barnett  added the comment:

We have that already, although it's spelled:

'   x y z'.split(maxsplit=1) == ['x', 'y z']

because the keepempty option doesn't exist yet.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Mark Bell


Mark Bell  added the comment:

So I think I agree with you about the difference between .split() and .split(' 
'). However wouldn't that mean that
'   x y z'.split(maxsplit=1, keepempty=False) == ['x', 'y z']

since it should do one split.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Matthew Barnett


Matthew Barnett  added the comment:

The best way to think of it is that .split() is like .split(' '), except that 
it's splitting on any whitespace character instead of just ' ', and keepempty 
is defaulting to False instead of True.

Therefore:

'   x y z'.split(maxsplit=1, keepempty=True) == ['', '  x y z']

because:

'   x y z'.split(' ', maxsplit=1) == ['', '  x y z']

but:

'   x y z'.split(maxsplit=1, keepempty=False) == ['x y z']

At least, I think that's the case!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Mark Bell


Mark Bell  added the comment:

> suggests that empty strings don't count towards maxsplit

Thank you for the confirmation. Although just to clarify I guess you really 
mean "empty strings *that are dropped from the output* don't count towards 
maxsplit". Just to double check this, what do we expect the output of

'   x y z'.split(maxsplit=1, keepempty=True)

to be?

I think it should be ['', '  x y z'] since in this case we are retaining empty 
strings and they should count towards the maxsplit.

(In the current patch this crashes with a core dump since it tries to write to 
unallocated memory)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Matthew Barnett


Matthew Barnett  added the comment:

The case:

'  a b c  '.split(maxsplit=1) == ['a', 'b c  ']

suggests that empty strings don't count towards maxsplit, otherwise it would 
return [' a b c  '] (i.e. the split would give ['', ' a b c  '] and dropping 
the empty strings would give [' a b c  ']).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Mark Bell

Mark Bell  added the comment:

So I have taken a look at the original patch that was provided and I have been 
able to update it so that it is compatible with the current release. I have 
also flipped the logic in the wrapping functions so that they take a 
`keepempty` flag (which is the opposite of the `prune` flag). 

I had to make a few extra changes since there are now some extra catches in 
things like PyUnicode_Split which spot that if len(self) > len(sep) then they 
can just return [self]. However that now needs an extra test since that 
shortcut can only be used if len(self) > 0. You can find the code here: 
https://github.com/markcbell/cpython/tree/split-keepempty

However in exploring this, I'm not sure that this patch interacts correctly 
with maxsplit. For example, 
'   x y z'.split(maxsplit=1, keepempty=True)
results in
['', '', 'x', 'y z']
since the first two empty strings items are "free" and don't count towards the 
maxsplit. I think the length of the result returned must be <= maxsplit + 1, is 
this right?

I'm about to rework the logic to avoid this, but before I go too far could 
someone double check my test cases to make sure that I have the correct idea 
about how this is supposed to work please. Only the 8 lines marked "New case" 
show new behaviour, all the other come from how string.split works currently. 
Of course the same patterns should apply to bytestrings and bytearrays.

''.split() == []
''.split(' ') == ['']
''.split(' ', keepempty=False) == []# New case

'  '.split(' ') == ['', '', '']
'  '.split(' ', maxsplit=1) == ['', ' ']
'  '.split(' ', maxsplit=1, keepempty=False) == [' ']# New case

'  a b c  '.split() == ['a', 'b', 'c']
​'  a b c  '.split(maxsplit=0) == ['a b c  ']
​'  a b c  '.split(maxsplit=1) == ['a', 'b c  ']

'  a b c  '.split(' ') == ['', '', 'a', 'b', 'c', '', '']
​'  a b c  '.split(' ', maxsplit=0) == ['  a b c  ']
​'  a b c  '.split(' ', maxsplit=1) == ['', ' a b c  ']
​'  a b c  '.split(' ', maxsplit=2) == ['', '', 'a b c  ']
​'  a b c  '.split(' ', maxsplit=3) == ['', '', 'a', 'b c  ']
​'  a b c  '.split(' ', maxsplit=4) == ['', '', 'a', 'b', 'c  ']
​'  a b c  '.split(' ', maxsplit=5) == ['', '', 'a', 'b', 'c', ' ']
​'  a b c  '.split(' ', maxsplit=6) == ['', '', 'a', 'b', 'c', '', '']

​'  a b c  '.split(' ', keepempty=False) == ['a', 'b', 'c']# New case
​'  a b c  '.split(' ', maxsplit=0, keepempty=False) == ['  a b c  ']# 
New case
​'  a b c  '.split(' ', maxsplit=1, keepempty=False) == ['a', 'b c  ']# 
New case
​'  a b c  '.split(' ', maxsplit=2, keepempty=False) == ['a', 'b', 'c  ']   
 # New case
​'  a b c  '.split(' ', maxsplit=3, keepempty=False) == ['a', 'b', 'c', ' 
']# New case
​'  a b c  '.split(' ', maxsplit=4, keepempty=False) == ['a', 'b', 'c']
# New case

--
nosy: +Mark.Bell

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-17 Thread Catherine Devlin


Catherine Devlin  added the comment:

@ZackerySpytz - I made https://github.com/python/cpython/pull/26196 with a test 
for the desired behavior; hopefully it helps.  I could try to adapt Barry's old 
patch myself, but it's probably better if somebody C-competent does so...

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-17 Thread Catherine Devlin


Change by Catherine Devlin :


--
nosy: +Catherine.Devlin
nosy_count: 11.0 -> 12.0
pull_requests: +24813
stage: test needed -> patch review
pull_request: https://github.com/python/cpython/pull/26196

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-01-12 Thread Dong-hee Na


Change by Dong-hee Na :


--
nosy: +corona10

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-01-04 Thread Guido van Rossum


Guido van Rossum  added the comment:

Excellent!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-01-04 Thread Zackery Spytz


Zackery Spytz  added the comment:

I am working on this issue.

--
assignee:  -> ZackerySpytz
nosy: +ZackerySpytz
versions: +Python 3.10 -Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-01-03 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
nosy:  -rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-01-03 Thread Guido van Rossum

Guido van Rossum  added the comment:

This issue probably needs a new champion. There is broad agreement but some
bike shedding, so a PEP isn’t needed.--
--Guido (mobile)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-01-03 Thread karl


Change by karl :


--
nosy: +karlcow

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2019-10-18 Thread Philippe Cloutier


Philippe Cloutier  added the comment:

I assume the "workaround" suggested by Raymond in msg282966 is supposed to 
read...
filter(None, str.split(sep)
... rather than filter(None, sep.split(input)).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2019-10-18 Thread Philippe Cloutier


Philippe Cloutier  added the comment:

I understood the current (only) behavior, but coming from a PHP background, I 
really didn't expect it. Thank you for this request, I would definitely like 
the ability to get behavior matching PHP's explode().

--
nosy: +Philippe Cloutier
title: str.split(): remove empty strings when sep is not None -> str.split(): 
allow removing empty strings (when sep is not None)

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com