Re: Behavior of re.split on empty strings is unexpected
On Aug 2, 7:34 pm, John Nagle wrote: > >>> s2 = " HELLO THERE " > >>> kresplit4 = re.compile(r'\W+', re.UNICODE) > >>> kresplit4.split(s2) > ['', 'HELLO', 'THERE', ''] > > I still get empty strings. >>> re.findall(r"\w+", " a b c ") ['a', 'b', 'c'] -- http://mail.python.org/mailman/listinfo/python-list
Re: Behavior of re.split on empty strings is unexpected
On 8/2/2010 5:53 PM, samwyse wrote: On Aug 2, 12:34 pm, John Nagle wrote: The regular expression "split" behaves slightly differently than string split: I'm going to argue that it's the string split that's behaving oddly. I tend to agree. It doesn't seem to be possible to get the same semantics with any regular expression split. The default "split" has a special case for head and tail whitespace, and there's no way to express that with a regular expression split. Applying "strip" first will work, of course. The documentation should reflect that. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Behavior of re.split on empty strings is unexpected
On Aug 2, 7:53 pm, samwyse wrote: > It's the same results; however many people don't like these results > because they feel that whitespace occupies a privileged role. People > generally agree that a string of consecutive commas means missing > values, but a string of consecutive spaces just means someone held the > space-bar down too long. To accommodate this viewpoint, the string > split is special-cased to behave differently when None is passed as a > separator. First, it splits on any number of whitespace characters, > like this: Well we could have created another method like "splitstrip()". However then folks would complain that they must remember two methods that are almost identical. Uggh, you just can't win. There is always the naysayers no matter what you do! PS: Great post by the way. Highly informative for the pynoobs. -- http://mail.python.org/mailman/listinfo/python-list
Re: Behavior of re.split on empty strings is unexpected
On Aug 2, 12:34 pm, John Nagle wrote: > The regular expression "split" behaves slightly differently than string > split: I'm going to argue that it's the string split that's behaving oddly. To see why, let's first look at some simple CSV values: cat,dog ,missing,,values, How many fields are on each line and what are they? Here's what re.split(',') says: >>> re.split(',', 'cat,dog') ['cat', 'dog'] >>> re.split(',', ',missing,,values,') ['', 'missing', '', 'values', ''] Note that the presence of missing values is clearly flagged via the presence of empty strings in the results. Now let's look at string split: >>> 'cat,dog'.split(',') ['cat', 'dog'] >>> ',missing,,values,'.split(',') ['', 'missing', '', 'values', ''] It's the same results. Let's try it again, but replacing the commas with spaces. >>> re.split(' ', 'cat dog') ['cat', 'dog'] >>> re.split(' ', ' missing values ') ['', 'missing', '', 'values', ''] >>> 'cat dog'.split(' ') ['cat', 'dog'] >>> ' missing values '.split(' ') ['', 'missing', '', 'values', ''] It's the same results; however many people don't like these results because they feel that whitespace occupies a privileged role. People generally agree that a string of consecutive commas means missing values, but a string of consecutive spaces just means someone held the space-bar down too long. To accommodate this viewpoint, the string split is special-cased to behave differently when None is passed as a separator. First, it splits on any number of whitespace characters, like this: >>> re.split('\s+', ' missing values ') ['', 'missing', 'values', ''] >>> re.split('\s+', 'cat dog') ['cat', 'dog'] But it also eliminates any empty strings from the head and tail of the list, because that's what people generally expect when splitting on whitespace: >>> 'cat dog'.split(None) ['cat', 'dog'] >>> ' missing values '.split(None) ['missing', 'values'] -- http://mail.python.org/mailman/listinfo/python-list
Re: Behavior of re.split on empty strings is unexpected
On 08/02/2010 11:22 PM, John Nagle wrote: >> [ s in rexp.split(long_s) if s ] > >Of course I can discard the blank strings afterward, but > is there some way to do it in the "split" operation? If > not, then the default case for "split()" is too non-standard. > >(Also, "if s" won't work; if s != '' might) Of course it will work. Empty sequences are considered false in Python. Python 3.1.2 (release31-maint, Jul 8 2010, 09:18:08) [GCC 4.4.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> sprexp = re.compile(r'\s+') >>> [s for s in sprexp.split(' spaces every where ! ') if s] ['spaces', 'every', 'where', '!'] >>> list(filter(bool, sprexp.split(' more spaces \r\n\t\t '))) ['more', 'spaces'] >>> (of course, the list comprehension I posted earlier was missing a couple of words, which was very careless of me) -- http://mail.python.org/mailman/listinfo/python-list
Re: Behavior of re.split on empty strings is unexpected
On Mon, Aug 2, 2010 at 2:22 PM, John Nagle wrote: > On 8/2/2010 12:52 PM, Thomas Jollans wrote: > >> On 08/02/2010 09:41 PM, John Nagle wrote: >> >>> On 8/2/2010 11:02 AM, MRAB wrote: >>> John Nagle wrote: > The regular expression "split" behaves slightly differently than > string split: > occurrences of pattern", which is not too helpful. >>> > It's the plain str.split() which is unusual in that: 1. it splits on sequences of whitespace instead of one per occurrence; >>> >>>That can be emulated with the obvious regular expression: >>> >>> re.compile(r'\W+') >>> >>> 2. it discards leading and trailing sequences of whitespace. >>> >>>But that can't, or at least I can't figure out how to do it. >>> >> >> [ s in rexp.split(long_s) if s ] >> > > Of course I can discard the blank strings afterward, but > is there some way to do it in the "split" operation? If > not, then the default case for "split()" is too non-standard. > > (Also, "if s" won't work; if s != '' might) > >John Nagle > -- > What makes it non-standard? The fact that it's not a 1-line regex? The default case for str.split is designed to handle the most common case: you want to break a string into words, where a word is defined as a sequence of non-whitespace characters. > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list
Re: Behavior of re.split on empty strings is unexpected
On 8/2/2010 12:52 PM, Thomas Jollans wrote: On 08/02/2010 09:41 PM, John Nagle wrote: On 8/2/2010 11:02 AM, MRAB wrote: John Nagle wrote: The regular expression "split" behaves slightly differently than string split: occurrences of pattern", which is not too helpful. It's the plain str.split() which is unusual in that: 1. it splits on sequences of whitespace instead of one per occurrence; That can be emulated with the obvious regular expression: re.compile(r'\W+') 2. it discards leading and trailing sequences of whitespace. But that can't, or at least I can't figure out how to do it. [ s in rexp.split(long_s) if s ] Of course I can discard the blank strings afterward, but is there some way to do it in the "split" operation? If not, then the default case for "split()" is too non-standard. (Also, "if s" won't work; if s != '' might) John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Behavior of re.split on empty strings is unexpected
On 08/02/2010 09:41 PM, John Nagle wrote: > On 8/2/2010 11:02 AM, MRAB wrote: >> John Nagle wrote: >>> The regular expression "split" behaves slightly differently than >>> string split: > occurrences of pattern", which is not too helpful. >>> >> It's the plain str.split() which is unusual in that: >> >> 1. it splits on sequences of whitespace instead of one per occurrence; > >That can be emulated with the obvious regular expression: > > re.compile(r'\W+') > >> 2. it discards leading and trailing sequences of whitespace. > >But that can't, or at least I can't figure out how to do it. [ s in rexp.split(long_s) if s ] > >> It just happens that the unusual one is the most commonly used one, if >> you see what I mean! :-) > >The no-argument form of "split" shouldn't be that much of a special > case. > > John Nagle > -- http://mail.python.org/mailman/listinfo/python-list
Re: Behavior of re.split on empty strings is unexpected
On 8/2/2010 11:02 AM, MRAB wrote: John Nagle wrote: The regular expression "split" behaves slightly differently than string split: occurrences of pattern", which is not too helpful. It's the plain str.split() which is unusual in that: 1. it splits on sequences of whitespace instead of one per occurrence; That can be emulated with the obvious regular expression: re.compile(r'\W+') 2. it discards leading and trailing sequences of whitespace. But that can't, or at least I can't figure out how to do it. It just happens that the unusual one is the most commonly used one, if you see what I mean! :-) The no-argument form of "split" shouldn't be that much of a special case. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Behavior of re.split on empty strings is unexpected
John Nagle wrote: > The regular string split operation doesn't yield empty strings: > > >>> " HELLO THERE ".split() > ['HELLO', 'THERE'] Note that invocation without separator argument (or None as the separator) is special in that respect: >>> " hello there ".split(" ") ['', 'hello', 'there', ''] Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Behavior of re.split on empty strings is unexpected
John Nagle wrote: The regular expression "split" behaves slightly differently than string split: >>> import re >>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE) >>> kresplit2.split(" HELLOTHERE ") ['', 'HELLO', 'THERE', ''] >>> kresplit2.split("VERISIGN INC.") ['VERISIGN', 'INC', ''] I'd thought that "split" would never produce an empty string, but it will. The regular string split operation doesn't yield empty strings: >>> " HELLO THERE ".split() ['HELLO', 'THERE'] Yes it does. >>> " HELLOTHERE ".split(" ") ['', '', '', 'HELLO', '', '', '', 'THERE', '', '', ''] If I try to get the functionality of string split with re: >>> s2 = " HELLO THERE " >>> kresplit4 = re.compile(r'\W+', re.UNICODE) >>> kresplit4.split(s2) ['', 'HELLO', 'THERE', ''] I still get empty strings. The documentation just describes re.split as "Split string by the occurrences of pattern", which is not too helpful. It's the plain str.split() which is unusual in that: 1. it splits on sequences of whitespace instead of one per occurrence; 2. it discards leading and trailing sequences of whitespace. Compare: >>> " A B ".split(" ") ['', '', 'A', '', 'B', '', ''] with: >>> " A B ".split() ['A', 'B'] It just happens that the unusual one is the most commonly used one, if you see what I mean! :-) -- http://mail.python.org/mailman/listinfo/python-list
Behavior of re.split on empty strings is unexpected
The regular expression "split" behaves slightly differently than string split: >>> import re >>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE) >>> kresplit2.split(" HELLOTHERE ") ['', 'HELLO', 'THERE', ''] >>> kresplit2.split("VERISIGN INC.") ['VERISIGN', 'INC', ''] I'd thought that "split" would never produce an empty string, but it will. The regular string split operation doesn't yield empty strings: >>> " HELLO THERE ".split() ['HELLO', 'THERE'] If I try to get the functionality of string split with re: >>> s2 = " HELLO THERE " >>> kresplit4 = re.compile(r'\W+', re.UNICODE) >>> kresplit4.split(s2) ['', 'HELLO', 'THERE', ''] I still get empty strings. The documentation just describes re.split as "Split string by the occurrences of pattern", which is not too helpful. John Nagle -- http://mail.python.org/mailman/listinfo/python-list