Re: Regular Expression bug?

2023-03-02 Thread jose isaias cabrera
On Thu, Mar 2, 2023 at 9:56 PM Alan Bawden  wrote:
>
> jose isaias cabrera  writes:
>
>On Thu, Mar 2, 2023 at 2:38 PM Mats Wichmann  wrote:
>
>This re is a bit different than the one I am used. So, I am trying to match
>everything after 'pn=':
>
>import re
>s = "pm=jose pn=2017"
>m0 = r"pn=(.+)"
>r0 = re.compile(m0)
>s0 = r0.match(s)
>>>> print(s0)
>None
>
> Assuming that you were expecting to match "pn=2017", then you probably
> don't want the 'match' method.  Read its documentation.  Then read the
> documentation for the _other_ methods that a Pattern supports.  Then you
> will be enlightened.

Yes. I need search. Thanks.

-- 

What if eternity is real?  Where will you spend it?  H...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression bug?

2023-03-02 Thread jose isaias cabrera
On Thu, Mar 2, 2023 at 8:35 PM  wrote:
>
> It is a well-known fact, Jose, that GIGO.
>
> The letters "n" and "m" are not interchangeable. Your pattern fails because 
> you have "pn" in one place and "pm" in the other.

It is not GIGO. pm=project manager. pn=project name. I needed search()
rather than match().

>
> >>> s = "pn=jose pn=2017"
> ...
> >>> s0 = r0.match(s)
> >>> s0
> 
>
>
>
> -Original Message-
> From: Python-list  On 
> Behalf Of jose isaias cabrera
> Sent: Thursday, March 2, 2023 8:07 PM
> To: Mats Wichmann 
> Cc: python-list@python.org
> Subject: Re: Regular Expression bug?
>
> On Thu, Mar 2, 2023 at 2:38 PM Mats Wichmann  wrote:
> >
> > On 3/2/23 12:28, Chris Angelico wrote:
> > > On Fri, 3 Mar 2023 at 06:24, jose isaias cabrera 
> wrote:
> > >>
> > >> Greetings.
> > >>
> > >> For the RegExp Gurus, consider the following python3 code:
> > >> 
> > >> import re
> > >> s = "pn=align upgrade sd=2023-02-"
> > >> ro = re.compile(r"pn=(.+) ")
> > >> r0=ro.match(s)
> > >>>>> print(r0.group(1))
> > >> align upgrade
> > >> 
> > >>
> > >> This is wrong. It should be 'align' because the group only goes up-to
> > >> the space. Thoughts? Thanks.
> > >>
> > >
> > > Not a bug. Find the longest possible match that fits this; as long as
> > > you can find a space immediately after it, everything in between goes
> > > into the .+ part.
> > >
> > > If you want to exclude spaces, either use [^ ]+ or .+?.
> >
> > https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy
>
> This re is a bit different than the one I am used. So, I am trying to match
> everything after 'pn=':
>
> import re
> s = "pm=jose pn=2017"
> m0 = r"pn=(.+)"
> r0 = re.compile(m0)
> s0 = r0.match(s)
> >>> print(s0)
> None
>
> Any help is appreciated.
> --
> https://mail.python.org/mailman/listinfo/python-list
>


-- 

What if eternity is real?  Where will you spend it?  H...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression bug?

2023-03-02 Thread jose isaias cabrera
On Thu, Mar 2, 2023 at 8:30 PM Cameron Simpson  wrote:
>
> On 02Mar2023 20:06, jose isaias cabrera  wrote:
> >This re is a bit different than the one I am used. So, I am trying to
> >match
> >everything after 'pn=':
> >
> >import re
> >s = "pm=jose pn=2017"
> >m0 = r"pn=(.+)"
> >r0 = re.compile(m0)
> >s0 = r0.match(s)
>
> `match()` matches at the start of the string. You want r0.search(s).
> - Cameron Simpson 

Thanks. Darn it! I knew it was something simple.


-- 

What if eternity is real?  Where will you spend it?  H...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression bug?

2023-03-02 Thread Alan Bawden
jose isaias cabrera  writes:

   On Thu, Mar 2, 2023 at 2:38 PM Mats Wichmann  wrote:

   This re is a bit different than the one I am used. So, I am trying to match
   everything after 'pn=':

   import re
   s = "pm=jose pn=2017"
   m0 = r"pn=(.+)"
   r0 = re.compile(m0)
   s0 = r0.match(s)
   >>> print(s0)
   None

Assuming that you were expecting to match "pn=2017", then you probably
don't want the 'match' method.  Read its documentation.  Then read the
documentation for the _other_ methods that a Pattern supports.  Then you
will be enlightened.

- Alan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression bug?

2023-03-02 Thread Cameron Simpson

On 02Mar2023 20:06, jose isaias cabrera  wrote:
This re is a bit different than the one I am used. So, I am trying to 
match

everything after 'pn=':

import re
s = "pm=jose pn=2017"
m0 = r"pn=(.+)"
r0 = re.compile(m0)
s0 = r0.match(s)


`match()` matches at the start of the string. You want r0.search(s).
- Cameron Simpson 
--
https://mail.python.org/mailman/listinfo/python-list


RE: Regular Expression bug?

2023-03-02 Thread avi.e.gross
It is a well-known fact, Jose, that GIGO.

The letters "n" and "m" are not interchangeable. Your pattern fails because you 
have "pn" in one place and "pm" in the other.


>>> s = "pn=jose pn=2017"
...
>>> s0 = r0.match(s)
>>> s0




-Original Message-
From: Python-list  On 
Behalf Of jose isaias cabrera
Sent: Thursday, March 2, 2023 8:07 PM
To: Mats Wichmann 
Cc: python-list@python.org
Subject: Re: Regular Expression bug?

On Thu, Mar 2, 2023 at 2:38 PM Mats Wichmann  wrote:
>
> On 3/2/23 12:28, Chris Angelico wrote:
> > On Fri, 3 Mar 2023 at 06:24, jose isaias cabrera 
wrote:
> >>
> >> Greetings.
> >>
> >> For the RegExp Gurus, consider the following python3 code:
> >> 
> >> import re
> >> s = "pn=align upgrade sd=2023-02-"
> >> ro = re.compile(r"pn=(.+) ")
> >> r0=ro.match(s)
> >>>>> print(r0.group(1))
> >> align upgrade
> >> 
> >>
> >> This is wrong. It should be 'align' because the group only goes up-to
> >> the space. Thoughts? Thanks.
> >>
> >
> > Not a bug. Find the longest possible match that fits this; as long as
> > you can find a space immediately after it, everything in between goes
> > into the .+ part.
> >
> > If you want to exclude spaces, either use [^ ]+ or .+?.
>
> https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy

This re is a bit different than the one I am used. So, I am trying to match
everything after 'pn=':

import re
s = "pm=jose pn=2017"
m0 = r"pn=(.+)"
r0 = re.compile(m0)
s0 = r0.match(s)
>>> print(s0)
None

Any help is appreciated.
-- 
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression bug?

2023-03-02 Thread jose isaias cabrera
On Thu, Mar 2, 2023 at 2:38 PM Mats Wichmann  wrote:
>
> On 3/2/23 12:28, Chris Angelico wrote:
> > On Fri, 3 Mar 2023 at 06:24, jose isaias cabrera 
wrote:
> >>
> >> Greetings.
> >>
> >> For the RegExp Gurus, consider the following python3 code:
> >> 
> >> import re
> >> s = "pn=align upgrade sd=2023-02-"
> >> ro = re.compile(r"pn=(.+) ")
> >> r0=ro.match(s)
> > print(r0.group(1))
> >> align upgrade
> >> 
> >>
> >> This is wrong. It should be 'align' because the group only goes up-to
> >> the space. Thoughts? Thanks.
> >>
> >
> > Not a bug. Find the longest possible match that fits this; as long as
> > you can find a space immediately after it, everything in between goes
> > into the .+ part.
> >
> > If you want to exclude spaces, either use [^ ]+ or .+?.
>
> https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy

This re is a bit different than the one I am used. So, I am trying to match
everything after 'pn=':

import re
s = "pm=jose pn=2017"
m0 = r"pn=(.+)"
r0 = re.compile(m0)
s0 = r0.match(s)
>>> print(s0)
None

Any help is appreciated.
-- 
https://mail.python.org/mailman/listinfo/python-list


RE: Regular Expression bug?

2023-03-02 Thread avi.e.gross
José,

Matching can be greedy. Did it match to the last space?

What you want is a pattern that matches anything except a space (or whitespace) 
followed b matching a space or something similar.

Or use a construct that makes matching non-greedy.

Avi

-Original Message-
From: Python-list  On 
Behalf Of jose isaias cabrera
Sent: Thursday, March 2, 2023 2:23 PM
To: python-list@python.org
Subject: Regular Expression bug?

Greetings.

For the RegExp Gurus, consider the following python3 code:

import re
s = "pn=align upgrade sd=2023-02-"
ro = re.compile(r"pn=(.+) ")
r0=ro.match(s)
>>> print(r0.group(1))
align upgrade


This is wrong. It should be 'align' because the group only goes up-to the 
space. Thoughts? Thanks.

josé

-- 

What if eternity is real?  Where will you spend it?  H...
--
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression bug?

2023-03-02 Thread jose isaias cabrera
On Thu, Mar 2, 2023 at 2:32 PM <2qdxy4rzwzuui...@potatochowder.com> wrote:
>
> On 2023-03-02 at 14:22:41 -0500,
> jose isaias cabrera  wrote:
>
> > For the RegExp Gurus, consider the following python3 code:
> > 
> > import re
> > s = "pn=align upgrade sd=2023-02-"
> > ro = re.compile(r"pn=(.+) ")
> > r0=ro.match(s)
> > >>> print(r0.group(1))
> > align upgrade
> > 
> >
> > This is wrong. It should be 'align' because the group only goes up-to
> > the space. Thoughts? Thanks.
>
> The bug is in your regular expression; the plus modifier is greedy.
>
> If you want to match up to the first space, then you'll need something
> like [^ ] (i.e., everything that isn't a space) instead of that dot.

Thanks. I appreciate your wisdom.

josé
-- 

What if eternity is real?  Where will you spend it?  H...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression bug?

2023-03-02 Thread Mats Wichmann

On 3/2/23 12:28, Chris Angelico wrote:

On Fri, 3 Mar 2023 at 06:24, jose isaias cabrera  wrote:


Greetings.

For the RegExp Gurus, consider the following python3 code:

import re
s = "pn=align upgrade sd=2023-02-"
ro = re.compile(r"pn=(.+) ")
r0=ro.match(s)

print(r0.group(1))

align upgrade


This is wrong. It should be 'align' because the group only goes up-to
the space. Thoughts? Thanks.



Not a bug. Find the longest possible match that fits this; as long as
you can find a space immediately after it, everything in between goes
into the .+ part.

If you want to exclude spaces, either use [^ ]+ or .+?.



https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression bug?

2023-03-02 Thread 2QdxY4RzWzUUiLuE
On 2023-03-02 at 14:22:41 -0500,
jose isaias cabrera  wrote:

> For the RegExp Gurus, consider the following python3 code:
> 
> import re
> s = "pn=align upgrade sd=2023-02-"
> ro = re.compile(r"pn=(.+) ")
> r0=ro.match(s)
> >>> print(r0.group(1))
> align upgrade
> 
> 
> This is wrong. It should be 'align' because the group only goes up-to
> the space. Thoughts? Thanks.

The bug is in your regular expression; the plus modifier is greedy.

If you want to match up to the first space, then you'll need something
like [^ ] (i.e., everything that isn't a space) instead of that dot.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression bug?

2023-03-02 Thread Chris Angelico
On Fri, 3 Mar 2023 at 06:24, jose isaias cabrera  wrote:
>
> Greetings.
>
> For the RegExp Gurus, consider the following python3 code:
> 
> import re
> s = "pn=align upgrade sd=2023-02-"
> ro = re.compile(r"pn=(.+) ")
> r0=ro.match(s)
> >>> print(r0.group(1))
> align upgrade
> 
>
> This is wrong. It should be 'align' because the group only goes up-to
> the space. Thoughts? Thanks.
>

Not a bug. Find the longest possible match that fits this; as long as
you can find a space immediately after it, everything in between goes
into the .+ part.

If you want to exclude spaces, either use [^ ]+ or .+?.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Regular Expression bug?

2023-03-02 Thread jose isaias cabrera
Greetings.

For the RegExp Gurus, consider the following python3 code:

import re
s = "pn=align upgrade sd=2023-02-"
ro = re.compile(r"pn=(.+) ")
r0=ro.match(s)
>>> print(r0.group(1))
align upgrade


This is wrong. It should be 'align' because the group only goes up-to
the space. Thoughts? Thanks.

josé

-- 

What if eternity is real?  Where will you spend it?  H...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-20 Thread umarpy

More elegant way

>>> [x for x in re.split('([A-Z]+[a-z]+)', a) if x ]
['foo', 'Bar', 'Baz']

R.

On Feb 20, 2:03 pm, Lie Ryan  wrote:
> On Thu, 19 Feb 2009 13:03:59 -0800, Ron Garret wrote:
> > In article ,
> >  Peter Otten <__pete...@web.de> wrote:
>
> >> Ron Garret wrote:
>
> >> > I'm trying to split a CamelCase string into its constituent
> >> > components.
>
> >> How about
>
> >> >>> re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")
> >> ['foo', 'Bar', 'Baz']
>
> > That's very clever.  Thanks!
>
> >> > (BTW, I tried looking at the source code for the re module, but I
> >> > could not find the relevant code.  re.split calls
> >> > sre_compile.compile().split, but the string 'split' does not appear
> >> > in sre_compile.py.  So where does this method come from?)
>
> >> It's coded in C. The source is Modules/sremodule.c.
>
> > Ah.  Thanks!
>
> > rg
>
> This re.split() doesn't consume character:
>
> >>> re.split('([A-Z][a-z]*)', 'fooBarBaz')
>
> ['foo', 'Bar', '', 'Baz', '']
>
> it does what the OP wants, albeit with extra blank strings.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-20 Thread Lie Ryan
On Thu, 19 Feb 2009 13:03:59 -0800, Ron Garret wrote:

> In article ,
>  Peter Otten <__pete...@web.de> wrote:
> 
>> Ron Garret wrote:
>> 
>> > I'm trying to split a CamelCase string into its constituent
>> > components.
>> 
>> How about
>> 
>> >>> re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")
>> ['foo', 'Bar', 'Baz']
> 
> That's very clever.  Thanks!
> 
>> > (BTW, I tried looking at the source code for the re module, but I
>> > could not find the relevant code.  re.split calls
>> > sre_compile.compile().split, but the string 'split' does not appear
>> > in sre_compile.py.  So where does this method come from?)
>> 
>> It's coded in C. The source is Modules/sremodule.c.
> 
> Ah.  Thanks!
> 
> rg

This re.split() doesn't consume character:

>>> re.split('([A-Z][a-z]*)', 'fooBarBaz')
['foo', 'Bar', '', 'Baz', '']

it does what the OP wants, albeit with extra blank strings. 

--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-19 Thread Steven D'Aprano
andrew cooke wrote:

> 
> i wonder what fraction of people posting with "bug?" in their titles here
> actually find bugs?

About 99.99%.

Unfortunately, 99.98% have found bugs in their code, not in Python.


-- 
Steven

--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-19 Thread Ron Garret
In article ,
 Albert Hopkins  wrote:

> On Thu, 2009-02-19 at 10:55 -0800, Ron Garret wrote:
> > I'm trying to split a CamelCase string into its constituent components.  
> > This kind of works:
> > 
> > >>> re.split('[a-z][A-Z]', 'fooBarBaz')
> > ['fo', 'a', 'az']
> > 
> > but it consumes the boundary characters.  To fix this I tried using 
> > lookahead and lookbehind patterns instead, but it doesn't work:
> 
> That's how re.split works, same as str.split...

I think one could make the argument that 'foo'.split('') ought to return 
['f','o','o']

> 
> > >>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> > ['fooBarBaz']
> > 
> > However, it does seem to work with findall:
> > 
> > >>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> > ['', '']
> 
> 
> Wow!
> 
> To tell you the truth, I can't even read that...

It's a regexp.  Of course you can't read it.  ;-)

rg
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-19 Thread Ron Garret
In article ,
 "andrew cooke"  wrote:

> i wonder what fraction of people posting with "bug?" in their titles here
> actually find bugs?

IMHO it ought to be an invariant that len(r.split(s)) should always be 
one more than len(r.findall(s)).

> anyway, how about:
> 
> re.findall('[A-Z]?[a-z]*', 'fooBarBaz')
> 
> or
> 
> re.findall('([A-Z][a-z]*|[a-z]+)', 'fooBarBaz')

That will do it.  Thanks!

rg
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-19 Thread Ron Garret
In article ,
 Peter Otten <__pete...@web.de> wrote:

> Ron Garret wrote:
> 
> > I'm trying to split a CamelCase string into its constituent components.
> 
> How about
> 
> >>> re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")
> ['foo', 'Bar', 'Baz']

That's very clever.  Thanks!

> > (BTW, I tried looking at the source code for the re module, but I could
> > not find the relevant code.  re.split calls sre_compile.compile().split,
> > but the string 'split' does not appear in sre_compile.py.  So where does
> > this method come from?)
> 
> It's coded in C. The source is Modules/sremodule.c.

Ah.  Thanks!

rg
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-19 Thread Ron Garret
In article ,
 MRAB  wrote:

> Ron Garret wrote:
> > I'm trying to split a CamelCase string into its constituent components.  
> > This kind of works:
> > 
>  re.split('[a-z][A-Z]', 'fooBarBaz')
> > ['fo', 'a', 'az']
> > 
> > but it consumes the boundary characters.  To fix this I tried using 
> > lookahead and lookbehind patterns instead, but it doesn't work:
> > 
>  re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> > ['fooBarBaz']
> > 
> > However, it does seem to work with findall:
> > 
>  re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> > ['', '']
> > 
> > So the regular expression seems to be doing the Right Thing.  Is this a 
> > bug in re.split, or am I missing something?
> > 
> > (BTW, I tried looking at the source code for the re module, but I could 
> > not find the relevant code.  re.split calls sre_compile.compile().split, 
> > but the string 'split' does not appear in sre_compile.py.  So where does 
> > this method come from?)
> > 
> > I'm using Python2.5.
> > 
> I, amongst others, think it's a bug (or 'misfeature'); Guido thinks it
> might be intentional, but changing it could break some existing code.

That seems unlikely.  It would only break where people had code invoking 
re.split on empty matches, which at the moment is essentially a no-op.  
It's hard to imagine there's a lot of code like that around.  What would 
be the point?

> You could do this instead:
> 
>  >>> re.sub('(?<=[a-z])(?=[A-Z])', '@', 'fooBarBaz').split('@')
> ['foo', 'Bar', 'Baz']

Blech!  ;-)  But thanks for the suggestion.

rg
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-19 Thread MRAB

Ron Garret wrote:
I'm trying to split a CamelCase string into its constituent components.  
This kind of works:



re.split('[a-z][A-Z]', 'fooBarBaz')

['fo', 'a', 'az']

but it consumes the boundary characters.  To fix this I tried using 
lookahead and lookbehind patterns instead, but it doesn't work:



re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')

['fooBarBaz']

However, it does seem to work with findall:


re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')

['', '']

So the regular expression seems to be doing the Right Thing.  Is this a 
bug in re.split, or am I missing something?


(BTW, I tried looking at the source code for the re module, but I could 
not find the relevant code.  re.split calls sre_compile.compile().split, 
but the string 'split' does not appear in sre_compile.py.  So where does 
this method come from?)


I'm using Python2.5.


I, amongst others, think it's a bug (or 'misfeature'); Guido thinks it
might be intentional, but changing it could break some existing code.
You could do this instead:

>>> re.sub('(?<=[a-z])(?=[A-Z])', '@', 'fooBarBaz').split('@')
['foo', 'Bar', 'Baz']
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-19 Thread Peter Otten
Ron Garret wrote:

> I'm trying to split a CamelCase string into its constituent components.

How about

>>> re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")
['foo', 'Bar', 'Baz']

> This kind of works:
> 
 re.split('[a-z][A-Z]', 'fooBarBaz')
> ['fo', 'a', 'az']
> 
> but it consumes the boundary characters.  To fix this I tried using
> lookahead and lookbehind patterns instead, but it doesn't work:
> 
 re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> ['fooBarBaz']
> 
> However, it does seem to work with findall:
> 
 re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> ['', '']
> 
> So the regular expression seems to be doing the Right Thing.  Is this a
> bug in re.split, or am I missing something?

IRC the split pattern must consume at least one character, but I can't find
the reference.
 
> (BTW, I tried looking at the source code for the re module, but I could
> not find the relevant code.  re.split calls sre_compile.compile().split,
> but the string 'split' does not appear in sre_compile.py.  So where does
> this method come from?)

It's coded in C. The source is Modules/sremodule.c.

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-19 Thread andrew cooke

i wonder what fraction of people posting with "bug?" in their titles here
actually find bugs?

anyway, how about:

re.findall('[A-Z]?[a-z]*', 'fooBarBaz')

or

re.findall('([A-Z][a-z]*|[a-z]+)', 'fooBarBaz')

(you have to specify what you're matching and lookahead/back doesn't do
that).

andrew


Ron Garret wrote:
> I'm trying to split a CamelCase string into its constituent components.
> This kind of works:
>
 re.split('[a-z][A-Z]', 'fooBarBaz')
> ['fo', 'a', 'az']
>
> but it consumes the boundary characters.  To fix this I tried using
> lookahead and lookbehind patterns instead, but it doesn't work:
>
 re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> ['fooBarBaz']
>
> However, it does seem to work with findall:
>
 re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> ['', '']
>
> So the regular expression seems to be doing the Right Thing.  Is this a
> bug in re.split, or am I missing something?
>
> (BTW, I tried looking at the source code for the re module, but I could
> not find the relevant code.  re.split calls sre_compile.compile().split,
> but the string 'split' does not appear in sre_compile.py.  So where does
> this method come from?)
>
> I'm using Python2.5.
>
> Thanks,
> rg
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>


--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-19 Thread Kurt Smith
On Thu, Feb 19, 2009 at 12:55 PM, Ron Garret  wrote:
> I'm trying to split a CamelCase string into its constituent components.
> This kind of works:
>
 re.split('[a-z][A-Z]', 'fooBarBaz')
> ['fo', 'a', 'az']
>
> but it consumes the boundary characters.  To fix this I tried using
> lookahead and lookbehind patterns instead, but it doesn't work:
>
 re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> ['fooBarBaz']
>
> However, it does seem to work with findall:
>
 re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> ['', '']
>
> So the regular expression seems to be doing the Right Thing.  Is this a
> bug in re.split, or am I missing something?

>From what I can tell, re.split can't split on zero-length boundaries.
It needs something to split on, like str.split.  Is this a bug?
Possibly.  The docs for re.split say:

Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.

Note that it does not say that zero-length matches won't work.

I can work around the problem thusly:

re.sub(r'(?<=[a-z])(?=[A-Z])', '_', 'fooBarBaz').split('_')

Which is ugly.  I reckon you can use re.findall with a pattern that
matches the components and not the boundaries, but you have to take
care of the beginning and end as special cases.

Kurt
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression bug?

2009-02-19 Thread Albert Hopkins
On Thu, 2009-02-19 at 10:55 -0800, Ron Garret wrote:
> I'm trying to split a CamelCase string into its constituent components.  
> This kind of works:
> 
> >>> re.split('[a-z][A-Z]', 'fooBarBaz')
> ['fo', 'a', 'az']
> 
> but it consumes the boundary characters.  To fix this I tried using 
> lookahead and lookbehind patterns instead, but it doesn't work:

That's how re.split works, same as str.split...

> >>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
> ['fooBarBaz']
> 
> However, it does seem to work with findall:
> 
> >>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
> ['', '']


Wow!

To tell you the truth, I can't even read that... but one wonders why
don't you just do

def ccsplit(s):
cclist = []
current_word = ''
for char in s:
if char in string.uppercase:
if current_word:
cclist.append(current_word)
current_word = char
else:
current_word += char
if current_word:
ccl.append(current_word)
return cclist

>>> ccsplit('fooBarBaz')
--> ['foo', 'Bar', 'Baz']

This is arguably *much* more easy to read than the re example doesn't
require one to look ahead in the string.

-a


--
http://mail.python.org/mailman/listinfo/python-list


Regular expression bug?

2009-02-19 Thread Ron Garret
I'm trying to split a CamelCase string into its constituent components.  
This kind of works:

>>> re.split('[a-z][A-Z]', 'fooBarBaz')
['fo', 'a', 'az']

but it consumes the boundary characters.  To fix this I tried using 
lookahead and lookbehind patterns instead, but it doesn't work:

>>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
['fooBarBaz']

However, it does seem to work with findall:

>>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
['', '']

So the regular expression seems to be doing the Right Thing.  Is this a 
bug in re.split, or am I missing something?

(BTW, I tried looking at the source code for the re module, but I could 
not find the relevant code.  re.split calls sre_compile.compile().split, 
but the string 'split' does not appear in sre_compile.py.  So where does 
this method come from?)

I'm using Python2.5.

Thanks,
rg
--
http://mail.python.org/mailman/listinfo/python-list