Trouble with regular expressions

2009-02-07 Thread LaundroMat
Hi,

I'm quite new to regular expressions, and I wonder if anyone here
could help me out.

I'm looking to split strings that ideally look like this: Update: New
item (Household) into a group.
This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
(Update, New item, (Household))

Some strings will look like this however: Update: New item (item)
(Household). The expression above still does its job, as it returns
(Update, New item (item), (Household)).

It does not work however when there is no text in parentheses (eg
Update: new item). How can I get the expression to return a tuple
such as (Update:, new item, None)?

Thanks in advance,

Mathieu
--
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble with regular expressions

2009-02-07 Thread John Machin
On Feb 7, 11:18 pm, LaundroMat laun...@gmail.com wrote:
 Hi,

 I'm quite new to regular expressions, and I wonder if anyone here
 could help me out.

 I'm looking to split strings that ideally look like this: Update: New
 item (Household) into a group.
 This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
 (Update, New item, (Household))

 Some strings will look like this however: Update: New item (item)
 (Household). The expression above still does its job, as it returns
 (Update, New item (item), (Household)).

 It does not work however when there is no text in parentheses (eg
 Update: new item). How can I get the expression to return a tuple
 such as (Update:, new item, None)?

I don't see how it can be done without some post-matching adjustment.
Try this:

C:\junktype mathieu.py
import re

tests = [
Update: New item (Household),
Update: New item (item) (Household),
Update: new item,
minimal,
parenthesis (plague) (has) (struck),
]

regex = re.compile(
(Update:)?  # optional prefix
\s* # ignore whitespace
([^()]*)# any non-parentheses stuff
(\([^()]*\))?   # optional (blahblah)
\s*# ignore whitespace
(\([^()]*\))?   # another optional (blahblah)
$
, re.VERBOSE)

for i, test in enumerate(tests):
print Test #%d: %r % (i, test)
m = regex.match(test)
if not m:
print No match
else:
g = m.groups()
print g
if g[3] is not None:
x = (g[0], g[1] + g[2], g[3])
else:
x = g[:3]
print x
print

C:\junkmathieu.py
Test #0: 'Update: New item (Household)'
('Update:', 'New item ', '(Household)', None)
('Update:', 'New item ', '(Household)')

Test #1: 'Update: New item (item) (Household)'
('Update:', 'New item ', '(item)', '(Household)')
('Update:', 'New item (item)', '(Household)')

Test #2: 'Update: new item'
('Update:', 'new item', None, None)
('Update:', 'new item', None)

Test #3: 'minimal'
(None, 'minimal', None, None)
(None, 'minimal', None)

Test #4: 'parenthesis (plague) (has) (struck)'
No match

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble with regular expressions

2009-02-07 Thread MRAB

LaundroMat wrote:

Hi,

I'm quite new to regular expressions, and I wonder if anyone here
could help me out.

I'm looking to split strings that ideally look like this: Update: New
item (Household) into a group.
This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
(Update, New item, (Household))

Some strings will look like this however: Update: New item (item)
(Household). The expression above still does its job, as it returns
(Update, New item (item), (Household)).

It does not work however when there is no text in parentheses (eg
Update: new item). How can I get the expression to return a tuple
such as (Update:, new item, None)?

You need to make the last group optional and also make the middle group 
lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'.


(?:...) is the non-capturing version of (...).

If you don't make the middle group lazy then it'll capture the rest of 
the string and the last group would never match anything!

--
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble with regular expressions

2009-02-07 Thread John Machin
On Feb 8, 1:37 am, MRAB goo...@mrabarnett.plus.com wrote:
 LaundroMat wrote:
  Hi,

  I'm quite new to regular expressions, and I wonder if anyone here
  could help me out.

  I'm looking to split strings that ideally look like this: Update: New
  item (Household) into a group.
  This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
  (Update, New item, (Household))

  Some strings will look like this however: Update: New item (item)
  (Household). The expression above still does its job, as it returns
  (Update, New item (item), (Household)).

Not quite true; it actually returns
('Update:', ' New item (item) ', '(Household)')
However ignoring the difference in whitespace, the OP's intention is
clear. Yours returns
('Update:', ' New item ', '(item) (Household)')


  It does not work however when there is no text in parentheses (eg
  Update: new item). How can I get the expression to return a tuple
  such as (Update:, new item, None)?

 You need to make the last group optional and also make the middle group
 lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'.

Why do you perpetuate the redundant ^ anchor?

 (?:...) is the non-capturing version of (...).

Why do you use
(?:(subpattern))?
instead of just plain
(subpattern)?
?

--
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble with regular expressions

2009-02-07 Thread MRAB

John Machin wrote:

On Feb 8, 1:37 am, MRAB goo...@mrabarnett.plus.com wrote:

LaundroMat wrote:

Hi,
I'm quite new to regular expressions, and I wonder if anyone here
could help me out.
I'm looking to split strings that ideally look like this: Update: New
item (Household) into a group.
This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
(Update, New item, (Household))
Some strings will look like this however: Update: New item (item)
(Household). The expression above still does its job, as it returns
(Update, New item (item), (Household)).


Not quite true; it actually returns
('Update:', ' New item (item) ', '(Household)')
However ignoring the difference in whitespace, the OP's intention is
clear. Yours returns
('Update:', ' New item ', '(item) (Household)')


The OP said it works OK, which I took to mean that the OP was OK with
the extra whitespace, which can be easily stripped off. Close enough!



It does not work however when there is no text in parentheses (eg
Update: new item). How can I get the expression to return a tuple
such as (Update:, new item, None)?

You need to make the last group optional and also make the middle group
lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'.


Why do you perpetuate the redundant ^ anchor?


The OP didn't say whether search() or match() was being used. With the ^
it doesn't matter.


(?:...) is the non-capturing version of (...).


Why do you use
(?:(subpattern))?
instead of just plain
(subpattern)?
?


Oops, you're right. I was distracted by the \( and \)! :-)
--
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble with regular expressions

2009-02-07 Thread John Machin
On Feb 8, 10:15 am, MRAB goo...@mrabarnett.plus.com wrote:
 John Machin wrote:
  On Feb 8, 1:37 am, MRAB goo...@mrabarnett.plus.com wrote:
  LaundroMat wrote:
  Hi,
  I'm quite new to regular expressions, and I wonder if anyone here
  could help me out.
  I'm looking to split strings that ideally look like this: Update: New
  item (Household) into a group.
  This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
  (Update, New item, (Household))
  Some strings will look like this however: Update: New item (item)
  (Household). The expression above still does its job, as it returns
  (Update, New item (item), (Household)).

  Not quite true; it actually returns
      ('Update:', ' New item (item) ', '(Household)')
  However ignoring the difference in whitespace, the OP's intention is
  clear. Yours returns
      ('Update:', ' New item ', '(item) (Household)')

 The OP said it works OK, which I took to mean that the OP was OK with
 the extra whitespace, which can be easily stripped off. Close enough!

As I said, the whitespace difference [between what the OP said his
regex did and what it actually does] is not the problem. The problem
is that the OP's works OK included (item) in the 2nd group, whereas
yours includes (item) in the 3rd group.


  It does not work however when there is no text in parentheses (eg
  Update: new item). How can I get the expression to return a tuple
  such as (Update:, new item, None)?
  You need to make the last group optional and also make the middle group
  lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'.

  Why do you perpetuate the redundant ^ anchor?

 The OP didn't say whether search() or match() was being used. With the ^
 it doesn't matter.

It *does* matter. re.search() is suboptimal; after failing at the
first position, it's not smart enough to give up if the pattern has a
front anchor.

[win32, 2.6.1]
C:\junk\python26\python -mtimeit -simport re;rx=re.compile
('^frobozz');txt=100
*'x' assert not rx.match(txt)
100 loops, best of 3: 1.17 usec per loop

C:\junk\python26\python -mtimeit -simport re;rx=re.compile
('^frobozz');txt=100
0*'x' assert not rx.match(txt)
100 loops, best of 3: 1.17 usec per loop

C:\junk\python26\python -mtimeit -simport re;rx=re.compile
('^frobozz');txt=100
*'x' assert not rx.search(txt)
10 loops, best of 3: 4.37 usec per loop

C:\junk\python26\python -mtimeit -simport re;rx=re.compile
('^frobozz');txt=100
0*'x' assert not rx.search(txt)
1 loops, best of 3: 34.1 usec per loop

Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9

--
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble with regular expressions

2009-02-07 Thread MRAB

John Machin wrote:

On Feb 8, 10:15 am, MRAB goo...@mrabarnett.plus.com wrote:

John Machin wrote:

On Feb 8, 1:37 am, MRAB goo...@mrabarnett.plus.com wrote:

LaundroMat wrote:

Hi,
I'm quite new to regular expressions, and I wonder if anyone here
could help me out.
I'm looking to split strings that ideally look like this: Update: New
item (Household) into a group.
This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
(Update, New item, (Household))
Some strings will look like this however: Update: New item (item)
(Household). The expression above still does its job, as it returns
(Update, New item (item), (Household)).

Not quite true; it actually returns
('Update:', ' New item (item) ', '(Household)')
However ignoring the difference in whitespace, the OP's intention is
clear. Yours returns
('Update:', ' New item ', '(item) (Household)')

The OP said it works OK, which I took to mean that the OP was OK with
the extra whitespace, which can be easily stripped off. Close enough!


As I said, the whitespace difference [between what the OP said his
regex did and what it actually does] is not the problem. The problem
is that the OP's works OK included (item) in the 2nd group, whereas
yours includes (item) in the 3rd group.


Ugh, right again!

That just shows what happens when I try to post while debugging! :-)


It does not work however when there is no text in parentheses (eg
Update: new item). How can I get the expression to return a tuple
such as (Update:, new item, None)?

You need to make the last group optional and also make the middle group
lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'.

Why do you perpetuate the redundant ^ anchor?

The OP didn't say whether search() or match() was being used. With the ^
it doesn't matter.


It *does* matter. re.search() is suboptimal; after failing at the
first position, it's not smart enough to give up if the pattern has a
front anchor.

[win32, 2.6.1]
C:\junk\python26\python -mtimeit -simport re;rx=re.compile
('^frobozz');txt=100
*'x' assert not rx.match(txt)
100 loops, best of 3: 1.17 usec per loop

C:\junk\python26\python -mtimeit -simport re;rx=re.compile
('^frobozz');txt=100
0*'x' assert not rx.match(txt)
100 loops, best of 3: 1.17 usec per loop

C:\junk\python26\python -mtimeit -simport re;rx=re.compile
('^frobozz');txt=100
*'x' assert not rx.search(txt)
10 loops, best of 3: 4.37 usec per loop

C:\junk\python26\python -mtimeit -simport re;rx=re.compile
('^frobozz');txt=100
0*'x' assert not rx.search(txt)
1 loops, best of 3: 34.1 usec per loop

Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9


On my PC the numbers for Python 2.6 are:

C:\Python26python -mtimeit -simport 
re;rx=re.compile('^frobozz');txt=100*'x' assert not rx.match(txt)

100 loops, best of 3: 1.02 usec per loop

C:\Python26python -mtimeit -simport 
re;rx=re.compile('^frobozz');txt=1000*'x' assert not rx.match(txt)

100 loops, best of 3: 1.04 usec per loop

C:\Python26python -mtimeit -simport 
re;rx=re.compile('^frobozz');txt=100*'x' assert not rx.search(txt)

10 loops, best of 3: 3.69 usec per loop

C:\Python26python -mtimeit -simport 
re;rx=re.compile('^frobozz');txt=1000*'x' assert not rx.search(txt)

1 loops, best of 3: 28.6 usec per loop

I'm currently working on the re module and I've fixed that problem:

C:\Python27python -mtimeit -simport 
re;rx=re.compile('^frobozz');txt=100*'x' assert not rx.match(txt)

100 loops, best of 3: 1.28 usec per loop

C:\Python27python -mtimeit -simport 
re;rx=re.compile('^frobozz');txt=1000*'x' assert not rx.match(txt)

100 loops, best of 3: 1.23 usec per loop

C:\Python27python -mtimeit -simport 
re;rx=re.compile('^frobozz');txt=100*'x' assert not rx.search(txt)

100 loops, best of 3: 1.21 usec per loop

C:\Python27python -mtimeit -simport 
re;rx=re.compile('^frobozz');txt=1000*'x' assert not rx.search(txt)

100 loops, best of 3: 1.21 usec per loop

Hmm. Needs more tweaking...
--
http://mail.python.org/mailman/listinfo/python-list