On Feb 8, 10:15 am, MRAB <goo...@mrabarnett.plus.com> wrote: > John Machin wrote: > > On Feb 8, 1:37 am, MRAB <goo...@mrabarnett.plus.com> wrote: > >> LaundroMat wrote: > >>> Hi, > >>> I'm quite new to regular expressions, and I wonder if anyone here > >>> could help me out. > >>> I'm looking to split strings that ideally look like this: "Update: New > >>> item (Household)" into a group. > >>> This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns > >>> ("Update", "New item", "(Household)") > >>> Some strings will look like this however: "Update: New item (item) > >>> (Household)". The expression above still does its job, as it returns > >>> ("Update", "New item (item)", "(Household)"). > > > Not quite true; it actually returns > > ('Update:', ' New item (item) ', '(Household)') > > However ignoring the difference in whitespace, the OP's intention is > > clear. Yours returns > > ('Update:', ' New item ', '(item) (Household)') > > The OP said it works OK, which I took to mean that the OP was OK with > the extra whitespace, which can be easily stripped off. Close enough!
As I said, the whitespace difference [between what the OP said his regex did and what it actually does] is not the problem. The problem is that the OP's "works OK" included (item) in the 2nd group, whereas yours includes (item) in the 3rd group. > > >>> It does not work however when there is no text in parentheses (eg > >>> "Update: new item"). How can I get the expression to return a tuple > >>> such as ("Update:", "new item", None)? > >> You need to make the last group optional and also make the middle group > >> lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'. > > > Why do you perpetuate the redundant ^ anchor? > > The OP didn't say whether search() or match() was being used. With the ^ > it doesn't matter. It *does* matter. re.search() is suboptimal; after failing at the first position, it's not smart enough to give up if the pattern has a front anchor. [win32, 2.6.1] C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile ('^frobozz');txt=100 *'x'" "assert not rx.match(txt)" 1000000 loops, best of 3: 1.17 usec per loop C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile ('^frobozz');txt=100 0*'x'" "assert not rx.match(txt)" 1000000 loops, best of 3: 1.17 usec per loop C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile ('^frobozz');txt=100 *'x'" "assert not rx.search(txt)" 100000 loops, best of 3: 4.37 usec per loop C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile ('^frobozz');txt=100 0*'x'" "assert not rx.search(txt)" 10000 loops, best of 3: 34.1 usec per loop Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9 -- http://mail.python.org/mailman/listinfo/python-list