Trouble with regular expressions
Hi, I'm quite new to regular expressions, and I wonder if anyone here could help me out. I'm looking to split strings that ideally look like this: Update: New item (Household) into a group. This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns (Update, New item, (Household)) Some strings will look like this however: Update: New item (item) (Household). The expression above still does its job, as it returns (Update, New item (item), (Household)). It does not work however when there is no text in parentheses (eg Update: new item). How can I get the expression to return a tuple such as (Update:, new item, None)? Thanks in advance, Mathieu -- http://mail.python.org/mailman/listinfo/python-list
Re: Trouble with regular expressions
On Feb 7, 11:18 pm, LaundroMat laun...@gmail.com wrote: Hi, I'm quite new to regular expressions, and I wonder if anyone here could help me out. I'm looking to split strings that ideally look like this: Update: New item (Household) into a group. This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns (Update, New item, (Household)) Some strings will look like this however: Update: New item (item) (Household). The expression above still does its job, as it returns (Update, New item (item), (Household)). It does not work however when there is no text in parentheses (eg Update: new item). How can I get the expression to return a tuple such as (Update:, new item, None)? I don't see how it can be done without some post-matching adjustment. Try this: C:\junktype mathieu.py import re tests = [ Update: New item (Household), Update: New item (item) (Household), Update: new item, minimal, parenthesis (plague) (has) (struck), ] regex = re.compile( (Update:)? # optional prefix \s* # ignore whitespace ([^()]*)# any non-parentheses stuff (\([^()]*\))? # optional (blahblah) \s*# ignore whitespace (\([^()]*\))? # another optional (blahblah) $ , re.VERBOSE) for i, test in enumerate(tests): print Test #%d: %r % (i, test) m = regex.match(test) if not m: print No match else: g = m.groups() print g if g[3] is not None: x = (g[0], g[1] + g[2], g[3]) else: x = g[:3] print x print C:\junkmathieu.py Test #0: 'Update: New item (Household)' ('Update:', 'New item ', '(Household)', None) ('Update:', 'New item ', '(Household)') Test #1: 'Update: New item (item) (Household)' ('Update:', 'New item ', '(item)', '(Household)') ('Update:', 'New item (item)', '(Household)') Test #2: 'Update: new item' ('Update:', 'new item', None, None) ('Update:', 'new item', None) Test #3: 'minimal' (None, 'minimal', None, None) (None, 'minimal', None) Test #4: 'parenthesis (plague) (has) (struck)' No match HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Trouble with regular expressions
LaundroMat wrote: Hi, I'm quite new to regular expressions, and I wonder if anyone here could help me out. I'm looking to split strings that ideally look like this: Update: New item (Household) into a group. This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns (Update, New item, (Household)) Some strings will look like this however: Update: New item (item) (Household). The expression above still does its job, as it returns (Update, New item (item), (Household)). It does not work however when there is no text in parentheses (eg Update: new item). How can I get the expression to return a tuple such as (Update:, new item, None)? You need to make the last group optional and also make the middle group lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'. (?:...) is the non-capturing version of (...). If you don't make the middle group lazy then it'll capture the rest of the string and the last group would never match anything! -- http://mail.python.org/mailman/listinfo/python-list
Re: Trouble with regular expressions
On Feb 8, 1:37 am, MRAB goo...@mrabarnett.plus.com wrote: LaundroMat wrote: Hi, I'm quite new to regular expressions, and I wonder if anyone here could help me out. I'm looking to split strings that ideally look like this: Update: New item (Household) into a group. This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns (Update, New item, (Household)) Some strings will look like this however: Update: New item (item) (Household). The expression above still does its job, as it returns (Update, New item (item), (Household)). Not quite true; it actually returns ('Update:', ' New item (item) ', '(Household)') However ignoring the difference in whitespace, the OP's intention is clear. Yours returns ('Update:', ' New item ', '(item) (Household)') It does not work however when there is no text in parentheses (eg Update: new item). How can I get the expression to return a tuple such as (Update:, new item, None)? You need to make the last group optional and also make the middle group lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'. Why do you perpetuate the redundant ^ anchor? (?:...) is the non-capturing version of (...). Why do you use (?:(subpattern))? instead of just plain (subpattern)? ? -- http://mail.python.org/mailman/listinfo/python-list
Re: Trouble with regular expressions
John Machin wrote: On Feb 8, 1:37 am, MRAB goo...@mrabarnett.plus.com wrote: LaundroMat wrote: Hi, I'm quite new to regular expressions, and I wonder if anyone here could help me out. I'm looking to split strings that ideally look like this: Update: New item (Household) into a group. This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns (Update, New item, (Household)) Some strings will look like this however: Update: New item (item) (Household). The expression above still does its job, as it returns (Update, New item (item), (Household)). Not quite true; it actually returns ('Update:', ' New item (item) ', '(Household)') However ignoring the difference in whitespace, the OP's intention is clear. Yours returns ('Update:', ' New item ', '(item) (Household)') The OP said it works OK, which I took to mean that the OP was OK with the extra whitespace, which can be easily stripped off. Close enough! It does not work however when there is no text in parentheses (eg Update: new item). How can I get the expression to return a tuple such as (Update:, new item, None)? You need to make the last group optional and also make the middle group lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'. Why do you perpetuate the redundant ^ anchor? The OP didn't say whether search() or match() was being used. With the ^ it doesn't matter. (?:...) is the non-capturing version of (...). Why do you use (?:(subpattern))? instead of just plain (subpattern)? ? Oops, you're right. I was distracted by the \( and \)! :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Trouble with regular expressions
On Feb 8, 10:15 am, MRAB goo...@mrabarnett.plus.com wrote: John Machin wrote: On Feb 8, 1:37 am, MRAB goo...@mrabarnett.plus.com wrote: LaundroMat wrote: Hi, I'm quite new to regular expressions, and I wonder if anyone here could help me out. I'm looking to split strings that ideally look like this: Update: New item (Household) into a group. This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns (Update, New item, (Household)) Some strings will look like this however: Update: New item (item) (Household). The expression above still does its job, as it returns (Update, New item (item), (Household)). Not quite true; it actually returns ('Update:', ' New item (item) ', '(Household)') However ignoring the difference in whitespace, the OP's intention is clear. Yours returns ('Update:', ' New item ', '(item) (Household)') The OP said it works OK, which I took to mean that the OP was OK with the extra whitespace, which can be easily stripped off. Close enough! As I said, the whitespace difference [between what the OP said his regex did and what it actually does] is not the problem. The problem is that the OP's works OK included (item) in the 2nd group, whereas yours includes (item) in the 3rd group. It does not work however when there is no text in parentheses (eg Update: new item). How can I get the expression to return a tuple such as (Update:, new item, None)? You need to make the last group optional and also make the middle group lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'. Why do you perpetuate the redundant ^ anchor? The OP didn't say whether search() or match() was being used. With the ^ it doesn't matter. It *does* matter. re.search() is suboptimal; after failing at the first position, it's not smart enough to give up if the pattern has a front anchor. [win32, 2.6.1] C:\junk\python26\python -mtimeit -simport re;rx=re.compile ('^frobozz');txt=100 *'x' assert not rx.match(txt) 100 loops, best of 3: 1.17 usec per loop C:\junk\python26\python -mtimeit -simport re;rx=re.compile ('^frobozz');txt=100 0*'x' assert not rx.match(txt) 100 loops, best of 3: 1.17 usec per loop C:\junk\python26\python -mtimeit -simport re;rx=re.compile ('^frobozz');txt=100 *'x' assert not rx.search(txt) 10 loops, best of 3: 4.37 usec per loop C:\junk\python26\python -mtimeit -simport re;rx=re.compile ('^frobozz');txt=100 0*'x' assert not rx.search(txt) 1 loops, best of 3: 34.1 usec per loop Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9 -- http://mail.python.org/mailman/listinfo/python-list
Re: Trouble with regular expressions
John Machin wrote: On Feb 8, 10:15 am, MRAB goo...@mrabarnett.plus.com wrote: John Machin wrote: On Feb 8, 1:37 am, MRAB goo...@mrabarnett.plus.com wrote: LaundroMat wrote: Hi, I'm quite new to regular expressions, and I wonder if anyone here could help me out. I'm looking to split strings that ideally look like this: Update: New item (Household) into a group. This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns (Update, New item, (Household)) Some strings will look like this however: Update: New item (item) (Household). The expression above still does its job, as it returns (Update, New item (item), (Household)). Not quite true; it actually returns ('Update:', ' New item (item) ', '(Household)') However ignoring the difference in whitespace, the OP's intention is clear. Yours returns ('Update:', ' New item ', '(item) (Household)') The OP said it works OK, which I took to mean that the OP was OK with the extra whitespace, which can be easily stripped off. Close enough! As I said, the whitespace difference [between what the OP said his regex did and what it actually does] is not the problem. The problem is that the OP's works OK included (item) in the 2nd group, whereas yours includes (item) in the 3rd group. Ugh, right again! That just shows what happens when I try to post while debugging! :-) It does not work however when there is no text in parentheses (eg Update: new item). How can I get the expression to return a tuple such as (Update:, new item, None)? You need to make the last group optional and also make the middle group lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'. Why do you perpetuate the redundant ^ anchor? The OP didn't say whether search() or match() was being used. With the ^ it doesn't matter. It *does* matter. re.search() is suboptimal; after failing at the first position, it's not smart enough to give up if the pattern has a front anchor. [win32, 2.6.1] C:\junk\python26\python -mtimeit -simport re;rx=re.compile ('^frobozz');txt=100 *'x' assert not rx.match(txt) 100 loops, best of 3: 1.17 usec per loop C:\junk\python26\python -mtimeit -simport re;rx=re.compile ('^frobozz');txt=100 0*'x' assert not rx.match(txt) 100 loops, best of 3: 1.17 usec per loop C:\junk\python26\python -mtimeit -simport re;rx=re.compile ('^frobozz');txt=100 *'x' assert not rx.search(txt) 10 loops, best of 3: 4.37 usec per loop C:\junk\python26\python -mtimeit -simport re;rx=re.compile ('^frobozz');txt=100 0*'x' assert not rx.search(txt) 1 loops, best of 3: 34.1 usec per loop Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9 On my PC the numbers for Python 2.6 are: C:\Python26python -mtimeit -simport re;rx=re.compile('^frobozz');txt=100*'x' assert not rx.match(txt) 100 loops, best of 3: 1.02 usec per loop C:\Python26python -mtimeit -simport re;rx=re.compile('^frobozz');txt=1000*'x' assert not rx.match(txt) 100 loops, best of 3: 1.04 usec per loop C:\Python26python -mtimeit -simport re;rx=re.compile('^frobozz');txt=100*'x' assert not rx.search(txt) 10 loops, best of 3: 3.69 usec per loop C:\Python26python -mtimeit -simport re;rx=re.compile('^frobozz');txt=1000*'x' assert not rx.search(txt) 1 loops, best of 3: 28.6 usec per loop I'm currently working on the re module and I've fixed that problem: C:\Python27python -mtimeit -simport re;rx=re.compile('^frobozz');txt=100*'x' assert not rx.match(txt) 100 loops, best of 3: 1.28 usec per loop C:\Python27python -mtimeit -simport re;rx=re.compile('^frobozz');txt=1000*'x' assert not rx.match(txt) 100 loops, best of 3: 1.23 usec per loop C:\Python27python -mtimeit -simport re;rx=re.compile('^frobozz');txt=100*'x' assert not rx.search(txt) 100 loops, best of 3: 1.21 usec per loop C:\Python27python -mtimeit -simport re;rx=re.compile('^frobozz');txt=1000*'x' assert not rx.search(txt) 100 loops, best of 3: 1.21 usec per loop Hmm. Needs more tweaking... -- http://mail.python.org/mailman/listinfo/python-list