Re: a regex question
Maggie Q Roth writes: > There are two primary types of lines in the log: > > 60.191.38.xx/ > 42.120.161.xx /archives/1005 > > I know how to write regex to match each line, but don't get the good result > with one regex to match both lines. > > Can you help? When I look at these lines, I see 2 fields separated by whitespace (note that two example lines are very very few to guess the proper pattern). I would not use a regular expression in this case, but the `split` string method. A regular expression for this pattern could be `(\S+)\s+(.*)` which reads a non-empty sequences of none whitespace (assigned to group 1), whitespace, any sequence (assigned to group 2) (note that the regular expression above is given on the regex level. The string in your Python code may look slightly different). -- https://mail.python.org/mailman/listinfo/python-list
Re: a regex question
On 25/10/19 12:22, Maggie Q Roth wrote: > Hello > > There are two primary types of lines in the log: > > 60.191.38.xx/ > 42.120.161.xx /archives/1005 > > I know how to write regex to match each line, but don't get the good result > with one regex to match both lines. Could you provide the regexes that you have for each line? -- Antoon. -- https://mail.python.org/mailman/listinfo/python-list
Re: a regex question
On October 25, 2019 12:22:44 PM GMT+02:00, Maggie Q Roth wrote: >Hello > >There are two primary types of lines in the log: > >60.191.38.xx/ >42.120.161.xx /archives/1005 > >I know how to write regex to match each line, but don't get the good >result >with one regex to match both lines. What is a good result? The is an re.MULTILINE flag. Did you try that? What does that do? -- https://mail.python.org/mailman/listinfo/python-list
a regex question
Hello There are two primary types of lines in the log: 60.191.38.xx/ 42.120.161.xx /archives/1005 I know how to write regex to match each line, but don't get the good result with one regex to match both lines. Can you help? Thanks, Maggie -- https://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
On Aug 18, 12:22 pm, Jussi Piitulainen wrote: > Frank Koshti writes: > > not always placed in HTML, and even in HTML, they may appear in > > strange places, such as Hello. My specific issue > > is I need to match, process and replace $foo(x=3), knowing that > > (x=3) is optional, and the token might appear simply as $foo. > > > To do this, I decided to use: > > > re.compile('\$\w*\(?.*?\)').findall(mystring) > > > the issue with this is it doesn't match $foo by itself, and requires > > there to be () at the end. > > Adding a ? after the meant-to-be-optional expression would let the > regex engine know what you want. You can also separate the mandatory > and the optional part in the regex to receive pairs as matches. The > test program below prints this: > > >$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc > ('$foo', '') > ('$foo', '(bar=3)') > ('$foo', '($)') > ('$foo', '') > ('$bar', '(v=0)') > > Here is the program: > > import re > > def grab(text): > p = re.compile(r'([$]\w+)([(][^()]+[)])?') > return re.findall(p, text) > > def test(html): > print(html) > for hit in grab(html): > print(hit) > > if __name__ == '__main__': > test('>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etchttp://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
Steven, Well done!!! Regards, Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
Frank Koshti writes: > not always placed in HTML, and even in HTML, they may appear in > strange places, such as Hello. My specific issue > is I need to match, process and replace $foo(x=3), knowing that > (x=3) is optional, and the token might appear simply as $foo. > > To do this, I decided to use: > > re.compile('\$\w*\(?.*?\)').findall(mystring) > > the issue with this is it doesn't match $foo by itself, and requires > there to be () at the end. Adding a ? after the meant-to-be-optional expression would let the regex engine know what you want. You can also separate the mandatory and the optional part in the regex to receive pairs as matches. The test program below prints this: >$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etchttp://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
On Aug 18, 11:48 am, Peter Otten <__pete...@web.de> wrote: > Frank Koshti wrote: > > I need to match, process and replace $foo(x=3), knowing that (x=3) is > > optional, and the token might appear simply as $foo. > > > To do this, I decided to use: > > > re.compile('\$\w*\(?.*?\)').findall(mystring) > > > the issue with this is it doesn't match $foo by itself, and requires > > there to be () at the end. > >>> s = """ > > ... $foo1 > ... $foo2() > ... $foo3(anything could go here) > ... """>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s) > > ['$foo1', '$foo2()', '$foo3(anything could go here)'] PERFECT- -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
2012/8/18 Frank Koshti : > Hey Steven, > > Thank you for the detailed (and well-written) tutorial on this very > issue. I actually learned a few things! Though, I still have > unresolved questions. > > The reason I don't want to use an XML parser is because the tokens are > not always placed in HTML, and even in HTML, they may appear in > strange places, such as Hello. My specific issue is > I need to match, process and replace $foo(x=3), knowing that (x=3) is > optional, and the token might appear simply as $foo. > > To do this, I decided to use: > > re.compile('\$\w*\(?.*?\)').findall(mystring) > > the issue with this is it doesn't match $foo by itself, and requires > there to be () at the end. > > Thanks, > Frank > -- > http://mail.python.org/mailman/listinfo/python-list Hi, Although I don't quite get the pattern you are using (with respect to the specified task), you most likely need raw string syntax for the pattern, e.g.: r"...", instead of "...", or you have to double all backslashes (which should be escaped), i.e. \\w etc. I am likely misunderstanding the specification, as the following: >>> re.sub(r"\$foo\(x=3\)", "bar", "Hello") 'Hello' >>> is probably not the desired output. For some kind of "processing" the matched text, you can use the replace function instead of the replace pattern in re.sub too. see http://docs.python.org/library/re.html#re.sub hth, vbr -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
Frank Koshti wrote: > I need to match, process and replace $foo(x=3), knowing that (x=3) is > optional, and the token might appear simply as $foo. > > To do this, I decided to use: > > re.compile('\$\w*\(?.*?\)').findall(mystring) > > the issue with this is it doesn't match $foo by itself, and requires > there to be () at the end. >>> s = """ ... $foo1 ... $foo2() ... $foo3(anything could go here) ... """ >>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s) ['$foo1', '$foo2()', '$foo3(anything could go here)'] -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
Hey Steven, Thank you for the detailed (and well-written) tutorial on this very issue. I actually learned a few things! Though, I still have unresolved questions. The reason I don't want to use an XML parser is because the tokens are not always placed in HTML, and even in HTML, they may appear in strange places, such as Hello. My specific issue is I need to match, process and replace $foo(x=3), knowing that (x=3) is optional, and the token might appear simply as $foo. To do this, I decided to use: re.compile('\$\w*\(?.*?\)').findall(mystring) the issue with this is it doesn't match $foo by itself, and requires there to be () at the end. Thanks, Frank -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
On Fri, 17 Aug 2012 21:41:07 -0700, Frank Koshti wrote: > Hi, > > I'm new to regular expressions. I want to be able to match for tokens > with all their properties in the following examples. I would appreciate > some direction on how to proceed. Others have already given you excellent advice to NOT use regular expressions to parse HTML files, but to use a proper HTML parser instead. However, since I remember how hard it was to get started with regexes, I'm going to ignore that advice and show you how to abuse regexes to search for text, and pretend that they aren't HTML tags. Here's your string you want to search for: > @foo1 You want to find a piece of text that starts with "@", followed by any alphanumeric characters, followed by "". We start by compiling a regex: import re pattern = r"@\w+" regex = re.compile(pattern, re.I) First we import the re module. Then we define a pattern string. Note that I use a "raw string" instead of a regular string -- this is not compulsory, but it is very common. The difference between a raw string and a regular string is how they handle backslashes. In Python, some (but not all!) backslashes are special. For example, the regular string "\n" is not two characters, backslash-n, but a single character, Newline. The Python string parser converts backslash combinations as special characters, e.g.: \n => newline \t => tab \0 => ASCII Null character \\ => a single backslash etc. We often call these "backslash escapes". Regular expressions use a lot of backslashes, and so it is useful to disable the interpretation of backlash escapes when writing regex patterns. We do that with a "raw string" -- if you prefix the string with the letter r, the string is raw and backslash-escapes are ignored: # ordinary "cooked" string: "abc\n" => a b c newline # raw string r"abc\n" => a b c backslash n Here is our pattern again: pattern = r"@\w+" which is thirteen characters: less-than h 1 greater-than at-sign backslash w plus-sign less-than slash h 1 greater-than Most of the characters shown just match themselves. For example, the @ sign will only match another @ sign. But some have special meaning to the regex: \w doesn't match "backslash w", but any alphanumeric character; + doesn't match a plus sign, but tells the regex to match the previous symbol one or more times. Since it immediately follows \w, this means "match at least one alphanumeric character". Now we feed that string into the re.compile, to create a pre-compiled regex. (This step is optional: any function which takes a compiled regex will also accept a string pattern. But pre-compiling regexes which you are going to use repeatedly is a good idea.) regex = re.compile(pattern, re.I) The second argument to re.compile is a flag, re.I which is a special value that tells the regular expression to ignore case, so "h" will match both "h" and "H". Now on to use the regex. Here's a bunch of text to search: text = """Now is the time for all good men blah blah blah spam and more text here blah blah blah and some more @victory blah blah blah""" And we search it this way: mo = re.search(regex, text) "mo" stands for "Match Object", which is returned if the regular expression finds something that matches your pattern. If nothing matches, then None is returned instead. if mo is not None: print(mo.group(0)) => prints @victory So far so good. But we can do better. In this case, we don't really care about the tags , we only care about the "victory" part. Here's how to use grouping to extract substrings from the regex: pattern = r"@(\w+)" # notice the round brackets () regex = re.compile(pattern, re.I) mo = re.search(regex, text) if mo is not None: print(mo.group(0)) print(mo.group(1)) This prints: @victory victory Hope this helps. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
I think the point was missed. I don't want to use an XML parser. The point is to pick up those tokens, and yes I've done my share of RTFM. This is what I've come up with: '\$\w*\(?.*?\)' Which doesn't work well on the above example, which is partly why I reached out to the group. Can anyone help me with the regex? Thanks, Frank -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
In article <385e732e-1c02-4dd0-ab12-b92890bbe...@o3g2000yqp.googlegroups.com>, Frank Koshti wrote: > I'm new to regular expressions. I want to be able to match for tokens > with all their properties in the following examples. I would > appreciate some direction on how to proceed. > > > @foo1 > @foo2() > @foo3(anything could go here) Don't try to parse HTML with regexes. Use a real HTML parser, such as lxml (http://lxml.de/). -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
On 18/08/2012 06:42, Chris Angelico wrote: On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti wrote: Hi, I'm new to regular expressions. I want to be able to match for tokens with all their properties in the following examples. I would appreciate some direction on how to proceed. @foo1 @foo2() @foo3(anything could go here) You can find regular expression primers all over the internet - fire up your favorite search engine and type those three words in. But it may be that what you want here is a more flexible parser; have you looked at BeautifulSoup (so rich and green)? ChrisA Totally agree with the sentiment. There's a comparison of python parsers here http://nedbatchelder.com/text/python-parsers.html -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti wrote: > Hi, > > I'm new to regular expressions. I want to be able to match for tokens > with all their properties in the following examples. I would > appreciate some direction on how to proceed. > > > @foo1 > @foo2() > @foo3(anything could go here) You can find regular expression primers all over the internet - fire up your favorite search engine and type those three words in. But it may be that what you want here is a more flexible parser; have you looked at BeautifulSoup (so rich and green)? ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Regex Question
Hi, I'm new to regular expressions. I want to be able to match for tokens with all their properties in the following examples. I would appreciate some direction on how to proceed. @foo1 @foo2() @foo3(anything could go here) Thanks- Frank -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On 29/07/11 19:52, Rustom Mody wrote: > MRAB wrote: > > findall returns a list of tuples (what the groups captured) if there > is more than 1 group, > > or a list of strings (what the group captured) if there is 1 group, > or a list of > > strings (what the regex matched) if there are no groups. > > Thanks. > It would be good to put this in the manual dont you think? It is in the manual. > > Also, the manual says in the 'match' section > > "Note If you want to locate a match anywhere in /string/, use search() > instead." > > to guard against users using match when they should be using search. > > Likewise it would be helpful if the manual also said (in the > match,search sections) > "If more than one match/search is required use findall" > > -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
MRAB wrote: > findall returns a list of tuples (what the groups captured) if there is more than 1 group, > or a list of strings (what the group captured) if there is 1 group, or a list of > strings (what the regex matched) if there are no groups. Thanks. It would be good to put this in the manual dont you think? Also, the manual says in the 'match' section "Note If you want to locate a match anywhere in *string*, use search()instead." to guard against users using match when they should be using search. Likewise it would be helpful if the manual also said (in the match,search sections) "If more than one match/search is required use findall" -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On 29/07/2011 16:45, Thomas Jollans wrote: On 29/07/11 16:53, rusi wrote: Can someone throw some light on this anomalous behavior? import re r = re.search('a(b+)', 'ababbaaab') r.group(1) 'b' r.group(0) 'ab' r.group(2) Traceback (most recent call last): File "", line 1, in IndexError: no such group re.findall('a(b+)', 'ababbaaab') ['b', 'bb', 'b'] So evidently group counts by number of '()'s and not by number of matches (and this is the case whether one uses match or search). So then whats the point of search-ing vs match-ing? Or equivalently how to move to the groups of the next match in? [Side note: The docstrings for this really suck: help(r.group) Help on built-in function group: group(...) Pretty standard regex behaviour: Group 1 is the first pair of brackets. Group 2 is the second, etc. pp. Group 0 is the whole match. The difference between matching and searching is that match assumes that the start of the regex coincides with the start of the string (and this is documented in the library docs IIRC). re.match(exp, s) is equivalent to re.search('^'+exp, s). (if not exp.startswith('^')) Apparently, findall() returns the content of the first group if there is one. I didn't check this, but I assume it is documented. findall returns a list of tuples (what the groups captured) if there is more than 1 group, or a list of strings (what the group captured) if there is 1 group, or a list of strings (what the regex matched) if there are no groups. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On 29/07/11 16:53, rusi wrote: > Can someone throw some light on this anomalous behavior? > import re r = re.search('a(b+)', 'ababbaaab') r.group(1) > 'b' r.group(0) > 'ab' r.group(2) > Traceback (most recent call last): > File "", line 1, in > IndexError: no such group > re.findall('a(b+)', 'ababbaaab') > ['b', 'bb', 'b'] > > So evidently group counts by number of '()'s and not by number of > matches (and this is the case whether one uses match or search). So > then whats the point of search-ing vs match-ing? > > Or equivalently how to move to the groups of the next match in? > > [Side note: The docstrings for this really suck: > help(r.group) > Help on built-in function group: > > group(...) > Pretty standard regex behaviour: Group 1 is the first pair of brackets. Group 2 is the second, etc. pp. Group 0 is the whole match. The difference between matching and searching is that match assumes that the start of the regex coincides with the start of the string (and this is documented in the library docs IIRC). re.match(exp, s) is equivalent to re.search('^'+exp, s). (if not exp.startswith('^')) Apparently, findall() returns the content of the first group if there is one. I didn't check this, but I assume it is documented. - Thomas -- http://mail.python.org/mailman/listinfo/python-list
regex question
Can someone throw some light on this anomalous behavior? >>> import re >>> r = re.search('a(b+)', 'ababbaaab') >>> r.group(1) 'b' >>> r.group(0) 'ab' >>> r.group(2) Traceback (most recent call last): File "", line 1, in IndexError: no such group >>> re.findall('a(b+)', 'ababbaaab') ['b', 'bb', 'b'] So evidently group counts by number of '()'s and not by number of matches (and this is the case whether one uses match or search). So then whats the point of search-ing vs match-ing? Or equivalently how to move to the groups of the next match in? [Side note: The docstrings for this really suck: >>> help(r.group) Help on built-in function group: group(...) >>> -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question on .findall and \b
Many thanks to all who replied! And, yes, I will *definitely* use raw strings from now on. :) ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question on .findall and \b
Ethan Furman wrote: Greetings! My closest to successfull attempt: Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] Type "copyright", "credits" or "license" for more information. IPython 0.9.1 -- An enhanced Interactive Python. In [161]: re.findall('\d+','this is test a3 attempt 79') Out[161]: ['3', '79'] What I really want in just the 79, as a3 is not a decimal number, but when I add the \b word boundaries I get: In [162]: re.findall('\b\d+\b','this is test a3 attempt 79') Out[162]: [] What am I missing? ~Ethan~ ARGH!! Okay, I need two \\ so I'm not trying to match a backspace. I knew (okay, hoped ;) I would figure it out once I posted the question and moved on. *sheepish grin* -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question on .findall and \b
On Thu, 02 Jul 2009 09:38:56 -0700, Ethan Furman wrote: > Greetings! > > My closest to successfull attempt: > > Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] > Type "copyright", "credits" or "license" for more information. > > IPython 0.9.1 -- An enhanced Interactive Python. > >In [161]: re.findall('\d+','this is test a3 attempt 79') >Out[161]: ['3', '79'] > > What I really want in just the 79, as a3 is not a decimal number, but > when I add the \b word boundaries I get: > >In [162]: re.findall('\b\d+\b','this is test a3 attempt 79') >Out[162]: [] > > What am I missing? You need to use a raw string (r'...') to prevent \b from being interpreted as a backspace: re.findall(r'\b\d+\b','this is test a3 attempt 79') \d isn't a recognised escape sequence, so it doesn't get interpreted: > print '\b' ^H > print '\d' \d > print r'\b' \b Try to get into the habit of using raw strings for regexps. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question on .findall and \b
On 2009-07-02 18:38, Ethan Furman wrote: Greetings! My closest to successfull attempt: Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] Type "copyright", "credits" or "license" for more information. IPython 0.9.1 -- An enhanced Interactive Python. In [161]: re.findall('\d+','this is test a3 attempt 79') Out[161]: ['3', '79'] What I really want in just the 79, as a3 is not a decimal number, but when I add the \b word boundaries I get: In [162]: re.findall('\b\d+\b','this is test a3 attempt 79') Out[162]: [] What am I missing? ~Ethan~ Try this: >>> re.findall(r'\b\d+\b','this is test a3 attempt 79') ['79'] The \b is a backspace, by using raw strings you get an actual backslash and b. -- Sjoerd Mullender -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question on .findall and \b
Ethan Furman wrote: Greetings! My closest to successfull attempt: Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] Type "copyright", "credits" or "license" for more information. IPython 0.9.1 -- An enhanced Interactive Python. In [161]: re.findall('\d+','this is test a3 attempt 79') Out[161]: ['3', '79'] What I really want in just the 79, as a3 is not a decimal number, but when I add the \b word boundaries I get: In [162]: re.findall('\b\d+\b','this is test a3 attempt 79') Out[162]: [] What am I missing? The sneaky detail that the regexp should be in a raw string (always a good practice), not a cooked string: r'\b\d+\b' The "\d" isn't a valid character-expansion, so python leaves it alone. However, I believe the "\b" is a control character, so your actual string ends up something like: >>> print repr('\b\d+\b') '\x08\\d+\x08' >>> print repr(r'\b\d+\b') '\\b\\d+\\b' the first of which doesn't match your target string, as you might imagine. -tkc -- http://mail.python.org/mailman/listinfo/python-list
regex question on .findall and \b
Greetings! My closest to successfull attempt: Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] Type "copyright", "credits" or "license" for more information. IPython 0.9.1 -- An enhanced Interactive Python. In [161]: re.findall('\d+','this is test a3 attempt 79') Out[161]: ['3', '79'] What I really want in just the 79, as a3 is not a decimal number, but when I add the \b word boundaries I get: In [162]: re.findall('\b\d+\b','this is test a3 attempt 79') Out[162]: [] What am I missing? ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list
Re: Python Regex Question
MalteseUnderdog wrote: Hi there I just started python (but this question isn't that trivial since I couldn't find it in google :) ) I have the following text file entries (simplified) start #frag 1 start x=Dog # frag 1 end stop start# frag 2 start x=Cat # frag 2 end stop start #frag 3 start x=Dog #frag 3 end stop I need a regex expression which returns the start to the x=ANIMAL for only the x=Dog fragments so all my entries should be start ... (something here) ... x=Dog . So I am really interested in fragments 1 and 3 only. As I understand the above I would first write a generator that separates the file into fragments and yields them one at a time. Perhaps something like def fragments(ifile): frag = [] for line in ifile: frag += line if : yield frag frag = [] Then I would iterate through fragments, testing for the ones I want: for frag in fragments(somefile): if 'x=Dog' in frag: Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: Python Regex Question
On Oct 29, 7:01 pm, Tim Chase <[EMAIL PROTECTED]> wrote: > > I need a regex expression which returns the start to the x=ANIMAL for > > only the x=Dog fragments so all my entries should be start ... > > (something here) ... x=Dog . So I am really interested in fragments 1 > > and 3 only. > > > My idea (primitive) ^start.*?x=Dog doesn't work because clearly it > > would return results > > > start > > x=Dog # (good) > > > and > > > start > > x=Cat > > stop > > start > > x=Dog # bad since I only want start ... x=Dog portion > > Looks like the following does the trick: > > >>> s = """start #frag 1 start > ... x=Dog # frag 1 end > ... stop > ... start # frag 2 start > ... x=Cat # frag 2 end > ... stop > ... start #frag 3 start > ... x=Dog #frag 3 end > ... stop""" > >>> import re > >>> r = re.compile(r'^start.*\nx=Dog.*\nstop.*', re.MULTILINE) > >>> for i, result in enumerate(r.findall(s)): > ... print i, repr(result) > ... > 0 'start #frag 1 start\nx=Dog # frag 1 end\nstop' > 1 'start #frag 3 start\nx=Dog #frag 3 end\nstop' > > -tkc This will only work if 'x=Dog' directly follows 'start' (which happens in the given example). If that's not necessarily the case, I would do it in two steps (in fact I wouldn't use regexps probably but...): >>> for chunk in re.split(r'\nstop', data): ... m = re.search('^start.*^x=Dog', chunk, re.DOTALL | re.MULTILINE) ... if m: print repr(m.group()) ... 'start #frag 1 start \nx=Dog' 'start #frag 3 start \nx=Dog' -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list
Re: Python Regex Question
I need a regex expression which returns the start to the x=ANIMAL for only the x=Dog fragments so all my entries should be start ... (something here) ... x=Dog . So I am really interested in fragments 1 and 3 only. My idea (primitive) ^start.*?x=Dog doesn't work because clearly it would return results start x=Dog # (good) and start x=Cat stop start x=Dog # bad since I only want start ... x=Dog portion Looks like the following does the trick: >>> s = """start #frag 1 start ... x=Dog # frag 1 end ... stop ... start# frag 2 start ... x=Cat # frag 2 end ... stop ... start #frag 3 start ... x=Dog #frag 3 end ... stop""" >>> import re >>> r = re.compile(r'^start.*\nx=Dog.*\nstop.*', re.MULTILINE) >>> for i, result in enumerate(r.findall(s)): ... print i, repr(result) ... 0 'start #frag 1 start\nx=Dog # frag 1 end\nstop' 1 'start #frag 3 start\nx=Dog #frag 3 end\nstop' -tkc -- http://mail.python.org/mailman/listinfo/python-list
Python Regex Question
Hi there I just started python (but this question isn't that trivial since I couldn't find it in google :) ) I have the following text file entries (simplified) start #frag 1 start x=Dog # frag 1 end stop start# frag 2 start x=Cat # frag 2 end stop start #frag 3 start x=Dog #frag 3 end stop I need a regex expression which returns the start to the x=ANIMAL for only the x=Dog fragments so all my entries should be start ... (something here) ... x=Dog . So I am really interested in fragments 1 and 3 only. My idea (primitive) ^start.*?x=Dog doesn't work because clearly it would return results start x=Dog # (good) and start x=Cat stop start x=Dog # bad since I only want start ... x=Dog portion Can you help me ? Thanks JP, Malta. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python regex question
Hey Gerhard, Gerhard Häring wrote: > > Tim van der Leeuw wrote: >> Hi, >> >> I'm trying to create a regular expression for matching some particular >> XML strings. I want to extract the contents of a particular XML tag, >> only if it follows one tag, but not follows another tag. Complicating >> this, is that there can be any number of other tags in between. [...] > > Sounds like this would be easier to implement using Python's SAX API. > > Here's a short example that does something similar to what you want to > achieve: > > [...] > I so far forgot to say a "thank you" for the suggestion :-) The sample code as you sent it doesn't do what I need to do, but I did look at it for creating SAX handler code that does what I want. It took me a while to implement, as it didn't fit in the parser-engine I had and I was close to making a release. But still: thanks! --Tim -- View this message in context: http://www.nabble.com/Python-regex-question-tp17773487p18997385.html Sent from the Python - python-list mailing list archive at Nabble.com. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Tue, 05 Aug 2008 15:55:46 +0100, Fred Mangusta wrote: > Chris wrote: > >> Doesn't work for his use case as he wants to keep periods marking the >> end of a sentence. Doesn't it? The period has to be surrounded by digits in the example solution, so wouldn't periods followed by a space (end of sentence) always make it through? ** Posted from http://www.teranews.com ** -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Aug 5, 11:39 am, Fred Mangusta <[EMAIL PROTECTED]> wrote: > Hi, > > I would like to delete all the instances of a '.' into a number. > > In other words I'd like to replace all the instances of a '.' character > with something (say nothing at all) when the '.' is representing a > decimal separator. E.g. > > 500.675 > 500675 > > but also > > 1.000.456.344 > 1000456344 > > I don't care about the fact the the resulting number is difficult to > read: as long as it remains a series of digits it's ok: the important > thing is to get rid of the period, because I want to keep it only where > it marks the end of a sentence. > > I was trying to do like this > > s=re.sub("[(\d+)(\.)(\d+)]","... ",s) > > but I don't know much about regular expressions, and don't know how to > get the two groups of numbers and join them in the sub. Moreover doing > like this I only match things like "345.000" and not "1.000.000". > > What's the correct approach? > I would use look-behind (is it preceded by a digit?) and look-ahead (is it followed by a digit?): s = re.sub(r'(?<=\d)\.(?=\d)', '', s) -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
Chris wrote: Doesn't work for his use case as he wants to keep periods marking the end of a sentence. Exactly. Thanks to all of you anyway, now I have a better understanding on how to go on :) F. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Aug 5, 2:23 pm, Jeff <[EMAIL PROTECTED]> wrote: > On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > > > > > On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote: > > > In other words I'd like to replace all the instances of a '.' character > > > with something (say nothing at all) when the '.' is representing a > > > decimal separator. E.g. > > > > 500.675 > 500675 > > > > but also > > > > 1.000.456.344 > 1000456344 > > > > I don't care about the fact the the resulting number is difficult to > > > read: as long as it remains a series of digits it's ok: the important > > > thing is to get rid of the period, because I want to keep it only where > > > it marks the end of a sentence. > > > > I was trying to do like this > > > > s=re.sub("[(\d+)(\.)(\d+)]","... ",s) > > > > but I don't know much about regular expressions, and don't know how to > > > get the two groups of numbers and join them in the sub. Moreover doing > > > like this I only match things like "345.000" and not "1.000.000". > > > > What's the correct approach? > > > In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344') > > Out[13]: '1000456344' > > > Ciao, > > Marc 'BlackJack' Rintsch > > Even faster: > > '1.000.456.344'.replace('.', '') => '1000456344' Doesn't work for his use case as he wants to keep periods marking the end of a sentence. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
=) Indeed. But it will replace all dots including ordinary strings instead of numbers only. On Tue, Aug 5, 2008 at 3:23 PM, Jeff <[EMAIL PROTECTED]> wrote: > On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > > On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote: > > > In other words I'd like to replace all the instances of a '.' character > > > with something (say nothing at all) when the '.' is representing a > > > decimal separator. E.g. > > > > > 500.675 > 500675 > > > > > but also > > > > > 1.000.456.344 > 1000456344 > > > > > I don't care about the fact the the resulting number is difficult to > > > read: as long as it remains a series of digits it's ok: the important > > > thing is to get rid of the period, because I want to keep it only where > > > it marks the end of a sentence. > > > > > I was trying to do like this > > > > > s=re.sub("[(\d+)(\.)(\d+)]","... ",s) > > > > > but I don't know much about regular expressions, and don't know how to > > > get the two groups of numbers and join them in the sub. Moreover doing > > > like this I only match things like "345.000" and not "1.000.000". > > > > > What's the correct approach? > > > > In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344') > > Out[13]: '1000456344' > > > > Ciao, > > Marc 'BlackJack' Rintsch > > Even faster: > > '1.000.456.344'.replace('.', '') => '1000456344' > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote: > > In other words I'd like to replace all the instances of a '.' character > > with something (say nothing at all) when the '.' is representing a > > decimal separator. E.g. > > > 500.675 > 500675 > > > but also > > > 1.000.456.344 > 1000456344 > > > I don't care about the fact the the resulting number is difficult to > > read: as long as it remains a series of digits it's ok: the important > > thing is to get rid of the period, because I want to keep it only where > > it marks the end of a sentence. > > > I was trying to do like this > > > s=re.sub("[(\d+)(\.)(\d+)]","... ",s) > > > but I don't know much about regular expressions, and don't know how to > > get the two groups of numbers and join them in the sub. Moreover doing > > like this I only match things like "345.000" and not "1.000.000". > > > What's the correct approach? > > In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344') > Out[13]: '1000456344' > > Ciao, > Marc 'BlackJack' Rintsch Even faster: '1.000.456.344'.replace('.', '') => '1000456344' -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
No, there is a bad way - because of the example doesn't solve arbitrary amount of ... blocks. But the python regexp engine supports for lookahead (?=pattern) and lookbehind (?<=pattern). In those cases patterns are not included into the replaced sequence of characters: >>> re.sub('(?<=\d)\.(?=\d)', '', '1234.324 abc.100.abc abc.abc') '1234324 abc.100.abc abc.abc' Alexey On Tue, Aug 5, 2008 at 2:10 PM, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]>wrote: > On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote: > > > In other words I'd like to replace all the instances of a '.' character > > with something (say nothing at all) when the '.' is representing a > > decimal separator. E.g. > > > > 500.675 > 500675 > > > > but also > > > > 1.000.456.344 > 1000456344 > > > > I don't care about the fact the the resulting number is difficult to > > read: as long as it remains a series of digits it's ok: the important > > thing is to get rid of the period, because I want to keep it only where > > it marks the end of a sentence. > > > > I was trying to do like this > > > > s=re.sub("[(\d+)(\.)(\d+)]","... ",s) > > > > but I don't know much about regular expressions, and don't know how to > > get the two groups of numbers and join them in the sub. Moreover doing > > like this I only match things like "345.000" and not "1.000.000". > > > > What's the correct approach? > > In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344') > Out[13]: '1000456344' > > > Ciao, > Marc 'BlackJack' Rintsch > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote: > In other words I'd like to replace all the instances of a '.' character > with something (say nothing at all) when the '.' is representing a > decimal separator. E.g. > > 500.675 > 500675 > > but also > > 1.000.456.344 > 1000456344 > > I don't care about the fact the the resulting number is difficult to > read: as long as it remains a series of digits it's ok: the important > thing is to get rid of the period, because I want to keep it only where > it marks the end of a sentence. > > I was trying to do like this > > s=re.sub("[(\d+)(\.)(\d+)]","... ",s) > > but I don't know much about regular expressions, and don't know how to > get the two groups of numbers and join them in the sub. Moreover doing > like this I only match things like "345.000" and not "1.000.000". > > What's the correct approach? In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344') Out[13]: '1000456344' Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
regex question
Hi, I would like to delete all the instances of a '.' into a number. In other words I'd like to replace all the instances of a '.' character with something (say nothing at all) when the '.' is representing a decimal separator. E.g. 500.675 > 500675 but also 1.000.456.344 > 1000456344 I don't care about the fact the the resulting number is difficult to read: as long as it remains a series of digits it's ok: the important thing is to get rid of the period, because I want to keep it only where it marks the end of a sentence. I was trying to do like this s=re.sub("[(\d+)(\.)(\d+)]","... ",s) but I don't know much about regular expressions, and don't know how to get the two groups of numbers and join them in the sub. Moreover doing like this I only match things like "345.000" and not "1.000.000". What's the correct approach? Thanks F. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python regex question
Tim van der Leeuw wrote: Hi, I'm trying to create a regular expression for matching some particular XML strings. I want to extract the contents of a particular XML tag, only if it follows one tag, but not follows another tag. Complicating this, is that there can be any number of other tags in between. [...] Sounds like this would be easier to implement using Python's SAX API. Here's a short example that does something similar to what you want to achieve: import xml.sax test_str = """ """ class MyHandler(xml.sax.handler.ContentHandler): def __init__(self): xml.sax.handler.ContentHandler.__init__(self) self.ignore_next = False def startElement(self, name, attrs): if name == "ignore": self.ignore_next = True return elif name == "foo": if not self.ignore_next: # handle the element you're interested in here print "MY ELEMENT", name, "with", dict(attrs) self.ignore_next = False xml.sax.parseString(test_str, MyHandler()) In this case, this looks much clearer and easier to understand to me than regular expressions. -- Gerhard -- http://mail.python.org/mailman/listinfo/python-list
Python regex question
Hi, I'm trying to create a regular expression for matching some particular XML strings. I want to extract the contents of a particular XML tag, only if it follows one tag, but not follows another tag. Complicating this, is that there can be any number of other tags in between. So basically, my regular expression should have 3 parts: - first match - any random text, that should not contain string '.*?(?P\d+)' (hopefully without typos) Here '' is my first match, and '(?P\d+)' is my second match. In this expression, I want to change the generic '.*?', which matches everything, with something that matches every string that does not include the substring '-- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Feb 13, 6:53 am, mathieu <[EMAIL PROTECTED]> wrote: > I do not understand what is wrong with the following regex expression. > I clearly mark that the separator in between group 3 and group 4 > should contain at least 2 white space, but group 3 is actually reading > 3 +4 > > Thanks > -Mathieu > > import re > > line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings > Auto Window Width SL 1 " > patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_ > -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s* > $") I love the smell of regex'es in the morning! For more legible posting (and general maintainability), try breaking up your quoted strings like this: line = \ " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \ "Auto Window Width SL 1 " patt = re.compile( "^\s*" "\(" "([0-9A-Z]+)," "([0-9A-Zx]+)" "\)\s+" "([A-Za-z0-9./:_ -]+)\s\s+" "([A-Za-z0-9 ()._,/#>-]+)\s+" "([A-Z][A-Z]_?O?W?)\s+" "([0-9n-]+)\s*$") Of course, the problem is that you have a greedy match in the part of the regex that is supposed to stop between "Settings" and "Auto". Change patt to: patt = re.compile( "^\s*" "\(" "([0-9A-Z]+)," "([0-9A-Zx]+)" "\)\s+" "([A-Za-z0-9./:_ -]+?)\s\s+" "([A-Za-z0-9 ()._,/#>-]+)\s+" "([A-Z][A-Z]_?O?W?)\s+" "([0-9n-]+)\s*$") or if you prefer: patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_ -]+?)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s* $") It looks like you wrote this regex to process this specific input string - it has a fragile feel to it, as if you will have to go back and tweak it to handle other data that might come along, such as (xx42,xx0A) Honeywell: Inverse Flitznoid (Kelvin) 80 SL 1 Just out of curiosity, I wondered what a pyparsing version of this would look like. See below: from pyparsing import Word,hexnums,delimitedList,printables,\ White,Regex,nums line = \ " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \ "Auto Window Width SL 1 " # define fields hexint = Word(hexnums+"x") text = delimitedList(Word(printables), delim=White(" ",exact=1), combine=True) type_label = Regex("[A-Z][A-Z]_?O?W?") int_label = Word(nums+"n-") # define line structure - give each field a name line_defn = "(" + hexint("x") + "," + hexint("y") + ")" + \ text("desc") + text("window") + type_label("type") + \ int_label("int") line_parts = line_defn.parseString(line) print line_parts.dump() print line_parts.desc Prints: ['(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab Settings', 'Auto Window Width', 'SL', '1'] - desc: Siemens: Thorax/Multix FD Lab Settings - int: 1 - type: SL - window: Auto Window Width - x: 0021 - y: xx0A Siemens: Thorax/Multix FD Lab Settings I was just guessing on the field names, but you can see where they are defined and change them to the appropriate values. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Feb 13, 1:53 pm, mathieu <[EMAIL PROTECTED]> wrote: > I do not understand what is wrong with the following regex expression. > I clearly mark that the separator in between group 3 and group 4 > should contain at least 2 white space, but group 3 is actually reading > 3 +4 > > Thanks > -Mathieu > > import re > > line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings > Auto Window Width SL 1 " > patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_ > -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s* > $") > m = patt.match(line) > if m: > print m.group(3) > print m.group(4) I don't know if it solves your problem, but if you want to match a dash (-), then it must be either escaped or be the first element in a character class. Gerard -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
mathieu, stop writing complex REs like obfuscated toys, use the re.VERBOSE flag and split that RE into several commented and *indented* lines (indented just like Python code), the indentation level has to be used to denote nesting. With that you may be able to solve the problem by yourself. If not, you can offer us a much more readable thing to fix. Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
Hey Mathieu Due to word wrap I'm not sure what you want to do. What result do you expect? I get: >>> print m.groups() ('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings Auto Window Width ', ' ', 'SL', '1') But only when I insert a space in the 3rd char group (I'm not sure if your original pattern has a space there or not). So the third group is: ([A-Za-z0-9./:_ -]+). If I do not insert the space, the pattern does not match the line. I also cant see how the format of your line is. If it is like this: line = "...Siemens: Thorax/Multix FD Lab Settings Auto Window Width..." where "Auto Window Width" should be the 4th group, you have to mark the + in the 3rd group as non-greedy (it's done with a "?"): http://docs.python.org/lib/re-syntax.html ([A-Za-z0-9./:_ -]+?) With that I get: >>> patt.match(line).groups() ('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings', 'Auto Window Width ', 'SL', '1') Which probably is what you want. You can also add the non-greedy marker in the fourth group, to get rid of the tailing spaces. HTH Wanja mathieu wrote: > I clearly mark that the separator in between group 3 and group 4 > should contain at least 2 white space, but group 3 is actually reading > 3 +4 -- http://mail.python.org/mailman/listinfo/python-list
regex question
I do not understand what is wrong with the following regex expression. I clearly mark that the separator in between group 3 and group 4 should contain at least 2 white space, but group 3 is actually reading 3 +4 Thanks -Mathieu import re line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings Auto Window Width SL 1 " patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_ -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s* $") m = patt.match(line) if m: print m.group(3) print m.group(4) -- http://mail.python.org/mailman/listinfo/python-list
Re: a newbie regex question
"Dotan Cohen" <[EMAIL PROTECTED]> wrote: > Maybe you mean: > for match in re.finditer(r'\([A-Z].+[a-z])\', contents): > Note the last backslash was in the wrong place. The location of the backslash in the orignal reply is correct, it is there to escape the closing paren, which is a special character: >>> import re >>> s='Abcd\nabc (Ab), (ab)' >>> re.findall(r'\([A-Z].+[a-z]\)', s) ['(Ab), (ab)'] Putting the backslash at the end of the string like you indicated results in a syntax error, as it escapes the closing single quote of the raw string literal: >>> re.findall(r'\([A-Z].+[a-z])\', s) SyntaxError: EOL while scanning single-quoted string >>> max -- http://mail.python.org/mailman/listinfo/python-list
Re: a newbie regex question
On 24/01/2008, Jonathan Gardner <[EMAIL PROTECTED]> wrote: > On Jan 24, 12:14 pm, Shoryuken <[EMAIL PROTECTED]> wrote: > > Given a regular expression pattern, for example, \([A-Z].+[a-z]\), > > > > print out all strings that match the pattern in a file > > > > Anyone tell me a way to do it? I know it's easy, but i'm completely > > new to python > > > > thanks alot > > You may want to read the pages on regular expressions in the online > documentation: http://www.python.org/doc/2.5/lib/module-re.html > > The simple approach works: > > import re > > # Open the file > f = file('/your/filename.txt') > > # Read the file into a single string. > contents = f.read() > > # Find all matches in the string of the regular expression and > iterate through them. > for match in re.finditer(r'\([A-Z].+[a-z]\)', contents): > # Print what was matched > print match.group() Maybe you mean: for match in re.finditer(r'\([A-Z].+[a-z])\', contents): Note the last backslash was in the wrong place. Dotan Cohen http://what-is-what.com http://gibberish.co.il א-ב-ג-ד-ה-ו-ז-ח-ט-י-ך-כ-ל-ם-מ-ן-נ-ס-ע-ף-פ-ץ-צ-ק-ר-ש-ת A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? -- http://mail.python.org/mailman/listinfo/python-list
Re: a newbie regex question
On Jan 24, 12:14 pm, Shoryuken <[EMAIL PROTECTED]> wrote: > Given a regular expression pattern, for example, \([A-Z].+[a-z]\), > > print out all strings that match the pattern in a file > > Anyone tell me a way to do it? I know it's easy, but i'm completely > new to python > > thanks alot You may want to read the pages on regular expressions in the online documentation: http://www.python.org/doc/2.5/lib/module-re.html The simple approach works: import re # Open the file f = file('/your/filename.txt') # Read the file into a single string. contents = f.read() # Find all matches in the string of the regular expression and iterate through them. for match in re.finditer(r'\([A-Z].+[a-z]\)', contents): # Print what was matched print match.group() -- http://mail.python.org/mailman/listinfo/python-list
a newbie regex question
Given a regular expression pattern, for example, \([A-Z].+[a-z]\), print out all strings that match the pattern in a file Anyone tell me a way to do it? I know it's easy, but i'm completely new to python thanks alot -- http://mail.python.org/mailman/listinfo/python-list
Re: python/regex question... hope someone can help
En Sun, 09 Dec 2007 16:45:53 -0300, charonzen <[EMAIL PROTECTED]> escribió: >> [John Machin] Another suggestion is to ensure that the job >> specification is not >> overly simplified. How did you parse the text into "words" in the >> prior exercise that produced the list of bigrams? Won't you need to >> use the same parsing method in the current exercise of tagging the >> bigrams with an underscore? > > Thank you John, that definitely puts things in perspective! I'm very > new to both Python and text parsing, and I often feel that I can't see > the forest for the trees. If you're asking, I'm working on a project > that utilizes Church's mutual information score. I tokenize my text, > split it into a list, derive some unigram and bigram dictionaries, and > then calculate a pmi dictionary based on x,y from the bigrams and > unigrams. The bigrams that pass my threshold then get put into my > list of x_y strings, and you know the rest. By modifying the original > text file, I can view 'x_y', z pairs as x,y and iterate it until I > have some collocations that are worth playing with. So I think that > covers the question the same parsing method. I'm sure there are more > pythonic ways to do it, but I'm on deadline :) Looks like you should work with the list of tokens, collapsing consecutive elements, not with the original text. Should be easier, and faster because you don't regenerate the text and tokenize it again and again. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: python/regex question... hope someone can help
> Another suggestion is to ensure that the job specification is not > overly simplified. How did you parse the text into "words" in the > prior exercise that produced the list of bigrams? Won't you need to > use the same parsing method in the current exercise of tagging the > bigrams with an underscore? > > Cheers, > John Thank you John, that definitely puts things in perspective! I'm very new to both Python and text parsing, and I often feel that I can't see the forest for the trees. If you're asking, I'm working on a project that utilizes Church's mutual information score. I tokenize my text, split it into a list, derive some unigram and bigram dictionaries, and then calculate a pmi dictionary based on x,y from the bigrams and unigrams. The bigrams that pass my threshold then get put into my list of x_y strings, and you know the rest. By modifying the original text file, I can view 'x_y', z pairs as x,y and iterate it until I have some collocations that are worth playing with. So I think that covers the question the same parsing method. I'm sure there are more pythonic ways to do it, but I'm on deadline :) Thanks again! Brandon -- http://mail.python.org/mailman/listinfo/python-list
Re: python/regex question... hope someone can help
On Dec 9, 6:13 pm, charonzen <[EMAIL PROTECTED]> wrote: The following *may* come close to doing what your revised spec requires: import re def ch_replace2(alist, text): for bigram in alist: pattern = r'\b' + bigram.replace('_', ' ') + r'\b' text = re.sub(pattern, bigram, text) return text Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: python/regex question... hope someone can help
On Dec 9, 6:13 pm, charonzen <[EMAIL PROTECTED]> wrote: > I have a list of strings. These strings are previously selected > bigrams with underscores between them ('and_the', 'nothing_given', and > so on). I need to write a regex that will read another text string > that this list was derived from and replace selections in this text > string with those from my list. So in my text string, '... and the... > ' becomes ' ... and_the...'. I can't figure out how to manipulate > > re.sub(r'([a-z]*) ([a-z]*)', r'()', textstring) > > Any suggestions? The usual suggestion is: Don't bother with regexes when simple string methods will do the job. >>> def ch_replace(alist, text): ... for bigram in alist: ... original = bigram.replace('_', ' ') ... text = text.replace(original, bigram) ... return text ... >>> print ch_replace( ... ['quick_brown', 'lazy_dogs', 'brown_fox'], ... 'The quick brown fox jumped over the lazy dogs.' ... ) The quick_brown_fox jumped over the lazy_dogs. >>> print ch_replace(['red_herring'], 'He prepared herring fillets.') He prepared_herring fillets. >>> Another suggestion is to ensure that the job specification is not overly simplified. How did you parse the text into "words" in the prior exercise that produced the list of bigrams? Won't you need to use the same parsing method in the current exercise of tagging the bigrams with an underscore? Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
python/regex question... hope someone can help
I have a list of strings. These strings are previously selected bigrams with underscores between them ('and_the', 'nothing_given', and so on). I need to write a regex that will read another text string that this list was derived from and replace selections in this text string with those from my list. So in my text string, '... and the... ' becomes ' ... and_the...'. I can't figure out how to manipulate re.sub(r'([a-z]*) ([a-z]*)', r'()', textstring) Any suggestions? Thank you if you can help! -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
On 15:25 Thu 04 Oct , Robert Dailey wrote: > I am not a regex expert, I simply assumed regex was standardized to follow > specific guidelines. There are as many different regex flavours as there are Linux distros. Each follows the basic rules but implements them slightly differently and adds their own 'extensions'. > I also made the assumption that this was a good place > to pose the question since regular expressions are a feature of Python. The best place to pose a regex question is in the sphere of usage, i.e. Perl regexes differ hugely in implementation from OO langs like Python or Java, while shells like bash or zsh use regexes slightly differently, as do shell scripting languages like awk or sed. > The question concerned regular expressions in general, not really the > application. However, now that I know that regex can be different, I'll try > to contact the author directly to find out the dialect and then find the > appropriate location for my question from there. I do appreciate everyone's > help. I've tried the various suggestions offered here, however none of them > work. I can only assume at this point that this regex is drastically > different or the application reading the regex is just broken. If you care to PM me with details of the language/context I will try to help but I am no expert. Regards, John -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
I am not a regex expert, I simply assumed regex was standardized to follow specific guidelines. I also made the assumption that this was a good place to pose the question since regular expressions are a feature of Python. The question concerned regular expressions in general, not really the application. However, now that I know that regex can be different, I'll try to contact the author directly to find out the dialect and then find the appropriate location for my question from there. I do appreciate everyone's help. I've tried the various suggestions offered here, however none of them work. I can only assume at this point that this regex is drastically different or the application reading the regex is just broken. Thanks again for everyones help! -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
[sigh...replying to my own post] > However, things to try: > > - sometimes the grouping parens need to be escaped with "\" > > - sometimes "\w" isn't a valid character class, so use the > long-hand variant of something like "[a-zA-Z0-9_]] > > - sometimes the "+" is escaped with a "\" > > - if you don't use raw strings, you'll need to escape your "\" > characters, making each instance "\\" just to be clear...these are some variants you may find in non-python regexps (or in python regexps if you're not using raw strings) -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
>>> try @param\[(in|out)\] \w+ >>> >> This didn't work either :( >> >> The tool using this regular expression (Comment Reflower for VS2005) May be >> broken... > > How about @param\[[i|o][n|u]t*\]\w+ ? ...if you want to accept patterns like @param[iutt]xxx ... The regexp at the top (Adam's original reply) would be the valid regexp in python and matches all the tests thrown at it, assuming it's placed in a raw string: r = re.compile(r"@param\[(in|out)\] \w+") If it's not a python regexp, this isn't really the list for the question, is it? ;) However, things to try: - sometimes the grouping parens need to be escaped with "\" - sometimes "\w" isn't a valid character class, so use the long-hand variant of something like "[a-zA-Z0-9_]] - sometimes the "+" is escaped with a "\" - if you don't use raw strings, you'll need to escape your "\" characters, making each instance "\\" HTH, -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
On 10/4/07, Robert Dailey <[EMAIL PROTECTED]> wrote: > On 10/4/07, Adam Lanier <[EMAIL PROTECTED]> wrote: > > > > try @param\[(in|out)\] \w+ > > > > This didn't work either :( > > The tool using this regular expression (Comment Reflower for VS2005) May be > broken... > > -- > http://mail.python.org/mailman/listinfo/python-list > How about @param\[[i|o][n|u]t*\]\w+ ? -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
> As far as the dialect, I can't be sure. I am unable to find documentation > for Comment Reflower and thus cannot figure out what type of regex it is > using. What exactly do you mean by your question, "are you using raw > strings?". Thanks for your response and I apologize for the lack of detail. Comment Reflower appears to be a plugin for Visual Studio written in C#. As far as I can tell, it has nothing to do with Python at all. A quick look at their sourceforge page (http://sourceforge.net/projects/commentreflower/) doesn't show any mailing lists or discussion groups. Maybe try emailing the author directly, or asking a C# language group about whatever the standard C# regular expression library is. -- Jerry -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
On 10/4/07, J. Clifford Dyer <[EMAIL PROTECTED]> wrote: > > You *are* talking about python regular expressions, right? There are a > number of different dialects. Also, there could be issues with the quoting > method (are you using raw strings?) > > The more specific you can get, the more we can help you. As far as the dialect, I can't be sure. I am unable to find documentation for Comment Reflower and thus cannot figure out what type of regex it is using. What exactly do you mean by your question, "are you using raw strings?". Thanks for your response and I apologize for the lack of detail. -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
You *are* talking about python regular expressions, right? There are a number of different dialects. Also, there could be issues with the quoting method (are you using raw strings?) The more specific you can get, the more we can help you. Cheers, Cliff On Thu, Oct 04, 2007 at 11:54:32AM -0500, Robert Dailey wrote regarding Re: RegEx question: > >On 10/4/07, Adam Lanier <[EMAIL PROTECTED]> wrote: > > try @param\[(in|out)\] \w+ > >This didn't work either :( >The tool using this regular expression (Comment Reflower for VS2005) >May be broken... > > References > >1. mailto:[EMAIL PROTECTED] > -- > http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
On 10/4/07, Adam Lanier <[EMAIL PROTECTED]> wrote: > > > try @param\[(in|out)\] \w+ > This didn't work either :( The tool using this regular expression (Comment Reflower for VS2005) May be broken... -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
On Thu, 2007-10-04 at 10:58 -0500, Robert Dailey wrote: > It should also match: > > @param[out] state Some description of this variable > > > On 10/4/07, Robert Dailey <[EMAIL PROTECTED]> wrote: > Hi, > > The following regex (Not including the end quotes): > > "@param\[in|out\] \w+ " > > Should match any of the following: > > @param[in] variable > @param[out] state > @param[in] foo > @param[out] bar > > > Correct? (Note the trailing whitespace in the regex as well as > in the examples) > > -- > http://mail.python.org/mailman/listinfo/python-list try @param\[(in|out)\] \w+ -- http://mail.python.org/mailman/listinfo/python-list
Re: RegEx question
It should also match: @param[out] state Some description of this variable On 10/4/07, Robert Dailey <[EMAIL PROTECTED]> wrote: > > Hi, > > The following regex (Not including the end quotes): > > "@param\[in|out\] \w+ " > > Should match any of the following: > > @param[in] variable > @param[out] state > @param[in] foo > @param[out] bar > > > Correct? (Note the trailing whitespace in the regex as well as in the > examples) > -- http://mail.python.org/mailman/listinfo/python-list
RegEx question
Hi, The following regex (Not including the end quotes): "@param\[in|out\] \w+ " Should match any of the following: @param[in] variable @param[out] state @param[in] foo @param[out] bar Correct? (Note the trailing whitespace in the regex as well as in the examples) -- http://mail.python.org/mailman/listinfo/python-list
Re: Python Regex Question
> re.search(expr, string) compiles and searches every time. This can > potentially be more expensive in calculating power. especially if you > have to use the expression a lot of times. The re module-level helper functions cache expressions and their compiled form in a dict. They are only compiled once. The main overhead would be for repeated dict lookups. See sre.py (included from re.py) for more details. /usr/lib/python2.4/sre.py -- http://mail.python.org/mailman/listinfo/python-list
Re: Python Regex Question
crybaby wrote: > On Sep 20, 4:12 pm, Tobiah <[EMAIL PROTECTED]> wrote: >> [EMAIL PROTECTED] wrote: >>> I need to extract the number on each >> i.e 49.950 from the following: >>> 49.950 >>> The actual number between: 49.950 can be any number of >>> digits before decimal and after decimal. >>> ##. >>> How can I just extract the real/integer number using regex? >> '[0-9]*\.[0-9]*' >> >> -- >> Posted via a free Usenet account fromhttp://www.teranews.com > > I am trying to use BeautifulSoup: > > soup = BeautifulSoup(page) > > td_tags = soup.findAll('td') > i=0 > for td in td_tags: > i = i+1 > print "td: ", td > # re.search('[0-9]*\.[0-9]*', td) > price = re.compile('[0-9]*\.[0-9]*').search(td) > > I am getting an error: > >price= re.compile('[0-9]*\.[0-9]*').search(td) > TypeError: expected string or buffer > > Does beautiful soup returns array of objects? If so, how do I pass > "td" instance as string to re.search? What is the different between > re.search vs re.compile().search? > I don't know anything about BeautifulSoup, but to the other questions: var=re.compile(regexpr) compiles the expression and after that you can use var as the reference to that compiled expression (costs less) re.search(expr, string) compiles and searches every time. This can potentially be more expensive in calculating power. especially if you have to use the expression a lot of times. The way you use it it doesn't matter. do: pattern = re.compile('[0-9]*\.[0-9]*') result = pattern.findall(your tekst here) Now you can reuse pattern. Cheers, Ivo. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python Regex Question
On Sep 20, 4:12 pm, Tobiah <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > I need to extract the number on each > > i.e 49.950 from the following: > > > 49.950 > > > The actual number between: 49.950 can be any number of > > digits before decimal and after decimal. > > > ##. > > > How can I just extract the real/integer number using regex? > > '[0-9]*\.[0-9]*' > > -- > Posted via a free Usenet account fromhttp://www.teranews.com I am trying to use BeautifulSoup: soup = BeautifulSoup(page) td_tags = soup.findAll('td') i=0 for td in td_tags: i = i+1 print "td: ", td # re.search('[0-9]*\.[0-9]*', td) price = re.compile('[0-9]*\.[0-9]*').search(td) I am getting an error: price= re.compile('[0-9]*\.[0-9]*').search(td) TypeError: expected string or buffer Does beautiful soup returns array of objects? If so, how do I pass "td" instance as string to re.search? What is the different between re.search vs re.compile().search? -- http://mail.python.org/mailman/listinfo/python-list
Re: Python Regex Question
[EMAIL PROTECTED] wrote: >I need to extract the number on each >i.e 49.950 from the following: > > 49.950 > >The actual number between: 49.950 can be any number of >digits before decimal and after decimal. > > ##. > >How can I just extract the real/integer number using regex? > > > If all the td's content has the [value_to_extract] pattern, things goes simplest [untested] /http://mail.python.org/mailman/listinfo/python-list
Re: Python Regex Question
[EMAIL PROTECTED] wrote: > I need to extract the number on each > i.e 49.950 from the following: > > 49.950 > > The actual number between: 49.950 can be any number of > digits before decimal and after decimal. > > ##. > > How can I just extract the real/integer number using regex? > '[0-9]*\.[0-9]*' -- Posted via a free Usenet account from http://www.teranews.com -- http://mail.python.org/mailman/listinfo/python-list
Python Regex Question
I need to extract the number on each 49.950 The actual number between: 49.950 can be any number of digits before decimal and after decimal. ##. How can I just extract the real/integer number using regex? -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Python REGEX Question
johnny <[EMAIL PROTECTED]> wrote: > I need to get the content inside the bracket. > eg. some characters before bracket (3.12345). > I need to get whatever inside the (), in this case 3.12345. > How do you do this with python regular expression? I'm going to presume that you mean something like: I want to extract floating point numerics from parentheses embedded in other, arbitrary, text. Something like: >>> given='adfasdfafd(3.14159265)asdfasdfadsfasf' >>> import re >>> mymatch = re.search(r'\(([0-9.]+)\)', given).groups()[0] >>> mymatch '3.14159265' >>> Of course, as with any time you're contemplating the use of regular expressions, there are lots of questions to consider about the exact requirements here. What if there are more than such pattern? Do you only want the first match per line (or other string)? (That's all my example will give you). What if there are no matches? My example will raise an AttributeError (since the re.search will return the "None" object rather than a match object; and naturally the None object has no ".groups()' method. The following might work better: >>> mymatches = re.findall(r'\(([0-9.]+)\)', given).groups()[0] >>> if len(mymatches): >>> ... ... and, of couse, you might be better with a compiled regexp if you're going to repeast the search on many strings: num_extractor = re.compile(r'\(([0-9.]+)\)') for line in myfile: for num in num_extractor(line): pass # do whatever with all these numbers -- Jim Dennis, Starshine: Signed, Sealed, Delivered -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Python REGEX Question
On Fri, 11 May 2007 08:54:31 -0700, johnny wrote: > I need to get the content inside the bracket. > > eg. some characters before bracket (3.12345). > > I need to get whatever inside the (), in this case 3.12345. > > How do you do this with python regular expression? Why would you bother? If you know your string is a bracketed expression, all you need is: s = "(3.12345)" contents = s[1:-1] # ignore the first and last characters If your string is more complex: s = "lots of things here (3.12345) and some more things here" then the task is harder. In general, you can't use regular expressions for that, you need a proper parser, because brackets can be nested. But if you don't care about nested brackets, then something like this is easy: def get_bracket(s): p, q = s.find('('), s.find(')') if p == -1 or q == -1: raise ValueError("Missing bracket") if p > q: raise ValueError("Close bracket before open bracket") return s[p+1:q-1] Or as a one liner with no error checking: s[s.find('(')+1:s.find(')'-1] -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Python REGEX Question
On May 12, 2:21 am, Gary Herron <[EMAIL PROTECTED]> wrote: > johnny wrote: > > I need to get the content inside the bracket. > > > eg. some characters before bracket (3.12345). > > > I need to get whatever inside the (), in this case 3.12345. > > > How do you do this with python regular expression? > > >>> import re > >>> x = re.search("[0-9.]+", "(3.12345)") > >>> print x.group(0) > > 3.12345 > > There's a lot more to the re module, of course. I'd suggest reading the > manual, but this should get you started. > >>> s = "some chars like 987 before the bracket (3.12345) etc" >>> x = re.search("[0-9.]+", s) >>> x.group(0) '987' OP sez: "I need to get the content inside the bracket" OP sez: "I need to get whatever inside the ()" My interpretation: >>> for s in ['foo(123)bar', 'foo(123))bar', 'foo()bar', 'foobar']: ... x = re.search(r"\([^)]*\)", s) ... print repr(x and x.group(0)[1:-1]) ... '123' '123' '' None -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Python REGEX Question
johnny wrote: > I need to get the content inside the bracket. > > eg. some characters before bracket (3.12345). > > I need to get whatever inside the (), in this case 3.12345. > > How do you do this with python regular expression? > >>> import re >>> x = re.search("[0-9.]+", "(3.12345)") >>> print x.group(0) 3.12345 There's a lot more to the re module, of course. I'd suggest reading the manual, but this should get you started. Gary Herron -- http://mail.python.org/mailman/listinfo/python-list
Simple Python REGEX Question
I need to get the content inside the bracket. eg. some characters before bracket (3.12345). I need to get whatever inside the (), in this case 3.12345. How do you do this with python regular expression? -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Apr 27, 8:26 am, Michael Hoffman <[EMAIL PROTECTED]> wrote: > proctorwrote: > > On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote: > >> On Apr 27, 1:33 am,proctor<[EMAIL PROTECTED]> wrote: > >>> rx_test = re.compile('/x([^x])*x/') > >>> s = '/xabcx/' > >>> if rx_test.findall(s): > >>> print rx_test.findall(s) > >>> > >>> i expect the output to be ['abc'] however it gives me only the last > >>> single character in the group: ['c'] > > >> As Josiah already pointed out, the * needs to be inside the grouping > >> parens. > > so my question remains, why doesn't the star quantifier seem to grab > > all the data. > > Because you didn't use it *inside* the group, as has been said twice. > Let's take a simpler example: > > >>> import re > >>> text = "xabc" > >>> re_test1 = re.compile("x([^x])*") > >>> re_test2 = re.compile("x([^x]*)") > >>> re_test1.match(text).groups() > ('c',) > >>> re_test2.match(text).groups() > ('abc',) > > There are three places that match ([^x]) in text. But each time you find > one you overwrite the previous example. > > > isn't findall() intended to return all matches? > > It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used > a grouping parenthesis in there, it only returns one group from each > pattern. > > Back to my example: > > >>> re_test1.findall("xabcxaaaxabc") > ['c', 'a', 'c'] > > Here it finds multiple matches, but only because the x occurs multiple > times as well. In your example there is only one match. > > > i would expect either 'abc' or 'a', 'b', 'c' or at least just > > 'a' (because that would be the first match). > > You are essentially doing this: > > group1 = "a" > group1 = "b" > group1 = "c" > > After those three statements, you wouldn't expect group1 to be "abc" or > "a". You'd expect it to be "c". > -- > Michael Hoffman thank you all again for helping to clarify this for me. of course you were exactly right, and the problem lay not with python or the text, but with me. i mistakenly understood the text to be attempting to capture the C style comment, when in fact it was merely matching it. apologies. sincerely, proctor -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Apr 27, 8:50 am, Paul McGuire <[EMAIL PROTECTED]> wrote: > On Apr 27, 9:10 am, proctor <[EMAIL PROTECTED]> wrote: > > > > > On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote: > > > > On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote: > > > > > hello, > > > > > i have a regex: rx_test = re.compile('/x([^x])*x/') > > > > > which is part of this test program: > > > > > > > > > > import re > > > > > rx_test = re.compile('/x([^x])*x/') > > > > > s = '/xabcx/' > > > > > if rx_test.findall(s): > > > > print rx_test.findall(s) > > > > > > > > > > i expect the output to be ['abc'] however it gives me only the last > > > > single character in the group: ['c'] > > > > > C:\test>python retest.py > > > > ['c'] > > > > > can anyone point out why this is occurring? i can capture the entire > > > > group by doing this: > > > > > rx_test = re.compile('/x([^x]+)*x/') > > > > but why isn't the 'star' grabbing the whole group? and why isn't each > > > > letter 'a', 'b', and 'c' present, either individually, or as a group > > > > (group is expected)? > > > > > any clarification is appreciated! > > > > > sincerely, > > > > proctor > > > > As Josiah already pointed out, the * needs to be inside the grouping > > > parens. > > > > Since re's do lookahead/backtracking, you can also write: > > > > rx_test = re.compile('/x(.*?)x/') > > > > The '?' is there to make sure the .* repetition stops at the first > > > occurrence of x/. > > > > -- Paul > > > i am working through an example from the oreilly book mastering > > regular expressions (2nd edition) by jeffrey friedl. my post was a > > snippet from a regex to match C comments. every 'x' in the regex > > represents a 'star' in actual usage, so that backslash escaping is not > > needed in the example (on page 275). it looks like this: > > > === > > > /x([^x]|x+[^/x])*x+/ > > > it is supposed to match '/x', the opening delimiter, then > > > ( > > either anything that is 'not x', > > > or, > > > 'x' one or more times, 'not followed by a slash or an x' > > ) any number of times (the 'star') > > > followed finally by the closing delimiter. > > > === > > > this does not seem to work in python the way i understand it should > > from the book, and i simplified the example in my first post to > > concentrate on just one part of the alternation that i felt was not > > acting as expected. > > > so my question remains, why doesn't the star quantifier seem to grab > > all the data. isn't findall() intended to return all matches? i > > would expect either 'abc' or 'a', 'b', 'c' or at least just > > 'a' (because that would be the first match). why does it give only > > one letter, and at that, the /last/ letter in the sequence?? > > > thanks again for replying! > > > sincerely, > > proctor- Hide quoted text - > > > - Show quoted text - > > Again, I'll repeat some earlier advice: you need to move the '*' > inside the parens - you are still leaving it outside. Also, get in > the habit of using raw literal notation (that is r"slkjdfljf" instead > of "lsjdlfkjs") when defining re strings - you don't have backslash > issues yet, but you will as soon as you start putting real '*' > characters in your expression. > > However, when I test this, > > restr = r'/x(([^x]|x+[^/])*)x+/' > re_ = re.compile(restr) > print re_.findall("/xabxxcx/ /x123xxx/") > > findall now starts to give a tuple for each "comment", > > [('abxxc', 'xxc'), ('123xx', 'xx')] > > so you have gone beyond my limited re skill, and will need help from > someone else. > > But I suggest you add some tests with multiple consecutive 'x' > characters in the middle of your comment, and multiple consecutive 'x' > characters before the trailing comment. In fact, from my > recollections of trying to implement this type of comment recognizer > by hand a long time ago in a job far, far away, test with both even > and odd numbers of 'x' characters. > > -- Paul thanks paul, the reason the regex now give tuples is that there are now 2 groups, the inner and outer parens. so group 1 matches with the star, and group 2 matches without the star. sincerely, proctor -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
proctor <[EMAIL PROTECTED]> wrote: >> >>> re.findall('(.)*', 'abc') >> ['c', ''] > thank you this is interesting. in the second example, where does the > 'nothingness' match, at the end? why does the regex 'run again' when > it has already matched everything? and if it reports an empty match > along with a non-empty match, why only the two? > There are 4 possible starting points for a regular expression to match in a three character string. The regular expression would match at any starting point so in theory you could find 4 possible matches in the string. In this case they would be 'abc', 'bc', 'c', ''. However findall won't get any overlapping matches, so there are only two possible matches and it returns both of them: 'abc' and '' (or rather it returns the matching group within the match so you only see the 'c' although it matched 'abc'. If you use a regex which doesn't match an empty string (e.g. '/x(.*?)x/' then you won't get the empty match. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Apr 27, 8:37 am, Duncan Booth <[EMAIL PROTECTED]> wrote: > proctor <[EMAIL PROTECTED]> wrote: > > so my question remains, why doesn't the star quantifier seem to grab > > all the data. isn't findall() intended to return all matches? i > > would expect either 'abc' or 'a', 'b', 'c' or at least just > > 'a' (because that would be the first match). why does it give only > > one letter, and at that, the /last/ letter in the sequence?? > > findall returns the matched groups. You get one group for each > parenthesised sub-expression, and (the important bit) if a single > parenthesised expression matches more than once the group only contains > the last string which matched it. > > Putting a star after a subexpression means that subexpression can match > zero or more times, but each time it only matches a single character > which is why your findall only returned the last character it matched. > > You need to move the * inside the parentheses used to define the group, > then the group will match only once but will include everything that it > matched. > > Consider: > > >>> re.findall('(.)', 'abc') > ['a', 'b', 'c'] > >>> re.findall('(.)*', 'abc') > ['c', ''] > >>> re.findall('(.*)', 'abc') > > ['abc', ''] > > The first pattern finds a single character which findall manages to > match 3 times. > > The second pattern finds a group with a single character zero or more > times in the pattern, so the first time it matches each of a,b,c in turn > and returns the c, and then next time around we get an empty string when > group matched zero times. > > In the third pattern we are looking for a group with any number of > characters in it. First time we get all of the string, then we get > another empty match. thank you this is interesting. in the second example, where does the 'nothingness' match, at the end? why does the regex 'run again' when it has already matched everything? and if it reports an empty match along with a non-empty match, why only the two? sincerely, proctor -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Apr 27, 8:26 am, Michael Hoffman <[EMAIL PROTECTED]> wrote: > proctor wrote: > > On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote: > >> On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote: > >>> rx_test = re.compile('/x([^x])*x/') > >>> s = '/xabcx/' > >>> if rx_test.findall(s): > >>> print rx_test.findall(s) > >>> > >>> i expect the output to be ['abc'] however it gives me only the last > >>> single character in the group: ['c'] > > >> As Josiah already pointed out, the * needs to be inside the grouping > >> parens. > > so my question remains, why doesn't the star quantifier seem to grab > > all the data. > > Because you didn't use it *inside* the group, as has been said twice. > Let's take a simpler example: > > >>> import re > >>> text = "xabc" > >>> re_test1 = re.compile("x([^x])*") > >>> re_test2 = re.compile("x([^x]*)") > >>> re_test1.match(text).groups() > ('c',) > >>> re_test2.match(text).groups() > ('abc',) > > There are three places that match ([^x]) in text. But each time you find > one you overwrite the previous example. > > > isn't findall() intended to return all matches? > > It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used > a grouping parenthesis in there, it only returns one group from each > pattern. > > Back to my example: > > >>> re_test1.findall("xabcxaaaxabc") > ['c', 'a', 'c'] > > Here it finds multiple matches, but only because the x occurs multiple > times as well. In your example there is only one match. > > > i would expect either 'abc' or 'a', 'b', 'c' or at least just > > 'a' (because that would be the first match). > > You are essentially doing this: > > group1 = "a" > group1 = "b" > group1 = "c" > > After those three statements, you wouldn't expect group1 to be "abc" or > "a". You'd expect it to be "c". > -- > Michael Hoffman ok, thanks michael. so i am now assuming that either the book's example assumes perl, and perl is different from python in this regard, or, that the book's example is faulty. i understand all the examples given since my question, and i know what i need to do to make it work. i am raising the question because the book says one thing, but the example is not working for me. i am searching for the source of the discrepancy. i will try to research the differences between perl's and python's regex engines. thanks again, sincerely, proctor -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Apr 27, 9:10 am, proctor <[EMAIL PROTECTED]> wrote: > On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote: > > > > > > > On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote: > > > > hello, > > > > i have a regex: rx_test = re.compile('/x([^x])*x/') > > > > which is part of this test program: > > > > > > > > import re > > > > rx_test = re.compile('/x([^x])*x/') > > > > s = '/xabcx/' > > > > if rx_test.findall(s): > > > print rx_test.findall(s) > > > > > > > > i expect the output to be ['abc'] however it gives me only the last > > > single character in the group: ['c'] > > > > C:\test>python retest.py > > > ['c'] > > > > can anyone point out why this is occurring? i can capture the entire > > > group by doing this: > > > > rx_test = re.compile('/x([^x]+)*x/') > > > but why isn't the 'star' grabbing the whole group? and why isn't each > > > letter 'a', 'b', and 'c' present, either individually, or as a group > > > (group is expected)? > > > > any clarification is appreciated! > > > > sincerely, > > > proctor > > > As Josiah already pointed out, the * needs to be inside the grouping > > parens. > > > Since re's do lookahead/backtracking, you can also write: > > > rx_test = re.compile('/x(.*?)x/') > > > The '?' is there to make sure the .* repetition stops at the first > > occurrence of x/. > > > -- Paul > > i am working through an example from the oreilly book mastering > regular expressions (2nd edition) by jeffrey friedl. my post was a > snippet from a regex to match C comments. every 'x' in the regex > represents a 'star' in actual usage, so that backslash escaping is not > needed in the example (on page 275). it looks like this: > > === > > /x([^x]|x+[^/x])*x+/ > > it is supposed to match '/x', the opening delimiter, then > > ( > either anything that is 'not x', > > or, > > 'x' one or more times, 'not followed by a slash or an x' > ) any number of times (the 'star') > > followed finally by the closing delimiter. > > === > > this does not seem to work in python the way i understand it should > from the book, and i simplified the example in my first post to > concentrate on just one part of the alternation that i felt was not > acting as expected. > > so my question remains, why doesn't the star quantifier seem to grab > all the data. isn't findall() intended to return all matches? i > would expect either 'abc' or 'a', 'b', 'c' or at least just > 'a' (because that would be the first match). why does it give only > one letter, and at that, the /last/ letter in the sequence?? > > thanks again for replying! > > sincerely, > proctor- Hide quoted text - > > - Show quoted text - Again, I'll repeat some earlier advice: you need to move the '*' inside the parens - you are still leaving it outside. Also, get in the habit of using raw literal notation (that is r"slkjdfljf" instead of "lsjdlfkjs") when defining re strings - you don't have backslash issues yet, but you will as soon as you start putting real '*' characters in your expression. However, when I test this, restr = r'/x(([^x]|x+[^/])*)x+/' re_ = re.compile(restr) print re_.findall("/xabxxcx/ /x123xxx/") findall now starts to give a tuple for each "comment", [('abxxc', 'xxc'), ('123xx', 'xx')] so you have gone beyond my limited re skill, and will need help from someone else. But I suggest you add some tests with multiple consecutive 'x' characters in the middle of your comment, and multiple consecutive 'x' characters before the trailing comment. In fact, from my recollections of trying to implement this type of comment recognizer by hand a long time ago in a job far, far away, test with both even and odd numbers of 'x' characters. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
proctor <[EMAIL PROTECTED]> wrote: > so my question remains, why doesn't the star quantifier seem to grab > all the data. isn't findall() intended to return all matches? i > would expect either 'abc' or 'a', 'b', 'c' or at least just > 'a' (because that would be the first match). why does it give only > one letter, and at that, the /last/ letter in the sequence?? > findall returns the matched groups. You get one group for each parenthesised sub-expression, and (the important bit) if a single parenthesised expression matches more than once the group only contains the last string which matched it. Putting a star after a subexpression means that subexpression can match zero or more times, but each time it only matches a single character which is why your findall only returned the last character it matched. You need to move the * inside the parentheses used to define the group, then the group will match only once but will include everything that it matched. Consider: >>> re.findall('(.)', 'abc') ['a', 'b', 'c'] >>> re.findall('(.)*', 'abc') ['c', ''] >>> re.findall('(.*)', 'abc') ['abc', ''] The first pattern finds a single character which findall manages to match 3 times. The second pattern finds a group with a single character zero or more times in the pattern, so the first time it matches each of a,b,c in turn and returns the c, and then next time around we get an empty string when group matched zero times. In the third pattern we are looking for a group with any number of characters in it. First time we get all of the string, then we get another empty match. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
proctor wrote: > On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote: >> On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote: >>> rx_test = re.compile('/x([^x])*x/') >>> s = '/xabcx/' >>> if rx_test.findall(s): >>> print rx_test.findall(s) >>> >>> i expect the output to be ['abc'] however it gives me only the last >>> single character in the group: ['c'] > >> As Josiah already pointed out, the * needs to be inside the grouping >> parens. > so my question remains, why doesn't the star quantifier seem to grab > all the data. Because you didn't use it *inside* the group, as has been said twice. Let's take a simpler example: >>> import re >>> text = "xabc" >>> re_test1 = re.compile("x([^x])*") >>> re_test2 = re.compile("x([^x]*)") >>> re_test1.match(text).groups() ('c',) >>> re_test2.match(text).groups() ('abc',) There are three places that match ([^x]) in text. But each time you find one you overwrite the previous example. > isn't findall() intended to return all matches? It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used a grouping parenthesis in there, it only returns one group from each pattern. Back to my example: >>> re_test1.findall("xabcxaaaxabc") ['c', 'a', 'c'] Here it finds multiple matches, but only because the x occurs multiple times as well. In your example there is only one match. > i would expect either 'abc' or 'a', 'b', 'c' or at least just > 'a' (because that would be the first match). You are essentially doing this: group1 = "a" group1 = "b" group1 = "c" After those three statements, you wouldn't expect group1 to be "abc" or "a". You'd expect it to be "c". -- Michael Hoffman -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote: > On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote: > > > > > hello, > > > i have a regex: rx_test = re.compile('/x([^x])*x/') > > > which is part of this test program: > > > > > > import re > > > rx_test = re.compile('/x([^x])*x/') > > > s = '/xabcx/' > > > if rx_test.findall(s): > > print rx_test.findall(s) > > > > > > i expect the output to be ['abc'] however it gives me only the last > > single character in the group: ['c'] > > > C:\test>python retest.py > > ['c'] > > > can anyone point out why this is occurring? i can capture the entire > > group by doing this: > > > rx_test = re.compile('/x([^x]+)*x/') > > but why isn't the 'star' grabbing the whole group? and why isn't each > > letter 'a', 'b', and 'c' present, either individually, or as a group > > (group is expected)? > > > any clarification is appreciated! > > > sincerely, > > proctor > > As Josiah already pointed out, the * needs to be inside the grouping > parens. > > Since re's do lookahead/backtracking, you can also write: > > rx_test = re.compile('/x(.*?)x/') > > The '?' is there to make sure the .* repetition stops at the first > occurrence of x/. > > -- Paul i am working through an example from the oreilly book mastering regular expressions (2nd edition) by jeffrey friedl. my post was a snippet from a regex to match C comments. every 'x' in the regex represents a 'star' in actual usage, so that backslash escaping is not needed in the example (on page 275). it looks like this: === /x([^x]|x+[^/x])*x+/ it is supposed to match '/x', the opening delimiter, then ( either anything that is 'not x', or, 'x' one or more times, 'not followed by a slash or an x' ) any number of times (the 'star') followed finally by the closing delimiter. === this does not seem to work in python the way i understand it should from the book, and i simplified the example in my first post to concentrate on just one part of the alternation that i felt was not acting as expected. so my question remains, why doesn't the star quantifier seem to grab all the data. isn't findall() intended to return all matches? i would expect either 'abc' or 'a', 'b', 'c' or at least just 'a' (because that would be the first match). why does it give only one letter, and at that, the /last/ letter in the sequence?? thanks again for replying! sincerely, proctor -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote: > hello, > > i have a regex: rx_test = re.compile('/x([^x])*x/') > > which is part of this test program: > > > > import re > > rx_test = re.compile('/x([^x])*x/') > > s = '/xabcx/' > > if rx_test.findall(s): > print rx_test.findall(s) > > > > i expect the output to be ['abc'] however it gives me only the last > single character in the group: ['c'] > > C:\test>python retest.py > ['c'] > > can anyone point out why this is occurring? i can capture the entire > group by doing this: > > rx_test = re.compile('/x([^x]+)*x/') > but why isn't the 'star' grabbing the whole group? and why isn't each > letter 'a', 'b', and 'c' present, either individually, or as a group > (group is expected)? > > any clarification is appreciated! > > sincerely, > proctor As Josiah already pointed out, the * needs to be inside the grouping parens. Since re's do lookahead/backtracking, you can also write: rx_test = re.compile('/x(.*?)x/') The '?' is there to make sure the .* repetition stops at the first occurrence of x/. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
proctor wrote: > i have a regex: rx_test = re.compile('/x([^x])*x/') You probably want... rx_test = re.compile('/x([^x]*)x/') - Josiah -- http://mail.python.org/mailman/listinfo/python-list
regex question
hello, i have a regex: rx_test = re.compile('/x([^x])*x/') which is part of this test program: import re rx_test = re.compile('/x([^x])*x/') s = '/xabcx/' if rx_test.findall(s): print rx_test.findall(s) i expect the output to be ['abc'] however it gives me only the last single character in the group: ['c'] C:\test>python retest.py ['c'] can anyone point out why this is occurring? i can capture the entire group by doing this: rx_test = re.compile('/x([^x]+)*x/') but why isn't the 'star' grabbing the whole group? and why isn't each letter 'a', 'b', and 'c' present, either individually, or as a group (group is expected)? any clarification is appreciated! sincerely, proctor -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
Gabriel Genellina wrote: > At Tuesday 16/1/2007 16:36, Bill Mill wrote: > > > > py> import re > > > py> rgx = re.compile('1?') > > > py> rgx.search('a1').groups() > > > (None,) > > > py> rgx = re.compile('(1)+') > > > py> rgx.search('a1').groups() > > > >But shouldn't the ? be greedy, and thus prefer the one match to the > >zero? This is my sticking point - I've seen that plus works, and this > >just confuses me more. > > Perhaps you have misunderstood what search does. > search( pattern, string[, flags]) > Scan through string looking for a location where the regular > expression pattern produces a match > > '1?' means 0 or 1 times '1', i.e., nothing or a single '1'. > At the start of the target string, 'a1', we have nothing, so the re > matches, and returns that occurrence. It doesnt matter that a few > characters later there is *another* match, even if it is longer; once > a match is found, the scan is done. > If you want "the longest match of all possible matches along the > string", you should use findall() instead of search(). > That is exactly what I misunderstood. Thank you very much. -Bill Mill bill.mill at gmail.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
At Tuesday 16/1/2007 16:36, Bill Mill wrote: > py> import re > py> rgx = re.compile('1?') > py> rgx.search('a1').groups() > (None,) > py> rgx = re.compile('(1)+') > py> rgx.search('a1').groups() But shouldn't the ? be greedy, and thus prefer the one match to the zero? This is my sticking point - I've seen that plus works, and this just confuses me more. Perhaps you have misunderstood what search does. search( pattern, string[, flags]) Scan through string looking for a location where the regular expression pattern produces a match '1?' means 0 or 1 times '1', i.e., nothing or a single '1'. At the start of the target string, 'a1', we have nothing, so the re matches, and returns that occurrence. It doesnt matter that a few characters later there is *another* match, even if it is longer; once a match is found, the scan is done. If you want "the longest match of all possible matches along the string", you should use findall() instead of search(). -- Gabriel Genellina Softlab SRL __ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
James Stroud wrote: > Bill Mill wrote: > > Hello all, > > > > I've got a test script: > > > > start python code = > > > > tests2 = ["item1: alpha; item2: beta. item3 - gamma--", > > "item1: alpha; item3 - gamma--"] > > > > def test_re(regex): > >r = re.compile(regex, re.MULTILINE) > >for test in tests2: > >res = r.search(test) > >if res: > >print res.groups() > >else: > >print "Failed" > > > > end python code > > > > And a simple question: > > > > Why does the first regex that follows successfully grab "beta", while > > the second one doesn't? > > > > In [131]: test_re(r"(?:item2: (.*?)\.)") > > ('beta',) > > Failed > > > > In [132]: test_re(r"(?:item2: (.*?)\.)?") > > (None,) > > (None,) > > > > Shouldn't the '?' greedily grab the group match? > > > > Thanks > > Bill Mill > > bill.mill at gmail.com > > The question-mark matches at zero or one. The first match will be a > group with nothing in it, which satisfies the zero condition. Perhaps > you mean "+"? > > e.g. > > py> import re > py> rgx = re.compile('1?') > py> rgx.search('a1').groups() > (None,) > py> rgx = re.compile('(1)+') > py> rgx.search('a1').groups() But shouldn't the ? be greedy, and thus prefer the one match to the zero? This is my sticking point - I've seen that plus works, and this just confuses me more. -Bill Mill bill.mill at gmail.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Question
Bill Mill wrote: > Hello all, > > I've got a test script: > > start python code = > > tests2 = ["item1: alpha; item2: beta. item3 - gamma--", > "item1: alpha; item3 - gamma--"] > > def test_re(regex): >r = re.compile(regex, re.MULTILINE) >for test in tests2: >res = r.search(test) >if res: >print res.groups() >else: >print "Failed" > > end python code > > And a simple question: > > Why does the first regex that follows successfully grab "beta", while > the second one doesn't? > > In [131]: test_re(r"(?:item2: (.*?)\.)") > ('beta',) > Failed > > In [132]: test_re(r"(?:item2: (.*?)\.)?") > (None,) > (None,) > > Shouldn't the '?' greedily grab the group match? > > Thanks > Bill Mill > bill.mill at gmail.com The question-mark matches at zero or one. The first match will be a group with nothing in it, which satisfies the zero condition. Perhaps you mean "+"? e.g. py> import re py> rgx = re.compile('1?') py> rgx.search('a1').groups() (None,) py> rgx = re.compile('(1)+') py> rgx.search('a1').groups() James -- http://mail.python.org/mailman/listinfo/python-list
Regex Question
Hello all, I've got a test script: start python code = tests2 = ["item1: alpha; item2: beta. item3 - gamma--", "item1: alpha; item3 - gamma--"] def test_re(regex): r = re.compile(regex, re.MULTILINE) for test in tests2: res = r.search(test) if res: print res.groups() else: print "Failed" end python code And a simple question: Why does the first regex that follows successfully grab "beta", while the second one doesn't? In [131]: test_re(r"(?:item2: (.*?)\.)") ('beta',) Failed In [132]: test_re(r"(?:item2: (.*?)\.)?") (None,) (None,) Shouldn't the '?' greedily grab the group match? Thanks Bill Mill bill.mill at gmail.com -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
> yes, i suppose you are right. i can't think of a reason i would NEED a > raw string in this situation. It looks from your code that you are trying to remove all occurances of one string from the other. a simple regex way would be to use re.sub() >>> import re >>> a = "abc" >>> b = "debcabbde" >>> re.sub("[" + a + "]","",b) 'dede' -- http://mail.python.org/mailman/listinfo/python-list
Re: regex question
Paul McGuire wrote: > "proctor" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > > > > > > it does work now...however, one more question: when i type: > > > > rx_a = re.compile(r'a|b|c') > > it works correctly! > > > > Do you see the difference between: > > rx_a = re.compile(r'a|b|c') > > and > > rx_a = re.compile("r'a|b|c'") > > There is no difference in the variable datatype between "string" and "raw > string". Raw strings are just a notational helper when creating string > literals that have lots of backslashes in them (as happens a lot with > regexps). > > r'a|b|c' is the same as 'a|b|c' > r'\d' is the same as '\\d' > > There is no reason to "add raw strings" to your makeRE method, since you > don't have a single backslash anywhere. And even if there were a backslash > in the 'w' argument, it is just a string - no need to treat it differently. > > -- Paul thanks paul. this helps. proctor. -- http://mail.python.org/mailman/listinfo/python-list