regular expression help
Hi, say: import re m=cccvlvlvlvnnnflfllffccclfnnnooo re.compile(r'ccc.*nnn') rtt=.sub(||,m) rtt '||ooo' The regex is eating up too much. What I want is every non-overlapping occurrence I think. so rtt would be: '||flfllff||ooo' just like findall acts but in this case I want sub to act like that. Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression help
--- On Tue, 11/30/10, goldtech goldt...@worldpost.com wrote: From: goldtech goldt...@worldpost.com Subject: regular expression help To: python-list@python.org Date: Tuesday, November 30, 2010, 9:17 AM The regex is eating up too much. What I want is every non-overlapping occurrence I think. so rtt would be: '||flfllff||ooo' Hi, I'll just let Python do most of the talk here. import re m=cccvlvlvlvnnnflfllffccclfnnnooo p=re.compile(r'ccc.*?nnn') p.sub(||, m) '||flfllff||ooo' Cheers, Yingjie -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression help
On 2010-11-30, goldtech goldt...@worldpost.com wrote: Hi, say: import re m=cccvlvlvlvnnnflfllffccclfnnnooo re.compile(r'ccc.*nnn') rtt=.sub(||,m) rtt '||ooo' The regex is eating up too much. What I want is every non-overlapping occurrence I think. so rtt would be: '||flfllff||ooo' Python 3.1.2 (r312:79147, Oct 9 2010, 00:16:06) [GCC 4.4.4] on linux2 Type help, copyright, credits or license for more information. import re m=cccvlvlvlvnnnflfllffccclfnnnooo pattern = re.compile(r'ccc[^n]*nnn') pattern.sub(||, m) '||flfllff||ooo' -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression help
Python 3.1.2 (r312:79147, Oct 9 2010, 00:16:06) [GCC 4.4.4] on linux2 Type help, copyright, credits or license for more information. import re m=cccvlvlvlvnnnflfllffccclfnnnooo pattern = re.compile(r'ccc[^n]*nnn') pattern.sub(||, m) '||flfllff||ooo' # or, assuming that the middle sequence might contain singular or # double 'n's pattern = re.compile(r'ccc.*?nnn') pattern.sub(||, m) '||flfllff||ooo' -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression help
.*? fixed it. Every occurrence of the pattern is now affected, which is what I want. Thank you very much. -- http://mail.python.org/mailman/listinfo/python-list
Python's regular expression help
Hi, Trying to start out with simple things but apparently there's some basics I need help with. This works OK: import re p = re.compile('(ab*)(sss)') m = p.match( 'absss' ) m.group(0) 'absss' m.group(1) 'ab' m.group(2) 'sss' ... But two questions: How can I operate a regex on a string variable? I'm doing something wrong here: f=r'abss' f 'abss' m = p.match( f ) m.group(0) Traceback (most recent call last): File pyshell#15, line 1, in module m.group(0) AttributeError: 'NoneType' object has no attribute 'group' How do I implement a regex on a multiline string? I thought this might work but there's problem: p = re.compile('(ab*)(sss)', re.S) m = p.match( 'ab\nsss' ) m.group(0) Traceback (most recent call last): File pyshell#26, line 1, in module m.group(0) AttributeError: 'NoneType' object has no attribute 'group' Thanks for the newbie regex help, Lee -- http://mail.python.org/mailman/listinfo/python-list
Re: Python's regular expression help
Le 29/04/2010 20:00, goldtech a écrit : Hi, Trying to start out with simple things but apparently there's some basics I need help with. This works OK: import re p = re.compile('(ab*)(sss)') m = p.match( 'absss' ) m.group(0) 'absss' m.group(1) 'ab' m.group(2) 'sss' ... But two questions: How can I operate a regex on a string variable? I'm doing something wrong here: f=r'abss' f 'abss' m = p.match( f ) m.group(0) Traceback (most recent call last): File pyshell#15, line 1, inmodule m.group(0) AttributeError: 'NoneType' object has no attribute 'group' How do I implement a regex on a multiline string? I thought this might work but there's problem: p = re.compile('(ab*)(sss)', re.S) m = p.match( 'ab\nsss' ) m.group(0) Traceback (most recent call last): File pyshell#26, line 1, inmodule m.group(0) AttributeError: 'NoneType' object has no attribute 'group' Thanks for the newbie regex help, Lee for multiline, I use re.DOTALL I do not know match(), findall is pretty efficient : my = a href=\hello world.com\LINK/a res = re.findall((.*?),my) res ['LINK'] Dorian -- http://mail.python.org/mailman/listinfo/python-list
Re: Python's regular expression help
goldtech wrote: Hi, Trying to start out with simple things but apparently there's some basics I need help with. This works OK: import re p = re.compile('(ab*)(sss)') m = p.match( 'absss' ) m.group(0) 'absss' m.group(1) 'ab' m.group(2) 'sss' ... But two questions: How can I operate a regex on a string variable? I'm doing something wrong here: f=r'abss' f 'abss' m = p.match( f ) m.group(0) Traceback (most recent call last): File pyshell#15, line 1, in module m.group(0) AttributeError: 'NoneType' object has no attribute 'group' Look closely: the regex contains 3 letter 's', but the string referred to by f has only 2. How do I implement a regex on a multiline string? I thought this might work but there's problem: p = re.compile('(ab*)(sss)', re.S) m = p.match( 'ab\nsss' ) m.group(0) Traceback (most recent call last): File pyshell#26, line 1, in module m.group(0) AttributeError: 'NoneType' object has no attribute 'group' Thanks for the newbie regex help, Lee The string contains a newline between the 'b' and the 's', but the regex isn't expecting any newline (or any other character) between the 'b' and the 's', hence no match. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python's regular expression help
On 04/29/2010 01:00 PM, goldtech wrote: Trying to start out with simple things but apparently there's some basics I need help with. This works OK: import re p = re.compile('(ab*)(sss)') m = p.match( 'absss' ) f=r'abss' f 'abss' m = p.match( f ) m.group(0) Traceback (most recent call last): File pyshell#15, line 1, inmodule m.group(0) AttributeError: 'NoneType' object has no attribute 'group' 'absss' != 'abss' Your regexp looks for 3 s, your f contains only 2. So the regexp object doesn't, well, match. Try f = 'absss' and it will work. As an aside, using raw-strings for this text doesn't change anything, but if you want, you _can_ write it as f = r'absss' if it will make you feel better :) How do I implement a regex on a multiline string? I thought this might work but there's problem: p = re.compile('(ab*)(sss)', re.S) m = p.match( 'ab\nsss' ) m.group(0) Traceback (most recent call last): File pyshell#26, line 1, inmodule m.group(0) AttributeError: 'NoneType' object has no attribute 'group' Well, it depends on what you want to do -- regexps are fairly precise, so if you want to allow whitespace between the two, you can use r = re.compile(r'(ab*)\s*(sss)') If you want to allow whitespace anywhere, it gets uglier, and your capture/group results will contain that whitespace: r'(a\s*b*)\s*(s\s*s\s*s)' Alternatively, if you don't want to allow arbitrary whitespace but only newlines, you can use \n* instead of \s* -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: Python's regular expression help
On Apr 29, 11:49 am, Tim Chase python.l...@tim.thechases.com wrote: On 04/29/2010 01:00 PM, goldtech wrote: Trying to start out with simple things but apparently there's some basics I need help with. This works OK: import re p = re.compile('(ab*)(sss)') m = p.match( 'absss' ) f=r'abss' f 'abss' m = p.match( f ) m.group(0) Traceback (most recent call last): File pyshell#15, line 1, inmodule m.group(0) AttributeError: 'NoneType' object has no attribute 'group' 'absss' != 'abss' Your regexp looks for 3 s, your f contains only 2. So the regexp object doesn't, well, match. Try f = 'absss' and it will work. As an aside, using raw-strings for this text doesn't change anything, but if you want, you _can_ write it as f = r'absss' if it will make you feel better :) How do I implement a regex on a multiline string? I thought this might work but there's problem: p = re.compile('(ab*)(sss)', re.S) m = p.match( 'ab\nsss' ) m.group(0) Traceback (most recent call last): File pyshell#26, line 1, inmodule m.group(0) AttributeError: 'NoneType' object has no attribute 'group' Well, it depends on what you want to do -- regexps are fairly precise, so if you want to allow whitespace between the two, you can use r = re.compile(r'(ab*)\s*(sss)') If you want to allow whitespace anywhere, it gets uglier, and your capture/group results will contain that whitespace: r'(a\s*b*)\s*(s\s*s\s*s)' Alternatively, if you don't want to allow arbitrary whitespace but only newlines, you can use \n* instead of \s* -tkc Yes, most of my problem is w/my patterns not w/any python re syntax. I thought re.S will take a multiline string with any spaces or newlines and make it appear as one line to the regex. Make /n be ignored in a way...still playing w/it. Thanks for the help! -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression Help
Jean-Claude Neveu wrote: Hello, I was wondering if someone could tell me where I'm going wrong with my regular expression. I'm trying to write a regexp that identifies whether a string contains a correctly-formatted currency amount. I want to support dollars, UK pounds and Euros, but the example below deliberately omits Euros in case the Euro symbol get mangled anywhere in email or listserver processing. I also want people to be able to omit the currency symbol if they wish. If Euro symbols can get mangled, so can Pound signs. They're both outside ASCII. My regexp that I'm matching against is: ^\$\£?\d{0,10}(\.\d{2})?$ Here's how I think it should work (but clearly I'm wrong, because it does not actually work): ^\$\£? Require zero or one instance of $ or £ at the start of the string. ^[$£]? is correct. And, as you're using re.match, the ^ is superfluous. (A previous message suggested ^[\$£]? which will also work. You generally need to escape a Dollar sign but not here.) You should also think about the encoding. In my terminal, £ is identical to '\xc2\xa3'. That is, two bytes for a UTF-8 code point. If you assume this encoding, it's best to make it explicit. And if you don't assume a specific encoding it's best to convert to unicode to do the comparisons, so for 2.x (or portability) your string should start u d{0,10} Next, require between zero and ten alpha characters. There's a backslash missing, but not from your original expression. Digits are not alpha characters. (\.\d{2})? Optionally, two characters can follow. They must be preceded by a decimal point. That works. Of course, \d{2} is longer than the simpler \d\d Note that you can comment the original expression like this: rex = u(?x) ^[$£]?# Zero or one instance of $ or £ # at the start of the string. \d{0,10} # Between zero and ten digits (\.\d{2})? # Optionally, two digits. # They must be preceded by a decimal point. $ # End of line Then anybody (including you) who comes to read this in the future will have some idea what you were trying to do. \ Examples of acceptable input should be: $12.42 $12 £12.42 $12,482.96 (now I think about it, I have not catered for this in my regexp) Yes, you need to think about that. Graham -- http://mail.python.org/mailman/listinfo/python-list
Regular Expression Help
Hello, I was wondering if someone could tell me where I'm going wrong with my regular expression. I'm trying to write a regexp that identifies whether a string contains a correctly-formatted currency amount. I want to support dollars, UK pounds and Euros, but the example below deliberately omits Euros in case the Euro symbol get mangled anywhere in email or listserver processing. I also want people to be able to omit the currency symbol if they wish. My regexp that I'm matching against is: ^\$\£?\d{0,10}(\.\d{2})?$ Here's how I think it should work (but clearly I'm wrong, because it does not actually work): ^\$\£? Require zero or one instance of $ or £ at the start of the string. d{0,10} Next, require between zero and ten alpha characters. (\.\d{2})? Optionally, two characters can follow. They must be preceded by a decimal point. Examples of acceptable input should be: $12.42 $12 £12.42 $12,482.96 (now I think about it, I have not catered for this in my regexp) And unacceptable input would be: $12b.42 blah $blah etc Here is my Python script: # import re def is_currency(str): rex = ^\$\£?\d{0,10}(\.\d{2})?$ if re.match(rex, str): return 1 else: return 0 def test_match(str): if is_currency (str): print str + is a match else: print str + is not a match # All should match except the last two test_match($12.47) test_match(12.47) test_match(£12.47) test_match(£12) test_match($12) test_match($12588.47) test_match($12,588.47) test_match(£12588.47) test_match(12588.47) test_match(£12588) test_match($12588) test_match(blah) test_match($12b.56) AND HERE IS THE OUTPUT FROM THE ABOVE SCRIPT: $12.47 is a match 12.47 is not a match £12.47 is not a match £12 is not a match $12 is a match $12588.47 is a match $12,588.47 is not a match £12588.47 is not a match 12588.47 is not a match £12588 is not a match $12588 is a match blah is not a match $12b.56 is not a match Many thanks in advance. Regular expressions are not my strong suit :) J-C -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression Help
On Apr 11, 9:42 pm, Jean-Claude Neveu jcn-france1...@pobox.com wrote: My regexp that I'm matching against is: ^\$\£?\d{0,10}(\.\d{2})?$ Here's how I think it should work (but clearly I'm wrong, because it does not actually work): ^\$\£? Require zero or one instance of $ or £ at the start of the string. The or in $ or £ above is a vertical bar. You want ^(\$|£)? here. d{0,10} Next, require between zero and ten alpha characters. (\.\d{2})? Optionally, two characters can follow. They must be preceded by a decimal point. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression Help
On Apr 12, 2:19 pm, ru...@yahoo.com wrote: On Apr 11, 9:42 pm, Jean-Claude Neveu jcn-france1...@pobox.com wrote: My regexp that I'm matching against is: ^\$\£?\d{0,10}(\.\d{2})?$ Here's how I think it should work (but clearly I'm wrong, because it does not actually work): ^\$\£? Require zero or one instance of $ or £ at the start of the string. The or in $ or £ above is a vertical bar. You want ^(\$|£)? here. Best not to use a capturing group (blah) when you don't need to capture ... use (?:blah) instead. When the alternatives are all single characters, for greater typing efficiency and computing efficiency use a character class: ^[\$£]? -- http://mail.python.org/mailman/listinfo/python-list
regular expression, help
I think there are two parts to this question and I am sure lots I am missing. I am hoping an example will help meI have a html doc that I am trying to use regular expressions to get a value out of. here is an example or the line td colspan='2'Parcel ID: 39-034-15-009 /td I want to get the number 39-034-15-009 after Parcel ID: The number will be different each time but always the same format. I think I can match Parcel ID: but not sure how to get the number after. Parcel ID: only occurs once in the document. is this how i need to start? pid = re.compile('Parcel ID: ') Basically I am completely lost and am not finding examples I find helpful. I am getting the html using myurl=urllib.urlopen(). Can I use RE like this thenum=pid.match(myurl) I think the two key things I need to know are 1, how do I get the text after a match? 2, when I use myurl=urllib.urlopen(http://...). can I use the myurl as the string in a RE, thenum=pid.match(myurl) Thanks Vincent -- http://mail.python.org/mailman/listinfo/python-list
regular expression, help
I think there are two parts to this question and I am sure lots I am missing. I am hoping an example will help meI have a html doc that I am trying to use regular expressions to get a value out of. here is an example or the line td colspan='2'Parcel ID: 39-034-15-009 /td I want to get the number 39-034-15-009 after Parcel ID: The number will be different each time but always the same format. I think I can match Parcel ID: but not sure how to get the number after. Parcel ID: only occurs once in the document. is this how i need to start? pid = re.compile('Parcel ID: ') Basically I am completely lost and am not finding examples I find helpful. I am getting the html using myurl=urllib.urlopen(). Can I use RE like this thenum=pid.match(myurl) I think the two key things I need to know are 1, how do I get the text after a match? 2, when I use myurl=urllib.urlopen(http://...). can I use the myurl as the string in a RE, thenum=pid.match(myurl) Thanks Vincent -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression, help
is BeautifulSoup really better? Since I don't know either I would prefer to learn only one for now. Thanks Vincent Davis On Tue, Jan 27, 2009 at 10:39 AM, MRAB goo...@mrabarnett.plus.com wrote: Vincent Davis wrote: I think there are two parts to this question and I am sure lots I am missing. I am hoping an example will help me I have a html doc that I am trying to use regular expressions to get a value out of. here is an example or the line td colspan='2'Parcel ID: 39-034-15-009 /td I want to get the number 39-034-15-009 after Parcel ID: The number will be different each time but always the same format. I think I can match Parcel ID: but not sure how to get the number after. Parcel ID: only occurs once in the document. is this how i need to start? pid = re.compile('Parcel ID: ') Basically I am completely lost and am not finding examples I find helpful. I am getting the html using myurl=urllib.urlopen(). Can I use RE like this thenum=pid.match(myurl) I think the two key things I need to know are 1, how do I get the text after a match? 2, when I use myurl=urllib.urlopen(http://...). can I use the myurl as the string in a RE, thenum=pid.match(myurl) Something like: pid = re.compile(r'Parcel ID: (\d+(?:-\d+)*)') myurl = urllib.urlopen(url) text = myurl.read() myurl.close() thenum = pid.search(text).group(1) Although BeautifulSoup is the preferred solution. -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression, help
Vincent Davis wrote: I think there are two parts to this question and I am sure lots I am missing. I am hoping an example will help me I have a html doc that I am trying to use regular expressions to get a value out of. here is an example or the line td colspan='2'Parcel ID: 39-034-15-009 /td I want to get the number 39-034-15-009 after Parcel ID: The number will be different each time but always the same format. I think I can match Parcel ID: but not sure how to get the number after. Parcel ID: only occurs once in the document. is this how i need to start? pid = re.compile('Parcel ID: ') Basically I am completely lost and am not finding examples I find helpful. I am getting the html using myurl=urllib.urlopen(). Can I use RE like this thenum=pid.match(myurl) I think the two key things I need to know are 1, how do I get the text after a match? 2, when I use myurl=urllib.urlopen(http://...). can I use the myurl as the string in a RE, thenum=pid.match(myurl) Something like: pid = re.compile(r'Parcel ID: (\d+(?:-\d+)*)') myurl = urllib.urlopen(url) text = myurl.read() myurl.close() thenum = pid.search(text).group(1) Although BeautifulSoup is the preferred solution. -- http://mail.python.org/mailman/listinfo/python-list
Regular expression help: unable to search ' # ' character in the file
Hi, Can some help me with the regular expression. I'm looking to search # character in my file? My file has contents: ### Hello World ### length = 10 breadth = 20 height = 30 ### ### Hello World ### length = 20 breadth = 30 height = 40 ### I used the following search : import re fd = open(file, 'r') line = fd.readline pat1 = re.compile(\#*) while(line): mat1 = pat1.search(line) if mat1: print line line = fd.readline() But the above prints the whole file instead of the hash lines only. Please help Regards, Rajat -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression help: unable to search ' # ' character in the file
[EMAIL PROTECTED] wrote: import re fd = open(file, 'r') line = fd.readline pat1 = re.compile(\#*) while(line): mat1 = pat1.search(line) if mat1: print line line = fd.readline() I strongly doubt that this is the code you used. But the above prints the whole file instead of the hash lines only. * means zero or more matches. all lines is a file contain zero or more # characters. but using a RE is overkill in this case, of course. to check for a character or substring, use the in operator: for line in open(file): if # in line: print line /F -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression help: unable to search ' # ' character in the file
On Sat, Sep 27, 2008 at 1:58 PM, Fredrik Lundh [EMAIL PROTECTED]wrote: [EMAIL PROTECTED] wrote: import re fd = open(file, 'r') line = fd.readline pat1 = re.compile(\#*) while(line): mat1 = pat1.search(line) if mat1: print line line = fd.readline() I strongly doubt that this is the code you used. But the above prints the whole file instead of the hash lines only. * means zero or more matches. all lines is a file contain zero or more # characters. but using a RE is overkill in this case, of course. to check for a character or substring, use the in operator: for line in open(file): if # in line: print line /F -- http://mail.python.org/mailman/listinfo/python-list Thanks Fredrik, this works. Indeed it is a much better and cleaner approach. -- Regards, Rajat -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression help
[EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I am new to Python, with a background in scientific computing. I'm trying to write a script that will take a file with lines like c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647 3pv=0 extract the values of afrac and etot and plot them. ... What is being stored in energy is '_sre.SRE_Match object at 0x2a955e4ed0', not '-11.020107'. Why? because the re.match() method returns a match object, as documented at http://www.python.org/doc/current/lib/match-objects.html But this looks like a problem where regular expressions are overkill. Assuming all your lines are formatted as in the example above (every value you are interested in contains an equals sign and is surrounded by spaces), you could do this: values = {} for expression in line.split( ): if = in expression: name, val = expression.split(=) values[name] = val I'd wager that this will run a fair bit faster than any regex-based solution. Then you just use values['afrac'] and values['etot'] when you need them. And when you get to be a really hard-core Pythonista, you could write the whole routine above in one line, but this seems clearer. ;-) Russ -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression help
[EMAIL PROTECTED] wrote: Hello, I am new to Python, with a background in scientific computing. I'm trying to write a script that will take a file with lines like c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647 3pv=0 extract the values of afrac and etot... Why not just split them out instead of using REs? fp = open(test.txt) lines = fp.readlines() fp.close() for line in lines: split = line.split() for pair in split: pair_split = pair.split(=) if len(pair_split) == 2: try: print pair_split[0], is, pair_split[1] except: pass Results: IDLE 1.2.2 No Subprocess afrac is .7 mmom is 0 sev is -9.56646 erep is 0 etot is -11.020107 emad is -3.597647 3pv is 0 -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression help
[EMAIL PROTECTED] wrote: Hello, I am new to Python, with a background in scientific computing. I'm trying to write a script that will take a file with lines like c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647 3pv=0 extract the values of afrac and etot and plot them. I'm really struggling with getting the values of efrac and etot. So far I have come up with (small snippet of script just to get the energy, etot): def get_data_points(filename): file = open(filename,'r') data_points = [] while 1: line = file.readline() if not line: break energy = get_total_energy(line) data_points.append(energy) return data_points def get_total_energy(line): rawstr = r(?Pkey.*?)=(?Pvalue.*?)\s p = re.compile(rawstr) return p.match(line,5) What is being stored in energy is '_sre.SRE_Match object at 0x2a955e4ed0', not '-11.020107'. Why? 1. Consider using the 'split' method on each line rather than regexes 2. In your code you are compiling the regex for every line in the file, you should lift it out of the 'get_total-energy' function so that the compilation is only done once. 3. A Match object has a 'groups' function which is what you need to retrieve the data 4. Also look at the findall method: data = 'c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647 3pv=0 ' import re rx = re.compile(r'(\w+)=(\S+)') data = dict(rx.findall(data)) print data hth G. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression help
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I think you're over-complicating this. I'm assuming that you're going to do a line graph of some sorta, and each new line of the file contains a new set of data. The problem you mentioned with your regex returning a match object rather than a string is because you're simply using a re function that doesn't return strings. re.findall() is what you want. That being said, here is working code to mine data from your file. [code] line = 'c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 mad=-3.597647 3pv=0' energypat = r'\betot=(-?\d*?[.]\d*)' #Note: To change the data grabbed from the line, you can change the #'etot' to 'afrac' or 'emad' or anything that doesn't contain a regex #special character. energypat = re.compile(energypat) re.findall(energypat, line)# returns a STRING containing '-12.020107' [/code] This returns a string, which is easy enough to convert to an int. After that, you can datapoints.append() to your heart's content. Good luck with your work. [EMAIL PROTECTED] wrote: Hello, I am new to Python, with a background in scientific computing. I'm trying to write a script that will take a file with lines like c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647 3pv=0 extract the values of afrac and etot and plot them. I'm really struggling with getting the values of efrac and etot. So far I have come up with (small snippet of script just to get the energy, etot): def get_data_points(filename): file = open(filename,'r') data_points = [] while 1: line = file.readline() if not line: break energy = get_total_energy(line) data_points.append(energy) return data_points def get_total_energy(line): rawstr = r(?Pkey.*?)=(?Pvalue.*?)\s p = re.compile(rawstr) return p.match(line,5) What is being stored in energy is '_sre.SRE_Match object at 0x2a955e4ed0', not '-11.020107'. Why? I've been struggling with regular expressions for two days now, with no luck. Could someone please put me out of my misery and give me a clue as to what's going on? Apologies if it's blindingly obvious or if this question has been asked and answered before. Thanks, Nicole -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkiAqiAACgkQLMI5fndAv9h7HgCfU6a7v1nE5iLYcUPbXhC6sfU7 mpkAn1Q/DyOI4Zo7QJhF9zqfqCq6boXv =L2VZ -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression help
On Jul 18, 3:35 pm, Nick Dumas [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I think you're over-complicating this. I'm assuming that you're going to do a line graph of some sorta, and each new line of the file contains a new set of data. The problem you mentioned with your regex returning a match object rather than a string is because you're simply using a re function that doesn't return strings. re.findall() is what you want. That being said, here is working code to mine data from your file. [code] line = 'c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 mad=-3.597647 3pv=0' energypat = r'\betot=(-?\d*?[.]\d*)' #Note: To change the data grabbed from the line, you can change the #'etot' to 'afrac' or 'emad' or anything that doesn't contain a regex #special character. energypat = re.compile(energypat) re.findall(energypat, line)# returns a STRING containing '-12.020107' [/code] This returns a string, which is easy enough to convert to an int. After that, you can datapoints.append() to your heart's content. Good luck with your work. [EMAIL PROTECTED] wrote: Hello, I am new to Python, with a background in scientific computing. I'm trying to write a script that will take a file with lines like c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647 3pv=0 extract the values of afrac and etot and plot them. I'm really struggling with getting the values of efrac and etot. So far I have come up with (small snippet of script just to get the energy, etot): def get_data_points(filename): file = open(filename,'r') data_points = [] while 1: line = file.readline() if not line: break energy = get_total_energy(line) data_points.append(energy) return data_points def get_total_energy(line): rawstr = r(?Pkey.*?)=(?Pvalue.*?)\s p = re.compile(rawstr) return p.match(line,5) What is being stored in energy is '_sre.SRE_Match object at 0x2a955e4ed0', not '-11.020107'. Why? I've been struggling with regular expressions for two days now, with no luck. Could someone please put me out of my misery and give me a clue as to what's going on? Apologies if it's blindingly obvious or if this question has been asked and answered before. Thanks, Nicole -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org iEYEARECAAYFAkiAqiAACgkQLMI5fndAv9h7HgCfU6a7v1nE5iLYcUPbXhC6sfU7 mpkAn1Q/DyOI4Zo7QJhF9zqfqCq6boXv =L2VZ -END PGP SIGNATURE- Thanks guys :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression help
On Fri, 18 Jul 2008 10:04:29 -0400, Russell Blau wrote: values = {} for expression in line.split( ): if = in expression: name, val = expression.split(=) values[name] = val […] And when you get to be a really hard-core Pythonista, you could write the whole routine above in one line, but this seems clearer. ;-) I know it's a matter of taste but I think the one liner is still clear (enough):: values = dict(s.split('=') for s in line.split() if '=' in s) Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Regular Expression Help
Hi all, I have text like , STRINGTABLE BEGIN ID_NEXT_PANECambiar a la siguiente sección de laventana \nSiguiente sección ID_PREV_PANERegresar a la sección anterior de laventana\nSección anterior END STRINGTABLE BEGIN ID_VIEW_TOOLBAR Mostrar u ocultar la barra de herramientas\nMostrar/Ocultar la barra de herramientas ID_VIEW_STATUS_BAR Mostrar u ocultar la barra de estado\nMostrar/Ocultar la barra de estado END .. and i need to parse from STRINGTABLE to END as a list object. whatkind of regular expression should i write. -- Regards, Santhoshkumar.S -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression Help
santhosh kumar [EMAIL PROTECTED] wrote: I have text like , STRINGTABLE BEGIN ID_NEXT_PANECambiar a la siguiente sección de laventana \nSiguiente sección ID_PREV_PANERegresar a la sección anterior de laventana\nSección anterior END STRINGTABLE BEGIN ID_VIEW_TOOLBAR Mostrar u ocultar la barra de herramientas\nMostrar/Ocultar la barra de herramientas ID_VIEW_STATUS_BAR Mostrar u ocultar la barra de estado\nMostrar/Ocultar la barra de estado END .. and i need to parse from STRINGTABLE to END as a list object. whatkind of regular expression should i write. I doubt very much whether you want any regular expressions at all. I'd do something alone these lines: find a line==STRINGTABLE assert the next line==BEGIN then until we find a line==END: idvalue = line.strip().split(None,1) assert len(idvalue)==2 result.append(idvalue) -- http://mail.python.org/mailman/listinfo/python-list
Regular Expression Help
Hi All, I have a python utility which helps to generate an excel file for language translation. For any new language, we will generate the excel file which will have the English text and column for interested translation language. The translator will provide the language string and again I will have python utility to read the excel file target language string and update/generate the resource file database records. Our application is VC++ application, we use MS Access db. We have string table like this. STRINGTABLE BEGIN IDS_CONTEXT_API_ API Totalizer Control Dialog IDS_CONTEXT Gas Analyzer END STRINGTABLE BEGIN ID_APITOTALIZER_CONTROL Start, stop, and reset API volume flow \nTotalizer Control END this repeats. I read the file line by line and pick the contents inside the STRINGTABLE. I want to use the regular expression while should give me all the entries with in STRINGTABLE BEGIN Get what ever put in this END I tried little bit, but no luck. Note that it is multi-line string entries which we cannot make as single line Regards, Krish -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression Help
On Feb 27, 6:28 am, [EMAIL PROTECTED] wrote: Hi All, I have a python utility which helps to generate an excel file for language translation. For any new language, we will generate the excel file which will have the English text and column for interested translation language. The translator will provide the language string and again I will have python utility to read the excel file target language string and update/generate the resource file database records. Our application is VC++ application, we use MS Access db. We have string table like this. STRINGTABLE BEGIN IDS_CONTEXT_API_ API Totalizer Control Dialog IDS_CONTEXT Gas Analyzer END STRINGTABLE BEGIN ID_APITOTALIZER_CONTROL Start, stop, and reset API volume flow \nTotalizer Control END this repeats. I read the file line by line and pick the contents inside the STRINGTABLE. I want to use the regular expression while should give me all the entries with in STRINGTABLE BEGIN Get what ever put in this END I tried little bit, but no luck. Note that it is multi-line string entries which we cannot make as single line Looks to me like you have a very simple grammar: entry ::= id quoted_string id is matched by r'[A-Z]+[A-Z_]+' quoted_string is matched by r'[^]*' So a pattern which will pick out one entry would be something like r'([A-Z]+[A-Z_]+)\s+([^]*)' Not that using \s+ (whitespace) allows for having \n etc between id and quoted_string. You need to build a string containing all the lines between BEGIN and END, and then use re.findall. If you still can't get it to work, ask again -- but do show the code from your best attempt, and reduce ambiguity by showing your test input as a Python expression e.g. test1_in = \ ID_F fough ID_B_ barre ID__Z zotte start zotte end -- http://mail.python.org/mailman/listinfo/python-list
Re: python regular expression help
On Apr 11, 11:15 pm, [EMAIL PROTECTED] wrote: On Apr 11, 9:50 pm, Gabriel Genellina [EMAIL PROTECTED] lhs = re.compile(r'\s*(\b\w+\s*=)') for s in [ a = 4 b =3.4 5.4 c = 4.5, a = 4.5 b = 'h' 'd' c = 4.5 3.5]: tokens = lhs.split(s) results = [tokens[_] + tokens[_+1] for _ in range(1,len(tokens), The only thing I can think when I look at that is: what a syntactic abomination. -- http://mail.python.org/mailman/listinfo/python-list
Re: python regular expression help
Hi, Yeah, a little bit tricky. Actually it is part of some Fortran input file. Thanks for suggestion! It helps a lot! Thanks,Qilong - Original Message From: Gabriel Genellina [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, April 11, 2007 9:50:00 PM Subject: Re: python regular expression help En Wed, 11 Apr 2007 23:14:01 -0300, Qilong Ren [EMAIL PROTECTED] escribió: Thanks for reply. That actually is not what I want. Strings I am dealing with may look like this: s = 'a = 4.5 b = 'h' 'd' c = 4.5 3.5' What I want is a = 4.5 b = 'h' 'd' c = 4.5 3.5 That's a bit tricky. You have LHS = RHS where RHS includes all the following text *except* the very next word before the following = (which is the LHS of the next expression). Or something like that :) py import re py s = a = 4.5 b = 'h' 'd' c = 4.5 3.5 py r = re.compile(r\w+\s*=\s*.*?(?=\w+\s*=|$)) py for item in r.findall(s): ... print item ... a = 4.5 b = 'h' 'd' c = 4.5 3.5 -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list Don't pick lemons. See all the new 2007 cars at Yahoo! Autos. http://autos.yahoo.com/new_cars.html -- http://mail.python.org/mailman/listinfo/python-list
python regular expression help
Hi, everyone, I am extracting some information from a given string using python RE. The string is ,for example, s = 'a = 4 b =3.4 5.4 c = 4.5' What I want is : a = 4 b = 3.4 5.4 c = 4.5 Right now I use : pattern = re.compile(r'\w+\s*=\s*.*?\s+') lists = pattern.findall(s) It works for the string like 'a = 4 b = 3.4 c = 4.5', but does not work with strings like 'a=4 b=3.4 5.4 c = 4.5' Any suggestion? Thanks,Qilong It's here! Your new message! Get new email alerts with the free Yahoo! Toolbar. http://tools.search.yahoo.com/toolbar/features/mail/-- http://mail.python.org/mailman/listinfo/python-list
Re: python regular expression help
pattern = re.compile(r'\w+\s*=\s*[0-9]*.[0-9]*\s*') lists = pattern.findall(s) print lists ['a=4 ', 'b=3.4 ', 'c=4.5'] On Wed, Apr 11, 2007 at 06:10:07PM -0700, Qilong Ren wrote: Hi, everyone, I am extracting some information from a given string using python RE. The string is ,for example, s = 'a = 4 b =3.4 5.4 c = 4.5' What I want is : a = 4 b = 3.4 5.4 c = 4.5 Right now I use : pattern = re.compile(r'\w+\s*=\s*.*?\s+') lists = pattern.findall(s) It works for the string like 'a = 4 b = 3.4 c = 4.5', but does not work with strings like 'a=4 b=3.4 5.4 c = 4.5' Any suggestion? Thanks,Qilong ━━━ Don't get soaked. Take a quick peak at the forecast with theYahoo! Search weather shortcut. -- http://mail.python.org/mailman/listinfo/python-list signature.asc Description: Digital signature -- http://mail.python.org/mailman/listinfo/python-list
Re: python regular expression help
Hi, Thanks for reply. That actually is not what I want. Strings I am dealing with may look like this: s = 'a = 4.5 b = 'h' 'd' c = 4.5 3.5' What I want is a = 4.5 b = 'h' 'd' c = 4.5 3.5 - Original Message From: liupeng [EMAIL PROTECTED] To: python-list@python.org Sent: Wednesday, April 11, 2007 6:41:30 PM Subject: Re: python regular expression help pattern = re.compile(r'\w+\s*=\s*[0-9]*.[0-9]*\s*') lists = pattern.findall(s) print lists ['a=4 ', 'b=3.4 ', 'c=4.5'] On Wed, Apr 11, 2007 at 06:10:07PM -0700, Qilong Ren wrote: Hi, everyone, I am extracting some information from a given string using python RE. The string is ,for example, s = 'a = 4 b =3.4 5.4 c = 4.5' What I want is : a = 4 b = 3.4 5.4 c = 4.5 Right now I use : pattern = re.compile(r'\w+\s*=\s*.*?\s+') lists = pattern.findall(s) It works for the string like 'a = 4 b = 3.4 c = 4.5', but does not work with strings like 'a=4 b=3.4 5.4 c = 4.5' Any suggestion? Thanks,Qilong ━━━ Don't get soaked. Take a quick peak at the forecast with theYahoo! Search weather shortcut. -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list Need Mail bonding? Go to the Yahoo! Mail QA for great tips from Yahoo! Answers users. http://answers.yahoo.com/dir/?link=listsid=396546091-- http://mail.python.org/mailman/listinfo/python-list
Re: python regular expression help
On Apr 11, 7:41 pm, liupeng [EMAIL PROTECTED] wrote: pattern = re.compile(r'\w+\s*=\s*[0-9]*.[0-9]*\s*') lists = pattern.findall(s) print lists ['a=4 ', 'b=3.4 ', 'c=4.5'] On Wed, Apr 11, 2007 at 06:10:07PM -0700, Qilong Ren wrote: Hi, everyone, I am extracting some information from a given string using python RE. The string is ,for example, s = 'a = 4 b =3.4 5.4 c = 4.5' What I want is : a = 4 b = 3.4 5.4 c = 4.5 Right now I use : pattern = re.compile(r'\w+\s*=\s*.*?\s+') lists = pattern.findall(s) It works for the string like 'a = 4 b = 3.4 c = 4.5', but does not work with strings like 'a=4 b=3.4 5.4 c = 4.5' Any suggestion? Thanks,Qilong ━━━ Don't get soaked. Take a quick peak at the forecast with theYahoo! Search weather shortcut. -- http://mail.python.org/mailman/listinfo/python-list signature.asc 1KDownload Try this: import re s = 'a = 4 b =3.4 5.4 c = 4.5' r = re.compile([a-z]+.*?(?=[a-z]|$) ) l = r.findall(s) print l -- http://mail.python.org/mailman/listinfo/python-list
Re: python regular expression help
En Wed, 11 Apr 2007 23:14:01 -0300, Qilong Ren [EMAIL PROTECTED] escribió: Thanks for reply. That actually is not what I want. Strings I am dealing with may look like this: s = 'a = 4.5 b = 'h' 'd' c = 4.5 3.5' What I want is a = 4.5 b = 'h' 'd' c = 4.5 3.5 That's a bit tricky. You have LHS = RHS where RHS includes all the following text *except* the very next word before the following = (which is the LHS of the next expression). Or something like that :) py import re py s = a = 4.5 b = 'h' 'd' c = 4.5 3.5 py r = re.compile(r\w+\s*=\s*.*?(?=\w+\s*=|$)) py for item in r.findall(s): ... print item ... a = 4.5 b = 'h' 'd' c = 4.5 3.5 -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: python regular expression help
Hi, I don't quite understand the regular expression: re.compile([a-z]+.*?(?=[a-z]|$) ) and I tried. In some cases it works. But if the string looks like: s = 'a = 3.4 b = 4.5 5.6 c = h,d' it failed. What I came up with is : names = re.compile(r'(\w+)\s*=').findall(s) the corresponding values values = re.split(r'\w+\s*=',s)[1:] It dose not look good but it works. What do you think? Thanks,Qilong - Original Message From: 7stud [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, April 11, 2007 8:27:57 PM Subject: Re: python regular expression help On Apr 11, 7:41 pm, liupeng [EMAIL PROTECTED] wrote: pattern = re.compile(r'\w+\s*=\s*[0-9]*.[0-9]*\s*') lists = pattern.findall(s) print lists ['a=4 ', 'b=3.4 ', 'c=4.5'] On Wed, Apr 11, 2007 at 06:10:07PM -0700, Qilong Ren wrote: Hi, everyone, I am extracting some information from a given string using python RE. The string is ,for example, s = 'a = 4 b =3.4 5.4 c = 4.5' What I want is : a = 4 b = 3.4 5.4 c = 4.5 Right now I use : pattern = re.compile(r'\w+\s*=\s*.*?\s+') lists = pattern.findall(s) It works for the string like 'a = 4 b = 3.4 c = 4.5', but does not work with strings like 'a=4 b=3.4 5.4 c = 4.5' Any suggestion? Thanks,Qilong ━━━ Don't get soaked. Take a quick peak at the forecast with theYahoo! Search weather shortcut. -- http://mail.python.org/mailman/listinfo/python-list signature.asc 1KDownload Try this: import re s = 'a = 4 b =3.4 5.4 c = 4.5' r = re.compile([a-z]+.*?(?=[a-z]|$) ) l = r.findall(s) print l -- http://mail.python.org/mailman/listinfo/python-list Be a PS3 game guru. Get your game face on with the latest PS3 news and previews at Yahoo! Games. http://videogames.yahoo.com/platform?platform=120121-- http://mail.python.org/mailman/listinfo/python-list
Re: python regular expression help
On Apr 11, 10:50 pm, Gabriel Genellina [EMAIL PROTECTED] wrote: En Wed, 11 Apr 2007 23:14:01 -0300, Qilong Ren [EMAIL PROTECTED] escribió: Thanks for reply. That actually is not what I want. Strings I am dealing with may look like this: s = 'a = 4.5 b = 'h' 'd' c = 4.5 3.5' What I want is a = 4.5 b = 'h' 'd' c = 4.5 3.5 I suppose next you'll post your strings can also look like this: [EMAIL PROTECTED]@[EMAIL PROTECTED]@%12341234qeerasdfdae and you want A = 3 -- http://mail.python.org/mailman/listinfo/python-list
Re: python regular expression help
On Apr 11, 9:50 pm, Gabriel Genellina [EMAIL PROTECTED] wrote: En Wed, 11 Apr 2007 23:14:01 -0300, Qilong Ren [EMAIL PROTECTED] escribió: Thanks for reply. That actually is not what I want. Strings I am dealing with may look like this: s = 'a = 4.5 b = 'h' 'd' c = 4.5 3.5' What I want is a = 4.5 b = 'h' 'd' c = 4.5 3.5 That's a bit tricky. You have LHS = RHS where RHS includes all the following text *except* the very next word before the following = (which is the LHS of the next expression). Or something like that :) py import re py s = a = 4.5 b = 'h' 'd' c = 4.5 3.5 py r = re.compile(r\w+\s*=\s*.*?(?=\w+\s*=|$)) py for item in r.findall(s): ... print item ... a = 4.5 b = 'h' 'd' c = 4.5 3.5 Another way is to use split: import re lhs = re.compile(r'\s*(\b\w+\s*=)') for s in [ a = 4 b =3.4 5.4 c = 4.5, a = 4.5 b = 'h' 'd' c = 4.5 3.5]: tokens = lhs.split(s) results = [tokens[_] + tokens[_+1] for _ in range(1,len(tokens), 2)] print s print results -- Regards, Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: python regular expression help
On Apr 11, 11:50 pm, Gabriel Genellina [EMAIL PROTECTED] wrote: En Wed, 11 Apr 2007 23:14:01 -0300, Qilong Ren [EMAIL PROTECTED] escribió: Thanks for reply. That actually is not what I want. Strings I am dealing with may look like this: s = 'a = 4.5 b = 'h' 'd' c = 4.5 3.5' What I want is a = 4.5 b = 'h' 'd' c = 4.5 3.5 That's a bit tricky. You have LHS = RHS where RHS includes all the following text *except* the very next word before the following = (which is the LHS of the next expression). Or something like that :) py import re py s = a = 4.5 b = 'h' 'd' c = 4.5 3.5 py r = re.compile(r\w+\s*=\s*.*?(?=\w+\s*=|$)) py for item in r.findall(s): ... print item ... a = 4.5 b = 'h' 'd' c = 4.5 3.5 -- Gabriel Genellina The pyparsing version is a bit more readable, probably simpler to come back later to expand definition of varName, for example. from pyparsing import Word,alphas,nums,FollowedBy,sglQuotedString,OneOrMore realNum = Word(nums,nums+.).setParseAction(lambda t:float(t[0])) varName = Word(alphas) LHS = varName + FollowedBy(=) RHSval = sglQuotedString | realNum | varName RHS = OneOrMore( ~LHS + RHSval ) assignment = LHS.setResultsName(LHS) + '=' + RHS.setResultsName(RHS) s = a = 4.5 b = 'h' 'd' c = 4.5 3.5 for a in assignment.searchString(s): print a.LHS, '=', a.RHS prints: ['a'] = [4.5] ['b'] = ['h', 'd'] ['c'] = [4.5, 3.5] -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help for parsing html tables
[EMAIL PROTECTED] skrev: Hello, I am having some difficulty creating a regular expression for the following string situation in html. I want to find a table that has specific text in it and then extract the html just for that immediate table. the string would look something like this: ...stuff here... table ...stuff here... table ...stuff here... table ... text i'm searching for ... /table ...stuff here... /table ...stuff here... /table ...stuff here... My question: is there a way in RE to say: when I find this text I'm looking for, search backwards and find the immediate instance of the string table and then search forwards and find the immediate instance of the string /table. ? any help is appreciated. Steve. It would have been easier if you'd said what the text you are looking for is, but I think: regex = re.compile( r'table(.*?text you are looking for.*?)/table', re.DOTALL ) match = regex.search( html_string ) found_table = match.group( 1 ) would work. /Odalrick -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help for parsing html tables
[EMAIL PROTECTED] wrote: Hello, I am having some difficulty creating a regular expression for the following string situation in html. I want to find a table that has specific text in it and then extract the html just for that immediate table. the string would look something like this: ...stuff here... table ...stuff here... table ...stuff here... table ... text i'm searching for ... /table ...stuff here... /table ...stuff here... /table ...stuff here... My question: is there a way in RE to say: when I find this text I'm looking for, search backwards and find the immediate instance of the string table and then search forwards and find the immediate instance of the string /table. ? any help is appreciated. Steve. Might searching the output of BeautifulSoup(html).prettify() make things easier? http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing%20HTML - Paddy -- http://mail.python.org/mailman/listinfo/python-list
Regular Expression help for parsing html tables
Hello, I am having some difficulty creating a regular expression for the following string situation in html. I want to find a table that has specific text in it and then extract the html just for that immediate table. the string would look something like this: ...stuff here... table ...stuff here... table ...stuff here... table ... text i'm searching for ... /table ...stuff here... /table ...stuff here... /table ...stuff here... My question: is there a way in RE to say: when I find this text I'm looking for, search backwards and find the immediate instance of the string table and then search forwards and find the immediate instance of the string /table. ? any help is appreciated. Steve. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help for parsing html tables
Hi Steve, [EMAIL PROTECTED] wrote: I am having some difficulty creating a regular expression for the following string situation in html. I want to find a table that has specific text in it and then extract the html just for that immediate table. Any reason why you can't use a real HTML parser and API (e.g. the one provided by lxml)? That can really make things easier here. http://codespeak.net/lxml/ http://codespeak.net/lxml/api.html#parsers http://codespeak.net/lxml/api.html#trees-and-documents http://effbot.org/zone/element-index.htm Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
hanumizzle wrote: On 7 Oct 2006 15:00:29 -0700, Diez B. Roggisch [EMAIL PROTECTED] wrote: Chris wrote: I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = [ '((2x+2)sin(x))', '(log(2)/log(5))' ] Can anybody help me out? This is not possible with regular expressions - they can't remember how many parens they already encountered. Remember that regular expressions are used to represent regular grammars. Most regex engines actually aren't regular in that they support fancy things like look-behind/ahead and capture groups...IIRC, these cannot be part of a true regular expression library. Certainly true, and it always gives me a hard time because I don't know to which extend a regular expression nowadays might do the job because of these extensions. It was so much easier back in the old times With that said, the quote-unquote regexes in Lua have a special feature that supports balanced expressions. I believe Python has a PCRE lib somewhere; you may be able to use the experimental ??{ } construct in that case. Even if it has - I'm not sure if it really does you good, for several reasons: - regexes - even enhanced ones - don't build trees. But that is what you ultimately want from an expression like sin(log(x)) - even if they are more powerful these days, the theory of context free grammars still applies. so if what you need isn't LL(k) but LR(k), how do you specify that to the regex engine? - the regexes are useful because of their compact notations, parsers allow for better structured outcome Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
On 8 Oct 2006 01:49:50 -0700, Diez B. Roggisch [EMAIL PROTECTED] wrote: Even if it has - I'm not sure if it really does you good, for several reasons: - regexes - even enhanced ones - don't build trees. But that is what you ultimately want from an expression like sin(log(x)) - even if they are more powerful these days, the theory of context free grammars still applies. so if what you need isn't LL(k) but LR(k), how do you specify that to the regex engine? - the regexes are useful because of their compact notations, parsers allow for better structured outcome Just wait for Perl 6 :D -- Theerasak -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
Tim Chase: It still doesn't solve the aforementioned problem of things like ')))(((' which is balanced, but psychotic. :) This may solve the problem: def balanced(txt): d = {'(':1, ')':-1} tot = 0 for c in txt: tot += d.get(c, 0) if tot 0: return False return tot == 0 print balanced(42^((2x+2)sin(x)) + (log(2)/log(5))) # True print balanced(42^((2x+2)sin(x) + (log(2)/log(5))) # False print balanced(42^((2x+2)sin(x))) + (log(2)/log(5))) # False print balanced()))((() # False A possibile alternative for Py 2.5. The dict solution looks better, but this may be faster: def balanced2(txt): tot = 0 for c in txt: tot += 1 if c==( else (-1 if c==) else 0) if tot 0: return False return tot == 0 Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
[EMAIL PROTECTED] wrote: The dict solution looks better, but this may be faster: it's slightly faster, but both your alternatives are about 10x slower than a straightforward: def balanced(txt): return txt.count(() == txt.count()) /F -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
Thus spoke Diez B. Roggisch (on 2006-10-08 10:49): Certainly true, and it always gives me a hard time because I don't know to which extend a regular expression nowadays might do the job because of these extensions. It was so much easier back in the old times Right, in perl, this would be a no-brainer, its documented all over the place, like: my $re; $re = qr{ (?: (? [^\\()]+ | \\. ) | \( (??{ $re }) \) )* }xs; where you have a 'delayed execution' of the (??{ $re }) which in the end makes the whole a thing recursive one, it gets expanded and executed if the match finds its way to it. Above regex will match balanced parens, as in: my $good = 'a + (b / (c - 2)) * (d ^ (e+f)) '; my $bad1 = 'a + (b / (c - 2) * (d ^ (e+f)) '; my $bad2 = 'a + (b / (c - 2)) * (d) ^ (e+f) )'; if you do: print ok \n if $good =~ /^$re$/; print ok \n if $bad1 =~ /^$re$/; print ok \n if $bad2 =~ /^$re$/; This in some depth documented e.g. in http://japhy.perlmonk.org/articles/tpj/2004-summer.html (topic: Recursive Regexes) Regards M. -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
Mirco Wahab schrieb: Thus spoke Diez B. Roggisch (on 2006-10-08 10:49): Certainly true, and it always gives me a hard time because I don't know to which extend a regular expression nowadays might do the job because of these extensions. It was so much easier back in the old times Right, in perl, this would be a no-brainer, its documented all over the place, like: my $re; $re = qr{ (?: (? [^\\()]+ | \\. ) | \( (??{ $re }) \) )* }xs; where you have a 'delayed execution' of the (??{ $re }) which in the end makes the whole a thing recursive one, it gets expanded and executed if the match finds its way to it. Above regex will match balanced parens, as in: my $good = 'a + (b / (c - 2)) * (d ^ (e+f)) '; my $bad1 = 'a + (b / (c - 2) * (d ^ (e+f)) '; my $bad2 = 'a + (b / (c - 2)) * (d) ^ (e+f) )'; if you do: print ok \n if $good =~ /^$re$/; print ok \n if $bad1 =~ /^$re$/; print ok \n if $bad2 =~ /^$re$/; This in some depth documented e.g. in http://japhy.perlmonk.org/articles/tpj/2004-summer.html (topic: Recursive Regexes) That clearly is a recursive grammar rule, and thus it can't be regular anymore :) But first of all, I find it ugly - the clean separation of lexical and syntactical analysis is better here, IMHO - and secondly, what are the properties of that parsing? Is it LL(k), LR(k), backtracking? Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
Fredrik Lundh wrote: it's slightly faster, but both your alternatives are about 10x slower than a straightforward: def balanced(txt): return txt.count(() == txt.count()) I know, but if you read my post again you see that I have shown those solutions to mark )))((( as bad expressions. Just counting the parens isn't enough. Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
Diez B. Roggisch [EMAIL PROTECTED] wrote: Certainly true, and it always gives me a hard time because I don't know to which extend a regular expression nowadays might do the job because of these extensions. It was so much easier back in the old times What old times? I've been working with regex for mumble years and there's always been the problem that every implementation supports a slightly different syntax. Even back in the good old days, grep, awk, sed, and ed all had slightly different flavors. -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
On 10/8/06, Roy Smith [EMAIL PROTECTED] wrote: Diez B. Roggisch [EMAIL PROTECTED] wrote: Certainly true, and it always gives me a hard time because I don't know to which extend a regular expression nowadays might do the job because of these extensions. It was so much easier back in the old times What old times? I've been working with regex for mumble years and there's always been the problem that every implementation supports a slightly different syntax. Even back in the good old days, grep, awk, sed, and ed all had slightly different flavors. Which grep? Which awk? :) -- Theerasak -- http://mail.python.org/mailman/listinfo/python-list
need some regular expression help
I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = [ '((2x+2)sin(x))', '(log(2)/log(5))' ] Can anybody help me out? Thanks for any help! -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
Chris wrote: I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = [ '((2x+2)sin(x))', '(log(2)/log(5))' ] Can anybody help me out? This is not possible with regular expressions - they can't remember how many parens they already encountered. You will need a real parser for this - pyparsing seems to be the most popular choice today, I personally like spark. I'm sure you find an example-grammar that will parse simple arithmetical expressions like the one above. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
Chris wrote: I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = [ '((2x+2)sin(x))', '(log(2)/log(5))' ] Can anybody help me out? No, there is so such pattern. You will have to code up a function. Consider what your spec really is: '42^((2x+2)sin(x)) + (log(2)/log(5))' has the same number of left and right parentheses; so does the zero-length string; so does ') + (' -- perhaps you need to add 'and starts with a (' Consider what you are going to do with input like this: print '(' + some_text + ')' Maybe you need to do some lexical analysis and work at the level of tokens rather than individual characters. Which then raises the usual question: you have a perception that regular expressions are the solution -- to what problem?? HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
On 7 Oct 2006 15:00:29 -0700, Diez B. Roggisch [EMAIL PROTECTED] wrote: Chris wrote: I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = [ '((2x+2)sin(x))', '(log(2)/log(5))' ] Can anybody help me out? This is not possible with regular expressions - they can't remember how many parens they already encountered. Remember that regular expressions are used to represent regular grammars. Most regex engines actually aren't regular in that they support fancy things like look-behind/ahead and capture groups...IIRC, these cannot be part of a true regular expression library. With that said, the quote-unquote regexes in Lua have a special feature that supports balanced expressions. I believe Python has a PCRE lib somewhere; you may be able to use the experimental ??{ } construct in that case. -- Theerasak -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
In article [EMAIL PROTECTED], Chris [EMAIL PROTECTED] wrote: I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = [ '((2x+2)sin(x))', '(log(2)/log(5))' ] Can anybody help me out? Thanks for any help! Why does it need to be a regex? There is a very simple and well-known algorithm which does what you want. Start with i=0. Walk the string one character at a time, incrementing i each time you see a '(', and decrementing it each time you see a ')'. At the end of the string, the count should be back to 0. If at any time during the process, the count goes negative, you've got mis-matched parentheses. The algorithm runs in O(n), same as a regex. Regex is a wonderful tool, but it's not the answer to all problems. -- http://mail.python.org/mailman/listinfo/python-list
Re: need some regular expression help
Why does it need to be a regex? There is a very simple and well-known algorithm which does what you want. Start with i=0. Walk the string one character at a time, incrementing i each time you see a '(', and decrementing it each time you see a ')'. At the end of the string, the count should be back to 0. If at any time during the process, the count goes negative, you've got mis-matched parentheses. The algorithm runs in O(n), same as a regex. Regex is a wonderful tool, but it's not the answer to all problems. Following Roy's suggestion, one could use something like: s = '42^((2x+2)sin(x)) + (log(2)/log(5))' d = {'(':1, ')':-1} sum(d.get(c, 0) for c in s) 0 If you get a sum() 0, then you have too many (, and if you have sum() 0, you have too many ) characters. A sum() of 0 means there's the same number of parens. It still doesn't solve the aforementioned problem of things like ')))(((' which is balanced, but psychotic. :) -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help
Edward Elliott wrote: [EMAIL PROTECTED] wrote: If you are parsing HTML, it may make more sense to use a package designed especially for that purpose, like Beautiful Soup. I don't know Beautiful Soup, but one advantage regexes have over some parsers is handling malformed html. Beautiful Soup is intended to handle malformed HTML and seems to do pretty well. Kent -- http://mail.python.org/mailman/listinfo/python-list
Regular Expression help
I have some data and I need to put it in a list in a particular way. I have that figured out but there is stuff in the data that I don't want. Example: 10:00am - 11:00am:/b a href=/tvpdb?d=tvpid=167540528cf=0lineup=us_KS57836dchannels=us_KCTVchspid=166030466chname=CBSprogutn=114615.intl=usThe Price Is Right/aem All I want is Price Is Right Here is the re. findshows = re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*/aem)') I have used a for loop to remove the extra data but then it ruins the list that I am building. Basically I want the list to be something like this. [[Government Access], [Price Is Right, Guiding Light, Another show]] the for loop just comma deliminates all of them so I lose the list in a list that I need. I hope I have explained this well enough. Any help or ideas would be appreciated. TIA -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help
RunLevelZero wrote: 10:00am - 11:00am:/b a href=/tvpdb?d=tvpid=167540528[snip]The Price Is Right/aem All I want is Price Is Right Here is the re. findshows = re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*/aem)') 1. A regex remembers everything it matches -- no need to wrap the entire thing in parens. Just call group() on the returned MatchObject. 2. If all you want is the link text, you don't need to do so much matching. If you don't need the time, don't match it in the first place. If you're using it as a marker, try matching each time with r'[\d:]{4,5}[ap]m'. Not as exact but a bit simpler. Or just r'[\d:apm]{6,7}' 3. To grab what's inside the link: r'a[^]*(.*?)/a' 4. If the link text itself contains html tags, you'll have to strip those off separately. Extracting the text from arbitrarily nested html tags in one shot requires a parser, not a regex. 5. If you're just going to run this regex repeatedly on an html doc and make a list of the results, it's easier to read the whole doc into a string and then use re.findall. I have used a for loop to remove the extra data but then it ruins the list that I am building. Basically I want the list to be something like this. [[Government Access], [Price Is Right, Guiding Light, Another show]] the for loop just comma deliminates all of them so I lose the list in a list that I need. I hope I have explained this well enough. Any help or ideas would be appreciated. No one can help with that unless you show us how you're building your list. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help
Great I will test this out once I have the time... thanks for the quick response -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help
If you are parsing HTML, it may make more sense to use a package designed especially for that purpose, like Beautiful Soup. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help
I considered that but what I need is simple and I don't want to use another library for something so simple but thank you. Plus I don't understand them all that well :) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help
If what you need is simple, regular expressions are almost never the answer. And how simple can it be if you are posting here? :) BeautifulSoup isn't all that hard. Observe: from BeautifulSoup import BeautifulSoup html = '10:00am - 11:00am:/b a href=/tvpdb?d=tvpid=167540528[snip]The Price Is Right/aem' soup = BeautifulSoup(html) soup('a') [a href=/tvpdb?d=tvpid=167540528ThePrice Is Right/a] for show in soup('a'): print show.contents[0] The Price Is Right RunLevelZero wrote: I considered that but what I need is simple and I don't want to use another library for something so simple but thank you. Plus I don't understand them all that well :) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help
r'a[^]*(.*?)/a' With a slight modification that did exactly what I wanted, and yes the findall was the only way to get all that I needed as I buffered all the read. Thanks a bunch. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help
Interesting... thank you. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help
[EMAIL PROTECTED] wrote: If you are parsing HTML, it may make more sense to use a package designed especially for that purpose, like Beautiful Soup. I don't know Beautiful Soup, but one advantage regexes have over some parsers is handling malformed html. Omitted closing tags can wreak havoc. Regexes can also help if you only want elements preceded/followed by a certain sibling or cousin in the parse tree. It all depends on what you're trying to accomplish. In general though, yes parsers are better suited to extracting from markup. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression help
Edward Elliott [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: If you are parsing HTML, it may make more sense to use a package designed especially for that purpose, like Beautiful Soup. I don't know Beautiful Soup, but one advantage regexes have over some parsers is handling malformed html. Omitted closing tags can wreak havoc. Regexes can also help if you only want elements preceded/followed by a certain sibling or cousin in the parse tree. It all depends on what you're trying to accomplish. In general though, yes parsers are better suited to extracting from markup. A parser can be written in such a way that it doesn't give up on malformed HTML. Probably less hard then coming up with regexes that handle HTML that's well-formed. (and that coming from a Perl programmer ;-) ) -- John MexIT: http://johnbokma.com/mexit/ personal page: http://johnbokma.com/ Experienced programmer available: http://castleamber.com/ Happy Customers: http://castleamber.com/testimonials.html -- http://mail.python.org/mailman/listinfo/python-list