Re: regex help
Larry Martell wrote: I need to remove all trailing zeros to the right of the decimal point, but leave one zero if it's whole number. For example, if I have this: 14S,5.,4.5686274500,3.7272727272727271,3.3947368421052630,5.7307692307692308,5.7547169811320753,4.9423076923076925,5.7884615384615383,5.13725490196 I want to end up with: 14S,5.0,4.56862745,3.7272727272727271,3.394736842105263,5.7307692307692308,5.7547169811320753,4.9423076923076925,5.7884615384615383,5.13725490196 I have a regex to remove the zeros: '0+[,$]', '' But I can't figure out how to get the 5. to be 5.0. I've been messing with the negative lookbehind, but I haven't found one that works for this. First of all, I find it unlikely that you really want to solve your problem with regular expressions. Google “X-Y problem”. Second, if you must use regular expressions, the most simple approach is to use backreferences. Third, you need to show the relevant (Python) code. http://www.catb.org/~esr/faqs/smart-questions.html -- PointedEars Twitter: @PointedEars2 Please do not cc me. / Bitte keine Kopien per E-Mail. -- https://mail.python.org/mailman/listinfo/python-list
Re: regex help
On 2015-03-13 12:05, Larry Martell wrote: I need to remove all trailing zeros to the right of the decimal point, but leave one zero if it's whole number. But I can't figure out how to get the 5. to be 5.0. I've been messing with the negative lookbehind, but I haven't found one that works for this. You can do it with string-ops, or you can resort to regexp. Personally, I like the clarity of the string-ops version, but use what suits you. -tkc import re input = [ '14S', '5.', '4.5686274500', '3.7272727272727271', '3.3947368421052630', '5.7307692307692308', '5.7547169811320753', '4.9423076923076925', '5.7884615384615383', '5.13725490196', ] output = [ '14S', '5.0', '4.56862745', '3.7272727272727271', '3.394736842105263', '5.7307692307692308', '5.7547169811320753', '4.9423076923076925', '5.7884615384615383', '5.13725490196', ] def fn1(s): if '.' in s: s = s.rstrip('0') if s.endswith('.'): s += '0' return s def fn2(s): return re.sub(r'(\.\d+?)0+$', r'\1', s) for fn in (fn1, fn2): for i, o in zip(input, output): v = fn(i) print %s: %s - %s [%s] % (v == o, i, v, o) -- https://mail.python.org/mailman/listinfo/python-list
Re: regex help
On 2015-03-13 16:05, Larry Martell wrote: I need to remove all trailing zeros to the right of the decimal point, but leave one zero if it's whole number. For example, if I have this: 14S,5.,4.5686274500,3.7272727272727271,3.3947368421052630,5.7307692307692308,5.7547169811320753,4.9423076923076925,5.7884615384615383,5.13725490196 I want to end up with: 14S,5.0,4.56862745,3.7272727272727271,3.394736842105263,5.7307692307692308,5.7547169811320753,4.9423076923076925,5.7884615384615383,5.13725490196 I have a regex to remove the zeros: '0+[,$]', '' But I can't figure out how to get the 5. to be 5.0. I've been messing with the negative lookbehind, but I haven't found one that works for this. Search: (\.\d+?)0+\b Replace: \1 which is: re.sub(r'(\.\d+?)0+\b', r'\1', string) -- https://mail.python.org/mailman/listinfo/python-list
Re: regex help
On Fri, Mar 13, 2015 at 1:29 PM, MRAB pyt...@mrabarnett.plus.com wrote: On 2015-03-13 16:05, Larry Martell wrote: I need to remove all trailing zeros to the right of the decimal point, but leave one zero if it's whole number. For example, if I have this: 14S,5.,4.5686274500,3.7272727272727271,3.3947368421052630,5.7307692307692308,5.7547169811320753,4.9423076923076925,5.7884615384615383,5.13725490196 I want to end up with: 14S,5.0,4.56862745,3.7272727272727271,3.394736842105263,5.7307692307692308,5.7547169811320753,4.9423076923076925,5.7884615384615383,5.13725490196 I have a regex to remove the zeros: '0+[,$]', '' But I can't figure out how to get the 5. to be 5.0. I've been messing with the negative lookbehind, but I haven't found one that works for this. Search: (\.\d+?)0+\b Replace: \1 which is: re.sub(r'(\.\d+?)0+\b', r'\1', string) Thanks! That works perfectly. -- https://mail.python.org/mailman/listinfo/python-list
Re: regex help
On 13Mar2015 12:05, Larry Martell larry.mart...@gmail.com wrote: I need to remove all trailing zeros to the right of the decimal point, but leave one zero if it's whole number. For example, if I have this: 14S,5.,4.5686274500,3.7272727272727271,3.3947368421052630,5.7307692307692308,5.7547169811320753,4.9423076923076925,5.7884615384615383,5.13725490196 I want to end up with: 14S,5.0,4.56862745,3.7272727272727271,3.394736842105263,5.7307692307692308,5.7547169811320753,4.9423076923076925,5.7884615384615383,5.13725490196 I have a regex to remove the zeros: '0+[,$]', '' But I can't figure out how to get the 5. to be 5.0. I've been messing with the negative lookbehind, but I haven't found one that works for this. Leaving aside the suggested non-greedy match, you can rephrase this: strip trailing zeroes _after_ the first decimal digit. Then you can consider a number to be: digits point any digit other digits to be right-zero stripped so: (\d+\.\d)(\d*[1-9])?0*\b and keep .group(1) and .group(2) from the match. Another way of considering the problem. Or you could two step it. Strip all trailing zeroes. If the result ends in a dot, add a single zero. Cheers, Cameron Simpson c...@zip.com.au C'mon. Take the plunge. By the time you go through rehab the first time, you'll be surrounded by the most interesting people, and if it takes years off of your life, don't sweat it. They'll be the last ones anyway. - Vinnie Jordan, alt.peeves -- https://mail.python.org/mailman/listinfo/python-list
Re: regex help
Larry Martell wrote: I need to remove all trailing zeros to the right of the decimal point, but leave one zero if it's whole number. def strip_zero(s): if '.' not in s: return s s = s.rstrip('0') if s.endswith('.'): s += '0' return s And in use: py strip_zero('-10.2500') '-10.25' py strip_zero('123000') '123000' py strip_zero('123000.') '123000.0' It doesn't support exponential format: py strip_zero('1.230e3') '1.230e3' because it isn't clear what you intend to do under those circumstances. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
regex help
I need to remove all trailing zeros to the right of the decimal point, but leave one zero if it's whole number. For example, if I have this: 14S,5.,4.5686274500,3.7272727272727271,3.3947368421052630,5.7307692307692308,5.7547169811320753,4.9423076923076925,5.7884615384615383,5.13725490196 I want to end up with: 14S,5.0,4.56862745,3.7272727272727271,3.394736842105263,5.7307692307692308,5.7547169811320753,4.9423076923076925,5.7884615384615383,5.13725490196 I have a regex to remove the zeros: '0+[,$]', '' But I can't figure out how to get the 5. to be 5.0. I've been messing with the negative lookbehind, but I haven't found one that works for this. -- https://mail.python.org/mailman/listinfo/python-list
Newbie needs regex help
I'm getting bogged down with backslash escaping. I have some text files containing characters with the 8th bit set. These characters are encoded one of two ways: either =hh or \xhh, where h represents a hex digit, and \x is a literal backslash followed by a lower-case x. Catching the first case with a regex is simple. But when I try to write a regex to catch the second case, I mess up the escaping. I took at look at http://docs.python.org/howto/regex.html, especially the section titled The Backslash Plague. I started out trying : d...@dan:~/personal/usenet$ python Python 2.7 (r27:82500, Nov 15 2010, 12:10:23) [GCC 4.3.2] on linux2 Type help, copyright, credits or license for more information. import re r = re.compile('x([0-9a-fA-F]{2})') a = This \xef file \xef has \x20 a bunch \xa0 of \xb0 crap \xc0 characters \xefn \xeft. m = r.search(a) m No match. I then followed the advice of the above-mentioned document, and expressed the regex as a raw string: r = re.compile(r'\\x([0-9a-fA-F]{2})') r.search(a) Still no match. I'm obviously missing something. I spent a fair bit of time playing with this over the weekend, and I got nowhere. Now it's time to ask for help. What am I doing wrong here? -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie needs regex help
Dan M wrote: I'm getting bogged down with backslash escaping. I have some text files containing characters with the 8th bit set. These characters are encoded one of two ways: either =hh or \xhh, where h represents a hex digit, and \x is a literal backslash followed by a lower-case x. Catching the first case with a regex is simple. But when I try to write a regex to catch the second case, I mess up the escaping. I took at look at http://docs.python.org/howto/regex.html, especially the section titled The Backslash Plague. I started out trying : d...@dan:~/personal/usenet$ python Python 2.7 (r27:82500, Nov 15 2010, 12:10:23) [GCC 4.3.2] on linux2 Type help, copyright, credits or license for more information. import re r = re.compile('x([0-9a-fA-F]{2})') a = This \xef file \xef has \x20 a bunch \xa0 of \xb0 crap \xc0 characters \xefn \xeft. m = r.search(a) m No match. I then followed the advice of the above-mentioned document, and expressed the regex as a raw string: r = re.compile(r'\\x([0-9a-fA-F]{2})') r.search(a) Still no match. I'm obviously missing something. I spent a fair bit of time playing with this over the weekend, and I got nowhere. Now it's time to ask for help. What am I doing wrong here? What you're missing is that string `a` doesn't actually contain four- character sequences like '\', 'x', 'a', 'a' . It contains single characters that you encode in string literals as '\xaa' and so on. You might do better with p1 = r'([\x80-\xff])' r1 = re.compile (p1) m = r1.search (a) I get at least an _sre.SRE_Match object at 0xb749a6e0 when I try this. Mel. -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie needs regex help
Dan M d...@catfolks.net writes: I took at look at http://docs.python.org/howto/regex.html, especially the section titled The Backslash Plague. I started out trying : import re r = re.compile('x([0-9a-fA-F]{2})') a = This \xef file \xef has \x20 a bunch \xa0 of \xb0 crap \xc0 The backslash trickery applies to string literals also, not only regexps. Your string does not have the value you think it has. Double each backslash (or make your string raw) and you'll get what you expect. -- Alain. -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie needs regex help
On Mon, 06 Dec 2010 10:29:41 -0500, Mel wrote: What you're missing is that string `a` doesn't actually contain four- character sequences like '\', 'x', 'a', 'a' . It contains single characters that you encode in string literals as '\xaa' and so on. You might do better with p1 = r'([\x80-\xff])' r1 = re.compile (p1) m = r1.search (a) I get at least an _sre.SRE_Match object at 0xb749a6e0 when I try this. Mel. That's what I had initially assumed was the case, but looking at the data files with a hex editor showed me that I do indeed have four-character sequences. That's what makes this such as interesting task! -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie needs regex help
On Mon, 06 Dec 2010 16:34:56 +0100, Alain Ketterlin wrote: Dan M d...@catfolks.net writes: I took at look at http://docs.python.org/howto/regex.html, especially the section titled The Backslash Plague. I started out trying : import re r = re.compile('x([0-9a-fA-F]{2})') a = This \xef file \xef has \x20 a bunch \xa0 of \xb0 crap \xc0 The backslash trickery applies to string literals also, not only regexps. Your string does not have the value you think it has. Double each backslash (or make your string raw) and you'll get what you expect. -- Alain. D'oh! I hadn't thought of that. If I read my data file in from disk, use the raw string version of the regex, and do the search that way I do indeed get the results I'm looking for. Thanks for pointing that out. I guess I need to think a little deeper into what I'm doing when I escape stuff. -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie needs regex help
On Mon, 06 Dec 2010 09:44:39 -0600, Dan M wrote: That's what I had initially assumed was the case, but looking at the data files with a hex editor showed me that I do indeed have four-character sequences. That's what makes this such as interesting task! Sorry, I misunderstood the first time I read your reply. You're right, the string I showed did indeed contain single-byte characters, not four-character sequences. The data file I work with, though, does contain four-character sequences. -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie needs regex help
Dan M wrote: I'm getting bogged down with backslash escaping. I have some text files containing characters with the 8th bit set. These characters are encoded one of two ways: either =hh or \xhh, where h represents a hex digit, and \x is a literal backslash followed by a lower-case x. By the way: print quopri.decodestring(=E4=F6=FC).decode(iso-8859-1) äöü print r\xe4\xf6\xfc.decode(string-escape).decode(iso-8859-1) äöü -- http://mail.python.org/mailman/listinfo/python-list
Re: Newbie needs regex help
On Mon, 06 Dec 2010 18:12:33 +0100, Peter Otten wrote: By the way: print quopri.decodestring(=E4=F6=FC).decode(iso-8859-1) äöü print r\xe4\xf6\xfc.decode(string-escape).decode(iso-8859-1) äöü Ah - better than a regex. Thanks! -- http://mail.python.org/mailman/listinfo/python-list
regex help: splitting string gets weird groups
[ python3.1.1, re.__version__='2.2.1' ] I'm trying to use re to split a string into (any number of) pieces of these kinds: 1) contiguous runs of letters 2) contiguous runs of digits 3) single other characters e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain', '.', 'in', '#', '=', 1234] I tried: re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups() ('1234', 'in', '1234', '=') Why is 1234 repeated in two groups? and why doesn't tHe appear as a group? Is my regexp illegal somehow and confusing the engine? I *would* like to understand what's wrong with this regex, though if someone has a neat other way to do the above task, I'm also interested in suggestions. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help: splitting string gets weird groups
gry wrote: [ python3.1.1, re.__version__='2.2.1' ] I'm trying to use re to split a string into (any number of) pieces of these kinds: 1) contiguous runs of letters 2) contiguous runs of digits 3) single other characters e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain', '.', 'in', '#', '=', 1234] I tried: re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups() ('1234', 'in', '1234', '=') Why is 1234 repeated in two groups? and why doesn't tHe appear as a group? Is my regexp illegal somehow and confusing the engine? I *would* like to understand what's wrong with this regex, though if someone has a neat other way to do the above task, I'm also interested in suggestions. If the regex was illegal then it would raise an exception. It's doing exactly what you're asking it to do! First of all, there are 4 groups, with group 1 containing groups 2..4 as alternatives, so group 1 will match whatever groups 2..4 match: Group 1: (([A-Za-z]+)|([0-9]+)|([-.#=])) Group 2: ([A-Za-z]+) Group 3: ([0-9]+) Group 4: ([-.#=]) It matches like this: Group 1 and group 3 match '555'. Group 1 and group 2 match 'tHe'. Group 1 and group 4 match '-'. Group 1 and group 2 match 'rain'. Group 1 and group 4 match '.'. Group 1 and group 2 match 'in'. Group 1 and group 4 match '#'. Group 1 and group 4 match '='. Group 1 and group 3 match '1234'. If a group matches then any earlier match of that group is discarded, so: Group 1 finishes with '1234'. Group 2 finishes with 'in'. Group 3 finishes with '1234'. Group 4 finishes with '='. A solution is: re.findall('[A-Za-z]+|[0-9]+|[-.#=]', '555tHe-rain.in#=1234') ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234'] Note: re.findall() returns a list of matches, so if the regex doesn't contain any groups then it returns the matched substrings. Compare: re.findall(a(.), ax ay) ['x', 'y'] re.findall(a., ax ay) ['ax', 'ay'] -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help: splitting string gets weird groups
On 8 Apr, 19:49, gry georgeryo...@gmail.com wrote: [ python3.1.1, re.__version__='2.2.1' ] I'm trying to use re to split a string into (any number of) pieces of these kinds: 1) contiguous runs of letters 2) contiguous runs of digits 3) single other characters e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain', '.', 'in', '#', '=', 1234] I tried: re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups() ('1234', 'in', '1234', '=') Why is 1234 repeated in two groups? and why doesn't tHe appear as a group? Is my regexp illegal somehow and confusing the engine? I *would* like to understand what's wrong with this regex, though if someone has a neat other way to do the above task, I'm also interested in suggestions. I would avoid .match and use .findall (if you walk through them both together, it'll make sense what's happening with your match string). s = 555tHe-rain.in#=1234 re.findall('[A-Za-z]+|[0-9]+|[-.#=]', s) ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234'] hth, Jon. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help: splitting string gets weird groups
On Apr 8, 1:49 pm, gry georgeryo...@gmail.com wrote: [ python3.1.1, re.__version__='2.2.1' ] I'm trying to use re to split a string into (any number of) pieces of these kinds: 1) contiguous runs of letters 2) contiguous runs of digits 3) single other characters e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain', '.', 'in', '#', '=', 1234] I tried: re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups() ('1234', 'in', '1234', '=') Why is 1234 repeated in two groups? and why doesn't tHe appear as a group? Is my regexp illegal somehow and confusing the engine? I *would* like to understand what's wrong with this regex, though if someone has a neat other way to do the above task, I'm also interested in suggestions. IMO, for most purposes, for people who don't want to become re experts, the easiest, fastest, best, most predictable way to use re is re.split. You can either call re.split directly, or, if you are going to be splitting on the same pattern over and over, compile the pattern and grab its split method. Use a *single* capture group in the pattern, that covers the *whole* pattern. In the case of your example data: import re splitter=re.compile('([A-Za-z]+|[0-9]+|[-.#=])').split s='555tHe-rain.in#=1234' [x for x in splitter(s) if x] ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234'] The reason for the list comprehension is that re.split will always return a non-matching string between matches. Sometimes this is useful even when it is a null string (see recent discussion in the group about splitting digits out of a string), but if you don't care to see null (empty) strings, this comprehension will remove them. The reason for a single capture group that covers the whole pattern is that it is much easier to reason about the output. The split will give you all your data, in order, e.g. ''.join(splitter(s)) == s True HTH, Pat -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help: splitting string gets weird groups
gry wrote: [ python3.1.1, re.__version__='2.2.1' ] I'm trying to use re to split a string into (any number of) pieces of these kinds: 1) contiguous runs of letters 2) contiguous runs of digits 3) single other characters e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain', '.', 'in', '#', '=', 1234] I tried: re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups() ('1234', 'in', '1234', '=') Why is 1234 repeated in two groups? and why doesn't tHe appear as a group? Is my regexp illegal somehow and confusing the engine? well, I'm not sure what it thinks its finding but nested capture-groups always produce somewhat weird results for me (I suspect that's what's triggering the duplication). Additionally, you're only searching for one match (.match() returns a single match-object or None; not all possible matches within the repeated super-group). I *would* like to understand what's wrong with this regex, though if someone has a neat other way to do the above task, I'm also interested in suggestions. Tweaking your original, I used s='555tHe-rain.in#=1234' import re r=re.compile(r'([a-zA-Z]+|\d+|.)') r.findall(s) ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234'] The only difference between my results and your results is that the 555 and 1234 come back as strings, not ints. -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help: splitting string gets weird groups
On Apr 8, 3:40 pm, MRAB pyt...@mrabarnett.plus.com wrote: ... Group 1 and group 4 match '='. Group 1 and group 3 match '1234'. If a group matches then any earlier match of that group is discarded, Wow, that makes this much clearer! I wonder if this behaviour shouldn't be mentioned in some form in the python docs? Thanks much! so: Group 1 finishes with '1234'. Group 2 finishes with 'in'. Group 3 finishes with '1234'. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help: splitting string gets weird groups
On 8 Apr, 19:49, gry georgeryo...@gmail.com wrote: [ python3.1.1, re.__version__='2.2.1' ] I'm trying to use re to split a string into (any number of) pieces of these kinds: 1) contiguous runs of letters 2) contiguous runs of digits 3) single other characters e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain', '.', 'in', '#', '=', 1234] I tried: re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups() ('1234', 'in', '1234', '=') Why is 1234 repeated in two groups? and why doesn't tHe appear as a group? Is my regexp illegal somehow and confusing the engine? I *would* like to understand what's wrong with this regex, though if someone has a neat other way to do the above task, I'm also interested in suggestions. Avoiding re's (for a bit of fun): (no good for unicode obviously) import string from itertools import groupby, chain, repeat, count, izip s = 555tHe-rain.in#=1234 unique_group = count() lookup = dict( chain( izip(string.ascii_letters, repeat('L')), izip(string.digits, repeat('D')), izip(string.punctuation, unique_group) ) ) parse = dict(D=int, L=str.capitalize) print [ parse.get(key, lambda L: L)(''.join(items)) for key, items in groupby(s, lambda L: lookup[L]) ] [555, 'The', '-', 'Rain', '.', 'In', '#', '=', 1234] Jon. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help: splitting string gets weird groups
s='555tHe-rain.in#=1234' import re r=re.compile(r'([a-zA-Z]+|\d+|.)') r.findall(s) ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234'] This is nice and simple and has the invertible property that Patrick mentioned above. Thanks much! -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help: splitting string gets weird groups
On Apr 8, 3:40 pm, gry georgeryo...@gmail.com wrote: s='555tHe-rain.in#=1234' import re r=re.compile(r'([a-zA-Z]+|\d+|.)') r.findall(s) ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234'] This is nice and simple and has the invertible property that Patrick mentioned above. Thanks much! Yes, like using split(), this is invertible. But you will see a difference (and for a given task, you might prefer one way or the other) if, for example, you put a few consecutive spaces in the middle of your string, where this pattern and findall() will return each space individually, and split() will return them all together. You *can* fix up the pattern for findall() where it will have the same properties as the split(), but it will almost always be a more complicated pattern than for the equivalent split(). Another thing you can do with split(): if you *think* you have a pattern that fully covers every string you expect to throw at it, but would like to verify this, you can make use of the fact that split() returns a string between each match (and before the first match and after the last match). So if you expect that every character in your entire string should be a part of a match, you can do something like: strings = splitter(s) tokens = strings[1::2] assert not ''.join(strings[::2]) Regards, Pat -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
In article 19de1d6e-5ba9-42b5-9221-ed7246e39...@u36g2000prn.googlegroups.com, Oltmans rolf.oltm...@gmail.com wrote: I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. 'Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems.' --Jamie Zawinski Take the advice other people gave you and use BeautifulSoup. -- Aahz (a...@pythoncraft.com) * http://www.pythoncraft.com/ If you think it's expensive to hire a professional to do the job, wait until you hire an amateur. --Red Adair -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
# http://gist.github.com/271661 import lxml.html import re src = lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div hello, my age is 86 years old and I was born in 1945. Do you know that PI is roughly 3.1443534534534534534 regex = re.compile('amazon_(\d+)') doc = lxml.html.document_fromstring(src) for div in doc.xpath('//div[starts-with(@id, amazon_)]'): match = regex.match(div.get('id')) if match: print match.groups()[0] On Thu, Jan 7, 2010 at 4:42 PM, Aahz a...@pythoncraft.com wrote: In article 19de1d6e-5ba9-42b5-9221-ed7246e39...@u36g2000prn.googlegroups.com, Oltmans rolf.oltm...@gmail.com wrote: I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. 'Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems.' --Jamie Zawinski Take the advice other people gave you and use BeautifulSoup. -- Aahz (a...@pythoncraft.com) * http://www.pythoncraft.com/ If you think it's expensive to hire a professional to do the job, wait until you hire an amateur. --Red Adair -- http://mail.python.org/mailman/listinfo/python-list -- Rolando Espinoza La fuente www.rolandoespinoza.info -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
On 21.12.2009 12:38, Oltmans wrote: Hello,. everyone. I've a string that looks something like lksjdflsdiv id ='amazon_345343' kdjff lsdfs/div sdjflsdiv id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. If you filter in two or even more sequential steps the problem becomes a lot simpler, not least because you can test each step separately: r1 = re.compile ('div id\D*\d+[^]*') # Add ignore case and variable white space r2 = re.compile ('\d+') [r2.search (item).group () for item in r1.findall (s) if item] # s is your sample ['345343', '35343433', '8898'] # Supposing all ids have digits Frederic -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
how about re.findall(r'\w+.=\W\D+(\d+)?',str) ? this will work for any string within id ! ~Ukanth On Dec 21, 6:06 pm, Oltmans rolf.oltm...@gmail.com wrote: On Dec 21, 5:05 pm, Umakanth cum...@gmail.com wrote: How about re.findall(r'\d+(?:\.\d+)?',str) extracts only numbers from any string Thank you. However, I only need the digits within the ID attribute of the DIV. Regex that you suggested fails on the following string lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div hello, my age is 86 years old and I was born in 1945. Do you know that PI is roughly 3.1443534534534534534 ~uk On Dec 21, 4:38 pm, Oltmans rolf.oltm...@gmail.com wrote: Hello,. everyone. I've a string that looks something like lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
On Dec 21, 5:38 am, Oltmans rolf.oltm...@gmail.com wrote: Hello,. everyone. I've a string that looks something like lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) The issue with using regexen for parsing HTML is that you often get surprised by attributes that you never expected, or out of order, or with weird or missing quotation marks, or tags or attributes that are in upper/lower case. BeautifulSoup is one tool to use for HTML scraping, here is a pyparsing example, with hopefully descriptive comments: from pyparsing import makeHTMLTags,ParseException src = lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div hello, my age is 86 years old and I was born in 1945. Do you know that PI is roughly 3.1443534534534534534 # use makeHTMLTags to return an expression that will match # HTML div tags, including attributes, upper/lower case, # etc. (makeHTMLTags will return expressions for both # opening and closing tags, but we only care about the # opening one, so just use the [0]th returned item div = makeHTMLTags(div)[0] # define a parse action to filter only for div tags # with the proper id form def filterByIdStartingWithAmazon(tokens): if not tokens.id.startswith(amazon_): raise ParseException( must have id attribute starting with 'amazon_') # define a parse action that will add a pseudo- # attribute 'amazon_id', to make it easier to get the # numeric portion of the id after the leading 'amazon_' def makeAmazonIdAttribute(tokens): tokens[amazon_id] = tokens.id[len(amazon_):] # attach parse action callbacks to the div expression - # these will be called during parse time div.setParseAction(filterByIdStartingWithAmazon, makeAmazonIdAttribute) # search through the input string for matching divs, # and print out their amazon_id's for divtag in div.searchString(src): print divtag.amazon_id Prints: 345343 35343433 8898 -- http://mail.python.org/mailman/listinfo/python-list
Regex help needed!
Hello,. everyone. I've a string that looks something like lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
How about re.findall(r'\d+(?:\.\d+)?',str) extracts only numbers from any string ~uk On Dec 21, 4:38 pm, Oltmans rolf.oltm...@gmail.com wrote: Hello,. everyone. I've a string that looks something like lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
On Dec 21, 7:38 pm, Oltmans rolf.oltm...@gmail.com wrote: Hello,. everyone. I've a string that looks something like lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. don't need regular expression. just do a split on amazon s=lksjdfls div id =\'amazon_345343\' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id=\'amazon_8898\'welcome/div for item in s.split(amazon_)[1:]: ... print item ... 345343' kdjff lsdfs /div sdjfls div id = 35343433sdfsd/divdiv id=' 8898'welcome/div then find ' or indices and do index slicing. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
Oltmans wrote: I've a string that looks something like lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. from BeautifulSoup import BeautifulSoup bs = BeautifulSoup(lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id ... = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div) [node[id][7:] for node in bs(id=lambda id: id.startswith(amazon_))] [u'345343', u'35343433', u'8898'] I think BeautifulSoup is a better tool for the task since it actually understands HTML. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
On Dec 21, 5:05 pm, Umakanth cum...@gmail.com wrote: How about re.findall(r'\d+(?:\.\d+)?',str) extracts only numbers from any string Thank you. However, I only need the digits within the ID attribute of the DIV. Regex that you suggested fails on the following string lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div hello, my age is 86 years old and I was born in 1945. Do you know that PI is roughly 3.1443534534534534534 ~uk On Dec 21, 4:38 pm, Oltmans rolf.oltm...@gmail.com wrote: Hello,. everyone. I've a string that looks something like lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
Ok. how about re.findall(r'\w+_(\d+)',str) ? returns ['345343', '35343433', '8898', '8898'] ! On Dec 21, 6:06 pm, Oltmans rolf.oltm...@gmail.com wrote: On Dec 21, 5:05 pm, Umakanth cum...@gmail.com wrote: How about re.findall(r'\d+(?:\.\d+)?',str) extracts only numbers from any string Thank you. However, I only need the digits within the ID attribute of the DIV. Regex that you suggested fails on the following string lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div hello, my age is 86 years old and I was born in 1945. Do you know that PI is roughly 3.1443534534534534534 ~uk On Dec 21, 4:38 pm, Oltmans rolf.oltm...@gmail.com wrote: Hello,. everyone. I've a string that looks something like lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
Oltmans wrote: Hello,. everyone. I've a string that looks something like lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall(\w+\s*\W+amazon_(\d+),str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. Try: re.findall(rdiv\s*id\s*=\s*[']amazon_(\d+)['], str) You shouldn't be using 'str' as a variable name because it hides the builtin string class 'str'. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
Oltmans wrote: I've a string that looks something like lksjdfls div id ='amazon_345343' kdjff lsdfs /div sdjfls div id = amazon_35343433sdfsd/divdiv id='amazon_8898'welcome/div From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 Your string is in /tmp/y in this example: $ grep -o [0-9]+ /tmp/y 345343 35343433 8898 Much simpler, isn't it? But that is not python. Regards Johann -- Johann Spies Telefoon: 021-808 4599 Informasietegnologie, Universiteit van Stellenbosch And there were in the same country shepherds abiding in the field, keeping watch over their flock by night. And, lo, the angel of the Lord came upon them, and the glory of the Lord shone round about them: and they were sore afraid. And the angel said unto them, Fear not: for behold I bring you good tidings of great joy, which shall be to all people. For unto you is born this day in the city of David a Saviour, which is Christ the Lord.Luke 2:8-11 -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
On Wed, Dec 16, 2009 at 10:46 PM, Gabriel Rossetti gabriel.rosse...@arimaz.com wrote: Hello everyone, I'm going nuts with some regex, could someone please show me what I'm doing wrong? I have an XMPP msg : message xmlns='jabber:client' to='n...@host.com' mynode xmlns='myprotocol:core' version='1.0' type='mytype' parameters param1123/param1 param2456/param2 /parameters payload type='plain'.../payload /mynode x xmlns='jabber:x:expire' seconds='15'/ /message the parameter node may be absent or empty (parameter/), the x node may be absent. I'd like to grab everything exept the payload nod and create something new using regex, with the XMPP message example above I'd get this : message xmlns='jabber:client' to='n...@host.com' mynode xmlns='myprotocol:core' version='1.0' type='mytype' parameters param1123/param1 param2456/param2 /parameters /mynode x xmlns='jabber:x:expire' seconds='15'/ /message for some reason my regex doesn't work correctly : r(message .*?).*?(mynode .*?).*?(?:(parameters.*?/parameters)|parameters/)?.*?(x .*/)? If all you need is to remove payload node ,this could be useful, s1=message xmlns='jabber:client' to='n...@host.com'mynode xmlns='myprotocol:core' version='1.0' type='mytype'parametersparam1123/param1param2456/param2/parameterspayload type='plain'.../payload/mynodex xmlns='jabber:x:expire' seconds='15'//message pat=re.compile(rpayload.*\/payload) s1=pat.sub(,s1) -- Regards, S.Selvam -- http://mail.python.org/mailman/listinfo/python-list
regex help
Hello everyone, I'm going nuts with some regex, could someone please show me what I'm doing wrong? I have an XMPP msg : message xmlns='jabber:client' to='n...@host.com' mynode xmlns='myprotocol:core' version='1.0' type='mytype' parameters param1123/param1 param2456/param2 /parameters payload type='plain'.../payload /mynode x xmlns='jabber:x:expire' seconds='15'/ /message the parameter node may be absent or empty (parameter/), the x node may be absent. I'd like to grab everything exept the payload nod and create something new using regex, with the XMPP message example above I'd get this : message xmlns='jabber:client' to='n...@host.com' mynode xmlns='myprotocol:core' version='1.0' type='mytype' parameters param1123/param1 param2456/param2 /parameters /mynode x xmlns='jabber:x:expire' seconds='15'/ /message for some reason my regex doesn't work correctly : r(message .*?).*?(mynode .*?).*?(?:(parameters.*?/parameters)|parameters/)?.*?(x .*/)? I group the opening message node, the opening mynode node and if the parameters node is present and not empty I group it and if the x node is present I group it. For some reason this doesn't work correctly : import re s1 = message xmlns='jabber:client' to='n...@host.com'mynode xmlns='myprotocol:core' version='1.0' type='mytype'parametersparam1123/param1param2456/param2/parameterspayload type='plain'.../payload/mynodex xmlns='jabber:x:expire' seconds='15'//message s2 = message xmlns='jabber:client' to='n...@host.com'mynode xmlns='myprotocol:core' version='1.0' type='mytype'parameters/payload type='plain'.../payload/mynodex xmlns='jabber:x:expire' seconds='15'//message s3 = message xmlns='jabber:client' to='n...@host.com'mynode xmlns='myprotocol:core' version='1.0' type='mytype'payload type='plain'.../payload/mynodex xmlns='jabber:x:expire' seconds='15'//message s4 = message xmlns='jabber:client' to='n...@host.com'mynode xmlns='myprotocol:core' version='1.0' type='mytype'parametersparam1123/param1param2456/param2/parameterspayload type='plain'.../payload/mynode/message s5 = message xmlns='jabber:client' to='n...@host.com'mynode xmlns='myprotocol:core' version='1.0' type='mytype'parameters/payload type='plain'.../payload/mynode/message s6 = message xmlns='jabber:client' to='n...@host.com'mynode xmlns='myprotocol:core' version='1.0' type='mytype'payload type='plain'.../payload/mynode/message exp = r(message .*?).*?(mynode .*?).*?(?:(parameters.*?/parameters)|parameters/)?.*?(x .*/)? re.match(exp, s1).groups() (message xmlns='jabber:client' to='n...@host.com', mynode xmlns='myprotocol:core' version='1.0' type='mytype', 'parametersparam1123/param1param2456/param2/parameters', None) re.match(exp, s2).groups() (message xmlns='jabber:client' to='n...@host.com', mynode xmlns='myprotocol:core' version='1.0' type='mytype', None, None) re.match(exp, s3).groups() (message xmlns='jabber:client' to='n...@host.com', mynode xmlns='myprotocol:core' version='1.0' type='mytype', None, None) re.match(exp, s4).groups() (message xmlns='jabber:client' to='n...@host.com', mynode xmlns='myprotocol:core' version='1.0' type='mytype', 'parametersparam1123/param1param2456/param2/parameters', None) re.match(exp, s5).groups() (message xmlns='jabber:client' to='n...@host.com', mynode xmlns='myprotocol:core' version='1.0' type='mytype', None, None) re.match(exp, s6).groups() (message xmlns='jabber:client' to='n...@host.com', mynode xmlns='myprotocol:core' version='1.0' type='mytype', None, None) Does someone know what is wrong with my expression? Thank you, Gabriel -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
Gabriel Rossetti wrote: Hello everyone, I'm going nuts with some regex, could someone please show me what I'm doing wrong? I have an XMPP msg : snip Does someone know what is wrong with my expression? Thank you, Gabriel Gabriel, trying to debug a long regex in situ can be a nightmare however the following technique always works for me... Use the interactive interpreter and see if half the regex works, if it does your problem is in the second half, if not it's in the first so try the first half of that and so on an so forth. You'll find the point at which it goes wrong in a snip. Non-trivial regexes are always best built up and tested a bit at a time, the interactive interpreter is great for this. Roger. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
On Dec 16, 10:22 am, r0g aioe@technicalbloke.com wrote: Gabriel Rossetti wrote: Hello everyone, I'm going nuts with some regex, could someone please show me what I'm doing wrong? I have an XMPP msg : snip Does someone know what is wrong with my expression? Thank you, Gabriel Gabriel, trying to debug a long regex in situ can be a nightmare however the following technique always works for me... Use the interactive interpreter and see if half the regex works, if it does your problem is in the second half, if not it's in the first so try the first half of that and so on an so forth. You'll find the point at which it goes wrong in a snip. Non-trivial regexes are always best built up and tested a bit at a time, the interactive interpreter is great for this. Roger. I'll just add that the now you have two problems quip applies here, especially when there are very good XML parsing libraries for Python that will keep you from having to reinvent the wheel for every little change. See sections 20.5 through 20.13 of the Python Documentation for several built-in options, and I'm sure there are many community projects that may fit the bill if none of those happen to. Personally, I consider regular expressions of any substantial length and complexity to be bad practice as it inhibits readability and maintainability. They are also decidedly non-Zen on at least Readability counts and Sparse is better than dense. Intchanter Daniel Fackrell P.S. I'm not sure how any of these libraries are implemented yet, but I'd hope they're using a finite state machine tailored to the parsing task rather than using regexes, but even if they do the latter, having that abstracted out in a mature library with a clean interface is still a huge win. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
David wrote: tdnbsp;/td td width=1% class=keyOpen: /td td width=1% class=val5.50 /td tdnbsp;/td td width=1% class=keyMkt Cap: /td td width=1% class=val6.92M /td tdnbsp;/td td width=1% class=keyP/E: /td td width=1% class=val21.99 /td I want to extract the open, mkt cap and P/E values - but apart from doing loads of indivdual REs which I think would look messy, I can't think of a better and neater looking way. Any ideas? from BeautifulSoup import BeautifulSoup bs = BeautifulSoup(tdnbsp;/td ... ... td width=1% class=keyOpen: ... /td ... td width=1% class=val5.50 ... /td ... tdnbsp;/td ... td width=1% class=keyMkt Cap: ... /td ... td width=1% class=val6.92M ... /td ... tdnbsp;/td ... td width=1% class=keyP/E: ... /td ... td width=1% class=val21.99 ... /td ... ) for key in bs.findAll(attrs={class: key}): ... value = key.findNext(attrs={class: val}) ... print key.string.strip(), --, value.string.strip() ... Open: -- 5.50 Mkt Cap: -- 6.92M P/E: -- 21.99 -- http://mail.python.org/mailman/listinfo/python-list
regex help
Hi I have a few regexs I need to do, but im struggling to come up with a nice way of doing them, and more than anything am here to learn some tricks and some neat code rather than getting an answer - although thats obviously what i would like to get to. Problem 1 - span class=chg id=ref_678774_cp(25.47%)/spanbr I want to extract 25.47 from here - so far I've tried - xPer = re.search('span class=chg id=ref_'+str(xID.group(1))+'_cp \(.*?)%', content) and xPer = re.search('span class=\chg\ id=\ref_+str(xID.group(1))+_cp \\((\d*)%\)/spanbr', content) neither of these seem to do what I want - am I not doing this correctly? (obviously!) Problem 2 - tdnbsp;/td td width=1% class=keyOpen: /td td width=1% class=val5.50 /td tdnbsp;/td td width=1% class=keyMkt Cap: /td td width=1% class=val6.92M /td tdnbsp;/td td width=1% class=keyP/E: /td td width=1% class=val21.99 /td I want to extract the open, mkt cap and P/E values - but apart from doing loads of indivdual REs which I think would look messy, I can't think of a better and neater looking way. Any ideas? Cheers David -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
On Wed, Jul 8, 2009 at 3:06 PM, Daviddavid.bra...@googlemail.com wrote: Hi I have a few regexs I need to do, but im struggling to come up with a nice way of doing them, and more than anything am here to learn some tricks and some neat code rather than getting an answer - although thats obviously what i would like to get to. Problem 1 - span class=chg id=ref_678774_cp(25.47%)/spanbr I want to extract 25.47 from here - so far I've tried - xPer = re.search('span class=chg id=ref_'+str(xID.group(1))+'_cp \(.*?)%', content) and xPer = re.search('span class=\chg\ id=\ref_+str(xID.group(1))+_cp \\((\d*)%\)/spanbr', content) neither of these seem to do what I want - am I not doing this correctly? (obviously!) Problem 2 - tdnbsp;/td td width=1% class=keyOpen: /td td width=1% class=val5.50 /td tdnbsp;/td td width=1% class=keyMkt Cap: /td td width=1% class=val6.92M /td tdnbsp;/td td width=1% class=keyP/E: /td td width=1% class=val21.99 /td I want to extract the open, mkt cap and P/E values - but apart from doing loads of indivdual REs which I think would look messy, I can't think of a better and neater looking way. Any ideas? Use an actual HTML parser? Like BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/), for instance. I will never understand why so many people try to parse/scrape HTML/XML with regexes... Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
On 2009-07-08, Chris Rebert c...@rebertia.com wrote: On Wed, Jul 8, 2009 at 3:06 PM, Daviddavid.bra...@googlemail.com wrote: I want to extract the open, mkt cap and P/E values - but apart from doing loads of indivdual REs which I think would look messy, I can't think of a better and neater looking way. Any ideas? You are downloading market data? Yahoo offers its stats in CSV format that is easier to parse without a dedicated parser. Use an actual HTML parser? Like BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/), for instance. I agree with your sentiment exactly. If the regex he is trying to get is difficult enough that he has to ask; then, yes, he should be using a parser. I will never understand why so many people try to parse/scrape HTML/XML with regexes... Why? Because some times it is good enough to get the job done easily. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
On Wed, 08 Jul 2009 23:06:22 +0100, David david.bra...@googlemail.com wrote: Hi I have a few regexs I need to do, but im struggling to come up with a nice way of doing them, and more than anything am here to learn some tricks and some neat code rather than getting an answer - although thats obviously what i would like to get to. Problem 1 - span class=chg id=ref_678774_cp(25.47%)/spanbr I want to extract 25.47 from here - so far I've tried - xPer = re.search('span class=chg id=ref_'+str(xID.group(1))+'_cp \(.*?)%', content) Supposing that str(xID.group(1)) == 678774, let's see how that string concatenation turns out: span class=chg id=ref_678774_cp(.*?)% The obvious problems here are the spurious double quotes, the spurious (but harmless) escaping of a double quote, and the lack of (escaped) backslash and (escaped) open parenthesis. The latter you can always strip off later, but the first sink the match rather thoroughly. and xPer = re.search('span class=\chg\ id=\ref_+str(xID.group(1))+_cp \\((\d*)%\)/spanbr', content) With only two single quotes present, the biggest problem should be obvious. Unfortunately if you just fix the obvious in either of the two regular expressions, you're setting yourself up for a fall later on. As The Fine Manual says right at the top of the page on the re module (http://docs.python.org/library/re.html), you want to be using raw string literals when you're dealing with regular expressions, because you want the backslashes getting through without being interpreted specially by Python's own parser. As it happens you get away with it in this case, since neither '\d' nor '\(' have a special meaning to Python, so aren't changed, and '\' is interpreted as '', which happens to be the right thing anyway. Problem 2 - tdnbsp;/td td width=1% class=keyOpen: /td td width=1% class=val5.50 /td tdnbsp;/td td width=1% class=keyMkt Cap: /td td width=1% class=val6.92M /td tdnbsp;/td td width=1% class=keyP/E: /td td width=1% class=val21.99 /td I want to extract the open, mkt cap and P/E values - but apart from doing loads of indivdual REs which I think would look messy, I can't think of a better and neater looking way. Any ideas? What you're trying to do is inherently messy. You might want to use something like BeautifulSoup to hide the mess, but never having had cause to use it myself I couldn't say for sure. -- Rhodri James *-* Wildebeest Herder to the Masses -- http://mail.python.org/mailman/listinfo/python-list
RE: Regex Help
In message [EMAIL PROTECTED], Support Desk wrote: Thanks for the reply ... A: The vulture doesn't get Frequent Poster miles. Q: What's the difference between a top-poster and a vulture? -- http://mail.python.org/mailman/listinfo/python-list
RE: Regex Help
Thanks for the reply, I found out the problem was occurring later on in the script. The regexp works well. -Original Message- From: Lawrence D'Oliveiro [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 23, 2008 6:51 PM To: python-list@python.org Subject: Re: Regex Help In message [EMAIL PROTECTED], Support Desk wrote: Anybody know of a good regex to parse html links from html code? The one I am currently using seems to be cutting off the last letter of some links, and returning links like http://somesite.co or http://somesite.ph the code I am using is regex = r'a href=[|\']([^|\']+)[|\']' Can you post some example HTML sequences that this regexp is not handling correctly? -- http://mail.python.org/mailman/listinfo/python-list
More regex help
I am working on a python webcrawler, that will extract all links from an html page, and add them to a queue, The problem I am having is building absolute links from relative links, as there are so many different types of relative links. If I just append the relative links to the current url, some websites will send it into a never-ending loop. What I am looking for is a regexp that will extract the root url from any url string I pass to it, such as 'http://example.com/stuff/stuff/morestuff/index.html' Regexp = http:example.com 'http://anotherexample.com/stuff/index.php Regexp = 'http://anotherexample.com/ 'http://example.com/stuff/stuff/ Regext = 'http://example.com' -- http://mail.python.org/mailman/listinfo/python-list
Re: More regex help
At 2008-09-24T16:25:02Z, Support Desk [EMAIL PROTECTED] writes: I am working on a python webcrawler, that will extract all links from an html page, and add them to a queue, The problem I am having is building absolute links from relative links, as there are so many different types of relative links. If I just append the relative links to the current url, some websites will send it into a never-ending loop. import urllib urllib.basejoin('http://www.example.com/path/to/deep/page', '/foo') 'http://www.example.com/foo' urllib.basejoin('http://www.example.com/path/to/deep/page', 'http://slashdot.org/foo') 'http://slashdot.org/foo' -- Kirk Strauser The Day Companies -- http://mail.python.org/mailman/listinfo/python-list
RE: More regex help
Kirk, That's exactly what I needed. Thx! -Original Message- From: Kirk Strauser [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 11:42 AM To: python-list@python.org Subject: Re: More regex help At 2008-09-24T16:25:02Z, Support Desk [EMAIL PROTECTED] writes: I am working on a python webcrawler, that will extract all links from an html page, and add them to a queue, The problem I am having is building absolute links from relative links, as there are so many different types of relative links. If I just append the relative links to the current url, some websites will send it into a never-ending loop. import urllib urllib.basejoin('http://www.example.com/path/to/deep/page', '/foo') 'http://www.example.com/foo' urllib.basejoin('http://www.example.com/path/to/deep/page', 'http://slashdot.org/foo') 'http://slashdot.org/foo' -- Kirk Strauser The Day Companies -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Help
Hello, Anybody know of a good regex to parse html links from html code? BeautifulSoup is *the* library to handle HTML from BeautifulSoup import BeautifulSoup from urllib import urlopen soup = BeautifulSoup(urlopen(http://python.org/;)) for a in soup(a): print a[href] HTH, -- Miki [EMAIL PROTECTED] http://pythonwise.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Help
In message [EMAIL PROTECTED], Support Desk wrote: Anybody know of a good regex to parse html links from html code? The one I am currently using seems to be cutting off the last letter of some links, and returning links like http://somesite.co or http://somesite.ph the code I am using is regex = r'a href=[|\']([^|\']+)[|\']' Can you post some example HTML sequences that this regexp is not handling correctly? -- http://mail.python.org/mailman/listinfo/python-list
Regex Help
Anybody know of a good regex to parse html links from html code? The one I am currently using seems to be cutting off the last letter of some links, and returning links like http://somesite.co or http://somesite.ph the code I am using is regex = r'a href=[|\']([^|\']+)[|\']' page_text = urllib.urlopen('http://somesite.com') page_text = page_text.read() links = re.findall(regex, text, re.IGNORECASE) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex Help
Support Desk wrote: the code I am using is regex = r'a href=[|\']([^|\']+)[|\']' that's way too fragile to work with real-life HTML (what if the link has a TITLE attribute, for example? or contains whitespace after the HREF?) you might want to consider using a real HTML parser for this task. page_text = urllib.urlopen('http://somesite.com') page_text = page_text.read() links = re.findall(regex, text, re.IGNORECASE) the RE looks fine for the subset of all valid A elements that it can handle, though. got any examples of pages where you see that behaviour? /F -- http://mail.python.org/mailman/listinfo/python-list
regex help
Hello, I am working on a web-app, that querys long distance numbers from a database of call logs. I am trying to put together a regex that matches any number that does not start with the following. Basically any number that does'nt start with: 281 713 832 or 1281 1713 1832 is long distance any, help would be appreciated. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
Le Monday 30 June 2008 16:53:54 Support Desk, vous avez écrit : Hello, I am working on a web-app, that querys long distance numbers from a database of call logs. I am trying to put together a regex that matches any number that does not start with the following. Basically any number that does'nt start with: 281 713 832 or 1281 1713 1832 is long distance any, help would be appreciated. sounds like str.startswith() is enough for your needs: if not number.startswith(('281', '713', '832', ...)) : ... -- Cédric Lucantis -- http://mail.python.org/mailman/listinfo/python-list
RE: regex help
import re if __name__ == __main__: ... lst = [281, 713, 832, 1281, 1713, 1832, 2281, 2713, 2832] ... for item in lst: ... if re.match(^1?(?=281)|^1?(?=713)|^1?(?=832), str(item)): ... print %d invalid % item ... else: ... print %d valid % item ... 281 invalid 713 invalid 832 invalid 1281 invalid 1713 invalid 1832 invalid 2281 valid 2713 valid 2832 valid _ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Support Desk Sent: Monday, June 30, 2008 10:54 PM To: python-list@python.org Subject: regex help Hello, I am working on a web-app, that querys long distance numbers from a database of call logs. I am trying to put together a regex that matches any number that does not start with the following. Basically any number that does'nt start with: 281 713 832 or 1281 1713 1832 is long distance any, help would be appreciated. -- http://mail.python.org/mailman/listinfo/python-list
regex help
I am trying to put together a regular expression that will rename users address books on our server due to a recent change we made. Users with address books user.abook need to be changed to [EMAIL PROTECTED] I'm having trouble with the regex. Any help would be appreciated. -Mike -- http://mail.python.org/mailman/listinfo/python-list
RE: regex help
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Support Desk Sent: Tuesday, June 03, 2008 9:32 AM To: python-list@python.org Subject: regex help I am trying to put together a regular expression that will rename users address books on our server due to a recent change we made. Users with address books user.abook need to be changed to [EMAIL PROTECTED] I'm having trouble with the regex. Any help would be appreciated. import re emails = ('foo.abook', 'abook.foo', 'bob.abook.com', 'john.doe.abook') for email in emails: print email, '--', print re.sub(r'\.abook$', '@domain.com.abook', email) * The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA623 -- http://mail.python.org/mailman/listinfo/python-list
RE: regex help
Thats it exactly..thx -Original Message- From: Reedick, Andrew [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 03, 2008 9:26 AM To: Support Desk Subject: RE: regex help The regex will now skip anything with an '@'in the filename on the assumption it's already in the correct format. Uncomment the os.rename line once you're satisfied you won't mangle anything. import glob import os import re for filename in glob.glob('*.abook'): newname = filename newname = re.sub(r'[EMAIL PROTECTED]', '@domain.com.abook', filename) if filename != newname: print rename, filename, to, newname #os.rename(filename, newname) -Original Message- From: Support Desk [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 03, 2008 10:07 AM To: Reedick, Andrew Subject: RE: regex help Thx for the reply, I would first have to list all files matching user.abook then rename them to [EMAIL PROTECTED] something like Im still new to python and haven't had much experience with the re module import os import re emails = os.popen('ls').readlines() for email in emails: print email, '--', print re.findall(r'\.abook$', email) -Original Message- From: Reedick, Andrew [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 03, 2008 8:52 AM To: Support Desk; python-list@python.org Subject: RE: regex help From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Support Desk Sent: Tuesday, June 03, 2008 9:32 AM To: python-list@python.org Subject: regex help I am trying to put together a regular expression that will rename users address books on our server due to a recent change we made. Users with address books user.abook need to be changed to [EMAIL PROTECTED] I'm having trouble with the regex. Any help would be appreciated. import re emails = ('foo.abook', 'abook.foo', 'bob.abook.com', 'john.doe.abook') for email in emails: print email, '--', print re.sub(r'\.abook$', '@domain.com.abook', email) * The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA623 -- http://mail.python.org/mailman/listinfo/python-list
Re: pexpect regex help
On Feb 21, 11:15 pm, [EMAIL PROTECTED] wrote: On Feb 21, 6:13 pm, [EMAIL PROTECTED] wrote: I have apexpectscript to walk through a cisco terminal server and I was hoping to get some help with this regex because I really suck at it. This is the code: index = s.expect(['login: ',pexpect.EOF,pexpect.TIMEOUT]) if index == 0: m = re.search('((#.+\r\n){20,25})(\s.*)', s.before) #-- MY PROBLEM print m.group(3), print ' %s %s' % (ip[0], port) s.send(chr(30)) s.sendline('x') s.sendline('disco') s.sendline('\n') elif index == 1: print s.before elif index == 2: print print '%s %s FAILED' % (ip[0], port) print 'This host may be down or locked on the TS' s.send(chr(30)) s.sendline('x') s.sendline('disco') s.sendline('\n') This is attempting to match the hostname of the connected host using the output of a motd file which unfortunately is not the same everywhere... It looks like this: # # This system is the property of: # # # #DefNet # # # # Use of this system is for authorized users only.# # Individuals using this computer system without authority, or in # # excess of their authority, are subject to having all of their # # activities on this system monitored and recorded by system # # personnel. # # # # In the course of monitoring individuals improperly using this # # system, or in the course of system maintenance, the activities # # of authorized users may also be monitored. # # # # Anyone using this system expressly consents to such monitoring # # and is advised that if such monitoring reveals possible # # evidence of criminal activity, system personnel may provide the # # evidence of such monitoring to law enforcement officials. # # pa-chi1 console login: And sometimes it looks like this: # # This system is the property of: # # # #DefNet # # # # Use of this system is for authorized users only.# # Individuals using this computer system without authority, or in # # excess of their authority, are subject to having all of their # # activities on this system monitored and recorded by system # # personnel. # # # # In the course of monitoring individuals improperly using this # # system, or in the course of system maintenance, the activities # # of authorized users may also be monitored. # # # # Anyone using this system expressly consents to such monitoring # # and is advised that if such monitoring reveals possible # # evidence of criminal activity, system personnel may provide the # # evidence of such monitoring to law enforcement officials. # # pa11-chi1 login: The second one works and it will print out pa11-chi1 but when there is a space or console is in the output it wont print anything or it wont match anything...I want to be able to match just the hostname and print it out. Any ideas? Thanks, Jonathan It is also posted here more clearly and formatted as it would appear on the terminal: http://www.pastebin.ca/366822 what about using s.before.split(\r\n)[-1]? A -- http://mail.python.org/mailman/listinfo/python-list
Re: pexpect regex help
On Feb 23, 8:46 am, amadain [EMAIL PROTECTED] wrote: On Feb 21, 11:15 pm, [EMAIL PROTECTED] wrote: On Feb 21, 6:13 pm, [EMAIL PROTECTED] wrote: I have apexpectscript to walk through a cisco terminal server and I was hoping to get some help with this regex because I really suck at it. This is the code: index = s.expect(['login: ',pexpect.EOF,pexpect.TIMEOUT]) if index == 0: m = re.search('((#.+\r\n){20,25})(\s.*)', s.before) #-- MY PROBLEM print m.group(3), print ' %s %s' % (ip[0], port) s.send(chr(30)) s.sendline('x') s.sendline('disco') s.sendline('\n') elif index == 1: print s.before elif index == 2: print print '%s %s FAILED' % (ip[0], port) print 'This host may be down or locked on the TS' s.send(chr(30)) s.sendline('x') s.sendline('disco') s.sendline('\n') This is attempting to match the hostname of the connected host using the output of a motd file which unfortunately is not the same everywhere... It looks like this: # # This system is the property of: # # # #DefNet # # # # Use of this system is for authorized users only.# # Individuals using this computer system without authority, or in # # excess of their authority, are subject to having all of their # # activities on this system monitored and recorded by system # # personnel. # # # # In the course of monitoring individuals improperly using this # # system, or in the course of system maintenance, the activities # # of authorized users may also be monitored. # # # # Anyone using this system expressly consents to such monitoring # # and is advised that if such monitoring reveals possible # # evidence of criminal activity, system personnel may provide the # # evidence of such monitoring to law enforcement officials. # # pa-chi1 console login: And sometimes it looks like this: # # This system is the property of: # # # #DefNet # # # # Use of this system is for authorized users only.# # Individuals using this computer system without authority, or in # # excess of their authority, are subject to having all of their # # activities on this system monitored and recorded by system # # personnel. # # # # In the course of monitoring individuals improperly using this # # system, or in the course of system maintenance, the activities # # of authorized users may also be monitored. # # # # Anyone using this system expressly consents to such monitoring # # and is advised that if such monitoring reveals possible # # evidence of criminal activity, system personnel may provide the # # evidence of such monitoring to law enforcement officials. # # pa11-chi1 login: The second one works and it will print out pa11-chi1 but when there is a space or console is in the output it wont print anything or it wont match anything...I want to be able to match just the hostname and print it out. Any ideas? Thanks, Jonathan It is also posted here more clearly and formatted as it would appear on the terminal: http://www.pastebin.ca/366822 what about using s.before.split(\r\n)[-1]? A result=[x for x in s.before.split(\r\n) if x != ] print result[-1] should cover the blank line problem A -- http://mail.python.org/mailman/listinfo/python-list
Re: pexpect regex help
On Feb 23, 8:53 am, amadain [EMAIL PROTECTED] wrote: On Feb 23, 8:46 am, amadain [EMAIL PROTECTED] wrote: On Feb 21, 11:15 pm, [EMAIL PROTECTED] wrote: On Feb 21, 6:13 pm, [EMAIL PROTECTED] wrote: I have apexpectscript to walk through a cisco terminal server and I was hoping to get some help with this regex because I really suck at it. This is the code: index = s.expect(['login: ',pexpect.EOF,pexpect.TIMEOUT]) if index == 0: m = re.search('((#.+\r\n){20,25})(\s.*)', s.before) #-- MY PROBLEM print m.group(3), print ' %s %s' % (ip[0], port) s.send(chr(30)) s.sendline('x') s.sendline('disco') s.sendline('\n') elif index == 1: print s.before elif index == 2: print print '%s %s FAILED' % (ip[0], port) print 'This host may be down or locked on the TS' s.send(chr(30)) s.sendline('x') s.sendline('disco') s.sendline('\n') This is attempting to match the hostname of the connected host using the output of a motd file which unfortunately is not the same everywhere... It looks like this: # # This system is the property of: # # # #DefNet # # # # Use of this system is for authorized users only.# # Individuals using this computer system without authority, or in # # excess of their authority, are subject to having all of their # # activities on this system monitored and recorded by system # # personnel. # # # # In the course of monitoring individuals improperly using this # # system, or in the course of system maintenance, the activities # # of authorized users may also be monitored. # # # # Anyone using this system expressly consents to such monitoring # # and is advised that if such monitoring reveals possible # # evidence of criminal activity, system personnel may provide the # # evidence of such monitoring to law enforcement officials. # # pa-chi1 console login: And sometimes it looks like this: # # This system is the property of: # # # #DefNet # # # # Use of this system is for authorized users only.# # Individuals using this computer system without authority, or in # # excess of their authority, are subject to having all of their # # activities on this system monitored and recorded by system # # personnel. # # # # In the course of monitoring individuals improperly using this # # system, or in the course of system maintenance, the activities # # of authorized users may also be monitored. # # # # Anyone using this system expressly consents to such monitoring # # and is advised that if such monitoring reveals possible # # evidence of criminal activity, system personnel may provide the # # evidence of such monitoring to law enforcement officials. # # pa11-chi1 login: The second one works and it will print out pa11-chi1 but when there is a space or console is in the output it wont print anything or it wont match anything...I want to be able to match just the hostname and print it out. Any ideas? Thanks, Jonathan It is also posted here more clearly and formatted as it would appear on the terminal: http://www.pastebin.ca/366822 what about using s.before.split(\r\n)[-1]? A result=[x for x in s.before.split(\r\n) if x != ] print result[-1] should cover the blank line problem A sorry I just read that you are not matching sometimes. Try expecting for ogin: (without the first letter and trailing space). There could be no space after login: or there could be \t (tab). A -- http://mail.python.org/mailman/listinfo/python-list
pexpect regex help
I have a pexpect script to walk through a cisco terminal server and I was hoping to get some help with this regex because I really suck at it. This is the code: index = s.expect(['login: ', pexpect.EOF, pexpect.TIMEOUT]) if index == 0: m = re.search('((#.+\r\n){20,25})(\s.*)', s.before) #-- MY PROBLEM print m.group(3), print ' %s %s' % (ip[0], port) s.send(chr(30)) s.sendline('x') s.sendline('disco') s.sendline('\n') elif index == 1: print s.before elif index == 2: print print '%s %s FAILED' % (ip[0], port) print 'This host may be down or locked on the TS' s.send(chr(30)) s.sendline('x') s.sendline('disco') s.sendline('\n') This is attempting to match the hostname of the connected host using the output of a motd file which unfortunately is not the same everywhere... It looks like this: # # This system is the property of: # # # #DefNet # # # # Use of this system is for authorized users only.# # Individuals using this computer system without authority, or in # # excess of their authority, are subject to having all of their # # activities on this system monitored and recorded by system # # personnel. # # # # In the course of monitoring individuals improperly using this # # system, or in the course of system maintenance, the activities # # of authorized users may also be monitored. # # # # Anyone using this system expressly consents to such monitoring # # and is advised that if such monitoring reveals possible # # evidence of criminal activity, system personnel may provide the # # evidence of such monitoring to law enforcement officials. # # pa-chi1 console login: And sometimes it looks like this: # # This system is the property of: # # # #DefNet # # # # Use of this system is for authorized users only.# # Individuals using this computer system without authority, or in # # excess of their authority, are subject to having all of their # # activities on this system monitored and recorded by system # # personnel. # # # # In the course of monitoring individuals improperly using this # # system, or in the course of system maintenance, the activities # # of authorized users may also be monitored. # # # # Anyone using this system expressly consents to such monitoring # # and is advised that if such monitoring reveals possible # # evidence of criminal activity, system personnel may provide the # # evidence of such monitoring to law enforcement officials. # # pa11-chi1 login: The second one works and it will print out pa11-chi1 but when there is a space or console is in the output it wont print anything or it wont match anything...I want to be able to match just the hostname and print it out. Any ideas? Thanks, Jonathan -- http://mail.python.org/mailman/listinfo/python-list
Re: pexpect regex help
On Feb 21, 6:13 pm, [EMAIL PROTECTED] wrote: I have a pexpect script to walk through a cisco terminal server and I was hoping to get some help with this regex because I really suck at it. This is the code: index = s.expect(['login: ', pexpect.EOF, pexpect.TIMEOUT]) if index == 0: m = re.search('((#.+\r\n){20,25})(\s.*)', s.before) #-- MY PROBLEM print m.group(3), print ' %s %s' % (ip[0], port) s.send(chr(30)) s.sendline('x') s.sendline('disco') s.sendline('\n') elif index == 1: print s.before elif index == 2: print print '%s %s FAILED' % (ip[0], port) print 'This host may be down or locked on the TS' s.send(chr(30)) s.sendline('x') s.sendline('disco') s.sendline('\n') This is attempting to match the hostname of the connected host using the output of a motd file which unfortunately is not the same everywhere... It looks like this: # # This system is the property of: # # # #DefNet # # # # Use of this system is for authorized users only.# # Individuals using this computer system without authority, or in # # excess of their authority, are subject to having all of their # # activities on this system monitored and recorded by system # # personnel. # # # # In the course of monitoring individuals improperly using this # # system, or in the course of system maintenance, the activities # # of authorized users may also be monitored. # # # # Anyone using this system expressly consents to such monitoring # # and is advised that if such monitoring reveals possible # # evidence of criminal activity, system personnel may provide the # # evidence of such monitoring to law enforcement officials. # # pa-chi1 console login: And sometimes it looks like this: # # This system is the property of: # # # #DefNet # # # # Use of this system is for authorized users only.# # Individuals using this computer system without authority, or in # # excess of their authority, are subject to having all of their # # activities on this system monitored and recorded by system # # personnel. # # # # In the course of monitoring individuals improperly using this # # system, or in the course of system maintenance, the activities # # of authorized users may also be monitored. # # # # Anyone using this system expressly consents to such monitoring # # and is advised that if such monitoring reveals possible # # evidence of criminal activity, system personnel may provide the # # evidence of such monitoring to law enforcement officials. # # pa11-chi1 login: The second one works and it will print out pa11-chi1 but when there is a space or console is in the output it wont print anything or it wont match anything...I want to be able to match just the hostname and print it out. Any ideas? Thanks, Jonathan It is also posted here more clearly and formatted as it would appear on the terminal: http://www.pastebin.ca/366822 -- http://mail.python.org/mailman/listinfo/python-list
Regex help...pretty please?
I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: cond(c,a,b) but it gets a little more complicated because the conds themselves may have conds within, like the following: cond(0,cond(c,cond(e,cond(g,h,(af)),(ad)),(ab)),(a1)) What I want to do in this case is move the last parameter to the front and then work backwards all the way out (if you're thinking recursion too, I'm vindicated) so that it ends up looking like this: cond((a1), 0, cond((ab),c,cond((ad), e, cond((af), g, h futhermore, the conds may be multiplied by an expression, such as the following: cond(-1,1,f)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float(a)) Here, all I want to do is switch the parameters of the conds without touching the expression, like so: cond(f,-1,1)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float(a)) So that's the gist of my problem statement. I immediately thought that regular expressions would provide an elegant solution. I would go through the string by conds, stripping them the () off, until I got to the lowest level, then move the parameters and work backwards. That thought process became this: -CODE import re def swap(left, middle, right): left = left.replace((, ) right = right.replace(), ) temp = left left = right right = temp temp = middle middle = right right = temp whole = 'cond(' + left + ',' + middle + ',' + right + ')' return whole def condReplacer(string): #regex = re.compile(r'cond\(.*,.*,.+\)') regex = re.compile(r'cond\(.*,.*,.+?\)') if not regex.search(string): print whole string is: + string [left, middle, right] = string.split(',') right = right.replace('\'', ' ') string = swap(left.strip(), middle.strip(), right.strip()) print the new string is: + string return string else: more_conds = regex.search(string) temp_string = more_conds.group() firstParen = temp_string.find('(') temp_string = temp_string[firstParen:] print there are more conditionals! + temp_string condReplacer(temp_string) def lineReader(file): for line in file: regex = r'cond\(.*,.*,.+\)?' if re.search(regex,line,re.DOTALL): condReplacer(line) if __name__ == __main__: input_file = open(only_conds2.txt, 'r') lineReader(input_file) -CODE I think my problem lies in my regular expression... If I use the one commented out I do a greedy search and in my test case where I have a conditional * an expression, I grab the expression too, like so: INPUT: cond(-1,1,f)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float(a)) OUTPUT: whole string is: (-1,1,f)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float (a)) the new string is:cond(f*((float(e*(2**4+(float(d*8+(float(c*4+(float(b*2+float (a,-1,1) when all I really want to do is grab the part associated with the cond. But if I do a non-greedy search I avoid that problem but stop too early when I have an expression like this: INPUT: cond(a,b,(abs(c) = d)) OUTPUT: whole string is: (a,b,(abs(c) the new string is:cond((abs(c,a,b) Can anyone help me with the regular expression? Is this even the best approach to take? Anyone have any thoughts? Thanks for your time! -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help...pretty please?
cond(a,b,c) and I want them to look like this: cond(c,a,b) but it gets a little more complicated because the conds themselves may have conds within, like the following: cond(0,cond(c,cond(e,cond(g,h,(af)),(ad)),(ab)),(a1)) Regexps are *really* *REALLY* *bad* at arbitrarily nested structures. really. Sounds more like you want something like a lex/yacc sort of solution. IIUC, pyparsing may do the trick for you. I'm not a pyparsing wonk, but I can hold my own when it comes to crazy regexps, and can tell you from experience that regexps are *not* a good path to try and go down for this problem. Many times, a regexp can be hammered into solving problems superior solutions than employing regexps. This case is not even one of those. If you know the maximum depth of nesting you'll encounter, you can do some hackish stunts to shoehorn regexps to solve the problem. But if they are truely of arbitrary nesting-depth, *good* *luck*! :) -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help...pretty please?
MooMaster wrote: I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: cond(c,a,b) but it gets a little more complicated because the conds themselves may have conds within, like the following: cond(0,cond(c,cond(e,cond(g,h,(af)),(ad)),(ab)),(a1)) What I want to do in this case is move the last parameter to the front and then work backwards all the way out (if you're thinking recursion too, I'm vindicated) so that it ends up looking like this: cond((a1), 0, cond((ab),c,cond((ad), e, cond((af), g, h futhermore, the conds may be multiplied by an expression, such as the following: cond(-1,1,f)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float(a)) Here, all I want to do is switch the parameters of the conds without touching the expression, like so: cond(f,-1,1)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float(a)) So that's the gist of my problem statement. I immediately thought that regular expressions would provide an elegant solution. I would go through the string by conds, stripping them the () off, until I got to the lowest level, then move the parameters and work backwards. That thought process became this: -CODE import re def swap(left, middle, right): left = left.replace((, ) right = right.replace(), ) temp = left left = right right = temp temp = middle middle = right right = temp whole = 'cond(' + left + ',' + middle + ',' + right + ')' return whole def condReplacer(string): #regex = re.compile(r'cond\(.*,.*,.+\)') regex = re.compile(r'cond\(.*,.*,.+?\)') if not regex.search(string): print whole string is: + string [left, middle, right] = string.split(',') right = right.replace('\'', ' ') string = swap(left.strip(), middle.strip(), right.strip()) print the new string is: + string return string else: more_conds = regex.search(string) temp_string = more_conds.group() firstParen = temp_string.find('(') temp_string = temp_string[firstParen:] print there are more conditionals! + temp_string condReplacer(temp_string) def lineReader(file): for line in file: regex = r'cond\(.*,.*,.+\)?' if re.search(regex,line,re.DOTALL): condReplacer(line) if __name__ == __main__: input_file = open(only_conds2.txt, 'r') lineReader(input_file) -CODE I think my problem lies in my regular expression... If I use the one commented out I do a greedy search and in my test case where I have a conditional * an expression, I grab the expression too, like so: INPUT: cond(-1,1,f)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float(a)) OUTPUT: whole string is: (-1,1,f)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float (a)) the new string is:cond(f*((float(e*(2**4+(float(d*8+(float(c*4+(float(b*2+float (a,-1,1) when all I really want to do is grab the part associated with the cond. But if I do a non-greedy search I avoid that problem but stop too early when I have an expression like this: INPUT: cond(a,b,(abs(c) = d)) OUTPUT: whole string is: (a,b,(abs(c) the new string is:cond((abs(c,a,b) Can anyone help me with the regular expression? Is this even the best approach to take? Anyone have any thoughts? Thanks for your time! You're gonna want a parser for this. pyparsing or spark would suffice. However, since it looks like your source strings are valid python you could get some traction out of the tokenize standard library module: from tokenize import generate_tokens from StringIO import StringIO s = 'cond(-1,1,f)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float(a))' for t in generate_tokens(StringIO(s).readline): print t[1], Prints: cond ( - 1 , 1 , f ) * ( ( float ( e ) * ( 2 ** 4 ) ) + ( float ( d ) * 8 ) + ( float ( c ) * 4 ) + ( float ( b ) * 2 ) + float ( a ) ) Once you've got that far the rest should be easy. :) Peace, ~Simon http://pyparsing.wikispaces.com/ http://pages.cpsc.ucalgary.ca/~aycock/spark/ http://docs.python.org/lib/module-tokenize.html -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help...pretty please?
MooMaster [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: cond(c,a,b) snip Pyparsing makes this a fairly tractable problem. The hardest part is defining the valid contents of a relational and arithmetic expression, which may be found within the arguments of your cond(a,b,c) constructs. Not guaranteeing this 100%, but it did convert your pathologically nested example on the first try. -- Paul -- from pyparsing import * ident = ~Literal(cond) + Word(alphas) number = Combine(Optional(-) + Word(nums) + Optional(. + Word(nums))) arithExpr = Forward() funcCall = ident+(+delimitedList(arithExpr)+) operand = number | funcCall | ident binop = oneOf(+ - * /) arithExpr ( ( operand + ZeroOrMore( binop + operand ) ) | (( + arithExpr + ) ) ) relop = oneOf( == = = != ) condDef = Forward() simpleCondExpr = arithExpr + ZeroOrMore( relop + arithExpr ) | condDef multCondExpr = simpleCondExpr + * + arithExpr condExpr = Forward() condExpr ( simpleCondExpr | multCondExpr | ( + condExpr + ) ) def reorderArgs(t): return cond( + ,.join([.join(t.arg3), .join(t.arg1), .join(t.arg2)]) + ) condDef ( Literal(cond) + ( + Group(condExpr).setResultsName(arg1) + , + Group(condExpr).setResultsName(arg2) + , + Group(condExpr).setResultsName(arg3) + ) ).setParseAction( reorderArgs ) tests = [ cond(a,b,c), cond(12,b,c), cond(-1,1,f)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+floa t(a)), cond(a,b,(abs(c) = d)), cond(0,cond(c,cond(e,cond(g,h,(af)),(ad)),(ab)),(a1)), ] for t in tests: print t,-,condExpr.transformString(t) -- Prints: cond(a,b,c) - cond(c,a,b) cond(12,b,c) - cond(c,12,b) cond(-1,1,f)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float (a)) - cond(f,-1,1)*((float(e)*(2**4))+(float(d)*8)+(float(c)*4)+(float(b)*2)+float (a)) cond(a,b,(abs(c) = d)) - cond((abs(c)=d),a,b) cond(0,cond(c,cond(e,cond(g,h,(af)),(ad)),(ab)),(a1)) - cond((a1),0,cond((ab),c,cond((ad),e,cond((af),g,h -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help...pretty please?
MooMaster Wrote: I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: cond(c,a,b) I zoned out on your question and created a very simple flipper. Although it will not solve your problem maybe someone looking for a simpler version may find it useful as a starting point. I hope it proves useful. I'll post my simple flipper here: s = 'cond(1,savv(grave(3,2,1),y,x),maxx(c,b,a),0)' def argFlipper(s): ''' take a string of arguments and reverse'em e.g. cond(1,savv(grave(3,2,1),y,x),maxx(c,b,a),0) - cond(0,maxx(a,b,c),savv(x,y,grave(1,2,3)),1) ''' count = 0 keyholder = {} while 1: if s.find('(') 0: count += 1 value = '%sph' + '%d' % count tempstring = [x for x in s] startindex = s.rfind('(') limitindex = s.find(')', startindex) argtarget = s[startindex + 1:limitindex].split(',') argreversed = ','.join(reversed(argtarget)) keyholder[value] = '(' + argreversed + ')' tempstring[startindex:limitindex + 1] = value s = ''.join(tempstring) else: while count and keyholder: s = s.replace(value, keyholder[value]) count -= 1 value = '%sph' + '%d' % count return s print argFlipper(s) -- http://mail.python.org/mailman/listinfo/python-list
regex help
I have the following table and I am trying to match percentage the 2nd column on the 2nd Tiger line (9.0). I have tried both of the following. I expected both to match but neither did? Is there a modifier I am missing? What changes do I need to make these match? I need to keep the structure of the regex the same. TIGER.append(re.search(TIGER\s{10}.*?(?:(\d{1,3}\.\d)\s+){2}, target_table).group(1)) TIGER.append(re.search(^TIGER.*?(?:(\d{1,3}\.\d)\s+){2}, target_table).group(1)) BASE - TOTAL TIGER 268 268173 95 101 - 10157 - 5778 276 268 19276230 21 DOG 7979 44 3531 -3117 - 1725 124795524 75 1 29.5 29.5 25.4 36.8 30.7 - 30.7 29.8 - 29.8 32.1 50.0 31.6 29.5 28.6 31.6 32.64.8 CAT 4646 28 1820 -20 7 - 714 -14463214 39 4 17.2 17.2 16.2 18.9 19.8 - 19.8 12.3 - 12.3 17.9 - 18.4 17.2 16.7 18.4 17.0 19.0 LAMB3232 23 910 -10 8 - 812 -12322012 28 1 11.9 11.9 13.39.5 9.9 - 9.9 14.0 - 14.0 15.4 - 15.8 11.9 10.4 15.8 12.24.8 TRIPOD 3232 23 9 9 - 9 9 - 911 110322210 28 3 11.9 11.9 13.39.5 8.9 - 8.9 15.8 - 15.8 14.1 50.0 13.2 11.9 11.5 13.2 12.2 14.3 TIGER 2424 16 8 5 - 510 - 10 7 - 72417 7 18 2 9.0 9.09.28.4 5.0 - 5.0 17.5 - 17.5 9.0 - 9.2 9.0 8.9 9.27.89.5 -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
Lance Hoffmeyer wrote: I have the following table and I am trying to match percentage the 2nd column on the 2nd Tiger line (9.0). I have tried both of the following. I expected both to match but neither did? Is there a modifier I am missing? What changes do I need to make these match? I need to keep the structure of the regex the same. TIGER.append(re.search(TIGER\s{10}.*?(?:(\d{1,3}\.\d)\s+){2}, target_table).group(1)) TIGER.append(re.search(^TIGER.*?(?:(\d{1,3}\.\d)\s+){2}, target_table).group(1)) You can try the re.DOTALL flag (prepend the regex string with (?s)), but I'd go with something really simple: instream = iter(target_table.splitlines()) # or: instream = open(datafile) for line in instream: if line.startswith(TIGER): value = instream.next().split()[1] # or ...[0]? they are both '9.0' TIGER.append(value) break Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
Why not use split instead of regular expressions? ln = 3232 23 9 9 - 9 9 - 911 110 ln.split() ['32', '32', '23', '9', '9', '-', '9', '9', '-', '9', '11', '1', '10'] Much simpler, yes? Just find the line that comes after a line that begins with TIGER, split it, and pick the number you want out of the resulting list. Lance Hoffmeyer wrote: I have the following table and I am trying to match percentage the 2nd column on the 2nd Tiger line (9.0). I have tried both of the following. I expected both to match but neither did? Is there a modifier I am missing? What changes do I need to make these match? I need to keep the structure of the regex the same. TIGER.append(re.search(TIGER\s{10}.*?(?:(\d{1,3}\.\d)\s+){2}, target_table).group(1)) TIGER.append(re.search(^TIGER.*?(?:(\d{1,3}\.\d)\s+){2}, target_table).group(1)) BASE - TOTAL TIGER 268 268173 95 101 - 10157 - 5778 276 268 19276230 21 DOG 7979 44 3531 -3117 - 1725 124795524 75 1 29.5 29.5 25.4 36.8 30.7 - 30.7 29.8 - 29.8 32.1 50.0 31.6 29.5 28.6 31.6 32.64.8 CAT 4646 28 1820 -20 7 - 714 -14463214 39 4 17.2 17.2 16.2 18.9 19.8 - 19.8 12.3 - 12.3 17.9 - 18.4 17.2 16.7 18.4 17.0 19.0 LAMB3232 23 910 -10 8 - 812 -12322012 28 1 11.9 11.9 13.39.5 9.9 - 9.9 14.0 - 14.0 15.4 - 15.8 11.9 10.4 15.8 12.24.8 TRIPOD 3232 23 9 9 - 9 9 - 911 110322210 28 3 11.9 11.9 13.39.5 8.9 - 8.9 15.8 - 15.8 14.1 50.0 13.2 11.9 11.5 13.2 12.2 14.3 TIGER 2424 16 8 5 - 510 - 10 7 - 72417 7 18 2 9.0 9.09.28.4 5.0 - 5.0 17.5 - 17.5 9.0 - 9.2 9.0 8.9 9.27.89.5 -- http://mail.python.org/mailman/listinfo/python-list
Regex help needed
Hi all, I am using python to drive another tool using pexpect. The values which I get back I would like to automatically put into a list if there is more than one return value. They provide me a way to see that the data is in set by parenthesising it. This is all generated as I said using pexpect - Here is how I use it.. child = pexpect.spawn( _buildCadenceExe(), timeout=timeout) child.sendline(somefunction()) child.expect( ) data=child.before Given this data can take on several shapes: Single return value -- THIS IS THE ONE I CAN'T GET TO WORK.. data = 'somefunction()\r\n@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $\r\n' Multiple return value data = 'somefunction()\r\n(. ~ /eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile)\r\n' It may take up several lines... data = 'somefunction()\r\n(. ~ \r\n/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile\r\nfoo)\r\n' So if you're still reading this I want to parse out data. Here are the rules... - Line 1 ALWAYS is the calling function whatever is there (except \r\n) should be kept as original - Anything may occur inside the quotations - I don't care what's in there per se but it must be maintained. - Parenthesed items I want to be pushed into a list. I haven't run into a case where you have nested paren's but that not to say it won't happen... So here is my code.. Pardon my hack job.. import os,re def main(data=None): # Get rid of the annoying \r's dat=data.split(\r) data=.join(dat) # Remove the first line - that is the original call dat = data.split(\n) original=dat[0] del dat[0] print Original, original # Now join all of the remaining lines retl=.join(dat) # self.logger.debug(Original = \'%s\' % original) try: # Get rid of the parenthesis parmatcher = re.compile( r'\(([^()]*)\)' ) parmatch = parmatcher.search(retl) # Get rid of the first and last quotes qrmatcher = re.compile( r'\([^()]*)\' ) qrmatch = qrmatcher.search(parmatch.group(1)) # Split the items qmatch=re.compile(r'\\s+\') results = qmatch.split(qrmatch.group(1)) except: qrmatcher = re.compile( r'\([^()]*)\' ) qrmatch = qrmatcher.search(retl) # Split the items qmatch=re.compile(r'\\s+\') results = qmatch.split(qrmatch.group(1)) print Orig, original, Results, results return original,results # General run.. if __name__ == '__main__': # data = 'someFunction\r\n test foo\r\n' # data = 'someFunction\r\n test foo\r\n' data = 'getVersion()\r\n@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $\r\n' # data = 'someFunction\r\n (test test1 foo aasdfasdf\r\n newline test2)\r\n' main(data) CAN SOMEONE PLEASE CLEAN THIS UP? -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
rh0dium [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Hi all, I am using python to drive another tool using pexpect. The values which I get back I would like to automatically put into a list if there is more than one return value. They provide me a way to see that the data is in set by parenthesising it. snip Well, you asked for regex help, but a pyparsing rendition may be easier to read and maintain. -- Paul (Download pyparsing at http://pyparsing.sourceforge.net.) # test data strings test1 = somefunction() @(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $ test2 = somefunction() (. ~ /eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile foo) test3 = somefunctionWithNestedlist() (. ~ /eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile (Hey! this is a nested list) foo) So if you're still reading this I want to parse out data. Here are the rules... - Line 1 ALWAYS is the calling function whatever is there (except \r\n) should be kept as original - Anything may occur inside the quotations - I don't care what's in there per se but it must be maintained. - Parenthesed items I want to be pushed into a list. I haven't run into a case where you have nested paren's but that not to say it won't happen... from pyparsing import Literal, Word, alphas, alphanums, \ dblQuotedString, OneOrMore, Group, Forward LPAR = Literal(() RPAR = Literal()) # assume function identifiers must start with alphas, followed by zero or more # alphas, numbers, or '_' - expand this defn as needed ident = Word(alphas,alphanums+_) # define a list as one or more quoted strings, inside ()'s - we'll tackle nesting # in a minute quoteList = Group( LPAR.suppress() + OneOrMore(dblQuotedString) + RPAR.suppress() ) # define format of a line of data - don't bother with \n's or \r's, # pyparsing just skips 'em dataFormat = ident + LPAR + RPAR + ( dblQuotedString | quoteList ) def test(t): print dataFormat.parseString(t) print Parse flat lists test(test1) test(test2) # modifications for nested lists quoteList = Forward() quoteList Group( LPAR.suppress() + OneOrMore(dblQuotedString | quoteList) + RPAR.suppress() ) dataFormat = ident + LPAR + RPAR + ( dblQuotedString | quoteList ) print print Parse using nested lists test(test1) test(test2) test(test3) Parsing results: Parse flat lists ['somefunction', '(', ')', '@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $'] ['somefunction', '(', ')', ['.', '~', '/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile', 'foo']] Parse using nested lists ['somefunction', '(', ')', '@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $'] ['somefunction', '(', ')', ['.', '~', '/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile', 'foo']] ['somefunctionWithNestedlist', '(', ')', ['.', '~', '/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile', ['Hey!', 'this is a nested', 'list'], 'foo']] -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
Paul McGuire wrote: -- Paul (Download pyparsing at http://pyparsing.sourceforge.net.) Done. Hey this is pretty cool! I have one small problem that I don't know how to resolve. I want the entire contents (whatever it is) of line 1 to be the ident. Now digging into the code showed a method line, lineno and LineStart LineEnd. I tried to use all three but it didn't work for a few reasons ( line = type issues, lineno - I needed the data and could't get it to work, LineStart/End - I think it matches every line and I need the scope to line 1 ) So here is my rendition of the code - But this is REALLY slick.. I think the problem is the parens on line one def main(data=None): LPAR = Literal(() RPAR = Literal()) # assume function identifiers must start with alphas, followed by zero or more # alphas, numbers, or '_' - expand this defn as needed ident = LineStart + LineEnd # define a list as one or more quoted strings, inside ()'s - we'll tackle nesting # in a minute quoteList = Group( LPAR.suppress() + OneOrMore(dblQuotedString) + RPAR.suppress()) # define format of a line of data - don't bother with \n's or \r's, # pyparsing just skips 'em dataFormat = ident + ( dblQuotedString | quoteList ) return dataFormat.parseString(data) # General run.. if __name__ == '__main__': # data = 'someFunction\r\n test foo\r\n' # data = 'someFunction\r\n test foo\r\n' data = 'getVersion()\r\n@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $\r\n' # data = 'someFunction\r\n (test test1 foo aasdfasdf\r\n newline test2)\r\n' foo = main(data) print foo -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
rh0dium [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Paul McGuire wrote: -- Paul (Download pyparsing at http://pyparsing.sourceforge.net.) Done. Hey this is pretty cool! I have one small problem that I don't know how to resolve. I want the entire contents (whatever it is) of line 1 to be the ident. Now digging into the code showed a method line, lineno and LineStart LineEnd. I tried to use all three but it didn't work for a few reasons ( line = type issues, lineno - I needed the data and could't get it to work, LineStart/End - I think it matches every line and I need the scope to line 1 ) So here is my rendition of the code - But this is REALLY slick.. I think the problem is the parens on line one def main(data=None): LPAR = Literal(() RPAR = Literal()) # assume function identifiers must start with alphas, followed by zero or more # alphas, numbers, or '_' - expand this defn as needed ident = LineStart + LineEnd # define a list as one or more quoted strings, inside ()'s - we'll tackle nesting # in a minute quoteList = Group( LPAR.suppress() + OneOrMore(dblQuotedString) + RPAR.suppress()) # define format of a line of data - don't bother with \n's or \r's, # pyparsing just skips 'em dataFormat = ident + ( dblQuotedString | quoteList ) return dataFormat.parseString(data) # General run.. if __name__ == '__main__': # data = 'someFunction\r\n test foo\r\n' # data = 'someFunction\r\n test foo\r\n' data = 'getVersion()\r\n@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $\r\n' # data = 'someFunction\r\n (test test1 foo aasdfasdf\r\n newline test2)\r\n' foo = main(data) print foo LineStart() + LineEnd() will only match an empty line. If you describe in words what you want ident to be, it may be more natural to translate to pyparsing. A word starting with an alpha, followed by zero or more alphas, numbers, or '_'s, with a trailing pair of parens ident = Word(alpha,alphanums+_) + LPAR + RPAR If you want the ident all combined into a single token, use: ident = Combine( Word(alpha,alphanums+_) + LPAR + RPAR ) LineStart and LineEnd are geared more for line-oriented or whitespace-sensitive grammars. Your example doesn't really need them, I don't think. If you *really* want everything on the first line to be the ident, try this: ident = Word(alpha,alphanums+_) + restOfLine or ident = Combine( Word(alpha,alphanums+_) + restOfLine ) Now the next step is to assign field names to the results: dataFormat = ident.setResultsName(ident) + ( dblQuotedString | quoteList ).setResultsName(contents) test = blah blah test string results = dataFormat.parseString(test) print results.ident, results.contents I'm glad pyparsing is working out for you! There should be a number of examples that ship with pyparsing that may give you some more ideas on how to proceed from here. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
rh0dium wrote: Hi all, I am using python to drive another tool using pexpect. The values which I get back I would like to automatically put into a list if there is more than one return value. They provide me a way to see that the data is in set by parenthesising it. ... CAN SOMEONE PLEASE CLEAN THIS UP? How about using the Python tokenizer rather than re: import cStringIO, tokenize ... def get_tokens(source): ... allowed_tokens = (tokenize.STRING, tokenize.OP) ... src = cStringIO.StringIO(source).readline ... src = tokenize.generate_tokens(src) ... return (token[1] for token in src if token[0] in allowed_tokens) ... def rest_eval(tokens): ... output = [] ... for token in tokens: ... if token == (: ... output.append(rest_eval(tokens)) ... elif token == ): ... return output ... else: ... output.append(token[1:-1]) ... return output ... def parse(source): ... source = source.splitlines() ... original, rest = source[0], \n.join(source[1:]) ... return original, rest_eval(get_tokens(rest)) ... sources = [ ... 'someFunction\r\n test foo\r\n', ... 'someFunction\r\n test foo\r\n', ... 'getVersion()\r\n@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $\r\n', ... 'someFunction\r\n (test test1 foo aasdfasdf\r\n newline test2)\r\n'] for data in sources: parse(data) ... ('someFunction', ['test', 'foo']) ('someFunction', ['test foo']) ('getVersion()', ['@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $']) ('someFunction', [['test', 'test1', 'foo aasdfasdf', 'newline', 'test2']]) Cheers Michael -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
Paul McGuire wrote: ident = Combine( Word(alpha,alphanums+_) + LPAR + RPAR ) This will only work for a word with a parentheses ( ie. somefunction() ) If you *really* want everything on the first line to be the ident, try this: ident = Word(alpha,alphanums+_) + restOfLine or ident = Combine( Word(alpha,alphanums+_) + restOfLine ) This nicely grabs the \r.. How can I get around it? Now the next step is to assign field names to the results: dataFormat = ident.setResultsName(ident) + ( dblQuotedString | quoteList ).setResultsName(contents) This is super cool!! So let's take this for example test= 'fprintf( outFile leSetInstSelectable( t )\n )\r\n (test test1 foo aasdfasdf\r\n newline test2)\r\n' Now I want the ident to pull out 'fprintf( outFile leSetInstSelectable( t )\n )' so I tried to do this? ident = Forward() ident Group( Word(alphas,alphanums) + LPAR + ZeroOrMore( dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR) Borrowing from the example listed previously. But it bombs out cause it wants a ) but it has one.. Forward() ROCKS!! Also how does it know to do this for just the first line? It would seem that this will work for every line - No? -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
Michael Spencer wrote: def parse(source): ... source = source.splitlines() ... original, rest = source[0], \n.join(source[1:]) ... return original, rest_eval(get_tokens(rest)) This is a very clean and elegant way to separate them - Very nice!! I like this alot - I will definately use this in the future!! Cheers Michael -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
rh0dium [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Paul McGuire wrote: ident = Combine( Word(alpha,alphanums+_) + LPAR + RPAR ) This will only work for a word with a parentheses ( ie. somefunction() ) If you *really* want everything on the first line to be the ident, try this: ident = Word(alpha,alphanums+_) + restOfLine or ident = Combine( Word(alpha,alphanums+_) + restOfLine ) This nicely grabs the \r.. How can I get around it? Now the next step is to assign field names to the results: dataFormat = ident.setResultsName(ident) + ( dblQuotedString | quoteList ).setResultsName(contents) This is super cool!! So let's take this for example test= 'fprintf( outFile leSetInstSelectable( t )\n )\r\n (test test1 foo aasdfasdf\r\n newline test2)\r\n' Now I want the ident to pull out 'fprintf( outFile leSetInstSelectable( t )\n )' so I tried to do this? ident = Forward() ident Group( Word(alphas,alphanums) + LPAR + ZeroOrMore( dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR) Borrowing from the example listed previously. But it bombs out cause it wants a ) but it has one.. Forward() ROCKS!! Also how does it know to do this for just the first line? It would seem that this will work for every line - No? This works for me: test4 = rfprintf( outFile leSetInstSelectable( t )\n ) (test test1 foo aasdfasdf newline test2) ident = Forward() ident Group( Word(alphas,alphanums) + LPAR + ZeroOrMore( dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR) dataFormat = ident + ( dblQuotedString | quoteList ) print dataFormat.parseString(test4) Prints: [['fprintf', '(', 'outFile', 'leSetInstSelectable( t )\\n', ')'], ['test', 'test1', 'foo aasdfasdf', 'newline', 'test2']] 1. Is there supposed to be a real line break in the string leSetInstSelectable( t )\n, or just a slash-n at the end? pyparsing quoted strings do not accept multiline quotes, but they do accept escaped characters such as \t \n, etc. That is, to pyparsing: \n this is a valid \t \n string this is not a valid string Part of the confusion is that your examples include explicit \r\n characters. I'm assuming this is to reflect what you see when listing out the Python variable containing the string. (Are you opening a text file with rb to read in binary? Try opening with just r, and this may resolve your \r\n problems.) 2. If restOfLine is still giving you \r's at the end, you can redefine restOfLine to not include them, or to include and suppress them. Or (this is easier) define a parse action for restOfLine that strips trailing \r's: def stripTrailingCRs(st,loc,toks): try: if toks[0][-1] == '\r': return toks[0][:-1] except: pass restOfLine.setParseAction( stripTrailingCRs ) 3. How does it know to only do it for the first line? Presumably you told it to do so. pyparsing's parseString method starts at the beginning of the input string, and matches expressions until it finds a mismatch, or runs out of expressions to match - even if there is more input string to process, pyparsing does not continue. To search through the whole file looking for idents, try using scanString which returns a generator; for each match, the generator gives a tuple containing: - tokens - the matched tokens - start - the start location of the match - end - the end location of the match If your input file consists *only* of these constructs, you can also just expand dataFormat.parseString to OneOrMore(dataFormat).parseString. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
rh0dium wrote: Michael Spencer wrote: def parse(source): ... source = source.splitlines() ... original, rest = source[0], \n.join(source[1:]) ... return original, rest_eval(get_tokens(rest)) This is a very clean and elegant way to separate them - Very nice!! I like this alot - I will definately use this in the future!! Cheers Michael On reflection, this simplifies further (to 9 lines), at least for the test cases your provide, which don't involve any nested parens: import cStringIO, tokenize ... def get_tokens2(source): ... src = cStringIO.StringIO(source).readline ... src = tokenize.generate_tokens(src) ... return [token[1][1:-1] for token in src if token[0] == tokenize.STRING] ... def parse2(source): ... source = source.splitlines() ... original, rest = source[0], \n.join(source[1:]) ... return original, get_tokens2(rest) ... This matches your main function for the three tests where main works... for source in sources[:3]: #matches your main function where it works ... assert parse2(source) == main(source) ... Original someFunction Orig someFunction Results ['test', 'foo'] Original someFunction Orig someFunction Results ['test foo'] Original someFunction Orig someFunction Results ['test', 'test1', 'foo aasdfasdf', 'newline', 'test2'] ...and handles the case where main fails (I think correctly, although I'm not entirely sure what your desired output is in this case: parse2(sources[3]) ('getVersion()', ['@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $']) If you really do need nested parens, then you'd need the slightly longer version I posted earlier Cheers Michael -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
jeff sacksteder wrote: Regex questions seem to be rather resistant to googling. My regex currently looks like - 'FOO:.*\n\n' The chunk of text I am attempting to locate is a line beginning with FOO:, followed by an unknown number of lines, terminating with a blank line. Clearly the .* phrase does not match the single newlines occuring inside the block. Suggestions are warmly welcomed. I suggest you read the manual first: . (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
when *I* google http://www.awaretek.com/tutorials.html#regular http://en.wikibooks.org/wiki/Programming:Python_Strings http://www.regexlib.com/Default.aspx http://docs.python.org/lib/module-re.html http://diveintopython.org/regular_expressions/index.html#re.intro http://www.amk.ca/python/howto/regex/ http://gnosis.cx/publish/programming/regular_expressions.html also look into ActiveStateKomodo reg ex debugger ( I think WIng IDE has it too -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
John Machin wrote: jeff sacksteder wrote: Regex questions seem to be rather resistant to googling. My regex currently looks like - 'FOO:.*\n\n' The chunk of text I am attempting to locate is a line beginning with FOO:, followed by an unknown number of lines, terminating with a blank line. Clearly the .* phrase does not match the single newlines occuring inside the block. Suggestions are warmly welcomed. I suggest you read the manual first: . (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline. I think you need to write you own function. Something like: for x in open('_file_name'): if x == 'Foo:\n': flag=1 if x == '\n': flag=0 if flag == 1: print x if the line is 'FOO: _some_more_data_' you may try, if x.startswith('Foo:'): instead of if x == 'Foo:\n': Hope this help. Shantanoo -- http://mail.python.org/mailman/listinfo/python-list
regex help
Regex questions seem to be rather resistant to googling. My regex currently looks like - 'FOO:.*\n\n' The chunk of text I am attempting to locate is a line beginning with FOO:, followed by an unknown number of lines, terminating with a blank line. Clearly the .* phrase does not match the single newlines occuring inside the block. Suggestions are warmly welcomed. -- http://mail.python.org/mailman/listinfo/python-list
Re: regex help
jeff sacksteder wrote: Regex questions seem to be rather resistant to googling. My regex currently looks like - 'FOO:.*\n\n' The chunk of text I am attempting to locate is a line beginning with FOO:, followed by an unknown number of lines, terminating with a blank line. Clearly the .* phrase does not match the single newlines occuring inside the block. Include the re.DOTALL flag when you compile the regular expression. -- http://mail.python.org/mailman/listinfo/python-list
Multiline regex help
Hey Folks, I've got some info in a bunch of files that kind of looks like so: Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04 NothingImportant ThisDoesNotMatter 44 RelevantInfo2 22 BlahBlah 343 RelevantInfo3 23 Hubris Crap 34 and so on... Anyhow, these fields repeat several times in a given file (number of repetitions varies from file to file). The number on the line following the RelevantInfo lines is really what I'm after. Ideally, I would like to have something like so: RelevantInfo1 = 10/10/04 # The variable name isn't actually important RelevantInfo3 = 23 # it's just there to illustrate what info I'm # trying to snag. Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2 Collected from all of the files. So, there would be several of these scores per file and there are a bunch of files. Ultimately, I am interested in printing them out as a csv file but that should be relatively easy once they are trapped in my array of doom cue evil laughter. I've got a fairly ugly solution (I am using this term *very* loosely) using awk and his faithfail companion sed, but I would prefer something in python. Thanks for your time. -- McGowan's Madison Avenue Axiom: If an item is advertised as under $50, you can bet it's not $19.95. -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
Yatima wrote: Hey Folks, I've got some info in a bunch of files that kind of looks like so: Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04 NothingImportant ThisDoesNotMatter 44 RelevantInfo2 22 BlahBlah 343 RelevantInfo3 23 Hubris Crap 34 and so on... Anyhow, these fields repeat several times in a given file (number of repetitions varies from file to file). The number on the line following the RelevantInfo lines is really what I'm after. Ideally, I would like to have something like so: RelevantInfo1 = 10/10/04 # The variable name isn't actually important RelevantInfo3 = 23 # it's just there to illustrate what info I'm # trying to snag. Here is a way to create a list of [RelevantInfo, value] pairs: import cStringIO raw_data = '''Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04 NothingImportant ThisDoesNotMatter 44 RelevantInfo2 22 BlahBlah 343 RelevantInfo3 23 Hubris Crap 34''' raw_data = cStringIO.StringIO(raw_data) data = [] for line in raw_data: if line.startswith('RelevantInfo'): key = line.strip() value = raw_data.next().strip() data.append([key, value]) print data Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2 I'm not sure what you mean by this. Do you want to build a Score dictionary as well? Kent Collected from all of the files. So, there would be several of these scores per file and there are a bunch of files. Ultimately, I am interested in printing them out as a csv file but that should be relatively easy once they are trapped in my array of doom cue evil laughter. I've got a fairly ugly solution (I am using this term *very* loosely) using awk and his faithfail companion sed, but I would prefer something in python. Thanks for your time. -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
Yatima wrote: Hey Folks, I've got some info in a bunch of files that kind of looks like so: Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04 NothingImportant ThisDoesNotMatter 44 RelevantInfo2 22 BlahBlah 343 RelevantInfo3 23 Hubris Crap 34 and so on... Anyhow, these fields repeat several times in a given file (number of repetitions varies from file to file). The number on the line following the RelevantInfo lines is really what I'm after. Ideally, I would like to have something like so: RelevantInfo1 = 10/10/04 # The variable name isn't actually important RelevantInfo3 = 23 # it's just there to illustrate what info I'm # trying to snag. Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2 A possible solution, using the re module: py s = \ ... Gibberish ... 53 ... MoreGarbage ... 12 ... RelevantInfo1 ... 10/10/04 ... NothingImportant ... ThisDoesNotMatter ... 44 ... RelevantInfo2 ... 22 ... BlahBlah ... 343 ... RelevantInfo3 ... 23 ... Hubris ... Crap ... 34 ... py import re py m = re.compile(r^RelevantInfo1\n([^\n]*) ....* ...^RelevantInfo2\n([^\n]*) ....* ...^RelevantInfo3\n([^\n]*), ...re.DOTALL | re.MULTILINE | re.VERBOSE) py score = {} py for info1, info2, info3 in m.findall(s): ... score.setdefault(info1, {})[info3] = info2 ... py score {'10/10/04': {'23': '22'}} Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE to have ^ apply at the start of each line, and VERBOSE to allow me to write the re in a more readable form. If I didn't get your dict update quite right, hopefully you can see how to fix it! HTH, STeVe -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard [EMAIL PROTECTED] wrote: A possible solution, using the re module: py s = \ ... Gibberish ... 53 ... MoreGarbage ... 12 ... RelevantInfo1 ... 10/10/04 ... NothingImportant ... ThisDoesNotMatter ... 44 ... RelevantInfo2 ... 22 ... BlahBlah ... 343 ... RelevantInfo3 ... 23 ... Hubris ... Crap ... 34 ... py import re py m = re.compile(r^RelevantInfo1\n([^\n]*) ....* ...^RelevantInfo2\n([^\n]*) ....* ...^RelevantInfo3\n([^\n]*), ...re.DOTALL | re.MULTILINE | re.VERBOSE) py score = {} py for info1, info2, info3 in m.findall(s): ... score.setdefault(info1, {})[info3] = info2 ... py score {'10/10/04': {'23': '22'}} Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE to have ^ apply at the start of each line, and VERBOSE to allow me to write the re in a more readable form. If I didn't get your dict update quite right, hopefully you can see how to fix it! Thanks! That was very helpful. Unfortunately, I wasn't completely clear when describing the problem. Is there anyway to extract multiple scores from the same file and from multiple files (I will probably use the fileinput module to deal with multiple files). So, if I've got say: Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04 NothingImportant ThisDoesNotMatter 44 RelevantInfo2 22 BlahBlah 343 RelevantInfo3 23 Hubris Crap 34 SecondSetofGarbage 2423 YouGetThePicture 342342 RelevantInfo1 10/10/04 HoHum 343 MoreStuffNotNeeded 232 RelevantInfo2 33 RelevantInfo3 44 sdfsdf RelevantInfo1 10/11/04 InsertBoringFillerHere 43234 Stuff MoreStuff RelevantInfo2 45 ExcitingIsntIt 324234 RelevantInfo3 60 Lalala Sorry for the long and painful example input. Notice that the first two RelevantInfo1 fields have the same info but that the RelevantInfo2 and RelevantInfo3 fields have different info. Also, there will be cases where RelevantInfo3 might be the same with a different RelevantInfo2. What, I'm hoping for is something along then lines of being able to organize it like so (don't worry about the format of the output -- I'll deal with that later; RelevantInfo shortened to Info for readability): Info1[0], Info[1],Info[2] ... Info3[0]Info2[Info1[0],Info3[0]]Info2[Info1[1],Info3[1]]... Info3[1]Info2[Info1[0],Info3[1]]... Info3[2]Info2[Info1[0],Info3[2]]... ... I don't really care if it's a list, dictionary, array etc. Thanks again for your help. The multiline option in the re module is very useful. Take care. -- Clarke's Conclusion: Never let your sense of morals interfere with doing the right thing. -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
Have a look at martel, part of biopython. The world of bioinformatics is filled with files with structure like this. http://www.biopython.org/docs/api/public/Martel-module.html James On Thursday 03 March 2005 12:03 pm, Yatima wrote: On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard [EMAIL PROTECTED] wrote: A possible solution, using the re module: py s = \ ... Gibberish ... 53 ... MoreGarbage ... 12 ... RelevantInfo1 ... 10/10/04 ... NothingImportant ... ThisDoesNotMatter ... 44 ... RelevantInfo2 ... 22 ... BlahBlah ... 343 ... RelevantInfo3 ... 23 ... Hubris ... Crap ... 34 ... py import re py m = re.compile(r^RelevantInfo1\n([^\n]*) ....* ...^RelevantInfo2\n([^\n]*) ....* ...^RelevantInfo3\n([^\n]*), ...re.DOTALL | re.MULTILINE | re.VERBOSE) py score = {} py for info1, info2, info3 in m.findall(s): ... score.setdefault(info1, {})[info3] = info2 ... py score {'10/10/04': {'23': '22'}} Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE to have ^ apply at the start of each line, and VERBOSE to allow me to write the re in a more readable form. If I didn't get your dict update quite right, hopefully you can see how to fix it! Thanks! That was very helpful. Unfortunately, I wasn't completely clear when describing the problem. Is there anyway to extract multiple scores from the same file and from multiple files (I will probably use the fileinput module to deal with multiple files). So, if I've got say: Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04 NothingImportant ThisDoesNotMatter 44 RelevantInfo2 22 BlahBlah 343 RelevantInfo3 23 Hubris Crap 34 SecondSetofGarbage 2423 YouGetThePicture 342342 RelevantInfo1 10/10/04 HoHum 343 MoreStuffNotNeeded 232 RelevantInfo2 33 RelevantInfo3 44 sdfsdf RelevantInfo1 10/11/04 InsertBoringFillerHere 43234 Stuff MoreStuff RelevantInfo2 45 ExcitingIsntIt 324234 RelevantInfo3 60 Lalala Sorry for the long and painful example input. Notice that the first two RelevantInfo1 fields have the same info but that the RelevantInfo2 and RelevantInfo3 fields have different info. Also, there will be cases where RelevantInfo3 might be the same with a different RelevantInfo2. What, I'm hoping for is something along then lines of being able to organize it like so (don't worry about the format of the output -- I'll deal with that later; RelevantInfo shortened to Info for readability): Info1[0], Info[1],Info[2] ... Info3[0]Info2[Info1[0],Info3[0]]Info2[Info1[1],Info3[1]]... Info3[1]Info2[Info1[0],Info3[1]]... Info3[2]Info2[Info1[0],Info3[2]]... ... I don't really care if it's a list, dictionary, array etc. Thanks again for your help. The multiline option in the re module is very useful. Take care. -- Clarke's Conclusion: Never let your sense of morals interfere with doing the right thing. -- James Stroud, Ph.D. UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
On Thu, 03 Mar 2005 07:14:50 -0500, Kent Johnson [EMAIL PROTECTED] wrote: Here is a way to create a list of [RelevantInfo, value] pairs: import cStringIO raw_data = '''Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04 NothingImportant ThisDoesNotMatter 44 RelevantInfo2 22 BlahBlah 343 RelevantInfo3 23 Hubris Crap 34''' raw_data = cStringIO.StringIO(raw_data) data = [] for line in raw_data: if line.startswith('RelevantInfo'): key = line.strip() value = raw_data.next().strip() data.append([key, value]) print data Thank you. This isn't exactly what I'm looking for (I wasn't clear in describing the problem -- please see my reply to Steve for a, hopefully, better explanation) but it does give me a few ideas. Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2 I'm not sure what you mean by this. Do you want to build a Score dictionary as well? Sure... Uhhh.. I think. Okay, what I want is some kind of awk-like associative array because the raw data files will have repeats for certain field vaues such that there would be, for example, multiple RelevantInfo2's and RelevantInfo3's for the same RelevantInfo1 (i.e. on the same date). To make matters more exciting, there will be multiple RelevantInfo1's (dates) for the same RelevantInfo3 (e.g. a subject ID). RelevantInfo2 will be the value for all unique combinations of RelevantInfo1 and RelevantInfo3. There will be multiple occurrences of these fields in the same file (original data sample was not very good for this reason) and multiple files as well. The interesting three fields will always be repeated in the same order although the amount of irrelevant data in between may vary. So: RelevantInfo1 10/10/04 snipped crap RelevantInfo2 12 more snippage RelevantInfo3 43 more snippage RelevantInfo1 10/10/04- The same as the first occurrence of RelevantInfo1 snipped RelevantInfo2 22 snipped RelevantInfo3 25 snipped RelevantInfo1 10/11/04 snipped RelevantInfo2 34 snipped RelevantInfo3 28 snipped RelevantInfo1 10/12/04 snipped RelevantInfo2 98 snipped RelevantInfo3 25- The same as the second occurrence of RelevantInfo3 ... Sorry for the long and tedious data example. There will be missing values for some combinations of RelevantInfo1 and RelevantInfo3 so hopefully that won't be an issue. Thanks again for your reply. Take care. -- I figured there was this holocaust, right, and the only ones left alive were Donna Reed, Ozzie and Harriet, and the Cleavers. -- Wil Wheaton explains why everyone in Star Trek: The Next Generation is so nice -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
I found the original paper for Martel: http://www.dalkescientific.com/Martel/ipc9/ On Thursday 03 March 2005 12:26 pm, James Stroud wrote: Have a look at martel, part of biopython. The world of bioinformatics is filled with files with structure like this. http://www.biopython.org/docs/api/public/Martel-module.html James On Thursday 03 March 2005 12:03 pm, Yatima wrote: -- James Stroud, Ph.D. UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
Yatima wrote: On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard [EMAIL PROTECTED] wrote: A possible solution, using the re module: py s = \ ... Gibberish ... 53 ... MoreGarbage ... 12 ... RelevantInfo1 ... 10/10/04 ... NothingImportant ... ThisDoesNotMatter ... 44 ... RelevantInfo2 ... 22 ... BlahBlah ... 343 ... RelevantInfo3 ... 23 ... Hubris ... Crap ... 34 ... py import re py m = re.compile(r^RelevantInfo1\n([^\n]*) ....* ...^RelevantInfo2\n([^\n]*) ....* ...^RelevantInfo3\n([^\n]*), ...re.DOTALL | re.MULTILINE | re.VERBOSE) py score = {} py for info1, info2, info3 in m.findall(s): ... score.setdefault(info1, {})[info3] = info2 ... py score {'10/10/04': {'23': '22'}} Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE to have ^ apply at the start of each line, and VERBOSE to allow me to write the re in a more readable form. If I didn't get your dict update quite right, hopefully you can see how to fix it! Thanks! That was very helpful. Unfortunately, I wasn't completely clear when describing the problem. Is there anyway to extract multiple scores from the same file and from multiple files I think if you use the non-greedy .*? instead of the greedy .*, you'll get this behavior. For example: py s = \ ... Gibberish ... 53 ... MoreGarbage [snip a whole bunch of stuff] ... RelevantInfo3 ... 60 ... Lalala ... py import re py m = re.compile(r^RelevantInfo1\n([^\n]*) ....*? ...^RelevantInfo2\n([^\n]*) ....*? ...^RelevantInfo3\n([^\n]*), ...re.DOTALL | re.MULTILINE | re.VERBOSE) py score = {} py for info1, info2, info3 in m.findall(s): ... score.setdefault(info1, {})[info3] = info2 ... py score {'10/10/04': {'44': '33', '23': '22'}, '10/11/04': {'60': '45'}} If you might have multiple info2 values for the same (info1, info3) pair, you can try something like: py score = {} py for info1, info2, info3 in m.findall(s): ... score.setdefault(info1, {}).setdefault(info3, []).append(info2) ... py score {'10/10/04': {'44': ['33'], '23': ['22']}, '10/11/04': {'60': ['45']}} HTH, STeVe -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
Here is another attempt. I'm still not sure I understand what form you want the data in. I made a dict - dict - list structure so if you lookup e.g. scores['10/11/04']['60'] you get a list of all the RelevantInfo2 values for Relevant1='10/11/04' and Relevant2='60'. The parser is a simple-minded state machine that will misbehave if the input does not have entries in the order Relevant1, Relevant2, Relevant3 (with as many intervening lines as you like). All three values are available when Relevant3 is detected so you could do something else with them if you want. HTH Kent import cStringIO raw_data = '''Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04 NothingImportant ThisDoesNotMatter 44 RelevantInfo2 22 BlahBlah 343 RelevantInfo3 23 Hubris Crap 34 Gibberish 53 MoreGarbage 12 RelevantInfo1 10/10/04 NothingImportant ThisDoesNotMatter 44 RelevantInfo2 22 BlahBlah 343 RelevantInfo3 23 Hubris Crap 34 SecondSetofGarbage 2423 YouGetThePicture 342342 RelevantInfo1 10/10/04 HoHum 343 MoreStuffNotNeeded 232 RelevantInfo2 33 RelevantInfo3 44 sdfsdf RelevantInfo1 10/11/04 InsertBoringFillerHere 43234 Stuff MoreStuff RelevantInfo2 45 ExcitingIsntIt 324234 RelevantInfo3 60 Lalala''' raw_data = cStringIO.StringIO(raw_data) scores = {} info1 = info2 = info3 = None for line in raw_data: if line.startswith('RelevantInfo1'): info1 = raw_data.next().strip() elif line.startswith('RelevantInfo2'): info2 = raw_data.next().strip() elif line.startswith('RelevantInfo3'): info3 = raw_data.next().strip() scores.setdefault(info1, {}).setdefault(info3, []).append(info2) info1 = info2 = info3 = None print scores print scores['10/11/04']['60'] print scores['10/10/04']['23'] ## prints: {'10/10/04': {'44': ['33'], '23': ['22', '22']}, '10/11/04': {'60': ['45']}} ['45'] ['22', '22'] -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
On Thu, 03 Mar 2005 16:25:39 -0500, Kent Johnson [EMAIL PROTECTED] wrote: Here is another attempt. I'm still not sure I understand what form you want the data in. I made a dict - dict - list structure so if you lookup e.g. scores['10/11/04']['60'] you get a list of all the RelevantInfo2 values for Relevant1='10/11/04' and Relevant2='60'. The parser is a simple-minded state machine that will misbehave if the input does not have entries in the order Relevant1, Relevant2, Relevant3 (with as many intervening lines as you like). All three values are available when Relevant3 is detected so you could do something else with them if you want. HTH Kent import cStringIO raw_data = '''Gibberish 53 MoreGarbage [mass snippage] 60 Lalala''' raw_data = cStringIO.StringIO(raw_data) scores = {} info1 = info2 = info3 = None for line in raw_data: if line.startswith('RelevantInfo1'): info1 = raw_data.next().strip() elif line.startswith('RelevantInfo2'): info2 = raw_data.next().strip() elif line.startswith('RelevantInfo3'): info3 = raw_data.next().strip() scores.setdefault(info1, {}).setdefault(info3, []).append(info2) info1 = info2 = info3 = None print scores print scores['10/11/04']['60'] print scores['10/10/04']['23'] ## prints: {'10/10/04': {'44': ['33'], '23': ['22', '22']}, '10/11/04': {'60': ['45']}} ['45'] ['22', '22'] Thank you so much. Your solution and Steve's both give me what I'm looking for. I appreciate both of your incredibly quick replies! Take care. -- You worry too much about your job. Stop it. You are not paid enough to worry. -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
On Thu, 3 Mar 2005 12:26:37 -0800, James Stroud [EMAIL PROTECTED] wrote: Have a look at martel, part of biopython. The world of bioinformatics is filled with files with structure like this. http://www.biopython.org/docs/api/public/Martel-module.html James Thanks for the link. Steve and Kent have provided me with nice solutions but I will check this out anyways for future referenced. Take care. -- You may easily play a joke on a man who likes to argue -- agree with him. -- Ed Howe -- http://mail.python.org/mailman/listinfo/python-list
Re: Multiline regex help
Kent Johnson wrote: for line in raw_data: if line.startswith('RelevantInfo1'): info1 = raw_data.next().strip() elif line.startswith('RelevantInfo2'): info2 = raw_data.next().strip() elif line.startswith('RelevantInfo3'): info3 = raw_data.next().strip() scores.setdefault(info1, {}).setdefault(info3, []).append(info2) info1 = info2 = info3 = None Very pretty. =) I have to say, I hadn't ever used iterators this way before, that is, calling their next method from within a for-loop. I like it. =) Thanks for opening my mind. ;) STeVe -- http://mail.python.org/mailman/listinfo/python-list