Re: local greediness ???
On 19/04/2006 3:09 PM, [EMAIL PROTECTED] wrote: hi, all. I need to process a file with the following format: $ cat sample [(some text)2.3(more text)4.5(more text here)] [(aa bb ccc)-1.2(kdk)12.0(xxxyyy)] [(xxx)11.0(bbb\))8.9(end here)] ... my goal here is for each line, extract every '(.*)' (including the round brackets, put them in a list, and extract every float on the same line and put them in a list.. here is my code: p = re.compile(r'\[.*\]$') num = re.compile(r'[-\d]+[.\d]*') brac = re.compile(r'\(.*?\)') for line in ifp: if p.match(line): x = num.findall(line) y = brac.findall(line) print x, y len(x), len(y) Now, this works for most of the lines. however, I'm having problems with lines such as line 3 above (in the sample file). here, (bbb\)) contains an escaped ')' and the re I use will match it (because of the non-greedy '?'). But I want this to be ignored since it's escaped. is there a such thing as local greediness?? Can anyone suggest a way to deal with this here.. thanks. For a start, your brac pattern is better rewritten to avoid the non-greedy ? tag: r'\([^)]*\)' -- this says the middle part is zero or more occurrences of a single character that is not a ')' To handle the pesky backslash-as-escape, we need to extend that to: zero or more occurrences of either (a) a single character that is not a ')' or (b) the two-character string r\). This gives us something like this: # brac = re.compile(r'\((?:\\\)|[^)])*\)') # tests = r(xxx)123.4(bbb\))5.6(end\Zhere)7.8()9.0(\))1.2(ab\)cd) # brac.findall(tests) ['(xxx)', '(bbb\\))', '(end\\Zhere)', '()', '(\\))', '(ab\\)cd)'] # Pretty, isn't it? Maybe better done with a hand-coded state machine. -- http://mail.python.org/mailman/listinfo/python-list
Re: local greediness ???
How about using the numbers as delimiters: pat = re.compile(r[\d\.\-]+) pat.split([(some text)2.3(more text)4.5(more text here)]) ['[(some text)', '(more text)', '(more text here)]'] pat.findall([(some text)2.3(more text)4.5(more text here)]) ['2.3', '4.5'] pat.split([(xxx)11.0(bbb\))8.9(end here)] ) ['[(xxx)', '(bbb\\))', '(end here)] '] pat.findall([(xxx)11.0(bbb\))8.9(end here)] ) ['11.0', '8.9'] [EMAIL PROTECTED] wrote: hi, all. I need to process a file with the following format: $ cat sample [(some text)2.3(more text)4.5(more text here)] [(aa bb ccc)-1.2(kdk)12.0(xxxyyy)] [(xxx)11.0(bbb\))8.9(end here)] ... my goal here is for each line, extract every '(.*)' (including the round brackets, put them in a list, and extract every float on the same line and put them in a list.. here is my code: p = re.compile(r'\[.*\]$') num = re.compile(r'[-\d]+[.\d]*') brac = re.compile(r'\(.*?\)') for line in ifp: if p.match(line): x = num.findall(line) y = brac.findall(line) print x, y len(x), len(y) Now, this works for most of the lines. however, I'm having problems with lines such as line 3 above (in the sample file). here, (bbb\)) contains an escaped ')' and the re I use will match it (because of the non-greedy '?'). But I want this to be ignored since it's escaped. is there a such thing as local greediness?? Can anyone suggest a way to deal with this here.. thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: local greediness ???
[EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] hi, all. I need to process a file with the following format: $ cat sample [(some text)2.3(more text)4.5(more text here)] [(aa bb ccc)-1.2(kdk)12.0(xxxyyy)] [(xxx)11.0(bbb\))8.9(end here)] ... my goal here is for each line, extract every '(.*)' (including the round brackets, put them in a list, and extract every float on the same line and put them in a list.. Are you wedded to re's? Here's a pyparsing approach for your perusal. It uses the new QuotedString class, treating your ()-enclosed elements as custom quoted strings (including backslash escape support). Some other things the parser does for you during parsing: - converts the numeric strings to floats - processes the \) escaped paren, returning just the ) Why not? While parsing, the parser knows it has just parsed a floating point number (or an escaped character), go ahead and do the conversion too. -- Paul (Download pyparsing at http://pyparsing.sourceforge.net.) test = r [(some text)2.3(more text)4.5(more text here)] [(aa bb ccc)-1.2(kdk)12.0(xxxyyy)] [(xxx)11.0(bbb\))8.9(end here)] from pyparsing import oneOf,Combine,Optional,Word,nums,QuotedString,Suppress # define a floating point number sign = oneOf(+ -) floatNum = Combine( Optional(sign) + Word(nums) + . + Word(nums) ) # have parser convert to actual floats while parsing floatNum.setParseAction(lambda s,l,t: float(t[0])) # define a quoted string where ()'s are the opening and closing quotes parenString = QuotedString((,endQuoteChar=),escChar=\\) # define the overall entry structure entry = Suppress([) + parenString + floatNum + parenString + floatNum + parenString + Suppress(]) # scan for floats for toks,start,end in floatNum.scanString(test): print toks[0] print # scan for paren strings for toks,start,end in parenString.scanString(test): print toks[0] print # scan for entries for toks,start,end in entry.scanString(test): print toks print Gives: 2.3 4.5 -1.2 12.0 11.0 8.9 some text more text more text here aa bb ccc kdk xxxyyy xxx bbb) end here ['some text', 2.2998, 'more text', 4.5, 'more text here'] ['aa bb ccc', -1.2, 'kdk', 12.0, 'xxxyyy'] ['xxx', 11.0, 'bbb)', 8.9004, 'end here'] -- http://mail.python.org/mailman/listinfo/python-list
local greediness ???
hi, all. I need to process a file with the following format: $ cat sample [(some text)2.3(more text)4.5(more text here)] [(aa bb ccc)-1.2(kdk)12.0(xxxyyy)] [(xxx)11.0(bbb\))8.9(end here)] ... my goal here is for each line, extract every '(.*)' (including the round brackets, put them in a list, and extract every float on the same line and put them in a list.. here is my code: p = re.compile(r'\[.*\]$') num = re.compile(r'[-\d]+[.\d]*') brac = re.compile(r'\(.*?\)') for line in ifp: if p.match(line): x = num.findall(line) y = brac.findall(line) print x, y len(x), len(y) Now, this works for most of the lines. however, I'm having problems with lines such as line 3 above (in the sample file). here, (bbb\)) contains an escaped ')' and the re I use will match it (because of the non-greedy '?'). But I want this to be ignored since it's escaped. is there a such thing as local greediness?? Can anyone suggest a way to deal with this here.. thanks. -- http://mail.python.org/mailman/listinfo/python-list