On 19/04/2006 3:09 PM, [EMAIL PROTECTED] wrote: > hi, all. I need to process a file with the following format: > $ cat sample > [(some text)2.3(more text)4.5(more text here)] > [(aa bb ccc)-1.2(kdk)12.0(xxxyyy)] > [(xxx)11.0(bbb\))8.9(end here)] > ....... > > my goal here is for each line, extract every '(.*)' (including the > round > brackets, put them in a list, and extract every float on the same line > and put them in a list.. here is my code: > > p = re.compile(r'\[.*\]$') > num = re.compile(r'[-\d]+[.\d]*') > brac = re.compile(r'\(.*?\)') > > for line in ifp: > if p.match(line): > x = num.findall(line) > y = brac.findall(line) > print x, y len(x), len(y) > > Now, this works for most of the lines. however, I'm having problems > with > lines such as line 3 above (in the sample file). here, (bbb\)) contains > an escaped > ')' and the re I use will match it (because of the non-greedy '?'). But > I want this to > be ignored since it's escaped. is there a such thing as local > greediness?? > Can anyone suggest a way to deal with this here.. > thanks. >
For a start, your brac pattern is better rewritten to avoid the non-greedy ? tag: r'\([^)]*\)' -- this says the middle part is zero or more occurrences of a single character that is not a ')' To handle the pesky backslash-as-escape, we need to extend that to: zero or more occurrences of either (a) a single character that is not a ')' or (b) the two-character string r"\)". This gives us something like this: #>>> brac = re.compile(r'\((?:\\\)|[^)])*\)') #>>> tests = r"(xxx)123.4(bbb\))5.6(end\Zhere)7.8()9.0(\))1.2(ab\)cd)" #>>> brac.findall(tests) ['(xxx)', '(bbb\\))', '(end\\Zhere)', '()', '(\\))', '(ab\\)cd)'] #>>> Pretty, isn't it? Maybe better done with a hand-coded state machine. -- http://mail.python.org/mailman/listinfo/python-list