Re: local greediness ???

2006-04-19 Thread John Machin
On 19/04/2006 3:09 PM, [EMAIL PROTECTED] wrote:
 hi, all. I need to process a file with the following format:
 $ cat sample
 [(some text)2.3(more text)4.5(more text here)]
 [(aa bb ccc)-1.2(kdk)12.0(xxxyyy)]
 [(xxx)11.0(bbb\))8.9(end here)]
 ...
 
 my goal here is for each line, extract every '(.*)' (including the
 round
 brackets, put them in a list, and extract every float on the same line
 and put them in a list.. here is my code:
 
   p = re.compile(r'\[.*\]$')
   num = re.compile(r'[-\d]+[.\d]*')
   brac  = re.compile(r'\(.*?\)')
 
   for line in ifp:
   if p.match(line):
   x = num.findall(line)
   y = brac.findall(line)
 print x, y len(x), len(y)
 
 Now, this works for most of the lines. however, I'm having problems
 with
 lines such as line 3 above (in the sample file). here, (bbb\)) contains
 an escaped
 ')' and the re I use will match it (because of the non-greedy '?'). But
 I want this to
 be ignored since it's escaped. is there a such thing as local
 greediness??
 Can anyone suggest a way to deal with this here.. 
 thanks.
 

For a start, your brac pattern is better rewritten to avoid the 
non-greedy ? tag: r'\([^)]*\)' -- this says the middle part is zero or 
more occurrences of a single character that is not a ')'

To handle the pesky backslash-as-escape, we need to extend that to: zero 
or more occurrences of either (a) a single character that is not a ')' 
or (b) the two-character string r\). This gives us something like this:

# brac  = re.compile(r'\((?:\\\)|[^)])*\)')
# tests = r(xxx)123.4(bbb\))5.6(end\Zhere)7.8()9.0(\))1.2(ab\)cd)
# brac.findall(tests)
['(xxx)', '(bbb\\))', '(end\\Zhere)', '()', '(\\))', '(ab\\)cd)']
#

Pretty, isn't it? Maybe better done with a hand-coded state machine.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: local greediness ???

2006-04-19 Thread johnzenger
How about using the numbers as delimiters:

 pat = re.compile(r[\d\.\-]+)
 pat.split([(some text)2.3(more text)4.5(more text here)])
['[(some text)', '(more text)', '(more text here)]']
 pat.findall([(some text)2.3(more text)4.5(more text here)])
['2.3', '4.5']
 pat.split([(xxx)11.0(bbb\))8.9(end here)] )
['[(xxx)', '(bbb\\))', '(end here)] ']
 pat.findall([(xxx)11.0(bbb\))8.9(end here)] )
['11.0', '8.9']

[EMAIL PROTECTED] wrote:
 hi, all. I need to process a file with the following format:
 $ cat sample
 [(some text)2.3(more text)4.5(more text here)]
 [(aa bb ccc)-1.2(kdk)12.0(xxxyyy)]
 [(xxx)11.0(bbb\))8.9(end here)]
 ...

 my goal here is for each line, extract every '(.*)' (including the
 round
 brackets, put them in a list, and extract every float on the same line
 and put them in a list.. here is my code:

   p = re.compile(r'\[.*\]$')
   num = re.compile(r'[-\d]+[.\d]*')
   brac  = re.compile(r'\(.*?\)')

   for line in ifp:
   if p.match(line):
   x = num.findall(line)
   y = brac.findall(line)
 print x, y len(x), len(y)

 Now, this works for most of the lines. however, I'm having problems
 with
 lines such as line 3 above (in the sample file). here, (bbb\)) contains
 an escaped
 ')' and the re I use will match it (because of the non-greedy '?'). But
 I want this to
 be ignored since it's escaped. is there a such thing as local
 greediness??
 Can anyone suggest a way to deal with this here.. 
 thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: local greediness ???

2006-04-19 Thread Paul McGuire
[EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 hi, all. I need to process a file with the following format:
 $ cat sample
 [(some text)2.3(more text)4.5(more text here)]
 [(aa bb ccc)-1.2(kdk)12.0(xxxyyy)]
 [(xxx)11.0(bbb\))8.9(end here)]
 ...

 my goal here is for each line, extract every '(.*)' (including the
 round
 brackets, put them in a list, and extract every float on the same line
 and put them in a list..

Are you wedded to re's?  Here's a pyparsing approach for your perusal.  It
uses the new QuotedString class, treating your ()-enclosed elements as
custom quoted strings (including backslash escape support).

Some other things the parser does for you during parsing:
- converts the numeric strings to floats
- processes the \) escaped paren, returning just the )
Why not? While parsing, the parser knows it has just parsed a floating
point number (or an escaped character), go ahead and do the conversion too.

-- Paul
(Download pyparsing at http://pyparsing.sourceforge.net.)


test = r
[(some text)2.3(more text)4.5(more text here)]
[(aa bb ccc)-1.2(kdk)12.0(xxxyyy)]
[(xxx)11.0(bbb\))8.9(end here)]

from pyparsing import oneOf,Combine,Optional,Word,nums,QuotedString,Suppress

# define a floating point number
sign = oneOf(+ -)
floatNum = Combine( Optional(sign) + Word(nums) + . + Word(nums) )

# have parser convert to actual floats while parsing
floatNum.setParseAction(lambda s,l,t: float(t[0]))

# define a quoted string where ()'s are the opening and closing quotes
parenString = QuotedString((,endQuoteChar=),escChar=\\)

# define the overall entry structure
entry = Suppress([) + parenString + floatNum + parenString + floatNum +
parenString + Suppress(])

# scan for floats
for toks,start,end in floatNum.scanString(test):
print toks[0]
print

# scan for paren strings
for toks,start,end in parenString.scanString(test):
print toks[0]
print

# scan for entries
for toks,start,end in entry.scanString(test):
print toks
print

Gives:
2.3
4.5
-1.2
12.0
11.0
8.9

some text
more text
more text here
aa bb ccc
kdk
xxxyyy
xxx
bbb)
end here

['some text', 2.2998, 'more text', 4.5, 'more text here']
['aa bb ccc', -1.2, 'kdk', 12.0, 'xxxyyy']
['xxx', 11.0, 'bbb)', 8.9004, 'end here']



-- 
http://mail.python.org/mailman/listinfo/python-list


local greediness ???

2006-04-18 Thread [EMAIL PROTECTED]
hi, all. I need to process a file with the following format:
$ cat sample
[(some text)2.3(more text)4.5(more text here)]
[(aa bb ccc)-1.2(kdk)12.0(xxxyyy)]
[(xxx)11.0(bbb\))8.9(end here)]
...

my goal here is for each line, extract every '(.*)' (including the
round
brackets, put them in a list, and extract every float on the same line
and put them in a list.. here is my code:

p = re.compile(r'\[.*\]$')
num = re.compile(r'[-\d]+[.\d]*')
brac  = re.compile(r'\(.*?\)')

for line in ifp:
if p.match(line):
x = num.findall(line)
y = brac.findall(line)
print x, y len(x), len(y)

Now, this works for most of the lines. however, I'm having problems
with
lines such as line 3 above (in the sample file). here, (bbb\)) contains
an escaped
')' and the re I use will match it (because of the non-greedy '?'). But
I want this to
be ignored since it's escaped. is there a such thing as local
greediness??
Can anyone suggest a way to deal with this here.. 
thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list