Re: OT: novice regular expression question

2004-12-30 Thread Steve Holden
It's me wrote:
I am never very good with regular expressions.  My head always hurts
whenever I need to use it.
Well, they are a pain to more than just you, and the conventional advice 
is even when you are convinced you need to use REs, try and find 
another way.

I need to read a data file and parse each data record.  Each item on the
data record begins with either a string, or a list of strings.  I searched
around and didn't see any existing Python packages that does that.
scanf.py, for instance, can do standard items but doesn't know about list.
So, I figure I might have to write a lex engine for it and of course I have
to deal wit RE again.
Well, you haven't yet convinced me that you *have* to. Personally, I 
think you just like trouble :-)

But I run into problem right from the start.   To recognize a list, I need a
RE for the string:
1) begin with [  (left bracket followed by a double quote with zero or more
spaces in between)
2) followed by any characters until ] but only if that left bracket is not
preceeded by the escape character \.
So the pattern is
1. If the line begins with a [ it should end with a ]
2. Otherwise, it shouldn't?
I'm trying to gently point out that the syntax you want to accept isn't 
actually very clear. If the format is Python strings and lists of 
strings then you might want to use the Python lexer to parse them, but 
that's quite an advanced topic. [too advanced for me :-]

The problem is matching up to a right bracket not preceded by a 
backslash. This seems to require what's technically referred to as a 
negative lookbehind assertion - in other words, a pattern that doesn't 
match anything, but checks that a specific condition is false or fails.

So, I tried:
^\[[ ]*[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \]*]
and tested with:
[This line\] works]
but it fails with:
[This line fails]
I would have thought that:
   (\\\])*
should work because it's zero or more incidence of the pattern \]
Any help is greatly appreciated.
Sorry for beign OT.  I posted this question at the lex group and didn't get
any response.  I figure may be somebody would know around here.
I'd start with baby steps. First of all, make sure that you can match 
the individual strings. Then use that pattern, parenthesized to turn it 
into a group, as a component in a more complex pattern.

Do you want to treat this is also \ a string as an allowable string? 
In that case you need a pattern that matches 'up to the first quotation 
mark not preceded by a backslash as well!

Let's try matching a single string first:
  s = re.compile(r'(.*?(?!\\))')
  s.match('s1, s2').groups()
('s1',)
Note that I followed the * with a ? to stop it being greedy, and 
matching as many characters as it could. OK, does that work when we have 
escaped quotation marks?

  s.match(r's1\\, s2').groups()
('s1',)
Apparently so. The negative lookbehind assertion stops a quote from 
matching when it's preceded by a backslash. Can we match a 
comma-separated list of such strings?

  slpat = r'(.*?(?!\\))(?:, (.*?(?!\\)))*'
  s = re.compile(slpat)
This is a bit trickier: here the second grouping beginning with (?: is 
intended to ensure that only the strings that get matched are included 
in the groups, not the separators, even though they must be grouped 
together. The list *must* be separated by , , but you could alter the 
pattern to allow zero or more whitespace characters.

  s.match(r's1\\, s2').groups()
('s1', 's2')
Well, that seems to work. Note that these patterns all ignore bracket 
characters, so all you need to do now is to surround them with patterns 
to match the opening and closing brackets, and you're done (I hope).

Anyway, it'll give you a few ideas to work with.
regards
 Steve
--
Steve Holden   http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC  +1 703 861 4237  +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-30 Thread RyanMorillo
check jgsoft dot com, they have2 things witch may help.  Edit pad pro
(the test version has a good tutorial) or power grep (if you do a lot
of regexes, or the mastering regular expressions book from Orielly (if
yo do a lot of regex work)

Also the perl group would be good for regexes (pythons are Perl 5
compatable)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-30 Thread It's me
I'll chew on this.  Thanks, got to go.


Steve Holden [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 It's me wrote:

  I am never very good with regular expressions.  My head always hurts
  whenever I need to use it.
 
 Well, they are a pain to more than just you, and the conventional advice
 is even when you are convinced you need to use REs, try and find
 another way.

  I need to read a data file and parse each data record.  Each item on the
  data record begins with either a string, or a list of strings.  I
searched
  around and didn't see any existing Python packages that does that.
  scanf.py, for instance, can do standard items but doesn't know about
list.
  So, I figure I might have to write a lex engine for it and of course I
have
  to deal wit RE again.
 
 Well, you haven't yet convinced me that you *have* to. Personally, I
 think you just like trouble :-)

  But I run into problem right from the start.   To recognize a list, I
need a
  RE for the string:
 
  1) begin with [  (left bracket followed by a double quote with zero or
more
  spaces in between)
  2) followed by any characters until ] but only if that left bracket is
not
  preceeded by the escape character \.
 
 So the pattern is

 1. If the line begins with a [ it should end with a ]

 2. Otherwise, it shouldn't?

 I'm trying to gently point out that the syntax you want to accept isn't
 actually very clear. If the format is Python strings and lists of
 strings then you might want to use the Python lexer to parse them, but
 that's quite an advanced topic. [too advanced for me :-]

 The problem is matching up to a right bracket not preceded by a
 backslash. This seems to require what's technically referred to as a
 negative lookbehind assertion - in other words, a pattern that doesn't
 match anything, but checks that a specific condition is false or fails.

  So, I tried:
 
  ^\[[ ]*[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \]*]
 
  and tested with:
 
  [This line\] works]
 
  but it fails with:
 
  [This line fails]
 
  I would have thought that:
 
 (\\\])*
 
  should work because it's zero or more incidence of the pattern \]
 
  Any help is greatly appreciated.
 
  Sorry for beign OT.  I posted this question at the lex group and didn't
get
  any response.  I figure may be somebody would know around here.

 I'd start with baby steps. First of all, make sure that you can match
 the individual strings. Then use that pattern, parenthesized to turn it
 into a group, as a component in a more complex pattern.

 Do you want to treat this is also \ a string as an allowable string?
 In that case you need a pattern that matches 'up to the first quotation
 mark not preceded by a backslash as well!

 Let's try matching a single string first:

s = re.compile(r'(.*?(?!\\))')
s.match('s1, s2').groups()
 ('s1',)

 Note that I followed the * with a ? to stop it being greedy, and
 matching as many characters as it could. OK, does that work when we have
 escaped quotation marks?

s.match(r's1\\, s2').groups()
 ('s1',)

 Apparently so. The negative lookbehind assertion stops a quote from
 matching when it's preceded by a backslash. Can we match a
 comma-separated list of such strings?

slpat = r'(.*?(?!\\))(?:, (.*?(?!\\)))*'
s = re.compile(slpat)

 This is a bit trickier: here the second grouping beginning with (?: is
 intended to ensure that only the strings that get matched are included
 in the groups, not the separators, even though they must be grouped
 together. The list *must* be separated by , , but you could alter the
 pattern to allow zero or more whitespace characters.

s.match(r's1\\, s2').groups()
 ('s1', 's2')

 Well, that seems to work. Note that these patterns all ignore bracket
 characters, so all you need to do now is to surround them with patterns
 to match the opening and closing brackets, and you're done (I hope).

 Anyway, it'll give you a few ideas to work with.

 regards
   Steve
 -- 
 Steve Holden   http://www.holdenweb.com/
 Python Web Programming  http://pydish.holdenweb.com/
 Holden Web LLC  +1 703 861 4237  +1 800 494 3119


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-30 Thread M.E.Farmer
Hello me,
Have you tried shlex.py it is a tokenizer for writing lexical
parsers.
Should be a breeze to whip something up with it.
an example of tokenizing:
pyimport shlex
py# fake an open record
pyimport cStringIO
pymyfakeRecord = cStringIO.StringIO()
pymyfakeRecord.write(['1','2'] \n 'fdfdfdfd' \n 'dfdfdfdfd'
['1','2']\n)
pymyfakeRecord.seek(0)
pylexer = shlex.shlex(myfakeRecord)

pylexer.get_token()
'['
pylexer.get_token()
'1'
pylexer.get_token()
','
pylexer.get_token()
'2'
pylexer.get_token()
']'
pylexer.get_token()
'fdfdfdfd'

You can do a lot with it that is just a teaser.
M.E.Farmer

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-30 Thread M.E.Farmer

It's me wrote:
 The shlex.py needs quite a number of .py files.  I tried to hunt down
a few
 of them and got really tire.

 Is there one batch of .py files that I can download from somewhere?

 Thanks,
Not sure what you mean by this.
Shlex is a standard library module.
It imports os and sys only, they are standard library modules.
If you have python you have them already.
If you mean cStringIO it is in the standard library(at least on my
system).
You dont have to use it just feed shlex an open file.
pylexer = shlex.shlex(open('myrecord.txt', 'r'))

Hth,
M.E.Farmer

-- 
http://mail.python.org/mailman/listinfo/python-list