Lie Ryan a écrit :
>> I just found a simple, but nice, trick to make regexes less unlegible.
>> Using substrings to represent sub-patterns. E.g. instead of:
>>
>> p =
>> re.compile(r'(?P<month>January|February|March|April|May|June|July|
> August|September|October|November|December)\s(?P<day>\d{1,2}),\s(?P<year>
> \d{4})')
>> write first:
>>
>> month =
>> r'(?P<month>January|February|March|April|May|June|July|August|September|
> October|November|December)'
>> day = r'(?P<day>\d{1,2})'
>> year = r'(?P<year>\d{4})'
>>
>> then:
>> p = re.compile( r"%s\s%s,\s%s" % (month,day,year) ) or even:
>> p = re.compile( r"%(month)s\s%(day)s,\s%(year)s" %
>> {'month':month,'day':day,'year':year} )
>>
>
> You might realize that that wouldn't help much and might even hide bugs
> (or make it harder to see). regex IS hairy in the first place.
>
> I'd vote for namespace in regex to allow a pattern inside a special pair
> tags to be completely ignorant of anything outside it. And also I'd vote
> for a compiled-pattern to be used as a subpattern. But then, with all
> those limbs, I might as well use a full-blown syntax parser instead.
Just tried to implement something like that for fun. The trick I chose is to
reuse the {} in pattern strings, as (as far as I know) the curly braces are
used only for repetition; hence there should be not conflict. A format would
then look like that:
format = r"...{subpatname}...{subpatname}..."
For instance:
format = r"record: {num}{n} -- name:{id}"
(Look at the code below.)
Found 2 issues:
-1- Regex patterns haven't any 'string' attribute to recover the string they
have beeen built from. As they aren't python objects, we cannot set anything on
them. So that we need to play with strings directly. (Which means that, once
the sub-pattern tested, we must copy its string.)
-2- There must be a scope to look for subpattern strings per name. the only
practicle solution I could think at is to make super-patterns ,of full
grammars, instances of a class that copes with the details. Subpattern strings
have to be attrs of this object.
_______________________________________________
from re import compile as pattern
class Grammar(object):
identifier = "[a-zA-Z_][a-zA-Z_0-9]*"
sub_pattern = pattern("\{%s\}" % identifier)
def subString(self,result):
name = result.group()[1:-1]
try:
return self.__dict__[name]
except KeyError:
raise AttributeError("Cannot find sub-pattern string
'%s'." % name)
def makePattern(self, format=None):
format = self.format if format is None else format
self.pattern_string =
Grammar.sub_pattern.sub(self.subString,format)
print "%s -->\n%s" % (self.format,self.pattern_string)
self.pattern = pattern(self.pattern_string)
return self.pattern
if __name__ == "__main__":
record = Grammar()
record.num = r"n[°\.oO]\ ?"
record.n = r"[0-9]+"
record.id = r"[a-zA-Z_][a-zA-Z_0-9]*"
record.format = r"record: {num}{n} -- name:{id}"
record.makePattern()
text = """
record: no 123 -- name:blah
record: n.456 -- name:foo
record: n°789 -- name:foo_bar
"""
result = record.pattern.findall(text)
print result,'\n'
# with format attr and invalid format
bloo_format = r"record: {num}{n} -- name:{id}{bloo_bar}"
record.makePattern(bloo_format)
result = record.pattern.findall(text)
print result
_______________________________________________
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor