Lie Ryan a écrit :

>> I just found a simple, but nice, trick to make regexes less unlegible.
>> Using substrings to represent sub-patterns. E.g. instead of:
>>
>> p =
>> re.compile(r'(?P<month>January|February|March|April|May|June|July|
> August|September|October|November|December)\s(?P<day>\d{1,2}),\s(?P<year>
> \d{4})')
>> write first:
>>
>> month =
>> r'(?P<month>January|February|March|April|May|June|July|August|September|
> October|November|December)'
>> day = r'(?P<day>\d{1,2})'
>> year = r'(?P<year>\d{4})'
>>
>> then:
>> p = re.compile( r"%s\s%s,\s%s" % (month,day,year) ) or even:
>> p = re.compile( r"%(month)s\s%(day)s,\s%(year)s" %
>> {'month':month,'day':day,'year':year} )
>>
>
> You might realize that that wouldn't help much and might even hide bugs
> (or make it harder to see). regex IS hairy in the first place.
>
> I'd vote for namespace in regex to allow a pattern inside a special pair
> tags to be completely ignorant of anything outside it. And also I'd vote
> for a compiled-pattern to be used as a subpattern. But then, with all
> those limbs, I might as well use a full-blown syntax parser instead.

Just tried to implement something like that for fun. The trick I chose is to reuse the {} in pattern strings, as (as far as I know) the curly braces are used only for repetition; hence there should be not conflict. A format would then look like that:

        format = r"...{subpatname}...{subpatname}..."

For instance:

        format = r"record: {num}{n} -- name:{id}"

(Look at the code below.)

Found 2 issues:
-1- Regex patterns haven't any 'string' attribute to recover the string they have beeen built from. As they aren't python objects, we cannot set anything on them. So that we need to play with strings directly. (Which means that, once the sub-pattern tested, we must copy its string.) -2- There must be a scope to look for subpattern strings per name. the only practicle solution I could think at is to make super-patterns ,of full grammars, instances of a class that copes with the details. Subpattern strings have to be attrs of this object.

_______________________________________________
from re import compile as pattern
class Grammar(object):
        identifier = "[a-zA-Z_][a-zA-Z_0-9]*"
        sub_pattern = pattern("\{%s\}" % identifier)
        def subString(self,result):
                name = result.group()[1:-1]
                try:
                        return self.__dict__[name]
                except KeyError:
                        raise AttributeError("Cannot find sub-pattern string 
'%s'."   % name)
        def makePattern(self, format=None):
                format = self.format if format is None else format
                self.pattern_string = 
Grammar.sub_pattern.sub(self.subString,format)
                print "%s -->\n%s" % (self.format,self.pattern_string)
                self.pattern = pattern(self.pattern_string)
                return self.pattern

if __name__ == "__main__":
        record = Grammar()
        record.num      = r"n[°\.oO]\ ?"
        record.n        = r"[0-9]+"
        record.id       = r"[a-zA-Z_][a-zA-Z_0-9]*"
        record.format = r"record: {num}{n} -- name:{id}"
        record.makePattern()
        text = """
        record: no 123 -- name:blah
        record: n.456 -- name:foo
        record: n°789 -- name:foo_bar
        """
        result = record.pattern.findall(text)
        print result,'\n'
        # with format attr and invalid format
        bloo_format = r"record: {num}{n} -- name:{id}{bloo_bar}"
        record.makePattern(bloo_format)
        result = record.pattern.findall(text)
        print result


_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to