[Licheng Fang[ > ... > In fact, what I'm doing is handle a lot of regular expressions. I > wanted to build VERY LONG regexps part by part and put them all into a > file for easy modification and maintenance. The idea is like this: > > (*INT) = \d+ > (*DECIMAL) = (*INT)\.(*INT) > (*FACTION) = (*DECIMAL)/(*DECIMAL) > (*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT) > ... ... > > What's inside the sytactically wrong (* and ) is something to be > replaced, and then I wrote a little script to do the string > substitution, to get very long regexps to be compiled. I thought that > might be a way to handle long and complex regexps, and in this course I > encountered the problem with the semantics of '|'. > > My approach may sound desperate and funny, but please, do you have any > good idea as to how to handle long and complex regexps?
My best advice is to find a different approach entirely. For example, build a parser using parser technology, and use regexps in that /only/ to do gross tokenization ("string of digits", "decimal point", ...): build the rest with a grammar. Regexps are a brittle tool, best tolerated in small doses. For an "NFA" implementation like Python's, you're likely to experience poor speed when combining many complex regexps, and /especially/ when alternatives are ambiguous wrt prefixes (and yours are, else you wouldn't have a problem with "longest match" versus "some shorter match" to begin with. OTOH, under a "DFA" implementation such as POSIX grep's, you're likely to experience exponential memory requirements (DFA implementations can need to build enormous state machines, tracing out in advance all possible paths through all the regexps applied to all possible input strings). Just sounds to me like the wrong tool for the job. -- http://mail.python.org/mailman/listinfo/python-list