John Machin wrote: > On 29/05/2006 7:46 AM, Sébastien Boisgérault wrote: > > Paddy a écrit : > > > >> maybe this: http://www.pcre.org/pcre.txt and ctypes might work for you? > > > > Well finally, it doesn't fit. What I need is a "longest match" policy > > in > > patterns like "(a)|(b)|(c)" and NOT a "left-to-right" policy. > > Additionaly, > > I need to be able to obtain the matched ("captured") substring and > > the PCRE does not allow this in DFA mode. > > > > Perhaps you might like to be somewhat more precise with your > requirements.
Sure. More on this below. > "POSIX-compliant" made me think of yuckies like [:fubar:] > in character classes :-) Yep. I do not need POSIX *syntax* for regular expressions but POSIX *semantics*, at least the "leftmost-longest" part (in contrast to the "first then longest" used in Python, Perl, .NET, etc.) > The operands of | are such that the length is not fixed and so you can't > write them in descending length order? Care to tell us some more detail > about those operands? Basically, I'd like to use the (excellent) python module SPARK of John Aycock to build an (extended) C lexer. To do so, I need to specify the patterns that match my tokens as well as a priority between them. SPARK then builds a big alternate list of patterns that begins with the high priority patterns and ends with the low priority patterns and runs a match. The problem with to be very careful and to specify explicitely the priorities to get the desired results: "<=" shall be higher than "<", decimal stuff higher than integer, etc, when most of the time what you really want is to match the longest pattern ... Worse, the priority work-around does not work well when you compare keywords and (other) identifiers. To match "fortune" as a identifier, you would need to define identifier with a higher priority than keyword and it is a problem: "for" would be then match as a identifier when it is a keyword. I can come up with possible work-arounds for the "id vs keyword" issue, but nothing that really makes me happy ... Therefore, I was studying the possible replacement of the Python native regular expression engine with a "POSIX semantics" regular expression engine that would give the longest match and avoid me a lot of extra work ... I hope it's clearer now :) Any advice ? Cheers SB -- http://mail.python.org/mailman/listinfo/python-list