> > > Mark -- have you tried the jflex-analyzer-r560135-patch.txt patch with > your wikipedia diff test? That's the early one whose grammar was "dot for > dot" translated from the original JavaCC spec -- for further patches I did > some "optimizations", which seem to have broken the compatibility... > > > The test is Mike's and I think it is off your latest patch.
Oops again -- I should stop working late at night :) The latest patch was not too compatible with JavaCC, the jflex-analyzer-r560135-patch.txt patch should be best here. Maybe I should delete the other attachments from JIRA to avoid further confusion? Looks like the optimizations might have to go then? Definitely, but the base version (jflex-analyzer-r560135-patch.txt) is still much faster than JavaCC. > Incidentally, what was the motivation for requiring the <NUM> token to > have numbers only in every second segment and not in any segment? > > > I don't think the rule is "every second segment" but "at least every > other segment". Why this rule was made, I am not sure; I am guessing it > was just a good rule of thumb to catch a lot of serial numbers, model > numbers, etc but without going too overboard in the matching. > Ok -- seems reasonable. Staszek