https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6143
--- Comment #32 from Justin Mason <[email protected]> 2009-07-08 02:16:51 PST --- a lot of questions ;) Thanks for looking into this, Sidney. (In reply to comment #29) > It appears that BodyRuleBaseExtractor.pm stops when it hits \x{00} when it is > extracting a base_string from the original string of the pattern if there is > a \x{80} through \x{ff} character anywhere in the pattern, either before or > after the \x{00}. BodyRuleBaseExtractor's purpose is to extract simple required "base strings" from complex regexps. To be honest I think a NUL char isn't particularly "simple" so I'd be happy to likewise fix it to cut at that, too, in all cases. > Does anyone know how the the rule2xs scanner is supposed to handle two rules that match on the same string? So far I haven't seen how that can be done. RET() supports multiple returns. eg. given 3 rules body ABC /abc/ body ABCD /abcd/ body BCD /bcd/ run sa-compile --debug --keep-tmps --C tst.cf -p /dev/null . we wind up with /*!re2c "abc" {RET("ABC,[l=1]");} "abcd" {RET("ABC,[l=1] ABCD,[l=1] BCD,[l=1]");} "bcd" {RET("BCD,[l=1]");} [\000-\377] { return NULL; } */ you can clearly see how multiple returns are handled here; the RET string is a space-separated list of rule names. to see overlapping: body ABC /abc/ body CDE /cde/ /*!re2c "abc" {RET("ABC,[l=1]");} "cde" {RET("CDE,[l=1]");} [\000-\377] { return NULL; } */ in that case you can see why it should rewind YYCURSOR to deal with the case of "abcde" as input. I think the main problem here is that RET() is being used on a single-character string. I didn't think it'd be possible for a single-char base string to be extracted and appear in the re2c input; it wasn't supposed to be. If it's possible to cause a single ASCII printable [a-zA-Z0-9] char to appear there, it will certainly not produce an efficient matcher. :( iirc the min length for base strings is supposed to be 3 chars or so... So there's a few things that we can do: 1. fix BodyRuleBaseExtractor to cut base strings at NULs 2. fix sa-compile to not generate single-byte patterns in the re2c input 3. fix YYCURSOR rewind to be smarter in the single-char case, or at least not loop infinitely. may not be necessary given #2 -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.
