https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6143





--- Comment #32 from Justin Mason <[email protected]>  2009-07-08 02:16:51 PST ---
a lot of questions ;)  Thanks for looking into this, Sidney.

(In reply to comment #29)
> It appears that BodyRuleBaseExtractor.pm stops when it hits \x{00} when it is
> extracting a base_string from the original string of the pattern if there is
> a \x{80} through \x{ff} character anywhere in the pattern, either before or
> after the \x{00}.

BodyRuleBaseExtractor's purpose is to extract simple required "base strings"
from complex regexps.  To be honest I think a NUL char isn't particularly
"simple" so I'd be happy to likewise fix it to cut at that, too, in all cases.

> Does anyone know how the the rule2xs scanner is supposed to handle two rules
that match on the same string? So far I haven't seen how that can be done.

RET() supports multiple returns. eg. given 3 rules

  body ABC /abc/
  body ABCD /abcd/
  body BCD /bcd/

run sa-compile --debug --keep-tmps --C tst.cf -p /dev/null .
we wind up with

/*!re2c
        "abc"            {RET("ABC,[l=1]");}
        "abcd"            {RET("ABC,[l=1] ABCD,[l=1] BCD,[l=1]");}
        "bcd"            {RET("BCD,[l=1]");}
  [\000-\377]        { return NULL; }
*/

you can clearly see how multiple returns are handled here; the RET string
is a space-separated list of rule names.  to see overlapping:

  body ABC /abc/
  body CDE /cde/

/*!re2c
        "abc"            {RET("ABC,[l=1]");}
        "cde"            {RET("CDE,[l=1]");}
  [\000-\377]        { return NULL; }
*/

in that case you can see why it should rewind YYCURSOR to deal with the case of
"abcde" as input.

I think the main problem here is that RET() is being used on a single-character
string.  I didn't think it'd be possible for a single-char base string to be
extracted and appear in the re2c input; it wasn't supposed to be.  If it's
possible to cause a single ASCII printable [a-zA-Z0-9] char to appear there, it
will certainly not produce an efficient matcher. :(   iirc the min length for
base strings is supposed to be 3 chars or so...

So there's a few things that we can do:

1. fix BodyRuleBaseExtractor to cut base strings at NULs

2. fix sa-compile to not generate single-byte patterns in the re2c input

3. fix YYCURSOR rewind to be smarter in the single-char case, or at least
not loop infinitely.  may not be necessary given #2

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to