Mark -- can you mail a *real* sample? private mail would be fine. --j.
Mark Martinec writes: > I recently noticed a couple of cases where SA (3.1.4 or earlier) > would take over a minute (instead of few seconds) to check a 500 kB > message. Investigation reavealed that cases have one thing in common: > these were all message/partial chunks of a longish transfer of some > document or other data. Moreover, most of these cases were hitting > random sets of SARE or baseline rules, yielding false positives. > > In case someone would suggest that Content-Type: message/partial > should be banned outright - well, it is a policy decision, and > if allowed, should not bring SA to its knees on a 0.5 MB message. > > Here is one example where a command-line 'spamassassin -t -D' would > run for 68 seconds. Timestamping each debug line produces the > following top-10 lines - sorted by elapsed time, first column > is time in seconds for this line to appear after a previous one: > > 1.935 dbg: rules: ran body rule SARE_RMML_Stock1 ======> got hit: "0TC" > 2.204 dbg: rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134" > 3.695 dbg: rules: ran body rule SARE_RMML_Stock9 ======> got hit: "0il" > 3.976 dbg: rules: ran body rule __NONEMPTY_BODY ======> got hit: "i" > 4.021 dbg: rules: running raw-body-text per-line regexp tests; score ... > 6.397 dbg: rules: ran body rule FB_NOT_SEX ======> got hit: " Sjx" > 8.225 dbg: bayes: tok_get_all: token count: 37175 > 8.254 dbg: rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169" > 9.682 dbg: rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218" > 11.999 dbg: rules: running body-text per-line regexp tests; score so far=2.501 > > and another example: > > 2.396 dbg: rules: ran body rule DISGUISE_PORN_MUNDANE ======> got hit: "b0y" > 2.424 dbg: rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134" > 2.627 dbg: bayes: tok_get_all: token count: 36631 > 3.421 dbg: rules: running body-text per-line regexp tests; score so far=0.203 > 3.826 dbg: rules: ran body rule SARE_RMML_Stock9 ======> got hit: "0Il" > 4.181 dbg: rules: running raw-body-text per-line regexp tests; score ... > 4.265 dbg: rules: ran body rule FB_NOT_SEX ======> got hit: " S8X" > 8.113 dbg: rules: ran body rule FUZZY_XPILL ======> got hit: "XoNOgX" > 9.308 dbg: rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169" > 9.945 dbg: rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218" > > I know some of these are SARE rulesets, but some are baseline rules > or bayes token parsing. > > Here is a relevant section/sample of one of these messages: > > MIME-Version: 1.0 > Content-Type: message/partial; > total=22; > id="[EMAIL PROTECTED]"; > number=21 > X-Priority: 3 > X-MSMail-Priority: Normal > X-Mailer: Microsoft Outlook Express 6.00.2900.2869 > X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2869 > > f6idzxqa608aID8+YhwNSQwBpIrboHA0/zPfOP26mB6eONz70Xl12DwGVnAPemaaKaJyQk5ZKUwg > VC0sGYHLd543cICNa1piu8YgRJR0EaEK7GNVXvFSriat5dZwj7PNzQuOTO030bra7tBjROxbrVYR > XFStjnugVkyH27zqrvUdUsHYnLaVLdUuAxWH51QDV9/kc6vtIURcdUbthPszq12lj7Lt7rMAtVX7 > > > So the problem is that these base64-encoded lines in a message/partial > chunk are treated as obfuscated text, which is very slow, and produces > almost random hits on various rules. It also places some burden on > SQL server (bayes: tok_get_all: token count: 37175). > > > Somewhat similar mail cases that also hit various obfuscation rules > because of its UU-encoding being mistaken for a plain text, is mail > with attachments produced by Microsoft Office Outlook where user > has the following setting chosen: > > Tools -> Options -> Mail Format -> Internet format: plain text options: > (YES) Encode attachments in UUENCODE format > when sending a plain text message > > It would be nice if such encodings were recognized and at least > prevent rules that expect plain text from running and/or producing > false hits. > > Mark