https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

--- Comment #15 from Adam Katz <[email protected]> 2010-04-12 18:50:51 EDT ---
Just a follow-up because I had some investigations running when this was
closed...

Rules
------------------
# From rulesrc/sandbox/khopesh/20_bug_6389.cf on trunk at r932438
#
http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/khopesh/20_bug_6389.cf?revision=932438&view=markup

# just a raw numbers check:
header __HAS_XMIME_AUTOCONV exists:X-MIME-Autoconverted
tflags __HAS_XMIME_AUTOCONV nice

# possible fix to bug 6389
header __MIME_QP_TO_8BIT X-MIME-Autoconverted =~ /from quoted-printable to
8bit/
tflags __MIME_QP_TO_8BIT nice

# John Wilcock's proposed subtitutions for __..._ENCODED_B64 (comment 8)
header __FROM_1BYTE_B64 From:raw =~
/=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i
header __SUBJ_1BYTE_B64 Subject:raw =~
/=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i

meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJECT_ENCODED_B64
&& __FROM_ENCODED_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS &&
!__MIME_QP_TO_8BIT

# Daryl O'Shea (DOS) + Adam Katz (KHOP) + John Wilcock version
meta FROM_SUBJ_BODY_8BIT __FROM_NEEDS_MIME && __SUBJ_1BYTE_B64 &&
__FROM_1BYTE_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT

# assuming recipients won't also be highbit'd ("highbitten?")
header __TO_1BYTE_B64 To:raw =~
/=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i
meta FROM_SUBJ_NOTO_BODY_8BIT __FROM_NEEDS_MIME && __SUBJ_1BYTE_B64 &&
__FROM_1BYTE_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT &&
!__TO_1BYTE_B64


Results from 2010-04-11 (non-net run)
------------------

http://ruleqa.spamassassin.org/20100411-r932853-n/%2FDOS_HIGHB|MIME_QP_TO_|HAS_XMIME_|_1BYTE_B64|_ENCODED_B64|FROM_SUBJ_

  SPAM%     HAM%     S/O    RANK   SCORE  NAME
 1.1775   0.0359   0.970    0.82    0.01  T_DOS_HIGHBIT_HDRS_BODY_BUG6389
 0.0718   0.0021   0.972    0.66    0.01  T_FROM_SUBJ_BODY_8BIT
 0.0714   0.0021   0.972    0.66    0.01  T_FROM_SUBJ_NOTO_BODY_8BIT
 0.5069   0.2155   0.702    0.62   (n/a)  __SUBJ_1BYTE_B64
 0.0928   0.1333   0.410    0.53   (n/a)  __FROM_1BYTE_B64
 2.3337   2.3339   0.500    0.51   (n/a)  __SUBJECT_ENCODED_B64
 1.3552   1.7032   0.443    0.50   (n/a)  __FROM_ENCODED_B64
 0.0004   0.1519   0.003    0.31   (n/a)  __TO_1BYTE_B64
 6.2081   1.0613   0.854    0.24   (n/a)  __HAS_XMIME_AUTOCONV
 6.1458   0.9837   0.862    0.24   (n/a)  __MIME_QP_TO_8BIT

That rules out the suggestions from comment 8.  Because Daryl removed the
original rule, it's not listed here, but my modification did little to nothing.

A breakdown of T_DOS_HIGHBIT_HDRS_BODY_BUG6389 scores:

  scoremap  ham:  0  79.31%   69 *******************************
  scoremap  ham:  1   3.45%    3 *
  scoremap  ham:  2  16.09%   14 ******
  scoremap  ham:  3   1.15%    1
  scoremap spam:  0   2.85%  413 *
  scoremap spam:  1   0.15%   22
  scoremap spam:  2  18.89% 2734 *******
  scoremap spam:  3   3.70%  536 *
  scoremap spam:  4   4.40%  637 *
  scoremap spam:  5  12.40% 1794 ****
  scoremap spam:  6   5.51%  797 **
  scoremap spam:  7   7.81% 1130 ***
  scoremap spam:  8  10.22% 1479 ****
  scoremap spam:  9   5.66%  819 **
  scoremap spam: 10   7.17% 1037 **
  scoremap spam: 11   5.80%  839 **
  scoremap spam: 12   4.35%  629 *
  scoremap spam: 13   2.74%  396 *
  scoremap spam: 14   2.64%  382 *
  scoremap spam: 15   1.53%  221
  scoremap spam: 16   1.29%  187
  scoremap spam: 17   0.98%  142
  scoremap spam: 18   0.53%   76
  scoremap spam: 19   0.53%   76
  scoremap spam: 20   0.27%   39
  scoremap spam: 21   0.20%   29
  scoremap spam: 22   0.12%   17
  scoremap spam: 23   0.08%   12
  scoremap spam: 24   0.10%   15
  scoremap spam: 25   0.01%    2
  scoremap spam: 26   0.01%    2
  scoremap spam: 28   0.02%    3
  scoremap spam: 29   0.01%    2
  scoremap spam: 30   0.01%    1
  scoremap spam: 32   0.02%    3
  scoremap spam: 33   0.01%    1

Overlap Spam (50% and up)
  x%  of this rule x             also hit this rule y,     y% of y also hit x
 76%  T_DOS_HIGHBIT_HDRS...6389  T_FSL_HELO_NON_FQDN_2     1%
 72%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_PBL               1%
 68%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_XBL               1%
 55%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CHECK              0%
 53%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CF_RANGE_51_100    0%
 53%  T_DOS_HIGHBIT_HDRS...6389  RDNS_NONE                 1%
 51%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_BL_SPAMCOP_NET    1%

Note that despite this being a non-net run, the overlap still has RDNS_NONE as
the only matching (published) non-net rule that overlapped over 50%.  In a scan
completely lacking network tests, the score-map would be even lower and the
rule would appear more valuable.


Results from 2010-04-10 (net run)
------------------

http://ruleqa.spamassassin.org/20100410-r932679-n/%2FDOS_HIGHB%7CMIME_QP_TO_%7CHAS_XMIME_%7C_1BYTE_B64%7C_ENCODED_B64%7CFROM_SUBJ_

  SPAM%     HAM%     S/O    RANK   SCORE  NAME
 1.1755   0.0116   0.990    0.86    0.01  T_DOS_HIGHBIT_HDRS_BODY_BUG6389
 0.5164   0.0390   0.930    0.76   (n/a)  __SUBJ_1BYTE_B64
 0.0685        0   1.000    0.66    0.01  T_FROM_SUBJ_BODY_8BIT
 0.0682        0   1.000    0.66    0.01  T_FROM_SUBJ_NOTO_BODY_8BIT
 0.0854   0.0435   0.663    0.61   (n/a)  __FROM_1BYTE_B64
 2.3165   2.0477   0.531    0.52   (n/a)  __SUBJECT_ENCODED_B64
 1.3498   1.6534   0.449    0.51   (n/a)  __FROM_ENCODED_B64
 0.0004   0.0099   0.039    0.47   (n/a)  __TO_1BYTE_B64
 6.2616   1.1081   0.850    0.23   (n/a)  __HAS_XMIME_AUTOCONV
 6.1999   1.0350   0.857    0.23   (n/a)  __MIME_QP_TO_8BIT

A breakdown of T_DOS_HIGHBIT_HDRS_BODY_BUG6389 scores:

  scoremap  ham: -2  65.38%   17 **************************
  scoremap  ham:  0  26.92%    7 **********
  scoremap  ham:  1   3.85%    1 *
  scoremap  ham:  4   3.85%    1 *
  scoremap spam:  0   0.05%    7
  scoremap spam:  1   0.20%   29
  scoremap spam:  2   0.78%  113
  scoremap spam:  3   0.55%   80
  scoremap spam:  4   1.11%  161
  scoremap spam:  5   1.56%  226
  scoremap spam:  6   2.57%  373 *
  scoremap spam:  7   3.76%  546 *
  scoremap spam:  8   4.86%  705 *
  scoremap spam:  9   6.58%  955 **
  scoremap spam: 10   7.68% 1114 ***
  scoremap spam: 11   8.85% 1284 ***
  scoremap spam: 12   8.48% 1230 ***
  scoremap spam: 13   8.19% 1188 ***
  scoremap spam: 14   8.07% 1171 ***
  scoremap spam: 15   6.81%  989 **
  scoremap spam: 16   6.02%  873 **
  scoremap spam: 17   5.29%  767 **
  scoremap spam: 18   4.36%  632 *
  scoremap spam: 19   3.41%  495 *
  scoremap spam: 20   2.56%  371 *
  scoremap spam: 21   2.06%  299
  scoremap spam: 22   1.45%  211
  scoremap spam: 23   1.13%  164
  scoremap spam: 24   0.87%  126
  scoremap spam: 25   0.74%  108
  scoremap spam: 26   0.59%   85
  scoremap spam: 27   0.28%   40
  scoremap spam: 28   0.19%   27
  scoremap spam: 29   0.10%   14
  scoremap spam: 30   0.24%   35
  scoremap spam: 31   0.10%   14
  scoremap spam: 32   0.11%   16
  scoremap spam: 33   0.10%   14
  scoremap spam: 34   0.05%    7
  scoremap spam: 35   0.08%   11
  scoremap spam: 36   0.08%   12
  scoremap spam: 37   0.03%    4
  scoremap spam: 38   0.03%    4
  scoremap spam: 39   0.01%    2
  scoremap spam: 40   0.03%    4
  scoremap spam: 41   0.01%    2
  scoremap spam: 42   0.01%    2
  scoremap spam: 43   0.01%    1
  scoremap spam: 47   0.01%    1

Overlap Spam (50% and up)
  x%  of this rule x             also hit this rule y,     y% of y also hit x
 95%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_BRBL_LASTEXT      1%
 76%  T_DOS_HIGHBIT_HDRS...6389  T_FSL_HELO_NON_FQDN_2     1%
 73%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_PBL               1%
 68%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_XBL               1%
 61%  T_DOS_HIGHBIT_HDRS...6389  T_RCVD_IN_ANBREP_BL       1%
 56%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CHECK              0%
 54%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CF_RANGE_51_100    0%
 53%  T_DOS_HIGHBIT_HDRS...6389  RDNS_NONE                 1%
 51%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_BL_SPAMCOP_NET    1%
 50%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CF_RANGE_E4_51_100 3%


Conclusion
------------------

This rule is not worthwhile in network-enabled checks.  Without network tests,
this rule may be extremely valuable.  Assuming we're interested in developing
offline-only tests, this is worth revisiting once we have more corpora from
areas that use non-Latin character sets (specifically China), especially if we
can pin it to not fire on network tests.


I have removed the tests from SVN (satisfying comment #14).  They will
disappear from the ruleqa system in the next day or two.

$ svn delete --force 20_bug_6389.cf
D         20_bug_6389.cf
$ svn commit -m "Bug closed.  I posted my observations, including this file's
contents and stats for ent and non-net runs, on bug 6389, comment 14"
20_bug_6389.cf
Deleting       20_bug_6389.cf

Committed revision 933340.
$

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to