On 2022-05-10 06:44, Henrik K wrote:
On Sun, May 08, 2022 at 11:29:29AM -0400, Bill Cole wrote:
I have not researched all of those, but I believe that some of those
should in theory be useful in Bayes.
So is someone going to research them then? And the 268 older headers
that
axb already
Windows NT (that is, any kind of current Windows) file functions will
natively accept either \ or / equivalently in file paths. There is an option
to disable the acceptance of /, but almost nobody knows it exists, and I
can't imagine anyone setting it on a rational system.
The main
I don't recall your problem, but note that 3.4.6 was a very hasty update to
3.4.5 to correct some problems that showed up with some rules a day or so
after 3.4.5 was created. If things worked before 3.4.5 and fail in rules
wiht 3.4.5, I'd suggest that 3.4.6 may be the correct solution.
...yer preachin' t'th'choir here, Loren. :)
Yea, sorry about that. I realized that about 3 minutes after I hit send, but
didn't want to pollute the mailing list by re-posting it as a bug comment.
You did a much more diplomatic job anyway. :-)
Loren
Let me see if I can explain this problem in simple words:
1) The SA project develops rules.
2) From time to time (almost daily) the SA project updates those rules.
3) The problem you are reporting has been FIXED LONG AGO in the rules in the
SA project.
4) Administrators at various mail sites
An alternative approach is creating new strings from parsed data:
string TO_BODY = TO:addr ":" BODY(500)
string TO_BODY ~= //
the advantage of this is that there are no dependencies.
I'm thinking that BODY(500) would be a multi-line string constructed
from the first 500 byte of the rendered
And if the meta is depending on multiple unfinished rules, or even other
metas with unfinished rules? Sounds like a logic nightmare.. better just
design the metas better in the first place..
Seems to me the logic would be moderately straight-forward if it was driven
out of the end of net
Henrik (or anyone) what happens if a net rule is fired but no response is
ever received?
I assume there is some timeout limit for net responses to show up, so that
message processing can complete?
If that is true (there is a timeout on responses) then the situation in bug
4549 still exists
I guess the risk is exactly the same as rulenames colliding.. better not
use very generic names and you can always prepend the rulename yourself.
:-)
My other concern is thta as far as I know, SA rules are still limited to a
single line of text. If the rule name plus item name gets long, the
Perl already has named capture groups as legit syntax, so it would be most
simple to actually use them.
https://perldoc.perl.org/perlre#(?%3CNAME%3Epattern)
header FROM_NAME /^From: "(?\w+)/
Good. I thought there was someting there, but I didn't remember the exact
syntax and was too lazy to
> header __SUB_CAP Subject:Capture /Your (\w+) Order/i $(__COMPANY)=\1
Would :capture play well with (e.g.) :addr, :name, :raw, etc?
It might as well be a tflag or something. Why limit capturing to headers
only?
I hadn't intended it to be limited to headers only, but I guess the syntax
Now consider variable capture from the message:
header __SUB_CAPSubject:Capture/Your (\w+) Order/i
$(__COMPANY)=\1
The text above was intended to all appear on one line. "$(__COMPANY)=\1"
followed /i.
John Hardin wrote:
An awful lot I think could be done simply by having rules that can
capture to named per-message-global variables, and allowing those
variables to be used in other (or the same) rules.
I've been wanting this for years.
Proposal for discussion:
Consider the following
Ideally rules could be written with some pseudo-language that could do
complex things, grabbing things into variables, modifying, comparing to
other things etc. Then there wouldn't be any need for Perl plugins doing
some trivial stuff.
An awful lot I think could be done simply by having rules
These kinds of changes just make you wonder what's the point of doing such
plugins inside SA distribution.. if we ever do get 4.0 released, I really
doubt if there are enough resources in the project to even release monthly
updates after that..
Given that plugins are by and large the basis for
All:
There are several rules in my sandbox that do not appear in the results at
https://ruleqa.spamassassin.org/ (e.g. __RECEIVE_BONUS)
I don't see __RECEIVE_BONUS in the bad sandbox report. I would assume it is
in your lotsa_money file?
All I can think of is something obvious, like it being
It's properly formed. Compare the plaintext part to the HTML part, note
that the base64 block is QP'd base64, and note that there's some more QP
spam pitch text after the base64 block.
Ah. I completely missed the division boundary a third of the way thru, or
for that matter the pdf attachment
See attached spample.
Is there a boundary missing in that spample? It seems to go from a couple
lines of QP text into base64 with no intervening boundary.
The "alias" directive should not affect RE rules at all, other than
perhaps removing one if the alias is defined after a RE rule having the
same name was defined.
I would hope another use of ALIAS would be to redirect a subsequent SCORE or
DESCRIPTION directive to the renamed rule rather than
de on when and how to merge the branch
into trunk.
Such a vote should be made by the full PMC, and not by a quorum of one member.
I think we know how the vote would go if only a single member can carry a vote
in a meeting he convenes for the purpose just by himself.
Respectfully,
Loren Wilt
Is there a way to clear the noautolearn for the whitelist rules? Normal
rules could probably do it with tflags. Except I'm not sure that you can
necessarily negate a previously set tflags value with a later tflags value.
(If not, maybe it would be worth an enhancement request.)
Another
Having far more experience than I need on multiproc systems, the answer is
it depends. In all probability having the extra 8 threads running will
result in some processor speed increase. It will be less than double, so
1.0 x 2.0.
Of course if this spawns N more instances of SA, you will
This isn't the only concern. There are performance penalties once you
bring the Encode module into play. Please see the discussion some
months back when John Myers added the other Encoding stuff.
Seems in this case that Encode was *solving* performance penalties.
Loren
Honestly I'm -0.5 on this. SA isn't a virus scanner, and while it could
The magic key that to my mind makes bringing it into the core set isn't
virus, its phish. Agreed, SA isn't a virus scanner and probably
shouldn't be; it is quite inefficient at that sort of thing.
But to the best of
You want to set up a mass checker. This will run on mbox files (I'm pretty
sure) and it gives you combined and very useful stats on an entire group of
rules you are testing, across all the mail in the files. You basically need a
group of known ham mail and another group of known spam mail.
Good catch, mine is UTF-8.. Not sure about the original reporter.
Probably not, or they wouldn't be seeing the large difference they see.
Loren
Just wondering. would it be handy to have a new body type, the same as
body but matched as a single string, with all newlines converted to ?
in other words, this text:
It might be beneficial to convert the newlines to spaces, but it might also
be beneficial to leave them there so that they
I think Chris was talking about the same behavior yesterday... his
message had the following headers in it...
Chris a few weeks back, before trying to get away from Earthlink, was
complianing about exactly the same 'nohelo' hop in the Earthlink routing
chain. I haven't followed his recent
Bob had a technique that worked reasonably well, and only required some
minor human thought to usually come up with good numbers. I'm pretty sure
that it took overlap into account. (Although the overlap runs may have been
something he did manually, I don't recall.)
He seems to be gone to
A) make a specific rule or rule set test at (or near) the end of the tests
With some recent version of SA (3.2? Don't recall.) you can set a priority
on the rule and have it run as one of the last rules.
Come to think of it, you aren't talking about short-circuiting, just having
it run
required. I'm not 100% definite though. let's see if anyone else
weighs in ;)
As far as I'm concerned rulz is rulz.
If it is a rule that requires new code to work, then the new code better in
some way come with the new rule. Otherwise there is no point in
distributing the (unworkable) rule,
I'd personally perfer the second form. Every time I see something like the
first for I always wonder if it was deliberate or someone's fingers slipped
and entered an extra character. Or they took out an option and missed
deleting the pipe.
Loren
Someone please remind me why the email score simply isn't the score total up
to and including the short-circuit rule? I'm assuming that in general short
circuit rules are going to run relatively early (else why bother?) so except
for a short-circuit meta there should be relatively few scores in
Incidentally I'm using spamass-milter to pipe mail via milter to sa.
Spamass-militer is known to have problems with 3.1.1. Look at some of the
mail comping through and see if you are getting headers leaking down into
the body of the messages.
Total guess: that comment was left over from domainkeys, from before the SA
headers were moved up to the top.
Loren
Just for history sake, the reason we made a MIMEHeader plugin in the
first place (included in 3.1) was because it was asked for in bug 3781
by Loren. So I'm kind of surprised that it wasn't being used already.
Ah. I think we may have missed that it came into existance.
Is this disabled by
of Bayes and URIBL. There would probably be a much lower-overhead solution,
say SpamBayes, if SA's rules capability is effectively removed. Which seems
to be the effective intent of this proposal.
Loren Wilton
If it's a plugin, it has to be a code-tied rule! Otherwise it wouldn't
need
the plugin.
Hey, what a neat way to completely disable the initial concept of the Rules
project and put things back into the Land Of Arcana where they belong!
Just move 'body', 'rawbody', 'header', and 'full' to
It does introduce the danger of algorithmic complexity attacks
if .* is used instead of .{0,20} though -- but we may be able to help
this if we spot that kind of thing in --lint.
I still don't understand why .* is more dangerous in rawbody rules than it is
in full rules. Any cases where it
my $text = $parts[$pt]-decode();
$text =~ tr/ \t\n\r\x0b\xa0/ /s;# whitespace = space
push(@{$self-{text_decoded}}, split_into_array_of_short_lines($text));
What does split_into_array_of_short_lines do? This sounds to me like it still
ends up with individual lines fed to the
One big plugin would be better than the current split. The current
split has no solid technical rationale behind it.
- allows eval rules to not be loaded. arguably, most of them will always
be
enabled, but some could be disabled. DNSEval, for instance, is only
useful
in net mode. If
Should we be wrapping full rules in alarms (using M::SA::Timeout) to
prevent this?
You can do this with any rule, a full rule is just easier to mess up.
I'd be concerned of the overhead (and probable timing holes) in wrapping
every rule in an alarm().
As an alternative, how about wrappring
default_rules_path (/usr/share/spamassassin)
site_rules_path (/etc/mail/spamassassin)
default_userprefs_path (~/.spamassassin/user_prefs)
Doesn't that imply that site rules override local rules? Surely those are
in the other order? Or is there magic when reading the second file
in other words it's been dropping from a high of 19.348% of spam to just
0.38%
nowadays.
Which isn't to say that there aren't unique ids in modern subjects. They
just aren't in a form this can detect. :-)
Loren
As an outsider, I find myself strongly agreeing with Motohraru-san that,
when dealing with at least the oriental multibyte languages, tokinization
belongs early in the stream, before both bayes and rules.
Of course this is an overhead penalty that should not occur on mail that
isn't likely to be
IMO, bugs which allow any specially crafted spammy message to get
through, even if the method used is to crash spamd or stand-alone SA,
is NOT a security bug, provided the only damage is to SA/spamd and the
resulting FN. That's a bug, pure and simple, no matter how creative
the spammer is.
At a guess: IE and apparently Firefox have search for url enabled by
default. In IE that consists of sticking .com, .net, etc suffixes on, and I
think trying a www. prefix. From a report on the user's list, it appears
that Firefox goes farther and will do a google search, resulting in a
tinyurl
You can do that with the plain regex rules thanks to the experimental
and rather loony (?{...}) and (??{...}) constructs.
Well no. You could do that on 2.6x, and I used that for some very valuable
rule development tools. That ability was removed in 3.x.
Loren
anyway, I've just checked in a change that'll allow hit-rates
all the way down to 0.02%. why not. ;)
I guess I question active hitrates much under 1%. The key there is
'active'. Things that may be hitting next to nothing in one corpus might be
hitting well in another one.
Loren
Whether your idea is good or not, it has to do
with a suggestion for how to use sa-learn, not anything to do with
development.
Hi Sidney, happy new year!
Actually, while he phrased the RFE in terms of sa-learn, it is actually
something that could be done as an SA plugin, if SA were run on the
Converting sections of tests into plugins where some people will want to
disable the entire set due to performance, memory, or similar
constraints (i.e., Bayes tests, network tests, special functionality,
etc.) does make sense. However, converting individual (or nearly
individual) tests that
Hello Warren,
There was also a recent discussion about using SVM scoring techniques, and
someone posted a tool to do that. I believe the claim was that it produced
reasonable scoring with less effort than the normal method. Perhaps that
could be used here?
Loren
Looks generally good. Minor comments:
1. Bob had a thing built into his version of mass-check that assigns default
scores. I'm not clear on the basis for this (although he has explained it
any number of times) but it is fairly simple and seems todo a decent job,
shy of a full scoring run.
I'm
'As a collaborative documentation platform, the wiki has already proved
much more
effective than our SVN codebase.'
So why not write a routine to scrape the Wiki on the day of release and
stick the pages into files in the release tree?
Loren
Not in my case Tom. I actually have all the Bayes features disabled and
the
error still happened on my installation.
But do you have AWL disabled too?
I suppose mkrules could be changed to cat all the files parsed so far,
so that a sandbox file can refer to a core file's rule by name (since
sandbox will be compiled after core); but I quite like the side-effect of
restricting sandbox files to only being able to affect rules in their own
Hum. Is there any way to configure some default colors for the graph? On a
PC it seems Quicktime prints the thing out, and it is near unreadable. I
see a black square with a straight yellow line in the center and some wiggly
lines near the bottom. I *think* there might be some text in the
Now that I can log is, I see why it isn't really important.
Loren
Some random comments:
So the idea is that the source code for all rules (apart from the legacy
core and lang sets) remains in the sandbox dirs; in other words, there's
no need to cut and paste and move around rules when they're promoted
from testing status, to live core status.
I'm not
Not too important, but the quip software is dumping SQL debug
info:
Maybe that depends on what you are doing. I tried to log in
unsuccessfully:
Software error:DBD::mysql::st execute failed: You have an error in your SQL syntax near '' at line 1 [for Statement "SELECT login_name FROM
You know, I don't know if there'd be a separate bugzilla. good
question... I think the mostly likely thing would be that the rules
project stuff would be under the (existing) Rules component in BZ.
I don't know that BZ would get much use or be of much use in day to day
rules testing and
Please let me know what you think!
Daryl and Chris both make a number of good points, but the buildbot idea
also seems to have a good deal of merit. A creative solution for the
'private corpus' problem that Chris mentions might help a lot though.
Unfortunately I don't have one at the moment,
Well, user rules are always allowed when 'spamassassin' is run so a --lint
message would have to say if you plan on using spamd your user rules
won't be
used.
On the other hand, spamd when called with -Dconfig, will tell you it's not
parsing each of your user rules.
So... do we really want
Note also
echo score MICROSOFT_EXECUTABLE 4 .spamassassin/user_prefs
Isn't that a 2.6x rule that went away in 3.0? I would hope that anything
comparing filtering results (as I would guess this to be, knowing nothing of
it) would be using a reasonably recent version.
(Of course it would
As ancedotal evidence, its my belief that people are seeing _alarm_ log
records and associated scan failures on both rc1 and rc2, and that they are
occuring with more than just Pyzor. This is anecdotal however, I don't have
any evidence to hand to support that.
I'm personally wondering if this
Better asked on the user's list, where there are people running systems like
that.
Loren
Justin writes:
I think we don't even need to do that; once we get the search directories
recursively code worked out for configuration and rules, plugins will be
loadable from *any* directory in the rules project:
ROOT/rules/group/20_name_of_file.cf
I *think* what Daniel was thinking of here, which should work, is
just using the ifversion commands to conditionalize too-advanced
rules.
Assuming ifversion can be used in the negative also. For instance, we have
one set of meta rules that use addition post-whatever, and do a less-good
job
Just looking from the sidelines, it seems the obvious answer would be to add
a new namespace to the blacklist. eg:
*.2.1.9.ipv6.rbl.example.org.
instead of
*.2.1.9.rbl.example.org.
Since this is for numeric lookups, and alpha or alphanum tag in what would
be the high octet of the ipv4 dotted
How big are they? SA is set up to bypass messages over a given size.
The following functions, immediately after they all
Mail::SpamAssassin::Message::Node::decode, need to call a
function that does charset normalization.
* Mail::SpamAssassin::Message::get_rendered_body_text_array
* Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
*
Agree in general, but possibly...
2. code-tied rules stay with main tree in current rules directory with
the exception of 25_replace.cf which is really just another way to
write body/header rules (basically, the static stuff that is tied to
code does not move to the rules project)
Could you please point this thread at the two bug numbers? I'd like to
target these for a future 3.0.5 bug-fix release, because we are very
unlikely able to upgrade our Enterprise distro to 3.1 in the short to
medium term. (I am hoping in the long term to have both RHEL4 and RHEL5
on
This is quite similar to two recent bugs that caused similar problems if
certain ascii characters immediately followed the URI. Spammers had
exploited at least one of those cases. I don't know what the fix was for
those bugs, but it may have been similar to the change you propose.
Loren
You need to ask this question on the users list. This list is to discuss
spamassassin development.
Are you SURE that was a valid message? If so, it will be the first recorded
instance of X-Message-Info showing up in ham and not only in spam.
Previously that had been a sure sign of a spam tool generated mail.
naming isn't really much of a big deal but it'd be nice to have some way
to keep track of that. (not that I can think of it.)
Look at some of the SARE rule files that Bob maintains. He has a formalized
set of comments that get stuck to rules, and one of these can/does show the
history
a) what the heck are priorities, who sets them, and do they really have
any
justifiable purpose? Ie: can they just quietly vanish into the night
with
nobody being any the wiser?
They order the rules -- or more correctly, sets of rules.
Most rules are priority 500 (iirc), but some need
I was thinking about the 'best' wat to shortcut running rules when they
weren't needed, and suddenly realized there might be cases where it is
necessary to run them even though they won't determine the hammyness or
spammyness of the mail.
In particular, I'm wondering about bayes and awl
It seems obvious that we want to run that -100 rule first. If it hits, the
maximum possible score if *every* other rule hits will be 4, and with a
threshold of 5, the mail can't be spam. So we can stop after the -100 rule
hits, and only run one rule on this mail.
This just brought up an
+score BAYES_50 0 0 0.845 0.001 # n=1
+score BAYES_60 0 0 2.312 0.372 # n=1
+score BAYES_80 0 0 2.775 2.087 # n=1
+score BAYES_95 0 0 3.023 2.063 # n=1
+score BAYES_99 0 0 2.960 1.886 # n=1
I think the score for BAYES_99 should be hand tweaked, regardless of what the
score generator said.
This
Example: I am currently writing a very FEW rules, some from
scratch and some by adapting the work or ideas of others from
such lists or web sites.
You have all convinced me that if I post a rule for discussion
that it is then close to worthless.
It depends on how you post it. And it may
How would we determine ham/spam? At this point all we have is SA's
first estimation, and no way of knowing whether this is accurate, FN,
or FP.
All we could reasonably do is take SA's assment of the message and assume that
statistically it will be correct to one or two sigma or so. If the
More thought ... what if SA systems were to accumulate daily
statistics, along the lines of one record for each rule, containing:
That sounds like the general sort of vague idea I had, fleshed out in more
detail.
Certainly the desirable goal is basically:
1 does this rule hit anything?
2 does
That's why we use 70_sare_name_eng.cf files, to indicate that these
rules work well only on systems which expect almost 100% English ham,
and little to no ham in other languages.
I've begun to wonder whether it might be worth while having
50_scores.cf for English emails, and then
it's not a matter of popularity -- it's a matter of being horrendously
difficult to support.
I grant from what I've seen of PMS that this gets pretty ugly. Or at least
it seems to to me, but then a lot of apparently good Perl looks pretty ugly
to me. ;-) But I'm a C++ and Algol programmer,
I know user rules aren't real popular with the sa dev community, however
that attitude isn't universally shared by sa users. Therefore may I
suggest:
Would it be possible when reorganizing things to come up with some
semi-persistant storage for compiled user rules, so that they don't have to
be
Duncan earlier enscribed:
Masscheck has an interdependency option, although it increases the
checking
time. We use it on rules once they seem useful, but not usually in early
one-off checking.
I'm not sure what you mean by this. We have an overlap script which
does some of this -- is that
I'm *really worried* about proposals that involve mailing lists that
have only private archives and require moderator approval for
subscription. It just doesn't feel right for an open source project.
I understand the feeling. I'm trying to balance the obvious desire for a
completely public
I guess you'd have better data than I would; but I'm still having
trouble believing that Spammers are adjusting on that time frame.
Some do; not all do. However, the ones that can adjust in less than a day,
or maybe less than 2-3 days sometimes, tend to be some of the more prolific
spammers.
May I help?
(How will you folks decide)
Well, to paraphrase how we decide in SARE -- do something, we'll watch.
And it really is pretty much that simple.
I expect (and this is personal opinion, I'm not an SA dev) that the rules
subproject will sooner or later consist of annointed
I'd like to see if there's a way to combine the two somehow so that new
SVN commits that update sandbox rules, are immediately mass-checked alone.
However, I can't see a way to do that reliably from SVN commits alone,
because (for example) meta rules may depend on other rules that were not
What I miss most is a transparent dataset about every rule.
I'd like to know
- percentage of false positives
- percentage of flase negatives
- percentage of true positives
- percentage of true negatives
- number of mails checked for the results above
- standard deviation of the percentages
Sidney writes:
Perhaps we could use SVN to check in rule submissions so they are version
controlled and tracked, and have emails refer to file paths and version
numbers instead of attaching the rules. Would that be too complex for the
people we want to attract compared to mailing in sets of rules
Could the list be a semi-private one, with moderated subscription and
posting? That'd take care of rules in development being exposed
to spammers while they're still being worked on, at least partially.
The SARE list is private and invitation only for exactly these reasons.
You don't want to
Sidney writes:
Dealing with metarules and modifications to them presents a problem in any
case. How do we deal with person X submitting a modification to metarule A
and proposed rule A1, while person Y submits a different modification to
metarule A and proposed rule A2 while person Z submits
Dealing with metarules and modifications to them presents a problem in
any
case. How do we deal with person X submitting a modification to metarule
A
and proposed rule A1, while person Y submits a different modification to
metarule A and proposed rule A2 while person Z submits proposed
---
I guess that part of making the rule submission and test process nimble is
for the submitted rules to be independent of anything else. That makes
changing metarules less of a nimble process. That's fine, because metarules
are really just an optimization which can be implemented after the fact
A big part (perhaps the biggest part) of rules development is the mass
check. Most anyone can develop a rule on their home system and see how they
*think* it works.
Some few (but not many) people can do a mass-check on their home system and
see how it *really* works - *for them*.
As proposed,
As rules are put into the sandboxes, they become part of svn. When the
nightly mass-checks are run, each person pulls the latest rules sandboxes
from svn and does their mass-check with all of those, then rsyncs the
results back up to the central site once the mass-check completes.
I think I
1 - 100 of 138 matches
Mail list logo