Re: Dropping mail

2018-04-29 Thread Linda Walsh



Dianne Skoll wrote:

On Fri, 27 Apr 2018 14:39:43 -0500 (CDT)
David B Funk  wrote:

[snip]


Define two classes of recipients:
   class A == all users who want everything
   class B == all users who want "standard" filtering


This works if you have a limited number of classes, but in some cases
users can make their own rules and settings so the number of classes
can be the same as the number of RCPTs.

---
Except users who have their own rules are not likely
doing it in the context of the initial choice of whether or
not to accept the email onto the server.  I.e. they'll run some
anti-spam filter in their "account" context as a normal user.

The users who want "standard filtering", may have it
done such that the email is never accepted onto their
email server.

I.e. it "should" never be the case that user-defined
filters are run in the MTA's initial receive context as the MTA
just received (or is in process of receiving) an email coming
in on a privileged port (like port 25) which would imply a
privileged context (most likely root).


Even in the two-class case, there's still a delay for the subsequent
class(es).

---
Delays are not the same as dropped email. People use
grey-listing which often causes some delay, but in this case,
I've seen examples of people who's mail-provider was 
inspecting+filtering emails for spam+malware also have 
problems in delivery time (60-90 minutes after the fact).


  So it is already the case that mail-providers who do
filtering on the mail-server are sometimes slow to pass
on the email to their users, depending on their volume).

linda



Re: how to enable autolearn?

2017-01-10 Thread Linda Walsh

Marc Stürmer wrote:

Am 2017-01-09 22:30, schrieb L A Walsh:

I have:

bayes_auto_learn_threshold_nonspam -5.0
bayes_auto_learn_threshold_spam 10.0


In order for autolearn to work you need at least 200 trained messages 
in the ham and spam category. If the filter doesn't know enough mails 
yet it will state it in the log file.


Have you trained your Bayes filter accordingly or just enabled it and 
expect it to start autolearning out of the box?


   My corpus is regularly pruned, but still I have daily junk-mail
logs going back to 2014 (~776 files over the 3 years, where files contain
a day's spam).  I did have junk going back much farther until I decided it
was a bit too much and a bit to dated..


Also take a look at:
https://wiki.apache.org/spamassassin/AutolearningNotWorking


   I'm not terribly worried, since every night all the junk messages
get fed in as spam.  Anything I catch as non-spam gets tossed in
the auto-despam folder, but those are only a few a week.


Quote: "Finally, SpamAssassin requires at least 3 points from the 
header and 3 points from the body, to auto-learn as spam. If either 
section contributes fewer points, the message will not be auto-learned."


I guess the latter might be just the case with your installation. †

---
??? How so?  Though the past few days been getting many spam mails that
are mostly header, no to little body + an attachment ...




Re: how to enable autolearn?

2017-01-09 Thread Linda Walsh

John Hardin wrote:

On Mon, 9 Jan 2017, L A Walsh wrote:

I have:
   bayes_auto_learn_threshold_nonspam -5.0
   bayes_auto_learn_threshold_spam 10.0
in my user_prefs. When I get a message though, I see autolearn being 
set to 'no':
  X-Spam-Status: Yes, score=18.7 req=4.8..autolearn=no 
autolearn_force=no

Shouldn't a score of 18.7 trigger an autolearn?


Not all rules contribute to the score used for the autolearn decision. 
Particularly, the BAYES rules don't contribute to the autolearning 
decision in order to avoid positive feedback loops.


   That's why my "bayes_auto_learn" thresholds were fairly high.

So why is it called bayes_auto_learn_threshold if it isn't used for
auto-learning?  Isn't that a bit confusing?




Re: How to report spam to mailspike

2014-08-29 Thread Linda Walsh

Dave Warren wrote:

On 2014-08-29 02:38, Marcin Mirosław wrote:

So what should I do in your opinion? I'm getting spam to my private
spamtrap so I can't fill fields about company - it doesn't matter where
I'm hired for reporting spam. What if I would be unemployed? Then I
would have to lie about company? IMHO it is the way to hinder sending
complaints from users.


If you're not willing 

---
I think perception may be "am not able"... ?
to provide the information they request, and they won't accept an 
inquiry without it, then you're left with a different choice: 1) Do 
nothing, 2) Cease using the service.


From their perspective, either the policy will ...

---
If they really mean company then it helps them target companies for 
their own advertising.



If I'm acting on my own behalf, I'd put "Personal" or "None" or "N/A" 
into a form, and if it's not accepted, oh well.

---
Ditto on this... Company "Self" has been in business for decades!  ;-)

"They" are definitely a "Service provider"... (think of all the things
'self' does for you!) ;-)  Corporation was a way of "embodying" a 
business practice

to give it human rights... but you are already "embodied", thus incorporated
(no offense to the non-corporeal beings reading this list).  I'm sure 
you govern yourself
as well if you want to get technical, so if they want to be technical, 
so can others...


Then again, are they worth the bother?






Re: Advice sought on how to convince irresponsible Megapath ISP.

2014-08-17 Thread Linda Walsh

Karsten Br�� wrote:

Similarly, your scripts do not reject messages, but choose not to fetch
them.

===
   No... fetchmail fetches them, "sendmail" rejects them because they
don't have a resolvable domain.  My sorting and spamassassin scripts
get called after the email  makes it through sendmail.  My scripts don't
see the email.



Pragmatic solution: If you insist on your scripts to not fetch those
spam messages (which have been accepted by the MX, mind you), automate
the "manual download and delete stage", which frankly only exists due to
your choice of not downloading them in the first place. Make your
scripts delete, instead of skipping over them.


   'fetchmail', that I know of, isn't able to tell if a sending domain 
is invalid
until it has fetched the email (that I know of).  fetchmail tries to 
send the email
to me via sendmail, which doesn't accept the email because it is invalid. 
Unfortunately, my ISP doesn't use sendmail or it would reject such emails by

default.


Be liberal in what you accept, strict in what you send. In particular,
later stages simply must not be less liberal than early stages.


   In this case, I don't even want the invalid email passed on to me. I 
don't
want to accept spam.  The first defense is to have the MX reject 
non-conforming

email.


Your MX has accepted the message. 
My ISP's MX has accepted it, because it doesn't do domain checking.  My 
machine's
MX rejects it so fetchmail keeps trying to deliver it. 

While I *could* figure out how to hack sendmail to not reject the 
message, my
preference would be to get the ISP to act responsibly and reject emails 
without

a valid return domainname. It's standard in sendmail, rejection of such
email is called for in the RFC's.  The choice to not follow RFC's allows
spam that would normally be rejected, through to my system which does follow
the standards and rejects it -- so it stays in the "download queue" for my
domain.

At that point, there is absolutely no
way to not accept, reject it later. You can classify, which you use SA
for (I guess, given you posting here). You can filter or even delete
based on classification, or other criteria.

The MX shouldn't accept it based on RFC standards.  When I asked for it to
be blocked, I was first asked for the name of the "offending domain" and
told I could blacklist the domain by adding it to a list with their 
web-client.

I asked for a scriptable way to do this after a domain lookup, they said
they no longer offer scripted solutions as the ISP I signed up with (who
they bought) did.



The only response my ISP will give is to turn on their spam filtering. 
I tried that. In about a 2 hour time frame, over 400 messages were

blocked as spam.  Of those less than 10 were actually spam, the rest
were from various lists.

So having them censoring my incoming mail isn't gonna work, but neither will
the reject the obvious invalid domain email.

I can't believe that they insist on forwarding SPAM to their users even 
though they know it is invalid and is spam. 


There is no censoring.
When I complained about the problem I found that "recommended filter 
rules" had
been activated on my account.  In the couple of days they were active 
about 80% of
the messages they caught were not spam -- and some of the bad domains 
still got

passed through.

 There is no forwarding.

It comes in their MX, and is forwarded to their users.


Any ideas on how to get a cheapo-doesn't want to support anything ISP to 
start blocking all the garbage the pass on?


Change ISP. You decided for them to run your MX.


   I didn't decide for them, I inherited them when they bought out the 
competition

to supply lower quality service for the same price.


It is your choice to aim for a cheapo service (your words).

It wasn't when I signed up.   Cost $100 extra/month.  Now only $30
extra/month that I don't host the domain with them.

 If you're
unhappy with the service, take your business elsewhere. Better service
doesn't necessarily mean more expensive, but you might need to shell out
a few bucks for the service you want.


I already am... my ISP (cable company) doesn't have the services I want 
for mail
hosting.  I went to another company for that, who was bought out some 
times ago, with
the new company dropping quality as time goes on.  In this case, I 
wanted to
try to push back against them accepting the illegal (not to spec) spam 
and forwarding

it to their customers in the first place.

There are many "compromised" solutions that are available.  Certainly 
such choices are
not my first, which was why I posted here to see if anyone else had any 
experience
with getting an irresponsible ISP to reject non-compliant email, and 
barring that,
maybe getting offered better choices from the experience of the people 
on this list.


Cheers!
'^/



Re: SA 3.3.2 buggie? -- message that DB file doesn't exist -- but systrace shows successful lock and open!

2012-01-17 Thread Linda Walsh



Michael Scheidell wrote:

On 1/16/12 9:36 AM, Linda Walsh wrote:

This is not permission problem --
Message I get:

have you tried to upgrade to the released version? 3.3.2?

3.0.2 was obsolete 6 years ago.

---
Well, I could pretend like you wouldn't have guessed it was
a typo and tell you that you were right, and that after installing 3.3.2, I had
the exact same bug...

but, I'll just mention that it is 3.3.2...
 sa-learn --version
SpamAssassin version 3.3.2

That is 'still' having this problem (though I will note that most of the
references on this error message date back to the early 3.0 series...

So this bug has been out there for over 6 years (according to your timeline...)


That's a long time to ignore a widely found bug (by googling on it) -- the only 
offering

of a solution was to check permissions -- which I verified in the trace, -- as 
not
being the problem -- but that the message being issued by SA was bogus.



SA 3.0.2 buggie? -- message that DB file doesn't exist -- but systrace shows successful lock and open!

2012-01-16 Thread Linda Walsh

This is not permission problem --
Message I get:

bayes: cannot open bayes databases /home/lw_spam/.spamassassin/bayes_* R/O: tie 
failed:
bayes: cannot open bayes databases /home/lw_spam/.spamassassin/bayes_* R/W: tie 
failed: No such file or directory


---
Except I followed it through using strace.

Both are being opened and the 2nd is even successfully being LOCKED:

Jan 16 06:17:34.806 [20156] dbg: locker: safe_lock: trying to get lock on 
/home/lw_spam/.spamassassin/bayes with 0 retries
link("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156", 
"/home/lw_spam/.spamassassin/bayes.lock") = 0
Jan 16 06:17:34.806 [20156] dbg: locker: safe_lock: link to 
/home/lw_spam/.spamassassin/bayes.lock: link ok



before it is opened... then
SA turns around and claims it can't find them...


So why is SA opening the files, but then writing out a completely BOGUS and
false messge that it couldn't open them or even find them?!?!...

Whatever the problem is -- a better error message that isn't LYING would be a 
good
thing at this point, since in searching on the web, I see alot of people getting 
this -- and
it's often blamed on their permissions... but now, everyone should know that 
permissions
are not it...the message is completely bogus... it can open them just fine -- 
something eles

may be wrong, but the message is very misleading


More complete log follows... (deleted all the lines that had
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
in them -- they were 2/3rd the debug messages...

-l
-


Jan 16 06:17:34.798 [20156] dbg: bayes: tie-ing to DB file R/O 
/home/lw_spam/.spamassassin/bayes_toks
stat("/home/lw_spam/.spamassassin/bayes_toks", {st_mode=S_IFREG|0777, 
st_size=5177344, ...}) = 0

open("/home/lw_spam/.spamassassin/bayes_toks", O_RDONLY) = 3
bayes: cannot open bayes databases /home/lw_spam/.spamassassin/bayes_* R/O: tie 
failed:

Jan 16 06:17:34.799 [20156] dbg: bayes: untie-ing DB file toks
Jan 16 06:17:34.799 [20156] dbg: config: score set 1 chosen.
Jan 16 06:17:34.800 [20156] dbg: sa-learn: spamtest initialized
Jan 16 06:17:34.800 [20156] dbg: learn: initializing learner
Jan 16 06:17:34.800 [20156] dbg: plugin: 
Mail::SpamAssassin::Plugin::Bayes=HASH(0x286d1f8) implements 'learner_sync', 
priority 0


Jan 16 06:17:34.801 [20156] dbg: bayes: bayes journal sync starting
stat("/home/lw_spam/.spamassassin", {st_mode=S_IFDIR|S_ISGID|0777, st_size=4096, 
...}) = 0
stat("/home/lw_spam/.spamassassin", {st_mode=S_IFDIR|S_ISGID|0777, st_size=4096, 
...}) = 0
stat("/home/lw_spam/.spamassassin", {st_mode=S_IFDIR|S_ISGID|0777, st_size=4096, 
...}) = 0
stat("/home/lw_spam/.spamassassin/bayes_journal", 0xe3e138) = -1 ENOENT (No such 
file or directory)

Jan 16 06:17:34.801 [20156] dbg: bayes: bayes journal sync completed
Jan 16 06:17:34.802 [20156] dbg: plugin: 
Mail::SpamAssassin::Plugin::Bayes=HASH(0x286d1f8) implements 
'learner_expire_old_training', priority 0

Jan 16 06:17:34.802 [20156] dbg: bayes: expiry starting
stat("/home/lw_spam/.spamassassin/bayes_toks", {st_mode=S_IFREG|0777, 
st_size=5177344, ...}) = 0
stat("/home/lw_spam/.spamassassin", {st_mode=S_IFDIR|S_ISGID|0777, st_size=4096, 
...}) = 0

Jan 16 06:17:34.803 [20156] dbg: locker: mode is 384
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=834, ...}) = 0
open("/etc/resolv.conf", O_RDONLY)  = 3
open("/etc/host.conf", O_RDONLY)= 3
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 3
open("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156", 
O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
Jan 16 06:17:34.805 [20156] dbg: locker: safe_lock: created 
/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156
Jan 16 06:17:34.806 [20156] dbg: locker: safe_lock: trying to get lock on 
/home/lw_spam/.spamassassin/bayes with 0 retries
link("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156", 
"/home/lw_spam/.spamassassin/bayes.lock") = 0
Jan 16 06:17:34.806 [20156] dbg: locker: safe_lock: link to 
/home/lw_spam/.spamassassin/bayes.lock: link ok

unlink("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156") = 0
lstat("/home/lw_spam/.spamassassin/bayes.lock", {st_mode=S_IFREG|0660, 
st_size=26, ...}) = 0
Jan 16 06:17:34.807 [20156] dbg: bayes: tie-ing to DB file R/W 
/home/lw_spam/.spamassassin/bayes_toks

open("/home/lw_spam/.spamassassin/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) 
= 3
stat("/home/lw_spam/.spamassassin/__db.bayes_toks", 0xe3e138) = -1 ENOENT (No 
such file or directory)
stat("/home/lw_spam/.spamassassin/bayes_toks", {st_mode=S_IFREG|0777, 
st_size=5177344, ...}) = 0

open("/home/lw_spam/.spamassassin/bayes_toks", O_RDWR) = 3
Jan 16 06:17:34.808 [20156] dbg: bayes: untie-ing DB file toks
open("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156", 
O_WRONLY|O_CREAT|O_EXCL, 0700) = 3

unlink("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156") = 0
lstat("/home/lw_spam/.spamassassin/bayes.lock", {st_mode

BUGs? (Re: Upgraded to new spamassassin(3.3.2(, and now it won't work (no rules.....ran sa-update, nada...)_

2011-10-31 Thread Linda Walsh



Linda Walsh wrote:


Sorry, included that in my subject

I did run sa-update, all it says (put it in verbose mode) is that
the rules are up to date.

Initially it did download the rules into
/var/lib/spamassassin//.

Those files are still there, but spamd is,
apparently,  not seeing them.




---
	I finally solved this by copying all the rules from 
/etc/mail/spamassassin into /usr/share/spamassassin...and then

it was all happy.

Of course this would seem to break multiple rules about where to keep the 
various rule sets...but hey, maybe spamd didn't get the memo about where 
it  was supposed to look...  I dunno...



	It IS looking in /var/lib/spamassassin, but that was failing, because it 
couldn't find 'Mail::SpamAssassin/CompiledRegexps/body_0.pm'.


It's the file that includes all the compiled expressions.

The path structure under /var/lib/spamassassin is a bit confused/confusing.

That's what cased that error.  I.e. dir struct under /var/lib/spamassassin 
looks like:



Ishtar:/var/lib/spamassassin> tree -FNhsR --filelimit 7
.
├── [  71]  3.003002/
│   ├── [4.0K]  updates_spamassassin_org/ [61 entries exceeds filelimit, 
not opening dir]

│   └── [2.5K]  updates_spamassassin_org.cf
├── [  18]  compiled/
│   └── [  21]  5.012/
│   └── [  50]  3.003002/
## under the 3.003002 dir under compiled is it gets interesting
##two trees for Mail/SA/CompiledRegexps, one rooted here,
## but the other 'down' a level under 'auto'
## (where the real binaries are.
│   ├── [  17]  auto/
│   │   └── [  25]  Mail/
│   │   └── [  28]  SpamAssassin/
│   │   └── [  35]  CompiledRegexps/
│   │   ├── [  54]  body_0/
│   │   │   ├── [   0]  body_0.bs
│   │   │   └── [1.3M]  body_0.so*
│   │   └── [ 51K]  body_0.pm  #was missing
│   ├── [237K]  bases_body_0.pl
│   └── [  25]  Mail/
│   └── [  28]  SpamAssassin/
│   └── [  22]  CompiledRegexps/
│   └── [ 51K]  body_0.pm #copied to above
├── [2.6K]  user_prefs*
└── [2.6K]  user_prefs.template*

13 directories, 8 files


As for why it didn't find rules in /etc/mail/SA (it DID read /etc/mail/SA, 
just didn't regard anything there as a rule)...so duping those files

into /usr/share/SA, made them magically become 'rules'

I assume, of course, this is correct-by-design behavior?   ;-)
(*cough*)



Re: Upgraded to new spamassassin(3.3.2(, and now it won't work (no rules.....ran sa-update, nada...

2011-10-20 Thread Linda Walsh

Sorry, included that in my subject

I did run sa-update, all it says (put it in verbose mode) is that
the rules are up to date.

Initially it did download the rules into
/var/lib/spamassassin//.

Those files are still there, but spamd is,
apparently,  not seeing them.




Martin Gregorie wrote:


Run sa_update.

SA packages from 3.x onwards don't include the rule set to avoid
installing stale rules. 


A good install will have added /etc/cron.d/sa-update to your system. It
runs a daily update at 04:10, but you can run it manually if it hasn't
already picked up a rule set.

Martin







Upgraded to new spamassassin(3.3.2(, and now it won't work (no rules.....ran sa-update, nada...

2011-10-20 Thread Linda Walsh

I wanted to try to head off an  increasing spam count I'd gotten
since I upgraded my suse server to 11.4 ...

So I tried cpan to goto 3.3.2, but now...it says ..
no rules!...I've tried putting rules in just about every dir I can think
of...

I had it running as as a daemon before - I thought it ran as spamd, but
I could be wrong -- just that there have been user and group spamd
on my since I first installed SA a few years ag0, but I'd installed some
suse versions since then and they still worked, and I don't know
how they ran.  The latest incarnation didn't work as well, so went
back to try the cpan version...*ouch*...

where is it looking for it's rules, in /dev/null?!?

I tried running it as root, as spamd, giving it a home dir,
in /usr/share/SA, /var/lib/SA, /etc/mail/SA, (not to mention
~/.SA... )...

"spamassassin --lint" comes back 'silently'...no errors...

so if there are no errors, why does it claim it can't find any rules?

Ya'd think SA --lint would notice that as being a problem...?


:-(

ideas?










Re: HT-perf, paralism, thruput+latncy (dsk, net, RBLs) powr usg/meas, perlMiltring & ISP's reducng spamd latency

2009-08-08 Thread Linda Walsh

Nix wrote:


[This is really OT for spamassassin, isn't it? Should we take it
off-list?] 

--

a bit -- and somewhat not.  Much of it boils down to speed.  How to best do
it, parallelism, new hardware features...lowering latency...etc.

I'd really hoped to speed up my SA processing -- at least it can handle
a sizable concurrent load now, that's an improvement.  Need to figure out
a way to cache or speedup the network requests -- I'm sure it's mostly
latency on the spamd servers i'm checking w/. The highest my download speed
went was about 500K (on a 768K DSL)...it's all in packet latency that's the
prob.


On 8 Aug 2009, Linda Walsh spake thusly:
OK, you've out-RAIDed me.



   It's a server.  Mostly unraided...sorta...4 of them are in 2 VD's in
mirror mode.  the system disk is a 15K SAS, but only 70G space.  The rest
are what RAID are supposed to be -- Redundant Arrays of _Inexpensive_ disks
(SATA).  Boy was Dell pissed.  They really don't like selling bare-bones
systems.  I had to buy the disk trays elsewhere (dell won't sell them
separate from a disk).  Only 1 VD is a real RAID(5) with a whopping
3 disks...ooooo..   2 disks are sitting around as spares until I can
figure out how to add them to existing arrays (supposed to be 'easy' and
the controller rebuilds -- but noooI'm just spending too much
time on computer solving mail-filter problems while forcing myself
uptospeed w/perl5.10's new features, CSS, and fonts again (I just hosed my
desktop's fonts, so need to reboot...oops).



I'd also prefer the my own *choice* of whether or not to use the on-disk
cache ...  Maybe some of this control will get into the or does the bios
have to support everything?


Well, you'll never get the option to turn off the Linux kernel's disk
cache,

---

   On-disk cache = cache 16-32MB of cache on the disk.  It really speeds
up writes when you are writing small junks as it can coalesce the writes to
physical positions on disk -- while the kernel only uses a generic 'model'
all disks.  The real internal geometry is completely hidden these days --
you can see it talked about on Tom's HW occasionally when they bench
a disk.  You see fast constant speeds at track 0 (outside of disk), then
you see multiple 'drops' as the sectors/track shrink due to lower diameter.
But the on disk cache -- all the kernel developers dis it because they run
unstable kernels that can leave up to 32MB in a write buffer on a disk if
it gets reset or loses power before it finishes flushing it's cache.  But
on a system on a UPS, not running test-kernels all the time, unplanned
shutdowns are rare, so the speed-up is worth it.  Just like the RAID
controller itself has it's own battery backed up (non-extensible) RAM (it
doesn't know about UPS's and such -- my previous server lasted 9 years...I
feel in large part due to it being on a power-conditioned UPS (APC SmartUPS
that supposedly puts out a sinewave, despite my flakey PG&E power).



fast speeds -- then  because executables and shared libraries run out of
it,

---

   I'm more worried about large write speeds.  There, circumventing the
system cache and using direct i/o can get you faster throughput when doing
disk-to-disk copies -- the limiting factor is the target disk write rate,
and no kernel cache will help.   What does help is overlapping reads from
one device and doing writes to the other device that fit in it's buffer.
Then you can theoretically get _closer_, but not quite, double the
throughput (as writes are slower).  But if you write in, say 24M chunks to
a 34MB on-disk (no RAID) buffer, it can often get the data out while you
are reading the next 24MB from the 1st disk.

   If you go through the kernel's system cache, it throws everything off
-- you can watch it -- the kernel will give priority to reads (as it
should, as reads usually block the CPU or the USER from getting things
done.  While writes can *usually* be done lazy in the background.  But on
D:D copies of large multi-gig files, you want write and read to be exactly
balanced for optimal throughput.  But that's a *special* case, when you are
moving large data files around (for example a 157G full backup -- and
that's gzipped, because bzip2 (lzma is much worse, but way good at
compression) uses too much cpu.  On my old server (which died after 9 years
(started with a p...@400mhz, ended with dual-P-III's at 1GHz, but 256K
cache each...hardly better than a celeron!) bzip2 would slow down backup
writing to disk to about 600K/s!  gzip only cut speed by about half (from
20MB/s for raw data to 10MB/s).  Compressed backups are nice, BUT, when you
need to access them -- if you need to unpack a 100+G level zero...ouch...
just to uncompress it would take hours!.  


   So ... while my new server is relatively fast -- I sorta earned it --

Re: OT: Nehelam's New HT ability....

2009-08-08 Thread Linda Walsh



Per Jessen wrote:
But how about the core subject here - the hyperthreading? Have you 
noticed anything very different wrt that?  I haven't, but it will 
certainly depend on your workload.


Definitely will depend on workload.  But I noticed more power
consumption and it seemed to handle more real work in the new
HT's, but I haven't had the sys long enough to do alot of benching.

Major noticeable diff in sys load and fan's kicking in in running
4 cpu intensive processes vs. 8 (used multiple copies of
sh-keygen -b 16384) to keep it busy - 8192 finished too quickly..~10-15 secs).
Sigh


Re: OT: Nehelam's New HT ability.... and ability to handle spamd high load (preheating cache?)

2009-08-07 Thread Linda Walsh

My bios doesn't allow shutting off HT, but does allow turning off
2 or 3 cores (allowing dual or single) -- I'd rather see that type
of feature at runtime - allowing system load to decide whether to activate 
another core -- though the diff on my 2.6GHZ in power consumption
when from about 157 watts (according to its front panel), to over
260 when I loaded all 8 'virtual' cores (only 4 corex2HT's/core).

That's w/8 hard disks inside (though not under load...just spinning).

Seems to be no way on my machine (Dell is so limiting sometimes), to
turn off unused hard drives, or only spin them up when I want to use
them -- Some are hot-spare or just unconfig'ed, yet they spinup.

I'd also prefer the my own *choice* of whether or not to use the 
on-disk cache as well as the raid controller's cache.  I virtually never have unplanned shutdowns -- (its on a UPS that will run for >1hour under its load).


Maybe some of this control will get into the lk -- or does the bios have
to support everything?

Supposedly it has temp and electrical monitoring 'galore', but I can'
even read the DIMM temps.  I went with the 'eco' power supplies at 570W (vs. 
870).  But got the dual power supply backup -- I think, from what I an measure, 
it splits the power usage between the supplies unless one goes out. That could 
mean I really have a 1140W available?  Dunno.  Not sure exactly
what 'spare' means -- if it limits total consumption to level  of 1 supply even 
though it splits the load (power meter hooked to one and watched it go to half 
load when other was plugged in).

BTW, I'm running at 1333MHZ, so maybe it's a heat dissipation prob and not
power?  I'm only pulling 157-160 to a max of 260 (didn't have disks 
churning though -- was just running copies of ssh-keygen -b 16384 -- that seems to take it a little bit...8192 comes out in about 10 seconds though. :-).


Oblig:sa-users -- I may finally have my 'dead -email' restart problem solved.  
Before, if I had a large queue, I had to stop fetchmail, often -- download only 
10-20 at a  time so it's emails wouldn't overload my sendmail queue (it gets 
backed up on spamassassin).  My minimum time for SA (w/network tests) is around 
3seconds.  But during heavy loads it can really go high -- and my machine can 
just run out of memory and process space.
(part of it is sendmail looking up hosts of received email and bind starting 
'cold' (no cache).  But started with 2700 emails, ... after # processes
got to about 900, I chickened a bit and paused the fetchmail until they dropped 
under 400 (note, 'load' never went over '2' the whole time, so it was mostly 
network wait time).  But after the initial clear I had about 2200 emails left 
and just let it run.  At that point, I could see it keeping up -- bind's cache 
was alot warmer now, so not as much network traffic.

I added the 'delay time' taken by spamd when running my email inputs (its' 
actually my filter delay time, but the max diff between the two is about .01 
seconds, so it's mostly spamd delay -- my stats for today from ~9:30am
are: (n=#emails)
n=4513, min=3.27s, max=208.09s, ave=35.16s, mean=27.43s

I suppose for RBL's, some of those results are cached in bind as well?

I wonder if there's anyway to speed up priming the cache before downloading a 
bunch of emails (not that I'm off line for that long usually) -- but it's sorta 
too bad bind doesn't save it's DB on disk on a shutdown, and read it back in 
after a reboot -- and then expire if needed...


Nix wrote:

On 1 Aug 2009, Linda Walsh stated:



Per Jessen wrote:

Not sure about that - AFAICT, it's exactly the same technology. (I
haven't done in exhaustive tests though).



Supposedly 'Very' different (I hope)...


Oh yes. I have a P4 here (2GHz Northwood), and two Nehalems (one 2.6GHz
Core i7 with 12Gb RAM and a 2.26GHz L5520 with 24Gb, hello overkill).
Compared to the P4s, the Nehalems are *searingly* fast: the performance
difference is far higher than I was expecting, and much higher than the
clockspeed different would imply.

Things the P4 takes half an hour to do, the Nehalems often slam through
in a minute or less (!), especially things like compilations that need a
lot of cache. Surprisingly, even some non-parallelizable things (like
going into a big newsgroup in Gnus) are hugely faster (22 minutes versus
39 seconds: it's a *really* big newsgroup).

I suspect the cause is almost entirely the memory interface and cache.
The Northwood has, what, 512Kb L2 cache? The Nehalem has 256Kb... but it
has 8Mb of shared L3 cache, and an enormously faster memory interface
(the FSB is dead, Intel has a decent competitor to HyperTransport at
last).

I was an AMD fan for years, but the Nehalem has won me back to Intel
again.


1) You ca

Re: OT: Nehelam's New HT ability....

2009-08-01 Thread Linda Walsh



Per Jessen wrote:

Not sure about that - AFAICT, it's exactly the same technology. (I
haven't done in exhaustive tests though).  



Supposedly 'Very' different (I hope)...
1) You can't turn it off in the BIOS
2) claim of benefit from increased cache (FALSE), 
	(have older 2x2 Dual Core machine with 4MBxL2 Cache/Dual core.

   If you only use 1 Core/CPU, that 4MB L2 cache/Core)

New machine with 1 Quad core (Dual core CPU's are too slow
   to use memory faster than 800MHz -- only Quad cores go up to Quick
   Connect Speeds that will support fastest memory of 1333MHz (even if 
   you only have 1 CPU).  So you are 'encouraged' to go with Quad over

   2x2Dual.  Quad has 8MB L3 Cache, w/256K dedicated L2/Core.  So
   with HT 128K/thread.  To get 2 Cores, they'll get 256K-L2 ea, +
   8MB L3 shared.  So about 3.125%more memory!  WOW ea!...(though the
   bandwidth for the fast core processors to main memory can be 2x faster).
3) Here's possible benefit: they've added more parallel resources to
each core -- so each thread can possibly get more done than the
old threads -- but this is only a maybe depending on workload.

The biggest cool thing about Nehelam is power savings -- they implemented
Celeron's power-step tech in a big way.   Quiescent cores crank down their
clocks independently to about 60% of top speed and have efficient sleep
states (I think some cores can be halted, but not sure).  Some of their 
processors have a 'turbo mode', which will some small amount faster speed

than the speed on the chip label (does that mean the turbo chips are really
faster rated chips...you tell me), BUT if fewer cores are used -- say only 
2/4, the turbo boost can be a small amount greater (don't have access

(don't know if any is published).  If one was to go from their marketing
graphs (HAHAHAHAHA), Turbo for 4 cores is about 10 more, and if only 2/4
cores are running, it's an additional 10%.  So marketing hype/reality, 
might mean 1-3% faster?


I will say this much -- @ idle, w/8 disks (it's a server, so built-in GPU
with 8MB shared memory, if you aren't going headless) -- with dual/redundant
PS, it uses 157W.  (1-PS, slightly more efficient at 146W).  Major power
savings with possible big increases in speed.  But you can't turn off HT
as in previous machines (at least not in the one I've had access to).

That power consumption is less than half their older Workstation model (though an idle graphics card still sucks quite a bit of useless ergs 
(stupid Nvidia)..


Oblig SA content: When I ran 100 msgs through my filters (that connect to
spamd, but that uses net), the MHz immediately jumped from ~1596 up to 2300 on 
each of the '8' HT cores...so might be perfect for a server that gets sporadic 
loads! ;-)

-linda








Re: Parallelizing Spam Assassin

2009-08-01 Thread Linda Walsh

Well -- it's not just the cores -- what was the usage of the cores that
were being used?  were 3 out the 8 'pegged'?  Are these 'real' cores, or
HT cores?  In the Core2 and P4 archs, HT's actually slowed down a good 
many workloads unless they were tightly constructed to work on the same

data in cache.  Else, those HT's did just enough extra work to block cache
contents more than anything else.

What's the disk I/O look like?  I mean don't just focus on idle cores --
if the wait is on disk, maybe the cores can't get the data fast enough.

If the network is involved, well, that's a drag on any message checking.
I'm seeing times of .3msgs/sec, but I think that's with networking turned
on.  Pretty Ugly.



poifgh wrote:



Henrik K wrote:

Yeah, given that my 4x3Ghz box masscheck peaks at 22 msgs/sec, without
Net/AWL/Bayes. But that's the 3.3 SVN ruleset.. wonder what version was
used
and any nondefault rules/settings? Certainly sounds strange that 1 core
could top out the same. Anyone else have figures? Maybe I've borked
something myself..



The problem is not with 22 being a low number, but when we have other free
cores to run different SA parallely why doesnt the throughput scale linearly
.. I expect for 8 cores with 8 SA running simultaneously the number to be
150+ msgs/sec but it is 1/3rd at 50 msgs/sec





Re: Parallelizing Spam Assassin

2009-07-31 Thread Linda Walsh

May I point out, that while you may find the language crude -- it isn't
language that would violate FTC standards in that in used any of the 
7 or so 'unmentionable words'...


People -- these standards of 'crude language' really need to be strongly
held 'in check' -- the US is 'supposed' to be the society of 'free speech'
unless it is obscene or threatening.

I don't think his posting was either (BTW, I've never even 'heard' or seen
his name before this post.  All I saw was his 'uk' addr -- and I've known
a few 'uk' types, and many of them sound very crude to an American ear
these days.

So in addition to applying strictures in a conservative manner, we must,
hopefully, try to be sensitive to different cultural backgrounds.

If I was talking with a black teen from downtown SF/Oakland, I'd have to
translate from Eubonics -- which can sound rather crude and might contain
and F-word every other sentence.  I just apply my linguistic filter and
attempt to get the meaning.  I hardly thing this list is aimed at an young
audience -- and kid 13+ is going to have heard quite an ear-full of 'colorful 
explicatives' from ST4:Voyage home (a family movie), to everyday peer talk.

Yes -- it sounded crude...more than I, normally hear in America -- but not more than I'd hear in London. 


Just my 2-cents on cultural sensitivity, and the ability to be amused at 
cultural differences (rather than choosing to be offended by them).

p.s. - Most Commercial vendor products are Bantha Poodoo -- especially for
Virus/Security and Spam protection, but NOT all.  Usually the highest 
advertised profile are the worst -- they put more budget into advertising than 
engineering.

Yeah, I still thing SA is a bit slow, but I put much of that up to it being
written in an interpretive language and it's wide flexibility and extensibility 
with plug-ins.  Whatcha gonna do?  Maybe we should rewrite it in Forth?
*grin*...


Re: Parallelizing Spam Assassin

2009-07-31 Thread Linda Walsh

It's an American thing.  Things that are normal speech for UK blokes, get
Americans all disturbed.

Funny, used to be the other way around...but well...times change.



Justin Mason wrote:

On Fri, Jul 31, 2009 at 09:32,
rich...@buzzhost.co.uk wrote:

Imagine what Barracuda Networks could do with that if they did not fill
their gay little boxes with hardware rubbish from the floors of MSI and
supermicro. Jesus, try and process that many messages with a $30,000
Barracuda and watch support bitch 'You are fully scanning to much mail
and making our rubbish hardware wet the bed.' LOL.


Richard -- please watch your language.   This is a public mailing
list, and offensive language here is inappropriate.



Re: AWL functionality messed up?

2009-05-27 Thread Linda Walsh

Jeff Mincy wrote:

   From: Linda Walsh 
   Date: Wed, 27 May 2009 12:48:43 -0700
   
   Bowie Bailey wrote:  >

   At face value, this seems very counter productive.
   
You still aren't understanding the wiki or the AWL scoring or what AWL

is trying to do.


Ah, but it only seems I'm daft, today...:-)


   If I get spam from 1000 senders, they all end up in my
   AWL???
   
yes.   every email+ip address pair that sends you email winds up in

your AWL with an average score for that pair.  This is ok.


GRRRnot so ok in my mindset, but ... and ... errr..
well that only makes it more confusing, in a way...since I was
only 99% certain that I'd never gotten any HAM from hostname
'518501.com' (thinking for a short period that AWL might be classify
things by hosts as reliable or not, instead of, or in addition to
by email-addr), but I'm 99.97% certain I've never gotten any HAM
from user 'paypal.notify' (at) hostname '5185



   AWL should only be added to by emails judged to be 'ham' via
   the feed back mechanisms --, spammers shouldn't get bonuses for
   being repeat senders...
   
You are getting too attached to the 'whitelist' part of the name.

Pretend AWL stands for average weighting list.

=
Aw...come on.  Isn't the world difficult enough without
changing white to black or white to weighing?  I mean, we humans
have enough trouble agreeing on what our symbols, "words" mean in
relation to concepts and all without ya goin' and redefining perfectly
good acceptable symbols to mean something else completely and still
claim it to be some semblance of English.   No wonder most of the
non-techno-literate humans on this world regard us techies with
a hint of suspicion regarding the difficulty of problems.  We go around
redefining words to suit reality and catch the heat when the rest of
the world doesn't understand our meaning:

Pointy-Haired Boss: "Well, how long did you say it would take?"

Geek: "Well, I said it was 3-4 weeks worth of work."

PHB: "Then why has it been 6 weeks with no product? I told you
  anything over 4 weeks was unacceptable!"

G: "6 weeks, but...to get under 4 weeks, I assumed you were talking
168-hour pure-programming time weeks -- not CALENDAR weeks!"



AWL isn't whitelisting spammers.   It is pushing the score to the
average for that sender.   The sender can have a high average or a low
average.   

---
	An average?  So it keeps the scores of all the past emails of every email we 
ever got sent?  Must just store a weighted average -- otherwise

the space (hmm...someone said something about 80MB+ auto-whitelist DB
files?)

Why not call it the Historically Based Score Normalizer or
HBSN module?  Db file could be "historical-norms" or something.



If the previous email from a particular sender was FP or FN then AWL
will have an incorrect average and will wind up doing or trying to do
the wrong thing with subsequent email for that sender.


Maybe it shouldn't add in the 'average' unless it exceeds
the 'auto-learning threshold'??  I.e. something like the
'bayes_auto_learn_threshold_nonspam' for HAM and the
'bayes_auto_learn_threshold_spam' for SPAM.  Assuming it doesn't
already do such a thing, it would make a little sense...so as
not to train it on 'bad data'...

When I run "sa-learn --spam " over a message, can I
assume (or is it the case) that telling SA, a message was 'spam'
would assign a sufficiently large value to the 'HBSN' value for that
sender to reduce any effect of having falsely (if it is likely to happen)
incorrect value?

Or might I at least assume that each "sa-learn" over a message
will modify it's AWL score appropriately?



You can remove addresses using spamassassin --remove-from-whitelist


Yes...saw that after visiting the wiki.  Is there a
--show-whitelist-with-current-scores-and-their-weight switch as well
(as opposed to one that only showed the addr's in the white list, or only
showed the non-weighted scores)?


Thanks...and um...
How difficult would it be to have the name of the module reflect
what it's actually doing?  maybe roll out a name change with the next
".dot" release of SA?  (3.3? 3.4?)  Might alleviate some amount of
confusion(?)...

Does the AWL also keep track of when it last saw an 'email' addr
so it can 'expire' the oldest entries so the db doesn't grow to eventually
consume all forms of matter and energy in the universe?  :-)

Thanks for the clarification and info!!

-linda


Re: my AWL messed up?

2009-05-27 Thread Linda Walsh


Bowie Bailey wrote:

Linda Walsh wrote:


I got a really poorly scored piece of spam -- one thing that stood out
as weird was report claimed the sender was in my AWL.


Any sender who has sent mail to you previously will be in your AWL.  
This is probably the most misunderstood component of SA.  Read the wiki.


http://wiki.apache.org/spamassassin/AutoWhitelist


---
To be clear about what is being white listed, would it
hurt if the 'brief report for the AWL', instead of :
-1.3 AWLAWL: From: address is in the auto white-list

it had
-1.3 AWLAWL: 'From: 518501.com' addr is in auto white-list

So I can see what domain it is flagging with a 'white' value?

I don't know of any emails from '518501.com' that wouldn't have
been classified spam, so none should have a 'negative value'.



AWL functionality messed up?

2009-05-27 Thread Linda Walsh

Bowie Bailey wrote:

Linda Walsh wrote:


I got a really poorly scored piece of spam -- one thing that stood out
as weird was report claimed the sender was in my AWL.


Any sender who has sent mail to you previously will be in your AWL.  
This is probably the most misunderstood component of SA.  Read the wiki.


http://wiki.apache.org/spamassassin/AutoWhitelist




At face value, this seems very counter productive.

If I get spam from 1000 senders, they all end up in my
AWL???

WTF?

AWL should only be added to by emails judged to be 'ham' via
the feed back mechanisms --, spammers shouldn't get bonuses for
being repeat senders...

How do I delete spammer addresses from my 'auto-white-list'?

(That's just insane..whitelisting spammers?!?!)




Re: new netset warn msg (howto avoid?)

2009-05-26 Thread Linda Walsh



Jari Fredriksson wrote:

I see this message coming out of my SA alot these days
since upgrading to 
3.2.5:


[23920] warn: netset: cannot include 127.0.0.0/8 as it
has already been included 


Where is this local net being 'included', and how can I
suppress 
the duplicate inclusion message?


Thanks,
linda


It is in /etc/spamassassin/local.cf as "internal networks" or "trusted 
networks" setting.

Remove it and no message should be shown.


Ah...

In earlier SA version, I used to have
'internal_network 127.'
on a line by itself.

Is the "127" network now 'built-in' in the current SA?

Thanks in Advance!
linda


new netset warn msg (howto avoid?)

2009-05-26 Thread Linda Walsh

I see this message coming out of my SA alot these days since upgrading to
3.2.5:

[23920] warn: netset: cannot include 127.0.0.0/8 as it has already been included

Where is this local net being 'included', and how can I suppress
the duplicate inclusion message?

Thanks,
linda


Re: user-db size, excess growth...limits ignored

2009-04-02 Thread Linda Walsh

 LuKreme wrote:

On 1-Apr-2009, at 13:27, Linda Walsh wrote:
*ouch* -- you mean each message writes out an 80MB white-list file? 
That's alot of I/O per message, no wonder spamd seems to be slowing 
down...


No these are DB files.  Data is added to them, this does not 
necessitate rewriting the entire file.

---

Yeah -- then this refers back to the bug about there being no way to  prune
that file -- it just slowly grows and needs to be read in when spamd starts(?)
and spamd needs to keep that info around as the basis for its AWL scoring, no?
So the only real harm is the increased read-initialization and the run-time
AWL length?



Re: user-db size, excess growth...limits ignored

2009-04-01 Thread Linda Walsh

01234567890123456789012345678901234567890123456789012345678901234567890123456789
Matt Kettler wrote:

Linda Walsh wrote:

Matt Kettler wrote:

I see 3 DB's in my user directory (.spamassassin).
   auto-whitelist (~80MB),   bayes_seen (~40MB),   bayes_toks (~20MB)



expiry will only affect bayes_toks. Currently neither auto-whitelist nor
bayes_seen have any expiry mechanism at all.

---
So they just grow without limit?

Yep. Not ideal, and there's bugs open on both.



 How often does the whitelist get sync'd to disk?

In the case of the whitelist, it's per-message.

-
*ouch* -- you mean each message writes out an 80MB white-list file?
That's alot of I/O per message, no wonder spamd seems to be slowing down...



Having changed the user_prefs files back to the default
setting (i.e. deleted my previous addition) -- 2 days ago, and system was
rebooted 1day14hours ago, I'm certain spamd has been restarted.

Hmm, can you set bayes_expiry_max_db_size in a user_prefs file? That
seems like an option that might be privileged and only honored at the
site-wide level. An absurdly large value can bog the whole server down
when processing mail, so an end user could DoS your machine if allowed
to set this.


I *thought* I could set it -- certainly, the only place I
*increased* the tokens beyond the *default* was in user-prefs. That
*seems to have worked in bumping up the toks to 500K, but, now,
lowering it, is being ignored.  Perhaps the user-pref option to set
#tokens changed and an old version allowed it and raised it to 500K,
but newer version disallows so I can't 'relower' it (though I'd think
global 150K limit would have been re-applied).




That said, 3.1.7 is vulnerable to CVE-2007-0451 and CVE-2007-2873.

You should seriously consider upgrading for the first one.


-
While I was supporting multiple local users at one point, I'm only
local user, so local-user escalation to create local service denial isn't
top-most concern.  Doesn't mean shouldn't upgrade for other reasons.


I'm still *Greatly* concerned about an 80MB file being written to disk
potentially on every email message incoming.  That's seems a high
overhead, or are their mitigating factors that decrease that amount
under 99% of the cases?

Tnx,
Linda


One BUG found: userpref whitelist pattern BUG/DOC prob;

2009-04-01 Thread Linda Walsh



Bowie Bailey wrote:

Linda Walsh wrote:

I get many emails addressed to internal sendmail 's.
  123...@mydomain,   1abd56.ef7...@mydomain

(seem to fit a basic pattern but don't know how to specify the
pattern (or I don't have it right):
  <(start of an email-address)>[0-9][0-9a-fa-f\@mydomain



I think this is what you are looking for (untested):

header MY_NUMBER_EMAIL To:addr =~ /^\d[0-9a-f\@mydomain/i

Look in the "Rule Definition" section of the man page for
Mail::SpamAssassin::Conf for more info on the ':addr' option.

--

I found, BURIED, in the doc "Mail::SpamAssassin::Conf the broken,
primitive rules for white/black list patterns allowed:

   Whitelist and blacklist addresses are now file-glob-style patterns,
   so "fri...@somewhere.com", "*...@isp.com", or "*.domain.net" will all
   work.  Specifically, "*" and "?" are allowed, but all other
   metacharacters are not.  Regular expressions are not used for
   security reasons.
===

These are NOT file-glob style patterns.  As on linux
These are examples of non-regex file-glob patterns that don't work under
SA:  "[0-9][0-9a-f]*.domain", "[0-9]*.domain", "[^0-9]*.domain".

They don't work:   the "bracket notation for a single character" doesn't
work.
1) Instead you need:
0*.domain
1*.domain
2*.domain
3*.domain
4*.domain
5*.domain
6*.domain
7*.domain
8*.domain
9*.domain
-
2)  There is no way to express negation.
3)  The documentation is ALSO unclear if the expression is a full or partial
match, as "^ and "$" are also not included.
So unclear if "@domain" is same as "*...@domain".

Attempts to match  of form:
"^[0-9][0-9a-f].*.domain$"  (ex: "0...@domain")
fail to match as well as any more complex file-glob.

white/black lists should not claim 'file-glob' matching ability if they
don't even include single char 'range' matches.

This was the answer THAT NO ONE understood or could answer.

If the format of white/black list entries in 'userprefs' is SO
arcane, limited, and poorly documented, I assert it is a bug.

Short-term, documentation would be quickest fix (get rid of file-glob
description as it's not true in normal sense of fileglob, but longer term,
might be real-file globs
   AND
making clear whether the pattern provided must match the full email
address, or if a partial match will be considered a a positive match
(i.e. "@foobar" is same as "*...@foobar*")

Sorry if I am coming across a bit terse, but this hard-to-find and
misleading description has been a long-term "bug" in my filtering rules.

Seems like a alot of email-harvesting progs see mail-ID numbers like
"12345.6ab...@domain" as email addrs, which in my setup are completely
bogus.

-linda



Re: user-db size, content confusions (how many toks?)

2009-03-31 Thread Linda Walsh

Matt Kettler wrote:

I see 3 DB's in my user directory (.spamassassin).
   auto-whitelist (~80MB),  bayes_seen (~40MB), bayes_toks (~20MB)
Was trying to find relation of 'bayes_expiry_max_db_size' to the physical
size of the above files.

---


expiry will only affect bayes_toks. Currently neither auto-whitelist nor
bayes_seen have any expiry mechanism at all.

---
So they just grow without limit?  How often are they loaded?
Does only "spamd" access the auto-whitelist?

Optimally, I would assume spamd opens it upon start, but it needs to update
the disk file periodically (sync the db) for reliability.  How often does
it 'sync'?



bayes_seen can safely be deleted if you need to. It keeps track of what
messages have already been learned to prevent relearning them. However,
unless you're likely to re-feed messages to SA, bayes_seen isn't stictly
neccesary.

---
Only refeeding would usually be 'ham', because I might rerun over
an "Inbox", that might have old messages in it.  I don't rerun "ham" training
often -- except to "despam" a message (one that was marked spam and shouldn't
have been).



I'm finding some answers, I've run into some seeming "contradictions".  
...

---
First prob(contradiction).  dbg above says "token count: 0".  (This is
with
a combined bayes db size of 60MB (_seen, _toks).

Are you sure your sa-learn was using the same DB path?

---
Sure??  It listed the same filename (default location
/home//.spamassasssin/).  Other than that, I haven't
tried to trace perl running spamassassin, to see if it is really accessing
the same file.  Only going off the 'debug' messages (which correspond to the
settings in "user_prefs" that's in the default location dir.



From the sounds of it, sa-learn is using a directory with an empty DB.


Yeah...Doesn't make sense to me -- how would "sa-learn --dump magic"
use a different location?  I.e. it showed ~500K tokens...


I.e. isn't 'ntokens' = 491743 mean slightly under 500K tokens 

Yep, looks like you have 491,743 tokens to me.



It's like the sa-learn magic shows a 'db' corresponding to my old limit
(that I think is still being 'auto-expired', so might not have pruned
figure as it runs about once per 24 hours, if I understand normal spamd
workings).

Approximately. Also, be aware that in order for spamd to use new
settings it needs to be restarted.


Having changed the user_prefs files back to the default
setting (i.e. deleted my previous addition) -- 2 days ago, and system was
rebooted 1day14hours ago, I'm certain spamd has been restarted.
YET: all db sizes are the same as before (no reduction in size
corresponding to going 'back' to a default 150K limit), though sa-learn
run with dbg and force-expire indicated 0 tokens -- but sa-learn w/dump magic
indicates 500K tokens.  How can "expire" say 0 toks but dump-magic say 500K?

File timemstamps show all 3-db files have been updated today.
(Presumably by spamd processing email as it comes in).  But file sizes
still are @ sizes indicated at top of this message: 80/40/20-MB.



So is the --magic output, maybe what is seen and being
'size-controlled' by auto-expire?

Yes, at least, it should be.




Why isn't 'sa-learn --force expire' seeing the TOKENs indicated in
sa-learn --dump magic?  

That is particularly strange to me, and it sounds like there's some
problems there.

---
*sigh*



Can you give a bit of detail, ie: what paths are you looking at for the
files, what version of SA,

---
SA = old version of 3.1.7.
Which at very least points to an upgrade possibly solving the problem,
BUT, this was working at one point, and don't know why it 'stopped'.  I'm
generally uncomfortable with fixing things that were working just because they
have randomly stopped working without knowing *why*, (though that discomfort has
become something I've just more had to deal with as the Microsoft SW
maintenance method becomes the norm (update and see if bug is gone...yes?  ok,
bug gone; (unclear if fixed or hidden, unclear about effects of other changes in
a new version...)



Am I misinterpreting the debug output?

No, you don't seem to be.

---
Thanks for the confirmation of my 'reality'.  Really, the most logical
and time-efficient way to proceed is likely to upgrade to newer version at some
point soon (and ignore my discontent regarding 'not knowing' why or what caused
the break).

*sigh*
Linda








user-db size, content confusions (how many toks?)

2009-03-29 Thread Linda Walsh


I see 3 DB's in my user directory (.spamassassin).

auto-whitelist  (~80MB)
bayes_seen  (~40MB)
bayes_toks  (~20MB)

Was trying to find relation of 'bayes_expiry_max_db_size' to the physical
size of the above files.  I'm finding some answers, I've run into some
seeming "contradictions".  Had db_size set to 500,000, reduced to 250,000
and to 'default' (150,000) during testing.

In trying to lower 'db_size' and see how that affected physical sizes,
I ran sa-learn --force expires and saw these debug messages of 'Note':

[30905] dbg: bayes: expiry check keep size, 0.75 * max: 112500
[30905] dbg: bayes: token count: 0, final goal reduction size: -112500
[30905] dbg: bayes: reduction goal of -112500 is under 1,000 tokens, skipping 
expire
[30905] dbg: bayes: expiry completed

---
First prob(contradiction).  dbg above says "token count: 0".  (This is with
a combined bayes db size of 60MB (_seen, _toks).

Seems to think I have no bayes data.  Saw another dbg msg that indicated the
bayes classifier was untrained (<~150? entries) & disabled.

Dunno how it got zeroed, but tried adding 'ham' by running sa-learn over
my a despam'ed mailbox.  First run showed:

Learned tokens from 55 message(s) (55 message(s) examined)

But subsequent runs of 'sa-learn with dbg+expire" still show token count: 0.

sa-learn --dump magic shows something different:
0.000  0  3  0  non-token data: bayes db version
0.000  0 556414  0  non-token data: nspam
0.000  0 574441  0  non-token data: nham
0.000  0 491743  0  non-token data: ntokens
0.000  0 1216456288  0  non-token data: oldest atime
0.000  0 1237796146  0  non-token data: newest atime
0.000  0 1220476831  0  non-token data: last journal sync atime
0.000  0 1217838535  0  non-token data: last expiry atime
0.000  01382400  0  non-token data: last expire atime delta
0.000  0  70612  0  non-token data: last expire reduction 
count
-

Does the above indicate 0 tokens?  I.e. isn't 'ntokens' = 491743 mean
slightly under 500K tokens (my original limit before trying to run 'sa-learn 
-expires + dbg' manually).


It's like the sa-learn magic shows a 'db' corresponding to my old limit
(that I think is still being 'auto-expired', so might not have pruned
figure as it runs about once per 24 hours, if I understand normal spamd
workings).

So is the --magic output, maybe what is seen and being 'size-controlled' by
auto-expire (was ~500K before recent test changes).

Why isn't 'sa-learn --force expire' seeing the TOKENs indicated in
sa-learn --dump magic?  Debug messages are pointing at the same file
for both operations, so how can dump-magic indicated 500K, but the
debug of sa-learn --force-expire, is somehow seeing 0 TOKENs?

Am I misinterpreting the debug output?

Thanks,
Linda





Re: What is AWL: _Average-Whitelister_....

2009-03-24 Thread Linda Walsh



John Hardin wrote:

What is AWL rule? Why it gives so different amount of points?


"Auto Whitelist" is a misleading name. It is actually a score averager. 
Since the points it applies are based on the historical scoring from 
that sender, the score will vary by who the sender is and when the 
message is processed (i.e. their history to-date).

---

Thank you for the clear and simple explanation.

Perhaps:
AWL (AutomaticWhiteList)
should be renamed to:
AWL (Averaging-Whitelister)


While the acronym would/could stay the same, the standard
expanded form should say it is an "Averaging" - something
(whitelister, blacklister, whatever).

The important point is not that it's automatically applied, but
that it's _A_veraging...

This clarifies this long outstanding Q for me as well.

Thanks!
Linda


Re: userpref whitelist pattern problem

2009-03-15 Thread Linda Walsh

LuKreme wrote:

On 13-Mar-2009, at 12:58, Linda Walsh wrote:

I get many emails addressed to internal sendmail 's.
123...@mydomain   or  1abd56.ef7...@mydomain
(seem to fit a basic pattern but don't know how to specify the
pattern (or I don't have it right):
<(start of an email-address)>[0-9][0-9a-fa-f\@mydomain


Generally:
^ means 'start of line',   $ means 'end of line'
but whitelist_from used globbing, and I don't think you can use those 
anchors there.  are the emaisl comming in without a tld (1...@example and 
not 1...@example.com)?


All with a tld, but two forms, am trying to catch:
 from: 11234.2a...@somedomain.tld(or)
 from: larry <11234.2a...@somedomain.tld>



any hints would be appreciated...
running slightly older SA 3.1.7 on perl 5.8.8


Slightly?  No, that's ancient (2.5 years!!). 

---
Sometimes I understate, sometimes overstate.  Hard to tell by
"dot." name unless one has been paying attention on all their products.
I guess it is difficult to tell by the version-dot number.



 Seriously, if there is
only one thing you keep updated on your mailserver, it needs to be 
SpamAssassin.

---
Good to know, was doing better.  But got out of sync
when I couldn't get perl-libs to update cleaning during some perl
version update.   Between updating:
- Perl versions...
- Distro RPM versions,
- CPAN module versions.
-* requirements of SW dependent on Perl modules
(i.e. SpamAssassin)

...one thing leads to another...and before you know it
(up to butt in alligators?)...  :-)

Tnx,
Linda


Re: whitelist pattern problem in userpref-whitelisting

2009-03-13 Thread Linda Walsh

Does the below apply to the
~/.spamassassin/userprefs
   whitelisting (command, keyword or feature)...
  
Sorry...it was the whitelisting in the userpref file that I

was talking about the "primitive pattern matching"

At one point it was limited to DOS-like file-matching patterns,
not the full perlregexp set (which they below example you gave
me would be an excellent example!) ...

I don't see 'header' as a usable line in "userprefs".


thanks,
-linda


Bowie Bailey wrote:

Linda Walsh wrote:
> I get many emails addressed to internal sendmail 's.
>   123...@mydomain,  1abd56.ef7...@mydomain
> (seem to fit a basic pattern but don't know how to specify the
> pattern (or I don't have it right)):
>   <(start of an email-address)>[0-9][0-9a-fa-f\@mydomain
>
> by start of an email, addr, I mean inside or outside literal '<>'.
> I try matching to '<' as a start char to look for anything starting
> with a number, but that fails if they don't use the "name "
> format, but just use "x...@yy".  Don't know how to root at beginning
> of any email address looking thing.

I think this is what you are looking for (untested):

header MY_NUMBER_EMAIL To:addr =~ /^\d[0-9a-f\@mydomain/i

Look in the "Rule Definition" section of the man page for
Mail::SpamAssassin::Conf for more info on the ':addr' option.

> I know the pattern matcher in the userprefs file is primitive though
> -- like DOS level file matching, so I don't know how to write
> it in userprefs...

user_prefs uses the exact same pattern matching as the rest of SA (Perl
regexps).  It is anything but primitive.

The caveat being that rule definitions are not allowed in user_prefs
files unless you allow it by putting this in your local.cf:

allow_user_rules 1

> any hints would be appreciated...
> running slightly older SA 3.1.7 on perl 5.8.8
>
> intending to update ... eventually but don't know that this would
> solve any pattern help

Shouldn't make any difference for this.



whitelist pattern problem

2009-03-13 Thread Linda Walsh

I get many emails addressed to internal sendmail 's.
 123...@mydomain
 1abd56.ef7...@mydomain


(seem to fit a basic pattern but don't know how to specify the
pattern (or I don't have it right):
 <(start of an email-address)>[0-9][0-9a-fa-f\@mydomain

by start of an email, addr, I mean inside or outside literal '<>'.
I try matching to '<' as a start char to look for anything starting
with a number, but that fails if they don't use the "name "
format, but just use "x...@yy".  Don't know how to root at beginning
of any email address looking thing.

I know the pattern matcher in the userprefs file is primitive though
-- like DOS level file matching, so I don't know how to write
it in userprefs...

any hints would be appreciated...
running slightly older SA 3.1.7 on perl 5.8.8

intending to update ... eventually but don't know that this would
solve any pattern help

Thanks,
-linda


RFE? Or is there an easy way to do this?

2009-02-01 Thread Linda Walsh
I have some email accounts that I use with particular vendors or lists.  I have 
a few email accounts only known to a single person or company. 

What I'd like to do is someway of white-listing a "to-addr" if it is from a list 
of "from-addrs"else add something (constant?) to its spam score.


An even more advanced but non-trivial check would be "if to addr(X), and not in 
my contacts(addr-book), then SPAM, else ok


Anyone else have their ways to do these checks?

thanks,
-linda



Re: junkfiles-bays_toks.expire\d{4-5}

2008-07-28 Thread Linda Walsh

A manual expire run took less than 2 minutes -- closer to 1 minuteHow 
impatient
is SA ??



John Hardin wrote:

On Fri, 2008-07-25 at 18:35 -0700, Linda Walsh wrote:

Jul 25 15:28:21 Ishtar spamd[2355]: bayes: expire_old_tokens: child processing 
timeout at /usr/bin/spamd line 1085,  line 22.


Your autoexpire is taking longer than SA is willing to wait.

This is a fairly common question, there's lots of discussion in the list
archives.

Consensus: disable autoexpire and run a dedicated expiry from cron,
weekly or daily based on your token volume.



Mail::SpamAssassin 3.2.5 fails: NOT OK

2008-07-26 Thread Linda Walsh


Can't install Mail::SpamAssassin in CPAN...
fails at the end...
(not whole log, but enough to give context, I hope

*/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Locker/Flock.pm
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
>blib/lib/Mail/SpamAssassin/BayesStore/SQL.pm

yes
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
blib/lib/Mail/SpamAssassin/Logger/Syslog.pm
checking for inttypes.h... /usr/bin/perl build/preprocessor -Mconditional -Mvars 
-DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Plugin/SPF.pm
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
>blib/lib/Mail/SpamAssassin/Plugin/Shortcircuit.pm
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Client.pm
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/PerMsgStatus.pm

yes
checking for stdint.h... /usr/bin/perl build/preprocessor -Mconditional -Mvars 
-DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
>blib/lib/Mail/SpamAssassin/Plugin/URIDetail.pm
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
blib/lib/Mail/SpamAssassin/PerMsgLearner.pm
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
>blib/lib/Mail/SpamAssassin/Plugin/WLBLEval.pm

yes
checking for unistd.h... /usr/bin/perl build/preprocessor -Mconditional -Mvars 
-DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
>blib/lib/Mail/SpamAssassin/Plugin/AutoLearnThreshold.pm
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
>blib/lib/Mail/SpamAssassin/SQLBasedAddrList.pm
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
>blib/lib/Mail/SpamAssassin/Plugin/TextCat.pm

yes
checking sys/time.h usability... /usr/bin/perl build/preprocessor -Mconditional 
-Mvars -DVERSION="3.002005" -DPREFIX="/usr" 
-DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
>blib/lib/Mail/SpamAssassin/PersistentAddrList.pm
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
>blib/lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm
/usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" 
-DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/DnsResolver.pm

yes
checking sys/time.h presence... /usr/bin/perl build/preprocessor -Mconditional 
-Mvars -DVERSION="3.002005" -DPREFIX="/usr" 
-DDEF_RULES_DIR="/usr/share/spamassassin" 
-DLOCAL_RULES_DIR="/etc/mail/spamassassin" 
-DLOCAL_STATE_DIR="/var/lib/spamassassin" 
>blib/lib/Mail/SpamAssassin/SubProcBackChannel.pm

yes
checking for sys/time.h... /usr/bin/perl build/preprocessor -Mconditional -Mvars 
-DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/

Re: junkfiles-bays_toks.expire\d{4-5}

2008-07-26 Thread Linda Walsh



Matt Kettler wrote:
What version are you running? reading around the child processing 
timeout seems to have been a common problem in the 3.1.x series, but 
I've not seen it reported in the 3.2.x series.

---
Erp.  I'll try upgrading and see what happens...still have a
3.1.7 installed.


Re: junkfiles-bays_toks.expire\d{4-5}

2008-07-25 Thread Linda Walsh



Matt Kettler wrote:
The fact that they keep laying around is a problem. This suggests SA 
keeps getting killed before the expire can complete. Do you have any 
kind of limits set such as CPU time or memory that SA might be running 
against and dying?


You can try kicking off an expire manually using sa-learn 
--force-expire. (add -D if you want some debug output)..
note: this could run for a long time, particularly if bayes_toks is 
really large.




Another one of the fileles appeared -- 17M long.
while bayes_toks is 8.8M.  auto-whitelist is 78M -- that seems a bit 
excessive...

Don't know what really large means -- bayes_toks isn't that large
compared to some of the other files.  No limits that I know of...
Ahh...seeing some oddness in the log though:
(Interrupted?timeouts?...weird...)...

Jul 25 15:23:59 Ishtar spamd[2447]: bayes: cannot open bayes databases 
/home/user/.spamassassin/bayes_* R/W: lock failed: Interrupted system call
Jul 25 15:24:48 Ishtar spamd[2443]: bayes: cannot open bayes databases 
/home/user/.spamassassin/bayes_* R/W: lock failed: Interrupted system call
Jul 25 15:28:21 Ishtar spamd[2355]: bayes: expire_old_tokens: child processing 
timeout at /usr/bin/spamd line 1085,  line 22.
Jul 25 15:36:55 Ishtar spamd[2447]: bayes: cannot open bayes databases 
/home/user/.spamassassin/bayes_* R/W: lock failed: Interrupted system call
Jul 25 15:41:38 Ishtar spamd[2443]: bayes: expire_old_tokens: child processing 
timeout at /usr/bin/spamd line 1085.
Jul 25 16:14:14 Ishtar spamd[2355]: bayes: cannot open bayes databases 
/home/user/.spamassassin/bayes_* R/W: lock failed: Interrupted system call
Jul 25 16:19:00 Ishtar spamd[2385]: bayes: expire_old_tokens: child processing 
timeout at /usr/bin/spamd line 1085.
Jul 25 16:29:05 Ishtar spamd[2356]: bayes: expire_old_tokens: child processing 
timeout at /usr/bin/spamd line 1085.
Jul 25 17:06:02 Ishtar spamd[2385]: bayes: cannot open bayes databases 
/home/user/.spamassassin/bayes_* R/W: lock failed: Interrupted system call


junkfiles-bays_toks.expire\d{4-5}

2008-07-25 Thread Linda Walsh

In my .spamassassin dir, I see lots of files that look like:

bayes_toks.expire1098   bayes_toks.expire1243   bayes_toks.expire13494
bayes_toks.expire15029  bayes_toks.expire15761  bayes_toks.expire16349
bayes_toks.expire17370  bayes_toks.expire17385  bayes_toks.expire1754
bayes_toks.expire18183  bayes_toks.expire18584  bayes_toks.expire18813
bayes_toks.expire19274  bayes_toks.expire19481  bayes_toks.expire20721
bayes_toks.expire2264   bayes_toks.expire2265   bayes_toks.expire2266
bayes_toks.expire2267   bayes_toks.expire22670  bayes_toks.expire2268
bayes_toks.expire2324   bayes_toks.expire2327   bayes_toks.expire2355
bayes_toks.expire2356   bayes_toks.expire2385   bayes_toks.expire23960
bayes_toks.expire2443   bayes_toks.expire2447   bayes_toks.expire25435
bayes_toks.expire26900  bayes_toks.expire29828  bayes_toks.expire31304
bayes_toks.expire3343   bayes_toks.expire3442   bayes_toks.expire3444
bayes_toks.expire4002   bayes_toks.expire4334   bayes_toks.expire4877
bayes_toks.expire5636   bayes_toks.expire5683   bayes_toks.expire5779
bayes_toks.expire6464   bayes_toks.expire9281   bayes_toks.expire9300

They are all a few hundred K or more long (just deleted the bunch).

I've also noticed spamd going off and cranking for more than an hour -- seems to
produce one of these files...

Any idea what they are for? or why SA would keep leaving them in my .sa (user) 
dir?



mem use of spamd processes: wasted memory? 'bug'?

2008-07-24 Thread Linda Walsh

I noticed something about my spamd processes.

There is a "main" process at the top that spawns children.
5 of 6 of the top memory (by %) are  'spamd'.
5/6 top Resident (28M for parent), 40m-49m /child (268M total + parent)
5/7 top Data users (26M for parent) 38-47m/child (259M total + parent)

So since all of the spamd's are accessing the same databases on disk, how come
there is so much that isn't shared?  Shouldn't the clients be using mostly 
shared
memory and the only non-shared would be individual emails?

Would or should this be characterized as a design bug?

-linda






Re: Discussion side point: levels of Trust

2008-06-16 Thread Linda Walsh



John Hardin wrote:

On Wed, 11 Jun 2008, SM wrote:


At 17:46 11-06-2008, Linda Walsh wrote:

 How does one decided on 'trust'?  I.e. I think it would be
useful to assign a probability to "Trust" at the least.  I mean do I put
my ISP in my trusted server list?   -- suppose they start partnering 
with


It could be a reputation system where you assign a probability.


Probability of what, exactly?

Bear in mind, "trusted" means "does not forge Received: headers", not 
"does not send or relay spam".



I am aware of this.
	However, it's not an easily discerned number, but if I had att or comcast as an 
ISP, my trust in them would maybe be a trust value .7-.8.  Like the

ISP in Europe who insertted over 20million ads on HTML pages -- they could
just as easily be adjusting return headers.
But more worrysome are the cooperations of ISP's with the
unconstitutional 'lawless intercept' actions by law enforcement agencies that
are used to find and entrap end-users for any crime they wish to target.
While the laws were sold on terrorist grounds, then later bolstered via
the mantra "for the children for the children...its all the childporn" (expanded
to apply to anyone under age 18).

I could easily see the possibility of domain-information being
corrupted -in real time- to allow intercept of traffic -- that could either
be used in a 'honeypot' scenario, or just to monitor.  While in some cases
they ISP's have no choice but to cooperate, there have been several high profile
ISP's (ATT, Verizon), who have handed over information without requiring any
formal oversite or legal documents.  That's scarey as the US moves more toward
the corrupted-GOP's idealized police state.  Hopefully we can get some
serious regime change to undo some of these worst practices...but governments
are notoriously bad about letting go of power once they've grabbed a hold of it.




Discussion side point: levels of Trust

2008-06-11 Thread Linda Walsh

Matthias Leisi wrote:

1) This advice:
| Tue Jun 10 14:55:36 2008 [72096] dbg: conf: trusted_networks are not
| configured; it is recommended that you configure trusted_networks
manually

should not be ignored. Setting trusted_networks would slightly reduce
the number of DNS lookups and can avoid all sorts of funny error
situations.



How does one decided on 'trust'?  I.e. I think it would be
useful to assign a probability to "Trust" at the least.  I mean do I put
my ISP in my trusted server list?   -- suppose they start partnering with
an ad-firm?  Or.. get bought-out? ... I probably won't know most of their
internal politics...  ISP's in some eastern state have already committed to
filtering arbitrary sites based on local values and arbitrary listing
policies(?)  This whole 'save-the-child-porn' shtick the government is
using as a necessary excuse to violate computer privacy is unacceptable.

They did the same thing -- claimed they needed intrusive powers to protect
against terrorists -- but 80% of the people they've used those powers
against have been for 'common crimes' (or drug prosecutions).
In the UK, they are using anti-terrorism surveillance-cams to
enforce doggie-doodoo pickup laws!
In the US, the government is using "passenger manifests" of arriving, overseas
flights, to detain and arrest foreign businessmen and citizens on civil and
non-violent criminal investigations.

But those are general complaints about untrustworthiness of previously
trustworthy entities

I don't have a binary trust value, really.  As an example,
going from most trusted to least, I might have:

- a lab/build/test machine (linux usually)
- internal server proxy to out-net (linux)
- windows XP desktop (its windows, no direct outside connect, but can proxy)
- my ISP's servers
- root DNS servers (arguably more trustworthy than most ISPs, but since
   I have to go through my ISP to get to them, _logically_, how can I
   trust them more?)
- HTTPS-personal money sites...(for some things more trust than my ISP, but
they are 'banks' -- so that trust is with some grains of 'salt'
- Mainstream web-providers (varies based on reputation, but examples would
include Google, BBC(.co.uk), various online businesses with physical
presence, 'seem' more trustworthy (at least you know where they are based?)
- government sites, Depends.  from 'ok' trust to downright untrustworthy.
- unknown sites / known bad sites...




Re: Warning: "xxx" matches null string many times in regex in Text/Wrap.pm..

2006-12-25 Thread Linda Walsh

I looked at this error and it appears to be caused by SpamAssassin passing
in an incorrect parameter to "Text::Wrap" by changing the value
of "Text::Wrap::break" to be something other than a "line breaking"
character (or characters).

In file in my Spamassassin-3.17 cpan dir, there is a bad line:

./lib/Mail/SpamAssassin/PerMsgStatus.pm:996:  $hdr = 
Mail::SpamAssassin::Util::wrap($hdr, "\t", "", 79, 0, '(?<=[\s,])');


The last argument, '(?<=[\s,])' appears to be invalid.

The error message is "(?:(?<=[\s,]))* matches null string many \
times in regex; marked by <-- HERE in m/\G(?:(?<=[\s,]))* <-- \
HERE \Z/ at /usr/lib/perl5/5.8.8/Text/Wrap.pm line 46. "

In the Text::Wrap source code of 0704 (last working version),
there is (at line 46, remarkably enough:-)), the line:

   while ($t !~ /\G\s*\Z/gc) {

In versions 0711 and later, that line reads:
   while ($t !~ /\G(?:$break)*\Z/gc) {
=
   Note that "\s" has been replaced by "(?:$break)".  In the
0711 source code, $break defaults to '\s'.

   In other words -- it appears, from the code it replaces and
from the default values of "$break" that "$break" should contain
a pattern representing the characters to break on.

   However, in PerMsgStatus:996, we see a *zero-length*
(the "(?<=pat)" part) pattern passed in for the value of $break. 
Instead of matching the line "break" character, it only matches

the position and never matches the character itself -- thus it
gets "stuck" applying the zero-length (null pattern) again and
again (thus the message " matches null string many
times".

   I'm not sure what the author was trying to do in PerMsgStatus.pm
or who "owns" that "line" (or file), but perhaps they meant for
"comma" to be included in the list of "break" characters.  In
which case, instead of:
'(?<=[\s,])'
for the last argument in line 996, it should be:
'[\s,]'

   That is, line 996 in lib/Mail/SpamAssassin/PerMsgStatus.pm should be:
$hdr = Mail::SpamAssassin::Util::wrap($hdr, "\t", "", 79, 0, '[\s,]');

(instead of:
$hdr = Mail::SpamAssassin::Util::wrap($hdr, "\t", "", 79, 0, '(?<=[\s,])');
)

I hope this was helpful?

Linda

---orig msg follows---

Theo Van Dinter wrote:

On Sun, Dec 24, 2006 at 05:43:12PM -0800, Linda Walsh wrote:
  
I've seen this error message in the past few upgrades (~3.11, .12, .17) 
and was wondering if anyone else has seen it and knows what the problem is.


Discussed so much it's an FAQ. :


Re: Warning: "xxx" matches null string many times in regex in Text/Wrap.pm..

2006-12-25 Thread Linda Walsh

Many thanksdidn't think to look in the FAQ...sigh.  I have
"local site configuration" esteem issues -- thinking it is
usually something "peculiar" to my setup.


   So it's Text::Wrap...
   I'm surprised they haven't fixed it.  Doesn't seem like it
would be that difficult as they should have a fairly large
number of "test cases" and should know what they changed...
(famous last words).

-Linda


Theo Van Dinter wrote:

On Sun, Dec 24, 2006 at 05:43:12PM -0800, Linda Walsh wrote:
  
I've seen this error message in the past few upgrades (~3.11, .12, .17) 
and was wondering if anyone else has seen it and knows what the problem is.


Discussed so much it's an FAQ. :)
http://wiki.apache.org/spamassassin/TextWrapError
  


Warning: "xxx" matches null string many times in regex in Text/Wrap.pm..

2006-12-24 Thread Linda Walsh
I've seen this error message in the past few upgrades (~3.11, .12, .17) 
and was wondering if anyone else has seen it and knows what the problem is.


---
Dec 24 17:32:53 mailhost spamd[3320]: (?:(?<=[\s,]))* matches null 
string many times in regex; marked by <-- HERE in m/\G(?:(?<=[\s,]))* 
<-- HERE \Z/ at /usr/lib/perl5/5.8.8/Text/Wrap.pm line 47.

---

I'm guessing some configuration is messed up somewhere, but I suppose
it could be a bug in the Text/Wrap module.  I've just checked to see that
my cpan modules are up-to-date, and any with version numbers are.

Any ideas on getting rid of this message (preferably by removing the cause,
not by covering it up...:-)).


Thanks,
Linda


light-grey listing..? lkml filter probs & catching too much ham.

2006-10-04 Thread Linda Walsh

I'm having problems filtering a list I'm on (lkml).

First I had it on normal filter -- but I had too many false
positives.  Finally switched it to a white-list, but now, many
true negatives (spam) get through.

Is there a way to "light-grey" a list -- not a blanket accept
all, white-list, but something that temporarily moves the
spam-"high-water" mark for that specific email: i.e. instead of
it taking "X" points to be marked as SPAM, it adds 5-points to
the threshhold needed to mark the message as spam?

I heard that the list owners attempted to tighten the filters and
had the same problem -- too many "ham" emails got trapped.  Perhaps
it is all the code that gets published to that list?  Dunno, but
something seems in common with SPAM and, maybe, code (or at least
the normal linux-kernel-mailing-list "post") that is making it a hard
list to "police" ("clean") up.

Anyone else have stubborn lists like this or had successes in filtering
lkml?  I even split off "code-ish" looking posts to a separate folder,
but that still didn't stop the false negatives, so not quite sure
what makes such a list uniquely difficult to filter.

Not the worse problem -- at least it's confined to that folder,
but the various spams that are present make it a bit challenging to
read -- right in the middle of the tech stuff...just on the first
page of titles (conversations hidden under titles), 2/10 titles are
sex related spams.  It's a bit annoying to read through (sigh).

Now why would sex-spammers target lkml-readers.  Do they think
lkml-readers are uniquely more likely to respond to sex-spam?
(Maybe, given the fascination of the average "/." reader and
their amusement with "pr0n", there could be some basis to the
spammer's methods...?)...

thanks,
-linda



new problem after upgrade perl modeul to 3.1.4(from 3.1.2)

2006-09-02 Thread Linda Walsh

I just updated to a newer version of spamassin a few days ago.

Since then I'm getting regular error messages in my spamlog:
Sep  2 03:46:03 Ishtar spamd[13106]: (?:(?<=[\s,]))* matches null string 
many times in regex; marked by <-- HERE in m/\G(?:(?<=[\s,]))* <-- HERE 
\Z/ at /usr/lib/perl5/5.8.8/Text/Wrap.pm line 46.
Sep  2 03:49:04 Ishtar spamd[13087]: (?:(?<=[\s,]))* matches null string 
many times in regex; marked by <-- HERE in m/\G(?:(?<=[\s,]))* <-- HERE 
\Z/ at /usr/lib/perl5/5.8.8/Text/Wrap.pm line 46.
Sep  2 03:52:02 Ishtar spamd[13443]: (?:(?<=[\s,]))* matches null string 
many times in regex; marked by <-- HERE in m/\G(?:(?<=[\s,]))* <-- HERE 
\Z/ at /usr/lib/perl5/5.8.8/Text/Wrap.pm line 46.

...etc..etc...


Am I missing some needed configuration somewhere, or is the
above a problem? 


It seems to be happening with every message.

Um...is this like "unsolicited reporting of a bogus condition" and would 
it fall into syslog-spam? :-)


tnx,
Linda



Re: "non-fuzzy body parts in subject": missed

2006-04-17 Thread Linda Walsh

Matt Kettler wrote:

Yes it does.. the text of the subject line will match against any body rule. SA
pre-pends this so we don't have to have a massive duplication of rules to cover
both body and subject.

---
Ah.  Didn't know that.  Different tools, different lingo for
message, message header, message body.



"Want a Bigger MBP?"  A '25_replace' rule is present for "fuzzy"
MBP's, but doesn't seem to catch unfuzzy ones.
So I guess questions might be:
   1) should 'fuzzy' rules match non-fuzzy targets as well
  as fuzzy ones?


IMHO, no. I think there should be two rules with separate scores. In the above
example the scores would be pretty much the same.

---
I agree on keeping the rules separate, just didn't know fuzzy
Subj was included in body.


However consider the word viagra, an obfuscation is a clear sign of spam.
Un-obfuscated is a less strong sign of spam in this case, because it could be a
joke or a conversation with a medical discussion of some form.

---
Agreed.


Should it, or
rather, do people feel this is a good idea?


I don't feel that would be a good idea. Bear in mind this would also make a
"good" message (ie: one at -1.0) be "more good". It just doesn't make sense to
me to have something which merely acts as a "score amplifier" instead of a score
adjustment.

---
I realized it would increase "goodness" as well, but I guess I didn't
see that as much of an issue of the multiplier was applied last.


Performing any kind of GA to establish a reasonable multiplier value for these
would be a logistical nightmare.

---
:-)  True, but that doesn't mean SA couldn't "support" a post
multiplier! :-)  I can see it's use would be somewhat limited though, as I'm
not sure under what other conditions one would want such a scaling, so its loss
in "one" circumstance seems minor.  Sometimes I get overfocused on the problem,
and blow up its severity, in my mind.  Uh, maybe I can blame it on original
spam's intent on increasing small problems? ;^?

Feedback is good! :-)

Tnx,
Linda




"non-fuzzy body parts in subject": missed

2006-04-17 Thread Linda Walsh

I have been receiving a spate of short messages that don't seem

to trigger enough default rules to be knocked out.  I was
investigating and noticed a discrepancy [bug?] in the rules.

One particular email refers to the uniquely Male-Body-Part starting
w/"P", let's call MBP for purposes discussion.


It gets hit by a '20' rule for body parts in the message body,
but I noticed it doesn't get anything for the subject:
"Want a Bigger MBP?"  A '25_replace' rule is present for "fuzzy"
MBP's, but doesn't seem to catch unfuzzy ones. 


So I guess questions might be:
   1) should 'fuzzy' rules match non-fuzzy targets as well
  as fuzzy ones?
   2) Should there be some "normalization" adjustment for
short messages? 


  I'm thinking a "scale factor" rather than an absolute score
to add, -- reflecting the general idea that short messages
are not bad, but if you are scoring on the "bad" side, a
multiplier (ex. 1.1 or 1.2) would increase the score of a message
that is already being sized up as "bad".

  Does SA support any multiplier type rules?  Should it, or
rather, do people feel this is a good idea?
i.e.: RULENAME *1.1 (0,*1.1,0,*1.1) type format?

-l







Re: spamcop.net tactics

2005-11-22 Thread Linda Walsh

That doesn't mean it's a moral, an ethical or respectable reason:
"Spite" is reason enough for most people these days. 


Michele Neylon:: Blacknight.ie wrote:


if your IPs end up in there it's usually for a
reason.

Michele

 



when to SQL; RFE's (to dev?)

2005-10-30 Thread Linda Walsh



Michael Monnerie wrote:


On Samstag, 29. Oktober 2005 06:33 Linda Walsh wrote:
 


Assuming it is some sort of berkeley db format, what is a good
cut-over size as a "rule-of-thumb"...or is there?  What should I
expect in speeds  for "sa-learn" or spamc?  I.e. -- is there a
rough guideline for when it becomes more effective to use SQL
vs. the Berkeley DB?  Or rephrased, when it is worth the effort to
convert to SQL and ensure all the SQL software is setup and running?
   



I don't know whether this really is a performance question, but I 
believe it's more of a "do I need it" question. For example, if you use 
a system wide bayes db, you probably won't need SQL. I do this for now.
 


---
   Still am not sure what size system (or user) db's should trigger
usage of "SQL".  Any reason why user DB's would hurt performance
over a system DB using Berkeley format?  Supposing I have no system
DB and am only using user DB's?  What if it is a small group 3-4 people?
Is it an issue of having to read in the DB with each email / user and
the system DB might hang around in memory?  Does the system DB get some
preferential treatment?  I.e. if one user gets 80% of the email, will
SA operate as though it is using a system DB?

   Still not so sure about why "sa-learn" would process emails so much
more slowly than 2.6x, since for an individual user, it wouldn't be
accessing a system DB, no?

But if some users want/need their own bayes, or own settings, it starts 
becoming easier to use SQL for all that things - it's quickly becoming 
easier to manage, after 5 users or so need their special config. That's 
why I'm thinking of switching to SQL.


Does anybody know whether MySQL or PostgreSQL is better suited for the 
job? I prefer PostgreSQL, but many times MySQL is better supported...


mfg zmi
 



3.1 vs. 2.6x & 3.0x: Good; when to SQL; RFE's (to dev?)

2005-10-28 Thread Linda Walsh

Finally got the kinks worked out in my SA-3.1 setup last week. Filtered

out over 420 spams -- maybe 1 false positive, and it was borderline.

The speed on sa-learn has dropped, but that may be unavoidable.  But
I'm finally getting >= spam recognition than I had in 2.63.

I have no-online tests enabled as the online test databases are going
the way of "cddb"...becoming privatized. Sorta sad...maybe time to
start a "freezor" or some similar services.  I mean the spam services
collect data about what is spam from users who use the database.  Without
the users, they woudn't be nearly as effective.  Yet the users then are
encouraged to pay to access the body of data that was previously donated
for free.

I suppose one could look at the cost of "aggregation" and intelligent
processing of 1000's of user-spam inputs into a usable output format,
and while it might be manageable for a small community of users, it's
not so manageable if the database starts being used by a much larger
user-base than the original system was designed to run on.

Still -- I have yet to look at what is needed to convert my "db"s into
SQL form -- been sorta busy: car got crashed into last week and
was told this week it's totalled, that and was informed Tuesday
of a need for a root canal, on Wednesday, informed of need for 2nd
root canal & oral surgery.  *smile*  Life is just so _*!%fun!*%)_.

So am a bit behind in being on top of my ->SQL based conversion (I'm
assuming i'm in an older format.  I just ran the convert tool to convert
from 2.x format to 3.x. 


Assuming it is some sort of berkeley db format, what is a good
cut-over size as a "rule-of-thumb"...or is there?  What should I
expect in speeds  for "sa-learn" or spamc?  I.e. -- is there a
rough guideline for when it becomes more effective to use SQL
vs. the Berkeley DB?  Or rephrased, when it is worth the effort to
convert to SQL and ensure all the SQL software is setup and running?

Thanks...and thanks for the help/patience

BTW -- maybe this should go to the "sa-dev" list, but an RFE:

"spamassassin --lint":

  1) would be nice to mention if daemon is _RUNNING_ and  ready
to process messages; (user error: forgetting to restart daemon and
seeing no "--lint" message hinting that the daemon isn't running and
ready to process incoming mail--*duh*)
  2) Would be nice, especially in "--lint" to check for bogus
lock files left around in spam DB dir.  I don't know when these files
are used, but their presence really slows down sa-learn by about a
factor of 4-6x.

"sa-learn":
  1) RFE: have sa-learn issue warning about pre-existing lock-files,
or, better,  auto-remove bogus locks for processes that no longer exist.

 





Re: SA 3.04: high fail rate; X-SA-no-reject?; more details.

2005-09-18 Thread Linda Walsh



Loren Wilton wrote:


If you are only correctly classifying 50% of the spam (you said 100 caught
to 100 missed, I htink) then you have SERIOUS problems of some sort. 



   Yeah, well, I try not to be too reactionary on computer
things like this -- especially when it could just be a
matter of flipping a config switch somewhere and things get
instantly better.  While the number of spams getting through
are significantly higher, probably 75-80% of them are duplicate
emails sent to multiple email addresses -- including some
blacklisting To-Addresses.  Apparently, the spammer isn't being
kind enough to send the spam to the black-listed To-Add'ies first
and with the new spamc client, sendmail notices the lower load
average and likely allows more parallel incoming instances to
process incoming email before a given spam gets "locked out".
I suppose this could be a "downside" of this efficiency, but
previous to this I never saw multiple instances of these
simple spams get through **undetected**.  This makes me think
it isn't just the increased efficiency causing problems as
I would have expected at least one or two duplicate spams
that wouldn't have been caught by "other means" (than being
sent to a blacklisted To-addr).


As a
happy 2.63 user that upgraded to 3.04, it too a little minor fiddling, but
by and large things are *much* better now, and they were good before.
 


-- *(oh the salt, the salt [in the wound]...:-) )* ---



Also, you mentioned training with 'old spam' and 'new ham'.  Presumably you
were talking about bayes training.  Really training with new spam,
especially the stuff slipping through, would be the right thing to do.  Spam
has changed considerably in character in just the last 6 months.
 



   Sorry, unclear: I archive current spams after "sa-learn"ing on
them, so "archives" contain anything older than whatever I
haven't processed "recently".  With SA 2.63, I'd go through my
Junk email folder sometimes as infrequently as once/month and find
maybe 6-10 emails that should have gone to subscribed lists or
where from recent online vendors that sent me spammy looking
receipts (although those were rare).  I'd drop them in my "despam"
folder for later "ham learning".  But sifted folders of junk
email, I process(sa-learn-junk) in bulk and archive.


Suggestion: let us see the full list of SA hits on some of the stuff
slipping through.
 


The full list of SA hits? -- for that message, that was it, here's another
passed.  Note, there is a weird header "X-SA-Do-Not-Rej: Yes" which doesn't
look normal:
---junk email that passed as ham; sent to multiple email accounts---
Received: (qmail 16547 invoked from network); 16 Sep 2005 18:08:51 -
Received: from unknown (HELO thaimail.org) ([202.150.81.42])
 (envelope-sender <[EMAIL PROTECTED]>)
 by mail7.sea5.speakeasy.net (qmail-ldap-1.03) with SMTP
 for <[EMAIL PROTECTED]>; 16 Sep 2005 18:08:49 -
From: "Molnar Chris" <[EMAIL PROTECTED]>
To: "Siedler Clemens" <[EMAIL PROTECTED]>
Subject: Re[6]:
Date: Fri, 16 Sep 2005 18:09:04 +
Message-ID: <[EMAIL PROTECTED]>
X-SA-Do-Not-Rej: Yes
MIME-Version: 1.0
Content-Type: multipart/alternative;
   boundary="=_NextPart_000_40CE_1F627A89.B53D40CE"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.2527
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527
X-Spam-DCC: :
X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on 
ishtar.sc.tlinx.org

X-Spam-Level: ***
X-Spam-Status: No, score=3.5 required=4.8 tests=BAYES_99,HTML_MESSAGE
   autolearn=no version=3.0.4
X-Spam-Pyzor:
X-Spam-Report:
   *  0.0 HTML_MESSAGE BODY: HTML included in message
   *  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
   *  [score: 1.]
Status:
X-Status:
X-Keywords: Junk  



In a rejected email, I see many more tests:

--junk email correctly 
labeled--


Subject: ***SPAM*** Athena, Electric-chair for little or no-cost
MIME-Version: 1.0
X-Mailid: 6977
Content-Type: multipart/alternative; boundary="==8aa9d3a4cb398b"
Date: Thu, 15 Sep 2005 14:56:00 -0700
X-Spam-Prev-Subject: Athena, Electric-chair for little or no-cost
X-Spam-DCC: :
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on 
ishtar.sc.tlinx.org

X-Spam-Level: **
X-Spam-Status: Yes, score=6.9 required=4.8 tests=BAYES_99,HTML_90_100,
   HTML_IMAGE_ONLY_20,HTML_IMAGE_RATIO_02,HTML_MESSAGE,HTML_WEB_BUGS,
   MIME_HTML_MOSTLY,MPART_ALT_DIFF,MSGID_FROM_MTA_HEADER,
   MSGID_FROM_MTA_ID autolearn=no version=3.0.4
X-Spam-Pyzor:
X-Spam-Report:
   *  1.7 MSGID_FROM_MTA_ID Message-Id for external message added 
locally
   *  0.4 HTML_IMAGE_ONLY_20 BODY: HTML: images with 1600-2000 
bytes of wor

ds
   *  0.0 HTML_IMAGE_RATIO_02 BODY: HTML has a low ratio of text to 
image a

rea
   

SpmAssn 3.04 v. 2.6x false negative rate: Help???

2005-09-15 Thread Linda Walsh


Ever since I "upgraded" to the 3.x series I've had a major jump
in spams that are getting through.

Initially my upgrade was to 3.02 as distributed in SuSE 9.3 and
my problems were related to old configuration files/options where
NONE of my spam was being tagged into the spam folder (i.e. the SPAM header
wasn't set in the subject, as my filtering system makes use of).

I've gotten all of the "lint" out of my config files, ported my old
DB to the new format, and even ran the learning mechanism over several
old "SPAM" archives (~150Mb) and current "HAM" input folders ~100Mb.

About 100 spams a day are getting through and requiring manual
processing with about 100/day being correctly filtered into the spam
folder.  That's a huge drop in detected spams.  I've tried dialing
down the threshold from the default to my previous 5, then to 4.8...
not wanting to be overly aggressive.  But I'm wondering if the default
weightings for various tests have been changed between the 2.6x and 3.0x
series.

I note a new 3.1.0 release, but noticed no improvement going from 3.02
to 3.04.

It _seems_ like, maybe, some of the weightings of the various tests
changed which is throwing off the classifier.  I'll see multiple instances
of various, identical spams going to different email addresses on my
server -- most often with "Subject: Re[]:", where x=[0-9].  They are
the most numerous offenders as they'll come in to multiple accounts
at nearly the same time (or a few seconds apart).  One copy of those
messages will result in duplicate spam being sent to several accounts,
and my multiple personalities, er, um, "users" :-), are getting annoyed
with me.

Also of note: "sa-learn" is MUCH slower in 3.0.x than it was in 2.6.x though
with the compiled "spamc" client, I can see that the processing of incoming
spam is handled with a lower load on the server.

One voice in my head says, screw it, stop your whining and go back to what
worked (2.6x), but another part of me says "3.x" is where the future is, and
if there is a problem in my setup, I should take the time to figure out
what the problem is and try to make it work.

Looking at a partial header of one note:
X-Spam-Report:
  *  0.0 HTML_MESSAGE BODY: HTML included in message
  *  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
  *  [score: 1.]


Content was multi-part message in MIME format, same messsage in plain and
HTML text:

  ""
--
Content involved advertising product to increasing one's
chance of producing offspring via chance encounters with receptive 
female partners.  Is 5.0 too high a default in 3.x, though I would have

expected it to count a little bit more for an HTML message...

Ooops another batch of 80+ just came inSA tastes great, less filling!
re: first posting attempt: 


<<< 552 spam score (9.1) exceeded threshold

on a list to designed to talk about a tool to detect such spam>


And ironically, the irony of this restriction may never be known if this
note never makes it to the list...;-/.

Sigh,
Linda