How to configure FOO=-1.0 in X-Spam-Status ?

2015-11-12 Thread Christian Jaeger
Hi

I'm seeing X-Spam-Status headers from some other installation come
with =$x appended to the individual matches, which evidently helps
figuring out why a mail is being classified the way it is. I've spent
more than an hour on googling and rtfm but couldn't figure it
out. Also, grep does not turn on any occurrence of 'Spam-Status' in
the source code, and I don't feel like reading all of the source code
for this right now. Please tell me how I can set this up.

Thanks!
Christian.


Re: SPAM from our own domain

2015-10-01 Thread Christian Jaeger
I've noticed Haraka (and I've used software by Matt Sergeant before,
AxKit and qpsmtpd). I'm using qpsmtpd as the SMTP daemon already
(that's how you do spam filtering in a Qmail setup, and qmail is the
original delivery backend for qpsmtpd), which has the same
architecture as Haraka (single process, event based, plugins). I
haven't tried it, so I wouldn't know, but I'm still surprised that you
suggest it is similar(?) to Qmail. It hasn't triggered my interest so
far; I don't need raw speed for my purposes, and if I add a tool in
another programming language then I might wait for a replacement in a
language that allows for static verification (like Rust, Haskell,
Ocaml) or has a process model built in (Erlang/Elixir); I know some
MTAs already exist in those languages, although perhaps not stable or
complete enough for me right now.

Perhaps let's stop or move this discussion elsewhere now as it's
probably OT. I wanted to offer Tom my help, I didn't ask for help
about MTAs here.

Christian.


Re: SPAM from our own domain

2015-10-01 Thread Christian Jaeger
I don't have any problems with Qmail. I can code when needed for the missing 
parts. The unsolved problems I have with my mail setup would not be solved by 
going with another MTA. I like and know Qmail's configuration, modularity and 
security track record. This way my time is spent learning how things work, as 
opposed to how to use another MTA.

Ch.


Re: SPAM from our own domain

2015-09-30 Thread Christian Jaeger
Hi Tom,

You write that you're using Qmail. If you'd like to implement DKIM
signing for your outgoing mail (for possibly better deliverability and
also so that you could verify incoming mail and be sure that it's not
your own by verifying the DKIM signature), then perhaps you're
interested in what I've worked on: I didn't want to patch Qmail for
DKIM support, so I took an existing, older script that works as a
wrapper for qmail-remote and improved it. It can now also add hashcash
stamps (which may or may not actually be useful, but in any case
doesn't hurt me since I've got plenty CPU time for it), and checks
whether the mail to be sent is a bounce for a likely spam, in which
case it diverts it locally to avoid sending backscatter. This has only
been tested by me so far, so YMMV.. tell me if you run into problems.

https://github.com/pflanze/better-qmail-remote

I've just made a release v0.1, you can verify its signature with `git tag -v 
v0_1` using the key:

http://christianjaeger.ch/pgpkey-A54A-1D7C-A1F9-4C86-6AC8--1A1F-0FA5-B211-04ED-B072.asc

Cheers,
Christian.


Re: Bayes Filtering

2015-08-02 Thread Christian Jaeger
On August 2, 2015 6:40:10 PM CEST, Reindl Harald h.rei...@thelounge.net wrote:
 no idea what you are talking about by saying
 I can't find anything about this in the docs

I'm talking about the bundled docs. The man / perldoc pages of 
Mail::SpamAssassin::Plugin::Bayes / Mail::SpamAssassin::*Bayes* and the default 
config files. That's where I expected this info to be. It's something simple 
and basic, i.e. something that the writer of the software can foresee the need 
for documentation, so it makes sense that it's in the same files that the 
programmers wrote. That's where I start looking. That's where qpsmtpd, which 
I'm configuring around the same time, has its basic docs.

Ch.


Re: Bayes Filtering

2015-08-02 Thread Christian Jaeger
On August 2, 2015 7:36:36 PM CEST, RW rwmailli...@googlemail.com wrote:
 In future start with
 
  man spamassassin 
 
 which will lead you to:
 
 CONFIGURATION
Mail::SpamAssassin::Conf  SpamAssassin configuration files

I think I've actually seen this page rececntly. I also remember having looked 
through the bayes_* options (about a week ago) to see whether there's one that 
might indicate the number of required messages learnt, but couldn't find any 
(now I've seen bayes_min_ham_num / bayes_min_spam_num). I don't know how that 
happened, perhaps I was seeing another page (perhaps online), perhaps I had too 
many things in my mind then at the same time and was interrupted or 
unconcentrated (when starting to configure these systems (DNS, qmail, ezmlm, 
qpsmtpd, dovecot, SA), there are just too many things to take care of to not 
make errors sometimes).

 Normally the main man page has the name of the project or its main
 executable. It's not normal to document how a feature is configured
 in the documentation for library that implements that feature. 

qpsmtpd is different here since it has a plugin architecture and then it 
definitely makes more sense to document things in the plugins, which are just 
modules. If spamassassin does not have such an architecture then I agree it 
makes sense to document options where they are processed, i.e. the module which 
parses them. 

I know I'm sometimes confusing spamassassin and qpsmtpd. Both are in Perl and 
used together in my setup. I've grown a habit to thinking docs are in the 
modules and when I was checking the SA docs again before sending my post I 
followed this habit without realizing that it's not the configuration of a 
qpsmtpd plugin in this case. Please don't judge me too hard, I'm trying to get 
on with things as quickly as I can like most everybody, I've got other things 
on my plate, too.

So, I don't have a suggestion for improvement. Hopefully my post still helped 
the OP?

Cheers,
Christian.


Re: Bayes Filtering

2015-08-02 Thread Christian Jaeger
On August 2, 2015 5:15:08 PM CEST, Reindl Harald h.rei...@thelounge.net wrote:
 
 Am 02.08.2015 um 14:57 schrieb Roman Gelfand:
  Could somebody post a successful bayes configuration?
 
 ??
 
 you just need to *train* it for ham *and* spam

I think I remember from past use of SA that it only uses the bayes database 
once a certain number of messages have been learnt. It has confused me, too, 
now. I can't find anything about this in the docs, though, and neither have I 
found a test in the sources by way of searching for 'number', but that's not a 
thorough check. If I remember this detail correctly, it would be a good idea to 
add it to the docs.

Ch.


Re: Hashcash not working

2015-07-31 Thread Christian Jaeger
On July 31, 2015 4:37:14 PM CEST, RW rwmailli...@googlemail.com wrote:
 SA usually gets envelope information from headers. Since there are
 several headers that could contain the envelope recipient, it would
 need to be configured, so still wouldn't work by default.

That's why I mentioned RECIPIENT. The MTA knows where it's going to, the 
information just needs to be passed on to SA.

 It's probably for the best that it doesn't work by default. It would
 likely have been exploited by spammers if it were. 

Well, it seems that right now hashcash is of no use. If we actually use it then 
the worst that could happen (in the case that spammers can really generate 
hashcash as easily as legitimate senders) is that it's also of no use. But 
isn't there also a chance that it's not turning out as bad?

 Hashcash for email isn't a very good idea. Even if it were ubiquitous
 and email couldn't be sent without it, it wouldn't be a major
 impediment to spammers. If spammers don't have to add a hashcash
 header
 to everything, it doesn't even slow them down, it's just an
 opportunity
 to make some of their spam more deliverable.

I don't really see the logic in your statement. 

It doesn't need to be ubiquitous, my thinking is that it would be useful as an 
additional indication for *important* email that the email isn't spam 
(especially if end client applications (web or otherwise) would adopt it, so 
that it could use something like 20 seconds of CPU time). E.g. not for mailing 
list emails, but for personal email where you don't want the email to be lost 
(have a button that says retry more forcefully or something, that you could 
push when you suspect the receiver didn't get a mail, or when you're contacting 
someone the first time and think it's important, that then does the 20 second 
(or more) hashcash calculation). 20+ seconds would be rather hard to compete 
against I'd think. If it means that a spammer could only afford say 2 seconds, 
and even for the 2 seconds would have to reduce sending rate to a tenth, that 
would already be good? If it means that they can only make *some* of ther spam 
as well deliverable as currently, that's also success, no? I expec
 t the scores to adapt so that low-effort hashcash would have zero effect on 
the spam score, but high-effort hashcash would still point towards ham.

I think it boils down to the question of whether spammers really have enough 
CPU power for multi-second hashcash per recipient calculations (or, as much as 
legitimate senders). Others have argued that the heat/fan activity would make 
some people more suspicious and make them get rid of the abuser. (This by 
itself would already be a good thing.) I also wonder whether it wouldn't be 
more worthwhile for criminals to use the available CPU power for Bitcoin mining 
instead? Any sources for numbers?

Why not simply try it? Wouldn't the worst case be that the scores would be 
adapted to around zero when spammers would really start using it? Is it fear of 
making the system more complex and then not understanding it anymore?

(BTW is there a framework in SA to statistically analyze combinations of 
characteristics? So that by learning (sa-learn) client installations could 
adapt automatically? Or is that too CPU heavy? Or precalculate the data for 
everyone but let client installations adapt those (implicit) 'scores' through 
learning?)

Christian.


Re: Hashcash not working

2015-07-31 Thread Christian Jaeger
On July 31, 2015 4:51:02 PM CEST, Bill Cole 
sausers-20150...@billmail.scconsult.com wrote:
 John Levine wrote a definitive debunking of e-postage schemes
 including 
 hashcash over a decade ago (http://www.taugh.com/epostage.pdf) and 
 published an update (substantively unchanged) via Virus Bulletin in
 2009 
 (https://www.virusbtn.com/spambulletin/archive/2009/03/sb200903-epostage.dkb?mobile_on=no).
 
 All of his points against e-postage in general and hashcash
 specifically 
 have held up over time.

I've read both links, they both bring the same two arguments:

 The technical problems are that some computers are a lot faster than others

I see a social problem with this: that in principle it penalizes poor people. 
But let me restate:

As I already said in my other email, for me hashcash seems to make sense where 
you really need to deliver a particular important, personal email. I don't care 
for a fairy dust solution that would solve sending legitimate mass email (be 
it mailing lists or ). I'm fine with those being filtered the way they are now. 
I'm caring to reduce the risk of loss of *important* emails, especially in 
situations where currently the risk is high, i.e. there's no whitelisting 
through previous communications. Those cases are few.

It's easy to spend even minutes of CPU time on such cases. Or, since the 
article argues that grandma has a 100 Mhz computer, the ISPs could offer 
premium email, where the piece costs a few cents (hey, cheaper than SMS with 
many providers!), and then run hashcash on a few powerful servers in parallel 
for a minute with a total CPU budget of several minutes.

Now I would expect that ISPs in 3rd world countries would offer hashcash 
generation for a lower margin, and hence even people there could easily afford 
sending important mails with hashcash.

(If grandma's ISP wouldn't offer premium email, she'd have to send the email 
without hashcash, and it would still have a decent chance of deliverability, or 
she would have to let her computer up for an hour until it is sent. As I said, 
it would be rare to need it.)

Yes, that's when user's clients get the ability to compute hashcash, and ISPs 
adopt it. I.e. when it really catches on. Before that point, there's a phase 
where we're experimenting and hashcash doesn't play a big role in spam 
recognition (and grandma doesn't even come into play). The article argues in an 
absolute that ignores possible developments.

 and that currently spammers have a lot more computer power at their disposal 
 than legitimate senders do

 Furthermore, spammers have vast arrays of hijacked `zombie' computers at 
 their disposal. Blacklist maintainers report adding 10,000 newly hijacked 
 computers to their blacklists per day.

 No legitimate mailer has anything like 10,000 computers dedicated to sending 
 mail, much less 10,000 additional computers a day, meaning that it would be 
 easier for spammers to satisfy hashcash than for legitimate senders.

It compares a daily differential in the numbers of hijacked computers worldwide 
with the numbers of computers available to a single mailer? (How many are 
*removed* from the blacklists per day, btw?)

Please give me actual real numbers and I can do actual calculations.

So where's the actual debunking?

Christian.


Re: Hashcash not working

2015-07-31 Thread Christian Jaeger
On July 31, 2015 9:13:03 PM CEST, RW rwmailli...@googlemail.com wrote:
 On 31 Jul 2015 17:57:28 +0200
 Christian Jaeger wrote:
 
  On July 31, 2015 4:37:14 PM CEST, RW rwmailli...@googlemail.com
  wrote:
   SA usually gets envelope information from headers. Since there are
   several headers that could contain the envelope recipient, it
 would
   need to be configured, so still wouldn't work by default.
  
  That's why I mentioned RECIPIENT. The MTA knows where it's going to,
  the information just needs to be passed on to SA.
 
 You're making some assumptions about how SA is being used. 

When does RECIPIENT break? man qmail-command says that RECIPIENT is the 
envelope recipient address. Shouldn't this be the unchanged To/Cc/BCC address 
that the mail is currently being delivered to, assuming no forwarding was done?

 I can see why they went with  hashcash_accept, it always works - even
 if the recipient is rewritten.

I don't expect hashcash in forwarded email to be found without special 
configuration. If it finds the matching hashcash in non-forwarded configuration 
that sounds fine to me.

  I don't really see the logic in your statement. 
  It doesn't need to be ubiquitous, 
 
 In the hashcash FAQ they argue that hashcash is useful against botnets
 because it slows them down. But this would only be correct if hashcash
 were essential to delivery. If it isn't then hashcash support in
 spamfilters would benefit spammers because they can send a mixture of
 spam with and without the header. They'd get extra deliverability
 without any slow down at all.  

Hm, I see your point, they could use the CPU they have available but still 
saturate their network capability, too. The effect will be complicate to 
calculate. Possibly by sending spams without hashcash over the same network 
their IPs will be blacklisted enough to prevent the spam with hashcash from 
being delivered either. 

I guess their strategy will be to pregenerate as much hashcash as they can, 
then first send spam with hashcash, then when they've run out of hashcash send 
spam without, thus staying more likely in the green while they have hashcash 
then continue as long as they can or makes sense without. (I don't have deep 
insights into how spammers work, I'm just reckoning here. Hopefully at least as 
well as the writers of some articles.)

 One of the problems with hashcash is that its algorithm is
 well optimised for GPUs and other heavily parallel hardware. The 20
 seconds on an ordinary core could be milliseconds on a machine made
 from just gaming hardware.

Normal CPUs have SIMD instructions, and one could use all cores, then the 
difference shouldn't be that vast (make that number of milliseconds something 
in the range of thousands, then?). But agreed, scrypt would make more sense 
here. 

This is an attack on the hashing algorithm, not the concept as a whole. 
(Calculating hashes in browsers will eagerly await widespread support of SIMD 
in JavaScript; but this is again a problem that could go away if hashcash 
really got successful, browsers could include hashing functions implemented in 
C/C++/Rust/ASM.)

 Spammers also have the advantage that they don't have to work in real
 time - they can  generate postdated stamps in advance of a spam run.

Ok, that means they can keep their moves quick (quick bursts until IP blocked 
etc.), but the total amount of hashcash they can produce stays the same. (Also 
see the above.)

Maybe the concept could be extended to use a challenge-response scheme (e.g. 
where the receiving SMTPd would present a challenge, then let the sender 
(optionally) disconnect, calculate the hashcash with the challenge as 
additional input, then reconnect; or provide the challenge over DNS with short 
TTL).

Is there a(nother) good place already to discuss these concepts? (Wiki, etc.) I 
don't intend to 'spam' this list too much with this. But I think it's 
interesting to read and think more about this. (There seems to be a ML linked 
from hashcash.org, but the last message in the archives is from 2012.)

Christian.


Re: Hashcash not working

2015-07-31 Thread Christian Jaeger
On July 30, 2015 2:40:35 AM CEST, RW rwmailli...@googlemail.com wrote:
 The plugin is on by default and  use_hashcash defaults to 1, but you
 need to set hashcash_accept to an appropriate value 

That's disappointing. For me that barely counts as on by default. I was 
thinking that implementing hashcash would help get my mail delivered to at 
least the spamassassin users, but this means that no, only to the subset that 
cares about configuring it. 

Does SA not know which address(es) an email is being delivered to? If it knows 
(knew), it could just compare those addresses, no? (E.g. qmail sets various 
environment variables, e.g. RECIPIENT, when running filters, can't SA use this? 
I'm using QPSMTPD, I suppose spamc could be modified to pass recipients, too?)

If the answer is no, then I realize that there's also an accidental 
double-spend issue? My qmail-remote wrapper adds a X-Hashcash header for every 
receipt address the qmail-remote is being called with. I was thinking that the 
receiver could restrict itself to only look (and mark in the database) the 
header for the delivery that's being made. Now I worry that if I send an email 
with To: f...@bar.com, b...@bar.com with two X-Hashcash headers that, if SA 
is run separately for each sub-delivery, then it will mark both headers in the 
first delivery and add a penalty for used hashcash to the second. Luckily, I'm 
running SA from qpsmtpd, which should only run it once when it receives the 
double delivery. I suppose SA could prevent this issue from happening in other 
cases by storing the message-id together with the spent token.

My decision to spend time to implement this was based on reading in 
wikipedia[1] that SA is checking them. I think this needs a mention that it 
only happens when configured. If you don't disagree, I'll change that.

 [1] https://en.wikipedia.org/wiki/Hashcash


In any case, I've configured it now and it still doesn't work. Off again 
working on debugging it.

Christian.


Hashcash not working

2015-07-29 Thread Christian Jaeger
Hi

I've implemented (or at least so I thought) Hashcash for my outgoing mail (in a 
Perl wrapper around qmail-remote that I already had to do DKIM), using the 
`hashcash` tool as provided by Debian, using the `-X` command-line option. This 
tool returns multi-line headers if the email address the hash-cash is minted 
for is long enough. This might be the reason that Mail-SpamAssassin-3.4.1 
ignores those, I guessed, so I delved into the code.

Here's an example header it generates (you could also check the source of this 
email):

 X-Hashcash: 
1:23:150729:c...@a.christianjaeger.ch::BIsU5nVO5XGrvOIr:00
 6t75

Now another thing I notice is that this format is longer than the examples 
shown in the code, e.g.

 X-hashcash: 1:20:040803:a...@cypherspace.org::a1cbc54bf0182ea8:5d6a0

Anyone knows if that is already a problem?

Then I noticed that the following regex disallows \n in the header value; do 
decoded header values have a \n where they wrap, or not?

Then I notice in this commit:

https://github.com/apache/spamassassin/commit/a95d2cfd2cc07deac9842cfaf10d6d9a85365b12

# untaint the string for paranoia, making sure not to allow \n \0 \' \
 -  $hc =~ /^([-A-Za-z0-9\xA0-\xFF:_\/\%\@\.\,\= \*\+\;]+)$/; $hc = $1;
 +  if ($hc =~ /^[-A-Za-z0-9\xA0-\xFF:_\/\%\@\.\,\= \*\+\;]+$/) {
 +$hc = Mail::SpamAssassin::Util::untaint_var($hc);
 +  }
if (!$hc) { return 0; }

This looks like it isn't correct: before the patch, it would assign undef to 
$hc if the regex fails (right?), now it leaves the tainted original $hc value 
in place. Surely not what was meant, right?

I'm planning to debug this further (hm, debugging a live daemon is always 
painful, actually writing a tool for that now, will defer my work here), but 
would welcome feedback.

Christian.