Daryl C. W. O'Shea a écrit :
> 
> I'm not being a "std crusader".  I'm simply pointing out that someone
> going there own way shouldn't expect everyone to accept their way,
> especially when there's an established majority going the other way.
> 
> Just because a mail is "standards compliant" doesn't mean that we can't
> use statistical methods to decide if it may or may not be spam.
> 

agreeed, but

> "break the standards be tagging std complaint mail" has to be the
> funniest thing I've heard all day.  Both standards complaint and
> non-complaint mail can be spam.  Would you rather we decide all
> standards complaint mail is ham?  That seems rather nonsensical to me.
> 

except that most rules are written by people who looked at some spam
then decided to derive rules to catch it. These people are in the same
situation as those who write mail software. both should be blamed when
they take wrong decisions.


> The reality, though, is that a greater proportion of non-complaint mail
> is spam than the proportion of complaint mail.
> 

sure. but then we'll need to revisit a lot of rules...

> The notion that "statistics are no excuse for rejecting/tagging mail" is
> even more worrisome.  Statistics are the basis for rejecting/tagging
> mail and any other non-trivial activity in life.
> 

I disagree. that's only true for "neutral" stats. if your stats are
derived from an interpretation of the rules, you're constructing a
filter to enforce these rules. and the fact that such a rule gets a
score from a mass check brings nothing useful for "us".

> 
>> I lately saw an FP with RCVD illegal IP, because of a 127.0.0.80 IP.
>> while this rarely used, it's not illegal. so the rule is just bogus IMHO.
> 
> 
> 50_scores.cf:score RCVD_ILLEGAL_IP 1.585 0.234 1.813 0.288
> 
> With a set 3 score of 0.288, I'd say this isn't a big hitter anyway.
> 

That's enough to move a ham to the spam zone.


>> sub check_for_illegal_ip {
>>   my ($self) = @_;
>>
>>   foreach my $rcvd ( @{$self->{relays_untrusted}} ) {
>>     # (note this might miss some hits if the Received.pm skips any
>> invalid IPs)
>>     foreach my $check ( $rcvd->{ip}, $rcvd->{by} ) {
>>       return 1 if ($check =~ /^(?:
>>         (?:[01257]|22[3-9]|23[0-9]|24[0-9]|25[0-5])\.\d+\.\d+\.\d+|
>>         127\.[1-9]\.\d+\.\d+|
>>         127\.0\.[1-9]\.\d+|
>>         127\.0\.0\.(?:\d\d+|[2-9])
>>         )$/x);
>>     }
>>   }
>>   return 0;
>> }
> 

I don't think this code may have ever been derived by stats. do you?

> 
> Although, I do think that the check is a little weird.  It's completely
> valid (although not statistically common in the mass-check corpora) to
> have a non-routable IP in anything but that first untrusted relay.
> 

did anybody try to mass check a rule that looks if the last digit in an
IP is odd or even?

what I mean is that you can drive any classifier and it will give you
good looking results. you have a finite corpus.


> 
> The bottom line is, if the statistics generated based on the current
> score generation mass-check submitters don't meet your needs, you're
> always welcome to participate in future score generation mass-checks.

you're missing my point. My point is that some rules should not even be
tested. the 127.0.0.* was an example. the probability that a corpus
would contain this is almost zero, and even then, it would have no
statistical significance. yet, I think it'll appear more in ham than in
spam.

Take it the other way: Did 127.0.0.80 appear in any spam in the corpus
used during mass checks? and if so, did 172.1.2.3 appear? ... etc. yet,
someone deicided that the "event" to check was "if the IP is in a set of
rarely/illegal IPs". The one who decided that isn't the genetic/neural
classifier.


I tried, but I'm not sure I succeeded, to be clear.

Reply via email to