[Bug 7953] Inconsistent penalizing of TLD

bugzilla-daemon Mon, 14 Feb 2022 19:00:06 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7953


Bill Cole <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |INVALID

--- Comment #4 from Bill Cole <[email protected]> ---
(In reply to Cian from comment #2)
[...] 
> I understand that 5 is the
> default threshold and I am (according to mail-tester, which I recognize now
> is flawed) below it, but my mail is confirmed to be going to junk.  

That is always a customized local choice. SA has no facility for deciding how
mail is delivered, it only provides a score, a list of matched rules, and a
spam/ham judgment. 

> Is it
> possible that sys-admins at several large organizations are using SA with a
> stricter threshold?  I understand that their choice to use SA in a
> non-recommended way isn't your fault, but it raises the stakes on broad
> rules and makes false positives more likely.

It is POSSIBLE and LIKELY. However, most people who do that also understand
that they need to have a lot of mitigating customizations. Just lowering the
threshold to 4 without rescoring many rules and adding

However, if you are talking about MS or Google or Yahoo or any other REALLY big
mail operations: no. They don't use SA. They all use their own bespoke
proprietary filtering tools. SA really does not fit their operational models.  


> >See https://ruleqa.spamassassin.org for the details of how our rules score 
> >against the manually classified corpora of ham and spam provided by some of 
> >our users. This is an open system and we are always eager to add new 
> >dependable sources to those corpora to get a wider sample. You can see in 
> >that system that the rules you see as problematic match messages that are 
> >97-100% spam
> 
> Thank you for sharing that tool with me, I was not aware of it.  Am I
> understanding correctly that the QA for PDS_OTHER_BAD_TLD is based on 17
> corpuses?  And that those corpora come from the submissions of just 9
> testers?  

Yes. Roughly half a million messages per day. A very small sample, relative to
the actual size of "all email" but not so tiny in the scope of the mail
SpamAssassin actually sees. We have no good way to know how large a footprint
SA actually has in the world, so we also can't say anything about whether the
sample is big enough or diverse enough. There is absolutely a degree of
selection bias because it does take some effort to do the necessary analysis
and reliably submit the data. 

> Is it possible that there are industries not represented in that
> QA?  

It is CERTAIN, but we have no way to know where exactly those gaps are. More
submissions would be great.  

> If none of those 9 testers happens to work within the space technology
> sector, it seems natural that they would not receive much Ham from .space
> domains, even though there is a whole industry where it would be expected to
> receive mail from those domains.

Correct. 

There's a conceptual oddity here. SA is not designed (and cannot be) to be
equally effective and safe for all mail streams with just the default rules &
scores and no local adaptation. The best we can do is to adapt to the mail
streams that SA users actually have, to the extent that they provide feedback.
Submitting masscheck corpora is one form of feedback, bug reports and mail to
the [email protected] mailing list are others.

We DO try to fix 'squeaky wheels' in many circumstances, when given some
evidence that SA (and not something else) is causing specific mail to be
misclassified. The aforementioned mailing list (whose archives are public) is
full of examples where someone presents a problematic rule and something
concrete to show that it's causing a problem, and we work to fix it. 

> I'm not writing this to hassle you, Bill, I'm here because my whole business
> depends on it. 

I understand that, and I'm not trying to be dismissive. There's just not any
great way to address your (real) problem without potentially causing direct
failures of SA to get classifications *of spam* correct for the people who
actually use SA. 

I have been trying for some time to think up ways to do better oversight of
what's in the 'bad TLD' lists so that we can say with more grounding that a
particular one still belongs there. I just have to devise test rules that will
provide better data to figure that out without wreaking havoc. 

> I have done what research I could, followed the directions
> from NameCheap and Zoho when setting up my domain and email, I set up DKIM
> and DMARC and SPF, I have looked through the SpamAssassin wiki.  I could
> take the advice from the SA wiki and get a deliverability consultant, but
> besides the fact that I can't afford it, it seems absurd to pay hundreds of
> dollars to be able to send a few handcrafted emails a day to individual
> recipients. 

Agreed. If you're not sending thousands of messages at a time, a deliverability
consultant is not going to be terribly helpful and definitely won't be
cost-efficient. 

One thing that can help is to keep your mail simple but not cryptically so.
Plain text delivers better. Most 1-1 mail uses a standard format that includes
both a plain text version and an HTML version, duplicating the text content. If
you don't actually need fancy formatting and inline graphics, the HTML part is
not really helpful. Most mail programs (Outlook, Apple Mail, etc) can also send
HTML-only mail, which is a bad idea for more reasons than spam filtering but
can get your mail into many spam folders. Most mail programs (other than
webmail...) can also send mail as just plain text, which is generally more
reliably delivered. You also may find yourself using email as a way to share
files via attachments, which isn't bad per se. However, since a lot of spam is
only an attachment with little or no text, sending a message with (for example)
just a PDF or other image attached without a real text body is risky. Messages
with empty subjects, overlong subjects, emojis or other non-ASCII characters in
the subject, or a large number of recipients can also have issues of smelling
too much like spam. These basic principles apply to SA, but more importantly
they apply to a broad range filtering tools. In other words: if GMail or MS365
are your problems, you may be able to simplify your mail out of the spam
folder. 

> I could write to the sys-admin of each and every organization I
> want to contact and ask to be whitelisted, but I suspect I know how that
> will go.

You might be surprised. If this is just 1-1 messages and you can enlist eager
recipients of your mail to plead your case with their admins, you may have a
manageable number of sites to get fixed and not a lot of resistance. Mail
admins do not like having customers upset with spam-filtering run amok. It's a
reason people change providers and a reason mail admins get fired. 

> If you have any idea what else I can do, any other lead I can follow, I'll
> go after it and be out of your hair, but I'm here because the only *hint* of
> a reason that might explain why my emails aren't going through is points
> deducted on SA for having a domain that ends in ".space"

The best place for this conversation is the [email protected] mailing list,
particularly if you have a concrete example that you can share of a message
(incl. headers) that someone had to rescue from a spam folder. That list has
mostly helpful people, all of whom have some knowledge of SA specifically and
of spam filtering and deliverability issues more generally. I can almost
guarantee that you'd get better (or at least more) helpful suggestions from
that broader audience than you can via Bugzilla.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7953] Inconsistent penalizing of TLD

Reply via email to