On Fri, 8 Jun 2018 at 18:14, Stefano Bagnara <mai...@bago.org> wrote:
> On Fri, 8 Jun 2018 at 17:53, Michael Peddemors <mich...@linuxmagic.com> wrote:
> > [...]
> > And while using that as feedback might seem the logical conclusion, in
> > the real world we still see more feedback reports from legitimate email
> > the customer should have wanted, vs emails tagged as spam that are spam.
>
> Well, this is very surprising to me. Anyone else record similar scenario?

Michael, I just noticed that I read your sentence in a wrong way. I read "more
no-spam reports than spam reports" while I should have read "most of
the spam reports received are submitted by mistake".
So my previous reply was written after this misunderstanding.

Now, if most spam reports are submitted by mistake by people that
doesn't really want to complain/stop received emails from that sender
then this make it even more important to be able to collect false
positives (or are you telling that user-originated-feedback is not
trustable enough that you can't use it for "false positives
reports"?).

On Fri, 8 Jun 2018 at 17:53, Michael Peddemors <mich...@linuxmagic.com> wrote:
> IMHO, and the way most of our platforms are designed to work.. Empower
> the users when you can.. but block the worst of the worst..
>
> * Block at SMTP via RBL's that have very low false positive rates

How can you tell an RBL have "very low false positive rates" if most
of the users of that RBL use it at SMTP time (or if the "not-spam"
reports don't flow to the RBL operator)? This is a dog biting its
tail...

I read "Very low false positive rates" as "Very low
collected-false-positive rates" where we miss a
"collected-false-positive against false-positives rate".

In my small system I don't have a "false positive report", so If I
silently drop 5% of the traffic randomly at the smtp level I hardly
get any false positive report... this doesn't make the "randomly
dropping 5% of the traffic" a good/trustable block. Most people don't
know someone is trying to write them until they see the email and you
get a false positive only when the sender phone-call the recipient to
ask why he didn't reply (their next attempt only have 5% of chance to
get blocked again, so 0,25% to block 2 sends, and they won't care to
call me to complain).

Even if you have a good "rate" you could have a bad statistical
distribution (see my B2C vs B2B example at the end of this reply).

I don't want to delegitimize those RBL, but the "false positive loop"
is one of the key aspects of an automated system and its importance
increases as the "spam report loop" quality decrease. IMHO today there
are major holes in the false positive loop and this didn't improve at
all through the years.

How can the quality of the spam report loop be low?
1) you get it from the final user and, just like you told us, most of
them don't use this feature correctly.
2) you automate it with no "volume comparison", so you only evaluate
reports and not "everything else"
3) you works with hashes/fingerprints instead of full messages (it's
harder to manually review listing/delisting from something you can't
really "read").

Why is this "in topic"?
1) IPv6 enable more "scattering" and IP based RBL are less effective
for low-volume senders (the lower the volume, the lower the
statistical significance, the higher the error)
2) If you move to a fingerprint (content) based blacklist and you do
that working mainly with hashes (like CloudMark Authority) you don't
know anymore what you are really going to block, until you start
blocking it.

I'm sure CloudMark (CA: CloudMark Authority) have major "false
positive loops" but I'm aware of multimillion-inboxes-providers using
it only to block emails and to send "spam reports" (no false positive
report loops).

To name names, in Italy "Libero.it" (ItaliaOnLine) the largest italian
inbox provider (10-15 million inboxes) uses CloudMark and AFAIK send
them spam reports, non-spam reports and (I'm less confident about
this, but I think they do) "volume data" about fingerprints and they
never reject at SMTP time because og this.

On the other side "Aruba" the largest B2B italian inbox provider (7
millions inboxes) uses CloudMark and AFAIK send them spam reports, DO
NOT send "non-spam reports" and on recipient option they may reject
emails (not at SMTP time, but via bounce... but the recipient doesn't
get the email).

So, if you do B2C or mixed B2B/B2C to Italy Cloudmark works mostly
well (they easily block fingerprints, but then they get false positive
reports and unblock), but if you only do B2B traffic you'll often hit
some Cloudmark block with many false positives that are not collected
by Cloudmark that can't never decide to unlist on its own.

PS: I talk about CloudMark because I happen to have good knowledge of
that filter and I think it is one of the "smartest" filters out there
WRT to distributed content filtering and I think it is "100% IPv6
ready" as its design does not depend on IPs (they have "Sender
intelligence" for that), but still show issues. And you know that for
smaller players everything is harder.. the more data flow you see the
"easier" is to be accurate.

PS2: Libero and Aruba are not the only CloudMark Authority users in
Italy. There are other inbox providers using it for millions inboxes
(Tiscali, Register.it...), but I chose to select the 2 bigs and I
don't know how others deal with "reporting" with the same details. So
CA is one of the major filters in Italy and this make the Italian case
"interesting". I expect CA to be less effective in countries with
lower penetration where probably IP/domain reputation (or scoring) are
currenlty the major "drivers".

Stefano

_______________________________________________
mailop mailing list
mailop@mailop.org
https://chilli.nosignal.org/cgi-bin/mailman/listinfo/mailop

Reply via email to