On Fri, 8 Jun 2018 at 18:14, Stefano Bagnara <mai...@bago.org> wrote: > On Fri, 8 Jun 2018 at 17:53, Michael Peddemors <mich...@linuxmagic.com> wrote: > > [...] > > And while using that as feedback might seem the logical conclusion, in > > the real world we still see more feedback reports from legitimate email > > the customer should have wanted, vs emails tagged as spam that are spam. > > Well, this is very surprising to me. Anyone else record similar scenario?
Michael, I just noticed that I read your sentence in a wrong way. I read "more no-spam reports than spam reports" while I should have read "most of the spam reports received are submitted by mistake". So my previous reply was written after this misunderstanding. Now, if most spam reports are submitted by mistake by people that doesn't really want to complain/stop received emails from that sender then this make it even more important to be able to collect false positives (or are you telling that user-originated-feedback is not trustable enough that you can't use it for "false positives reports"?). On Fri, 8 Jun 2018 at 17:53, Michael Peddemors <mich...@linuxmagic.com> wrote: > IMHO, and the way most of our platforms are designed to work.. Empower > the users when you can.. but block the worst of the worst.. > > * Block at SMTP via RBL's that have very low false positive rates How can you tell an RBL have "very low false positive rates" if most of the users of that RBL use it at SMTP time (or if the "not-spam" reports don't flow to the RBL operator)? This is a dog biting its tail... I read "Very low false positive rates" as "Very low collected-false-positive rates" where we miss a "collected-false-positive against false-positives rate". In my small system I don't have a "false positive report", so If I silently drop 5% of the traffic randomly at the smtp level I hardly get any false positive report... this doesn't make the "randomly dropping 5% of the traffic" a good/trustable block. Most people don't know someone is trying to write them until they see the email and you get a false positive only when the sender phone-call the recipient to ask why he didn't reply (their next attempt only have 5% of chance to get blocked again, so 0,25% to block 2 sends, and they won't care to call me to complain). Even if you have a good "rate" you could have a bad statistical distribution (see my B2C vs B2B example at the end of this reply). I don't want to delegitimize those RBL, but the "false positive loop" is one of the key aspects of an automated system and its importance increases as the "spam report loop" quality decrease. IMHO today there are major holes in the false positive loop and this didn't improve at all through the years. How can the quality of the spam report loop be low? 1) you get it from the final user and, just like you told us, most of them don't use this feature correctly. 2) you automate it with no "volume comparison", so you only evaluate reports and not "everything else" 3) you works with hashes/fingerprints instead of full messages (it's harder to manually review listing/delisting from something you can't really "read"). Why is this "in topic"? 1) IPv6 enable more "scattering" and IP based RBL are less effective for low-volume senders (the lower the volume, the lower the statistical significance, the higher the error) 2) If you move to a fingerprint (content) based blacklist and you do that working mainly with hashes (like CloudMark Authority) you don't know anymore what you are really going to block, until you start blocking it. I'm sure CloudMark (CA: CloudMark Authority) have major "false positive loops" but I'm aware of multimillion-inboxes-providers using it only to block emails and to send "spam reports" (no false positive report loops). To name names, in Italy "Libero.it" (ItaliaOnLine) the largest italian inbox provider (10-15 million inboxes) uses CloudMark and AFAIK send them spam reports, non-spam reports and (I'm less confident about this, but I think they do) "volume data" about fingerprints and they never reject at SMTP time because og this. On the other side "Aruba" the largest B2B italian inbox provider (7 millions inboxes) uses CloudMark and AFAIK send them spam reports, DO NOT send "non-spam reports" and on recipient option they may reject emails (not at SMTP time, but via bounce... but the recipient doesn't get the email). So, if you do B2C or mixed B2B/B2C to Italy Cloudmark works mostly well (they easily block fingerprints, but then they get false positive reports and unblock), but if you only do B2B traffic you'll often hit some Cloudmark block with many false positives that are not collected by Cloudmark that can't never decide to unlist on its own. PS: I talk about CloudMark because I happen to have good knowledge of that filter and I think it is one of the "smartest" filters out there WRT to distributed content filtering and I think it is "100% IPv6 ready" as its design does not depend on IPs (they have "Sender intelligence" for that), but still show issues. And you know that for smaller players everything is harder.. the more data flow you see the "easier" is to be accurate. PS2: Libero and Aruba are not the only CloudMark Authority users in Italy. There are other inbox providers using it for millions inboxes (Tiscali, Register.it...), but I chose to select the 2 bigs and I don't know how others deal with "reporting" with the same details. So CA is one of the major filters in Italy and this make the Italian case "interesting". I expect CA to be less effective in countries with lower penetration where probably IP/domain reputation (or scoring) are currenlty the major "drivers". Stefano _______________________________________________ mailop mailing list mailop@mailop.org https://chilli.nosignal.org/cgi-bin/mailman/listinfo/mailop