Re: How to report 120,000 spams

Aaron Wolfe Sun, 09 Mar 2008 18:24:45 -0700

On Sun, Mar 9, 2008 at 8:53 PM, Tuc at T-B-O-H <[EMAIL PROTECTED]> wrote:
> >
>  > Tuc at T-B-O-H.NET wrote:
>  > >     I guess I'm still not being clear. There are 120K emails a day coming
>  > > to INVALID EMAIL ADDRESSES THAT NEVER EXISTED. Its not a case of a user 
> being
>  > > fickle, its a case that they are emailing addresses that NEVER EVER 
> ACTUALLY
>  > > EXISTED. About 1 ever 3/4 of a second. So running them through ANYTHING 
> is
>  > > counter productive since , atleast in my eyes, if you try to email an 
> email
>  > > address that never existed... ITS SPAM. Its not things the user ever 
> sees/knows,
>  > > etc. I have in my sendmail virtusertable:
>  > >
>  > > [EMAIL PROTECTED]                   bingo
>  > > [EMAIL PROTECTED]                  bango
>  > > [EMAIL PROTECTED]                   bongo
>  > > [EMAIL PROTECTED]                  irving
>  > > [EMAIL PROTECTED]                               nobody
>  > >
>  > >     The user doesn't even SEE the emails, and processing what they 
> consider
>  > > spam I really don't care about. But getting 120K emails to *@ that are 
> absolutely
>  > > known spam... I would like to help the community out by reporting them 
> to every
>  > > system possible. Yea, if the added benefit is the mail that bingo, 
> bango, bongo
>  > > and irving gets filtered a little better... I won't complain at all.
>  > >
>  > >                     Tuc
>  > >
>  >
>  > Just because mail goes to invalid addresses does not mean it is spam.
>  > people do mistype addresses some time. so this "corpus" is not safe.
>  >
>         Yes, I realize people mistype email addresses. But the domain gets
>  121,000 emails on an average day.
>
>         Of those 121,000 emails a day, 120,000 are to email addresses that
>  aren't of the 4 known/valid/acceptable ones. What percentage would you like
>  to use of emails that are sent are mistyped. One out of 1000? That means
>  121 invalid email addresses a day? But the other 999 of 1000 aren't valid...
>
>         Of the other 1000 that ARE to the 4 known/valid/acceptable email
>  addresses, about 900 of them are marked by SA as a spam level over 5.
>  Usually WILDLY over 5, like 20's and 30's.
>
>         Of those 100 delivered, 75 of them are rejected by the spam
>  filter (Using a method that violates the standard RFC's according to
>  sendmail) of the "final destination" for all 4 of those email boxes (Yes,
>  bingo, bango, bongo, irving actually all end up forwarded to
>  [EMAIL PROTECTED]).
>
>         Of the 25 that make it through, the user tells me 15 of them are
>  usually spam.
>
>         So, 10 VALID/ACCEPTABLE emails a day out of 121,000 emails received
>  a day .. Or 8 THOUSANDS OF A SINGLE PERCENT.
>
>         So, while I definitely don't think people can type bingo, bango,
>  bongo, irving correctly 100% of the time, with a valid email ratio of 8
>  thousands of a percent, I don't think in the grand scheme of things
>  that mistyped email addresses really account for much/any.
>
>                         Tuc
>


If you are proposing some kind of checksums or other types of 'message
identifying' techniques on the messages,  those few mistyped addresses
could certainly make a difference for your site.   What if bongo's mom
mistypes to bungo, realizes her mistake and resends it to bongo a few
minutes later.  It is quite likely that the valid message will be
rejected now since it's (almost) identical to the one your proposed
system just marked as spam.  What if bongo signs up for the a mailing
list and mistypes his own email address (yes, this happens).  Now your
system marks all list mailings as spam, so everyone using your system
starts losing their copies of the mailing list messages too?

I think you have good intentions but the source of your data is flawed
for anything but maybe limited statistical training.  Unfortunately it
probably is not great for that either, since the mail you are seeing
for non existent users is probably not at all similar to the mix of
spam you get to real accounts.  The scanner would end up biased
towards whatever junk the spammers desperate enough to use
dictionaries send, which would drown out the stats from those spams
that are actually difficult to detect.

Why do you accept messages for non existent accounts?  You're wasting
bandwidth, regardless of what you do or don't do with the junk after
you accept it.  From the sound of it you could reduce your mail
bandwidth to a tiny fraction of what it is now by just refusing this
stuff (which is what most everyone else does, AFAIK).

-Aaron

Re: How to report 120,000 spams

Reply via email to