Re: I have some bad news

2016-08-25 Thread @lbutlr
On 15 Aug 2016, at 23:22, Marc Perkel  wrote:
> Well, this is kind of hard to say so just going to say it. I have stage 4 
> lung cancer and the probably spectrum is not good. I've been fighting spam 
> for the last 15 years and I'd like to keep fighting spam from the grave. So 
> I'm willing to share my technology with anyone interested.

I encourage you to concentrate of fighting cancer right now, and while the 
prognosis for stage-4 anything is not good, it is neither certain. It appears 
that attitude does help, so pump yourself up to beat it.



Re: Matching infinite sets

2016-08-25 Thread Ted Mittelstaedt



On 8/22/2016 11:40 AM, Matus UHLAR - fantomas wrote:

On Mon, 22 Aug 2016 09:03:38 -0700
Marc Perkel  wrote:

The ones that are the same are of no interest. Only where it matches
one side and not the other.



On 08/22/16 09:06, Dianne Skoll wrote:

But... but... that's exactly like Bayes if you throw out tokens whose
observed probability is not 0 or 1.

Also, in your list of tokens, they are all phrases ranging from 1 to
4 words,
and that's why you get good results. Multiword Bayes is just as good,
and I know that from experience.


On 22.08.16 10:44, Marc Perkel wrote:

This is nothing like bayes. Bayes is creating a mental block.


This is just like bayes.
There are (only) a few differences between what you describe and bayes as
implemented in SA, but it's still bayes-based.


When I describe it to people who don't know bayes they immediately get
it. If I describe it to people who know bayes - they confuse it. Bayes
is a probability spectrum based on a frequency match on both sets.
That's not even close to what I'm doing.


Bayes uses probabilities between 0 and 1, while you only accept 0 and 1.
You have just tweaked bayes, and I'm not even sure if towards better
detection (i believe, towards worse)


Also - some of what I'm doing is all combinations, not just
sequential. So it's like a system that writes and scores it's own
rules. I just throw data at it and it classifies it.


The main difference between bayes as implemented in SA is that you make
multiword tokens. This is good, but you aren't even first one who proposed
or did that. The second main difference is in the point above.


The real magic is the feedback learning. So as it identifies ham it
learns new words and phrases that then match email from other people.
So it learns how normal people speak, it learns how spammers speak,
and it identifies the DIFFERENCES between the two. And it's completely
automated.


This it just the same as SA bayas with autolearning. However it will suffer
the same issues and thus will require learning by other sources, either
manual or other SA rules.



You see, Marc, this has circled around to exactly what I said last week.

The problem I have always had with SA and the Bays learner is that for 
it to work, it requires sources.   In SA it requires a source of spam to 
build tokens and (I guess) requires a source of ham to remove them.  In 
your system it requires a source of ham to build tokens and (I guess)

requires a source of spam to remove them.

But the fundamental problem with all of these is in getting the sources.

Getting spam is simple.   I merely review my email logs looking for 
spammers sending to non-existent e-mail addresses that have NEVER been 
on my server.  When I see a lot of the same attempts I then create a

honeypot email address using that.   Within a couple months I have
some of the highest quality spam available as spammers communicate the
"discovered" email address to each other.   All automatically done.

But, getting ham is HARD.   You have to convince users to give it to 
you.  And you cannot really trust users to do it without contaminating
their ham stream with spam they were too lazy to delete.   So I end up 
wasting a lot of time cleaning the ham before inputting it into SA.


This is why I have said before - and I will repeat it again - that if 
you have found a good way to convince your users to offer up cleaned

ham in an automatic fashion, that would be revolutionary.

It is NOT the back end that matters!!   That is easy.   I can hire
some programmers and math majors who have doctorates in set theory to
build that part of it, and they can probably do it in an afternoon and
then go out for pizza.

It is the front end that is hard   And its particularly hard when 
your interface is either IMAP or POP3.   Providing a webinterface that 
forces users to sort ham is somewhat easier but not not all users want a

webinterface.   I personally don't use one myself why would I expect my
users to do it?

You have repeatedly put down whatever user interface you have built by
referring to it as crude programming and you don't want to show it. 
But what you don't seem to get is that every scrap of user interface 
code out there is some of the crudest ugliest most icky and disgusting

code out there.

Users are people and people DO NOT logically interact with computers. 
They use a combination of sort-of-logic, rubbish they learned from some

other interface, and God-knows-what else to operate software interfaces.
So you can design the most elegant and cleanest interface in the world
with the most elegant code behind it and release it to the world and
God-help-you within 5 years that interface code will be so fugly that
you can only force newbie greehorn programmers who have no experience 
but are so desperate to work for you that they will do anything, to work
on it.  And eventually not even then, so you scrap it and release 
Windows 8 and start 

Re: I have some bad news

2016-08-25 Thread Ted Mittelstaedt



On 8/19/2016 3:34 AM, Ram wrote:



Marc thats too bad. But stage 4 lung cancer does not mean you have to
die of it.
And chill about spam. I know you have been great at contributions to
anti-spam ( and we all remember your distinct hate of SPF :-) ).
But antispam is just "commodity" technology.

Probably ML will take over antispam in the future and people would just
subscribe to some good ML antispam service. Running your own antispam is
too much of an attention grabbing task, and no one wants to put in so
much time today



You must not have checked prices on antispam services lately or prices 
on mailboxes.  Just about everyone out there in the web hosting biz 
provides 10-20 free emailboxes (they have to, otherwise small businesses 
would switch to a competitor) and every antispam service out
there charges at least a buck a month per box.  (they have to otherwise 
they would go out of business)


In this environment people have no choice but to run their own antispam.

But you are right in that nobody (including the people running it) wants 
to put in time to doing it.   Do you -want- to clean your toilet?  Do 
you -have-the-money- to pay someone else to do it?


We are all always looking for better toilet-cleaning brushes.  If Marc
has invented a better one, people will want it!   But they won't go buy
a $200 toilet cleaning brush when they can go to the grocery store and
buy a plastic one for $5 that will last 20 years.

Ted

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus