On 8/22/2016 11:40 AM, Matus UHLAR - fantomas wrote:
On Mon, 22 Aug 2016 09:03:38 -0700
Marc Perkel wrote:
The ones that are the same are of no interest. Only where it matches
one side and not the other.
On 08/22/16 09:06, Dianne Skoll wrote:
But... but... that's exactly like Bayes if you throw out tokens whose
observed probability is not 0 or 1.
Also, in your list of tokens, they are all phrases ranging from 1 to
4 words,
and that's why you get good results. Multiword Bayes is just as good,
and I know that from experience.
On 22.08.16 10:44, Marc Perkel wrote:
This is nothing like bayes. Bayes is creating a mental block.
This is just like bayes.
There are (only) a few differences between what you describe and bayes as
implemented in SA, but it's still bayes-based.
When I describe it to people who don't know bayes they immediately get
it. If I describe it to people who know bayes - they confuse it. Bayes
is a probability spectrum based on a frequency match on both sets.
That's not even close to what I'm doing.
Bayes uses probabilities between 0 and 1, while you only accept 0 and 1.
You have just tweaked bayes, and I'm not even sure if towards better
detection (i believe, towards worse)
Also - some of what I'm doing is all combinations, not just
sequential. So it's like a system that writes and scores it's own
rules. I just throw data at it and it classifies it.
The main difference between bayes as implemented in SA is that you make
multiword tokens. This is good, but you aren't even first one who proposed
or did that. The second main difference is in the point above.
The real magic is the feedback learning. So as it identifies ham it
learns new words and phrases that then match email from other people.
So it learns how normal people speak, it learns how spammers speak,
and it identifies the DIFFERENCES between the two. And it's completely
automated.
This it just the same as SA bayas with autolearning. However it will suffer
the same issues and thus will require learning by other sources, either
manual or other SA rules.
You see, Marc, this has circled around to exactly what I said last week.
The problem I have always had with SA and the Bays learner is that for
it to work, it requires sources. In SA it requires a source of spam to
build tokens and (I guess) requires a source of ham to remove them. In
your system it requires a source of ham to build tokens and (I guess)
requires a source of spam to remove them.
But the fundamental problem with all of these is in getting the sources.
Getting spam is simple. I merely review my email logs looking for
spammers sending to non-existent e-mail addresses that have NEVER been
on my server. When I see a lot of the same attempts I then create a
honeypot email address using that. Within a couple months I have
some of the highest quality spam available as spammers communicate the
"discovered" email address to each other. All automatically done.
But, getting ham is HARD. You have to convince users to give it to
you. And you cannot really trust users to do it without contaminating
their ham stream with spam they were too lazy to delete. So I end up
wasting a lot of time cleaning the ham before inputting it into SA.
This is why I have said before - and I will repeat it again - that if
you have found a good way to convince your users to offer up cleaned
ham in an automatic fashion, that would be revolutionary.
It is NOT the back end that matters!! That is easy. I can hire
some programmers and math majors who have doctorates in set theory to
build that part of it, and they can probably do it in an afternoon and
then go out for pizza.
It is the front end that is hard And its particularly hard when
your interface is either IMAP or POP3. Providing a webinterface that
forces users to sort ham is somewhat easier but not not all users want a
webinterface. I personally don't use one myself why would I expect my
users to do it?
You have repeatedly put down whatever user interface you have built by
referring to it as crude programming and you don't want to show it.
But what you don't seem to get is that every scrap of user interface
code out there is some of the crudest ugliest most icky and disgusting
code out there.
Users are people and people DO NOT logically interact with computers.
They use a combination of sort-of-logic, rubbish they learned from some
other interface, and God-knows-what else to operate software interfaces.
So you can design the most elegant and cleanest interface in the world
with the most elegant code behind it and release it to the world and
God-help-you within 5 years that interface code will be so fugly that
you can only force newbie greehorn programmers who have no experience
but are so desperate to work for you that they will do anything, to work
on it. And eventually not even then, so you scrap it and release
Windows 8 and start