Re: spamassassin and caching nameservers

2016-08-22 Thread Bill Cole

On 22 Aug 2016, at 21:15, Alex wrote:


Is it a full-fledged nameserver, suitable enough for MX, A, TXT,
queries, etc for this purpose?


Nope. rbldnsd is only an authoritative server and does not do any 
resolution via other servers or caching of records for which it is not 
authoritative.


If you have some solid reason to believe that the problem is BIND (which 
seems unlikely to me...) it might be a good idea to analyze exactly what 
the mechanics of the problem are and pick alternative software which is 
designed to be a caching recursive resolver AND which won't have exactly 
the same problem(s).


One example of a pure caching recursive resolver is Unbound. It might 
meet your needs and it definitely is simpler to configure than BIND 
because it (like rbldnsd) is focused on a narrow subset of the broad 
range of functions done by things we call nameservers, whereas BIND is 
designed to do anything anyone might reasonably want a nameserver to do. 
I run mail systems using a mix of Unbound and BIND for local caching, 
and I can't see any reason to believe that BIND performs objectively 
worse than Unbound in that role. One nice thing about BIND is that you 
can make it log profusely so that you can figure out whee the delays are 
in doing a particular query. My bet would be that what you're seeing 
isn't BIND being slow, but rather an external issue. 2 common problems:


1. Live IPv6 interfaces on a machine with no or very poor IPv6 
connectivity. In this circumstance, BIND running without the "-4" option 
(or Unbound with its default do-ip6=yes) will sometimes try to query an 
IPv6 authoritative nameserver for a name, eventually time out and then 
try an IPv4 nameserver. This is particularly pernicious in circumstances 
where you have one hop of IPv6 connectivity but your provider doesn't 
really have robust IPv6 connectivity and so you can't get to some 
places, often intermittently and variably.


2. Many cable providers these days hijack DNS queries by default for 
mostly sleazy but perfectly legal reasons, often justifying the practice 
with hand-waving about security (i.e. saving users from themselves by 
not resolving the names of miscreant domains.) This causes direct 
intentional breakage that often manifests as queries that time out (they 
should 'SERVFAIL' "bad" names but some do not.) It also can cause 
bottlenecks at the provider's hijacking routers and/or DNS servers 
(particularly during peak times) exacerbated by UDP being unreliable by 
design.


Could BIND itself be the culprit? Sure, it COULD, but it's not a good 
default scapegoat for sporadic timeouts in a local caching resolver.


Re: spamassassin and caching nameservers

2016-08-22 Thread Shawn Bakhtiar
Not sure if this helps but I use bind dlz with a mysql back-end as DNSBL of 
last resort. We get the IP addresses from honeypot emails, and it works pretty 
good. I have a daemon running in the background that uses a few intermediary 
tables with metrics like last seen, rate, total count, etc.. to make the final 
zone table which Sendmail queries (or SA if you wish).

http://bind-dlz.sourceforge.net/mysql_driver.html

How you populate the backend table is up to you. I’m sure there are lists you 
can download to populate the data, mitigating the need to make the DNS query, 
but I don’t use this as the first line of defense, it is our last line of 
defense before we engage SA.


On Aug 22, 2016, at 7:04 PM, Rob McEwen 
> wrote:

On 8/22/2016 9:15 PM, Alex wrote:
Has anyone configured it as a local caching nameserver, and if so,
could you share your config?

Correct me if I'm wrong... but...

I'm almost positive that rbldnsd acts ONLY as an authoritative name server, and 
not ever as a caching name server. I don't think there is functionality to 
either fetch root hints or to do catch-all forwarding to an upstream DNS server 
for just any host names. Instead, it only serves up the zones that it is 
specifically told to serve at startup, using the physical source data files to 
which those zones point.

It was designed from the ground up only to serve as a dumbed down locally 
hosted DNS, only for serving DNSBLs where the data files are found locally. It 
makes up for the lack of more extensive DNS features with blazing speed and 
very low memory overhead.

--
Rob McEwen




Re: spamassassin and caching nameservers

2016-08-22 Thread Rob McEwen

On 8/22/2016 9:15 PM, Alex wrote:

Has anyone configured it as a local caching nameserver, and if so,
could you share your config?


Correct me if I'm wrong... but...

I'm almost positive that rbldnsd acts ONLY as an authoritative name 
server, and not ever as a caching name server. I don't think there is 
functionality to either fetch root hints or to do catch-all forwarding 
to an upstream DNS server for just any host names. Instead, it only 
serves up the zones that it is specifically told to serve at startup, 
using the physical source data files to which those zones point.


It was designed from the ground up only to serve as a dumbed down 
locally hosted DNS, only for serving DNSBLs where the data files are 
found locally. It makes up for the lack of more extensive DNS features 
with blazing speed and very low memory overhead.


--
Rob McEwen



Re: spamassassin and caching nameservers

2016-08-22 Thread Marc Perkel
For what it's worth I use PowerDNS for a recursive nameserver and happy 
with it. Very easy to set up.


On 08/22/16 18:15, Alex wrote:

Hi all,
I've just set up spamassassin on a cable connection that appears to
have sporadic DNS timeouts using bind. It shouldn't be so slow that
queries timeout, but apparently they are. I'm hoping rbldnsd would
provide that additional responsiveness needed.

I've set up rbldnsd before, to be used as a way to query a local RBL.
Has anyone configured it as a local caching nameserver, and if so,
could you share your config?

I'd like it to listen on localhost/53 in place of bind and I would
think I would need the root zones in there somewhere, but there
doesn't appear to be many examples of doing this out there to
reference.

Is it a full-fledged nameserver, suitable enough for MX, A, TXT,
queries, etc for this purpose?

Thanks,
Alex




--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



spamassassin and caching nameservers

2016-08-22 Thread Alex
Hi all,
I've just set up spamassassin on a cable connection that appears to
have sporadic DNS timeouts using bind. It shouldn't be so slow that
queries timeout, but apparently they are. I'm hoping rbldnsd would
provide that additional responsiveness needed.

I've set up rbldnsd before, to be used as a way to query a local RBL.
Has anyone configured it as a local caching nameserver, and if so,
could you share your config?

I'd like it to listen on localhost/53 in place of bind and I would
think I would need the root zones in there somewhere, but there
doesn't appear to be many examples of doing this out there to
reference.

Is it a full-fledged nameserver, suitable enough for MX, A, TXT,
queries, etc for this purpose?

Thanks,
Alex


Re: Matching infinite sets

2016-08-22 Thread John Hardin

On Mon, 22 Aug 2016, Matus UHLAR - fantomas wrote:


> On Mon, 22 Aug 2016 09:03:38 -0700
> Marc Perkel  wrote:


The real magic is the feedback learning. So as it identifies ham it learns 
new words and phrases that then match email from other people. So it learns 
how normal people speak, it learns how spammers speak, and it identifies 
the DIFFERENCES between the two. And it's completely automated.


This it just the same as SA bayas with autolearning. However it will suffer
the same issues and thus will require learning by other sources, either
manual or other SA rules.


The restriction to probabilities 0 or 1 may mitigate the 
robot-off-the-rails syndrome to a degree.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Politicians never accuse you of "greed" for wanting other people's
  money, only for wanting to keep your own money.-- Joseph Sobran
---
 2 days until the 1937th anniversary of the destruction of Pompeii


Re: Matching infinite sets

2016-08-22 Thread Shawn Bakhtiar

On Aug 22, 2016, at 10:44 AM, Marc Perkel 
> wrote:



On 08/22/16 09:06, Dianne Skoll wrote:
On Mon, 22 Aug 2016 09:03:38 -0700
Marc Perkel > 
wrote:

The ones that are the same are of no interest. Only where it matches
one side and not the other.
But... but... that's exactly like Bayes if you throw out tokens whose
observed probability is not 0 or 1.

Also, in your list of tokens, they are all phrases ranging from 1 to 4 words,
and that's why you get good results.  Multiword Bayes is just as good,
and I know that from experience.



This is nothing like bayes. Bayes is creating a mental block. When I describe 
it to people who don't know bayes they immediately get it. If I describe it to 
people who know bayes - they confuse it. Bayes is a probability spectrum based 
on a frequency match on both sets. That's not even close to what I'm doing.


I think you've copied and pasted this same paragraph half a dozen times now, 
and the list has tried it's best to accommodate your statement about "Bayes is 
creating a mental block", asking you pertinent questions that either remained 
un-answered, and/or when answered provided conflicting statements, and when 
pressed ended up showing that what you are doing is (at best) a slightly 
modified version.

However, I find the statement "When I describe it to people who don't know 
bayes they immediately get it" the most telling of them all. Of course people 
who don't know the probability theory will look at what you are doing and go 
"Wow!!! This is great!!" BECAUSE THEY DON'T KNOW.

People who know, obviously, recognize it for what it is, and you can claim as 
much as you like it's NOT, but at the end of they day, if it looks like a rose, 
smells like a rose (no matter what you call it) tis still rose!

All you have to do is READ the Process section of the following link to see 
exactly how similar your explanation is (save one factor which is using phrases 
vs. words), which has already been explained as a feature of SA using 
multi-word tokens:
https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering



Also - some of what I'm doing is all combinations, not just sequential. So it's 
like a system that writes and scores it's own rules. I just throw data at it 
and it classifies it.

The real magic is the feedback learning. So as it identifies ham it learns new 
words and phrases that then match email from other people. So it learns how 
normal people speak, it learns how spammers speak, and it identifies the 
DIFFERENCES between the two. And it's completely automated.


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400




Re: Matching infinite sets

2016-08-22 Thread Matus UHLAR - fantomas

On Mon, 22 Aug 2016 09:03:38 -0700
Marc Perkel  wrote:

The ones that are the same are of no interest. Only where it matches
one side and not the other.



On 08/22/16 09:06, Dianne Skoll wrote:

But... but... that's exactly like Bayes if you throw out tokens whose
observed probability is not 0 or 1.

Also, in your list of tokens, they are all phrases ranging from 1 to 4 words,
and that's why you get good results.  Multiword Bayes is just as good,
and I know that from experience.


On 22.08.16 10:44, Marc Perkel wrote:

This is nothing like bayes. Bayes is creating a mental block.


This is just like bayes.
There are (only) a few differences between what you describe and bayes as
implemented in SA, but it's still bayes-based.

When I 
describe it to people who don't know bayes they immediately get it. 
If I describe it to people who know bayes - they confuse it. Bayes is 
a probability spectrum based on a frequency match on both sets. 
That's not even close to what I'm doing.


Bayes uses probabilities between 0 and 1, while you only accept 0 and 1. 


You have just tweaked bayes, and I'm not even sure if towards better
detection (i believe, towards worse)

Also - some of what I'm doing is all combinations, not just 
sequential. So it's like a system that writes and scores it's own 
rules. I just throw data at it and it classifies it.


The main difference between bayes as implemented in SA is that you make
multiword tokens.  This is good, but you aren't even first one who proposed
or did that.  The second main difference is in the point above.

The real magic is the feedback learning. So as it identifies ham it 
learns new words and phrases that then match email from other people. 
So it learns how normal people speak, it learns how spammers speak, 
and it identifies the DIFFERENCES between the two. And it's 
completely automated.


This it just the same as SA bayas with autolearning. However it will suffer
the same issues and thus will require learning by other sources, either
manual or other SA rules.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
The 3 biggets disasters: Hiroshima 45, Tschernobyl 86, Windows 95


Re: Matching infinite sets

2016-08-22 Thread Dianne Skoll
On Mon, 22 Aug 2016 10:44:42 -0700
Marc Perkel  wrote:

> This is nothing like bayes.

It's exactly like Bayes.  You're stumbling across a hacked version of
Bayes.  You seem to lack the mathematical background to see what you're
doing, thinking it's somehow fundamentally different.  But it's not.

> The real magic is the feedback learning.

Which is how Bayes works.

> So as it identifies ham it learns new words and phrases that then
> match email from other people.

Which is what Bayes does.

> So it learns how normal people speak, it learns how spammers speak,
> and it identifies the DIFFERENCES between the two. And it's
> completely automated.

You've just described Bayes.  Paul Graham used almost that exact language
14 years ago in his classic paper, http://www.paulgraham.com/spam.html
Check out this paragraph:

I'm more hopeful about Bayesian filters, because they evolve with the
spam. So as spammers start using "c0ck" instead of "cock" to evade
simple-minded spam filters based on individual words, Bayesian filters
automatically notice. Indeed, "c0ck" is far more damning evidence than
"cock", and Bayesian filters know precisely how much more.


Regards,

Dianne.



Re: Matching infinite sets

2016-08-22 Thread Marc Perkel



On 08/22/16 09:06, Dianne Skoll wrote:

On Mon, 22 Aug 2016 09:03:38 -0700
Marc Perkel  wrote:


The ones that are the same are of no interest. Only where it matches
one side and not the other.

But... but... that's exactly like Bayes if you throw out tokens whose
observed probability is not 0 or 1.

Also, in your list of tokens, they are all phrases ranging from 1 to 4 words,
and that's why you get good results.  Multiword Bayes is just as good,
and I know that from experience.




This is nothing like bayes. Bayes is creating a mental block. When I 
describe it to people who don't know bayes they immediately get it. If I 
describe it to people who know bayes - they confuse it. Bayes is a 
probability spectrum based on a frequency match on both sets. That's not 
even close to what I'm doing.


Also - some of what I'm doing is all combinations, not just sequential. 
So it's like a system that writes and scores it's own rules. I just 
throw data at it and it classifies it.


The real magic is the feedback learning. So as it identifies ham it 
learns new words and phrases that then match email from other people. So 
it learns how normal people speak, it learns how spammers speak, and it 
identifies the DIFFERENCES between the two. And it's completely automated.



--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: Matching infinite sets

2016-08-22 Thread Dianne Skoll
On Mon, 22 Aug 2016 09:06:08 -0700
Marc Perkel  wrote:

> Hi Dianne, what your missing are word combinations. Usually it's not
> a single word but a combination of words that trigger a result.

[snip]

So that's Bayes with multi-word tokens, throwing out tokens whose
probability is neither 0 nor 1.

Regards,

Dianne.


Re: Matching infinite sets

2016-08-22 Thread Marc Perkel



On 08/22/16 08:58, RW wrote:

On Mon, 22 Aug 2016 07:34:00 -0700
Marc Perkel wrote:


On 08/22/16 07:28, Dianne Skoll wrote:

The other two possibilities (no tokens in either or some tokens in
both) are undecidable.

Exactly!

In the past you've said that when there are token in both you compare
the counts.


I do a very little bit of that. I make additional sets I cal nearly-ham 
and nearly-spam where the ratio is very high, and count it as a half score.


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: Matching infinite sets

2016-08-22 Thread Christian Grunfeld
What you are trying to do is to identify a source of messages by its
entropysupposed the entropy of a ham source is distinguishable from a
spam one...

2016-08-22 13:48 GMT-03:00 Antony Stone <
antony.st...@spamassassin.open.source.it>:

> On Monday 22 August 2016 at 18:00:35, Marc Perkel wrote:
>
> > On 08/22/16 07:37, Antony Stone wrote:
> > >
> > > So what makes "cheapest Viagra online" a token, such that "cheapest"
> and
> > > "online" are not tokens?
> >
> > They would all be tokens. Just pointing out one that would match spam
> > and not match ham. "cheapest" and "online" would likely be in both sets
> > and would be ignored.
>
> Hm, that doesn't tie up with your earlier reply:
>
> On Monday 22 August 2016 at 16:34:00, Marc Perkel wrote:
>
> > On 08/22/16 07:28, Dianne Skoll wrote:
> > > On Mon, 22 Aug 2016 07:16:41 -0700
> > >
> > > As far as I understand your algorithm, if an email contains at least
> one
> > > token in the "ham" set and zero tokens in the "spam" set, you classify
> it
> > > as ham.  And conversely, if it contains at least one spam token but
> zero
> > > ham tokens, you classify it as spam.
> >
> > YES! YES! YES!
>
> Er, really?  See below.
>
> > Although I look at some thousand "fingerprints" to get a more
> > significant result.
> >
> > > The other two possibilities (no tokens in either or some tokens in
> both)
> > > are undecidable.
> >
> > Exactly!
>
> So, it's not that "if an email contains at least one token in the 'ham' set
> and zero tokens in the 'spam' set, you classify it as ham".
>
> You in fact ignore any tokens in the email which are in both the 'ham' and
> 'spam' sets, and then - what - work out which set contains more of the
> left-
> over tokens?
>
>
> Antony.
>
> --
> Pavlov is in the pub enjoying a pint.
> The barman rings for last orders, and Pavlov jumps up exclaiming "Damn!  I
> forgot to feed the dog!"
>
>Please reply to the
> list;
>  please *don't* CC
> me.
>


Re: Matching infinite sets

2016-08-22 Thread Antony Stone
On Monday 22 August 2016 at 18:00:35, Marc Perkel wrote:

> On 08/22/16 07:37, Antony Stone wrote:
> > 
> > So what makes "cheapest Viagra online" a token, such that "cheapest" and
> > "online" are not tokens?
>
> They would all be tokens. Just pointing out one that would match spam
> and not match ham. "cheapest" and "online" would likely be in both sets
> and would be ignored.

Hm, that doesn't tie up with your earlier reply:

On Monday 22 August 2016 at 16:34:00, Marc Perkel wrote:

> On 08/22/16 07:28, Dianne Skoll wrote:
> > On Mon, 22 Aug 2016 07:16:41 -0700
> > 
> > As far as I understand your algorithm, if an email contains at least one
> > token in the "ham" set and zero tokens in the "spam" set, you classify it
> > as ham.  And conversely, if it contains at least one spam token but zero
> > ham tokens, you classify it as spam.
> 
> YES! YES! YES!

Er, really?  See below.

> Although I look at some thousand "fingerprints" to get a more
> significant result.
> 
> > The other two possibilities (no tokens in either or some tokens in both)
> > are undecidable.
> 
> Exactly!

So, it's not that "if an email contains at least one token in the 'ham' set 
and zero tokens in the 'spam' set, you classify it as ham".

You in fact ignore any tokens in the email which are in both the 'ham' and 
'spam' sets, and then - what - work out which set contains more of the left-
over tokens?


Antony.

-- 
Pavlov is in the pub enjoying a pint.
The barman rings for last orders, and Pavlov jumps up exclaiming "Damn!  I 
forgot to feed the dog!"

   Please reply to the list;
 please *don't* CC me.


Re: Matching infinite sets

2016-08-22 Thread Dianne Skoll
On Mon, 22 Aug 2016 09:03:38 -0700
Marc Perkel  wrote:

> The ones that are the same are of no interest. Only where it matches
> one side and not the other.

But... but... that's exactly like Bayes if you throw out tokens whose
observed probability is not 0 or 1.

Also, in your list of tokens, they are all phrases ranging from 1 to 4 words,
and that's why you get good results.  Multiword Bayes is just as good,
and I know that from experience.

Regards,

Dianne.


Re: Matching infinite sets

2016-08-22 Thread Marc Perkel



On 08/22/16 07:45, Dianne Skoll wrote:

On Mon, 22 Aug 2016 07:34:00 -0700
Marc Perkel  wrote:


So.  What percentage of emails using your algorithm are actually
decidable?

Almost 100% if you look at a wide variety of tokens from multiple
attributes.

I can't believe that, or I'm missing something.  Almost every spam I see
contains words that also appear in ham.  Things like "this" or "invoice"
or "regards" or "dear".

What am I missing?




Hi Dianne, what your missing are word combinations. Usually it's not a 
single word but a combination of words that trigger a result.



 Example of how NOT matching works

Let’s take 2 subject lines and see how this works.

“Meet hot Russian Brides Online!”
“I read an article about Russian Brides in a magazine”

A traditional spam filter using Bayesian or hard coded rules about 
“Russian Brides” might determine that only 1 out of 500 emails 
mentioning the phrase “Russian Brides” is a good email. Thus the second 
line would have points assessed against it in the classification process 
using these traditional methods.


Using the Evolution Filter the phrase “Russian Brides” is in both sets 
and therefore has no influence on the results. But the first subject 
matches these phrases in the Spam Only set.


“Meet hot”
“Meet hot Russian”
“Meet hot Russian Brides”
“hot Russian Brides Online!”
“Russian Brides Online!”
“Brides Online!”
“Online!”

The second subject matches these phrases on the ham only set that are 
never used on the spam set.


“I read an article”
“read an article”
“read an article about”
“about Russian”
“an article about”
“in a magazine”
“Brides in a”

So even though the phrase “Russian Brides” has no influence each subject 
hits either ham or spam many times where the same phrase was never used 
in the subject line in the opposite set. And the number of hits is 
significant enough just from these subjects to cause the fingerprints to 
be learned, and that’s just looking at the Subject attribute. When this 
is combined with testing all attributes the messages usually come out 
strongly on one side or the other.


In rule based systems one would not normally build a white list rule to 
to allocate points based on seeing the phrase “read an article about”. 
That’s where the Evolution Filter is different. It didn’t need to have 
that rule because since it is comparing to the infinite set of what is 
not matched on the other side, it dynamically create billions of rules 
automatically.



 [edit
 
]




--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: Matching infinite sets

2016-08-22 Thread Marc Perkel



On 08/22/16 07:40, Antony Stone wrote:

On Monday 22 August 2016 at 16:34:00, Marc Perkel wrote:


On 08/22/16 07:28, Dianne Skoll wrote:


What percentage of emails using your algorithm are actually
decidable?

Almost 100% if you look at a wide variety of tokens from multiple
attributes. Subject, body, content flags, header structure, combinations
of all domains reference, php scripts, name part of from addresses,
behavior flags.

I would have said that a very large number of the words used in spam mails are
the same as the words used in ham mails, so I suspect I'm confused about what
constitutes a "token".


The ones that are the same are of no interest. Only where it matches one 
side and not the other.




I fail to see how the "name part of from addresses" are unlikely to match ham,
for example, since I see quite a lot of spam apparently from myself.


Antony.



Some spammers have Viagra in the name part. The name part is very 
spammy. I also store to and from email addresses so that relationships 
between people corresponding create a ham result. (I filter outbound as 
well for some people)


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: Matching infinite sets

2016-08-22 Thread Shawn Bakhtiar

> On Aug 22, 2016, at 8:09 AM, John Hardin  wrote:
> 
> On Mon, 22 Aug 2016, Antony Stone wrote:
> 
>> On Monday 22 August 2016 at 16:45:09, Dianne Skoll wrote:
>> 
>>> On Mon, 22 Aug 2016 07:34:00 -0700 Marc Perkel wrote:
> So.  What percentage of emails using your algorithm are actually
> decidable?
 
 Almost 100% if you look at a wide variety of tokens from multiple
 attributes.
>>> 
>>> I can't believe that, or I'm missing something.  Almost every spam I see
>>> contains words that also appear in ham.  Things like "this" or "invoice"
>>> or "regards" or "dear".
>>> 
>>> What am I missing?
>> 
>> I believe you're missing Marc's definition of "token".
> 
> ...and it looks like we're venturing into the "SA Bayes multiple-word token 
> support" realm (as a surrogate).
> 

Even with the multiple tokens combined into one fingerprint, you've changed 
little. No matter how you bound the token, the assumption that there are not 
SPAM emails that contain HAM content, and vice versa is false. 

Regardless that is NOT what you claimed before, you seem to be flip-flopping 
between definitions to suite your argument.


> -- 
> John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
> jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> ---
>  USMC Rules of Gunfighting #6: If you can choose what to bring to a
>  gunfight, bring a long gun and a friend with a long gun.
> ---
> 2 days until the 1937th anniversary of the destruction of Pompeii



Re: Matching infinite sets

2016-08-22 Thread Marc Perkel



On 08/22/16 07:37, Antony Stone wrote:

On Monday 22 August 2016 at 16:34:09, Marc Perkel wrote:


OK - Trying to make the really simple. Just talking about concept now.

Let's say I get an email where the subject is "I have aednocarsonoma of
the lung".

Right off you know it's ham because spammers never use the word
"aednocarsonoma" and normal people do. Spammer also never use:

"of the lung"
"the lung"
"aednocarsonoma of"

How do you create those boundaries to define the tokens?


Here's an example:

"the quick brown fox jumps over the lazy dog"

becomes ...

"the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown fox" 
"quick brown fox"
"the quick brown fox" "jumps" "fox jumps" "brown fox jumps" "quick brown fox jumps" 
"over" "jumps over"
"fox jumps over" "brown fox jumps over" "the" "over the" "jumps over the" "fox jumps 
over the"
"lazy" "the lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy dog" 
"over the lazy dog"











So - tell me you follow this so far. Spammers don't spam about
aednocarsonoma.

In this case I'm identifying ham because in some previous email people
were talking about lung cancer and those phrases were learned as ham.
But what makes it really ham is not just that it matches previous ham,
but it doesn't match previous spam.

A word like Viagra for example would produce no score because it is in
both sets. However "cheapest viagra online" would match spam and not
match ham indicating it's spam.

So what makes "cheapest Viagra online" a token, such that "cheapest" and
"online" are not tokens?




They would all be tokens. Just pointing out one that would match spam 
and not match ham. "cheapest" and "online" would likely be in both sets 
and would be ignored.


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: Matching infinite sets

2016-08-22 Thread RW
On Mon, 22 Aug 2016 07:34:00 -0700
Marc Perkel wrote:

> On 08/22/16 07:28, Dianne Skoll wrote:

> > The other two possibilities (no tokens in either or some tokens in
> > both) are undecidable.  
> 
> Exactly!

In the past you've said that when there are token in both you compare
the counts.


On Wed, 17 Aug 2016 11:02:38 -0700
Marc Perkel wrote:

>  Here's the actual formula.
> 
> card(Test_message intersect Spam diff Ham) minus card(Test_message
> intersect Ham diff Spam)
> 


On Wed, 20 Jan 2016 08:52:05 -0800
Marc Perkel wrote:

> Then you do a set
> diff both ways (ham - spam) (spam - ham) and whichever side is bigger
> wins. Generally it will match on only one side or very predominately
> on one side.


Re: Matching infinite sets

2016-08-22 Thread John Hardin

On Mon, 22 Aug 2016, Antony Stone wrote:


On Monday 22 August 2016 at 16:45:09, Dianne Skoll wrote:


On Mon, 22 Aug 2016 07:34:00 -0700 Marc Perkel wrote:

So.  What percentage of emails using your algorithm are actually
decidable?


Almost 100% if you look at a wide variety of tokens from multiple
attributes.


I can't believe that, or I'm missing something.  Almost every spam I see
contains words that also appear in ham.  Things like "this" or "invoice"
or "regards" or "dear".

What am I missing?


I believe you're missing Marc's definition of "token".


...and it looks like we're venturing into the "SA Bayes multiple-word 
token support" realm (as a surrogate).


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  USMC Rules of Gunfighting #6: If you can choose what to bring to a
  gunfight, bring a long gun and a friend with a long gun.
---
 2 days until the 1937th anniversary of the destruction of Pompeii


Re: Matching infinite sets

2016-08-22 Thread Antony Stone
On Monday 22 August 2016 at 16:45:09, Dianne Skoll wrote:

> On Mon, 22 Aug 2016 07:34:00 -0700 Marc Perkel wrote:
> > > So.  What percentage of emails using your algorithm are actually
> > > decidable?
> > 
> > Almost 100% if you look at a wide variety of tokens from multiple
> > attributes.
> 
> I can't believe that, or I'm missing something.  Almost every spam I see
> contains words that also appear in ham.  Things like "this" or "invoice"
> or "regards" or "dear".
> 
> What am I missing?

I believe you're missing Marc's definition of "token".


Antony.

-- 
Anyone that's normal doesn't really achieve much.

 - Mark Blair, Australian rocket engineer

   Please reply to the list;
 please *don't* CC me.


Re: Matching infinite sets

2016-08-22 Thread Dianne Skoll
On Mon, 22 Aug 2016 07:34:00 -0700
Marc Perkel  wrote:

> > So.  What percentage of emails using your algorithm are actually
> > decidable?

> Almost 100% if you look at a wide variety of tokens from multiple 
> attributes.

I can't believe that, or I'm missing something.  Almost every spam I see
contains words that also appear in ham.  Things like "this" or "invoice"
or "regards" or "dear".

What am I missing?

Regards,

Dianne.


Re: Matching infinite sets

2016-08-22 Thread Antony Stone
On Monday 22 August 2016 at 16:34:00, Marc Perkel wrote:

> On 08/22/16 07:28, Dianne Skoll wrote:
> 
> > What percentage of emails using your algorithm are actually
> > decidable?
> 
> Almost 100% if you look at a wide variety of tokens from multiple
> attributes. Subject, body, content flags, header structure, combinations
> of all domains reference, php scripts, name part of from addresses,
> behavior flags.

I would have said that a very large number of the words used in spam mails are 
the same as the words used in ham mails, so I suspect I'm confused about what 
constitutes a "token".

I fail to see how the "name part of from addresses" are unlikely to match ham, 
for example, since I see quite a lot of spam apparently from myself.


Antony.

-- 
Never automate fully anything that does not have a manual override capability. 
Never design anything that cannot work under degraded conditions in emergency.

   Please reply to the list;
 please *don't* CC me.


Re: Matching infinite sets

2016-08-22 Thread Antony Stone
On Monday 22 August 2016 at 16:34:09, Marc Perkel wrote:

> OK - Trying to make the really simple. Just talking about concept now.
> 
> Let's say I get an email where the subject is "I have aednocarsonoma of
> the lung".
> 
> Right off you know it's ham because spammers never use the word
> "aednocarsonoma" and normal people do. Spammer also never use:
> 
> "of the lung"
> "the lung"
> "aednocarsonoma of"

How do you create those boundaries to define the tokens?

> 
> 
> So - tell me you follow this so far. Spammers don't spam about
> aednocarsonoma.
> 
> In this case I'm identifying ham because in some previous email people
> were talking about lung cancer and those phrases were learned as ham.
> But what makes it really ham is not just that it matches previous ham,
> but it doesn't match previous spam.
> 
> A word like Viagra for example would produce no score because it is in
> both sets. However "cheapest viagra online" would match spam and not
> match ham indicating it's spam.

So what makes "cheapest Viagra online" a token, such that "cheapest" and 
"online" are not tokens?


Antony.

-- 
The words "e pluribus unum" on the Great Seal of the United States are from a 
poem by Virgil entitled "Moretum", which is about cheese and garlic salad 
dressing.

   Please reply to the list;
 please *don't* CC me.


Re: Matching infinite sets

2016-08-22 Thread Marc Perkel

OK - Trying to make the really simple. Just talking about concept now.

Let's say I get an email where the subject is "I have aednocarsonoma of 
the lung".


Right off you know it's ham because spammers never use the word 
"aednocarsonoma" and normal people do. Spammer also never use:


"of the lung"
"the lung"
"aednocarsonoma of"


So - tell me you follow this so far. Spammers don't spam about 
aednocarsonoma.


In this case I'm identifying ham because in some previous email people 
were talking about lung cancer and those phrases were learned as ham. 
But what makes it really ham is not just that it matches previous ham, 
but it doesn't match previous spam.


A word like Viagra for example would produce no score because it is in 
both sets. However "cheapest viagra online" would match spam and not 
match ham indicating it's spam.


The magic here is that this detects both spam and ham. And it is 
especially good at detecting ham, which greatly reduces false positives.




Re: Matching infinite sets

2016-08-22 Thread Marc Perkel



On 08/22/16 07:28, Dianne Skoll wrote:

On Mon, 22 Aug 2016 07:16:41 -0700
Marc Perkel  wrote:


Anthony, Yes - I don't store Set B. I store Set A. B is defined by
what's NOT in A. So I test A and if it's not matched it's set B. Set
B is just a negative match on A.

Let me ask you a question.  As far as I understand your algorithm, if
an email contains at least one token in the "ham" set and zero tokens in
the "spam" set, you classify it as ham.  And conversely, if it contains
at least one spam token but zero ham tokens, you classify it as spam.


YES! YES! YES!

Although I look at some thousand "fingerprints" to get a more 
significant result.




The other two possibilities (no tokens in either or some tokens in both)
are undecidable.


Exactly!



So.  What percentage of emails using your algorithm are actually decidable?


Almost 100% if you look at a wide variety of tokens from multiple 
attributes. Subject, body, content flags, header structure, combinations 
of all domains reference, php scripts, name part of from addresses, 
behavior flags.




Regards,

Dianne.





--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: Matching infinite sets

2016-08-22 Thread Dianne Skoll
On Mon, 22 Aug 2016 07:16:41 -0700
Marc Perkel  wrote:

> Anthony, Yes - I don't store Set B. I store Set A. B is defined by 
> what's NOT in A. So I test A and if it's not matched it's set B. Set
> B is just a negative match on A.

Let me ask you a question.  As far as I understand your algorithm, if
an email contains at least one token in the "ham" set and zero tokens in
the "spam" set, you classify it as ham.  And conversely, if it contains
at least one spam token but zero ham tokens, you classify it as spam.

The other two possibilities (no tokens in either or some tokens in both)
are undecidable.

So.  What percentage of emails using your algorithm are actually decidable?

Regards,

Dianne.



Re: Matching infinite sets

2016-08-22 Thread Marc Perkel



On 08/22/16 06:55, Antony Stone wrote:

On Monday 22 August 2016 at 15:46:41, Dianne Skoll wrote:


On Mon, 22 Aug 2016 06:04:49 -0700

Marc Perkel  wrote:

Set A - a  finite set - has some members,
Set B - an infinite set - is everything that is NOT in Set A

Set B is a very special case of an infinite set.  We're talking about
infinite sets in general.

Also, you have to realize that although set B is in principle infinite,
in practice it is not.  Computers have finite memory, and although the
number of email tokens representable in the memory of a computer is very,
very, very large, it's not infinite.

I do not think that Marc is proposing to actually store set B in a computer
(or anywhere else).

Set B is simply a theoretical construct, defined as the inverse of Set A, and
to discover whether something is a member of it, you do not search through the
infinite set B for a match, you instead check all members of finite set A for a
non-match.

If nothing in Set A matches X, then X is a member of Set B.


Antony.



Anthony, Yes - I don't store Set B. I store Set A. B is defined by 
what's NOT in A. So I test A and if it's not matched it's set B. Set B 
is just a negative match on A.


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: Matching infinite sets

2016-08-22 Thread Antony Stone
On Monday 22 August 2016 at 15:46:41, Dianne Skoll wrote:

> On Mon, 22 Aug 2016 06:04:49 -0700
> 
> Marc Perkel  wrote:
> > Set A - a  finite set - has some members,
> > Set B - an infinite set - is everything that is NOT in Set A
> 
> Set B is a very special case of an infinite set.  We're talking about
> infinite sets in general.
> 
> Also, you have to realize that although set B is in principle infinite,
> in practice it is not.  Computers have finite memory, and although the
> number of email tokens representable in the memory of a computer is very,
> very, very large, it's not infinite.

I do not think that Marc is proposing to actually store set B in a computer 
(or anywhere else).

Set B is simply a theoretical construct, defined as the inverse of Set A, and 
to discover whether something is a member of it, you do not search through the 
infinite set B for a match, you instead check all members of finite set A for a 
non-match.

If nothing in Set A matches X, then X is a member of Set B.


Antony.

-- 
I have an excellent memory.
I can't think of a single thing I've forgotten.

   Please reply to the list;
 please *don't* CC me.


Re: Matching infinite sets

2016-08-22 Thread Dianne Skoll
On Mon, 22 Aug 2016 06:04:49 -0700
Marc Perkel  wrote:

> Set A - a  finite set - has some members,
> Set B - and infinite set - is everything that is NOT in Set A

Set B is a very special case of an infinite set.  We're talking about
infinite sets in general.

Also, you have to realize that although set B is in principle infinte,
in practice it is not.  Computers have finite memory, and although the
number of email tokens representable in the memory of a computer is very,
very, very large, it's not infinite.

Regards,

Dianne.


Re: Matching infinite sets

2016-08-22 Thread Dianne Skoll
On Mon, 22 Aug 2016 08:54:48 -0400
Michael Orlitzky  wrote:

> The empty set contains itself.

No, it doesn't.  By definition.

Regards,

Dianne.


Re: Matching infinite sets

2016-08-22 Thread Antony Stone
On Monday 22 August 2016 at 15:04:49, Marc Perkel wrote:

> I'm confused by the confusion here.
> 
> Set A - a  finite set - has some members,
> Set B - and infinite set - is everything that is NOT in Set A
> 
> So you match a test item to Set A and if it matches it's a member of A.
> If it doesn't match Set A it's a member of B.
> 
> How is this not really simple?

Because "everything that is NOT in Set A" means some surprisingly complicated 
things to some people, and which I believe for the purposes of your spam 
identifier are irrelevant.

It might keep the pedants happier if you were to identify the sets as:

Set A contains some email tokens.

Set B contains all possible email tokens which are not in Set A.

This then precludes the possibility that Set B might contain itself, since a 
set is not a plausible email token.


Antony.

-- 
I just got a new mobile phone, and I called it Titanic.  It's already syncing.

   Please reply to the list;
 please *don't* CC me.


Re: Matching infinite sets

2016-08-22 Thread RW
On Mon, 22 Aug 2016 09:55:10 +1200
Sidney Markowitz wrote:


>  I'm one of those people he mentions who understands
> how Bayesian spam filtering works who has yet to wrap my head around
> what he is presenting - For now I'm staying agnostic about it until I
> do understand it better).

What it amounts to is:

Training: 

- tokenize a corpus of spam and ham 
- compile a list of tokens that occur only in spam and a list of
  tokens that only occur in ham

Classification:

- Tokenize the email
- count how many of the tokens are in each of the two list
- compare the two counts


In Bayes, if you set Robinson's S parameter to 0, then tokens that only
occur in spam or ham get a token probability of exactly 1 and 0
respectively. 

Tokens that have been seen in both spam and ham get a probability
between 0 and 1. So if you then set MIN_PROB_STRENGTH to 0.5 you can
discard all of these. 

All of the remaining tokens have probabilities of 0 or 1 so running
them through the chi-squared calculation (or any sensible symmetric
combining algorithm) and then comparing the result to 0.5  gives the
same result as comparing the number of spam-only and ham-only tokens.

In short it's mathematically equivalent to Bayes with different
tokenization and different constants; and on the face of it
the values of S and MIN_PROB_STRENGTH are very sub-optimal. 

OTOH it wouldn't surprise me if the tokenization is much better.




Re: Matching infinite sets

2016-08-22 Thread Michael Orlitzky
On 08/22/2016 09:02 AM, Joe Quinn wrote:
> On 8/22/2016 8:54 AM, Michael Orlitzky wrote:
>> On 08/21/2016 03:22 PM, Damian wrote:
>>> There is no such set B, as it would contain itself.
>> The empty set contains itself.
> That's an easy mistake to make. The empty set is {}, the set that
> contains only the empty set is {{}}. Sets are discrete elements that
> don't get "flattened".
> 
> In perl syntactic lists do get flattened though, which leads to some fun
> times. You can do silly things like @concatenated = (@listOne, @listTwo).

"Contains" in the context of sets means "is a superset of" =)

(I'm just being pedantic, I don't actually have a point.)



Re: Matching infinite sets

2016-08-22 Thread Marc Perkel

I'm confused by the confusion here.

Set A - a  finite set - has some members,
Set B - and infinite set - is everything that is NOT in Set A

So you match a test item to Set A and if it matches it's a member of A. 
If it doesn't match Set A it's a member of B.


How is this not really simple?


Re: Matching infinite sets

2016-08-22 Thread Joe Quinn

On 8/22/2016 8:54 AM, Michael Orlitzky wrote:

On 08/21/2016 03:22 PM, Damian wrote:

There is no such set B, as it would contain itself.

The empty set contains itself.
That's an easy mistake to make. The empty set is {}, the set that 
contains only the empty set is {{}}. Sets are discrete elements that 
don't get "flattened".


In perl syntactic lists do get flattened though, which leads to some fun 
times. You can do silly things like @concatenated = (@listOne, @listTwo).


Re: Matching infinite sets

2016-08-22 Thread Michael Orlitzky
On 08/21/2016 03:22 PM, Damian wrote:
>>
> There is no such set B, as it would contain itself.

The empty set contains itself.



Re: Matching infinite sets

2016-08-22 Thread Joe Quinn

On 8/21/2016 5:55 PM, Sidney Markowitz wrote:

Dianne Skoll wrote on 22/08/16 8:56 AM:

And... why can't a set contain itself?


It can't in standard modern set theory (ZFC), through the foundation axioms,
also known as the axiom of regularity
   https://en.wikipedia.org/wiki/Axiom_of_regularity
which is a formulation that allows set theory to avoid Russell's Paradox.
(see also https://en.wikipedia.org/wiki/ZFC)

Just like Euclidean Geometry has the axiom that parallel lines never meet, and
you get various non-euclidean geometries by changing that axiom, there are
non-standard set theories that do not include the axiom of regularity, in
which there can be sets that include themselves.

None of that is relevant to the discussion of Marc Perkel's ideas because he
is talking about sets of tokens from email (or sets of potential tokens?) not
sets that contain sets. And all he needs to do with his infinite sets is be
able to test if a token is in it, which is easy to do since the set is defined
as the complement of a finite set. (I'm not saying this to agree with the
method as good or to argue against it. I'm one of those people he mentions who
understands how Bayesian spam filtering works who has yet to wrap my head
around what he is presenting - For now I'm staying agnostic about it until I
do understand it better).

  Sidney
This is a good summary. As a fun theoretical side-note, ZFC can be 
interpreted as a type theory and then used as a way to reason about the 
behavior of programs. One of its major weaknesses is that it's possible 
to formulate exactly this sort of issue where a set can contain other 
sets of unknown depth. This corresponds to untyped programming languages 
and is almost always resolved by formalizations that correspond to 
adding a type system (as your last paragraph does).


But back to discussing Bayes... ;)