First, you are confusing several things here about how SA works. If you
understand this better you will have a better chance of deciding if SA could do
what you want it to do.
SA works two ways (well, a lot more, but two of importance here):
1) by hard-coded rules that check for known kinds of patterns, and
2) by using Bayesian filtering.
Rules do not require "training". They instead requiring occasional monitoring
for efficiency, and they have to be designed, tested, and written by hand. SA
needs to be restarted when new rules are added.
Bayesian filtering works off tokens, and requires training. It requires
matches to a number of tokens to give a score, unlike a rule that can match a
single pattern and give a score. You do not have to really understand much
about how Bayes works, beyond being able to train it by feeding it things that
are good and bad, and telling it which kind you are giving it.
What you want to look for is patterns within a single item. These patterns can
contain a lot of punctuation. Bayes tends to break symbols on punctuation. So
the one URL might become quite a few symbols. This could or could not be
useful. My guess is that you would be much better off not using Bayes at all.
Which means that you would have to write rules to catch the sort of things you
want to catch. In theory you would have to write at least one obfuscation rule
for every domain name you want to check on.
This would be a manual process. But it could be automated to some extent by
using some of the various tools available for making obfuscated phrase checks.
I have no idea how well this would work. My thought is that it would be a lot
of overkill, or at least overhead, to do what you want. You might be best off
building a process where you can pipe new domain names through an obfuscation
rule generator, and then combine the ever-growing output into a perl script.
Then pipe test domain names through this script, and see which ones it flags as
being possibly bogus.
You could do that with SA, but it might be more work than simply writing and
compiling a perl script.
Loren
-----Original Message-----
From: "Andrews, Rick" <[EMAIL PROTECTED]>
Sent: Oct 28, 2004 6:15 PM
To: "'[email protected]'" <[email protected]>
Subject: Using SpamAssassin, but not for spam
Greetings,
I'm trying to investigate whether SpamAssassin can be used in a non-spam
application that we're trying to build. I've read lots of stuff on the
website but I'm still not sure. I thought I would ask you, the experts.
The application needs to determine whether a certain domain name is
"similar" to another domain name. We have a list of known domain names, and
occasionally want to compare a "target" domain name to see if it is similar
to any of the known domain names. The target might contain replacement
characters ("1" instead of "I" or "L", zero instead of "O", gratuitious dots
or hyphens, etc.) in much the same way that spammers try to get past spam
filters. That's why I thought SpamAssassin might be appropriate. To give an
example, we want to automatically detect that "my-d0m.a1n_name.com" is very
close to "mydomainname.com".
But from what I've read, I think it may not be appropriate for several
reasons:
1) We probably would have much more ham (known domain names) than spam
(close to a known domain name, but not legal)
2) We wouldn't have large amounts of ham or spam to feed through
SpamAssassin to enable it to learn and improve
3) The "target" domain name would in most cases be a single token as far as
SpamAssassin is concerned; unlike an email which likely contains hundreds of
tokens from which to decide if it is spam
What do you think? Would it take a lot of work to adapt SpamAssassin for
this application? Does it seem like an appropriate tool to use?
Thanks in advance,
-Rick