Re: List of banned words/bounce to sender

2010-08-11 Thread Martin Gregorie
On Tue, 2010-08-10 at 19:24 -0700, jdow wrote:
 From: Martin Gregorie mar...@gregorie.org
 Sent: Monday, 2010/August/09 18:08
 
 
  On Mon, 2010-08-09 at 17:42 -0700, jdow wrote:
  From: Martin Gregorie mar...@gregorie.org
   Something like this will match a sequence of two capitalised name 
   words,
   including hyphenated ones, and extract the name words:
  
   /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/
  
   and should be fairly easy to extend to deal with initials and/or more
   than one forename. Tested in Python and should also work in Perl.
  
 
  That solves the Reginald Slovotniksky type names. But, John Smith? 
  Dunno.
 
  The regex I showed will return 'John' and 'Smith' so the combo can be
  queried in the database, which is all I set out to try. However, I was
  trying to generalise as a regex that would match two or more Capitalised
  Names and return them as an array of group values but I couldn't work
  out how to do that without writing a rather tedious set of ever longer
  alternates. If anybody knows how to do that without resorting to
  alternatives I'd be fascinated to know how you do that.
 
 Ah, but Martin, do you really know if it is the John Smith in the database 
 or
 another John Smith?
 
No, but then nobody knows that. If you're scanning for patient names in
body text and a common name happens to match a patient its an ambiguous
situation that can only be resolved iff you can write a rule that
reliably disambiguate it by recognising the name's context. 

Martin




Re: List of banned words/bounce to sender

2010-08-10 Thread Henrik K
On Tue, Aug 10, 2010 at 02:08:28AM +0100, Martin Gregorie wrote:
 On Mon, 2010-08-09 at 17:42 -0700, jdow wrote:
  From: Martin Gregorie mar...@gregorie.org
   Something like this will match a sequence of two capitalised name words,
   including hyphenated ones, and extract the name words:
  
   /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/
  
   and should be fairly easy to extend to deal with initials and/or more
   than one forename. Tested in Python and should also work in Perl.
  
 
  That solves the Reginald Slovotniksky type names. But, John Smith? Dunno.
  
 The regex I showed will return 'John' and 'Smith' so the combo can be
 queried in the database, which is all I set out to try. However, I was
 trying to generalise as a regex that would match two or more Capitalised
 Names and return them as an array of group values but I couldn't work
 out how to do that without writing a rather tedious set of ever longer
 alternates. If anybody knows how to do that without resorting to
 alternatives I'd be fascinated to know how you do that.

Ok I did some more testing since this is an interesting experiment..

I dumped 15000 mail bodies into a file like SA sees them and feeded it to
simple Perl script.

Runtime for different methods (memory used including Perl itself):

- Single 7 name regex, 20s (8MB)
- 7 regexes of 1 names each, 141s (9MB)
- Martin style, lookups from Perl hash, 8s (12MB)

So it seems single regex is much more preferred than few smaller ones.
Though creating it with Regexp::Assemble required 250MB of memory..

Yeah looking at this I would go for the generic regex and test all matches
with names stored in Perl hash. Average count of names to check per
message was around 100, so using SQL directly would be inefficient though
possible.

Anyways, I concur that with so many names you would probably get lots of
FPs.. identical doctors, friends, john doe wrote: etc..



Re: List of banned words/bounce to sender

2010-08-10 Thread Martin Gregorie
On Tue, 2010-08-10 at 11:19 +0300, Henrik K wrote:
 Runtime for different methods (memory used including Perl itself):
 
 - Single 7 name regex, 20s (8MB)
 - 7 regexes of 1 names each, 141s (9MB)
 - Martin style, lookups from Perl hash, 8s (12MB)
 
Very interesting indeed. Thanks for trying it. I'm not surprised that
the set of 7 regexes took longer than the one big one, but I am
surprised that the time difference is so close to the factor of 7.

Out of interest, did you leave the headers in your test messages? I did
initially when I developed the generic name matches, but then removed
them because most of the hits were in headers while the real-life
scan-and-compare rule would only be applied to the body. 

Of course, there should be almost no difference if there is no match in
a message, but on average we can guess that the single regex will do
35,000 attempted matches for every candidate name pair that generates a
hit while the set of seven will do 65,000 attempted matches (6 x 1
for the six regexes that don't contain the match and 5000 on average for
the one that does).

One thing this experiment makes clear is that a rule containing a lot of
alternates, such as one scanning the body for misspelt words, will
perform better if it contains one long regex rather than a set of
shorter regexes plus an OR meta to combine them - the latter is easier
to maintain but slower running. 

In the past I used the second form but now I always use a single long
regex that is built from a rule definition file with my 'portmanteau'
script - its rule definition file is easy to maintain because it holds
each alternate pattern on a separate line.
 

Martin




Re: List of banned words/bounce to sender

2010-08-10 Thread Henrik K
On Tue, Aug 10, 2010 at 10:47:15AM +0100, Martin Gregorie wrote:
 On Tue, 2010-08-10 at 11:19 +0300, Henrik K wrote:
  Runtime for different methods (memory used including Perl itself):
  
  - Single 7 name regex, 20s (8MB)
  - 7 regexes of 1 names each, 141s (9MB)
  - Martin style, lookups from Perl hash, 8s (12MB)
  
 Very interesting indeed. Thanks for trying it. I'm not surprised that
 the set of 7 regexes took longer than the one big one, but I am
 surprised that the time difference is so close to the factor of 7.

I guess the seven regexes contain lots of similar strings, so it's lots of
duplicate work compared to a single trie.

Credits to Perl 5.10 enhancements:

http://www.regex-engineer.org/slides/img38.html
http://taint.org/2006/07/07/184022a.html

I don't know if Python implements such..

 Out of interest, did you leave the headers in your test messages? I did
 initially when I developed the generic name matches, but then removed
 them because most of the hits were in headers while the real-life
 scan-and-compare rule would only be applied to the body. 

Just the body as print get_rendered_body_text_array().

For the record, matching wasn't as simple as one could think..

Normal while (/foo bar/g) won't not work since:
= word1 word2 word3 word4
.. would result in only two matches: word1 word2 word3 word4, but
we need to check word2 word3 also.

Big help was page 20+:
http://web.archive.org/web/20050515221554/http://birmingham.pm.org/talks/YAPC-Europe-2003-Gems.pdf

Basically you need to do something like:

$pat = qr/\b(([a-z][-a-z]{2,15}[a-z]),? ([a-z][-a-z]{2,15}[a-z]))\b/i;
$check = qr/(?{ $found = $1 if defined $names{lc $2,$3} || defined $names{lc 
$3,$2} })/;
while () {
$found = undef;
/$pat$check(?!)/;
print $found\n if defined $found;
}

Hope this helps someone ;)

 One thing this experiment makes clear is that a rule containing a lot of
 alternates, such as one scanning the body for misspelt words, will
 perform better if it contains one long regex rather than a set of
 shorter regexes plus an OR meta to combine them - the latter is easier
 to maintain but slower running. 


 In the past I used the second form but now I always use a single long
 regex that is built from a rule definition file with my 'portmanteau'
 script - its rule definition file is easy to maintain because it holds
 each alternate pattern on a separate line.

Yep though I guess most rules are so simple that they don't create much
penalty. Using sa-compile the difference should be neglible and it's easy to
see the exact rule hitting (of course you can find the string with debugging
also).



Re: List of banned words/bounce to sender

2010-08-10 Thread Henrik K
On Tue, Aug 10, 2010 at 01:35:55PM +0300, Henrik K wrote:
 
 Big help was page 20+:
 http://web.archive.org/web/20050515221554/http://birmingham.pm.org/talks/YAPC-Europe-2003-Gems.pdf
 
 Basically you need to do something like:
 
 $pat = qr/\b(([a-z][-a-z]{2,15}[a-z]),? ([a-z][-a-z]{2,15}[a-z]))\b/i;
 $check = qr/(?{ $found = $1 if defined $names{lc $2,$3} || defined 
 $names{lc $3,$2} })/;
 while () {
 $found = undef;
 /$pat$check(?!)/;
 print $found\n if defined $found;
 }

Blah never mind.. the simpler example there works also, I messed up
something trying it..

$pat = qr/\b(([a-z][-a-z]{2,15}[a-z]),? ([a-z][-a-z]{2,15}[a-z]))\b/i;
while () {
while (/(?=$pat)/g) {
print $1\n;
}
}

And even slightly better since it doesn't overly backtrack.



Re: List of banned words/bounce to sender

2010-08-10 Thread John Hardin

On Tue, 10 Aug 2010, Henrik K wrote:


Ok I did some more testing since this is an interesting experiment..

I dumped 15000 mail bodies into a file like SA sees them and feeded it 
to simple Perl script.


Runtime for different methods (memory used including Perl itself):

- Single 7 name regex, 20s (8MB)
- 7 regexes of 1 names each, 141s (9MB)
- Martin style, lookups from Perl hash, 8s (12MB)

Yeah looking at this I would go for the generic regex and test all 
matches with names stored in Perl hash. Average count of names to 
check per message was around 100, so using SQL directly would be 
inefficient though possible.


This smells like a custom plugin, building a hash from database queries of 
names added since plugin-local last-updated-datetime. Big initialization 
hit unless you build persistence into the plugin, but minimal database 
traffic primarily consisting of an IF EXISTS() query, and a few rows 
queried every time a new patient is added to the system.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Windows Vista: Windows ME for the XP generation.
---
 5 days until the 65th anniversary of the end of World War II


Re: List of banned words/bounce to sender

2010-08-10 Thread Henrik K
On Tue, Aug 10, 2010 at 07:37:32AM -0700, John Hardin wrote:
 On Tue, 10 Aug 2010, Henrik K wrote:
 
 Ok I did some more testing since this is an interesting experiment..
 
 I dumped 15000 mail bodies into a file like SA sees them and
 feeded it to simple Perl script.
 
 Runtime for different methods (memory used including Perl itself):
 
 - Single 7 name regex, 20s (8MB)
 - 7 regexes of 1 names each, 141s (9MB)
 - Martin style, lookups from Perl hash, 8s (12MB)
 
 Yeah looking at this I would go for the generic regex and test all
 matches with names stored in Perl hash. Average count of names
 to check per message was around 100, so using SQL directly would
 be inefficient though possible.
 
 This smells like a custom plugin, building a hash from database
 queries of names added since plugin-local last-updated-datetime. Big
 initialization hit unless you build persistence into the plugin, but
 minimal database traffic primarily consisting of an IF EXISTS()
 query, and a few rows queried every time a new patient is added to
 the system.

That just sounds too much work for little gain. Perhaps better in some other
scenario.. it was already stated that the patient names are dumped
somewhere, and usually SA is only reloaded once a day or such. Reading in a
file is the simplest and most efficient way to go. Not to mention
insecurities that might arise from querying a patient database.



Re: List of banned words/bounce to sender

2010-08-10 Thread John Hardin

On Tue, 10 Aug 2010, Henrik K wrote:


On Tue, Aug 10, 2010 at 07:37:32AM -0700, John Hardin wrote:

On Tue, 10 Aug 2010, Henrik K wrote:


Ok I did some more testing since this is an interesting experiment..

I dumped 15000 mail bodies into a file like SA sees them and
feeded it to simple Perl script.

Runtime for different methods (memory used including Perl itself):

- Single 7 name regex, 20s (8MB)
- 7 regexes of 1 names each, 141s (9MB)
- Martin style, lookups from Perl hash, 8s (12MB)

Yeah looking at this I would go for the generic regex and test all
matches with names stored in Perl hash. Average count of names
to check per message was around 100, so using SQL directly would
be inefficient though possible.


This smells like a custom plugin, building a hash from database
queries of names added since plugin-local last-updated-datetime. Big
initialization hit unless you build persistence into the plugin, but
minimal database traffic primarily consisting of an IF EXISTS()
query, and a few rows queried every time a new patient is added to
the system.


That just sounds too much work for little gain. Perhaps better in some 
other scenario.. it was already stated that the patient names are dumped 
somewhere, and usually SA is only reloaded once a day or such. Reading 
in a file is the simplest and most efficient way to go. Not to mention 
insecurities that might arise from querying a patient database.


Ah; I missed (or forgot) the non-real-time nature of the scenario. Never 
mind, then; batch-generated rules or a plugin with a batch-generated 
static hashtable would suffice.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 An operating system design that requires a system reboot in order to
 install a document viewing utility does not earn my respect.
---
 5 days until the 65th anniversary of the end of World War II


Re: List of banned words/bounce to sender

2010-08-10 Thread jdow

From: Martin Gregorie mar...@gregorie.org
Sent: Monday, 2010/August/09 18:08



On Mon, 2010-08-09 at 17:42 -0700, jdow wrote:

From: Martin Gregorie mar...@gregorie.org
 Something like this will match a sequence of two capitalised name 
 words,

 including hyphenated ones, and extract the name words:

 /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/

 and should be fairly easy to extend to deal with initials and/or more
 than one forename. Tested in Python and should also work in Perl.


That solves the Reginald Slovotniksky type names. But, John Smith? 
Dunno.



The regex I showed will return 'John' and 'Smith' so the combo can be
queried in the database, which is all I set out to try. However, I was
trying to generalise as a regex that would match two or more Capitalised
Names and return them as an array of group values but I couldn't work
out how to do that without writing a rather tedious set of ever longer
alternates. If anybody knows how to do that without resorting to
alternatives I'd be fascinated to know how you do that.


Ah, but Martin, do you really know if it is the John Smith in the database 
or

another John Smith?

{^_-} 



Re: List of banned words/bounce to sender

2010-08-09 Thread Martin Gregorie
On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet)
wrote:
 Thanks. We are looking at roughly 70,000 names and always growing. If I 
 gave it sufficient hardware, would you expect that to be practical, or 
 is that totally ridiculous? Any options for a database look up here?

I'd use a plugin that simply queries the database plus a rule to
activate the plugin by calling its eval() method and sets the score if
the rule fires.

I'm currently doing the reverse: 
- I use a view on a mail archive database to check whether I've
  previously sent mail to the sender of an incoming message and wrote
  an SA plugin to query the view. 
- a rule causes a plugin to query the view and whitelists the message
  by applying a large negative score if the query got a hit.
- the plugin + view does also detect incoming messages containing my
  addresses as forged senders since that's a common spammer trick.
 
This works very well. E-mail me off-list if you'd like a copy of the
plugin since I haven't yet published it. 

However, if you're not intending to apply any other rules to the
outgoing messages then using SA sounds a bit like overkill when you
could simply write small program in Perl, Python, Java, etc. that simply
runs the query and causes your MTA to bounce any messages than contain
matches with a suitable error code or diagnostic.

Martin




Re: List of banned words/bounce to sender

2010-08-09 Thread Henrik K
On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote:
 On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet)
 wrote:
  Thanks. We are looking at roughly 70,000 names and always growing. If I 
  gave it sufficient hardware, would you expect that to be practical, or 
  is that totally ridiculous? Any options for a database look up here?
 
 I'd use a plugin that simply queries the database plus a rule to
 activate the plugin by calling its eval() method and sets the score if
 the rule fires.

Queries database for what? I guess you didn't read the thread fully. :-)

 I'm currently doing the reverse: 
 - I use a view on a mail archive database to check whether I've
   previously sent mail to the sender of an incoming message and wrote
   an SA plugin to query the view. 

In case someone is interested, I wrote similar policy daemon for Postfix:
http://mailfud.org/postpals/



Re: List of banned words/bounce to sender

2010-08-09 Thread Martin Gregorie
On Mon, 2010-08-09 at 14:17 +0300, Henrik K wrote:
 On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote:
  On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet)
  wrote:
   Thanks. We are looking at roughly 70,000 names and always growing. If I 
   gave it sufficient hardware, would you expect that to be practical, or 
   is that totally ridiculous? Any options for a database look up here?
  
  I'd use a plugin that simply queries the database plus a rule to
  activate the plugin by calling its eval() method and sets the score if
  the rule fires.
 
 Queries database for what? I guess you didn't read the thread fully. :-)
 
Queries the patient data DB for patient names - obviously. I made the
offer because I found it useful to be able to modify an existing plugin
that queried a database. Exactly what the SQL query does in largely
irrelevant. I found that the difficult bit was working out to how to
configure the plugin to access my database. Constructing the query and
interpreting its result were relatively easy. 
 

Martin




Re: List of banned words/bounce to sender

2010-08-09 Thread Daniel McDonald
On 8/9/10 6:58 AM, Martin Gregorie mar...@gregorie.org wrote:

 On Mon, 2010-08-09 at 14:17 +0300, Henrik K wrote:
 On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote:
 On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet)
 wrote:
 Thanks. We are looking at roughly 70,000 names and always growing. If I
 gave it sufficient hardware, would you expect that to be practical, or
 is that totally ridiculous? Any options for a database look up here?
 
 I'd use a plugin that simply queries the database plus a rule to
 activate the plugin by calling its eval() method and sets the score if
 the rule fires.
 
 Queries database for what? I guess you didn't read the thread fully. :-)
 
 Queries the patient data DB for patient names - obviously. I made the
 offer because I found it useful to be able to modify an existing plugin
 that queried a database. Exactly what the SQL query does in largely
 irrelevant. I found that the difficult bit was working out to how to
 configure the plugin to access my database. Constructing the query and
 interpreting its result were relatively easy.

So, you are recommending that he use a plugin to query 70,000 records from a
database, and perform 140,000 body matches, for every e-mail message he
receives?  Doesn't seem very efficient.  It would make sense if it were
structured data he was looking at, to then perform one-off queries to see if
that data matched the database.  But the original post was discussing a
data-loss-prevention scheme to avoid unstructured data leaks.

If the data could be regularized somehow, that might be different.  For
example, if there were a limited number of first names, you could write
signatures that looked for first names with another capitalized word nearby,
and then do a database lookup to see if the capitalized word was a last name
associated with the first name that you discovered.  Unfortunately, people
are pretty random with first names.  I have a database of some 600K voters
in Travis County, Texas.  There are 38,808 distinct first names.  This
technique might cut down the number of rules by 93.5%, but then you have to
do database lookups and some fancy parsing to verify the hit.  Don't know if
that would be worth it.


-- 
Daniel J McDonald, CCIE # 2495, CISSP # 78281



Re: List of banned words/bounce to sender

2010-08-09 Thread Henrik K
On Mon, Aug 09, 2010 at 07:28:42AM -0500, Daniel McDonald wrote:

 This technique might cut down the number of rules by 93.5%, but then you
 have to do database lookups and some fancy parsing to verify the hit. 
 Don't know if that would be worth it.

Nope, people constantly underestimate the power of regexes.. of course you
can easily make bad ones, but Perl can run huge lists of simple alternations
FAST.

I downloaded a 1 random name pack, and made a quick hack to regexify it
with my favourite Regexp::Assemble.

--
#!/usr/bin/perl
use Regexp::Assemble;
$ra = Regexp::Assemble-new;
while (STDIN) {
chomp;
# Read comma separated names from stdin: Firstname,Lastname
($firstname, $lastname) = split(',', lc);
# Firstname Lastname
$ra-add($firstname $lastname);
# Lastname,? Firstname
$ra-add($lastname,? $firstname);
# Print rule every 1 names
# (?:^| ) instead of \b since Kate would hit Mary-Kate
if (++$cnt % 1 == 0 || eof STDIN) {
print 'body TEST_NAMES_'.++$idx;
print ' /(?:^| )'.$ra-as_string.'(?:$| )/i'.\n;
}
}
--
./names.pl  names.csv  names.cf

The resulting single 17 byte rule did not affect SA in anyway, there was
virtually no difference in my mass check tests. Running the regex through
some file manually results in 8 lines/second. This with one 3Ghz core.
I think you can make rules/REs of MBs in size, but gains probably nothing.

About ClamAV...

+ It would probably handle this even faster
+ Easy logging of exact signature that got hit (single name per sig)
- It would also match any header like To: From: etc (PRETTY BAD...)

I'd choose SA since it's way more flexible. I doubt performance here is a
factor, especially with outgoing mail..



Re: List of banned words/bounce to sender

2010-08-09 Thread Matthew Kitchin (public/usenet)

 On 8/9/2010 8:27 AM, Henrik K wrote:

Nope, people constantly underestimate the power of regexes.. of course you
can easily make bad ones, but Perl can run huge lists of simple alternations
FAST.

I downloaded a 1 random name pack, and made a quick hack to regexify it
with my favourite Regexp::Assemble.

--
#!/usr/bin/perl
use Regexp::Assemble;
$ra = Regexp::Assemble-new;
while (STDIN) {
 chomp;
 # Read comma separated names from stdin: Firstname,Lastname
 ($firstname, $lastname) = split(',', lc);
 # Firstname Lastname
 $ra-add($firstname $lastname);
 # Lastname,? Firstname
 $ra-add($lastname,? $firstname);
 # Print rule every 1 names
 # (?:^| ) instead of \b since Kate would hit Mary-Kate
 if (++$cnt % 1 == 0 || eof STDIN) {
print 'body TEST_NAMES_'.++$idx;
 print ' /(?:^| )'.$ra-as_string.'(?:$| )/i'.\n;
 }
}
--
./names.pl  names.csv  names.cf

The resulting single 17 byte rule did not affect SA in anyway, there was
virtually no difference in my mass check tests. Running the regex through
some file manually results in 8 lines/second. This with one 3Ghz core.
I think you can make rules/REs of MBs in size, but gains probably nothing.

About ClamAV...

+ It would probably handle this even faster
+ Easy logging of exact signature that got hit (single name per sig)
- It would also match any header like To: From: etc (PRETTY BAD...)

I'd choose SA since it's way more flexible. I doubt performance here is a
factor, especially with outgoing mail..


Thanks for the info.

- It would also match any header like To: From: etc (PRETTY BAD...)

That could be an issue. I will check to see if I can find a workaround, 
if not, ClamAV may not be an option.




Re: List of banned words/bounce to sender

2010-08-09 Thread jdow

From: Daniel McDonald dan.mcdon...@austinenergy.com
Sent: Monday, 2010/August/09 05:28



On 8/9/10 6:58 AM, Martin Gregorie mar...@gregorie.org wrote:


On Mon, 2010-08-09 at 14:17 +0300, Henrik K wrote:

On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote:

On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet)
wrote:
Thanks. We are looking at roughly 70,000 names and always growing. If 
I

gave it sufficient hardware, would you expect that to be practical, or
is that totally ridiculous? Any options for a database look up here?


I'd use a plugin that simply queries the database plus a rule to
activate the plugin by calling its eval() method and sets the score if
the rule fires.


Queries database for what? I guess you didn't read the thread fully. :-)


Queries the patient data DB for patient names - obviously. I made the
offer because I found it useful to be able to modify an existing plugin
that queried a database. Exactly what the SQL query does in largely
irrelevant. I found that the difficult bit was working out to how to
configure the plugin to access my database. Constructing the query and
interpreting its result were relatively easy.


So, you are recommending that he use a plugin to query 70,000 records from 
a

database, and perform 140,000 body matches, for every e-mail message he
receives?  Doesn't seem very efficient.  It would make sense if it were
structured data he was looking at, to then perform one-off queries to see 
if

that data matched the database.  But the original post was discussing a
data-loss-prevention scheme to avoid unstructured data leaks.

If the data could be regularized somehow, that might be different.  For
example, if there were a limited number of first names, you could write
signatures that looked for first names with another capitalized word 
nearby,
and then do a database lookup to see if the capitalized word was a last 
name

associated with the first name that you discovered.  Unfortunately, people
are pretty random with first names.  I have a database of some 600K voters
in Travis County, Texas.  There are 38,808 distinct first names.  This
technique might cut down the number of rules by 93.5%, but then you have 
to
do database lookups and some fancy parsing to verify the hit.  Don't know 
if

that would be worth it.


Um, a query for firstname=John and lastname=Smith and a query for
firstname=Smith and lastname=John is a start. (Match with the format for
the database.) One of the problems is picking out names and match them with
other names close enough to them to be John Smith. Then you have to guess
the order, the two queries above handle that. Then you have to settle on
whether this is one of our John Smith's or a third party unrelated to our
database. I see that last one as the real problem.

{^_^} 



Re: List of banned words/bounce to sender

2010-08-09 Thread Martin Gregorie
On Mon, 2010-08-09 at 07:28 -0500, Daniel McDonald wrote:

 So, you are recommending that he use a plugin to query 70,000 records from a
 database, and perform 140,000 body matches, for every e-mail message he
 receives?

It should be possible to write a rule that recognises names (initials +
capitalised word or a sequence of 2+ capitalised words, either
optionally prefixed with a title may well work. Designing the regex
should be relatively easy because it only has to match the type of name
that can be generated from the database - no matter what you do that
would seem to be a fundamental limit on what is reasonably possible. Now
you only have to run a SQL query against the regex matches and this is
even easier if you use grouping in the regex to extract strings that
correspond to database fields and build the query from them.

Something like this will match a sequence of two capitalised name words,
including hyphenated ones, and extract the name words:

/([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/

and should be fairly easy to extend to deal with initials and/or more
than one forename. Tested in Python and should also work in Perl.

 Doesn't seem very efficient.  It would make sense if it were
 structured data he was looking at, to then perform one-off queries to see if
 that data matched the database.  But the original post was discussing a
 data-loss-prevention scheme to avoid unstructured data leaks.
 
Maybe so, but nor is building and applying a regex with 70,000+
alternates in it.

Of course it would be wise to prototype both approaches before deciding
whether to do anything at all, but I have a gut feeling that recognising
a candidate name and using the matching string to construct and run an
SQL query will be less resource intensive than applying a very large
regex. I guestimate the latter at 10-20 bytes per name including
alternate separator, which is 700-1400 kb for 70,000 names.
  
 If the data could be regularized somehow, that might be different.  For
 example, if there were a limited number of first names, you could write
 signatures that looked for first names with another capitalized word nearby,
 and then do a database lookup to see if the capitalized word was a last name
 associated with the first name that you discovered.  Unfortunately, people
 are pretty random with first names.  I have a database of some 600K voters
 in Travis County, Texas.  There are 38,808 distinct first names.  This
 technique might cut down the number of rules by 93.5%, but then you have to
 do database lookups and some fancy parsing to verify the hit.  Don't know if
 that would be worth it.
 
Agreed: if some matching scheme can be made to work its going to let
some names through if only because the writer mis-spells names recorded
in the database. There's not a lot can be done about that.

Martin






Re: List of banned words/bounce to sender

2010-08-09 Thread jdow

From: Martin Gregorie mar...@gregorie.org
Sent: Monday, 2010/August/09 15:45


On Mon, 2010-08-09 at 07:28 -0500, Daniel McDonald wrote:

So, you are recommending that he use a plugin to query 70,000 records 
from a

database, and perform 140,000 body matches, for every e-mail message he
receives?


It should be possible to write a rule that recognises names (initials +
capitalised word or a sequence of 2+ capitalised words, either
optionally prefixed with a title may well work. Designing the regex
should be relatively easy because it only has to match the type of name
that can be generated from the database - no matter what you do that
would seem to be a fundamental limit on what is reasonably possible. Now
you only have to run a SQL query against the regex matches and this is
even easier if you use grouping in the regex to extract strings that
correspond to database fields and build the query from them.

Something like this will match a sequence of two capitalised name words,
including hyphenated ones, and extract the name words:

/([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/

and should be fairly easy to extend to deal with initials and/or more
than one forename. Tested in Python and should also work in Perl.


Doesn't seem very efficient.  It would make sense if it were
structured data he was looking at, to then perform one-off queries to see 
if

that data matched the database.  But the original post was discussing a
data-loss-prevention scheme to avoid unstructured data leaks.


Maybe so, but nor is building and applying a regex with 70,000+
alternates in it.

Of course it would be wise to prototype both approaches before deciding
whether to do anything at all, but I have a gut feeling that recognising
a candidate name and using the matching string to construct and run an
SQL query will be less resource intensive than applying a very large
regex. I guestimate the latter at 10-20 bytes per name including
alternate separator, which is 700-1400 kb for 70,000 names.


If the data could be regularized somehow, that might be different.  For
example, if there were a limited number of first names, you could write
signatures that looked for first names with another capitalized word 
nearby,
and then do a database lookup to see if the capitalized word was a last 
name
associated with the first name that you discovered.  Unfortunately, 
people
are pretty random with first names.  I have a database of some 600K 
voters

in Travis County, Texas.  There are 38,808 distinct first names.  This
technique might cut down the number of rules by 93.5%, but then you have 
to
do database lookups and some fancy parsing to verify the hit.  Don't know 
if

that would be worth it.


Agreed: if some matching scheme can be made to work its going to let
some names through if only because the writer mis-spells names recorded
in the database. There's not a lot can be done about that.

Martin


That solves the Reginald Slovotniksky type names. But, John Smith? Dunno.

{^_-} 



Re: List of banned words/bounce to sender

2010-08-09 Thread Martin Gregorie
On Mon, 2010-08-09 at 17:42 -0700, jdow wrote:
 From: Martin Gregorie mar...@gregorie.org
  Something like this will match a sequence of two capitalised name words,
  including hyphenated ones, and extract the name words:
 
  /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/
 
  and should be fairly easy to extend to deal with initials and/or more
  than one forename. Tested in Python and should also work in Perl.
 

 That solves the Reginald Slovotniksky type names. But, John Smith? Dunno.
 
The regex I showed will return 'John' and 'Smith' so the combo can be
queried in the database, which is all I set out to try. However, I was
trying to generalise as a regex that would match two or more Capitalised
Names and return them as an array of group values but I couldn't work
out how to do that without writing a rather tedious set of ever longer
alternates. If anybody knows how to do that without resorting to
alternatives I'd be fascinated to know how you do that.

 
Martin





Re: List of banned words/bounce to sender

2010-08-05 Thread Evan Platt

On 08/05/2010 10:47 AM, Matthew Kitchin (public/usenet) wrote:
 Hello all. I have been a loyal users for years, but have never had to 
do much more than make a few custom rules. I work for a healthcare 
company, and I have been asked to implement a mechanism to search for 
patient names in outgoing emails an bounce them back to the sender if 
one is identified.

We would search for them in the format John Smith and Smith, John.
We would like to bounce them back to the sender (that would be within 
our company) with a custom notice indicating what they should do to 
properly send the email.

My typical setups are Postfix -amavisd-SA
In this case, the setup doesn't exist yet, because I'm just exploring 
the feasibility of doing it.  I would run the latest Versions of 
CentOS 64 Bit, Postfix, Amavisd, and SA.
It would be great if it could search attachments too, but I could 
probably get by with just looking at the body. Of course, the emails 
will be HTML and RTF too. They originate in and Outlook/Exchange 
environment.

Is this a realistic setup?


Spamassassin can't handle this - it has no capability to reject mail, 
however you need to think - are you going to have a database of patients 
names, or is your intention to block anything with a Name? Are you 
really going to want to manage a databse of every name out there? If so, 
what happens when someone e-mails I watched a presentation from Bill 
Gates on Well, that's a name, right?


So let's take the alternative - you have a database of just custom names 
(of your patients). Whos job is it to maintain that? And what happens 
if, again, in the above situation, a patient has the same name as say a 
celebrity or even worse, say a doctor? Let's say there's a world famous 
doctor James Bond. But James Bond (different person) is a patient. One 
of your staf members e-mails We need to go see the conference Dr. James 
Bond is putting on. Bounced.



While it's a great idea in theory (IMHO), it's going to be a headache.

One company I worked at a while ago implemented a web filter. The IT guy 
implemented it, then went to lunch. Unless a site was allowed, it was 
blocked. We  very quickly realized that while he added say 
www.yahoo.com, http://mail.yahoo.com was blocked. So he added 
*.yahoo.com . But then we found out that  there were a dozen other 
DOMAINS needed too - one by one. Say yahoomail.com yahoohosting.com , 
etc. His first few days were spent whitelisting site after site after site.


Eventuallly, they gave up on the idea.


Re: List of banned words/bounce to sender

2010-08-05 Thread Matthew Kitchin (public/usenet)

 On 8/5/2010 1:03 PM, Evan Platt wrote:


Spamassassin can't handle this - it has no capability to reject mail, 
however you need to think - are you going to have a database of 
patients names, or is your intention to block anything with a Name? 
Are you really going to want to manage a databse of every name out 
there? If so, what happens when someone e-mails I watched a 
presentation from Bill Gates on Well, that's a name, right?


So let's take the alternative - you have a database of just custom 
names (of your patients). Whos job is it to maintain that? And what 
happens if, again, in the above situation, a patient has the same name 
as say a celebrity or even worse, say a doctor? Let's say there's a 
world famous doctor James Bond. But James Bond (different person) is a 
patient. One of your staf members e-mails We need to go see the 
conference Dr. James Bond is putting on. Bounced.


Amavisd could reject the mail. I was planning on using Spamassassin 
(with a custom built rule) to examine the email for the names. We would 
only use the names of our patients. The names would be dumped out of our 
patient DB every night. If a patient has a a same name as a friend, 
there would be a code we would put in the subject to bypass the filter. 
I was thinking of a custom rule for that code that would have a score of 
-20 or something like that. Basically, Spamassassin's role would be 
deciding whether or not one of the names was in the email and if the 
override code was in the subject. I'm not saying it is the most 
brilliant idea in the world, but it is what I have been told to implement.


I know Amavisd well, so I can handle that part. I guess by main question 
should be, could I have Spamassassin read a custom rule to look for 
several thousand patient names in the format John Smith and Smith, John?


Re: List of banned words/bounce to sender

2010-08-05 Thread Benny Pedersen

On tor 05 aug 2010 19:47:37 CEST, Matthew Kitchin (public/usenet) wrote


Is this a realistic setup?


postfix will love it if done right with local smtp auth senders, eg no  
sender sends unauthed then its just add smtpd_sender_bcc_naps from a  
list of all local recipients


just dont make it if sender auth is not in place first !

more questions ?, its not a spamassassin answer :)

--
xpoint http://www.unicom.com/pw/reply-to-harmful.html



Re: List of banned words/bounce to sender

2010-08-05 Thread Matthew Kitchin (public/usenet)

 On 8/5/2010 1:19 PM, Benny Pedersen wrote:

On tor 05 aug 2010 19:47:37 CEST, Matthew Kitchin (public/usenet) wrote


Is this a realistic setup?


postfix will love it if done right with local smtp auth senders, eg no 
sender sends unauthed then its just add smtpd_sender_bcc_naps from a 
list of all local recipients


just dont make it if sender auth is not in place first !

more questions ?, its not a spamassassin answer :)

I'm not sure what you mean. I'm not looking for anything along the lines 
of authorized senders. I'm wanting to search and email to see if it has 
one of several thousand patient names in it.
I guess by main question should be, could I have Spamassassin read a 
custom rule to look for several thousand patient names in the format 
John Smith and Smith, John?


RE: List of banned words/bounce to sender

2010-08-05 Thread Kelly, James
It is more associated with Mailscanner rather than Amavis, but the
Scamnailer project does something very similar with a long list of
known bad phishing mail addresses. It scans each message for the
presence of any of thousands of addresses from a frequently-updated
list, and when it finds one, it adds an additional header. You then
configure a Mailscanner Spamassassin Rule Action setting to perform
some custom action if it finds that Scamnailer header in the message.
Perhaps that action could include the kind of bounce/sender notify you'd
want to do.

Scamnailer also works great for stopping targeted phishing attacks, too,
which as a university we get a lot of.

Scamnailer is at http://www.scamnailer.info/

Thanks,
James
__

James Kelly
Network Administrator
IST Network Operations
Chapman University
Phone: 714-744-7833
Email: jake...@chapman.edu
---
CHAPMAN UNIVERSITY WILL NEVER ASK FOR YOUR PASSWORD!
DO NOT SHARE YOUR PASSWORD WITH OTHERS!
If you wish to modify your Chapman email address account information:
Use the account management web page at
https://web.chapman.edu/accountmanagement/,
Call the Chapman University helpdesk at (714) 997-6600, or
Contact helpd...@chapman.edu.


-Original Message-
From: Matthew Kitchin (public/usenet) [mailto:mkitchin.pub...@gmail.com]

Sent: Thursday, August 05, 2010 11:11 AM
To: Spamassassin
Subject: Re: List of banned words/bounce to sender

  On 8/5/2010 1:03 PM, Evan Platt wrote:

 Spamassassin can't handle this - it has no capability to reject mail, 
 however you need to think - are you going to have a database of 
 patients names, or is your intention to block anything with a Name? 
 Are you really going to want to manage a databse of every name out 
 there? If so, what happens when someone e-mails I watched a 
 presentation from Bill Gates on Well, that's a name, right?

 So let's take the alternative - you have a database of just custom 
 names (of your patients). Whos job is it to maintain that? And what 
 happens if, again, in the above situation, a patient has the same name

 as say a celebrity or even worse, say a doctor? Let's say there's a 
 world famous doctor James Bond. But James Bond (different person) is a

 patient. One of your staf members e-mails We need to go see the 
 conference Dr. James Bond is putting on. Bounced.

Amavisd could reject the mail. I was planning on using Spamassassin 
(with a custom built rule) to examine the email for the names. We would 
only use the names of our patients. The names would be dumped out of our

patient DB every night. If a patient has a a same name as a friend, 
there would be a code we would put in the subject to bypass the filter. 
I was thinking of a custom rule for that code that would have a score of

-20 or something like that. Basically, Spamassassin's role would be 
deciding whether or not one of the names was in the email and if the 
override code was in the subject. I'm not saying it is the most 
brilliant idea in the world, but it is what I have been told to
implement.

I know Amavisd well, so I can handle that part. I guess by main question

should be, could I have Spamassassin read a custom rule to look for 
several thousand patient names in the format John Smith and Smith,
John?


Re: List of banned words/bounce to sender

2010-08-05 Thread Bowie Bailey
 On 8/5/2010 2:11 PM, Matthew Kitchin (public/usenet) wrote:
  
 Amavisd could reject the mail. I was planning on using Spamassassin
 (with a custom built rule) to examine the email for the names. We
 would only use the names of our patients. The names would be dumped
 out of our patient DB every night. If a patient has a a same name as a
 friend, there would be a code we would put in the subject to bypass
 the filter. I was thinking of a custom rule for that code that would
 have a score of -20 or something like that. Basically, Spamassassin's
 role would be deciding whether or not one of the names was in the
 email and if the override code was in the subject. I'm not saying it
 is the most brilliant idea in the world, but it is what I have been
 told to implement.

My approach to doing something like this would be to have a rule that
matches the names (however you implement it), and then have the MTA
check for that particular rule hit and bounce the message if it exists. 
This is the same way you generally use the VBounce plugin.  Then do the
same thing for your bypass rule.

 I know Amavisd well, so I can handle that part. I guess by main
 question should be, could I have Spamassassin read a custom rule to
 look for several thousand patient names in the format John Smith and
 Smith, John?

Spamassassin can use whatever custom rule you care to come up with.  It
will happily use a regex with hundreds of names listed.  The question is
whether the rule would cause a noticeable slowdown in processing speed. 
The only way to find out is to try it.  Using compiled rules would
probably help here.

body BAD_NAMES /John Smith|Smith, John|Jane Doe|Doe, Jane|../

Not the most efficient rule, but it would work.  You would probably have
to split it into multiple rules and combine them with a meta rule.

body __BAD_NAMES1 .
body __BAD_NAMES2 .
body __BAD_NAMES3 .
meta BAD_NAMES __BAD_NAMES1 || __BAD_NAMES2 || __BAD_NAMES3

Regexp::Optimizer would probably also help when creating the rules.

-- 
Bowie


Re: List of banned words/bounce to sender

2010-08-05 Thread Matthew Kitchin (public/usenet)

 On 8/5/2010 1:52 PM, Bowie Bailey wrote:

My approach to doing something like this would be to have a rule that
matches the names (however you implement it), and then have the MTA
check for that particular rule hit and bounce the message if it exists.
This is the same way you generally use the VBounce plugin.  Then do the
same thing for your bypass rule.

That is pretty much what I wanted to do. The best way I know to make 
Postfix use SA is with Amavisd.


Spamassassin can use whatever custom rule you care to come up with.  It
will happily use a regex with hundreds of names listed.  The question is
whether the rule would cause a noticeable slowdown in processing speed.
The only way to find out is to try it.  Using compiled rules would
probably help here.

Thanks. We are looking at roughly 70,000 names and always growing. If I 
gave it sufficient hardware, would you expect that to be practical, or 
is that totally ridiculous? Any options for a database look up here?




Re: List of banned words/bounce to sender

2010-08-05 Thread Bowie Bailey
 On 8/5/2010 3:00 PM, Matthew Kitchin (public/usenet) wrote:
  On 8/5/2010 1:52 PM, Bowie Bailey wrote:
 My approach to doing something like this would be to have a rule that
 matches the names (however you implement it), and then have the MTA
 check for that particular rule hit and bounce the message if it exists.
 This is the same way you generally use the VBounce plugin.  Then do the
 same thing for your bypass rule.

 That is pretty much what I wanted to do. The best way I know to make
 Postfix use SA is with Amavisd.

The point being that the score is irrelevant.  If the rule hits, the
message gets bounced.


 Spamassassin can use whatever custom rule you care to come up with.  It
 will happily use a regex with hundreds of names listed.  The question is
 whether the rule would cause a noticeable slowdown in processing speed.
 The only way to find out is to try it.  Using compiled rules would
 probably help here.

 Thanks. We are looking at roughly 70,000 names and always growing. If
 I gave it sufficient hardware, would you expect that to be practical,
 or is that totally ridiculous? Any options for a database look up here?

I would tend to say that something that large would not be practical. 
On the other hand, there's no way to really know until you try it.

A database lookup is possible, but the problem is determining what to
look up.  You would have to somehow identify possible names for
comparison to the database.

-- 
Bowie


Re: List of banned words/bounce to sender

2010-08-05 Thread Noel Jones
On Thu, Aug 5, 2010 at 2:00 PM, Matthew Kitchin (public/usenet)
mkitchin.pub...@gmail.com wrote:
  On 8/5/2010 1:52 PM, Bowie Bailey wrote:

 My approach to doing something like this would be to have a rule that
 matches the names (however you implement it), and then have the MTA
 check for that particular rule hit and bounce the message if it exists.
 This is the same way you generally use the VBounce plugin.  Then do the
 same thing for your bypass rule.

 That is pretty much what I wanted to do. The best way I know to make Postfix
 use SA is with Amavisd.

 Spamassassin can use whatever custom rule you care to come up with.  It
 will happily use a regex with hundreds of names listed.  The question is
 whether the rule would cause a noticeable slowdown in processing speed.
 The only way to find out is to try it.  Using compiled rules would
 probably help here.

 Thanks. We are looking at roughly 70,000 names and always growing. If I gave
 it sufficient hardware, would you expect that to be practical, or is that
 totally ridiculous? Any options for a database look up here?





Use your database to generate rules for clamav.  You could even remove
the stock clamav rules if you want.  Matching the body for 70,000
names would probably take less than 0.1 seconds.


Re: List of banned words/bounce to sender

2010-08-05 Thread Matthew Kitchin (public/usenet)

 On 8/5/2010 2:05 PM, Bowie Bailey wrote:

I would tend to say that something that large would not be practical.
On the other hand, there's no way to really know until you try it.

A database lookup is possible, but the problem is determining what to
look up.  You would have to somehow identify possible names for
comparison to the database.

Thanks. I think I had a brain fart here. Obviously we would have to have 
identified the names before we could look them up... I think I divided 
by 0 in my head at some point :)




Re: List of banned words/bounce to sender

2010-08-05 Thread Matthew Kitchin (public/usenet)

 On 8/5/2010 2:10 PM, Noel Jones wrote:


Use your database to generate rules for clamav.  You could even remove
the stock clamav rules if you want.  Matching the body for 70,000
names would probably take less than 0.1 seconds.
That sounds like a really good idea. I do use ClamAV but have never 
written any rules of my own. Thanks for the tip!


Re: List of banned words/bounce to sender

2010-08-05 Thread Dominic Benson

On 5 Aug 2010, at 20:13, Matthew Kitchin (public/usenet) wrote:

 On 8/5/2010 2:10 PM, Noel Jones wrote:
 
 Use your database to generate rules for clamav.  You could even remove
 the stock clamav rules if you want.  Matching the body for 70,000
 names would probably take less than 0.1 seconds.
 That sounds like a really good idea. I do use ClamAV but have never written 
 any rules of my own. Thanks for the tip!

I'd set it up to check for surnames from the list in groups first, then if it 
matches one of those look for the various permutations of the full names that 
correspond to each set. I'm thinking of these in terms of calling out from 
Exim's acl_check_data section, using various database dirs depending on the 
rule set (like the Bayes filter), but there are other ways of achieving the 
same with. That ought to reduce the amount of work per message for those that 
will be let through. You'd have to experiment to find the best group size, it 
would depend on how many distinct surnames there are in your set, as well as 
the callout cost relative to the time for each expression. That would also give 
you a good shot at identifying J. Smith as well, for example.