Re: List of banned words/bounce to sender
On Tue, 2010-08-10 at 19:24 -0700, jdow wrote: From: Martin Gregorie mar...@gregorie.org Sent: Monday, 2010/August/09 18:08 On Mon, 2010-08-09 at 17:42 -0700, jdow wrote: From: Martin Gregorie mar...@gregorie.org Something like this will match a sequence of two capitalised name words, including hyphenated ones, and extract the name words: /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/ and should be fairly easy to extend to deal with initials and/or more than one forename. Tested in Python and should also work in Perl. That solves the Reginald Slovotniksky type names. But, John Smith? Dunno. The regex I showed will return 'John' and 'Smith' so the combo can be queried in the database, which is all I set out to try. However, I was trying to generalise as a regex that would match two or more Capitalised Names and return them as an array of group values but I couldn't work out how to do that without writing a rather tedious set of ever longer alternates. If anybody knows how to do that without resorting to alternatives I'd be fascinated to know how you do that. Ah, but Martin, do you really know if it is the John Smith in the database or another John Smith? No, but then nobody knows that. If you're scanning for patient names in body text and a common name happens to match a patient its an ambiguous situation that can only be resolved iff you can write a rule that reliably disambiguate it by recognising the name's context. Martin
Re: List of banned words/bounce to sender
On Tue, Aug 10, 2010 at 02:08:28AM +0100, Martin Gregorie wrote: On Mon, 2010-08-09 at 17:42 -0700, jdow wrote: From: Martin Gregorie mar...@gregorie.org Something like this will match a sequence of two capitalised name words, including hyphenated ones, and extract the name words: /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/ and should be fairly easy to extend to deal with initials and/or more than one forename. Tested in Python and should also work in Perl. That solves the Reginald Slovotniksky type names. But, John Smith? Dunno. The regex I showed will return 'John' and 'Smith' so the combo can be queried in the database, which is all I set out to try. However, I was trying to generalise as a regex that would match two or more Capitalised Names and return them as an array of group values but I couldn't work out how to do that without writing a rather tedious set of ever longer alternates. If anybody knows how to do that without resorting to alternatives I'd be fascinated to know how you do that. Ok I did some more testing since this is an interesting experiment.. I dumped 15000 mail bodies into a file like SA sees them and feeded it to simple Perl script. Runtime for different methods (memory used including Perl itself): - Single 7 name regex, 20s (8MB) - 7 regexes of 1 names each, 141s (9MB) - Martin style, lookups from Perl hash, 8s (12MB) So it seems single regex is much more preferred than few smaller ones. Though creating it with Regexp::Assemble required 250MB of memory.. Yeah looking at this I would go for the generic regex and test all matches with names stored in Perl hash. Average count of names to check per message was around 100, so using SQL directly would be inefficient though possible. Anyways, I concur that with so many names you would probably get lots of FPs.. identical doctors, friends, john doe wrote: etc..
Re: List of banned words/bounce to sender
On Tue, 2010-08-10 at 11:19 +0300, Henrik K wrote: Runtime for different methods (memory used including Perl itself): - Single 7 name regex, 20s (8MB) - 7 regexes of 1 names each, 141s (9MB) - Martin style, lookups from Perl hash, 8s (12MB) Very interesting indeed. Thanks for trying it. I'm not surprised that the set of 7 regexes took longer than the one big one, but I am surprised that the time difference is so close to the factor of 7. Out of interest, did you leave the headers in your test messages? I did initially when I developed the generic name matches, but then removed them because most of the hits were in headers while the real-life scan-and-compare rule would only be applied to the body. Of course, there should be almost no difference if there is no match in a message, but on average we can guess that the single regex will do 35,000 attempted matches for every candidate name pair that generates a hit while the set of seven will do 65,000 attempted matches (6 x 1 for the six regexes that don't contain the match and 5000 on average for the one that does). One thing this experiment makes clear is that a rule containing a lot of alternates, such as one scanning the body for misspelt words, will perform better if it contains one long regex rather than a set of shorter regexes plus an OR meta to combine them - the latter is easier to maintain but slower running. In the past I used the second form but now I always use a single long regex that is built from a rule definition file with my 'portmanteau' script - its rule definition file is easy to maintain because it holds each alternate pattern on a separate line. Martin
Re: List of banned words/bounce to sender
On Tue, Aug 10, 2010 at 10:47:15AM +0100, Martin Gregorie wrote: On Tue, 2010-08-10 at 11:19 +0300, Henrik K wrote: Runtime for different methods (memory used including Perl itself): - Single 7 name regex, 20s (8MB) - 7 regexes of 1 names each, 141s (9MB) - Martin style, lookups from Perl hash, 8s (12MB) Very interesting indeed. Thanks for trying it. I'm not surprised that the set of 7 regexes took longer than the one big one, but I am surprised that the time difference is so close to the factor of 7. I guess the seven regexes contain lots of similar strings, so it's lots of duplicate work compared to a single trie. Credits to Perl 5.10 enhancements: http://www.regex-engineer.org/slides/img38.html http://taint.org/2006/07/07/184022a.html I don't know if Python implements such.. Out of interest, did you leave the headers in your test messages? I did initially when I developed the generic name matches, but then removed them because most of the hits were in headers while the real-life scan-and-compare rule would only be applied to the body. Just the body as print get_rendered_body_text_array(). For the record, matching wasn't as simple as one could think.. Normal while (/foo bar/g) won't not work since: = word1 word2 word3 word4 .. would result in only two matches: word1 word2 word3 word4, but we need to check word2 word3 also. Big help was page 20+: http://web.archive.org/web/20050515221554/http://birmingham.pm.org/talks/YAPC-Europe-2003-Gems.pdf Basically you need to do something like: $pat = qr/\b(([a-z][-a-z]{2,15}[a-z]),? ([a-z][-a-z]{2,15}[a-z]))\b/i; $check = qr/(?{ $found = $1 if defined $names{lc $2,$3} || defined $names{lc $3,$2} })/; while () { $found = undef; /$pat$check(?!)/; print $found\n if defined $found; } Hope this helps someone ;) One thing this experiment makes clear is that a rule containing a lot of alternates, such as one scanning the body for misspelt words, will perform better if it contains one long regex rather than a set of shorter regexes plus an OR meta to combine them - the latter is easier to maintain but slower running. In the past I used the second form but now I always use a single long regex that is built from a rule definition file with my 'portmanteau' script - its rule definition file is easy to maintain because it holds each alternate pattern on a separate line. Yep though I guess most rules are so simple that they don't create much penalty. Using sa-compile the difference should be neglible and it's easy to see the exact rule hitting (of course you can find the string with debugging also).
Re: List of banned words/bounce to sender
On Tue, Aug 10, 2010 at 01:35:55PM +0300, Henrik K wrote: Big help was page 20+: http://web.archive.org/web/20050515221554/http://birmingham.pm.org/talks/YAPC-Europe-2003-Gems.pdf Basically you need to do something like: $pat = qr/\b(([a-z][-a-z]{2,15}[a-z]),? ([a-z][-a-z]{2,15}[a-z]))\b/i; $check = qr/(?{ $found = $1 if defined $names{lc $2,$3} || defined $names{lc $3,$2} })/; while () { $found = undef; /$pat$check(?!)/; print $found\n if defined $found; } Blah never mind.. the simpler example there works also, I messed up something trying it.. $pat = qr/\b(([a-z][-a-z]{2,15}[a-z]),? ([a-z][-a-z]{2,15}[a-z]))\b/i; while () { while (/(?=$pat)/g) { print $1\n; } } And even slightly better since it doesn't overly backtrack.
Re: List of banned words/bounce to sender
On Tue, 10 Aug 2010, Henrik K wrote: Ok I did some more testing since this is an interesting experiment.. I dumped 15000 mail bodies into a file like SA sees them and feeded it to simple Perl script. Runtime for different methods (memory used including Perl itself): - Single 7 name regex, 20s (8MB) - 7 regexes of 1 names each, 141s (9MB) - Martin style, lookups from Perl hash, 8s (12MB) Yeah looking at this I would go for the generic regex and test all matches with names stored in Perl hash. Average count of names to check per message was around 100, so using SQL directly would be inefficient though possible. This smells like a custom plugin, building a hash from database queries of names added since plugin-local last-updated-datetime. Big initialization hit unless you build persistence into the plugin, but minimal database traffic primarily consisting of an IF EXISTS() query, and a few rows queried every time a new patient is added to the system. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Windows Vista: Windows ME for the XP generation. --- 5 days until the 65th anniversary of the end of World War II
Re: List of banned words/bounce to sender
On Tue, Aug 10, 2010 at 07:37:32AM -0700, John Hardin wrote: On Tue, 10 Aug 2010, Henrik K wrote: Ok I did some more testing since this is an interesting experiment.. I dumped 15000 mail bodies into a file like SA sees them and feeded it to simple Perl script. Runtime for different methods (memory used including Perl itself): - Single 7 name regex, 20s (8MB) - 7 regexes of 1 names each, 141s (9MB) - Martin style, lookups from Perl hash, 8s (12MB) Yeah looking at this I would go for the generic regex and test all matches with names stored in Perl hash. Average count of names to check per message was around 100, so using SQL directly would be inefficient though possible. This smells like a custom plugin, building a hash from database queries of names added since plugin-local last-updated-datetime. Big initialization hit unless you build persistence into the plugin, but minimal database traffic primarily consisting of an IF EXISTS() query, and a few rows queried every time a new patient is added to the system. That just sounds too much work for little gain. Perhaps better in some other scenario.. it was already stated that the patient names are dumped somewhere, and usually SA is only reloaded once a day or such. Reading in a file is the simplest and most efficient way to go. Not to mention insecurities that might arise from querying a patient database.
Re: List of banned words/bounce to sender
On Tue, 10 Aug 2010, Henrik K wrote: On Tue, Aug 10, 2010 at 07:37:32AM -0700, John Hardin wrote: On Tue, 10 Aug 2010, Henrik K wrote: Ok I did some more testing since this is an interesting experiment.. I dumped 15000 mail bodies into a file like SA sees them and feeded it to simple Perl script. Runtime for different methods (memory used including Perl itself): - Single 7 name regex, 20s (8MB) - 7 regexes of 1 names each, 141s (9MB) - Martin style, lookups from Perl hash, 8s (12MB) Yeah looking at this I would go for the generic regex and test all matches with names stored in Perl hash. Average count of names to check per message was around 100, so using SQL directly would be inefficient though possible. This smells like a custom plugin, building a hash from database queries of names added since plugin-local last-updated-datetime. Big initialization hit unless you build persistence into the plugin, but minimal database traffic primarily consisting of an IF EXISTS() query, and a few rows queried every time a new patient is added to the system. That just sounds too much work for little gain. Perhaps better in some other scenario.. it was already stated that the patient names are dumped somewhere, and usually SA is only reloaded once a day or such. Reading in a file is the simplest and most efficient way to go. Not to mention insecurities that might arise from querying a patient database. Ah; I missed (or forgot) the non-real-time nature of the scenario. Never mind, then; batch-generated rules or a plugin with a batch-generated static hashtable would suffice. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- An operating system design that requires a system reboot in order to install a document viewing utility does not earn my respect. --- 5 days until the 65th anniversary of the end of World War II
Re: List of banned words/bounce to sender
From: Martin Gregorie mar...@gregorie.org Sent: Monday, 2010/August/09 18:08 On Mon, 2010-08-09 at 17:42 -0700, jdow wrote: From: Martin Gregorie mar...@gregorie.org Something like this will match a sequence of two capitalised name words, including hyphenated ones, and extract the name words: /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/ and should be fairly easy to extend to deal with initials and/or more than one forename. Tested in Python and should also work in Perl. That solves the Reginald Slovotniksky type names. But, John Smith? Dunno. The regex I showed will return 'John' and 'Smith' so the combo can be queried in the database, which is all I set out to try. However, I was trying to generalise as a regex that would match two or more Capitalised Names and return them as an array of group values but I couldn't work out how to do that without writing a rather tedious set of ever longer alternates. If anybody knows how to do that without resorting to alternatives I'd be fascinated to know how you do that. Ah, but Martin, do you really know if it is the John Smith in the database or another John Smith? {^_-}
Re: List of banned words/bounce to sender
On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet) wrote: Thanks. We are looking at roughly 70,000 names and always growing. If I gave it sufficient hardware, would you expect that to be practical, or is that totally ridiculous? Any options for a database look up here? I'd use a plugin that simply queries the database plus a rule to activate the plugin by calling its eval() method and sets the score if the rule fires. I'm currently doing the reverse: - I use a view on a mail archive database to check whether I've previously sent mail to the sender of an incoming message and wrote an SA plugin to query the view. - a rule causes a plugin to query the view and whitelists the message by applying a large negative score if the query got a hit. - the plugin + view does also detect incoming messages containing my addresses as forged senders since that's a common spammer trick. This works very well. E-mail me off-list if you'd like a copy of the plugin since I haven't yet published it. However, if you're not intending to apply any other rules to the outgoing messages then using SA sounds a bit like overkill when you could simply write small program in Perl, Python, Java, etc. that simply runs the query and causes your MTA to bounce any messages than contain matches with a suitable error code or diagnostic. Martin
Re: List of banned words/bounce to sender
On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote: On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet) wrote: Thanks. We are looking at roughly 70,000 names and always growing. If I gave it sufficient hardware, would you expect that to be practical, or is that totally ridiculous? Any options for a database look up here? I'd use a plugin that simply queries the database plus a rule to activate the plugin by calling its eval() method and sets the score if the rule fires. Queries database for what? I guess you didn't read the thread fully. :-) I'm currently doing the reverse: - I use a view on a mail archive database to check whether I've previously sent mail to the sender of an incoming message and wrote an SA plugin to query the view. In case someone is interested, I wrote similar policy daemon for Postfix: http://mailfud.org/postpals/
Re: List of banned words/bounce to sender
On Mon, 2010-08-09 at 14:17 +0300, Henrik K wrote: On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote: On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet) wrote: Thanks. We are looking at roughly 70,000 names and always growing. If I gave it sufficient hardware, would you expect that to be practical, or is that totally ridiculous? Any options for a database look up here? I'd use a plugin that simply queries the database plus a rule to activate the plugin by calling its eval() method and sets the score if the rule fires. Queries database for what? I guess you didn't read the thread fully. :-) Queries the patient data DB for patient names - obviously. I made the offer because I found it useful to be able to modify an existing plugin that queried a database. Exactly what the SQL query does in largely irrelevant. I found that the difficult bit was working out to how to configure the plugin to access my database. Constructing the query and interpreting its result were relatively easy. Martin
Re: List of banned words/bounce to sender
On 8/9/10 6:58 AM, Martin Gregorie mar...@gregorie.org wrote: On Mon, 2010-08-09 at 14:17 +0300, Henrik K wrote: On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote: On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet) wrote: Thanks. We are looking at roughly 70,000 names and always growing. If I gave it sufficient hardware, would you expect that to be practical, or is that totally ridiculous? Any options for a database look up here? I'd use a plugin that simply queries the database plus a rule to activate the plugin by calling its eval() method and sets the score if the rule fires. Queries database for what? I guess you didn't read the thread fully. :-) Queries the patient data DB for patient names - obviously. I made the offer because I found it useful to be able to modify an existing plugin that queried a database. Exactly what the SQL query does in largely irrelevant. I found that the difficult bit was working out to how to configure the plugin to access my database. Constructing the query and interpreting its result were relatively easy. So, you are recommending that he use a plugin to query 70,000 records from a database, and perform 140,000 body matches, for every e-mail message he receives? Doesn't seem very efficient. It would make sense if it were structured data he was looking at, to then perform one-off queries to see if that data matched the database. But the original post was discussing a data-loss-prevention scheme to avoid unstructured data leaks. If the data could be regularized somehow, that might be different. For example, if there were a limited number of first names, you could write signatures that looked for first names with another capitalized word nearby, and then do a database lookup to see if the capitalized word was a last name associated with the first name that you discovered. Unfortunately, people are pretty random with first names. I have a database of some 600K voters in Travis County, Texas. There are 38,808 distinct first names. This technique might cut down the number of rules by 93.5%, but then you have to do database lookups and some fancy parsing to verify the hit. Don't know if that would be worth it. -- Daniel J McDonald, CCIE # 2495, CISSP # 78281
Re: List of banned words/bounce to sender
On Mon, Aug 09, 2010 at 07:28:42AM -0500, Daniel McDonald wrote: This technique might cut down the number of rules by 93.5%, but then you have to do database lookups and some fancy parsing to verify the hit. Don't know if that would be worth it. Nope, people constantly underestimate the power of regexes.. of course you can easily make bad ones, but Perl can run huge lists of simple alternations FAST. I downloaded a 1 random name pack, and made a quick hack to regexify it with my favourite Regexp::Assemble. -- #!/usr/bin/perl use Regexp::Assemble; $ra = Regexp::Assemble-new; while (STDIN) { chomp; # Read comma separated names from stdin: Firstname,Lastname ($firstname, $lastname) = split(',', lc); # Firstname Lastname $ra-add($firstname $lastname); # Lastname,? Firstname $ra-add($lastname,? $firstname); # Print rule every 1 names # (?:^| ) instead of \b since Kate would hit Mary-Kate if (++$cnt % 1 == 0 || eof STDIN) { print 'body TEST_NAMES_'.++$idx; print ' /(?:^| )'.$ra-as_string.'(?:$| )/i'.\n; } } -- ./names.pl names.csv names.cf The resulting single 17 byte rule did not affect SA in anyway, there was virtually no difference in my mass check tests. Running the regex through some file manually results in 8 lines/second. This with one 3Ghz core. I think you can make rules/REs of MBs in size, but gains probably nothing. About ClamAV... + It would probably handle this even faster + Easy logging of exact signature that got hit (single name per sig) - It would also match any header like To: From: etc (PRETTY BAD...) I'd choose SA since it's way more flexible. I doubt performance here is a factor, especially with outgoing mail..
Re: List of banned words/bounce to sender
On 8/9/2010 8:27 AM, Henrik K wrote: Nope, people constantly underestimate the power of regexes.. of course you can easily make bad ones, but Perl can run huge lists of simple alternations FAST. I downloaded a 1 random name pack, and made a quick hack to regexify it with my favourite Regexp::Assemble. -- #!/usr/bin/perl use Regexp::Assemble; $ra = Regexp::Assemble-new; while (STDIN) { chomp; # Read comma separated names from stdin: Firstname,Lastname ($firstname, $lastname) = split(',', lc); # Firstname Lastname $ra-add($firstname $lastname); # Lastname,? Firstname $ra-add($lastname,? $firstname); # Print rule every 1 names # (?:^| ) instead of \b since Kate would hit Mary-Kate if (++$cnt % 1 == 0 || eof STDIN) { print 'body TEST_NAMES_'.++$idx; print ' /(?:^| )'.$ra-as_string.'(?:$| )/i'.\n; } } -- ./names.pl names.csv names.cf The resulting single 17 byte rule did not affect SA in anyway, there was virtually no difference in my mass check tests. Running the regex through some file manually results in 8 lines/second. This with one 3Ghz core. I think you can make rules/REs of MBs in size, but gains probably nothing. About ClamAV... + It would probably handle this even faster + Easy logging of exact signature that got hit (single name per sig) - It would also match any header like To: From: etc (PRETTY BAD...) I'd choose SA since it's way more flexible. I doubt performance here is a factor, especially with outgoing mail.. Thanks for the info. - It would also match any header like To: From: etc (PRETTY BAD...) That could be an issue. I will check to see if I can find a workaround, if not, ClamAV may not be an option.
Re: List of banned words/bounce to sender
From: Daniel McDonald dan.mcdon...@austinenergy.com Sent: Monday, 2010/August/09 05:28 On 8/9/10 6:58 AM, Martin Gregorie mar...@gregorie.org wrote: On Mon, 2010-08-09 at 14:17 +0300, Henrik K wrote: On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote: On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet) wrote: Thanks. We are looking at roughly 70,000 names and always growing. If I gave it sufficient hardware, would you expect that to be practical, or is that totally ridiculous? Any options for a database look up here? I'd use a plugin that simply queries the database plus a rule to activate the plugin by calling its eval() method and sets the score if the rule fires. Queries database for what? I guess you didn't read the thread fully. :-) Queries the patient data DB for patient names - obviously. I made the offer because I found it useful to be able to modify an existing plugin that queried a database. Exactly what the SQL query does in largely irrelevant. I found that the difficult bit was working out to how to configure the plugin to access my database. Constructing the query and interpreting its result were relatively easy. So, you are recommending that he use a plugin to query 70,000 records from a database, and perform 140,000 body matches, for every e-mail message he receives? Doesn't seem very efficient. It would make sense if it were structured data he was looking at, to then perform one-off queries to see if that data matched the database. But the original post was discussing a data-loss-prevention scheme to avoid unstructured data leaks. If the data could be regularized somehow, that might be different. For example, if there were a limited number of first names, you could write signatures that looked for first names with another capitalized word nearby, and then do a database lookup to see if the capitalized word was a last name associated with the first name that you discovered. Unfortunately, people are pretty random with first names. I have a database of some 600K voters in Travis County, Texas. There are 38,808 distinct first names. This technique might cut down the number of rules by 93.5%, but then you have to do database lookups and some fancy parsing to verify the hit. Don't know if that would be worth it. Um, a query for firstname=John and lastname=Smith and a query for firstname=Smith and lastname=John is a start. (Match with the format for the database.) One of the problems is picking out names and match them with other names close enough to them to be John Smith. Then you have to guess the order, the two queries above handle that. Then you have to settle on whether this is one of our John Smith's or a third party unrelated to our database. I see that last one as the real problem. {^_^}
Re: List of banned words/bounce to sender
On Mon, 2010-08-09 at 07:28 -0500, Daniel McDonald wrote: So, you are recommending that he use a plugin to query 70,000 records from a database, and perform 140,000 body matches, for every e-mail message he receives? It should be possible to write a rule that recognises names (initials + capitalised word or a sequence of 2+ capitalised words, either optionally prefixed with a title may well work. Designing the regex should be relatively easy because it only has to match the type of name that can be generated from the database - no matter what you do that would seem to be a fundamental limit on what is reasonably possible. Now you only have to run a SQL query against the regex matches and this is even easier if you use grouping in the regex to extract strings that correspond to database fields and build the query from them. Something like this will match a sequence of two capitalised name words, including hyphenated ones, and extract the name words: /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/ and should be fairly easy to extend to deal with initials and/or more than one forename. Tested in Python and should also work in Perl. Doesn't seem very efficient. It would make sense if it were structured data he was looking at, to then perform one-off queries to see if that data matched the database. But the original post was discussing a data-loss-prevention scheme to avoid unstructured data leaks. Maybe so, but nor is building and applying a regex with 70,000+ alternates in it. Of course it would be wise to prototype both approaches before deciding whether to do anything at all, but I have a gut feeling that recognising a candidate name and using the matching string to construct and run an SQL query will be less resource intensive than applying a very large regex. I guestimate the latter at 10-20 bytes per name including alternate separator, which is 700-1400 kb for 70,000 names. If the data could be regularized somehow, that might be different. For example, if there were a limited number of first names, you could write signatures that looked for first names with another capitalized word nearby, and then do a database lookup to see if the capitalized word was a last name associated with the first name that you discovered. Unfortunately, people are pretty random with first names. I have a database of some 600K voters in Travis County, Texas. There are 38,808 distinct first names. This technique might cut down the number of rules by 93.5%, but then you have to do database lookups and some fancy parsing to verify the hit. Don't know if that would be worth it. Agreed: if some matching scheme can be made to work its going to let some names through if only because the writer mis-spells names recorded in the database. There's not a lot can be done about that. Martin
Re: List of banned words/bounce to sender
From: Martin Gregorie mar...@gregorie.org Sent: Monday, 2010/August/09 15:45 On Mon, 2010-08-09 at 07:28 -0500, Daniel McDonald wrote: So, you are recommending that he use a plugin to query 70,000 records from a database, and perform 140,000 body matches, for every e-mail message he receives? It should be possible to write a rule that recognises names (initials + capitalised word or a sequence of 2+ capitalised words, either optionally prefixed with a title may well work. Designing the regex should be relatively easy because it only has to match the type of name that can be generated from the database - no matter what you do that would seem to be a fundamental limit on what is reasonably possible. Now you only have to run a SQL query against the regex matches and this is even easier if you use grouping in the regex to extract strings that correspond to database fields and build the query from them. Something like this will match a sequence of two capitalised name words, including hyphenated ones, and extract the name words: /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/ and should be fairly easy to extend to deal with initials and/or more than one forename. Tested in Python and should also work in Perl. Doesn't seem very efficient. It would make sense if it were structured data he was looking at, to then perform one-off queries to see if that data matched the database. But the original post was discussing a data-loss-prevention scheme to avoid unstructured data leaks. Maybe so, but nor is building and applying a regex with 70,000+ alternates in it. Of course it would be wise to prototype both approaches before deciding whether to do anything at all, but I have a gut feeling that recognising a candidate name and using the matching string to construct and run an SQL query will be less resource intensive than applying a very large regex. I guestimate the latter at 10-20 bytes per name including alternate separator, which is 700-1400 kb for 70,000 names. If the data could be regularized somehow, that might be different. For example, if there were a limited number of first names, you could write signatures that looked for first names with another capitalized word nearby, and then do a database lookup to see if the capitalized word was a last name associated with the first name that you discovered. Unfortunately, people are pretty random with first names. I have a database of some 600K voters in Travis County, Texas. There are 38,808 distinct first names. This technique might cut down the number of rules by 93.5%, but then you have to do database lookups and some fancy parsing to verify the hit. Don't know if that would be worth it. Agreed: if some matching scheme can be made to work its going to let some names through if only because the writer mis-spells names recorded in the database. There's not a lot can be done about that. Martin That solves the Reginald Slovotniksky type names. But, John Smith? Dunno. {^_-}
Re: List of banned words/bounce to sender
On Mon, 2010-08-09 at 17:42 -0700, jdow wrote: From: Martin Gregorie mar...@gregorie.org Something like this will match a sequence of two capitalised name words, including hyphenated ones, and extract the name words: /([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/ and should be fairly easy to extend to deal with initials and/or more than one forename. Tested in Python and should also work in Perl. That solves the Reginald Slovotniksky type names. But, John Smith? Dunno. The regex I showed will return 'John' and 'Smith' so the combo can be queried in the database, which is all I set out to try. However, I was trying to generalise as a regex that would match two or more Capitalised Names and return them as an array of group values but I couldn't work out how to do that without writing a rather tedious set of ever longer alternates. If anybody knows how to do that without resorting to alternatives I'd be fascinated to know how you do that. Martin
Re: List of banned words/bounce to sender
On 08/05/2010 10:47 AM, Matthew Kitchin (public/usenet) wrote: Hello all. I have been a loyal users for years, but have never had to do much more than make a few custom rules. I work for a healthcare company, and I have been asked to implement a mechanism to search for patient names in outgoing emails an bounce them back to the sender if one is identified. We would search for them in the format John Smith and Smith, John. We would like to bounce them back to the sender (that would be within our company) with a custom notice indicating what they should do to properly send the email. My typical setups are Postfix -amavisd-SA In this case, the setup doesn't exist yet, because I'm just exploring the feasibility of doing it. I would run the latest Versions of CentOS 64 Bit, Postfix, Amavisd, and SA. It would be great if it could search attachments too, but I could probably get by with just looking at the body. Of course, the emails will be HTML and RTF too. They originate in and Outlook/Exchange environment. Is this a realistic setup? Spamassassin can't handle this - it has no capability to reject mail, however you need to think - are you going to have a database of patients names, or is your intention to block anything with a Name? Are you really going to want to manage a databse of every name out there? If so, what happens when someone e-mails I watched a presentation from Bill Gates on Well, that's a name, right? So let's take the alternative - you have a database of just custom names (of your patients). Whos job is it to maintain that? And what happens if, again, in the above situation, a patient has the same name as say a celebrity or even worse, say a doctor? Let's say there's a world famous doctor James Bond. But James Bond (different person) is a patient. One of your staf members e-mails We need to go see the conference Dr. James Bond is putting on. Bounced. While it's a great idea in theory (IMHO), it's going to be a headache. One company I worked at a while ago implemented a web filter. The IT guy implemented it, then went to lunch. Unless a site was allowed, it was blocked. We very quickly realized that while he added say www.yahoo.com, http://mail.yahoo.com was blocked. So he added *.yahoo.com . But then we found out that there were a dozen other DOMAINS needed too - one by one. Say yahoomail.com yahoohosting.com , etc. His first few days were spent whitelisting site after site after site. Eventuallly, they gave up on the idea.
Re: List of banned words/bounce to sender
On 8/5/2010 1:03 PM, Evan Platt wrote: Spamassassin can't handle this - it has no capability to reject mail, however you need to think - are you going to have a database of patients names, or is your intention to block anything with a Name? Are you really going to want to manage a databse of every name out there? If so, what happens when someone e-mails I watched a presentation from Bill Gates on Well, that's a name, right? So let's take the alternative - you have a database of just custom names (of your patients). Whos job is it to maintain that? And what happens if, again, in the above situation, a patient has the same name as say a celebrity or even worse, say a doctor? Let's say there's a world famous doctor James Bond. But James Bond (different person) is a patient. One of your staf members e-mails We need to go see the conference Dr. James Bond is putting on. Bounced. Amavisd could reject the mail. I was planning on using Spamassassin (with a custom built rule) to examine the email for the names. We would only use the names of our patients. The names would be dumped out of our patient DB every night. If a patient has a a same name as a friend, there would be a code we would put in the subject to bypass the filter. I was thinking of a custom rule for that code that would have a score of -20 or something like that. Basically, Spamassassin's role would be deciding whether or not one of the names was in the email and if the override code was in the subject. I'm not saying it is the most brilliant idea in the world, but it is what I have been told to implement. I know Amavisd well, so I can handle that part. I guess by main question should be, could I have Spamassassin read a custom rule to look for several thousand patient names in the format John Smith and Smith, John?
Re: List of banned words/bounce to sender
On tor 05 aug 2010 19:47:37 CEST, Matthew Kitchin (public/usenet) wrote Is this a realistic setup? postfix will love it if done right with local smtp auth senders, eg no sender sends unauthed then its just add smtpd_sender_bcc_naps from a list of all local recipients just dont make it if sender auth is not in place first ! more questions ?, its not a spamassassin answer :) -- xpoint http://www.unicom.com/pw/reply-to-harmful.html
Re: List of banned words/bounce to sender
On 8/5/2010 1:19 PM, Benny Pedersen wrote: On tor 05 aug 2010 19:47:37 CEST, Matthew Kitchin (public/usenet) wrote Is this a realistic setup? postfix will love it if done right with local smtp auth senders, eg no sender sends unauthed then its just add smtpd_sender_bcc_naps from a list of all local recipients just dont make it if sender auth is not in place first ! more questions ?, its not a spamassassin answer :) I'm not sure what you mean. I'm not looking for anything along the lines of authorized senders. I'm wanting to search and email to see if it has one of several thousand patient names in it. I guess by main question should be, could I have Spamassassin read a custom rule to look for several thousand patient names in the format John Smith and Smith, John?
RE: List of banned words/bounce to sender
It is more associated with Mailscanner rather than Amavis, but the Scamnailer project does something very similar with a long list of known bad phishing mail addresses. It scans each message for the presence of any of thousands of addresses from a frequently-updated list, and when it finds one, it adds an additional header. You then configure a Mailscanner Spamassassin Rule Action setting to perform some custom action if it finds that Scamnailer header in the message. Perhaps that action could include the kind of bounce/sender notify you'd want to do. Scamnailer also works great for stopping targeted phishing attacks, too, which as a university we get a lot of. Scamnailer is at http://www.scamnailer.info/ Thanks, James __ James Kelly Network Administrator IST Network Operations Chapman University Phone: 714-744-7833 Email: jake...@chapman.edu --- CHAPMAN UNIVERSITY WILL NEVER ASK FOR YOUR PASSWORD! DO NOT SHARE YOUR PASSWORD WITH OTHERS! If you wish to modify your Chapman email address account information: Use the account management web page at https://web.chapman.edu/accountmanagement/, Call the Chapman University helpdesk at (714) 997-6600, or Contact helpd...@chapman.edu. -Original Message- From: Matthew Kitchin (public/usenet) [mailto:mkitchin.pub...@gmail.com] Sent: Thursday, August 05, 2010 11:11 AM To: Spamassassin Subject: Re: List of banned words/bounce to sender On 8/5/2010 1:03 PM, Evan Platt wrote: Spamassassin can't handle this - it has no capability to reject mail, however you need to think - are you going to have a database of patients names, or is your intention to block anything with a Name? Are you really going to want to manage a databse of every name out there? If so, what happens when someone e-mails I watched a presentation from Bill Gates on Well, that's a name, right? So let's take the alternative - you have a database of just custom names (of your patients). Whos job is it to maintain that? And what happens if, again, in the above situation, a patient has the same name as say a celebrity or even worse, say a doctor? Let's say there's a world famous doctor James Bond. But James Bond (different person) is a patient. One of your staf members e-mails We need to go see the conference Dr. James Bond is putting on. Bounced. Amavisd could reject the mail. I was planning on using Spamassassin (with a custom built rule) to examine the email for the names. We would only use the names of our patients. The names would be dumped out of our patient DB every night. If a patient has a a same name as a friend, there would be a code we would put in the subject to bypass the filter. I was thinking of a custom rule for that code that would have a score of -20 or something like that. Basically, Spamassassin's role would be deciding whether or not one of the names was in the email and if the override code was in the subject. I'm not saying it is the most brilliant idea in the world, but it is what I have been told to implement. I know Amavisd well, so I can handle that part. I guess by main question should be, could I have Spamassassin read a custom rule to look for several thousand patient names in the format John Smith and Smith, John?
Re: List of banned words/bounce to sender
On 8/5/2010 2:11 PM, Matthew Kitchin (public/usenet) wrote: Amavisd could reject the mail. I was planning on using Spamassassin (with a custom built rule) to examine the email for the names. We would only use the names of our patients. The names would be dumped out of our patient DB every night. If a patient has a a same name as a friend, there would be a code we would put in the subject to bypass the filter. I was thinking of a custom rule for that code that would have a score of -20 or something like that. Basically, Spamassassin's role would be deciding whether or not one of the names was in the email and if the override code was in the subject. I'm not saying it is the most brilliant idea in the world, but it is what I have been told to implement. My approach to doing something like this would be to have a rule that matches the names (however you implement it), and then have the MTA check for that particular rule hit and bounce the message if it exists. This is the same way you generally use the VBounce plugin. Then do the same thing for your bypass rule. I know Amavisd well, so I can handle that part. I guess by main question should be, could I have Spamassassin read a custom rule to look for several thousand patient names in the format John Smith and Smith, John? Spamassassin can use whatever custom rule you care to come up with. It will happily use a regex with hundreds of names listed. The question is whether the rule would cause a noticeable slowdown in processing speed. The only way to find out is to try it. Using compiled rules would probably help here. body BAD_NAMES /John Smith|Smith, John|Jane Doe|Doe, Jane|../ Not the most efficient rule, but it would work. You would probably have to split it into multiple rules and combine them with a meta rule. body __BAD_NAMES1 . body __BAD_NAMES2 . body __BAD_NAMES3 . meta BAD_NAMES __BAD_NAMES1 || __BAD_NAMES2 || __BAD_NAMES3 Regexp::Optimizer would probably also help when creating the rules. -- Bowie
Re: List of banned words/bounce to sender
On 8/5/2010 1:52 PM, Bowie Bailey wrote: My approach to doing something like this would be to have a rule that matches the names (however you implement it), and then have the MTA check for that particular rule hit and bounce the message if it exists. This is the same way you generally use the VBounce plugin. Then do the same thing for your bypass rule. That is pretty much what I wanted to do. The best way I know to make Postfix use SA is with Amavisd. Spamassassin can use whatever custom rule you care to come up with. It will happily use a regex with hundreds of names listed. The question is whether the rule would cause a noticeable slowdown in processing speed. The only way to find out is to try it. Using compiled rules would probably help here. Thanks. We are looking at roughly 70,000 names and always growing. If I gave it sufficient hardware, would you expect that to be practical, or is that totally ridiculous? Any options for a database look up here?
Re: List of banned words/bounce to sender
On 8/5/2010 3:00 PM, Matthew Kitchin (public/usenet) wrote: On 8/5/2010 1:52 PM, Bowie Bailey wrote: My approach to doing something like this would be to have a rule that matches the names (however you implement it), and then have the MTA check for that particular rule hit and bounce the message if it exists. This is the same way you generally use the VBounce plugin. Then do the same thing for your bypass rule. That is pretty much what I wanted to do. The best way I know to make Postfix use SA is with Amavisd. The point being that the score is irrelevant. If the rule hits, the message gets bounced. Spamassassin can use whatever custom rule you care to come up with. It will happily use a regex with hundreds of names listed. The question is whether the rule would cause a noticeable slowdown in processing speed. The only way to find out is to try it. Using compiled rules would probably help here. Thanks. We are looking at roughly 70,000 names and always growing. If I gave it sufficient hardware, would you expect that to be practical, or is that totally ridiculous? Any options for a database look up here? I would tend to say that something that large would not be practical. On the other hand, there's no way to really know until you try it. A database lookup is possible, but the problem is determining what to look up. You would have to somehow identify possible names for comparison to the database. -- Bowie
Re: List of banned words/bounce to sender
On Thu, Aug 5, 2010 at 2:00 PM, Matthew Kitchin (public/usenet) mkitchin.pub...@gmail.com wrote: On 8/5/2010 1:52 PM, Bowie Bailey wrote: My approach to doing something like this would be to have a rule that matches the names (however you implement it), and then have the MTA check for that particular rule hit and bounce the message if it exists. This is the same way you generally use the VBounce plugin. Then do the same thing for your bypass rule. That is pretty much what I wanted to do. The best way I know to make Postfix use SA is with Amavisd. Spamassassin can use whatever custom rule you care to come up with. It will happily use a regex with hundreds of names listed. The question is whether the rule would cause a noticeable slowdown in processing speed. The only way to find out is to try it. Using compiled rules would probably help here. Thanks. We are looking at roughly 70,000 names and always growing. If I gave it sufficient hardware, would you expect that to be practical, or is that totally ridiculous? Any options for a database look up here? Use your database to generate rules for clamav. You could even remove the stock clamav rules if you want. Matching the body for 70,000 names would probably take less than 0.1 seconds.
Re: List of banned words/bounce to sender
On 8/5/2010 2:05 PM, Bowie Bailey wrote: I would tend to say that something that large would not be practical. On the other hand, there's no way to really know until you try it. A database lookup is possible, but the problem is determining what to look up. You would have to somehow identify possible names for comparison to the database. Thanks. I think I had a brain fart here. Obviously we would have to have identified the names before we could look them up... I think I divided by 0 in my head at some point :)
Re: List of banned words/bounce to sender
On 8/5/2010 2:10 PM, Noel Jones wrote: Use your database to generate rules for clamav. You could even remove the stock clamav rules if you want. Matching the body for 70,000 names would probably take less than 0.1 seconds. That sounds like a really good idea. I do use ClamAV but have never written any rules of my own. Thanks for the tip!
Re: List of banned words/bounce to sender
On 5 Aug 2010, at 20:13, Matthew Kitchin (public/usenet) wrote: On 8/5/2010 2:10 PM, Noel Jones wrote: Use your database to generate rules for clamav. You could even remove the stock clamav rules if you want. Matching the body for 70,000 names would probably take less than 0.1 seconds. That sounds like a really good idea. I do use ClamAV but have never written any rules of my own. Thanks for the tip! I'd set it up to check for surnames from the list in groups first, then if it matches one of those look for the various permutations of the full names that correspond to each set. I'm thinking of these in terms of calling out from Exim's acl_check_data section, using various database dirs depending on the rule set (like the Bayes filter), but there are other ways of achieving the same with. That ought to reduce the amount of work per message for those that will be let through. You'd have to experiment to find the best group size, it would depend on how many distinct surnames there are in your set, as well as the callout cost relative to the time for each expression. That would also give you a good shot at identifying J. Smith as well, for example.