On 8/9/10 6:58 AM, "Martin Gregorie" <mar...@gregorie.org> wrote:
> On Mon, 2010-08-09 at 14:17 +0300, Henrik K wrote: >> On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote: >>> On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet) >>> wrote: >>>> Thanks. We are looking at roughly 70,000 names and always growing. If I >>>> gave it sufficient hardware, would you expect that to be practical, or >>>> is that totally ridiculous? Any options for a database look up here? >>>> >>> I'd use a plugin that simply queries the database plus a rule to >>> activate the plugin by calling its eval() method and sets the score if >>> the rule fires. >> >> Queries database for what? I guess you didn't read the thread fully. :-) >> > Queries the patient data DB for patient names - obviously. I made the > offer because I found it useful to be able to modify an existing plugin > that queried a database. Exactly what the SQL query does in largely > irrelevant. I found that the difficult bit was working out to how to > configure the plugin to access my database. Constructing the query and > interpreting its result were relatively easy. So, you are recommending that he use a plugin to query 70,000 records from a database, and perform 140,000 body matches, for every e-mail message he receives? Doesn't seem very efficient. It would make sense if it were structured data he was looking at, to then perform one-off queries to see if that data matched the database. But the original post was discussing a data-loss-prevention scheme to avoid unstructured data leaks. If the data could be regularized somehow, that might be different. For example, if there were a limited number of first names, you could write signatures that looked for first names with another capitalized word nearby, and then do a database lookup to see if the capitalized word was a last name associated with the first name that you discovered. Unfortunately, people are pretty random with first names. I have a database of some 600K voters in Travis County, Texas. There are 38,808 distinct first names. This technique might cut down the number of rules by 93.5%, but then you have to do database lookups and some fancy parsing to verify the hit. Don't know if that would be worth it. -- Daniel J McDonald, CCIE # 2495, CISSP # 78281