On 8/9/10 6:58 AM, "Martin Gregorie" <mar...@gregorie.org> wrote:

> On Mon, 2010-08-09 at 14:17 +0300, Henrik K wrote:
>> On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote:
>>> On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet)
>>> wrote:
>>>> Thanks. We are looking at roughly 70,000 names and always growing. If I
>>>> gave it sufficient hardware, would you expect that to be practical, or
>>>> is that totally ridiculous? Any options for a database look up here?
>>>> 
>>> I'd use a plugin that simply queries the database plus a rule to
>>> activate the plugin by calling its eval() method and sets the score if
>>> the rule fires.
>> 
>> Queries database for what? I guess you didn't read the thread fully. :-)
>> 
> Queries the patient data DB for patient names - obviously. I made the
> offer because I found it useful to be able to modify an existing plugin
> that queried a database. Exactly what the SQL query does in largely
> irrelevant. I found that the difficult bit was working out to how to
> configure the plugin to access my database. Constructing the query and
> interpreting its result were relatively easy.

So, you are recommending that he use a plugin to query 70,000 records from a
database, and perform 140,000 body matches, for every e-mail message he
receives?  Doesn't seem very efficient.  It would make sense if it were
structured data he was looking at, to then perform one-off queries to see if
that data matched the database.  But the original post was discussing a
data-loss-prevention scheme to avoid unstructured data leaks.

If the data could be regularized somehow, that might be different.  For
example, if there were a limited number of first names, you could write
signatures that looked for first names with another capitalized word nearby,
and then do a database lookup to see if the capitalized word was a last name
associated with the first name that you discovered.  Unfortunately, people
are pretty random with first names.  I have a database of some 600K voters
in Travis County, Texas.  There are 38,808 distinct first names.  This
technique might cut down the number of rules by 93.5%, but then you have to
do database lookups and some fancy parsing to verify the hit.  Don't know if
that would be worth it.


-- 
Daniel J McDonald, CCIE # 2495, CISSP # 78281

Reply via email to