From: "Daniel McDonald" <dan.mcdon...@austinenergy.com>
Sent: Monday, 2010/August/09 05:28


On 8/9/10 6:58 AM, "Martin Gregorie" <mar...@gregorie.org> wrote:

On Mon, 2010-08-09 at 14:17 +0300, Henrik K wrote:
On Mon, Aug 09, 2010 at 11:38:50AM +0100, Martin Gregorie wrote:
On Thu, 2010-08-05 at 14:00 -0500, Matthew Kitchin (public/usenet)
wrote:
Thanks. We are looking at roughly 70,000 names and always growing. If I
gave it sufficient hardware, would you expect that to be practical, or
is that totally ridiculous? Any options for a database look up here?

I'd use a plugin that simply queries the database plus a rule to
activate the plugin by calling its eval() method and sets the score if
the rule fires.

Queries database for what? I guess you didn't read the thread fully. :-)

Queries the patient data DB for patient names - obviously. I made the
offer because I found it useful to be able to modify an existing plugin
that queried a database. Exactly what the SQL query does in largely
irrelevant. I found that the difficult bit was working out to how to
configure the plugin to access my database. Constructing the query and
interpreting its result were relatively easy.

So, you are recommending that he use a plugin to query 70,000 records from a
database, and perform 140,000 body matches, for every e-mail message he
receives?  Doesn't seem very efficient.  It would make sense if it were
structured data he was looking at, to then perform one-off queries to see if
that data matched the database.  But the original post was discussing a
data-loss-prevention scheme to avoid unstructured data leaks.

If the data could be regularized somehow, that might be different.  For
example, if there were a limited number of first names, you could write
signatures that looked for first names with another capitalized word nearby, and then do a database lookup to see if the capitalized word was a last name
associated with the first name that you discovered.  Unfortunately, people
are pretty random with first names.  I have a database of some 600K voters
in Travis County, Texas.  There are 38,808 distinct first names.  This
technique might cut down the number of rules by 93.5%, but then you have to do database lookups and some fancy parsing to verify the hit. Don't know if
that would be worth it.

Um, a query for "firstname=John and lastname=Smith" and a query for
"firstname=Smith and lastname=John" is a start. (Match with the format for
the database.) One of the problems is picking out names and match them with
other names close enough to them to be "John Smith". Then you have to guess
the order, the two queries above handle that. Then you have to settle on
whether this is one of our John Smith's or a third party unrelated to our
database. I see that last one as the real problem.

{^_^}

Reply via email to