On Mon, 2010-08-09 at 07:28 -0500, Daniel McDonald wrote:

> So, you are recommending that he use a plugin to query 70,000 records from a
> database, and perform 140,000 body matches, for every e-mail message he
> receives?
>
It should be possible to write a rule that recognises names (initials +
capitalised word or a sequence of 2+ capitalised words, either
optionally prefixed with a title may well work. Designing the regex
should be relatively easy because it only has to match the type of name
that can be generated from the database - no matter what you do that
would seem to be a fundamental limit on what is reasonably possible. Now
you only have to run a SQL query against the regex matches and this is
even easier if you use grouping in the regex to extract strings that
correspond to database fields and build the query from them.

Something like this will match a sequence of two capitalised name words,
including hyphenated ones, and extract the name words:

/([A-Z][-a-zA-Z]{1,20})\s([A-Z][-a-zA-Z]{1,20})/

and should be fairly easy to extend to deal with initials and/or more
than one forename. Tested in Python and should also work in Perl.

> Doesn't seem very efficient.  It would make sense if it were
> structured data he was looking at, to then perform one-off queries to see if
> that data matched the database.  But the original post was discussing a
> data-loss-prevention scheme to avoid unstructured data leaks.
> 
Maybe so, but nor is building and applying a regex with 70,000+
alternates in it.

Of course it would be wise to prototype both approaches before deciding
whether to do anything at all, but I have a gut feeling that recognising
a candidate name and using the matching string to construct and run an
SQL query will be less resource intensive than applying a very large
regex. I guestimate the latter at 10-20 bytes per name including
alternate separator, which is 700-1400 kb for 70,000 names.
  
> If the data could be regularized somehow, that might be different.  For
> example, if there were a limited number of first names, you could write
> signatures that looked for first names with another capitalized word nearby,
> and then do a database lookup to see if the capitalized word was a last name
> associated with the first name that you discovered.  Unfortunately, people
> are pretty random with first names.  I have a database of some 600K voters
> in Travis County, Texas.  There are 38,808 distinct first names.  This
> technique might cut down the number of rules by 93.5%, but then you have to
> do database lookups and some fancy parsing to verify the hit.  Don't know if
> that would be worth it.
> 
Agreed: if some matching scheme can be made to work its going to let
some names through if only because the writer mis-spells names recorded
in the database. There's not a lot can be done about that.

Martin




Reply via email to