"Ing. Edwin Cruz" <[EMAIL PROTECTED]> wrote on 03/22/2006 11:38:53 AM:

> Ok, and sorry for my poor explanation and my "spanglish"
> 
> The ofac list is a database with thousands of names of persons who are
> forbidden to do transactions like change of dollars to mexican pesos, 
they
> are trying to avoid "money laundering", so if somebody try to do a
> transaction he have to be searched into ofac list, but the problem is 
that
> the name can be formatted in diferents ways like 'Edwin Cruz' 'Cruz 
Edwin'
> 'Edwin E. Cruz' and so on
> 
> I've tryed this:
> 
> Select * from OFACSDN where match(name) against ('edwin cruz');
> 
> And it returns more than 20 names that contain one or both words and i 
want
> to improve my search...
> 
> An example:
> SELECT NameSDN FROM tofacsdn t where nombre like '%cruz%';
> +----------------------------------+
> | NameSDN                          |
> +----------------------------------+
> | CRUZ REYES, Antonio Pedro        |
> | CRUZ, Juan M. de la              |
> | PEREZ CRUZ, Osvaldo              |
> | SANTACRUZ LONDONO, Jose          |
> | CAVIEDES CRUZ, Leonardo          |
> | SANTACRUZ CASTRO, Ana Milena     |
> | CASTRO DE SANTACRUZ, Amparo      |
> | CASTRILLON CRUZ, Maria Leonor    |
> | RUELAS MARTINEZ, Jose de la Cruz |
> | SANTA CRUZ IMPERIAL AIRLINES     |
> +----------------------------------+
> 10 rows in set (0.02 sec)
> 
> How do I have to perform a search in that list with my name 'Edwin 
Cruz', if
> I try with full text I get this:
> SELECT NameSDN from tofacsdn WHERE MATCH(nombre) AGAINST ('Edwin cruz')
> +----------------------------------+
> | NameSDN                          |
> +----------------------------------+
> | MUGUTI, Edwin                    |
> | PARRA VELASCO, Edwin Hiulder     |
> | MUTASA, Didymus Noel Edwin       |
> | CRUZ, Juan M. de la              |
> | CAVIEDES CRUZ, Leonardo          |
> | PEREZ CRUZ, Osvaldo              |
> | CASTRILLON CRUZ, Maria Leonor    |
> | RUELAS MARTINEZ, Jose de la Cruz |
> | SANTA CRUZ IMPERIAL AIRLINES     |
> | CRUZ REYES, Antonio Pedro        |
> +----------------------------------+
> 10 rows in set (0.01 sec)
> 
> The closest result that I want is with this query:
> 
> SELECT NameSDN from tofacsdn WHERE NameSDN like '%Edwin%' and NameSDN 
like
> '%Cruz%'
> 
> But it isn't healthy because what abaut if I provide my name like 'Edwin
> C.', I dont know what I am going to do with this... In the worst case 
I'll
> program a script to build a query like above.
> 
> I'm using MySQL 5.0.18, php 5.0.5
> 
----8<---- <snip> ----8<----

Fulltext searches are what you need but you will need to change both the 
stop word list and the minimum length parameter. That way you don't filter 
out the shorter name parts like "de" and "la". In this list almost nothing 
you have is "noise" so you really need to turn off that part of the full 
text indexing engine. 

Another thing to try is the BOOLEAN operators that you can use when you do 
a search IN BOOLEAN MODE...

I think you are on the right track. However, If for some reason the built 
in fulltext indexing doesn't work for you or can't be configured to work 
for you, you may need to create your own FT index. It's not as hard as it 
may sound with data like what you have. Here are the basic components:

a) a table of all of the words appearing in any indexed column -- in your 
situation, all you have are names. This table would hold (as single words) 
each part of every name (without any punctuation). Most languages have a 
function to split a string into an array based on some delimiter (like a 
space) so this should be easy to do.

b) a table linking each of the words in the table from a) with a record in 
your source table along with the position of the word in the source table. 
It's a simple two-column table and should be very fast to search. If a 
word appears more than once in the source record, it gets more than one 
record in this table. You typically build this as you build the first 
table.

c) an optional stemming table - this is where you can create a table to 
expand or contract a name or a name part into other recognizable forms 
(like misspelled names into their proper spelling)

d) a searching routine that queries the table from a) for matches to your 
base terms and your stemmed terms then uses that list to make a list of 
all of the records in the source table (by matching the first list to the 
table in b) ). This will give you a list of how many matches occurred for 
each source record. Run a count() query on this list to see how many 
matches were returned per source row. Order the results by # of matches in 
descending order.  You could even throw out all single hits as "noise" 
matches.

Anyway, that means that you are now in charge of your own FT index but 
this one will be tuned to your particular searching needs. The one built 
into MySQL is tuned better for matching words in lots of longer text 
fields (like newspaper articles) than it is for searching lists of names.

The little bit of effort you put into building the index when you import 
your names will really pay off in terms of performance and accuracy when 
you need to use the list later.

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Reply via email to