"Ing. Edwin Cruz" <[EMAIL PROTECTED]> wrote on 03/22/2006 11:38:53 AM:
> Ok, and sorry for my poor explanation and my "spanglish" > > The ofac list is a database with thousands of names of persons who are > forbidden to do transactions like change of dollars to mexican pesos, they > are trying to avoid "money laundering", so if somebody try to do a > transaction he have to be searched into ofac list, but the problem is that > the name can be formatted in diferents ways like 'Edwin Cruz' 'Cruz Edwin' > 'Edwin E. Cruz' and so on > > I've tryed this: > > Select * from OFACSDN where match(name) against ('edwin cruz'); > > And it returns more than 20 names that contain one or both words and i want > to improve my search... > > An example: > SELECT NameSDN FROM tofacsdn t where nombre like '%cruz%'; > +----------------------------------+ > | NameSDN | > +----------------------------------+ > | CRUZ REYES, Antonio Pedro | > | CRUZ, Juan M. de la | > | PEREZ CRUZ, Osvaldo | > | SANTACRUZ LONDONO, Jose | > | CAVIEDES CRUZ, Leonardo | > | SANTACRUZ CASTRO, Ana Milena | > | CASTRO DE SANTACRUZ, Amparo | > | CASTRILLON CRUZ, Maria Leonor | > | RUELAS MARTINEZ, Jose de la Cruz | > | SANTA CRUZ IMPERIAL AIRLINES | > +----------------------------------+ > 10 rows in set (0.02 sec) > > How do I have to perform a search in that list with my name 'Edwin Cruz', if > I try with full text I get this: > SELECT NameSDN from tofacsdn WHERE MATCH(nombre) AGAINST ('Edwin cruz') > +----------------------------------+ > | NameSDN | > +----------------------------------+ > | MUGUTI, Edwin | > | PARRA VELASCO, Edwin Hiulder | > | MUTASA, Didymus Noel Edwin | > | CRUZ, Juan M. de la | > | CAVIEDES CRUZ, Leonardo | > | PEREZ CRUZ, Osvaldo | > | CASTRILLON CRUZ, Maria Leonor | > | RUELAS MARTINEZ, Jose de la Cruz | > | SANTA CRUZ IMPERIAL AIRLINES | > | CRUZ REYES, Antonio Pedro | > +----------------------------------+ > 10 rows in set (0.01 sec) > > The closest result that I want is with this query: > > SELECT NameSDN from tofacsdn WHERE NameSDN like '%Edwin%' and NameSDN like > '%Cruz%' > > But it isn't healthy because what abaut if I provide my name like 'Edwin > C.', I dont know what I am going to do with this... In the worst case I'll > program a script to build a query like above. > > I'm using MySQL 5.0.18, php 5.0.5 > ----8<---- <snip> ----8<---- Fulltext searches are what you need but you will need to change both the stop word list and the minimum length parameter. That way you don't filter out the shorter name parts like "de" and "la". In this list almost nothing you have is "noise" so you really need to turn off that part of the full text indexing engine. Another thing to try is the BOOLEAN operators that you can use when you do a search IN BOOLEAN MODE... I think you are on the right track. However, If for some reason the built in fulltext indexing doesn't work for you or can't be configured to work for you, you may need to create your own FT index. It's not as hard as it may sound with data like what you have. Here are the basic components: a) a table of all of the words appearing in any indexed column -- in your situation, all you have are names. This table would hold (as single words) each part of every name (without any punctuation). Most languages have a function to split a string into an array based on some delimiter (like a space) so this should be easy to do. b) a table linking each of the words in the table from a) with a record in your source table along with the position of the word in the source table. It's a simple two-column table and should be very fast to search. If a word appears more than once in the source record, it gets more than one record in this table. You typically build this as you build the first table. c) an optional stemming table - this is where you can create a table to expand or contract a name or a name part into other recognizable forms (like misspelled names into their proper spelling) d) a searching routine that queries the table from a) for matches to your base terms and your stemmed terms then uses that list to make a list of all of the records in the source table (by matching the first list to the table in b) ). This will give you a list of how many matches occurred for each source record. Run a count() query on this list to see how many matches were returned per source row. Order the results by # of matches in descending order. You could even throw out all single hits as "noise" matches. Anyway, that means that you are now in charge of your own FT index but this one will be tuned to your particular searching needs. The one built into MySQL is tuned better for matching words in lots of longer text fields (like newspaper articles) than it is for searching lists of names. The little bit of effort you put into building the index when you import your names will really pay off in terms of performance and accuracy when you need to use the list later. Shawn Green Database Administrator Unimin Corporation - Spruce Pine