Re: OFAC SDN lists

2006-03-22 Thread sheeri kritzer
repost your question, this time giving an explanation (complete with
MySQL version, platform, queries and results) as to how it doesn't
work.

-Sheeri

On 3/15/06, Ing. Edwin Cruz [EMAIL PROTECTED] wrote:
 Hi folks! I'm asking in a web form for clients, those clients I have to
 search them into OFAC SDN lists(
 http://www.ustreas.gov/offices/enforcement/ofac/sdn/delimit/index.shtml), if
 they are into that list they wont be able to do transactions in my system.

 My problem is to perform a good search of the names into that list, I had
 thought in fulltext, but it appears to not work.
 Fulltext works with short names.


 does anyone have any suggestion for this?

 Regards!

 Edwin.



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: OFAC SDN lists

2006-03-22 Thread Ing. Edwin Cruz
Ok, and sorry for my poor explanation and my spanglish

The ofac list is a database with thousands of names of persons who are
forbidden to do transactions like change of dollars to mexican pesos, they
are trying to avoid money laundering, so if somebody try to do a
transaction he have to be searched into ofac list, but the problem is that
the name can be formatted in diferents ways like 'Edwin Cruz' 'Cruz Edwin'
'Edwin E. Cruz' and so on

I've tryed this:

Select * from OFACSDN where match(name) against ('edwin cruz');

And it returns more than 20 names that contain one or both words and i want
to improve my search...

An example:
SELECT NameSDN FROM tofacsdn t where nombre like '%cruz%';
+--+
| NameSDN  |
+--+
| CRUZ REYES, Antonio Pedro|
| CRUZ, Juan M. de la  |
| PEREZ CRUZ, Osvaldo  |
| SANTACRUZ LONDONO, Jose  |
| CAVIEDES CRUZ, Leonardo  |
| SANTACRUZ CASTRO, Ana Milena |
| CASTRO DE SANTACRUZ, Amparo  |
| CASTRILLON CRUZ, Maria Leonor|
| RUELAS MARTINEZ, Jose de la Cruz |
| SANTA CRUZ IMPERIAL AIRLINES |
+--+
10 rows in set (0.02 sec)

How do I have to perform a search in that list with my name 'Edwin Cruz', if
I try with full text I get this:
SELECT NameSDN from tofacsdn WHERE MATCH(nombre) AGAINST ('Edwin cruz')
+--+
| NameSDN  |
+--+
| MUGUTI, Edwin|
| PARRA VELASCO, Edwin Hiulder |
| MUTASA, Didymus Noel Edwin   |
| CRUZ, Juan M. de la  |
| CAVIEDES CRUZ, Leonardo  |
| PEREZ CRUZ, Osvaldo  |
| CASTRILLON CRUZ, Maria Leonor|
| RUELAS MARTINEZ, Jose de la Cruz |
| SANTA CRUZ IMPERIAL AIRLINES |
| CRUZ REYES, Antonio Pedro|
+--+
10 rows in set (0.01 sec)

The closest result that I want is with this query:

SELECT NameSDN from tofacsdn WHERE NameSDN like '%Edwin%' and NameSDN like
'%Cruz%'

But it isn't healthy because what abaut if I provide my name like 'Edwin
C.', I dont know what I am going to do with this... In the worst case I'll
program a script to build a query like above.

I'm using MySQL 5.0.18, php 5.0.5


-Mensaje original-
De: sheeri kritzer [mailto:[EMAIL PROTECTED] 
Enviado el: MiƩrcoles, 22 de Marzo de 2006 08:31 a.m.
Para: Ing. Edwin Cruz
CC: Mysql
Asunto: Re: OFAC SDN lists


repost your question, this time giving an explanation (complete with MySQL
version, platform, queries and results) as to how it doesn't work.

-Sheeri

On 3/15/06, Ing. Edwin Cruz [EMAIL PROTECTED] wrote:
 Hi folks! I'm asking in a web form for clients, those clients I have 
 to search them into OFAC SDN lists( 
 http://www.ustreas.gov/offices/enforcement/ofac/sdn/delimit/index.shtm
 l), if they are into that list they wont be able to do transactions in 
 my system.

 My problem is to perform a good search of the names into that list, I 
 had thought in fulltext, but it appears to not work. Fulltext works 
 with short names.


 does anyone have any suggestion for this?

 Regards!

 Edwin.





--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: OFAC SDN lists

2006-03-22 Thread SGreen
Ing. Edwin Cruz [EMAIL PROTECTED] wrote on 03/22/2006 11:38:53 AM:

 Ok, and sorry for my poor explanation and my spanglish
 
 The ofac list is a database with thousands of names of persons who are
 forbidden to do transactions like change of dollars to mexican pesos, 
they
 are trying to avoid money laundering, so if somebody try to do a
 transaction he have to be searched into ofac list, but the problem is 
that
 the name can be formatted in diferents ways like 'Edwin Cruz' 'Cruz 
Edwin'
 'Edwin E. Cruz' and so on
 
 I've tryed this:
 
 Select * from OFACSDN where match(name) against ('edwin cruz');
 
 And it returns more than 20 names that contain one or both words and i 
want
 to improve my search...
 
 An example:
 SELECT NameSDN FROM tofacsdn t where nombre like '%cruz%';
 +--+
 | NameSDN  |
 +--+
 | CRUZ REYES, Antonio Pedro|
 | CRUZ, Juan M. de la  |
 | PEREZ CRUZ, Osvaldo  |
 | SANTACRUZ LONDONO, Jose  |
 | CAVIEDES CRUZ, Leonardo  |
 | SANTACRUZ CASTRO, Ana Milena |
 | CASTRO DE SANTACRUZ, Amparo  |
 | CASTRILLON CRUZ, Maria Leonor|
 | RUELAS MARTINEZ, Jose de la Cruz |
 | SANTA CRUZ IMPERIAL AIRLINES |
 +--+
 10 rows in set (0.02 sec)
 
 How do I have to perform a search in that list with my name 'Edwin 
Cruz', if
 I try with full text I get this:
 SELECT NameSDN from tofacsdn WHERE MATCH(nombre) AGAINST ('Edwin cruz')
 +--+
 | NameSDN  |
 +--+
 | MUGUTI, Edwin|
 | PARRA VELASCO, Edwin Hiulder |
 | MUTASA, Didymus Noel Edwin   |
 | CRUZ, Juan M. de la  |
 | CAVIEDES CRUZ, Leonardo  |
 | PEREZ CRUZ, Osvaldo  |
 | CASTRILLON CRUZ, Maria Leonor|
 | RUELAS MARTINEZ, Jose de la Cruz |
 | SANTA CRUZ IMPERIAL AIRLINES |
 | CRUZ REYES, Antonio Pedro|
 +--+
 10 rows in set (0.01 sec)
 
 The closest result that I want is with this query:
 
 SELECT NameSDN from tofacsdn WHERE NameSDN like '%Edwin%' and NameSDN 
like
 '%Cruz%'
 
 But it isn't healthy because what abaut if I provide my name like 'Edwin
 C.', I dont know what I am going to do with this... In the worst case 
I'll
 program a script to build a query like above.
 
 I'm using MySQL 5.0.18, php 5.0.5
 
8 snip 8

Fulltext searches are what you need but you will need to change both the 
stop word list and the minimum length parameter. That way you don't filter 
out the shorter name parts like de and la. In this list almost nothing 
you have is noise so you really need to turn off that part of the full 
text indexing engine. 

Another thing to try is the BOOLEAN operators that you can use when you do 
a search IN BOOLEAN MODE...

I think you are on the right track. However, If for some reason the built 
in fulltext indexing doesn't work for you or can't be configured to work 
for you, you may need to create your own FT index. It's not as hard as it 
may sound with data like what you have. Here are the basic components:

a) a table of all of the words appearing in any indexed column -- in your 
situation, all you have are names. This table would hold (as single words) 
each part of every name (without any punctuation). Most languages have a 
function to split a string into an array based on some delimiter (like a 
space) so this should be easy to do.

b) a table linking each of the words in the table from a) with a record in 
your source table along with the position of the word in the source table. 
It's a simple two-column table and should be very fast to search. If a 
word appears more than once in the source record, it gets more than one 
record in this table. You typically build this as you build the first 
table.

c) an optional stemming table - this is where you can create a table to 
expand or contract a name or a name part into other recognizable forms 
(like misspelled names into their proper spelling)

d) a searching routine that queries the table from a) for matches to your 
base terms and your stemmed terms then uses that list to make a list of 
all of the records in the source table (by matching the first list to the 
table in b) ). This will give you a list of how many matches occurred for 
each source record. Run a count() query on this list to see how many 
matches were returned per source row. Order the results by # of matches in 
descending order.  You could even throw out all single hits as noise 
matches.

Anyway, that means that you are now in charge of your own FT index but 
this one will be tuned to your particular searching needs. The one built 
into MySQL is tuned better for matching words in lots of longer text 
fields (like newspaper articles) than it is for searching lists of names.

The little bit of effort you put into building the