Rick Dwyer <[EMAIL PROTECTED]> wrote on 11/17/2005 10:28:51 AM:

> Hello All.
> 
> I am hoping for a bit of help with some code that has really given me 
> some trouble.  If this is not he correct forum for this any help in 
> pointing me to a more suited list would be appreciated.
> 
> I have a MySQL 4.1.x database containing records with phone numbers.
> 
> Most of the phone numbers are enter in 12035551212 format, but some 
> are entered with spaces or "-" or "(" or other characters.
> 
> I need to read the first 4 positions in the phone number to determine 
> it's location.
> 
> My statement looks like this:
> 'Select mid(phone, 1,4) as phoneareacode from phonetable'
> 
> This works but if the number is entered as 1(203)-555-1212 the above 
> would return   "1(20" which is not what I am looking for.
> 
> Is there a way to have the select statement examine only numeric 
> values in the phone number so it would disregard the other charcters?
> 
> In Lasso, you can use a Replace with a Regular Expression function to 
> have just the digits 0-9 examined but haven't been able find  a way 
> to do this in SQL.
> 
> Any help is appreciated.
> Thank you.
> Rick
> 
> 

The problem is, you have "dirty" data making it very hard to use it as a 
search target. There are different ways of handling this:

a) scrub your data (preferably during input or import) so that all phone 
numbers are stored in the exact same pattern
b) store each number as its component parts (country code, city/area code, 
exchange, circuit)
c) add a field of scrubbed data to your existing table and populate it.
d) add fields for each part of the phone number to your tables and 
populate them.
e) handle all searching and substring matching outside of SQL.

Options a) and c) require that a "clean" copy of the data be stored in the 
database. That means that you pick a pattern and make all of your numbers 
look like that pattern. If, for instance, you get just the number 555-1212 
you would need to generate something like 'x-xxx-555-1212' as a replacment 
(where the x is used to indicate missing information). This is slower to 
search on because you have to do substring matches but since phone numbers 
are already organized from least-specific to most-specific (left to right) 
it's already optimized for some types of substring matches.

Options b) and d) provide the ability to index each part of a phone 
number. Here is an example breakdown:

1 (203) 555-1212 -> country code:1, citycode:203, exchange: 555, circut: 
1212

If any part of a number is missing, you can use a NULL value for that 
part...

555-1212 -> country code: NULL, citycode: NULL, exchange: 555, circut: 
1212

This would be all integer comparisons, VERY fast to search. If you have 
millions of phone numbers or international phone numbers, consider this 
schema. It may be hard to determine the foreign `exchange` and `circuit` 
parts. In that case just pick either the `exchange` or `circuit` field and 
put that whole portion of it in there. This takes more time to set up but 
is MUCH faster to search (how do you think the phone company does it?)

Options b) and d) also allow you to have searchable data while preserving 
the original information.

Option e) has the most flexibility but takes the database server out of 
the loop, which will destroy your search performance.


Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Reply via email to