>>Just curious: how would you mask names and addresses?  Of course, if >>these
>>are not used as keys some constraints such as uniqueness are relaxed.

>>-- gil

Our data records have fields for first name, last name, and gender.  They of 
course do not need to be unique.  But the masking algorithm must be 
deterministic.  So if we have a name "JOHN SMITH" that our algorithm translates 
to "FRED COLLINS", then "JOHN SMITH" must become "FRED COLLINS" across all 
files.

My technique for doing the transformation is as follows:
(1) I have a table of several hundred surnames.
(2) Also another table of female first names.
(3) Yet another table of male first names.

To transform a surname, I put the real surname thru a hash function and use the 
value obtained as an index into my table of surnames.

To transform a first name, I put the real first name thru the hash function and 
then use the value as an index into the proper given name table, as selected 
using the gender code.

So, using a scheme such as this, "JOHN SMITH" will always get translated to the 
same fictitious value (e.g. "FRED COLLINS").  But someone seeing "FRED COLLINS" 
in a test file won't be able to conclude that this is really a record for "JOHN 
SMITH", since thousands of other real names will also translate to "FRED 
COLLINS".

Transforming addresses uses a similar idea to mask the street name, although it 
gets a bit ugly at times.  Street number will be masked by hashing. 
Fortunately, we do not need to worry about creating fictitious addresses that 
are in the USPS database.

John

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: INFO IBM-MAIN

Reply via email to