On Tue, 2005-06-14 at 19:51 +0200, Gilles Lenfant wrote:
> rbt a écrit :
> > Here's the scenario:
> > 
> > You have many hundred gigabytes of data... possible even a terabyte or 
> > two. Within this data, you have private, sensitive information (US 
> > social security numbers) about your company's clients. Your company has 
> > generated its own unique ID numbers to replace the social security numbers.
> > 
> > Now, management would like the IT guys to go thru the old data and 
> > replace as many SSNs with the new ID numbers as possible. You have a tab 
> > delimited txt file that maps the SSNs to the new ID numbers. There are 
> > 500,000 of these number pairs. What is the most efficient way  to 
> > approach this? I have done small-scale find and replace programs before, 
> > but the scale of this is larger than what I'm accustomed to.
> > 
> > Any suggestions on how to approach this are much appreciated.
> 
> Are this huge amount of data to rearch/replace stored in an RDBMS or in 
> flat file(s) with markup (XML, CSV, ...) ?
> 
> --
> Gilles

I apologize that it has taken me so long to respond. I had a hdd crash
which I am in the process of recovering from. If emails to
[EMAIL PROTECTED] bounced, that is the reason why.


The data is in files. Mostly Word documents and excel spreadsheets. The
SSN map I have is a plain text file that has a format like this:

ssn-xx-xxxx     new-id-xxxx
ssn-xx-xxxx     new-id-xxxx
etc.

There are a bit more than 500K of these pairs.

Thank you,
rbt


-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to