On Fri, 2005-06-17 at 09:18 -0400, Peter Hansen wrote: > rbt wrote: > > The script is too long to post in its entirety. In short, I open the > > files, do a binary read (in 1MB chunks for ease of memory usage) on them > > before placing that read into a variable and that in turn into a list > > that I then apply the following re to > > > > ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b') > > > > like this: > > > > for chunk in whole_file: > > search = ss.findall(chunk) > > if search: > > validate(search) > > This seems so obvious that I hesitate to ask, but is the above really a > simplification of the real code, which actually handles the case of SSNs > that lie over the boundary between chunks? In other words, what happens > if the first chunk has only the first four digits of the SSN, and the > rest lies in the second chunk? > > -Peter
No, that's a good question. As of now, there is nothing to handle the scenario that you bring up. I have considered this possibility (rare but possible). I have not written a solution for it. It's a very good point though. Is that not why proper software is engineered? Anyone can build a go cart (write a program), but it takes a team of engineers and much testing to build a car, no? Which woulu you rather be riding in during a crash? I wish upper mgt had a better understanding of this ;) -- http://mail.python.org/mailman/listinfo/python-list