John J. Lee wrote: > "John Machin" <[EMAIL PROTECTED]> writes: > [...] > > This is all a bit OT. Before we close the thread down > > Do you have a warrant for that?
I have some signed-but-otherwise-blank warrants, but I'm saving them for other threads :-) > > > , let me leave > > you with one warning: > > Beware of enthusiastic maintenance programmers on a mission to clean up > > the dirty names in your database: > > E.g. (1) "Karim bin Md" may not appreciate getting a letter addressed > > to "Dr Karim Bin" (Md is an abbreviation of Muhammad). > > E.g. (2) Billing job barfs on a customer who has no given names and no > > family name. Inspection reveals that he is over-endowed in the title > > department: "Mr Earl King". > [...] > > Heh. Heh indeed. This behaviour seems to be endemic. Another true story from a 3rd post-cleanup cleanup assignment: Looking at the "country" component of addresses: WALES? Users suggested it be changed to "UK" to conform with ISO standard, UPU conventions, etc. However glancing at other address components, one found intriguing things like "C/o Prince of Hospital". The same "algorithm" had migrated a handful of clients from Coromandel Valley to Oman, and a considerable number from the Melbourne suburb of Chadstone to Chad. > > I guess the people who really know about that kind of thing are the > "record linkage" people (this one is a project worked on by c.l.py's > own Tim Churches, and has produced some Python code): > > http://datamining.anu.edu.au/projects/linkage.html The project is heavily into probabilistic methods. Given enough correctly tagged data to work on, 'Earl" and "King" are much more likely to drop into a name slot than a title slot. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list