Greetings all, Some further reflections on the creation of name conflated data, as illustrated by the Spock Challenge. Dealing with the multiple forms of names when creating ambiguities in names is challenging but quite important. Suppose you are substituting BillClintonTonyBlair for all occurrences of Bill Clinton and Tony Blair. Perfect. Almost. What about forms like "President Clinton", "Prime Minister Blair", "William Jefferson Clinton"....you really need to consider this alternate forms when doing your name conflation substitutions, and in fact we generally work pretty hard to make sure we catch most of those alternate forms when creating this kind of that.
In general I continue to believe that using name conflated data for experimental evaluations can be very effective. However, you must make sure that the names you are disguising or conflating are relatively unambiguous, and you must make sure that you accounts for at least some of the alternate forms of unambiguous names. A few relevant posts of my from the Spock Challenge discussion list, available at: http://challenge.spock.com ------------------------------------------------------------------------------------------------------------------------------- Yesterday 09:00:06 In reviewing the Antoine Destin cluster, I noticed that not only was Antoine Destin inserted as a name (in place of Benjamin Norden), all occurrences of Norden were replaced with Destin. See the following example: ------------------------------------------------------------------------------------ Before (from Google) : A year earlier he was advertising for tenders for building six cottages in Lawrence Street. In 1834, J. D. Norden was advertising a public sale of substantial landed property belonging to Benjamin Norden. After : SCI.6.945516281.html A year earlier he was advertising for tenders for building six cottages in Lawrence Street. In 1834, J. D. Destin was advertising a public sale of substantial landed property belonging to Antoine Destin. ------------------------------------------------------------------------------------ I think it is possibly important to understand this, especially if this wasn't intended by Spock. It's not clear to me that it is, since this doesn't seem to be specified when they discuss their data creation process. So, let me ask, what exactly has been changed in the test data? It's clear that Antoine Destin was substituted for Benjamin Norden....this is what we'd expect based on the FAQ.... It's clear that Destin was substituted for Norden (in the case of J.D.) ...this is not expected (at least not based on the description in the FAQ). There are quite a few questions that can arise if we move beyond simply substituting Antoine Destin for Benjamin Norden.... What about occurrences of B. Norden? Does that become A. Destin or B. Destin? This seems very unclear, since B. Norden clearly refers to Benjamin Norden, so A. Destin is probably the "right" thing to do, but if they are simply replacing the last name you could end up with B. Destin, which now appears to be a different entity than A. Destin. What about Benny Norden? smile Does that become Benny Destin or Antoine Destin? What happens to occurrences of Mr. Norden? Does that become Mr. Destin or stay as Mr. Norden? I suppose this one is easier, since Mr. is generic and does not refer to any particular Norden/Destin. But then what about Mr. B. Norden? smile I'm using Norden and Destin as examples here, and referring to some cases that don't occur for that name, but clearly this is a general issue that could and will affect all the data. I'm asking for the general case, and for the pattern or "rule" that was used to substitute disguised target names for the real names. Could a more detailed description of how the substitutions were done be provided? Cordially, Ted ---------------------------------------------------------------------------------------------------------------------------- Yesterday 22:15:48 A little bit of bad news. It appears that Destin was substituted fairly blindly for Norden. >From the same document as in the post above... Original (from google) FOR SALE, a Female Slave. - For particulars apply to Mr. B. NORDEN. This Slave is to be sold owing to the severity of the existing Law on Slave Owners. Benjamin Norden was clearly, even at this point in his long life, a businessman involved in many types of enterprise. It seems quite clear here that Mr. B. Norden refers to Benjamin Norden. Modified version (SCI.6.945516281.html) FOR SALE, a Female Slave. - For particulars apply to Mr. B. DESTIN. This Slave is to be sold owing to the severity of the existing Law on Slave Owners. Antoine Destin was clearly, even at this point in his long life, a businessman involved in many types of enterprise. The problem here is that Mr. B. Destin and Antoine Destin really now look like different entities, but they are not. So, there are two issues to resolve. What was done to create this data, and what should have been done. smile What was done, I believe, is to first replace the complete form of the original name (Benjamin Norden) with the new name (Antoine Destin). Then, I believe all occurrences of Norden were changed to Destin. It does not appear that B. Norden was "recognized" as being another way of saying Benjamin Norden. Is this a fatal problem? Not really, although if one has an approach that keys on names that appear with the target name, it could be confusing or problematic. It would be best if Spock could describe exactly how this test set data was created. I'd actually be interested in knowing how they selected the "unambiguous name", and then how they went through and did the replacement of the strings. Now, what should they have done? Well, it's a challenging problem, to recognize every possible form of Benjamin Norden and then replace it with Antoine Destin. I wouldn't expect that every form be accounted for. However, I would have hoped that the simplest of forms, like B. Norden, would be accounted for. In general, I do think names tend to be repeated quite often in text, doing the replacement as accurately as possible (so that new entities are not erroneously introduced) could really make a difference in the quality of the data and the results. In any case, I do think it would help participants to know exactly what replacements were done, especially if their approach somehow keys on other names that occur with the target name. Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
