Greetings all, Some of you may be aware that Spock is hosting a challenge that relates to person name discrimination. It turns out that there seem to be pretty significant problems in the data they have provided for this challenge (available if you register as a participant at http://challenge.spock.com).
Below are some postings I've made to the Spock discussion list that describe both the challenges of creating "name conflated" data, and a specific problem I found in the Spock data. I think these issues are very relevant to this list, since we use "name conflated" data rather regularly, and it's very important to understand the limitations of the same. Name conflation is a technique for creating evaluation data for clustering tools like SenseClusters, where you select two or more (hopefully) unambiguous names, and create a new name from those that is now ambiguous. For example, you might take all occurrences of Bill Clinton and Tony Blair and conflate them into a new name like "BillClintonTonyBlair" that you must then discriminate/disambiguate. Before getting to my posts, I should mention that another interested party spotted some problems with the Spock ata and posted a description of that to which i responded. In short, this person (also named Ted :) said that one of the ground truth clusters that Spock provided data was really about multiple people, and he asked how this could be or what the problem was. He did a great job in analyzing his cluster, so I'd refer you to the discussion list for that actual note. In any case, here are my posts from the Spock discussion list: ------------------------------------------------------------------------------------------------------------------ 2007-07-12 19:17:31 Very nice work Ted! I'm going to speculate here wildly, but I'll bet I know what the problem is. Read on. :) In the FAQ we find the following explanation of the data collection process.... ------------------------------------------------------------------------------ How did you collect the data? Here is a brief description We crawl the web to find documents about several people, say A, B, C, D, E We filter bad documents (e.g. spam, non-English...etc) We map the original people's names into new names. For example, A->F, B->F, C->F, D->G, E->G. We substitute new names for any instance of the original people's names in the documents. ------------------------------------------------------------------------------- Simply put, how does Spock know that A, B, C, D and E are not themselves ambiguous names? It is sometimes surprisingly hard to find a name that gives you more than a few hits on the web that is not shared by multiple people. That's what makes this challenge important of course, but it's also what makes creating data this way very very hard. So my guess is that whatever names Peggy Waterfall is disguising are in fact ambiguous names themselves. I understand that Spock can't tell us what names Peggy Waterfall is disguising. But, how do they know that those disguised names are not themselves ambiguous? Cordially, Ted (heh heh, our own name ambiguity) -------------------------------------------------------------------------------------------------------------------------------- 2007-07-12 19:41:14 I should add that we've had some considerable experience trying to create this kind of data...our approach is almost exactly as described in the Spock FAQ, and to be honest, it's a lot of work to find names that refer to just one entity. We call this process name conflation, btw, because we take multiple names that we hope are not ambiguous and substitute for them a name that is now ambiguous. So you take all the occurrences of Tom Hanks and all the occurrences of Russell Crowe and replace those with some name like "TomHanksRussellCrowe", and then try to figure out which is which via clustering, etc. Now, in the case of rather famous actors and politicians and so forth, unless their name is very common they tend to dominate many of the hits you find...and in fact we have created quite a bit of this kind of data using newswire text, where "Tom Hanks" tends to only refer to the actor. However, if you go out on the web you start finding lots of Tom Hanks and things get messier. We've had many amusing examples of being fooled by this....my personal favorite is the case of Puma. We were I think conflating Adidas and Puma, thinking that both would be unique as names to shoe companies, and we were being very careful to avoid the animal use of puma, which didn't occur in our corpus. Little did we know that there is a kind of helicopter called a Puma. :) In any case, you have to be very careful in creating this kind of data, and to be honest, given the number of entities that appear to underly some of the names in the Spock corpus, I really don't think there is any way they could guarantee that each of those hidden people is unique. If you'd like to mess around with this or see some of the data we've created, you can find our program called nameconflate.pl here : http://www.d.umn.edu/~tpederse/tools.html It's the first entry, nameconflate.pl version 0.16. We've used data like this in previous name disambiguation/discrimination experiments, and to be honest it usually works pretty well, but you have to be very very careful about this issue of ambiguity in your underlying entities. There are (I think) 82 clusters associated with Peggy Waterfall in the ground truth data. I suspect that at least some of those clusters in fact refer to multiple people, as it would be very hard to go through and manually disambiguate the clusters to verify this fact (as our brave Ted has started to do!). I do agree with Ted. If this "name conflated" data was created without very very careful selection of the underlying cluster identities being unique, then the test data is really a very serious problem indeed. Given that there are 1101 clusters in the training data, I can't imagine that Spock was able to find 1101 names that are truly unique to a single person.... So, when creating name conflated data like this, the first step is to make sure that A, B, C, D, and E only refer to a single person each. Then conflate them together. I think this step was missing. :( Cordially, Ted ----------------------------------------------------------------------------------------------------------------------------- 2007-07-12 19:48:25 PS There is a lot of name conflated data available here on data links found here: http://www.d.umn.edu/~tpederse/senseclusters-pubs.html This data was used for experiments in papers on name disambiguation/discrimination problems... We also manually disambiguated some names in web data, and my god that was work. smile But, we felt it was necessary since "name conflation" in the end is an artificial process and results in a kind of data that is almost, but not quite, representative of the real problem. You can find that here: http://www.d.umn.edu/~tpederse/namedata.html Sorry, I know this is shameless plugging, but it's all free. You can take the data and do whatever you want, you can take the code and do whatever you want, or you can just ignore it all. :) Cordially, Ted ------------------------------------------------------------------------------------------------------------------------------ Today 09:03:38 I have bad news about the ground truth data as used in the training data. I randomly selected one cluster for Harriet Arthur and examined each of the files. What I found was quite obviously more than one person. This Harriet Arthur is : 1) a high school administrator 2) a former runway model 3) a recent honor roll student in Oklahoma Are these the same person? Absolutely not. In fact, Harriet Arther is in this case disguising the real name "Chrystal Benson", which a simple Google search reveals is very very clearly an ambiguous name. Spock, you really need to clarify how this data was created, and why we should have any confidence in it whatsoever. To be honest, what I see below is painfully sloppy. Chrystal Benson is just clearly an ambiguous name, and the fact that this was included as one of the single identity ground truths shows me that the process of selecting the unique names is just horribly flawed. Please, tell me why I'm wrong in drawing this conclusion (and do it fast, or extend your July 15 proposal deadline and also the contest finishing date). Below are the contents of a single Harriet Arthur cluster, and a little bit about each document. I strongly advise each person who is still interested in this challenge to try what Ted first did, and then what I did below. Just pick a cluster at random, and look at the documents. Does it appear to be about a single person or not? Post your conclusions here. ------------------------------------------------------------------------- SCI.8.677067132.html identified as vice principal at mcdonough high school in a june 2005 blog entry ------------------------------------------------------------------------ SCI.18.862467912.html vice principal or administrative assistant (unclear) as listed on westlake high school (maryland) web page ------------------------------------------------------------------------ SCI.6.850369785.html DEVOTION: Actor-comedian Joe Torry and Harriet Arthur receive blessings from St. Louis Cardinals baseball legend Ozzie Smith and famed actress Regina ... ------------------------------------------------------------------------ SCI.20.248901493.html Comedian and actor Joe Torry and his fiancee, Harriet Arthur, a print and runway model, are firm believers that long-distance relationships can work. The two shared a long-distance love affair with each other "on and off" for seven years after they met at a party in their hometown of St. Louis. ------------------------------------------------------------------------ SCI.15.492335037.html vice principal in charles county public schools (maryland) ------------------------------------------------------------------------ SCI.2.620002873.html northwestern oklahoma state university honor role member (january 2007) ------------------------------------------------------------------------ Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
