Greetings all,

Some of you may be aware that Spock is hosting a challenge that
relates to person name discrimination. It turns out that there seem to
be pretty significant problems in the data they have provided for this
challenge (available if you register as a participant at
http://challenge.spock.com).

Below are some postings I've made to the Spock discussion list that
describe both the challenges of creating "name conflated" data, and a
specific problem I found in the Spock data. I think these issues are
very relevant to this list, since we use "name conflated" data rather
regularly, and it's very important to understand the limitations of
the same.

Name conflation is a technique for creating evaluation data for
clustering tools like SenseClusters, where you select two or more
(hopefully) unambiguous names, and create a new name from those that
is now ambiguous. For example, you might take all occurrences of Bill
Clinton and Tony Blair and conflate them into a new name like
"BillClintonTonyBlair" that you must then discriminate/disambiguate.

Before getting to my posts, I should mention that another interested
party spotted some problems with the Spock ata and posted a
description of that to which i responded. In short, this person (also
named Ted :) said that one of the ground truth clusters that Spock
provided data was really about multiple people, and he asked how this
could be or what the problem was. He did a great job in analyzing his
cluster, so I'd refer you to the discussion list for that actual note.

In any case, here are my posts from the Spock discussion list:

------------------------------------------------------------------------------------------------------------------
2007-07-12 19:17:31

Very nice work Ted!

I'm going to speculate here wildly, but I'll bet I know what the
problem is. Read on. :)

In the FAQ we find the following explanation of the data collection process....

------------------------------------------------------------------------------
How did you collect the data?

Here is a brief description
We crawl the web to find documents about several people, say A, B, C, D, E
We filter bad documents (e.g. spam, non-English...etc)
We map the original people's names into new names. For example, A->F,
B->F, C->F, D->G, E->G.
We substitute new names for any instance of the original people's
names in the documents.
-------------------------------------------------------------------------------

Simply put, how does Spock know that A, B, C, D and E are not
themselves ambiguous names?

It is sometimes surprisingly hard to find a name that gives you more
than a few hits on the web that is not shared by multiple people.
That's what makes this challenge important of course, but it's also
what makes creating data this way very very hard.

So my guess is that whatever names Peggy Waterfall is disguising are
in fact ambiguous names themselves.

I understand that Spock can't tell us what names Peggy Waterfall is
disguising. But, how do they know that those disguised names are not
themselves ambiguous?

Cordially,
Ted (heh heh, our own name ambiguity)
--------------------------------------------------------------------------------------------------------------------------------
2007-07-12 19:41:14

I should add that we've had some considerable experience trying to
create this kind of data...our approach is almost exactly as described
in the Spock FAQ, and to be honest, it's a lot of work to find names
that refer to just one entity. We call this process name conflation,
btw, because we take multiple names that we hope are not ambiguous and
substitute for them a name that is now ambiguous.

So you take all the occurrences of Tom Hanks and all the occurrences
of Russell Crowe and replace those with some name like
"TomHanksRussellCrowe", and then try to figure out which is which via
clustering, etc. Now, in the case of rather famous actors and
politicians and so forth, unless their name is very common they tend
to dominate many of the hits you find...and in fact we have created
quite a bit of this kind of data using newswire text, where "Tom
Hanks" tends to only refer to the actor. However, if you go out on the
web you start finding lots of Tom Hanks and things get messier.

We've had many amusing examples of being fooled by this....my personal
favorite is the case of Puma. We were I think conflating Adidas and
Puma, thinking that both would be unique as names to shoe companies,
and we were being very careful to avoid the animal use of puma, which
didn't occur in our corpus. Little did we know that there is a kind of
helicopter called a Puma. :) In any case, you have to be very careful
in creating this kind of data, and to be honest, given the number of
entities that appear to underly some of the names in the Spock corpus,
I really don't think there is any way they could guarantee that each
of those hidden people is unique.

If you'd like to mess around with this or see some of the data we've
created, you can find our program called nameconflate.pl here :
http://www.d.umn.edu/~tpederse/tools.html It's the first entry,
nameconflate.pl version 0.16. We've used data like this in previous
name disambiguation/discrimination experiments, and to be honest it
usually works pretty well, but you have to be very very careful about
this issue of ambiguity in your underlying entities.

There are (I think) 82 clusters associated with Peggy Waterfall in the
ground truth data. I suspect that at least some of those clusters in
fact refer to multiple people, as it would be very hard to go through
and manually disambiguate the clusters to verify this fact (as our
brave Ted has started to do!).

I do agree with Ted. If this "name conflated" data was created without
very very careful selection of the underlying cluster identities being
unique, then the test data is really a very serious problem indeed.
Given that there are 1101 clusters in the training data, I can't
imagine that Spock was able to find 1101 names that are truly unique
to a single person....

So, when creating name conflated data like this, the first step is to
make sure that A, B, C, D, and E only refer to a single person each.
Then conflate them together. I think this step was missing. :(

Cordially,
Ted
-----------------------------------------------------------------------------------------------------------------------------
2007-07-12 19:48:25

PS There is a lot of name conflated data available here on data links
found here:

http://www.d.umn.edu/~tpederse/senseclusters-pubs.html

This data was used for experiments in papers on name
disambiguation/discrimination problems...

We also manually disambiguated some names in web data, and my god that
was work. smile But, we felt it was necessary since "name conflation"
in the end is an artificial process and results in a kind of data that
is almost, but not quite, representative of the real problem. You can
find that here:

http://www.d.umn.edu/~tpederse/namedata.html

Sorry, I know this is shameless plugging, but it's all free. You can
take the data and do whatever you want, you can take the code and do
whatever you want, or you can just ignore it all. :)

Cordially,
Ted
------------------------------------------------------------------------------------------------------------------------------
Today 09:03:38

I have bad news about the ground truth data as used in the training
data. I randomly selected one cluster for Harriet Arthur and examined
each of the files. What I found was quite obviously more than one
person.

This Harriet Arthur is :

1) a high school administrator
2) a former runway model
3) a recent honor roll student in Oklahoma

Are these the same person? Absolutely not. In fact, Harriet Arther is in
this case disguising the real name "Chrystal Benson", which a simple
Google search reveals is very very clearly an ambiguous name.

Spock, you really need to clarify how this data was created, and
why we should have any confidence in it whatsoever. To be honest,
what I see below is painfully sloppy. Chrystal Benson is just clearly
an ambiguous name, and the fact that this was included as one of the
single identity ground truths shows me that the process of selecting
the unique names is just horribly flawed. Please, tell me why I'm
wrong in drawing this conclusion (and do it fast, or extend your July
15 proposal deadline and also the contest finishing date).

Below are the contents of a single Harriet Arthur cluster, and a little
bit about each document. I strongly advise each person who is still
interested in this challenge to try what Ted first did, and then what
I did below. Just pick a cluster at random, and look at the documents.
Does it appear to be about a single person or not? Post your
conclusions here.

-------------------------------------------------------------------------
SCI.8.677067132.html

identified as vice principal at mcdonough high school in a june 2005
blog entry

------------------------------------------------------------------------

SCI.18.862467912.html

vice principal or administrative assistant (unclear) as listed on
westlake high school (maryland) web page

------------------------------------------------------------------------

SCI.6.850369785.html

DEVOTION: Actor-comedian Joe Torry and Harriet Arthur receive blessings
from St. Louis Cardinals baseball legend Ozzie Smith and famed actress
Regina ...

------------------------------------------------------------------------

SCI.20.248901493.html

Comedian and actor Joe Torry and his fiancee, Harriet Arthur, a print
and runway model, are firm believers that long-distance relationships can
work. The two shared a long-distance love affair with each other "on and off"
for seven years after they met at a party in their hometown of St. Louis.

------------------------------------------------------------------------

SCI.15.492335037.html

vice principal in charles county public schools (maryland)

------------------------------------------------------------------------

SCI.2.620002873.html

northwestern oklahoma state university honor role member (january 2007)

------------------------------------------------------------------------

Cordially,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to