[Senseclusters-users] spock challenge dead and why it happened

Ted Pedersen Wed, 05 Sep 2007 06:13:12 -0700

Greetings all,

You may remember that I have mentioned the Spock Challenge, which
deals with named entity disambiguation. Much to my dismay, Spock has
simply pulled the plug on the challenge without any explanation, and
has removed all evidence of the contest from their web site. If you
visit http://challenge.spock.com you'll see what I mean.


This strikes me as a very low class move. While the contest was in
some ways troubled, I think those who were participating did so in
good faith, and to have Spock not only end the contest without
explanation, and then remove all evidence of what the contest ever
was, strikes me as a bit of a cover-up. Simply put, I think Spock
messed this one up. I don't think they created the ground truth data
for their contest correctly, and I think in the end they realized they
had a flawed event on their hands, and could think of no better
solution than to simply end it without announcement and without
explanation.

I went out and found the google cache of some of the discussion that
took place on the Challenge bulletin board (which is now vanished)
which brings to light what I think the fundamental problems with the
task were. It took some work on the part of a number of participants
to bring this to light, and I guess I don't want to see that
information simply vanish. There are three separate threads presented
(between the ====== )

For the record, I checked the leader board and discussion list
sometime in late August. There was no indication that the challenge
was about to be suspended. I found the entire site down with a short
message saying "The Challenge is Closed" on Sept 3, 2007.

Yours in frustration,
Ted
=========================================================

#1 2007-07-12 10:29:35

tsandler @    xxxxxxxxxxxxxxxxxxxxxxxxx
    New member
    Registered: 1969-12-31
    Posts: 5

serious concerns about the quality of the ground-truth file

Hi Spock-Challengers,

I have looked at just two clusters in the train_groundTruth file and
both of them are not clean in that they most certainly contain
mentions to different individuals.  For instance, the first "Peggy
Waterfall" cluster contains the files:

  1. SCI.4.966147384.html
  2. SCI.3.514944295.html
  3. SCI.12.15693106.html
  4. SCI.8.518726554.html
  5. SCI.8.885764493.html
  6. SCI.5.861103698.html
  7. SCI.19.194519667.html

Looking at these files, I see that the first refers to an individual
who lived from Sept. 1884 to July 1987 who was born and died in
Philadelphia, PA.  The second refers to an individual related to
someone in Scranton PA so this could be consistent.  However, the
third refers to a Native American whose photo was taken in the
Washington state.  The fourth refers to a person involved in an estate
dispute filed in the state of North Dakota in 1989, two years after
the first Peggy Waterfall died.  The fifth looks like it refers to
someone born in Nov. 1885 in New York City.  The sixth looks like it
refers to an individual who was married in Newark, NJ in 1892 when
Peggy Waterfall #'s 1 and 5 would have been 8 and 7 years old resp.
And the last individual seems to be a maritme photographer in Cornwall
(England?) whose artwork dates to 1929.

Now, I have a very hard time believing that all these people are the
same individual---a Native American in Washington and a maritime
photographer in Cornwall???  Certainly, there is no evidence to
support this and I found similar problems with the other cluster I
looked at, the second to last cluster with name "Ann Casler."  It
contains the following files:

1. SCI.16.9022486.html
2. SCI.6.666995392.html
3. SCI.6.703795586.html
4. SCI.10.942642735.html
5. SCI.9.224570072.html
6. SCI.14.149571968.html

The first file refers to a student in Mrs. Tsuchiyama's sixth grade
class while the second refers to an Ann Lynn Casler who was wed to a
Daniel Schneider at the Marriott, Ft. Wayne in 2006.  The third and
sixth files refer to an "Ann Casler" who is somehow involved with
virus software while the fifth refers to the friend of an 18 year-old
who according to her "MySpace" page, "only rolls with the best" (see
below).  Judging from the data, these also look to be different ``Ann
Casler's.''

I am wondering if someone could look into the quality of the
ground-truth files for both the training and test set because the
number of errors I've found after just a small amount of looking is
troubling to the point that I have mixed feelings about participating
in the challenge.  Below I've provided the contents of the files
listed above which helped me reach my conclusions above.

Thanks for any feedback on this issue,
Ted Sandler

-------------------------------- PEGGY WATERFALL #1
-----------------------------

#SCI.4.966147384.html
(Mary) Peggy Waterfall b. Sept. 14 1884, Philadelphia, PA d. July 14, 1987

#SCI.3.514944295.html
03/14/2007
Printer-friendly Helen Shannon Walsh, 87, formerly of Tall Trees
Apartments, died Monday in Paramus, N.J. She was the widow of William
Walsh, who died in 1957.

Born in Peckville, she was the daughter of the late Thomas A. and
Peggy Coleman Shannon. She was a longtime member of Immaculate
Conception Church, and lived in the Scranton area for 86 years before
moving in August. She was employed by the state for many years.

Surviving are a daughter, Marian Walsh McGrail and husband, Brian,
Ho-Ho-Kus, N.J.; two grandsons, Brendan Walsh McGrail and William
Colman McGrail; a sister, Elizabeth Banick and husband, John A.,
Vienna Va.; nieces and nephews and great-nieces and great-nephews. She
was also preceded in death by four sisters, Margaret Gilroy, Peggy
Waterfall, Mary Padden and Rosemary Burke; and a brother, Joseph
Shannon.

The funeral will be Friday at 12:30 p.m. from the August J. Haas
Funeral Home Inc., 202 Pittston Ave., with Mass at 1 in Immaculate
Conception Church, to be celebrated by the Rev. Richard Fox,
pastor. Interment, St. Rose Cemetery, Carbondale. Friends may call
Friday, 10 a.m. to noon. Memorials may go to Maryknoll Fathers and
Brothers, Box 301, Maryknoll, NY 10545. For directions or to leave an
online condolence, visit www.augusthaasfu neralhome.com.

#SCI.12.15693106.html
Titles:    San Poil chief Jim James, Howard Ball and others, Washington, 1939
Creators: Gamble, Wallace, b. 1901
Subjects: James, Jim, 1886-1971
Ball, Howard
Francis family
Herman family
Group portraits--Washington (State)
Sanpoil Indians--Clothing & dress
Notes: Group of men women & children pose next to porch; women wear
scarves on heads, dresses, sweaters; men wear workboots, flannel
jackets & sweaters.

Note from unidentified source: Chief James, Howard Ball and group of
Colvilles, 1939.  L-R: Front row: Johnny George, Jimmy Waterfall,
Lester Herman, Herman Francis, Johnny Francis, Millie James, Andrew
Francis (infant), Nettie Herman Francis (wife of Johnny Francis)
holding Andrew.  L-R: Back row: Peggy Waterfall, Molly Herman, Mary
Herman, Squrshanatkst "Aakat" Francis (mother of Johnny Francis and
Aunt of Pete George), Sally Iswald (Dave Condon's Grandmother), Howard
Ball. Per Lester Herman & Pete George Sr., 5./1994 Object Type:
Photographs Date: 1939 Location Depicted: United States--Washington
(State)

#SCI.8.518726554.html
Filed Oct. 24, 1989 IN THE SUPREME COURT STATE OF NORTH DAKOTA In the
Matter of the Estate of Pearl J. Jorstad, Deceased Maynard Jorstad and
Marvin Jorstad, Petitioners and Appellees v.  Gladys Yates, Peggy
Waterfall, and Mavis Hensley, individually, and Gladys Yates and Peggy
Waterfall, as co-guardians and co-conservators of the Estate of Martin
J. Jorstad, Jr., an incapacitated person, and in their representative
capacity on behalf of Martin J. Jorstad, Jr. Respondents and
Appellants and Estate of Pearl J. Jorstad, Silas Langager, Personal
Representative Respondent and Appellee

Civil No. 890035 Appeal from the County Court for Williams County,
Northwest Judicial District, the Honorable Gordon C. Thompson, Judge.
AFFIRMED.

Opinion of the Court by Levine, Justice.  McIntee & Whisenand, PC,
P.O. Box 1307, Williston, ND 58802-1307, for petitioners and
appellees; argued by Kathleen E. Key-Imes.  Bjella, Neff, Rathert,
Wahl & Eiken, PC, 111 East Broadway, Drawer 1526, Williston, ND
58802-1526, for respondents and appellants; argued by Vern C. Neff.

#SCI.8.885764493.html
Margret,Peggy WATERFALL
Country: USA
State: New York City
Date Submitted: May 5, 2003
Notes: Born Nov 25,1885

#SCI.5.861103698.html
If you want further information on one or more of the below Newark
Marriages, highlight the entire line, right click in the highlighted
area, select 'copy' and then click here.
GROGAN, James 1892 Peggy WATERFALL

#SCI.19.194519667.html
Peggy Waterfall (1929)
Maritime Photography
Looe Harbour, Cornwall
Newquay, Cornwall
Polperro, Cornwall


---------------------------- ANN CASLER #1 ----------------------------

#SCI.16.9022486.html
Mrs.  Tsuchiyama's sixth grade class.  This drawing is done by: Ty
Otsuka, Monica Orcine, Steven Paula, Ann Casler-Sidotti, and Michael
Gilespie.

#SCI.6.666995392.html
Mr. and Mrs. Daniel Schneider (Ann Casler) Daniel Lee Schneider and
Ann Lynn Casler were wed at 5:30 p.m. May 20, 2006, at the Marriott,
Ft. Wayne, by the Rev. Abernathy.

#SCI.6.703795586.html
[EMAIL PROTECTED] (Ann Casler)
Thu Sep 15 00:27:47 2005
From: "Ann Casler" <[EMAIL PROTECTED]>
Reply-To: "Ann Casler" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Date: Wed, 14 Sep 2005 23:41:47 -0500

#SCI.10.942642735.html
Eldora R. Simmons, dietary supervisor Eldora R. ?Preenie? Simmons, 82,
of Milton, died Wednesday, June 1, 2005, at Harbor Healthcare and
Rehabilitation Center in Lewes.  She was born in Milton, the daughter
of the late Roland and Lina Walls Moore. She was a dietary supervisor
for Beebe Medical Center, retiring in 1985. She was a member of the
Milton Wesleyan Church and the Milton Senior Center.  She was preceded
in death by her husband, Paul E. Simmons, in 1989.  She is survived by
two daughters, Kay Casler and her husband, James, of Wernersville,
Pa. and Joan Durham and her husband, Chester Kopelen, of Milton; three
grandchildren, James P. Casler, Michael L. Casler and Jeffrey
A. Durham; and two great-grandchildren, Brett and Ann Casler.
Services will be held at 1 p.m., Saturday, June 4, in the chapel of
Short Funeral Services, 416 Federal St., Milton, where friends may
call after noon. Burial will be in Odd Fellows Cemetery, Milton.
Contributions are suggested to Milton Wesleyan Church, 411 Union
St. Milton DE 19969 or Compassionate Care Hospice, 5610 Kirkwood Hwy.,
Wilmington DE 19808.

#SCI.9.224570072.html
MySpace Page for "BUTtERFLYxKiSSES"
"JUST DOiN MAH THiNG. AiNT NOBODY GONNA FUCK iT UP! ?"
Female
18 years old
WASHiNGTON, NEW HAMPSHIRE
United States
I Only Roll With The Best: Kristina- your amazing. i love you
babe. so many memories together & not one i can forget. get us
together and we are crazy.  ...  Alisha Church, Kathleen Bergeron,
Dave Gauthier, Ann Casler, Corey Rivard, Chris Green, ... and lotssss
more! let me kno and ill add you!

#SCI.14.149571968.html
Virus_Discussion_List
[1759]  Charlie Hightower  MS Office 2003 Pro $69.95 XP
[1768]  Ann Fontenot       MS Office XP Pro $49.95 AutoCAD
[1769]  Ann Casler         Out of this WoRLD $aving$ on all Macromedia titles

Offline

#2 2007-07-12 19:17:31

duluthted@ xxxxxxxxxxxxxxxxxxxxx
    Member
    Registered: 1969-12-31
    Posts: 22

Re: serious concerns about the quality of the ground-truth file

Very nice work Ted!

I'm going to speculate here wildly, but I'll bet I know what the
problem is. Read on. smile

In the FAQ we find the following explanation of the data collection process....

------------------------------------------------------------------------------
How did you collect the data?

Here is a brief description
We crawl the web to find documents about several people, say A, B, C, D, E
We filter bad documents (e.g. spam, non-English...etc)
We map the original people's names into new names. For example, A->F,
B->F, C->F, D->G, E->G.
We substitute new names for any instance of the original people's
names in the documents.
-------------------------------------------------------------------------------

Simply put, how does Spock know that A, B, C, D and E are not
themselves ambiguous names?

It is sometimes surprisingly hard to find a name that gives you more
than a few hits on the web that is not shared by multiple people.
That's what makes this challenge important of course, but it's also
what makes creating data this way very very hard.

So my guess is that whatever names Peggy Waterfall is disguising are
in fact ambiguous names themselves.

I understand that Spock can't tell us what names Peggy Waterfall is
disguising. But, how do they know that those disguised names are not
themselves ambiguous?

Cordially,
Ted (heh heh, our own name ambiguity)

Offline

#3 2007-07-12 19:41:14

duluthted@ xxxxxxxxxxxxxxxxxxxxxxxxxx
    Member
    Registered: 1969-12-31
    Posts: 22

Re: serious concerns about the quality of the ground-truth file

I should add that we've had some considerable experience trying to
create this kind of data...our approach is almost exactly as described
in the Spock FAQ, and to be honest, it's a lot of work to find names
that refer to just one entity. We call this process name conflation,
btw, because we take multiple names that we hope are not ambiguous and
substitute for them a name that is now ambiguous.

So you take all the occurrences of Tom Hanks and all the occurrences
of Russell Crowe and replace those with some name like
"TomHanksRussellCrowe", and then try to figure out which is which via
clustering, etc. Now, in the case of rather famous actors and
politicians and so forth, unless their name is very common they tend
to dominate many of the hits you find...and in fact we have created
quite a bit of this kind of data using newswire text, where "Tom
Hanks" tends to only refer to the actor. However, if you go out on the
web you start finding lots of Tom Hanks and things get messier.

We've had many amusing examples of being fooled by this....my personal
favorite is the case of Puma. We were I think conflating Adidas and
Puma, thinking that both would be unique as names to shoe companies,
and we were being very careful to avoid the animal use of puma, which
didn't occur in our corpus. Little did we know that there is a kind of
helicopter called a Puma. smile In any case, you have to be very
careful in creating this kind of data, and to be honest, given the
number of entities that appear to underly some of the names in the
Spock corpus, I really don't think there is any way they could
guarantee that each of those hidden people is unique.

If you'd like to mess around with this or see some of the data we've
created, you can find our program called nameconflate.pl here :
http://www.d.umn.edu/~tpederse/tools.html It's the first entry,
nameconflate.pl version 0.16. We've used data like this in previous
name disambiguation/discrimination experiments, and to be honest it
usually works pretty well, but you have to be very very careful about
this issue of ambiguity in your underlying entities.

There are (I think) 82 clusters associated with Peggy Waterfall in the
ground truth data. I suspect that at least some of those clusters in
fact refer to multiple people, as it would be very hard to go through
and manually disambiguate the clusters to verify this fact (as our
brave Ted has started to do!).

I do agree with Ted. If this "name conflated" data was created without
very very careful selection of the underlying cluster identities being
unique, then the test data is really a very serious problem indeed.
Given that there are 1101 clusters in the training data, I can't
imagine that Spock was able to find 1101 names that are truly unique
to a single person....

So, when creating name conflated data like this, the first step is to
make sure that A, B, C, D, and E only refer to a single person each.
Then conflate them together. I think this step was missing. sad

Cordially,
Ted

Offline

#4 2007-07-12 19:48:25

duluthted@ xxxxxxxxxxxxxxxxxxxxxxxxx
    Member
    Registered: 1969-12-31
    Posts: 22

Re: serious concerns about the quality of the ground-truth file

PS There is a lot of name conflated data available here on data links
found here:

http://www.d.umn.edu/~tpederse/senseclusters-pubs.html

This data was used for experiments in papers on name
disambiguation/discrimination problems...

We also manually disambiguated some names in web data, and my god that
was work. smile But, we felt it was necessary since "name conflation"
in the end is an artificial process and results in a kind of data that
is almost, but not quite, representative of the real problem. You can
find that here:

http://www.d.umn.edu/~tpederse/namedata.html

Sorry, I know this is shameless plugging, but it's all free. You can
take the data and do whatever you want, you can take the code and do
whatever you want, or you can just ignore it all. smile

Cordially,
Ted

Offline

#5 2007-07-13 07:26:05

pedro.t@ xxxxxxxxxxxxxxxxxxxxx
    New member
    Registered: 1969-12-31
    Posts: 8

Re: serious concerns about the quality of the ground-truth file

kudos for both Teds! :p Great effort and thanks for sharing the data.

Although I didn't suspect Spock's ground truth to be wrong, I agree
that research is difficult in this area mostly due to the costs of
producing a correct test set. I hope this partially explains the
current low F-measure I'm getting on the training data wink

I even wonder if Spock has an unsupervised algorithm to run on the
data. Do you think they could provide us a benchmark F-measure? (as
our algorithms improve, they could use people's prediction to help fix
the ground truth...)


--
Pedro

Offline

===================================================================
1 2007-07-14 09:03:38

duluthted@ xxxxxxxxxxxxxxxxxxxxxx
    Member
    Registered: 1969-12-31
    Posts: 22

ground truth file is clearly flawed

I have bad news about the ground truth data as used in the training
data. I randomly selected one cluster for Harriet Arthur and examined
each of the files. What I found was quite obviously more than one
person.

This Harriet Arthur is :

1) a high school administrator
2) a former runway model
3) a recent honor roll student in Oklahoma

Are these the same person? Absolutely not. In fact, Harriet Arther is in
this case disguising the real name "Chrystal Benson", which a simple
Google search reveals is very very clearly an ambiguous name.

Spock, you really need to clarify how this data was created, and
why we should have any confidence in it whatsoever. To be honest,
what I see below is painfully sloppy. Chrystal Benson is just clearly
an ambiguous name, and the fact that this was included as one of the
single identity ground truths shows me that the process of selecting
the unique names is just horribly flawed. Please, tell me why I'm
wrong in drawing this conclusion (and do it fast, or extend your July
15 proposal deadline and also the contest finishing date).

Below are the contents of a single Harriet Arthur cluster, and a little
bit about each document. I strongly advise each person who is still
interested in this challenge to try what Ted first did, and then what
I did below. Just pick a cluster at random, and look at the documents.
Does it appear to be about a single person or not? Post your
conclusions here.

-------------------------------------------------------------------------
SCI.8.677067132.html

identified as vice principal at mcdonough high school in a june 2005
blog entry

------------------------------------------------------------------------

SCI.18.862467912.html

vice principal or administrative assistant (unclear) as listed on
westlake high school (maryland) web page

------------------------------------------------------------------------

SCI.6.850369785.html

DEVOTION: Actor-comedian Joe Torry and Harriet Arthur receive blessings
from St. Louis Cardinals baseball legend Ozzie Smith and famed actress
Regina ...

------------------------------------------------------------------------

SCI.20.248901493.html

Comedian and actor Joe Torry and his fiancee, Harriet Arthur, a print
and runway model, are firm believers that long-distance relationships can
work. The two shared a long-distance love affair with each other "on and off"
for seven years after they met at a party in their hometown of St. Louis.

------------------------------------------------------------------------

SCI.15.492335037.html

vice principal in charles county public schools (maryland)

------------------------------------------------------------------------

SCI.2.620002873.html

northwestern oklahoma state university honor role member (january 2007)

------------------------------------------------------------------------

Cordially,
Ted

Offline

#2 2007-07-15 04:15:10

octavian.voicu@ xxxxxxxxxxxxxxxxxxxxxx
    New member
    Registered: 1969-12-31
    Posts: 2

Re: ground truth file is clearly flawed

I guess that's why the selection of the finalists is not based only on
F-scores, but also on the proposals. It is very likely that some
submissions will do a better job at clustering documents than the
ground_truth. I guess the finalists will be those using very scalable
algorithms that "seem" to work...

Offline

#3 2007-07-24 06:06:21

[EMAIL PROTECTED]
    New member
    Registered: 1969-12-31
    Posts: 5

Re: ground truth file is clearly flawed

    [EMAIL PROTECTED] wrote:

    I guess that's why the selection of the finalists is not based
only on F-scores, but also on the proposals. It is very likely that
some submissions will do a better job at clustering documents than the
ground_truth. I guess the finalists will be those using very scalable
algorithms that "seem" to work...

I sure hope that's *not* what they're doing. Since we're using
ground_truth to score our results, something that "does a better job"
may in fact get a really low F-score, causing the participant to not
even bother submitting their results. That's why it's immensely
important that ground_truth *is* the truth. After all, garbage in,
garbage out.

Offline

#4 2007-07-25 14:44:50

emailjdd@ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    New member
    Registered: 1969-12-31
    Posts: 3

Re: ground truth file is clearly flawed

I have noticed some errors in the ground truth as well. For example,
this cluster under the name "ellen francisco":

SCI.17.907931216.html   (company listing, Juneau Alaska, alaskacategory.com)
SCI.20.420760323.html   (Rhode Island Girls Basketball Box Score)
SCI.1.57647743.html       (KD Lady of the month)
# ellen francisco

The first document and second document seem to refer to different
people, unless a person with an address in Juneau Alaska also plays
girls basketball in Rhode Island. I can really tell if the ellen
francisco in the third document refers to either of the first two or
to a third person.

So, I tried this with a different cluster under the name olivier
portet. There are only two documents in this cluster.

SCI.13.154368395.html  (symphonic band director)
SCI.8.14553016.html      (amazon.com author with book title listed
about how  to save money on your insurance)
# olivier portet

This seems to be two different people to me, but maybe it's possible
they could be the same person (??).

Offline

#5 2007-07-25 19:07:45

chrisspen@ xxxxxxxxxxxxxxxxxxxxxxxxxx
    New member
    Registered: 1969-12-31
    Posts: 5

Re: ground truth file is clearly flawed

emailjdd, I can't confirm your results. As far as I can tell, those
three documents all contain the name "ellen francisco", as per Spock's
specification. I don't understand how they qualify as "errors". Those
files also contain plenty of other names and information, which is the
noise that is our task to filter out. This is no trivial problem.

Offline

#6 2007-07-26 17:04:11

[EMAIL PROTECTED]
    New member
    Registered: 1969-12-31
    Posts: 3

Re: ground truth file is clearly flawed

Spock's specification states that each cluster should contain only one
real world referent (as opposed to one name string):

    The challenge is to partition all the documents relevant to a
target name by their referent.

    Consider the following example:

      # generated using algorithm X
      file1 file5 file8 file9 # cluster name: James Kirk
      file6 file7 file1 file2 # cluster name: James Kirk
      file11 file17 # cluster name: William Riker

    This file says that files 1, 5, 8, and 9 all refer to the same
person. Files 6, 7, 1, and 2 refer to someone else (who happens to
have the same name).

The error with the previously mentioned "ellen francisco" cluster is
that the three documents in the relevant cluster do NOT (appear to)
refer to the same real world person entity with the name "ellen
francisco". The challenge task as I understand it is to cluster
documents that refer to the same real world person entity (as opposed
to clustering documents that contain the same name string, which it
much easier to do). Using Spock's definition of the task, the ground
truth for the training set appears to contain errors which will cause
major headaches for anyone planning to use supervised machine learning
for this task (not to mention problems in evaluating the accuracy of
system output regardless of the method used to build the system).

Offline

#7 2007-08-01 11:44:25

Spock
    Administrator
    Registered: 2007-03-29
    Posts: 33

Re: ground truth file is clearly flawed

I apologize for the noise in the data.
I agree with Ted that this cluster is bad and hence have removed the
cluster from the ground truth.
Please download the new ground truth from the Download page
(http://challenge.spock.com/download).
We believe that this is not a prevalant case but we will fix them as
they are discovered.
We plan to review the final scoring set more carefully for the final round.

Offline

#8 2007-08-07 07:50:30

duluthted@ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    Member
    Registered: 1969-12-31
    Posts: 22

Re: ground truth file is clearly flawed

I think removing "flawed" clusters from the training data as they are
discovered is great, but seems to put the burden of quality control on
participants rather than the organizers (in that you'll remove bad
clusters as they are reported, but don't appear to be taking any steps
to either re-check the training data or (better yet I think) explain
how the training data was created in the first place.

The FAQ tells us this.... How did you collect the data?

Here is a brief description
We crawl the web to find documents about several people, say A, B, C, D, E
We filter bad documents (e.g. spam, non-English...etc)
We map the original people's names into new names. For example, A->F,
B->F, C->F, D->G, E->G.
We substitute new names for any instance of the original people's
names in the documents.

The crucial point that is not discussed is "how did you decide who the
several people were (A, B, C, D, and E above) that you crawled the web
for?"

It's clear that you selected names that you felt were not ambiguous,
and then conflated together all the occurrences of a few of these
different names to create your new ambiguous names. For example, one
of the "unambiguous" names that you used for the Harriet Arthur
cluster was Chrystal Benson, which turns out to be very ambiguous,
thus causing the noise in the training data.

So, I think what would be helpful is to know how the unambiguous names
were selected when you generated the training data. Why was Chrystal
Benson thought to be unambiguous? There were a total of 1101 names
used to create the clusters for the 44 clusters in the training data.
How do you know that each of those 1101 names was not itself an
ambiguous name (that refers to multiple people, like Chyrstal Benson).

I think revealing this information will go a long way to either
inspiring confidence that the noise in the ground truth file is really
quite rare, or allow for a more systematic correction to be made to
the data.

Finally, I should point out that identifying an unambiguous name from
web search results is exactly the same problem as the Spock Challenge
seeks to solve, and so if Spock had a really good way to identify
unambiguous names they would not need to have the challenge (and this
is why I'm a little concerned about how the training data was created,
because to create it very well automatically would require that you
already know how to solve the problem. So, unless the answer is that
each of the 1101 names was manually determined to be unambiguous via
scanning Google Search results or something like that, it's not at all
clear to me how you could guarantee that the names you used to create
the ambiguous clusters are not themselves ambiguous.

Cordially,
Ted

Offline

==============================================================
    * Index
    *  » Discussion
    *  » Message from Spock Team

#1 2007-07-30 16:50:22

Spock
    Administrator
    Registered: 2007-03-29
    Posts: 33

Message from Spock Team

Spock is fully committed to this exciting competition.  We have a
significant number of submissions and contestants have already started
working on their solutions.

We believe the Spock Challenge has generated such a high level of
interest because the problems we have proposed are not only
interesting but also widely applicable to many areas of research &
development.  When we put together the Spock Challenge, the goal was
to find a way to work with the academic and engineering community who
are working on or looking to work with similar types of problems.

Our intention is not to take credit for our contestant's hard work.
Rather, it is to advance the field of entity resolution by fostering
cooperation, through some friendly competition.  We believe that
endeavors like the Spock Challenge help catalyze and advance
significant breakthroughs in a number of applications that require
entity resolution, including ours.

We have put together the leader board, and look forward to individuals
and teams post results and monitor their success.

Thank you once again for your interest in the Spock Challenge.

Best,
The Spock Team

Offline

#2 2007-07-31 08:34:40

emailjdd@ xxxxxxxxxxxxxxxxxxxxxxxxx
    New member
    Registered: 1969-12-31
    Posts: 3

Re: Message from Spock Team

Dear Spock Challenge Administrator:

A couple of the threads on the discussion forum have pointed out some
issues that have been found regarding the quality of the ground truth
training data. Are you able to comment on any of the points that were
raised regarding the quality/accuracy of the ground truth training
data, especially the fact that several of the document clusters in the
ground truth training set that have been manually inspected by various
posters to the discussion forum appear to contain documents that refer
to two or more real world person referents (as opposed to just a
single real world person referent). Could you provide any information
about how  the ground truth training data was validated or inspected
for quality control? Knowing more detail(s) about this issue could be
very useful to system builders when deciding how to best use the
ground truth training data provided by Spock for training of (fully or
partially) supervised systems and/or for evaluation of system output
quality/accuracy. Thank you for sponsoring this contest and for any
information you can provide about the issue raised above.

Offline

#3 2007-08-01 11:45:44

Spock
    Administrator
    Registered: 2007-03-29
    Posts: 33

Re: Message from Spock Team

We apologize for the noise in the data.
We agree that the cluster is bad and hence have removed the cluster
from the ground truth.
Please download the new ground truth from the Download page
(http://challenge.spock.com/download).
We believe that this is not a prevalant case but we will fix them as
they are discovered.
We plan to review the final scoring set more carefully for the final round.

Offline

#4 2007-08-02 16:05:27

tsandler@ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    New member
    Registered: 1969-12-31
    Posts: 5

Re: Message from Spock Team

Hello Spock Administrator(s), thanks for your reply.  I am wondering
if you have inspected the clusters that I thought were problematic in
my post "serious concerns about the quality of the ground-truth file"
and if you have reached any conclusions regarding the quality of these
clusters.  Thanks again.  It's nice getting feedback on this issue.

Offline

==================================================================



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] spock challenge dead and why it happened

Reply via email to