One of the data sets we frequency mention in our discussions of name
discrimination is the John Smith
corpus that was used in the Bagga and Baldwin, 1998 study. Until
recently I didn't realize this data was
available online, but I happened to bump into it recently and thought
I would pass this along.

One of the authors (Baldwin) is now with a company called Alias-i,
which among other things has a nice
collection of online tutorials regarding many topics of interest in
NLP, including Clustering. I was looking
at that when I stumbled across the John Smith data...

http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html

The direct link to the data is here...

http://alias-i.com/lingpipe/demos/data/johnSmith.tar.gz

And here is a brief description of that data as taken from the above web page...

John Smith: The second data collection contains 197 New York Times
articles about 35 different people named John Smith. Each article
mentions a single John Smith. The clusters with more than one document
contain the following numbers of documents: 2, 2, 2, 4, 4, 5, 9, 15,
20, 22, 88. The other 24 clusters contain a single document. This
data, johnSmith.tar.gz, is one of the data sets from Bagga and
Baldwin's 1998 ACL paper Entity-Based Cross-Document Coreferencing
Using the Vector Space Model.

I'd like to convert this data in the Senseval-2 format, so that we
could run some experiments with SenseClusters on
it. It's an interesting and important data set, and it's nice to see
it available.

Enjoy,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to