One of the data sets we frequency mention in our discussions of name discrimination is the John Smith corpus that was used in the Bagga and Baldwin, 1998 study. Until recently I didn't realize this data was available online, but I happened to bump into it recently and thought I would pass this along.
One of the authors (Baldwin) is now with a company called Alias-i, which among other things has a nice collection of online tutorials regarding many topics of interest in NLP, including Clustering. I was looking at that when I stumbled across the John Smith data... http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html The direct link to the data is here... http://alias-i.com/lingpipe/demos/data/johnSmith.tar.gz And here is a brief description of that data as taken from the above web page... John Smith: The second data collection contains 197 New York Times articles about 35 different people named John Smith. Each article mentions a single John Smith. The clusters with more than one document contain the following numbers of documents: 2, 2, 2, 4, 4, 5, 9, 15, 20, 22, 88. The other 24 clusters contain a single document. This data, johnSmith.tar.gz, is one of the data sets from Bagga and Baldwin's 1998 ACL paper Entity-Based Cross-Document Coreferencing Using the Vector Space Model. I'd like to convert this data in the Senseval-2 format, so that we could run some experiments with SenseClusters on it. It's an interesting and important data set, and it's nice to see it available. Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
