Users of SenseClusters may wonder where they can find data with which to experiment.
Over the course of the last few years we have created quite a bit of data, and I've updated our name discrimination data page to include links to most of that data. In addition, I've provided very brief summaries of the content of each collection. http://www.d.umn.edu/~tpederse/namedata.html This primarily consists of data we have created by conflating names together to create a new ambiguity, such as turning all occurrences of "Tony Blair" and "Bill Clinton" into the now ambiguous name "TonyBlairBillClinton". The objective with this data is to take the occurrences of this newly ambiguous name and see if you can discover who the underlying entities/identities are via SenseClusters. This page also includes the "Kulkarni name corpus" which is a collection where ambiguous names as found on the web have been manually disambiguated. In addition, please remember that SenseClusters can also be applied to text where word senses have been manually disambiguated. In this case the task of SenseClusters is to cluster the occurrences of a word based on the sense in which it was used. Any of the Senseval-2 formatted data found at the link below can easily be used with SenseClusters. http://www.d.umn.edu/~tpederse/data.html Finally, please note that you can use SenseClusters on email data, where there is not a single target word or name you are interested in, but rather you seek to categorize short messages by topic. There is some email data found in the name data page, and we also have a subset of the Enron email corpus available which has been categorized by topic which you could use as input to SenseClusters. That data is available here: http://www.d.umn.edu/~tpederse/enron.html Of course you can use SenseClusters with a wide range of data over a much broader range of tasks than described here, but the data we provide here has the advantage of having correct answers associated with it. This means you can evaluate your results and even compare to our published results should you wish to do that. Finally, if you have any data that you have used with SenseClusters and you'd like to make that available, please do let us know. We'd be happy to include a link to your data or even host it on our server. Please let us know if you have any questions or comments about this data! Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
