Greetings all,

I recently participated in the Sense Induction task of Semeval-2, and
found it to be a very interesting and worthwhile
experience. http://www.cs.york.ac.uk/semeval2010_WSI/index.html

The final camera-ready version of the paper that describes that
experience is now available here:

http://www.d.umn.edu/~tpederse/Pubs/pedersen-semeval2-2010.pdf

Duluth-WSI: SenseClusters Applied to the Sense Induction Task of
SemEval-2 (Pedersen) - To Appear in the
Proceedings of the SemEval 2010 Workshop : the 5th International
Workshop on Semantic Evaluations, July 15-16,
2010, Uppsala, Sweden

In the end it turns out that much of this paper is really more about
the evaluation methods of the task than it is my
participating system, although I do give some details of what I
attempted in my systems (all of which is available
fairly directly from SenseClusters http://senseclusters.sourceforge.net)

In any case, I do have some concerns about how we do unsupervised
evaluations which I've tried to lay out in this
paper, and I continue to think (although it's not explicitly stated in
this paper) that the F-score we have been using for
evaluation in SenseClusters is pretty reliable. I think it is
necessary (but not sufficient) that an evaluation measure for
unsupervised sense induction (or discrimination as we tend to call it)
do the following:

1) Not be fooled by random baselines. A random system should get a
painfully low score. :)
2) Reward systems that predict the correct number of senses (relative
to the gold standard) and penalize those that
get the number of clusters wrong with increasing severity as the
actual and predicted number of senses differ.

Interestingly enough some of the evaluation measures in this task did
not meet either or both of these conditions,
which is part of what prompted the focus of this particular paper.

The paired F-score that was used in the SemEval-2 task is fairly
similar to the SenseClusters F-score, and I think both
of these meet the above conditions reasonably well. But, I'll be doing
a more formal and comprehensive comparison
between them and other possible evaluation methods in the near future
to try and establish just how well, and maybe
formulate a set of necessary and sufficient conditions that we should
try to meet.

Any other thoughts and ideas about how to evaluate unsupervised sense
induction systems are of course very
welcome.

Enjoy,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

------------------------------------------------------------------------------

_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to