Greetings again,

On 4/27/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

> 4) Regarding the train/test split. We used the same train/test split as
> defined by task 17 organizers. The reason is to be able to compare with
> other supervised/unsupervised systems that participate in the lexical
> sample subtask of task 17. We expected that the split would be random,
> but it seems that it's not the case. We are not happy with this, but in
> any case, note that all our induction systems (as well as the supervised
> ones) are affected by the different distributions in train/test. In the
> following days we plan to do a random split, and check what influence it
> has in the ranking. Note that this way we loose comparability with
> regard to supervised systems.

This is I think my central concern about the supervised measure in the
sense induction task. It seems tempting to compare the supervised
sense induction measure to the scores of supervised systems as used in
the lexical sample task (#17), or perhaps to unsupervised measures
that also use MFS as a baseline, such as the unsupervised f-score or
the SenseClusters score.

However, I think this results in a flawed comparison.

While MFS will produce the same value across the supervised sense
induction measure, the traditional supervised measure from task 17,
the unsupervised f-score, the SenseClusters score, etc. other
important baselines or cases do not. For example, the score generated
by random2 "move around" as you go from evaluation measure to measure,
as do the scores of the 1 cluster  - 1 instance example I mentioned
previously.

This shows at least in an intuitive way that we are dealing with
measures that are rather different from each other, and should not be
compared directly to each other. Each of these measures may have
interesting points to make on their own, but I do not think they can
be compared directly or even indirectly.

For example, if you entered random2 in the English lexical sample task
(#17), it scores about .28. In the supervised sense induction task it
scores .78. Somehow this difference seems important. Now, it is not
enough to say that because random2 behaves differently these are
different measures that can't be compared...but it's sort of what got
me started on the thought process that follows below...

So, what is the problem with comparing the supervised sense induction
measure with the f-score or the SenseClusters score? I think the
mapping step in the supervised measure is clearly a supervised
learning step, since the mapping is based on knowledge of the correct
classes given in the training data, and then this knowledge is used to
alter the results of the clustering.

If you look at the results of create_supervised_keyfile.pl, the
distribution of clusters is often radically different than what the
clustering algorithm originally assigned. For example, when the
"clustering algorithm" is random2, each word has 2 clusters, and the
distribution of the clusters is relatively balanced. However, after
creating the mapping from the training data and building the new key
file, the distribution of the answers for clustering that has been
generated is radically different, and in fact for most words there is
just 1 cluster that occurs most of the time. This is why a random
baseline as used in task 17 will fare so poorly, because the original
random values are presented to the scoring algorithm, while in the
sense induction task the random values are in fact adjusted prior to
scoring based on information from the training data and end up
converging towards MFS.

In the supervised sense induction evaluation, in effect the clustering
algorithm is being augmented with a supervised learning step, and
while that is perhaps a reasonable thing to do, it does not seem
reasonable to compare the scores that come from such a process with
scores like the f-score or the SenseClusters score that do not have
the benefit of such a step.

What is the problem then with comparing the supervised sense induction
measure with the supervised results from the lexical sample task (#
17)? Here I think the problem is one of procedure, and is perhaps more
subtle than the case of comparing to the f-score and SenseClusters
score. However, the nub of the problem is that the clustering was
carried out on the full 27,132 instances. This means that we saw the
test data as defined by the lexical sample task during
learning/clustering, and that makes what we did quite different.

The task 17 participants that used supervised learning were given the
training data (22,281 instances) and then applied whatever model they
learned to the test data after the fact (4851 instances). I think to
compare our results with the task 17 results, we would need to cluster
the 22,281 instances in the training data, and then from that apply a
learning algorithm (the mapping procedure done in
create_supervised_keyfile.pl is essentially a simple learning
algorithm) to the results on the training data to build a model that
we could then apply to the test data.

So I think the problem is that we did not hold the test data out of
the clustering process, which was in effect how we were generating
training data. Now, would the results have been radically different? I
don't know. Perhaps not. But, perhaps they would be, because we'd have
a much smaller number of instances per word, and so there is a
potential for quite a few differences in how we would cluster the
training data. But, if we had held the test data out of the clustering
process, there would then be no doubt that we could compare the
results on the test data in a kind of supervised learning experiment
where the training data is created by clustering, which seems to be
the goal of the supervised measure. I do think that is a good and
interesting goal, btw, since in the end we would like to replace
manual sense tagging with clustering.

However, I really don't feel we can make that comparison to task 17
here, because we didn't keep the test data reserved for evaluation, it
was a part of the clustering/learning. This would be analogous to
giving the supervised systems all 27,132 instances to learn a model
from, and then applying that model to a subset of the data upon which
they learned.

I think more closely following the traditional procedure for
supervised learning when doing the supervised sense induction
evaluation would actually be quite a bit more clear, and allow for
direct comparison to other supervised methods. This would also make it
all the more clear that such a measure should not be compared to the
f-score or SenseClusters measure, which can fall below MFS for even
rather good results. Supervised learning, as we know, generally has
MFS as a lower bound, and so we might view exceeding the MFS with the
supervised sense induction evaluation in that light.

The other very significant problem with comparing to task 17 is that
it would be tempting to say something like "the best of our
unsupervised systems attained recall of 0.81, and the best of the
supervised systems attained recall of 0.89 (there are the actual
numbers I think...), therefore this shows that unsupervised systems
are nearly as good as supervised." Now, I do not mean to suggest that
this is the intention of the organizers, I am only pointing out a
potential pitfall here, a possible error of interpretation by those
who view these results more casually than we do. I think what the
unsupervised sense induction scores really could show (given that the
test data was really withheld) is that if you use clustering to create
training data, then applying a supervised learning algorithm to such
data can give you results that are at the level of MFS or a bit above,
and that's a very different kind of claim and perhaps makes the nature
of the evaluation a bit more clear. Unfortunately I think the
procedural issue I mentioned prevents us from making such a claim,
because it is kind of interesting.

So...if the motivation behind the supervised sense induction measure
is to make comparisons to supervised learning systems, I really do
think we need to "handle" the data in the same way as the supervised
systems do, and that is do our clustering/tagging on the training
data, and then build models with that clustering training data that we
then apply to the test data.

Also, I do think it's important to recognize that the supervised sense
induction measure really does include a learning step that alters the
answers of the clustering algorithm, and so really shouldn't be used
as a basis for comparison with unsupervised measures that don't adjust
the answers of the clustering algorithm like the f-score or the
SenseClusters score.

Cordially,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to