[Senseclusters-users] Fwd: [senseinduction] Re: results tables question / question on supervisedscoring

Ted Pedersen Sat, 21 Apr 2007 19:33:06 -0700

A rather lengthy note that talks about supervised evaluation relative
to SenseClusters evaluation....

---------- Forwarded message ----------
From: Ted Pedersen <[EMAIL PROTECTED]>
Date: Apr 20, 2007 2:28 PM
Subject: Re: [senseinduction] Re: results tables question / question
on supervisedscoring
To: [EMAIL PROTECTED]

Thanks Aitor,

This is very helpful as well. I'm going to take the liberty of trying
to make some general comments about the measures of evaluation, just
to make sure I'm not wildly off the mark in my thinking as that will
impact what I write in our final system report.

The simplest observation is that to get a high f-score you need to
find the same number AND same type of clusters (that agree with the
gold standard data) whereas with the supervised measure you need to
find the same type of clusters, but not the same number (as in the
gold standard data).

In that sense then the supervised measure is more lenient, in that if
you have 100 instances that belong to gold standard sense 1, if you
happen to put those into three different clusters that are relatively
pure (do not contain instances from other senses), you will get
penalized by the f-score but not the supervised score. In fact this
might be where there is a connection from the supervised measure to
purity, in that the supervised measure is oriented towards finding any
number of relatively pure clusters, where you do better the more pure
your clusters are regardless of how many you might create, whereas the
f-score wants you to get exactly the same number of clusters as senses
AND have them be relatively pure (in order to get a good score).

So, both measures are expecting that you will make the same kind of
distinction, that is keeping these gold standard sense 1 instances
separate from instances of other senses. It's just that the f-score
wants you to find the number of clusters as senses in the gold
standard data, but the supervised score doesn't care how many clusters
you find, as long as you keep the gold standard senses in relatively
pure clusters.

I had mentioned wanting to make sure I understood the supervised
measure in order to compare to what is done for evaluation in
SenseClusters, and I had a few thoughts on that as well. The
evaluation in SenseClusters is more like the f-score than the
supervised measure I think, in that we penalize the score very heavily
if we find some number of clusters that differs from the gold standard
number. However, it is similar to the supervised measure in that we do
create a mapping that is based on the most probable sense/cluster
combination.

Here's an example...suppose there are 3 true senses as observed in the
gold standard data (S1, S2, S3), and 4 discovered clusters (C1, C2,
C3, C4). Suppose that the confusion matrix for 100 instances that are
clustered looks like this....

        C1    C2     C3      C4
S1    10     30      0         5    = 45
S2    20       0     10        0    = 30
S3      0       0      5       20   =  25
       ----------------------------------
        30     30      15      25       100

Now, in SenseCluster's evaluation framework we try and find the
optimal mapping of clusters to senses, such that the agreement overall
is maximized. This turns out to be an instance of the Assignment
Problem from Operations Research, so we take advantage of the Munkres
Algorithm to "solve" that, which really just amounts to rearranging
the columns in the matrix above such that the diagonal sum is
maximized, and then we map clusters to senses based on that
arrangement.

So, for the above matrix we'd end up with ...

       C2    C1   C4     C3**
S1    30    10     5       0        =  45
S2      0    20     0     10        =  30
S3      0      0    20      5        =  25
       ----------------------------------
        30    30    25     15          100

So this means that

C2 -> S1
C1 -> S2
C4 -> S3

...and C3 doesn't map to anything, and that is where the penalty is
assessed. Those instances found in C3 are counted as being wrong. :)

In this case precision and recall are the same, since the clustering
algorithm has not declined to place any instances in a cluster (that
happens sometimes, and is essentially an "i don't know" answer from
the clustering algorithm, but we'll assume there are 100 instances in
total in the data that we clustered.

So precision and recall are both 30+20+20/100 = 0.70

And these are combined in the F-Measure as 2*P*R/(P+R) = 0.70

(The 30+20+20 comes from the diagonal of the matrix above).

Now...in the supervised methodology, you'd compute a table of
probabilities something like this...

        C1     C2      C3       C4
S1    .22    .66       0       .11
S2    .66       0    .33         0
S3      0        0      .2        .8

If we assume that the clustering algorithm only outputs one single
answer per instance, the mapping would go like this...

C1->S2
C2->S1
C3->S2
C4->S3

...which is the same as what SenseClusters finds, except for the
C3->S2 mapping, which is not allowed by SenseClusters but allowed in
the supervised framework.

Now the only problem here is that I need some test data to evaluate
the supervised measure with, but I used it all up getting that table
up there. :) But, I have perhaps gone far enough to make the point
that I think the SenseClusters measure is somewhere in between F-score
and the supervised method.

The SenseClusters evaluation is like the F-score in that it doesn't
require a training/test split, and it also is very harsh in punishing
you for getting a different number of clusters than senses. The
F-score does some weighting based on the size of clusters discovered,
which SenseClusters doesn't do, and SenseClusters is only "interested"
in the values in the confusion matrix that fall in the diagonal,
everything else is ignored.

However, it is like the supervised measure in that it creates a
mapping of clusters to senses, and in fact the mapping we get seems to
generally correspond to the supervised mapping, except that we only
map for as many senses as we have, and anything else is considered
wrong.

I ran the SenseClusters evaluation on the train+test solution for our
system (27,132 instances) and got a score of 56.36, which in this case
can be viewed as an accuracy number since precision and recall are
equal. So this is just the percentages of instances that we got
"right", according to the gold standard. The F-score gave us 63.1,
which is slightly higher, and is mentioned here to make it clear that
they are different. And if we assign each word to a single cluster in
the train+test (i.e., most frequent sense), we get 78.55, just to give
you an idea of how far away our results are from that. :)

Of course we don't have a supervised score for the train+test data,
but I'm planning to run the SenseClusters evaluation on the test data
as well, and then compare that with the f-score and supervised results
just to see how they vary. I would guess that the SenseClusters
evaluation will be somewhat lower than f-score, which is somewhat
lower than supervised score...

The SenseClusters evaluation I am describing can be done with a few
programs available in SenseClusters
(http://senseclusters.sourceforge.net), namely cluto2label.pl,
label.pl, and report.pl, in that order. Our solution and key files as
used in this task need to be converted to the SenseClusters format to
be used with these programs, but I'll package up those scripts and
make those available with the three programs above as a tar file if
anyone is interested in running this evaluation on their data.

In general the scores reported by the SenseClusters evaluation will be
lower for the reasons mentioned above. So a lower number from the
SenseClusters evaluation should only be viewed relative to other
numbers that come from the SenseClusters evaluation, otherwise you
will be making a comparison that is too harsh, as both the f-score and
supervised methodology tend to report somewhat higher numbers.

Thanks for listening, comments of course welcome. :)
Ted

On 4/20/07, Aitor Soroa Etxabe <[EMAIL PROTECTED]> wrote:
>
> On 2007/04/19, [EMAIL PROTECTED] wrote :
> >
> > Greetings Aitor,
> >
> > Me again. :)
>
> ;-)
>
> > [...]
> > Anyway, my understanding is that the results of clustering on the
> > training data are stored in a matrix, essentially a confusion matrix.
> > Suppose the true senses as shown by the gold standard are S1, S2, and
> > S3, and that we discover 3 clusters, C1, C2, C3.
> >
> >       C1   C2      C3
> > S1    0    10      5
> > S2   10     5      5
> > S3    5     5      5
> >
> > [...]
> >
> > Now, I think these counts are converted into probabilities....
> >
> >       C1  C2  C3
> > S1   0   .66  .33
> > S2  .5   .25  .25
> > S3  .33  .33  .33
> >
>
> Yes, your analysis is right. This is the way to create what we call a
> "mapping matrix"
>
> Now, suppose the system returned a cluster C2 to an instance of the test
> corpus. We interpret this assigment by means of what we call a "cluster
> score vector", which in this case will be Csv = (0, 1, 0)^{T}. So, to obtain
> the sense we multiply the mapping matrix with the cluster score vector,
> which gives a "sense score vector" Ssv:
>
> Ssv = M*Csv
>
> And then we choose the sense with maximum score. In case of ties, we take
> one sense arbitrarily (but ties don't occur very frequently). In this case,
> Ssv = (.66, .25, .33)^{T}, so we choose S1.
>
> Note that this procedure allows assigning more than a cluster to an instance
> (like a soft clustering). Suppose we assign the cluster score vector Csv =
> (.9, .3, .6)^{T}, i.e., C1 has a weight of 0.9, C2 has 0.3 and C3
> 0.6. Multiplying it with the matrix, we obtain
>
> Ssv = (0.396, 0.675, 0.505)
>
> so sense S2 will be assigned.
>
> I hope the explanation helps understanding the sup. evaluation, but if you
> have more questions feel free to ask on the list.
>
> best,
>                                 aitor
>

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] Fwd: [senseinduction] Re: results tables question / question on supervisedscoring

Reply via email to