Greetings again, On 4/27/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> 4) Regarding the train/test split. We used the same train/test split as > defined by task 17 organizers. The reason is to be able to compare with > other supervised/unsupervised systems that participate in the lexical > sample subtask of task 17. We expected that the split would be random, > but it seems that it's not the case. We are not happy with this, but in > any case, note that all our induction systems (as well as the supervised > ones) are affected by the different distributions in train/test. In the > following days we plan to do a random split, and check what influence it > has in the ranking. Note that this way we loose comparability with > regard to supervised systems. This is I think my central concern about the supervised measure in the sense induction task. It seems tempting to compare the supervised sense induction measure to the scores of supervised systems as used in the lexical sample task (#17), or perhaps to unsupervised measures that also use MFS as a baseline, such as the unsupervised f-score or the SenseClusters score. However, I think this results in a flawed comparison. While MFS will produce the same value across the supervised sense induction measure, the traditional supervised measure from task 17, the unsupervised f-score, the SenseClusters score, etc. other important baselines or cases do not. For example, the score generated by random2 "move around" as you go from evaluation measure to measure, as do the scores of the 1 cluster - 1 instance example I mentioned previously. This shows at least in an intuitive way that we are dealing with measures that are rather different from each other, and should not be compared directly to each other. Each of these measures may have interesting points to make on their own, but I do not think they can be compared directly or even indirectly. For example, if you entered random2 in the English lexical sample task (#17), it scores about .28. In the supervised sense induction task it scores .78. Somehow this difference seems important. Now, it is not enough to say that because random2 behaves differently these are different measures that can't be compared...but it's sort of what got me started on the thought process that follows below... So, what is the problem with comparing the supervised sense induction measure with the f-score or the SenseClusters score? I think the mapping step in the supervised measure is clearly a supervised learning step, since the mapping is based on knowledge of the correct classes given in the training data, and then this knowledge is used to alter the results of the clustering. If you look at the results of create_supervised_keyfile.pl, the distribution of clusters is often radically different than what the clustering algorithm originally assigned. For example, when the "clustering algorithm" is random2, each word has 2 clusters, and the distribution of the clusters is relatively balanced. However, after creating the mapping from the training data and building the new key file, the distribution of the answers for clustering that has been generated is radically different, and in fact for most words there is just 1 cluster that occurs most of the time. This is why a random baseline as used in task 17 will fare so poorly, because the original random values are presented to the scoring algorithm, while in the sense induction task the random values are in fact adjusted prior to scoring based on information from the training data and end up converging towards MFS. In the supervised sense induction evaluation, in effect the clustering algorithm is being augmented with a supervised learning step, and while that is perhaps a reasonable thing to do, it does not seem reasonable to compare the scores that come from such a process with scores like the f-score or the SenseClusters score that do not have the benefit of such a step. What is the problem then with comparing the supervised sense induction measure with the supervised results from the lexical sample task (# 17)? Here I think the problem is one of procedure, and is perhaps more subtle than the case of comparing to the f-score and SenseClusters score. However, the nub of the problem is that the clustering was carried out on the full 27,132 instances. This means that we saw the test data as defined by the lexical sample task during learning/clustering, and that makes what we did quite different. The task 17 participants that used supervised learning were given the training data (22,281 instances) and then applied whatever model they learned to the test data after the fact (4851 instances). I think to compare our results with the task 17 results, we would need to cluster the 22,281 instances in the training data, and then from that apply a learning algorithm (the mapping procedure done in create_supervised_keyfile.pl is essentially a simple learning algorithm) to the results on the training data to build a model that we could then apply to the test data. So I think the problem is that we did not hold the test data out of the clustering process, which was in effect how we were generating training data. Now, would the results have been radically different? I don't know. Perhaps not. But, perhaps they would be, because we'd have a much smaller number of instances per word, and so there is a potential for quite a few differences in how we would cluster the training data. But, if we had held the test data out of the clustering process, there would then be no doubt that we could compare the results on the test data in a kind of supervised learning experiment where the training data is created by clustering, which seems to be the goal of the supervised measure. I do think that is a good and interesting goal, btw, since in the end we would like to replace manual sense tagging with clustering. However, I really don't feel we can make that comparison to task 17 here, because we didn't keep the test data reserved for evaluation, it was a part of the clustering/learning. This would be analogous to giving the supervised systems all 27,132 instances to learn a model from, and then applying that model to a subset of the data upon which they learned. I think more closely following the traditional procedure for supervised learning when doing the supervised sense induction evaluation would actually be quite a bit more clear, and allow for direct comparison to other supervised methods. This would also make it all the more clear that such a measure should not be compared to the f-score or SenseClusters measure, which can fall below MFS for even rather good results. Supervised learning, as we know, generally has MFS as a lower bound, and so we might view exceeding the MFS with the supervised sense induction evaluation in that light. The other very significant problem with comparing to task 17 is that it would be tempting to say something like "the best of our unsupervised systems attained recall of 0.81, and the best of the supervised systems attained recall of 0.89 (there are the actual numbers I think...), therefore this shows that unsupervised systems are nearly as good as supervised." Now, I do not mean to suggest that this is the intention of the organizers, I am only pointing out a potential pitfall here, a possible error of interpretation by those who view these results more casually than we do. I think what the unsupervised sense induction scores really could show (given that the test data was really withheld) is that if you use clustering to create training data, then applying a supervised learning algorithm to such data can give you results that are at the level of MFS or a bit above, and that's a very different kind of claim and perhaps makes the nature of the evaluation a bit more clear. Unfortunately I think the procedural issue I mentioned prevents us from making such a claim, because it is kind of interesting. So...if the motivation behind the supervised sense induction measure is to make comparisons to supervised learning systems, I really do think we need to "handle" the data in the same way as the supervised systems do, and that is do our clustering/tagging on the training data, and then build models with that clustering training data that we then apply to the test data. Also, I do think it's important to recognize that the supervised sense induction measure really does include a learning step that alters the answers of the clustering algorithm, and so really shouldn't be used as a basis for comparison with unsupervised measures that don't adjust the answers of the clustering algorithm like the f-score or the SenseClusters score. Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
