Hi Weisi, Welcome to the SenseClusters-users list, and thanks for your question!
On Fri, Aug 22, 2008 at 10:44 PM, Weisi Duan <[EMAIL PROTECTED]> wrote: > Hi, Professor, > > I have been trying to send the email to the listen group of > SenseClusters but I kept receiving message saying that I > have > to subscribe first to send the email, but I think I have > subscribed to the group already... So I have to send it to > your email box. Sorry for disturbing... > > I am a graduate student at Temple University and I am trying > to use the SenseClusters package for some comparison work, > and I have correctly installed the package. However, when I > run the command "discriminate.pl --token token.reg eng-lex- > sample.evaluation.xml" in the sample/Data directory, I > got the error as below: > ERROR(discriminate.pl): > Only 0 FEATURES found in the > <expr1219436037.bigrams> > file. > At least 10 FEATURES required to proceed with > context > representation. In general this error is telling you that you did not find enough features in the data in order to proceed. As a default we require that 10 features be found in order to represent your contexts. Now, this is really just a symptom though, the cause of not finding features can be an overly aggressive feature selection method - for example, if you say that you will only use as features bigrams that occurred more than 10,000 times, and you have a very small corpus, you won't find any features. In your case though, I think the problem is in how you specified token.regex. > I simply put /\b.+\b/ in token.reg. I don't understand what > the problem is and I would really appreciate it if you could > me some help. Your regex is saying that a token is a string of length one or more consisting of any characters....So, I think that is happening is that each context is being treated as a token, and you aren't even finding any bigrams (because you only have one token per context, and a bigram must consist of at least two tokens. More typical examples of token.regex might be /\b\w+\b/ This says that each token is (in effect) a space separated word....I think this is more like what you want to do.... If you run your command above using that token.regex file, you will get a feature file generated that consists of quite a few bigrams, and these will be used to represent your contexts.... > I was hoping that the feature vector is > generated by going through the evaluation file by default. I > don't know what is happening there. By default you are representing these contexts using bigram features, which are then used to create a 2nd order co-occurrence representation (which is the clustered). You may want to look through a few of our papers that describe this method (any of the SenseClusters papers will discuss these issues). An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-occurrence Features (Pedersen, Kulkarni, Angheluta, Kozareva, and Solorio) - Appears in the Proceedings of the Seventh International Conference on Intelligent Text Processing and Computational Linguistics, pp. 208-222, February 19-25, 2006, Mexico City. http://www.d.umn.edu/~tpederse/Pubs/cicling2006.pdf You can find other SenseClusters papers here... http://www.d.umn.edu/~tpederse/senseclusters-pubs.html You may also wish to browse the program by program documentation here : http://search.cpan.org/dist/Text-SenseClusters/ > > Also, what I am trying to do is to be able to use all > contexts of all instances as the feature set. May I ask what > option I need to specify? I read the perldoc > of "discriminate.pl" but did not get very clear about how to > do that. SenseClusters relies on lexical features, usually bigrams or unigrams. By default you get bigrams and second order co-occurrences. You might want to try unigrams and first order co-occurrences too. I often suggest using the web interface to try things out at first, as that will lay out the different options you have, and let you see the options fairly conveniently. http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi There are also links in the web interface to more detailed explanations of the options, and of course you are welcome to ask questions here too! > Also, I would like to evaluate the result of the > clustering, is there any module for that? Yes. You can use --eval option for discriminate.pl and your results will be evaluated relative to "correct" answers provided in your input file. > I did not find > anything about evaluation in the perldoc > of "discriminate.pl". Thank you very much! The main program for evaluation is label.pl, you can find out more about that here... http://search.cpan.org/dist/Text-SenseClusters/Toolkit/evaluate/label.pl (This is what gets called when you use the --eval option). I hope this helps. Good luck! Ted > > Weisi > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > senseclusters-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/senseclusters-users > -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
