Re: [Senseclusters-users] Hi, Professor, I have a question (comments on token.regex and --eval)

Ted Pedersen Sat, 23 Aug 2008 13:19:58 -0700

Hi Weisi,

Welcome to the SenseClusters-users list, and thanks for your question!

On Fri, Aug 22, 2008 at 10:44 PM, Weisi Duan <[EMAIL PROTECTED]> wrote:
> Hi, Professor,
>
> I have been trying to send the email to the listen group of
> SenseClusters but I kept receiving message saying that I
> have
> to subscribe first to send the email, but I think I have
> subscribed to the group already... So I have to send it to
> your email box. Sorry for disturbing...
>
> I am a graduate student at Temple University and I am trying
> to use the SenseClusters package for some comparison work,
> and I have correctly installed the package. However, when I
> run the command "discriminate.pl --token token.reg eng-lex-
> sample.evaluation.xml" in the sample/Data directory, I
> got the error as below:
> ERROR(discriminate.pl):
>        Only 0 FEATURES found in the
> <expr1219436037.bigrams>
> file.
>        At least 10 FEATURES required to proceed with
> context
>        representation.

In general this error is telling you that you did not find enough features
in the data in order to proceed. As a default we require that 10 features be
found in order to represent your contexts. Now, this is really just a symptom
though, the cause of not finding features can be an overly aggressive
feature selection method - for example, if you say that you will only use as
features bigrams that occurred more than 10,000 times, and you have
a very small corpus, you won't find any features.

In your case though, I think the problem is in how you specified token.regex.

> I simply put /\b.+\b/ in token.reg. I don't understand what
> the problem is and I would really appreciate it if you could
> me some help.

Your regex is saying that a token is a string of length one or more
consisting of
any characters....So, I think that is happening is that each context
is being treated
as a token, and you aren't even finding any bigrams (because you only have one
token per context, and a bigram must consist of at least two tokens.

More typical examples of token.regex might be

/\b\w+\b/

This says that each token is (in effect) a space separated word....I
think this is
more like what you want to do....

If you run your command above using that token.regex file, you will
get a feature
file generated that consists of quite a few bigrams, and these will be
used to represent
your contexts....

> I was hoping that the feature vector is
> generated by going through the evaluation file by default. I
> don't know what is happening there.

By default you are representing these contexts using bigram features,
which are then used to create a 2nd order co-occurrence representation
(which is the clustered). You may want to look through a few of our
papers that describe this method (any of the SenseClusters papers will
discuss these issues).

An Unsupervised Language Independent Method of Name Discrimination
Using Second Order Co-occurrence Features  (Pedersen, Kulkarni,
Angheluta, Kozareva, and Solorio) - Appears in the Proceedings of the
Seventh International Conference on Intelligent Text Processing and
Computational Linguistics, pp. 208-222, February 19-25, 2006, Mexico
City.
http://www.d.umn.edu/~tpederse/Pubs/cicling2006.pdf

You can find other SenseClusters papers here...
http://www.d.umn.edu/~tpederse/senseclusters-pubs.html

You may also wish to browse the program by program documentation here :
http://search.cpan.org/dist/Text-SenseClusters/

>
> Also, what I am trying to do is to be able to use all
> contexts of all instances as the feature set. May I ask what
> option I need to specify? I read the perldoc
> of "discriminate.pl" but did not get very clear about how to
> do that.

SenseClusters relies on lexical features, usually bigrams or unigrams.
By default you get bigrams and second order co-occurrences. You might
want to try unigrams and first order co-occurrences too. I often
suggest using the web interface to try things out at first, as that
will lay out the different options you have, and let you see the
options fairly conveniently.

http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi

There are also links in the web interface to more detailed
explanations of the options, and of course you are welcome to ask
questions here too!

> Also, I would like to evaluate the result of the
> clustering, is there any module for that?

Yes. You can use --eval option for discriminate.pl and your results
will be evaluated relative to "correct" answers provided in your input
file.

> I did not find
> anything about evaluation in the perldoc
> of "discriminate.pl". Thank you very much!

The main program for evaluation is label.pl, you can find out more
about that here...
http://search.cpan.org/dist/Text-SenseClusters/Toolkit/evaluate/label.pl

(This is what gets called when you use the --eval option).

I hope this helps.

Good luck!
Ted

>
> Weisi
>
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> senseclusters-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/senseclusters-users
>

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Re: [Senseclusters-users] Hi, Professor, I have a question (comments on token.regex and --eval)

Reply via email to