Re: [Senseclusters-users] bigram contexts only?

tpederse Wed, 07 Jan 2004 15:52:55 -0800

Hi Rob,

> There are lots of ways to define features for a problem like this. I see you 
> support a number of them. I'm not sure if you support mine.
> 
> I have a kind of context "frame". For instance, given a text fragment 
> <context>A<head>X</head>B</context> my feature would be A_B, where A, B, and 
> X can be ngrams in general (but in practice I've only used ngrams for the 
> token X because ngram context "frames" become way too specific = rare).
> 
> What I'm wondering is if "locking" the prior and following contexts (A and B) 
> together like this (context features A and B become single context feature 
> A_B) will give a different result to simply including prior and following 
> context A and B in the context vector without reference to each other (which 
> I now understand is possible with your package). I haven't tried this 
> (unlocked), but I did find that it was necessary to have prior and following 
> context to distinguish ngram tokens.


Interesting question, and in general I think it's likely that locking A 
and B together as you say would give some different results. These might
be thought of as co-occurrence features that are simply skipping over the
X (head word) that lies in the middle. This is fairly easy to do with 
ngram features and windowing (as supported in NSP). 

In addition, you can in fact specify your own features that you determine
in some way other than using NSP. Rather than having count.pl or 
statistic.pl find the features, you can manually specify those and then
run them through nsp2regex.pl to get them in a form where they can be 
identified in the instances. (I say manual, but if you have some other
tool that identifies features it can certainly do that.) The features can 
be specified such that they allow for skipping over a certain number of
intermediate words, thus allowing for this kind of prior-following fixed
format. 

> The other question is have you considered combining 1st order and 2nd order 
> processing. If your 2nd order processing can be taken as similar to mine, and 
> I think it can, then it increases generality nicely, but also decreases 
> reliability. I have been exploring ways of using both 1st order and 2nd order 
> stats. with the 2nd order stats. scaled appropriately to reflect their lower 
> reliability.

We have thought about combining these in certain ways. For example, with 
order2 processing, you reach a point where you have vectors for each  
instance (which are the average of the word vectors that make up the 
context). We then allow you to cluster those vectors directly, or you can
go on and create a similarity matrix.  The same is true in the other  
direction - if you have an order1 representation of instances, you can   
directly cluster those as vectors or go on to create a similarity matrix.

Also, we do support second order co-occurrences (socs) as a 1st order
feature (via NSP again) so this is a sort of bridge between order 1 and 
order 2 style processing. 

> What I am really trying to find out is if you would profit from looking at my 
> code for this problem. Alternatively I would like to know if I could use your 
> code to produce word similarity "vectors" usable by my parsing algorithm. My 
> main interest is to improve my word similarity vectors (e.g. 
> http://www.collectivelanguage.com/cgi-bin/engword.cgi?word=a+word) so that I 
> can increase my parsing accuracy.

Sure, I'd suggest you set up a sourceforge project and make your code  
available that way. we are pretty happy with sourceforge (the nice folks
who make this mailing list possible in fact :) As to using our code, 
certainly you may, just follow the terms of the Gnu Public License.

> 
> My word similarity vectors are like the rows or columns of your similarity 
> matrix, I think.
> 

Sounds interesting. Thanks again for the questions and feedback!
Ted

-- 
# Ted Pedersen                              http://www.umn.edu/~tpederse #
# Department of Computer Science                        [EMAIL PROTECTED] #
# University of Minnesota, Duluth                                        #
# Duluth, MN 55812                                        (218) 726-8770 #



-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
senseclusters-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Re: [Senseclusters-users] bigram contexts only?

Reply via email to