Re: [Senseclusters-users] question about term weighting and NSP statistics

ted pedersen Sun, 17 Sep 2006 17:55:29 -0700

On Thu, 14 Sep 2006, Sun Daze wrote:

> Hi,
> 
> I have a general question about the use of the statistic measures provided
> in NSP.
> 
> Some people are using term weighting before construction word vectors, based
> on log transformations, inverse document/word frequency or an entropy
> measure.
> I was wondering if there is any use or benefit in combining this type of
> weighting scheme prior to applying the statistics on the bigram count.
> If so, are there any references that describe the results obtained with this
> type of approach...
> 
> Any suggestions or ideas on this would be appreciated!
> 
> S.


This is an interesting question, and the short answer is that I'm not 
aware of any real definitive findings on the best way to do such 
weighting, or even if it has any effect. 

It is important to note that Cluto (the clustering package used
with SenseClusters) does in fact include various ways to weight the
vectors that are input to SenseClusters. These are selected via the
rowmodel option, which is not something you can control from the
web interface, but which is easy to do if one is creating their own
scripts that string the SenseClusters programs together (you can
see some examples of that in the Demos directory). For example, with
vcluster you can set the -rowmodel and/or -colmodel paramter which
will allow you to weight or scale the rows and/or columns as described
below. 

-rowmodel=string
     Selects the model to be used to scale the rows.
     The possible values are:
        none    - The row values are used as they are [default].
        maxtf   - The row values are scaled to be between [.5, 1.0].
        sqrt    - The square-root function of the row values is used.
        log     - The log2 function of the row values are used.

  -colmodel=string
     Selects the model to be used to scale the columns.
     The possible values are:
        none    - The values on each column are used as they are.
        idf     - The values are scaled based on the inverse row
                  frequency. [default]
                  This is only applied if <cos> is selected as the
                  similarity function.

Now, I think if one is using first order SenseClusters methods, these 
might be particularly interesting to experiment with, as Cluto receives as 
input vectors that are directly based on the contexts to be clustered, and 
there have not been a lot of transformations already applied to the data 
(unless SVD was carried out on the first order vectors, which is an 
option). 

I say this because the cluto weightings are applied right before 
clusteirng, and in the case of LSA or second order SenseClusters, then
the word or feature vectors have already been created and in fact 
averaged together to create the context representation, and so the 
weighting is being applied to an average value, which might have somewhat 
less impact (just speculating a little there, but my own observations 
are that varying the settings -rowmodel or -colmodel have minimal effect 
in the second order case or LSA, where it has some effect on first order 
methods (sometimes)). 

A related point is that our word vectors (or feature by context vectors 
in the case of LSA) are often times subjected to SVD, which is yet
another sort of transformation, and in some respects is trying to adjust 
the weights of the vectors (by adjusting the number of dimensions in 
fact). 

So, I would wonder a little bit about applying some sort of term weighting 
method to the word vectors, then carrying out SVD, then carrying out an  
averaging operation on the word or feature vectors being used to build  
the context representation, and then having cluto do it's own weighting  
on those averaged context representations. It seems like an awful lot of  
transforming, and I think at a certain point the effect would be pretty  
minimal. 

I do think that term weighting prior to SVD for first order methods 
*might* be interesting, but I tend to think that SVD is going to come 
along and kick that data so hard that the weighting might just get washed 
away. I don't have any references at hand that support this view, just 
intuition talking really. Now, if one is not using SVD with the first 
order methods, then the weighting allowed by cluto might be interesting to 
experiment with. 

I fear that this has turned into a very rambling discourse. Sorry about
that, I'll go ahead and send and then see if that generates any follow up 
discussion or comments that we can sort through. 

Cordially,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Re: [Senseclusters-users] question about term weighting and NSP statistics

Reply via email to