Hi Ted,

This gets a little detailed. Let me know if you'd rather I took it off-list.

On Thu, 08 Jan 2004 12:38, you wrote:
> > > ...we do support second order co-occurrences (socs) as a 1st order
> > > feature (via NSP again) so this is a sort of bridge between order 1 and
> > > order 2 style processing.
> >
> > This might be similar to what I do.
>
> Ah, this sounds most interesting!
> 
> > I'm happy to talk about the algorithms, but even if I wanted to GPL my
> > source, I couldn't.
> >
> > If you thought it might be useful I could make it available to you for
> > reference, but GPL is not an option for me at this point.
>
> Hmmm. This is sort of hard. We are very much GPL oriented, so I'm even a
> bit reluctant to look at code that is in some way restricted beyond
> GPL ... the potential complexities that it raises if there is anything
> of interest are significant.

I understand your point of view. Just recently there has been talk of some 
pretty outrageous claims made about Open Source projects by predatory 
companies.

The only thing I am really proprietary about is application of word similarity 
vectors to the problem of parsing. That is an _application_ of the word 
similarity vectors. The generation of the vectors which we are discussing is 
quite a separate problem.

It is in my interests to see better vectors generated, though, because that 
will mean I can achieve more accurate parsing ;-)

> Rather than code, how about papers? Do you have anything written that
> is published that we could look at? It certainly sounds like we have some
> common interests, so it would be nice to know a bit more.

No papers, but if you are only interested in the 2nd order context features a 
general discussion might be enough. The ideas are fairly easily stated. 
Though we have issues which are not resolved.

Basically I use "frame" contexts as word similarity features. So if a text 
contains two strings:

AXB
AYB

then X and Y can be considered to have a common feature A_B.

Diagrammatically you can represent this as a path between X and Y over A_B 
(hoping that the ASCII formatting is preserved)...

  X
 / \
A   B
 \ /
  Y

This gives me lists of features. I then calculate similarities based on these 
features. Currently I follow Dekang Lin in the exact form of the similarity 
calculation (ftp://ftp.cs.umanitoba.ca/pub/lindek/papers/sim.ps.gz):

sim(w1,w2) = 2 x Inf(F(w1) ^ F(w2))/Inf(F(w1))+I(F(w2))

Where I'm using ^ for the intersection set symbol, and Inf(S) is the amount of 
information contained in a set of features S = - Sum logP(f) for all features 
f belonging to S, where P(f) is the probability of feature f.

This looks very much like your Dice Coefficient, if the components of your 
context vectors are sums of log feature probabilities.

Anyway, that is 1st order processing.

To do second order you just extend the path. 

  X
 / \
A   B
 \ /
  Z
 / \
C   D
 \ /
  Y

Which is the same as saying you substitute Z's vector of words with common 
contexts, into X's vector of words with common contexts.

To integrate this with the 1st order similarity calculation we just need to 
estimate a comparable probability for the "feature" represented by this path 
between X and Y. There is an issue with the probability definition to be 
used, because naive relative frequency gives low probabilities and high 
information, which is counter intuitive.

Other than that the main problem is that you get exponentially more features, 
which increases processing time exponentially too.

Let me know if you'd like me to go into more detail.

-Rob


-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
senseclusters-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to