Re: [Senseclusters-users] me, again...: scaling up and using syntactic information

ted pedersen Fri, 13 Oct 2006 11:33:33 -0700

Hi Marco,

Sorry to be so slow in reply, I've been away lately and not online too 
much. But, one very quick answer here, in fact you don't need to use NSP
with SenseClusters. NSP is a convenience I guess, but in fact you can 
specify the regular expressions that define the features any way you like.
We use NSP and then nsp2regex as we have found it to be useful and
interesting, but there is nothing to prevent you from defining your own
features via a regex file, and then SenseClusters will use those instead.


And generally speaking I don't think NSP will scale up too well to 2 
billion word corpora. 

So, I am not able to respond in much more depth right now, but wanted to
at least give you these two hints. I'll respond more over the weekend.

Cordially,
Ted

On Thu, 12 Oct 2006, Marco Baroni wrote:

> Sorry to be back so soon, but today has been my "fighting with 
> SenseClusters day"...
> 
> OK, after reading and re-reading the documentation, I finally concluded 
> that working with SCbut  avoiding NSP and SVAL is virtually impossible, so 
> I gavee up trying that...
> 
> Then, in the perspective of letting my data be re-counted by NSP and 
> represented in SVAL format, I have two questions.
> 
> First: Does anybody know how well this solution scales up? I need to 
> extract counts from corpora of up to 2 billion tokens: is it realistic to 
> let NSP count them?
> 
> Second: I would like to use some sort of structured syntactic information 
> when counting bigrams.
> 
> E.g., suppose I want to cluster nouns. Rather than considering their 
> co-occurrence with everything within a fixed size window, I would like to 
> count their co-occurrences with, say, any A in their noun phrase, any V 
> they are the object of, and any V they are the subject of.
> 
> For example, from the sentence:
> 
> The fast cat with the long black tail ate the poor mouse
> 
> I would like to extract the following bigrams, as far as "cat" is concerned:
> 
> fast cat
> cat ate
> 
> and, for "mouse",
> 
> ate mouse
> poor mouse
> 
> but not, for example, cat black, cat tail, tail mouse, cat mouse, etc.
> 
> I have a rudimentary partial parser that allows me to extract the contexts 
> I want. My question is: how can I feed them to SC?
> 
> I thought of generating, from the above, a representation like:
> 
> fast <head>cat</head> ate
> ate poor <head>mouse</head>
> 
> However, if I use statistical association measures instead of raw 
> frequencies, I don't know how to "tell" the system that the marginals to be 
> considered should be different for, say, "poor mouse" (counts of A, N and 
> AN in all AN sequences) and "ate mouse" (counts of V, N and VN in all VN 
> sequences).
> 
> Am I on the right track with my representation above? Is there a solution 
> to the "different marginals" problem?
> 
> Any hint appreciated -- thanks in advance.
> 
> Regards,
> 
> Marco
> 
> 
> 
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> senseclusters-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/senseclusters-users
> 

-- 
--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Re: [Senseclusters-users] me, again...: scaling up and using syntactic information

Reply via email to