Sorry to be back so soon, but today has been my "fighting with SenseClusters day"...
OK, after reading and re-reading the documentation, I finally concluded that working with SCbut avoiding NSP and SVAL is virtually impossible, so I gavee up trying that... Then, in the perspective of letting my data be re-counted by NSP and represented in SVAL format, I have two questions. First: Does anybody know how well this solution scales up? I need to extract counts from corpora of up to 2 billion tokens: is it realistic to let NSP count them? Second: I would like to use some sort of structured syntactic information when counting bigrams. E.g., suppose I want to cluster nouns. Rather than considering their co-occurrence with everything within a fixed size window, I would like to count their co-occurrences with, say, any A in their noun phrase, any V they are the object of, and any V they are the subject of. For example, from the sentence: The fast cat with the long black tail ate the poor mouse I would like to extract the following bigrams, as far as "cat" is concerned: fast cat cat ate and, for "mouse", ate mouse poor mouse but not, for example, cat black, cat tail, tail mouse, cat mouse, etc. I have a rudimentary partial parser that allows me to extract the contexts I want. My question is: how can I feed them to SC? I thought of generating, from the above, a representation like: fast <head>cat</head> ate ate poor <head>mouse</head> However, if I use statistical association measures instead of raw frequencies, I don't know how to "tell" the system that the marginals to be considered should be different for, say, "poor mouse" (counts of A, N and AN in all AN sequences) and "ate mouse" (counts of V, N and VN in all VN sequences). Am I on the right track with my representation above? Is there a solution to the "different marginals" problem? Any hint appreciated -- thanks in advance. Regards, Marco ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
