Yeah, doing this kind of thing would be great for us to have, and not hard given what we already have.
Ted, we don't have a MR job to scan through a corpus and ouptut [ngram : LLR] key-value pairs, do we? I've got one we use at LinkedIn that I could try and pull out if we don't have one. (I actually used to give this MR job as an interview question, because it's a cute problem you can work out the basics of in not too long). With one job producing the list of best collocations, depending on how many you want to keep, there a a couple of strategies for then joining that data into your original corpus... -jake On Tue, Jan 5, 2010 at 11:58 AM, Ted Dunning <[email protected]> wrote: > We do have partial framework for this including log-likelihood ratio test > computation. > > For the most part, we don't have anything that specifically counts bigrams > and words and arranges the counts in the right order for application, but > that is relatively easy to write for map-reduce. > > I would be happy to provide pointers on the tricks I have seen to make that > easy to do if you wanted to actually type the semi-colons and such. > > On Tue, Jan 5, 2010 at 9:02 AM, zaki rahaman <[email protected]> > wrote: > > > Pardon my ignorance as this is probably best handled by an NLP package > like > > GATE or LingPipe, but does Mahout provide anything for collocations? Or > > does > > anyone know of a MapReducible way to calculate something like t-values > for > > tokens in N-grams? I've got quite a large collection that I have to > prune, > > filter, and preprocess, but I still expect it to be a significant size. > > > > -- > > Zaki Rahaman > > > > > > -- > Ted Dunning, CTO > DeepDyve >
