Yeah, doing this kind of thing would be great for us to have, and not hard
given what
we already have.

Ted, we don't have a MR job to scan through a corpus and ouptut [ngram :
LLR]
key-value pairs, do we?  I've got one we use at LinkedIn that I could try
and pull
out if we don't have one.

(I actually used to give this MR job as an interview question, because it's
a cute
problem you can work out the basics of in not too long).

With one job producing the list of best collocations, depending on how many
you
want to keep, there a a couple of strategies for then joining that data into
your
original corpus...

  -jake

On Tue, Jan 5, 2010 at 11:58 AM, Ted Dunning <[email protected]> wrote:

> We do have partial framework for this including log-likelihood ratio test
> computation.
>
> For the most part, we don't have anything that specifically counts bigrams
> and words and arranges the counts in the right order for application, but
> that is relatively easy to write for map-reduce.
>
> I would be happy to provide pointers on the tricks I have seen to make that
> easy to do if you wanted to actually type the semi-colons and such.
>
> On Tue, Jan 5, 2010 at 9:02 AM, zaki rahaman <[email protected]>
> wrote:
>
> > Pardon my ignorance as this is probably best handled by an NLP package
> like
> > GATE or LingPipe, but does Mahout provide anything for collocations? Or
> > does
> > anyone know of a MapReducible way to calculate something like t-values
> for
> > tokens in N-grams? I've got quite a large collection that I have to
> prune,
> > filter, and preprocess, but I still expect it to be a significant size.
> >
> > --
> > Zaki Rahaman
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to