I've managed to get k-means clustering working, but I agree it would be very
nice to have an end-to-end example that would allow others to get up to
speed quickly. I think the largest holes here are related to the vacuum of a
corpus of text into the Lucene index and the presentation of a
human-readable display of the results. It might be interesting to also
calculate and include some metrics such as the F-measure (in cases where we
have a reference categorization) and scatter score (in cases where we
don't).

The existing LDA example would be a useful starting point. It slurps
in the Reuters-21578
corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>,
converts it to text, loads it into a Lucene index, extracts vectors from the
lucene index and runs LDA upon them.

This example uses the lucene benchmark utilities for the input to text
conversion and lucene loading. The benchmark utilities code is readable but
complex. It would be very nice to have a simple piece of code to handle the
creation of the Lucene index that others can easilly build upon to respond
to their existing corpus.

On Sat, Jan 2, 2010 at 2:10 PM, Benson Margulies <[email protected]>
wrote:
> As someone who tried, not hard enough, and failed, to assemble all
> these bits in a row, I can only say that the situation cries out for
> an end-to-end sample. I'd be willing to help lick it into shape to be
> checked-in as such. My idea is that it should set up to vacuum-cleaner
> up a corpus of text, push it through Lucene, pull it out as vectors,
> tickle the pig hadoop, and deliver actual doc paths arranged by
> cluster.
>

Reply via email to