As someone who tried, not hard enough, and failed, to assemble all
these bits in a row, I can only say that the situation cries out for
an end-to-end sample. I'd be willing to help lick it into shape to be
checked-in as such. My idea is that it should set up to vacuum-cleaner
up a corpus of text, push it through Lucene, pull it out as vectors,
tickle the pig hadoop, and deliver actual doc paths arranged by
cluster.

On Sat, Jan 2, 2010 at 1:44 PM, Ted Dunning <[email protected]> wrote:
> Since k-means is a hard clustering, that term should appear in no more than
> 2 clusters and even that is very unlikely.  It is also very unlikely if the
> cluster explanation would return that term as a top term even if it appeared
> in just one cluster.
>
> This could be some confusion in turning the id's back into terms.  It
> definitely does indicate serious problems.
>
> On Sat, Jan 2, 2010 at 10:27 AM, Bogdan Vatkov <[email protected]>wrote:
>
>> How is this even possible - for 23, 000 docs and for a term which is
>> mentioned only 2 times I have it as a top term in 9 clusters? I definitely
>> did something wrong, do you have an idea what that could be?
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to