RE: (Near) Realtime clustering

Jeff Eastman Wed, 24 Nov 2010 09:28:33 -0800

It likely means that your cluster's cardinality is different from your input 
vector's cardinality. If your input vectors are term vectors computed from 
Lucene, then this could occur if a new term is introduced, increasing the size 
of the input vector. I can also see some problems if you are using seq2sparse 
for just the new vector, as that builds a new term dictionary. Also, TF-IDF 
wants to analyze the term frequencies over the entire corpus which won't work 
incrementally.


I think you can fool the clustering by setting the sizes of your input vectors 
to be max_int but that won't help you with the other issues above. Our text 
processing algorithms will take some adjustments to handle this preprocessing 
correctly.

-----Original Message-----
From: Edoardo Tosca [mailto:[email protected]] 
Sent: Wednesday, November 24, 2010 9:16 AM
To: [email protected]
Subject: Re: (Near) Realtime clustering

Thank you,
I am trying adding new documents but I'm stuck with an exception.
Basically I copied some code from KMeansDriver, and I execute the
clusterDataSeq method.
I have seen that the clusterDataSeq accepts a clusterIn Path parameter that
should be the path that contains already generated clusters.
Am I right?

When it try to emitPointToNearestCluster and in particular it calculate the
distance a CardinalityException is thrown:
what does it mean?

BTW I'm creating the vector getting documents from a Lucene index.

On Wed, Nov 24, 2010 at 5:00 PM, Jeff Eastman <[email protected]> wrote:

> Note that the clustering drivers all have a static clusterData() method to
> run just the clustering (classification) of points. You would have to call
> this from your own driver as the current CLI does not offer just this
> option, but something like this should work:
>
> - Input documents are vectorized into sequence files which have timestamps
> so you know when to delete documents which have aged
> - Run full clustering over all remaining documents to produce clusters-n
> and clusteredPoints. This is the batch job over the entire corpus.
> - As new documents are received, use the clusterData() method to classify
> them using the previous clusters-n. This can be run using -xm sequential so
> it is all done in memory.
> - Periodically, add all the new documents to the corpus, delete any which
> have aged out of your time window, and start over
>
>
>
> -----Original Message-----
> From: Divya [mailto:[email protected]]
> Sent: Tuesday, November 23, 2010 6:32 PM
> To: [email protected]
> Subject: RE: (Near) Realtime clustering
>
> Hi,
>
> Even I also have similar requirement.
> Can some one please provide me the steps of hybrid approach.
>
>
> Regards,
> Divya
>
> -----Original Message-----
> From: Jeff Eastman [mailto:[email protected]]
> Sent: Wednesday, November 24, 2010 2:19 AM
> To: [email protected]
> Subject: RE: (Near) Realtime clustering
>
> I'd suggest a hybrid approach: Run the batch clustering periodically over
> the entire corpus to update the cluster centers and then use those centers
> for real-time clustering (classification) of new documents as they arrive.
> You can use the sequential execution mode of the clustering job to classify
> documents in real-time. This will suffer from the fact that new news topics
> will not immediately materialize new clusters until the batch job runs
> again.
>
> -----Original Message-----
> From: Gustavo Fernandes [mailto:[email protected]]
> Sent: Tuesday, November 23, 2010 9:58 AM
> To: [email protected]
> Subject: (Near) Realtime clustering
>
> Hello, we have a mission to implement a system to cluster news articles in
> near real time mode. We have a large amount of articles (millions), and we
> started using k-means to created clusters based on a fixed value of "k".
> The
> problem is that we have a constant incoming flow of news articles and we
> can't afford to rely on a batch process, we need to be able to present
> users
> clustered articles as soon as they arrive in our database. So far our
> clusters are saved into a SequenceFile, as normally output by k-means
> driver.
> What would be the recommended way of approaching this problem with Mahout?
> Is it possible to manipulate the generated clusters and incrementally add
> new articles to them, or even forming new clusters without incurring the
> penalty of recalculating for every vector again? Is starting with k-means
> the right way? What would be the right combination of algorithms to provide
> incremental and fast clustering calculation?
>
> TIA,
> Gustavo
>
>

RE: (Near) Realtime clustering

Reply via email to