Note that the clustering drivers all have a static clusterData() method to run 
just the clustering (classification) of points. You would have to call this 
from your own driver as the current CLI does not offer just this option, but 
something like this should work:

- Input documents are vectorized into sequence files which have timestamps so 
you know when to delete documents which have aged
- Run full clustering over all remaining documents to produce clusters-n and 
clusteredPoints. This is the batch job over the entire corpus.
- As new documents are received, use the clusterData() method to classify them 
using the previous clusters-n. This can be run using -xm sequential so it is 
all done in memory.
- Periodically, add all the new documents to the corpus, delete any which have 
aged out of your time window, and start over



-----Original Message-----
From: Divya [mailto:[email protected]] 
Sent: Tuesday, November 23, 2010 6:32 PM
To: [email protected]
Subject: RE: (Near) Realtime clustering

Hi,

Even I also have similar requirement.
Can some one please provide me the steps of hybrid approach.


Regards,
Divya 

-----Original Message-----
From: Jeff Eastman [mailto:[email protected]] 
Sent: Wednesday, November 24, 2010 2:19 AM
To: [email protected]
Subject: RE: (Near) Realtime clustering

I'd suggest a hybrid approach: Run the batch clustering periodically over
the entire corpus to update the cluster centers and then use those centers
for real-time clustering (classification) of new documents as they arrive.
You can use the sequential execution mode of the clustering job to classify
documents in real-time. This will suffer from the fact that new news topics
will not immediately materialize new clusters until the batch job runs
again. 

-----Original Message-----
From: Gustavo Fernandes [mailto:[email protected]] 
Sent: Tuesday, November 23, 2010 9:58 AM
To: [email protected]
Subject: (Near) Realtime clustering

Hello, we have a mission to implement a system to cluster news articles in
near real time mode. We have a large amount of articles (millions), and we
started using k-means to created clusters based on a fixed value of "k". The
problem is that we have a constant incoming flow of news articles and we
can't afford to rely on a batch process, we need to be able to present users
clustered articles as soon as they arrive in our database. So far our
clusters are saved into a SequenceFile, as normally output by k-means
driver. 
What would be the recommended way of approaching this problem with Mahout?
Is it possible to manipulate the generated clusters and incrementally add
new articles to them, or even forming new clusters without incurring the
penalty of recalculating for every vector again? Is starting with k-means
the right way? What would be the right combination of algorithms to provide
incremental and fast clustering calculation?

TIA,
Gustavo

Reply via email to