Hi, Even I also have similar requirement. Can some one please provide me the steps of hybrid approach.
Regards, Divya -----Original Message----- From: Jeff Eastman [mailto:[email protected]] Sent: Wednesday, November 24, 2010 2:19 AM To: [email protected] Subject: RE: (Near) Realtime clustering I'd suggest a hybrid approach: Run the batch clustering periodically over the entire corpus to update the cluster centers and then use those centers for real-time clustering (classification) of new documents as they arrive. You can use the sequential execution mode of the clustering job to classify documents in real-time. This will suffer from the fact that new news topics will not immediately materialize new clusters until the batch job runs again. -----Original Message----- From: Gustavo Fernandes [mailto:[email protected]] Sent: Tuesday, November 23, 2010 9:58 AM To: [email protected] Subject: (Near) Realtime clustering Hello, we have a mission to implement a system to cluster news articles in near real time mode. We have a large amount of articles (millions), and we started using k-means to created clusters based on a fixed value of "k". The problem is that we have a constant incoming flow of news articles and we can't afford to rely on a batch process, we need to be able to present users clustered articles as soon as they arrive in our database. So far our clusters are saved into a SequenceFile, as normally output by k-means driver. What would be the recommended way of approaching this problem with Mahout? Is it possible to manipulate the generated clusters and incrementally add new articles to them, or even forming new clusters without incurring the penalty of recalculating for every vector again? Is starting with k-means the right way? What would be the right combination of algorithms to provide incremental and fast clustering calculation? TIA, Gustavo
