It likely means that your cluster's cardinality is different from your input vector's cardinality. If your input vectors are term vectors computed from Lucene, then this could occur if a new term is introduced, increasing the size of the input vector. I can also see some problems if you are using seq2sparse for just the new vector, as that builds a new term dictionary. Also, TF-IDF wants to analyze the term frequencies over the entire corpus which won't work incrementally.
I think you can fool the clustering by setting the sizes of your input vectors to be max_int but that won't help you with the other issues above. Our text processing algorithms will take some adjustments to handle this preprocessing correctly. -----Original Message----- From: Edoardo Tosca [mailto:[email protected]] Sent: Wednesday, November 24, 2010 9:16 AM To: [email protected] Subject: Re: (Near) Realtime clustering Thank you, I am trying adding new documents but I'm stuck with an exception. Basically I copied some code from KMeansDriver, and I execute the clusterDataSeq method. I have seen that the clusterDataSeq accepts a clusterIn Path parameter that should be the path that contains already generated clusters. Am I right? When it try to emitPointToNearestCluster and in particular it calculate the distance a CardinalityException is thrown: what does it mean? BTW I'm creating the vector getting documents from a Lucene index. On Wed, Nov 24, 2010 at 5:00 PM, Jeff Eastman <[email protected]> wrote: > Note that the clustering drivers all have a static clusterData() method to > run just the clustering (classification) of points. You would have to call > this from your own driver as the current CLI does not offer just this > option, but something like this should work: > > - Input documents are vectorized into sequence files which have timestamps > so you know when to delete documents which have aged > - Run full clustering over all remaining documents to produce clusters-n > and clusteredPoints. This is the batch job over the entire corpus. > - As new documents are received, use the clusterData() method to classify > them using the previous clusters-n. This can be run using -xm sequential so > it is all done in memory. > - Periodically, add all the new documents to the corpus, delete any which > have aged out of your time window, and start over > > > > -----Original Message----- > From: Divya [mailto:[email protected]] > Sent: Tuesday, November 23, 2010 6:32 PM > To: [email protected] > Subject: RE: (Near) Realtime clustering > > Hi, > > Even I also have similar requirement. > Can some one please provide me the steps of hybrid approach. > > > Regards, > Divya > > -----Original Message----- > From: Jeff Eastman [mailto:[email protected]] > Sent: Wednesday, November 24, 2010 2:19 AM > To: [email protected] > Subject: RE: (Near) Realtime clustering > > I'd suggest a hybrid approach: Run the batch clustering periodically over > the entire corpus to update the cluster centers and then use those centers > for real-time clustering (classification) of new documents as they arrive. > You can use the sequential execution mode of the clustering job to classify > documents in real-time. This will suffer from the fact that new news topics > will not immediately materialize new clusters until the batch job runs > again. > > -----Original Message----- > From: Gustavo Fernandes [mailto:[email protected]] > Sent: Tuesday, November 23, 2010 9:58 AM > To: [email protected] > Subject: (Near) Realtime clustering > > Hello, we have a mission to implement a system to cluster news articles in > near real time mode. We have a large amount of articles (millions), and we > started using k-means to created clusters based on a fixed value of "k". > The > problem is that we have a constant incoming flow of news articles and we > can't afford to rely on a batch process, we need to be able to present > users > clustered articles as soon as they arrive in our database. So far our > clusters are saved into a SequenceFile, as normally output by k-means > driver. > What would be the recommended way of approaching this problem with Mahout? > Is it possible to manipulate the generated clusters and incrementally add > new articles to them, or even forming new clusters without incurring the > penalty of recalculating for every vector again? Is starting with k-means > the right way? What would be the right combination of algorithms to provide > incremental and fast clustering calculation? > > TIA, > Gustavo > >
