Thanks for sharing the article. The article focuses mainly on distance computation between sequences, which will help us in creating the self-similarity matrix. And then you can probably apply any standard self-similarity based clustering techniques ( spectral clustering or k-means etc. ).
Approach sounds okay, except that k-means requires the nXn matrix to be computed which itself could be pretty huge. But as long as you can distribute ( which you apparantly can ) over mapreduce/mahout it should be fine. -Prasen On Fri, Nov 27, 2009 at 9:47 PM, Patterson, Josh <[email protected]> wrote: > Prasen, > I've been reviewing techniques and literature on data mining in time > series and I found another paper that you might be interested in from > the time series "search" domain that deals with similarity of time > series data: > > http://delab.csd.auth.gr/papers/PCI99am.pdf > > Sequences are transformed into a feature vector and Euclidean distances > between the feature vectors are then calculated. I'm still getting this > concept (plus other variations) and mahout in general "mapped out". I > read some on suffix trees and they look very similar to k-grams and > permutation indexes in Information Retrieval material. I'm still > digesting this time series problem (and its several sub problems) but I > thought I'd throw that paper out there and see what you thought. > > Josh Patterson > TVA > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of > prasenjit mukherjee > Sent: Saturday, November 21, 2009 12:21 AM > To: [email protected] > Subject: Re: mahout examples > > Hi Josh, > > I too am working on clustering time-series-data, and basically trying > to come up with a sequence clustering model. Would like to know how > you intend to use K-means to achieve that. Are you treating each > sequence as a point ? Then, what would be your vector representation > of a sequence and also more importantly which metric ( distance > computation logic ) will you be using ? > > BTW, I am thinking along the lines of STC ( suffix-tree based clustering > ). > > -Prasen > > On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <[email protected]> > wrote: >> I think in terms of clustering time series data, the first step looks > to >> be vectorizing the input cases with possibly the DenseVector class and >> feeding that to a basic KMeans implementation like KMeansDriver.java. >> Once we can get the basic kmeans rolling with some known dataset we'll >> be able to iterate on that and move towards using more complex >> techniques and other grid timeseries data. Any suggestions or > discussion >> is greatly appreciated, >> >> Josh Patterson >> TVA >
