I believe Ted is correct, the distance function is the key with k-means. Another suggestion that I've heard for comparing similarity between time series data is comparing "energy", or something along the lines of CUSUM:
http://www.variation.com/cpa/help/hs108.htm http://www.itl.nist.gov/div898/handbook/pmc/section3/pmc323.htm http://www.public.iastate.edu/~vardeman/asqcourse/cusumcharts.pdf http://www.minitab.com/uploadedFiles/Shared_Resources/Documents/Articles/cumulative_sum_charts_for_small_shifts.pdf Waveforms could be broken down into sequences (or n-grams even) of units of energy change which might simplify the comparison calculation somewhat. I've used a similar technique (task-response-thresholds) when working with Ant Colony mechanics: http://jpatterson.floe.tv/index.php/2008/09/25/discovery-and-social-insects/ http://jpatterson.floe.tv/index.php/2009/07/19/tinyos-and-tinytermite/ in that enough change over a time period triggers a task-change / response function. Here I think its simply "more energy change than the previous N samples triggers an ouput" which ends up being a feature vector. I believe that one could see an "energy pattern" derived from a waveform and that could possibly indicate an aspect of similarity. I think this would alleviate the vertical shift problem and then you could focus on the horizontal shift issue which has been dealt with in IR by using various forms of n-gram techniques, intersection of forward and reverse b-trees for wildcard sequences / comparisons, etc. Of course I'm guessing there, but this problem will involve a lot of trial and error in terms of basic building blocks for timeseries comparison --- and it’s a starting point. Josh Patterson TVA -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Monday, November 30, 2009 2:54 AM To: [email protected] Subject: Re: mahout examples N-squared isn't all that scalable. But k-means doesn't require all n^2 distances to be computed. It just requires that distances to the centroids be computed. That is generally vastly less effort and since the number of clusters is typically bounded by a constant, it makes k-means into a nearly linear algorithm. On Sun, Nov 29, 2009 at 11:07 PM, prasenjit mukherjee < [email protected]> wrote: > Thanks for sharing the article. The article focuses mainly on > distance computation between sequences, which will help us in creating > the self-similarity matrix. And then you can probably apply any > standard self-similarity based clustering techniques ( spectral > clustering or k-means etc. ). > > Approach sounds okay, except that k-means requires the nXn matrix to > be computed which itself could be pretty huge. But as long as you can > distribute ( which you apparantly can ) over mapreduce/mahout it > should be fine. > > -Prasen > > On Fri, Nov 27, 2009 at 9:47 PM, Patterson, Josh <[email protected]> > wrote: > > Prasen, > > I've been reviewing techniques and literature on data mining in time > > series and I found another paper that you might be interested in from > > the time series "search" domain that deals with similarity of time > > series data: > > > > http://delab.csd.auth.gr/papers/PCI99am.pdf > > > > Sequences are transformed into a feature vector and Euclidean distances > > between the feature vectors are then calculated. I'm still getting this > > concept (plus other variations) and mahout in general "mapped out". I > > read some on suffix trees and they look very similar to k-grams and > > permutation indexes in Information Retrieval material. I'm still > > digesting this time series problem (and its several sub problems) but I > > thought I'd throw that paper out there and see what you thought. > > > > Josh Patterson > > TVA > > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] On Behalf Of > > prasenjit mukherjee > > Sent: Saturday, November 21, 2009 12:21 AM > > To: [email protected] > > Subject: Re: mahout examples > > > > Hi Josh, > > > > I too am working on clustering time-series-data, and basically trying > > to come up with a sequence clustering model. Would like to know how > > you intend to use K-means to achieve that. Are you treating each > > sequence as a point ? Then, what would be your vector representation > > of a sequence and also more importantly which metric ( distance > > computation logic ) will you be using ? > > > > BTW, I am thinking along the lines of STC ( suffix-tree based clustering > > ). > > > > -Prasen > > > > On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <[email protected]> > > wrote: > >> I think in terms of clustering time series data, the first step looks > > to > >> be vectorizing the input cases with possibly the DenseVector class and > >> feeding that to a basic KMeans implementation like KMeansDriver.java. > >> Once we can get the basic kmeans rolling with some known dataset we'll > >> be able to iterate on that and move towards using more complex > >> techniques and other grid timeseries data. Any suggestions or > > discussion > >> is greatly appreciated, > >> > >> Josh Patterson > >> TVA > > > -- Ted Dunning, CTO DeepDyve
