Prasen, Well, I'm not entirely sure how I'm going to do it right now, its going to come down to trial and error with multiple approaches. There are many obstacles to overcome, including the ones you are speaking of like:
- timeseries vector representation - timeseries shifts, scaling - for clustering, need a heuristic for distance metric I have not heard of suffix-tree based clustering, but it definitely sounds interesting and I'll check that one out. I suggested kmeans initially as its a very basic and well known clustering technique. I need to review my older notes for approaches to solve some of these projects and review more papers dealing with timeseries data in general. My initial thought is to decompose a block of timeseries data into a set of time ordered "features" in order to reduce the effect of scaling issues and alignment effects. That would make it easier to calculate to a delta between vectors, but also might create unintended approximation errors. As I get more Mahout code running, I'll probably post relevant research papers dealing with time series data that I'm looking at. That way anyone can chip in their opinion of the approach and we can build more example code for dealing with timeseries in Mahout. I'll take a look at the suffix-tree technique, and I look forward to any other suggestions you might have in terms of an approach. Josh Patterson TVA -----Original Message----- From: [email protected] on behalf of prasenjit mukherjee Sent: Sat 11/21/2009 12:20 AM To: [email protected] Subject: Re: mahout examples Hi Josh, I too am working on clustering time-series-data, and basically trying to come up with a sequence clustering model. Would like to know how you intend to use K-means to achieve that. Are you treating each sequence as a point ? Then, what would be your vector representation of a sequence and also more importantly which metric ( distance computation logic ) will you be using ? BTW, I am thinking along the lines of STC ( suffix-tree based clustering ). -Prasen On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <[email protected]> wrote: > I think in terms of clustering time series data, the first step looks to > be vectorizing the input cases with possibly the DenseVector class and > feeding that to a basic KMeans implementation like KMeansDriver.java. > Once we can get the basic kmeans rolling with some known dataset we'll > be able to iterate on that and move towards using more complex > techniques and other grid timeseries data. Any suggestions or discussion > is greatly appreciated, > > Josh Patterson > TVA
