Prasen,
I've been reviewing techniques and literature on data mining in time
series and I found another paper that you might be interested in from
the time series "search" domain that deals with similarity of time
series data:

http://delab.csd.auth.gr/papers/PCI99am.pdf

Sequences are transformed into a feature vector and Euclidean distances
between the feature vectors are then calculated. I'm still getting this
concept (plus other variations) and mahout in general "mapped out". I
read some on suffix trees and they look very similar to k-grams and
permutation indexes in Information Retrieval material. I'm still
digesting this time series problem (and its several sub problems) but I
thought I'd throw that paper out there and see what you thought.

Josh Patterson
TVA

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of
prasenjit mukherjee
Sent: Saturday, November 21, 2009 12:21 AM
To: [email protected]
Subject: Re: mahout examples

Hi Josh,

I too am working on  clustering time-series-data, and basically trying
to come up with a sequence clustering model. Would like to know how
you intend to use K-means to achieve that.  Are you  treating each
sequence as a point ?  Then, what would be your vector representation
of a sequence and also more importantly which metric ( distance
computation logic ) will you be using ?

BTW, I am thinking along the lines of STC ( suffix-tree based clustering
).

-Prasen

On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <[email protected]>
wrote:
> I think in terms of clustering time series data, the first step looks
to
> be vectorizing the input cases with possibly the DenseVector class and
> feeding that to a basic KMeans implementation like KMeansDriver.java.
> Once we can get the basic kmeans rolling with some known dataset we'll
> be able to iterate on that and move towards using more complex
> techniques and other grid timeseries data. Any suggestions or
discussion
> is greatly appreciated,
>
> Josh Patterson
> TVA

Reply via email to