RE: mahout examples

Patterson, Josh Sat, 21 Nov 2009 18:25:39 -0800

Prasen,
Well, I'm not entirely sure how I'm going to do it right now, its going to come 
down to trial and error with multiple approaches. There are many obstacles to 
overcome, including the ones you are speaking of like:

- timeseries vector representation
- timeseries shifts, scaling
- for clustering, need a heuristic for distance metric

I have not heard of suffix-tree based clustering, but it definitely sounds 
interesting and I'll check that one out. I suggested kmeans initially as its a 
very basic and well known clustering technique. I need to review my older notes 
for approaches to solve some of these projects and review more papers dealing 
with timeseries data in general. My initial thought is to decompose a block of 
timeseries data into a set of time ordered "features" in order to reduce the 
effect of scaling issues and alignment effects. That would make it easier to 
calculate to a delta between vectors, but also might create unintended 
approximation errors.

As I get more Mahout code running, I'll probably post relevant research papers 
dealing with time series data that I'm looking at. That way anyone can chip in 
their opinion of the approach and we can build more example code for dealing 
with timeseries in Mahout.

I'll take a look at the suffix-tree technique, and I look forward to any other 
suggestions you might have in terms of an approach.

Josh Patterson
TVA

-----Original Message-----
From: [email protected] on behalf of prasenjit mukherjee
Sent: Sat 11/21/2009 12:20 AM
To: [email protected]
Subject: Re: mahout examples

Hi Josh,

I too am working on  clustering time-series-data, and basically trying
to come up with a sequence clustering model. Would like to know how
you intend to use K-means to achieve that.  Are you  treating each
sequence as a point ?  Then, what would be your vector representation
of a sequence and also more importantly which metric ( distance
computation logic ) will you be using ?

BTW, I am thinking along the lines of STC ( suffix-tree based clustering ).

-Prasen

On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <[email protected]> wrote:
> I think in terms of clustering time series data, the first step looks to
> be vectorizing the input cases with possibly the DenseVector class and
> feeding that to a basic KMeans implementation like KMeansDriver.java.
> Once we can get the basic kmeans rolling with some known dataset we'll
> be able to iterate on that and move towards using more complex
> techniques and other grid timeseries data. Any suggestions or discussion
> is greatly appreciated,
>
> Josh Patterson
> TVA

RE: mahout examples

Reply via email to