Re: DTW distance measure and K-medioids, Hierarchical clustering

Marko Dinic Sat, 10 Jan 2015 01:03:49 -0800

Ted,

It's because I'm trying to cluster time series that may differ inlength, some parts may be shifted, some parts may last longer in onesignal than in the other (somehow skewed), but they represent more-lessthe same thing. DTW seems good because it's optimized for such things(used a lot if speech recognition where problems like this may occur).The problem is that K-means incorporates calculating centroids (and thething is, I need some kind of centroids that will represent the clusterin the end). Maybe I'm wrong, please correct me, I'm a youngster inthis kind of things, but there is a problem finding centroids for suchsignals. Distance measures that recalculate centroids (such asEuclidean), calculate pairwise distance between components of twosignals (signal and a centroid). How can I use a distance measure forsignals with different length? And also, there is shifting and skewingin signals that represent the same thing, and there mean could betotally wrong. For example, mean of two sinusoids while one of them isshifted by Pi is 0. And that's definitely not a good centroid in mycase.

So, I'm trying to find a scalable solution for my problem, I tried tofit it in what's already implemented in Mahout (for clustering), butit's not so obvious to me.


I'm open to suggestions, I'm still new to all of this.

Thanks,
Marko

On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote:

Why is it you can't compute a mean?



On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic <[email protected]>
wrote:

Thank you for your answer Ted.

What about some kind of Bisecting k-means? I'm trying to cluster time
series of different length and I came up to an idea to use DTW as a
similarity measure, which seems to be adequate, but the thing is, I cannot
use it with K-means, since it's hard to define centroids based on time
series which can have different length/phase. So I was thinking about
Hierarchical clustering, since it seems appropriate to combine with DTW,
but is not scalable, as you said. So my next thought is to try with
bisecting k-means that seems scalable, since it is based on K-means step
repetitions. My idea is next, by steps:

- Take two signals as initial centroids (maybe two signals that have
smallest similarity, calculated using DTW)
- Assign all signals to two initial centroids
- Repeat the procedure on the biggest cluster

In this way I could use DTW as distance measure, that could be useful
since my data may be shifted, skewed, and avoid calculating centroids. At
the end I could take one signal from each cluster that is the most similar
with others in cluster (some kind of centroid/medioid).

What do you think about this approach and about the scalability?

I would highly appreciate your answer, thanks.

On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:

On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <[email protected]>
wrote:

  1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout

that
could be used as a distance measure for clustering?

No.

2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing
that it could not be implemented efficiently on Hadoop, but I wanted to
check if something like that is possible.

Scalability as you suspected.

3) Same question, just considering Agglomerative Hierarchical clustering.

Again.  Agglomerative algorithms tend to be n^2 which contradicts scaling.

Re: DTW distance measure and K-medioids, Hierarchical clustering

Reply via email to