Re: DTW distance measure and K-medioids, Hierarchical clustering

Marko Dinic Thu, 15 Jan 2015 03:53:05 -0800

Ted,

Thank you for your answer. Maybe I made a wrong picture about my datawhen giving sinusoid as an example, my time series are not periodic.Let's say that I have a signal that represents value of power when somedevice is turned on. That power signal depends of the time when personturns on that device. The device can be turned on after differentinterval and can stay turned on for different durations. But all thatrepresents a similar behaviour that can be grouped together. I'msimplifying things a bit, but that's the concept.

Either way, I made a simple test in RapidMiner and used K-mediods withDTW as a similarity measure on a small subset and it gave verysatisfying results (please don't laugh at my approach, I'm a beginner:) ).

Now, in one of the previous mails you have answered that K-mediodsisn't scalable, so it isn't implemented. Does scalable means notpossible, or not such a good option? Why isn't it scalable? Does ittake too much time for execution?


I've got an idea to implement it as two phase mapreduce, like this

Pick initial centroids
1st Phase
map

find distance from each node to its nearest centroid using dtw and emitpair (clusterId, pointId)

reduce

for each cluster make output as (oneOfTheNodesInClusterId,listOfAllNodesInThatCluster)

2nd Phase
map

calculate distance from point to all the other points in the cluster asit was centroid and emit (pointAndItsClusterId,costIfThisPointIsCentroid)

reduce

find point in each cluster that results in the smallest error in thatcluster

repeat until convergence or certain number of iterations

This is the first thing that came to my mind, there's probably abetter solution. If it's not scalable because of two many calculations,how much time could I expect for such an algorithm in case of 10.000signals with 300 points, for example? How can I even estimate that?


Thanks for your effort, if you have time to answer.

Regards,
Marko

On Thu 15 Jan 2015 05:25:55 AM CET, Anand Avati wrote:

Perhaps you could think of the centroid as one of the signals itself, from
which the average distance to rest of the signals in the cluster is the
lowest? This way instead of finding that mythical "mean" of DTWs, you hop
from one signal to another over iterations as you refine memberships.
However "mean" calculation would become O(N^2) (within a cluster).

Be warned I have never tried this method.

Thanks

On Sat Jan 10 2015 at 1:03:25 AM Marko Dinic <[email protected]>
wrote:

Ted,

It's because I'm trying to cluster time series that may differ in
length, some parts may be shifted, some parts may last longer in one
signal than in the other (somehow skewed), but they represent more-less
the same thing. DTW seems good because it's optimized for such things
(used a lot if speech recognition where problems like this may occur).
The problem is that K-means incorporates calculating centroids (and the
thing is, I need some kind of centroids that will represent the cluster
in the end). Maybe I'm wrong, please correct me, I'm a youngster in
this kind of things,  but there is a problem finding centroids for such
signals. Distance measures that recalculate centroids (such as
Euclidean), calculate pairwise distance between components of two
signals (signal and a centroid). How can I use a distance measure for
signals with different length? And also, there is shifting and skewing
in signals that represent the same thing, and there mean could be
totally wrong. For example, mean of two sinusoids while one of them is
shifted by Pi is 0. And that's definitely not a good centroid in my
case.

So, I'm trying to find a scalable solution for my problem, I tried to
fit it in what's already implemented in Mahout (for clustering), but
it's not so obvious to me.

I'm open to suggestions, I'm still new to all of this.

Thanks,
Marko

On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote:

Why is it you can't compute a mean?



On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic <[email protected]>
wrote:

Thank you for your answer Ted.

What about some kind of Bisecting k-means? I'm trying to cluster time
series of different length and I came up to an idea to use DTW as a
similarity measure, which seems to be adequate, but the thing is, I

cannot

use it with K-means, since it's hard to define centroids based on time
series which can have different length/phase. So I was thinking about
Hierarchical clustering, since it seems appropriate to combine with DTW,
but is not scalable, as you said. So my next thought is to try with
bisecting k-means that seems scalable, since it is based on K-means step
repetitions. My idea is next, by steps:

- Take two signals as initial centroids (maybe two signals that have
smallest similarity, calculated using DTW)
- Assign all signals to two initial centroids
- Repeat the procedure on the biggest cluster

In this way I could use DTW as distance measure, that could be useful
since my data may be shifted, skewed, and avoid calculating centroids.

At

the end I could take one signal from each cluster that is the most

similar

with others in cluster (some kind of centroid/medioid).

What do you think about this approach and about the scalability?

I would highly appreciate your answer, thanks.

On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:

On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <[email protected]

wrote:

   1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout

that
could be used as a distance measure for clustering?

No.

2) Why isn't there an implementation of K-mediods in Mahout? I'm

guessing

that it could not be implemented efficiently on Hadoop, but I wanted

to

check if something like that is possible.

Scalability as you suspected.

3) Same question, just considering Agglomerative Hierarchical

clustering.

Again.  Agglomerative algorithms tend to be n^2 which contradicts

scaling.

Re: DTW distance measure and K-medioids, Hierarchical clustering

Reply via email to