Thank you for your answer Ted.
What about some kind of Bisecting k-means? I'm trying to cluster time
series of different length and I came up to an idea to use DTW as a
similarity measure, which seems to be adequate, but the thing is, I
cannot use it with K-means, since it's hard to define centroids based
on time series which can have different length/phase. So I was thinking
about Hierarchical clustering, since it seems appropriate to combine
with DTW, but is not scalable, as you said. So my next thought is to
try with bisecting k-means that seems scalable, since it is based on
K-means step repetitions. My idea is next, by steps:
- Take two signals as initial centroids (maybe two signals that have
smallest similarity, calculated using DTW)
- Assign all signals to two initial centroids
- Repeat the procedure on the biggest cluster
In this way I could use DTW as distance measure, that could be useful
since my data may be shifted, skewed, and avoid calculating centroids.
At the end I could take one signal from each cluster that is the most
similar with others in cluster (some kind of centroid/medioid).
What do you think about this approach and about the scalability?
I would highly appreciate your answer, thanks.
On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:
On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <marko.di...@nissatech.com>
wrote:
1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout that
could be used as a distance measure for clustering?
No.
2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing
that it could not be implemented efficiently on Hadoop, but I wanted to
check if something like that is possible.
Scalability as you suspected.
3) Same question, just considering Agglomerative Hierarchical clustering.
Again. Agglomerative algorithms tend to be n^2 which contradicts scaling.