On Thu, Jan 15, 2015 at 1:47 PM, <marko.di...@nissatech.com> wrote:

> So, to summarize, my idea of K-medoids with DTW as a distance measure
> (implemented as a two phase mapreduce) doesn't sound like an unrealistic
> idea?
>

Sounds like a fine idea.


> I'm mostly afraid not to get in the situation that the algorithm needs a
> couple of days for generating clusters, because it is more computationally
> expensive than K-means. Is there a way to predict the size of data that
> could be processed in a reasonable amount of time (let's say a day utmost)?
> I understand that again depends on the resources, but I'm hopping for a
> rough approximation, if possible...
>

I would guess that this can be made to work up to a few million elements.
In particular, if you can figure out how to do fast search in order to get
the most similar items quickly.  You can then filter and re-order using
DTW.  For doing the fast search, I would suggest motif extraction.  See the
SAX work [1] or the clustering stuff I did in the Anomaly Detection booklet
[2]

Using ideas related to DP-means algorithm or some of the scalable k-means
stuff.  See Wong, Schindler, Myerson's work. [3] and [4]


>
> Thank you again, I'm actually exited to start with the implementation, if
> it doesn't sound that crazy.
>

Sounds very plausible.  It will be work to get it really right, but could
be really good.

[1] http://www.cs.ucr.edu/~eamonn/SAX.htm
[2] http://info.mapr.com/resources_ebook_anewlook_anomalydetection.html
[3]
http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf
[4] http://dl.acm.org/citation.cfm?id=2133039

Reply via email to