On Thu, Jan 15, 2015 at 1:47 PM, <marko.di...@nissatech.com> wrote: > So, to summarize, my idea of K-medoids with DTW as a distance measure > (implemented as a two phase mapreduce) doesn't sound like an unrealistic > idea? >
Sounds like a fine idea. > I'm mostly afraid not to get in the situation that the algorithm needs a > couple of days for generating clusters, because it is more computationally > expensive than K-means. Is there a way to predict the size of data that > could be processed in a reasonable amount of time (let's say a day utmost)? > I understand that again depends on the resources, but I'm hopping for a > rough approximation, if possible... > I would guess that this can be made to work up to a few million elements. In particular, if you can figure out how to do fast search in order to get the most similar items quickly. You can then filter and re-order using DTW. For doing the fast search, I would suggest motif extraction. See the SAX work [1] or the clustering stuff I did in the Anomaly Detection booklet [2] Using ideas related to DP-means algorithm or some of the scalable k-means stuff. See Wong, Schindler, Myerson's work. [3] and [4] > > Thank you again, I'm actually exited to start with the implementation, if > it doesn't sound that crazy. > Sounds very plausible. It will be work to get it really right, but could be really good. [1] http://www.cs.ucr.edu/~eamonn/SAX.htm [2] http://info.mapr.com/resources_ebook_anewlook_anomalydetection.html [3] http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf [4] http://dl.acm.org/citation.cfm?id=2133039