On Thu, Jan 15, 2015 at 3:50 AM, Marko Dinic <[email protected]> wrote:
> > > Thank you for your answer. Maybe I made a wrong picture about my data when > giving sinusoid as an example, my time series are not periodic. I merely continued with that example because it illustrates how an alternative metric makes all sinusoids look very similar. > Let's say that I have a signal that represents value of power when some > device is turned on. That power signal depends of the time when person > turns on that device. The device can be turned on after different interval > and can stay turned on for different durations. But all that represents a > similar behaviour that can be grouped together. I'm simplifying things a > bit, but that's the concept. > Sure. But mean and medoid are still something you can define here. > Either way, I made a simple test in RapidMiner and used K-mediods with DTW > as a similarity measure on a small subset and it gave very satisfying > results (please don't laugh at my approach, I'm a beginner :) ). > Not laughing. Sounds like a fine result. > > Now, in one of the previous mails you have answered that K-mediods isn't > scalable, so it isn't implemented. Does scalable means not possible, or not > such a good option? Why isn't it scalable? Does it take too much time for > execution? > Scalable means that as the data gets bigger, the amount of resources gets at most bigger in linear proportion. > > I've got an idea to implement it as two phase mapreduce, like this > Yeah... this is pretty standard, BUT That takes a number of iterations that is unknown and which depends on the size of your data. Each pass costs O(n). Having many passes means that you are super-linear in cost. For k-means, you can create a sketch in one pass over the data and then munch on the sketch to get a clustering. This gives an O(n + k log n + small goo) answer.
