Dear Ted,

Thank you very much for your answer. It is very inspiring for a beginner like me to see the effort that you put to answer to questions like mine, I'm sure that they look trivial. And the whole community involved is great.

So, to summarize, my idea of K-medoids with DTW as a distance measure (implemented as a two phase mapreduce) doesn't sound like an unrealistic idea?

I'm mostly afraid not to get in the situation that the algorithm needs a couple of days for generating clusters, because it is more computationally expensive than K-means. Is there a way to predict the size of data that could be processed in a reasonable amount of time (let's say a day utmost)? I understand that again depends on the resources, but I'm hopping for a rough approximation, if possible...

Thank you again, I'm actually exited to start with the implementation, if it doesn't sound that crazy.

I wish you all the best,
Marko



Quoting Ted Dunning <ted.dunn...@gmail.com>:

On Thu, Jan 15, 2015 at 3:50 AM, Marko Dinic <marko.di...@nissatech.com>
wrote:



Thank you for your answer. Maybe I made a wrong picture about my data when
giving sinusoid as an example, my time series are not periodic.


I merely continued with that example because it illustrates how an
alternative metric makes all sinusoids look very similar.


Let's say that I have a signal that represents value of power when some
device is turned on. That power signal depends of the time when person
turns on that device. The device can be turned on after different interval
and can stay turned on for different durations. But all that represents a
similar behaviour that can be grouped together. I'm simplifying things a
bit, but that's the concept.


Sure.  But mean and medoid are still something you can define here.


Either way, I made a simple test in RapidMiner and used K-mediods with DTW
as a similarity measure on a small subset and it gave very satisfying
results (please don't laugh at my approach, I'm a beginner :) ).


Not laughing.  Sounds like a fine result.



Now, in one of the previous mails you have answered that K-mediods isn't
scalable, so it isn't implemented. Does scalable means not possible, or not
such a good option? Why isn't it scalable? Does it take too much time for
execution?


Scalable means that as the data gets bigger, the amount of resources gets
at most bigger in linear proportion.



I've got an idea to implement it as two phase mapreduce, like this


Yeah... this is pretty standard,

BUT

That takes a number of iterations that is unknown and which depends on the
size of your data.  Each pass costs O(n).  Having many passes means that
you are super-linear in cost.

For k-means, you can create a sketch in one pass over the data and then
munch on the sketch to get a clustering.  This gives an O(n + k log n +
small goo) answer.


Reply via email to