Re: DTW distance measure and K-medioids, Hierarchical clustering
On Thu, Jan 15, 2015 at 1:47 PM, wrote: > So, to summarize, my idea of K-medoids with DTW as a distance measure > (implemented as a two phase mapreduce) doesn't sound like an unrealistic > idea? > Sounds like a fine idea. > I'm mostly afraid not to get in the situation that the algorithm needs a > couple of days for generating clusters, because it is more computationally > expensive than K-means. Is there a way to predict the size of data that > could be processed in a reasonable amount of time (let's say a day utmost)? > I understand that again depends on the resources, but I'm hopping for a > rough approximation, if possible... > I would guess that this can be made to work up to a few million elements. In particular, if you can figure out how to do fast search in order to get the most similar items quickly. You can then filter and re-order using DTW. For doing the fast search, I would suggest motif extraction. See the SAX work [1] or the clustering stuff I did in the Anomaly Detection booklet [2] Using ideas related to DP-means algorithm or some of the scalable k-means stuff. See Wong, Schindler, Myerson's work. [3] and [4] > > Thank you again, I'm actually exited to start with the implementation, if > it doesn't sound that crazy. > Sounds very plausible. It will be work to get it really right, but could be really good. [1] http://www.cs.ucr.edu/~eamonn/SAX.htm [2] http://info.mapr.com/resources_ebook_anewlook_anomalydetection.html [3] http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf [4] http://dl.acm.org/citation.cfm?id=2133039
Re: DTW distance measure and K-medioids, Hierarchical clustering
Dear Ted, Thank you very much for your answer. It is very inspiring for a beginner like me to see the effort that you put to answer to questions like mine, I'm sure that they look trivial. And the whole community involved is great. So, to summarize, my idea of K-medoids with DTW as a distance measure (implemented as a two phase mapreduce) doesn't sound like an unrealistic idea? I'm mostly afraid not to get in the situation that the algorithm needs a couple of days for generating clusters, because it is more computationally expensive than K-means. Is there a way to predict the size of data that could be processed in a reasonable amount of time (let's say a day utmost)? I understand that again depends on the resources, but I'm hopping for a rough approximation, if possible... Thank you again, I'm actually exited to start with the implementation, if it doesn't sound that crazy. I wish you all the best, Marko Quoting Ted Dunning : On Thu, Jan 15, 2015 at 3:50 AM, Marko Dinic wrote: Thank you for your answer. Maybe I made a wrong picture about my data when giving sinusoid as an example, my time series are not periodic. I merely continued with that example because it illustrates how an alternative metric makes all sinusoids look very similar. Let's say that I have a signal that represents value of power when some device is turned on. That power signal depends of the time when person turns on that device. The device can be turned on after different interval and can stay turned on for different durations. But all that represents a similar behaviour that can be grouped together. I'm simplifying things a bit, but that's the concept. Sure. But mean and medoid are still something you can define here. Either way, I made a simple test in RapidMiner and used K-mediods with DTW as a similarity measure on a small subset and it gave very satisfying results (please don't laugh at my approach, I'm a beginner :) ). Not laughing. Sounds like a fine result. Now, in one of the previous mails you have answered that K-mediods isn't scalable, so it isn't implemented. Does scalable means not possible, or not such a good option? Why isn't it scalable? Does it take too much time for execution? Scalable means that as the data gets bigger, the amount of resources gets at most bigger in linear proportion. I've got an idea to implement it as two phase mapreduce, like this Yeah... this is pretty standard, BUT That takes a number of iterations that is unknown and which depends on the size of your data. Each pass costs O(n). Having many passes means that you are super-linear in cost. For k-means, you can create a sketch in one pass over the data and then munch on the sketch to get a clustering. This gives an O(n + k log n + small goo) answer.
Re: DTW distance measure and K-medioids, Hierarchical clustering
On Thu, Jan 15, 2015 at 3:50 AM, Marko Dinic wrote: > > > Thank you for your answer. Maybe I made a wrong picture about my data when > giving sinusoid as an example, my time series are not periodic. I merely continued with that example because it illustrates how an alternative metric makes all sinusoids look very similar. > Let's say that I have a signal that represents value of power when some > device is turned on. That power signal depends of the time when person > turns on that device. The device can be turned on after different interval > and can stay turned on for different durations. But all that represents a > similar behaviour that can be grouped together. I'm simplifying things a > bit, but that's the concept. > Sure. But mean and medoid are still something you can define here. > Either way, I made a simple test in RapidMiner and used K-mediods with DTW > as a similarity measure on a small subset and it gave very satisfying > results (please don't laugh at my approach, I'm a beginner :) ). > Not laughing. Sounds like a fine result. > > Now, in one of the previous mails you have answered that K-mediods isn't > scalable, so it isn't implemented. Does scalable means not possible, or not > such a good option? Why isn't it scalable? Does it take too much time for > execution? > Scalable means that as the data gets bigger, the amount of resources gets at most bigger in linear proportion. > > I've got an idea to implement it as two phase mapreduce, like this > Yeah... this is pretty standard, BUT That takes a number of iterations that is unknown and which depends on the size of your data. Each pass costs O(n). Having many passes means that you are super-linear in cost. For k-means, you can create a sketch in one pass over the data and then munch on the sketch to get a clustering. This gives an O(n + k log n + small goo) answer.
Re: DTW distance measure and K-medioids, Hierarchical clustering
Anand, That is a fine idea. It is called a medoid instead of a mean ( https://en.wikipedia.org/wiki/Medoid ) The basic idea is that for any metric m, you can define the medoid as the element from the set that minimizes the sum of the distances to the other elements for that metric. In one-dimension, the L_1-medoid is the median. The L_2-medoid is the point in the dataset closest to the mean. As you say, computing the medoid is expensive. The cost is clearly bounded by O(n^2) but you might be able to do somewhat better in practice with clever branching and bounding. I don't think that there is anything like an O(n) algorithm for medoid computation. On Wed, Jan 14, 2015 at 8:25 PM, Anand Avati wrote: > Perhaps you could think of the centroid as one of the signals itself, from > which the average distance to rest of the signals in the cluster is the > lowest? This way instead of finding that mythical "mean" of DTWs, you hop > from one signal to another over iterations as you refine memberships. > However "mean" calculation would become O(N^2) (within a cluster). > > Be warned I have never tried this method. > > Thanks > > On Sat Jan 10 2015 at 1:03:25 AM Marko Dinic > wrote: > > > Ted, > > > > It's because I'm trying to cluster time series that may differ in > > length, some parts may be shifted, some parts may last longer in one > > signal than in the other (somehow skewed), but they represent more-less > > the same thing. DTW seems good because it's optimized for such things > > (used a lot if speech recognition where problems like this may occur). > > The problem is that K-means incorporates calculating centroids (and the > > thing is, I need some kind of centroids that will represent the cluster > > in the end). Maybe I'm wrong, please correct me, I'm a youngster in > > this kind of things, but there is a problem finding centroids for such > > signals. Distance measures that recalculate centroids (such as > > Euclidean), calculate pairwise distance between components of two > > signals (signal and a centroid). How can I use a distance measure for > > signals with different length? And also, there is shifting and skewing > > in signals that represent the same thing, and there mean could be > > totally wrong. For example, mean of two sinusoids while one of them is > > shifted by Pi is 0. And that's definitely not a good centroid in my > > case. > > > > So, I'm trying to find a scalable solution for my problem, I tried to > > fit it in what's already implemented in Mahout (for clustering), but > > it's not so obvious to me. > > > > I'm open to suggestions, I'm still new to all of this. > > > > Thanks, > > Marko > > > > On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote: > > > Why is it you can't compute a mean? > > > > > > > > > > > > On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic > > > > wrote: > > > > > >> Thank you for your answer Ted. > > >> > > >> What about some kind of Bisecting k-means? I'm trying to cluster time > > >> series of different length and I came up to an idea to use DTW as a > > >> similarity measure, which seems to be adequate, but the thing is, I > > cannot > > >> use it with K-means, since it's hard to define centroids based on time > > >> series which can have different length/phase. So I was thinking about > > >> Hierarchical clustering, since it seems appropriate to combine with > DTW, > > >> but is not scalable, as you said. So my next thought is to try with > > >> bisecting k-means that seems scalable, since it is based on K-means > step > > >> repetitions. My idea is next, by steps: > > >> > > >> - Take two signals as initial centroids (maybe two signals that have > > >> smallest similarity, calculated using DTW) > > >> - Assign all signals to two initial centroids > > >> - Repeat the procedure on the biggest cluster > > >> > > >> In this way I could use DTW as distance measure, that could be useful > > >> since my data may be shifted, skewed, and avoid calculating centroids. > > At > > >> the end I could take one signal from each cluster that is the most > > similar > > >> with others in cluster (some kind of centroid/medioid). > > >> > > >> What do you think about this approach and about the scalability? > > >> > > >> I would highly appreciate your answer, thanks. > > >> > > >> On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote: > > >> > > >>> On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic < > marko.di...@nissatech.com > > > > > >>> wrote: > > >>> > > >>> 1) Is there an implementation of DTW (Dynamic Time Warping) in > Mahout > > that > > could be used as a distance measure for clustering? > > > > > > >>> No. > > >>> > > >>> > > >>> > > 2) Why isn't there an implementation of K-mediods in Mahout? I'm > > guessing > > that it could not be implemented efficiently on Hadoop, but I wanted > > to > > check if something like that is possible. > > > > > > >>> Scalability as you suspected. > > >>> > > >>> > > >>> > > 3) S
Re: Own recommender
The old Taste code is not the state of the art. User-based recommenders built on that will be slow. On Thu, Jan 15, 2015 at 7:10 AM, Juanjo Ramos wrote: > Hi David, > You implement your custom algorithm and create your own class that > implements the UserSimilarity interface. > > When you then instantiate your User-Based recommender, just pass your > custom class for the UserSimilarity parameter. > > Best. > > On Thu, Jan 15, 2015 at 1:11 PM, ARROYO MANCEBO David < > david.arr...@altran.com> wrote: > > > Hi folks, > > How I can start to build my own recommender system in apache mahout with > > my personal algorithm? I need a custom UserSimilarity. Maybe a subclass > > from UserSimilarity like PearsonCorrelationSimilarity? > > > > Thanks > > Regards :) > > >
mahout 1.0 on EMR with spark
Has anyone tried running mahout 1.0 on EMR with Spark? I've used instructions at https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get EMR cluster running spark. I am now able to deploy EMR cluster with Spark using AWS JAVA APIs. EMR allows running a custom script as bootstrap action which I can use to install mahout. What I am trying to figure out is whether I would need to build mahout every time I start EMR cluster or have pre-built artifacts and develop a script similar to what awslab is using to install spark? Thanks. The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.
Re: Own recommender
Hi David, You implement your custom algorithm and create your own class that implements the UserSimilarity interface. When you then instantiate your User-Based recommender, just pass your custom class for the UserSimilarity parameter. Best. On Thu, Jan 15, 2015 at 1:11 PM, ARROYO MANCEBO David < david.arr...@altran.com> wrote: > Hi folks, > How I can start to build my own recommender system in apache mahout with > my personal algorithm? I need a custom UserSimilarity. Maybe a subclass > from UserSimilarity like PearsonCorrelationSimilarity? > > Thanks > Regards :) >
Re: boost selected dimensions in kmeans clustering
On Thu, Jan 15, 2015 at 5:23 AM, Miguel Angel Martin junquera < mianmarjun.mailingl...@gmail.com> wrote: > My question is:.. > Is it better to scale up these dimensions directly in the tf-idf > sequence final mix file using this correction factors OR first do scale > up in each tf-vectors and then mix vectors and recalculate the tf-idf > final to minimize errors or desviations in a subsequent clustering > from this tf-idf final mix vectors. > Mathematically it doesn't matter whether you scale the vectors at generation time or before computing distance or by scaling during the distance computation. Different places for the change may be more or less easy in terms of programming. The two easiest places tend to be at the beginning (if you know the weights) since you have to write that code anyway, or at the end since there are provisions for changing the metric in some programs.
Re: boost selected dimensions in kmeans clustering
hi Ted, Yes. I was considering various possibilities. one of them was this. ( scale up these dimensions, for example,multiplying by a configurable factor correction.) I really want to mix two different vectors from the same documents with different lengths and dictionaries , (perhaps some terms of dictionaries are the same). Then I will be multiplyingdimension of each vector by a configurable factor correction. My question is:.. Is it better to scale up these dimensions directly in the tf-idf sequence final mix file using this correction factors OR first do scale up in each tf-vectors and then mix vectors and recalculate the tf-idf final to minimize errors or desviations in a subsequent clustering from this tf-idf final mix vectors. Thanks in advance for your help. One last note: I am bass player and 701q AKG with fiio E12+E09K is a perfect combination!! ;-) 2015-01-14 20:12 GMT+01:00 Ted Dunning : > The easiest way is to scale those dimensions up. > > > > On Wed, Jan 14, 2015 at 2:41 AM, Miguel Angel Martin junquera < > mianmarjun.mailingl...@gmail.com> wrote: > > > hi all, > > > > > > I am clustering using kmeans several text documents from distintct > sources > > and I have generated the sparse vectors of each document yet. > > I want to boost some dimensions in the sparse vectors. > > > > what is the best way to do this ? > > > > is it a good idea load the vectors and find the dimensions values of tf > > or tf-idf and boost this values? > > > > > > Thanks in advance and regards > > >
Own recommender
Hi folks, How I can start to build my own recommender system in apache mahout with my personal algorithm? I need a custom UserSimilarity. Maybe a subclass from UserSimilarity like PearsonCorrelationSimilarity? Thanks Regards :)
Re: DTW distance measure and K-medioids, Hierarchical clustering
Ted, Thank you for your answer. Maybe I made a wrong picture about my data when giving sinusoid as an example, my time series are not periodic. Let's say that I have a signal that represents value of power when some device is turned on. That power signal depends of the time when person turns on that device. The device can be turned on after different interval and can stay turned on for different durations. But all that represents a similar behaviour that can be grouped together. I'm simplifying things a bit, but that's the concept. Either way, I made a simple test in RapidMiner and used K-mediods with DTW as a similarity measure on a small subset and it gave very satisfying results (please don't laugh at my approach, I'm a beginner :) ). Now, in one of the previous mails you have answered that K-mediods isn't scalable, so it isn't implemented. Does scalable means not possible, or not such a good option? Why isn't it scalable? Does it take too much time for execution? I've got an idea to implement it as two phase mapreduce, like this Pick initial centroids 1st Phase map find distance from each node to its nearest centroid using dtw and emit pair (clusterId, pointId) reduce for each cluster make output as (oneOfTheNodesInClusterId, listOfAllNodesInThatCluster) 2nd Phase map calculate distance from point to all the other points in the cluster as it was centroid and emit (pointAndItsClusterId, costIfThisPointIsCentroid) reduce find point in each cluster that results in the smallest error in that cluster repeat until convergence or certain number of iterations This is the first thing that came to my mind, there's probably a better solution. If it's not scalable because of two many calculations, how much time could I expect for such an algorithm in case of 10.000 signals with 300 points, for example? How can I even estimate that? Thanks for your effort, if you have time to answer. Regards, Marko On Thu 15 Jan 2015 05:25:55 AM CET, Anand Avati wrote: Perhaps you could think of the centroid as one of the signals itself, from which the average distance to rest of the signals in the cluster is the lowest? This way instead of finding that mythical "mean" of DTWs, you hop from one signal to another over iterations as you refine memberships. However "mean" calculation would become O(N^2) (within a cluster). Be warned I have never tried this method. Thanks On Sat Jan 10 2015 at 1:03:25 AM Marko Dinic wrote: Ted, It's because I'm trying to cluster time series that may differ in length, some parts may be shifted, some parts may last longer in one signal than in the other (somehow skewed), but they represent more-less the same thing. DTW seems good because it's optimized for such things (used a lot if speech recognition where problems like this may occur). The problem is that K-means incorporates calculating centroids (and the thing is, I need some kind of centroids that will represent the cluster in the end). Maybe I'm wrong, please correct me, I'm a youngster in this kind of things, but there is a problem finding centroids for such signals. Distance measures that recalculate centroids (such as Euclidean), calculate pairwise distance between components of two signals (signal and a centroid). How can I use a distance measure for signals with different length? And also, there is shifting and skewing in signals that represent the same thing, and there mean could be totally wrong. For example, mean of two sinusoids while one of them is shifted by Pi is 0. And that's definitely not a good centroid in my case. So, I'm trying to find a scalable solution for my problem, I tried to fit it in what's already implemented in Mahout (for clustering), but it's not so obvious to me. I'm open to suggestions, I'm still new to all of this. Thanks, Marko On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote: Why is it you can't compute a mean? On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic wrote: Thank you for your answer Ted. What about some kind of Bisecting k-means? I'm trying to cluster time series of different length and I came up to an idea to use DTW as a similarity measure, which seems to be adequate, but the thing is, I cannot use it with K-means, since it's hard to define centroids based on time series which can have different length/phase. So I was thinking about Hierarchical clustering, since it seems appropriate to combine with DTW, but is not scalable, as you said. So my next thought is to try with bisecting k-means that seems scalable, since it is based on K-means step repetitions. My idea is next, by steps: - Take two signals as initial centroids (maybe two signals that have smallest similarity, calculated using DTW) - Assign all signals to two initial centroids - Repeat the procedure on the biggest cluster In this way I could use DTW as distance measure, that could be useful since my data may be shifted, skewed, and avoid calculating centroids. At the end I could
Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce
Is there any way.. Waiting for a reply.I have posted the question every where..but none is responding back. I feel like this is the right place to ask doubts. As some of u may came across the same issue and get stuck. On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni wrote: > Yes, One of my friend is implemeting the same. I know global sharing of > Data is not possible across Hadoop MapReduce. But I need to check if that > can be done somehow in hadoop Mapreduce also. Because I found some papers > in KNN hadoop also. > And I trying to compare the performance too. > > Hope some pointers can help me. > > > On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning > wrote: > >> >> have you considered implementing using something like spark? That could >> be much easier than raw map-reduce >> >> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni > > wrote: >> >>> In KNN like algorithm we need to load model Data into cache for >>> predicting the records. >>> >>> Here is the example for KNN. >>> >>> >>> [image: Inline image 1] >>> >>> So if the model will be a large file say1 or 2 GB we will be able to >>> load them into Distributed cache. >>> >>> The one way is to split/partition the model Result into some files and >>> perform the distance calculation for all records in that file and then find >>> the min ditance and max occurance of classlabel and predict the outcome. >>> >>> How can we parttion the file and perform the operation on these >>> partition ? >>> >>> ie 1 record parttition1,partition2, >>> 2nd record parttition1,partition2,... >>> >>> This is what came to my thought. >>> >>> Is there any further way. >>> >>> Any pointers would help me. >>> >>> -- >>> *Thanks & Regards * >>> >>> >>> *Unmesha Sreeveni U.B* >>> *Hadoop, Bigdata Developer* >>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>> http://www.unmeshasreeveni.blogspot.in/ >>> >>> >>> >> > > > -- > *Thanks & Regards * > > > *Unmesha Sreeveni U.B* > *Hadoop, Bigdata Developer* > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* > http://www.unmeshasreeveni.blogspot.in/ > > > -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/