Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-15 Thread Ted Dunning
On Thu, Jan 15, 2015 at 1:47 PM,  wrote:

> So, to summarize, my idea of K-medoids with DTW as a distance measure
> (implemented as a two phase mapreduce) doesn't sound like an unrealistic
> idea?
>

Sounds like a fine idea.


> I'm mostly afraid not to get in the situation that the algorithm needs a
> couple of days for generating clusters, because it is more computationally
> expensive than K-means. Is there a way to predict the size of data that
> could be processed in a reasonable amount of time (let's say a day utmost)?
> I understand that again depends on the resources, but I'm hopping for a
> rough approximation, if possible...
>

I would guess that this can be made to work up to a few million elements.
In particular, if you can figure out how to do fast search in order to get
the most similar items quickly.  You can then filter and re-order using
DTW.  For doing the fast search, I would suggest motif extraction.  See the
SAX work [1] or the clustering stuff I did in the Anomaly Detection booklet
[2]

Using ideas related to DP-means algorithm or some of the scalable k-means
stuff.  See Wong, Schindler, Myerson's work. [3] and [4]


>
> Thank you again, I'm actually exited to start with the implementation, if
> it doesn't sound that crazy.
>

Sounds very plausible.  It will be work to get it really right, but could
be really good.

[1] http://www.cs.ucr.edu/~eamonn/SAX.htm
[2] http://info.mapr.com/resources_ebook_anewlook_anomalydetection.html
[3]
http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf
[4] http://dl.acm.org/citation.cfm?id=2133039


Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-15 Thread marko . dinic

Dear Ted,

Thank you very much for your answer. It is very inspiring for a  
beginner like me to see the effort that you put to answer to questions  
like mine, I'm sure that they look trivial. And the whole community  
involved is great.


So, to summarize, my idea of K-medoids with DTW as a distance measure  
(implemented as a two phase mapreduce) doesn't sound like an  
unrealistic idea?


I'm mostly afraid not to get in the situation that the algorithm needs  
a couple of days for generating clusters, because it is more  
computationally expensive than K-means. Is there a way to predict the  
size of data that could be processed in a reasonable amount of time  
(let's say a day utmost)? I understand that again depends on the  
resources, but I'm hopping for a rough approximation, if possible...


Thank you again, I'm actually exited to start with the implementation,  
if it doesn't sound that crazy.


I wish you all the best,
Marko



Quoting Ted Dunning :


On Thu, Jan 15, 2015 at 3:50 AM, Marko Dinic 
wrote:




Thank you for your answer. Maybe I made a wrong picture about my data when
giving sinusoid as an example, my time series are not periodic.



I merely continued with that example because it illustrates how an
alternative metric makes all sinusoids look very similar.



Let's say that I have a signal that represents value of power when some
device is turned on. That power signal depends of the time when person
turns on that device. The device can be turned on after different interval
and can stay turned on for different durations. But all that represents a
similar behaviour that can be grouped together. I'm simplifying things a
bit, but that's the concept.



Sure.  But mean and medoid are still something you can define here.



Either way, I made a simple test in RapidMiner and used K-mediods with DTW
as a similarity measure on a small subset and it gave very satisfying
results (please don't laugh at my approach, I'm a beginner :) ).



Not laughing.  Sounds like a fine result.




Now, in one of the previous mails you have answered that K-mediods isn't
scalable, so it isn't implemented. Does scalable means not possible, or not
such a good option? Why isn't it scalable? Does it take too much time for
execution?



Scalable means that as the data gets bigger, the amount of resources gets
at most bigger in linear proportion.




I've got an idea to implement it as two phase mapreduce, like this



Yeah... this is pretty standard,

BUT

That takes a number of iterations that is unknown and which depends on the
size of your data.  Each pass costs O(n).  Having many passes means that
you are super-linear in cost.

For k-means, you can create a sketch in one pass over the data and then
munch on the sketch to get a clustering.  This gives an O(n + k log n +
small goo) answer.





Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-15 Thread Ted Dunning
On Thu, Jan 15, 2015 at 3:50 AM, Marko Dinic 
wrote:

>
>
> Thank you for your answer. Maybe I made a wrong picture about my data when
> giving sinusoid as an example, my time series are not periodic.


I merely continued with that example because it illustrates how an
alternative metric makes all sinusoids look very similar.


> Let's say that I have a signal that represents value of power when some
> device is turned on. That power signal depends of the time when person
> turns on that device. The device can be turned on after different interval
> and can stay turned on for different durations. But all that represents a
> similar behaviour that can be grouped together. I'm simplifying things a
> bit, but that's the concept.
>

Sure.  But mean and medoid are still something you can define here.


> Either way, I made a simple test in RapidMiner and used K-mediods with DTW
> as a similarity measure on a small subset and it gave very satisfying
> results (please don't laugh at my approach, I'm a beginner :) ).
>

Not laughing.  Sounds like a fine result.


>
> Now, in one of the previous mails you have answered that K-mediods isn't
> scalable, so it isn't implemented. Does scalable means not possible, or not
> such a good option? Why isn't it scalable? Does it take too much time for
> execution?
>

Scalable means that as the data gets bigger, the amount of resources gets
at most bigger in linear proportion.


>
> I've got an idea to implement it as two phase mapreduce, like this
>

Yeah... this is pretty standard,

BUT

That takes a number of iterations that is unknown and which depends on the
size of your data.  Each pass costs O(n).  Having many passes means that
you are super-linear in cost.

For k-means, you can create a sketch in one pass over the data and then
munch on the sketch to get a clustering.  This gives an O(n + k log n +
small goo) answer.


Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-15 Thread Ted Dunning
Anand,

That is a fine idea.  It is called a medoid instead of a mean (
https://en.wikipedia.org/wiki/Medoid )

The basic idea is that for any metric m, you can define the medoid as the
element from the set that minimizes the sum of the distances to the other
elements for that metric.  In one-dimension, the L_1-medoid is the median.
The L_2-medoid is the point in the dataset closest to the mean.

As you say, computing the medoid is expensive.  The cost is clearly bounded
by O(n^2) but you might be able to do somewhat better in practice with
clever branching and bounding.  I don't think that there is anything like
an O(n) algorithm for medoid computation.

On Wed, Jan 14, 2015 at 8:25 PM, Anand Avati  wrote:

> Perhaps you could think of the centroid as one of the signals itself, from
> which the average distance to rest of the signals in the cluster is the
> lowest? This way instead of finding that mythical "mean" of DTWs, you hop
> from one signal to another over iterations as you refine memberships.
> However "mean" calculation would become O(N^2) (within a cluster).
>
> Be warned I have never tried this method.
>
> Thanks
>
> On Sat Jan 10 2015 at 1:03:25 AM Marko Dinic 
> wrote:
>
> > Ted,
> >
> > It's because I'm trying to cluster time series that may differ in
> > length, some parts may be shifted, some parts may last longer in one
> > signal than in the other (somehow skewed), but they represent more-less
> > the same thing. DTW seems good because it's optimized for such things
> > (used a lot if speech recognition where problems like this may occur).
> > The problem is that K-means incorporates calculating centroids (and the
> > thing is, I need some kind of centroids that will represent the cluster
> > in the end). Maybe I'm wrong, please correct me, I'm a youngster in
> > this kind of things,  but there is a problem finding centroids for such
> > signals. Distance measures that recalculate centroids (such as
> > Euclidean), calculate pairwise distance between components of two
> > signals (signal and a centroid). How can I use a distance measure for
> > signals with different length? And also, there is shifting and skewing
> > in signals that represent the same thing, and there mean could be
> > totally wrong. For example, mean of two sinusoids while one of them is
> > shifted by Pi is 0. And that's definitely not a good centroid in my
> > case.
> >
> > So, I'm trying to find a scalable solution for my problem, I tried to
> > fit it in what's already implemented in Mahout (for clustering), but
> > it's not so obvious to me.
> >
> > I'm open to suggestions, I'm still new to all of this.
> >
> > Thanks,
> > Marko
> >
> > On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote:
> > > Why is it you can't compute a mean?
> > >
> > >
> > >
> > > On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic  >
> > > wrote:
> > >
> > >> Thank you for your answer Ted.
> > >>
> > >> What about some kind of Bisecting k-means? I'm trying to cluster time
> > >> series of different length and I came up to an idea to use DTW as a
> > >> similarity measure, which seems to be adequate, but the thing is, I
> > cannot
> > >> use it with K-means, since it's hard to define centroids based on time
> > >> series which can have different length/phase. So I was thinking about
> > >> Hierarchical clustering, since it seems appropriate to combine with
> DTW,
> > >> but is not scalable, as you said. So my next thought is to try with
> > >> bisecting k-means that seems scalable, since it is based on K-means
> step
> > >> repetitions. My idea is next, by steps:
> > >>
> > >> - Take two signals as initial centroids (maybe two signals that have
> > >> smallest similarity, calculated using DTW)
> > >> - Assign all signals to two initial centroids
> > >> - Repeat the procedure on the biggest cluster
> > >>
> > >> In this way I could use DTW as distance measure, that could be useful
> > >> since my data may be shifted, skewed, and avoid calculating centroids.
> > At
> > >> the end I could take one signal from each cluster that is the most
> > similar
> > >> with others in cluster (some kind of centroid/medioid).
> > >>
> > >> What do you think about this approach and about the scalability?
> > >>
> > >> I would highly appreciate your answer, thanks.
> > >>
> > >> On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:
> > >>
> > >>> On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <
> marko.di...@nissatech.com
> > >
> > >>> wrote:
> > >>>
> > >>>   1) Is there an implementation of DTW (Dynamic Time Warping) in
> Mahout
> >  that
> >  could be used as a distance measure for clustering?
> > 
> > 
> > >>> No.
> > >>>
> > >>>
> > >>>
> >  2) Why isn't there an implementation of K-mediods in Mahout? I'm
> > guessing
> >  that it could not be implemented efficiently on Hadoop, but I wanted
> > to
> >  check if something like that is possible.
> > 
> > 
> > >>> Scalability as you suspected.
> > >>>
> > >>>
> > >>>
> >  3) S

Re: Own recommender

2015-01-15 Thread Ted Dunning
The old Taste code is not the state of the art.  User-based recommenders
built on that will be slow.



On Thu, Jan 15, 2015 at 7:10 AM, Juanjo Ramos  wrote:

> Hi David,
> You implement your custom algorithm and create your own class that
> implements the UserSimilarity interface.
>
> When you then instantiate your User-Based recommender, just pass your
> custom class for the UserSimilarity parameter.
>
> Best.
>
> On Thu, Jan 15, 2015 at 1:11 PM, ARROYO MANCEBO David <
> david.arr...@altran.com> wrote:
>
> > Hi folks,
> > How I can start to build my own recommender system in apache mahout with
> > my personal algorithm? I need a custom UserSimilarity. Maybe a subclass
> > from UserSimilarity like PearsonCorrelationSimilarity?
> >
> > Thanks
> > Regards :)
> >
>


mahout 1.0 on EMR with spark

2015-01-15 Thread Pasmanik, Paul
Has anyone tried running mahout 1.0 on EMR with Spark?
I've used instructions at  
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get EMR 
cluster running spark.   I am now able to deploy EMR cluster with Spark using 
AWS JAVA APIs.
EMR allows running a custom script as bootstrap action which I can use to 
install mahout.
What I am trying to figure out is whether I would need to build mahout every 
time I start EMR cluster or have pre-built artifacts and develop a script 
similar to what awslab is using to install spark?

Thanks.




The information contained in this electronic transmission is intended only for 
the use of the recipient and may be confidential and privileged. Unauthorized 
use, disclosure, or reproduction is strictly prohibited and may be unlawful. If 
you have received this electronic transmission in error, please notify the 
sender immediately.


Re: Own recommender

2015-01-15 Thread Juanjo Ramos
Hi David,
You implement your custom algorithm and create your own class that
implements the UserSimilarity interface.

When you then instantiate your User-Based recommender, just pass your
custom class for the UserSimilarity parameter.

Best.

On Thu, Jan 15, 2015 at 1:11 PM, ARROYO MANCEBO David <
david.arr...@altran.com> wrote:

> Hi folks,
> How I can start to build my own recommender system in apache mahout with
> my personal algorithm? I need a custom UserSimilarity. Maybe a subclass
> from UserSimilarity like PearsonCorrelationSimilarity?
>
> Thanks
> Regards :)
>


Re: boost selected dimensions in kmeans clustering

2015-01-15 Thread Ted Dunning
On Thu, Jan 15, 2015 at 5:23 AM, Miguel Angel Martin junquera <
mianmarjun.mailingl...@gmail.com> wrote:

> My question is:..
>  Is it  better to scale up these dimensions  directly in the tf-idf
> sequence final mix  file using this correction factors  OR first do scale
> up   in each  tf-vectors and then mix vectors and  recalculate the  tf-idf
> final  to minimize  errors or desviations   in a  subsequent clustering
> from this tf-idf final mix vectors.
>

Mathematically it doesn't matter whether you scale the vectors at
generation time or before computing distance or by scaling during the
distance computation.

Different places for the change may be more or less easy in terms of
programming.  The two easiest places tend to be at the beginning (if you
know the weights) since you have to write that code anyway, or at the end
since there are provisions for changing the metric in some programs.


Re: boost selected dimensions in kmeans clustering

2015-01-15 Thread Miguel Angel Martin junquera
hi Ted,

Yes. I was considering various possibilities. one of them was this. ( scale
up these dimensions, for example,multiplying by a configurable factor
correction.)

 I really want  to mix two different vectors from the same documents
 with different lengths and dictionaries , (perhaps some terms of
dictionaries are the same). Then I will be  multiplyingdimension of
each vector  by a configurable factor correction.

My question is:..
 Is it  better to scale up these dimensions  directly in the tf-idf
sequence final mix  file using this correction factors  OR first do scale
up   in each  tf-vectors and then mix vectors and  recalculate the  tf-idf
final  to minimize  errors or desviations   in a  subsequent clustering
from this tf-idf final mix vectors.

Thanks in advance for your help.

One last note:

I am bass player and  701q AKG  with fiio E12+E09K is a perfect
combination!!


;-)






2015-01-14 20:12 GMT+01:00 Ted Dunning :

> The easiest way is to scale those dimensions up.
>
>
>
> On Wed, Jan 14, 2015 at 2:41 AM, Miguel Angel Martin junquera <
> mianmarjun.mailingl...@gmail.com> wrote:
>
> > hi all,
> >
> >
> > I am clustering using kmeans several text documents from distintct
> sources
> > and I have  generated the sparse vectors of each document yet.
> > I want to boost some dimensions in the sparse vectors.
> >
> > what is the best way to do this ?
> >
> > is it a good idea  load the vectors  and find the dimensions values of tf
> > or tf-idf and boost this values?
> >
> >
> > Thanks in advance and regards
> >
>


Own recommender

2015-01-15 Thread ARROYO MANCEBO David
Hi folks,
How I can start to build my own recommender system in apache mahout with my 
personal algorithm? I need a custom UserSimilarity. Maybe a subclass from 
UserSimilarity like PearsonCorrelationSimilarity?

Thanks
Regards :)


Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-15 Thread Marko Dinic

Ted,

Thank you for your answer. Maybe I made a wrong picture about my data 
when giving sinusoid as an example, my time series are not periodic. 
Let's say that I have a signal that represents value of power when some 
device is turned on. That power signal depends of the time when person 
turns on that device. The device can be turned on after different 
interval and can stay turned on for different durations. But all that 
represents a similar behaviour that can be grouped together. I'm 
simplifying things a bit, but that's the concept.


Either way, I made a simple test in RapidMiner and used K-mediods with 
DTW as a similarity measure on a small subset and it gave very 
satisfying results (please don't laugh at my approach, I'm a beginner 
:) ).


Now, in one of the previous mails you have answered that K-mediods 
isn't scalable, so it isn't implemented. Does scalable means not 
possible, or not such a good option? Why isn't it scalable? Does it 
take too much time for execution?


I've got an idea to implement it as two phase mapreduce, like this

Pick initial centroids
1st Phase
map
find distance from each node to its nearest centroid using dtw and emit 
pair (clusterId, pointId)

reduce
for each cluster make output as (oneOfTheNodesInClusterId, 
listOfAllNodesInThatCluster)

2nd Phase
map
calculate distance from point to all the other points in the cluster as 
it was centroid and emit (pointAndItsClusterId, 
costIfThisPointIsCentroid)

reduce
find point in each cluster that results in the smallest error in that 
cluster

repeat until convergence or certain number of iterations

This is the first thing that came to my mind, there's probably a 
better solution. If it's not scalable because of two many calculations, 
how much time could I expect for such an algorithm in case of 10.000 
signals with 300 points, for example? How can I even estimate that?


Thanks for your effort, if you have time to answer.

Regards,
Marko

On Thu 15 Jan 2015 05:25:55 AM CET, Anand Avati wrote:

Perhaps you could think of the centroid as one of the signals itself, from
which the average distance to rest of the signals in the cluster is the
lowest? This way instead of finding that mythical "mean" of DTWs, you hop
from one signal to another over iterations as you refine memberships.
However "mean" calculation would become O(N^2) (within a cluster).

Be warned I have never tried this method.

Thanks

On Sat Jan 10 2015 at 1:03:25 AM Marko Dinic 
wrote:


Ted,

It's because I'm trying to cluster time series that may differ in
length, some parts may be shifted, some parts may last longer in one
signal than in the other (somehow skewed), but they represent more-less
the same thing. DTW seems good because it's optimized for such things
(used a lot if speech recognition where problems like this may occur).
The problem is that K-means incorporates calculating centroids (and the
thing is, I need some kind of centroids that will represent the cluster
in the end). Maybe I'm wrong, please correct me, I'm a youngster in
this kind of things,  but there is a problem finding centroids for such
signals. Distance measures that recalculate centroids (such as
Euclidean), calculate pairwise distance between components of two
signals (signal and a centroid). How can I use a distance measure for
signals with different length? And also, there is shifting and skewing
in signals that represent the same thing, and there mean could be
totally wrong. For example, mean of two sinusoids while one of them is
shifted by Pi is 0. And that's definitely not a good centroid in my
case.

So, I'm trying to find a scalable solution for my problem, I tried to
fit it in what's already implemented in Mahout (for clustering), but
it's not so obvious to me.

I'm open to suggestions, I'm still new to all of this.

Thanks,
Marko

On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote:

Why is it you can't compute a mean?



On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic 
wrote:


Thank you for your answer Ted.

What about some kind of Bisecting k-means? I'm trying to cluster time
series of different length and I came up to an idea to use DTW as a
similarity measure, which seems to be adequate, but the thing is, I

cannot

use it with K-means, since it's hard to define centroids based on time
series which can have different length/phase. So I was thinking about
Hierarchical clustering, since it seems appropriate to combine with DTW,
but is not scalable, as you said. So my next thought is to try with
bisecting k-means that seems scalable, since it is based on K-means step
repetitions. My idea is next, by steps:

- Take two signals as initial centroids (maybe two signals that have
smallest similarity, calculated using DTW)
- Assign all signals to two initial centroids
- Repeat the procedure on the biggest cluster

In this way I could use DTW as distance measure, that could be useful
since my data may be shifted, skewed, and avoid calculating centroids.

At

the end I could

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

2015-01-15 Thread unmesha sreeveni
Is there any way..
Waiting for a reply.I have posted the question every where..but none is
responding back.
I feel like this is the right place to ask doubts. As some of u may came
across the same issue and get stuck.

On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni 
wrote:

> Yes, One of my friend is implemeting the same. I know global sharing of
> Data is not possible across Hadoop MapReduce. But I need to check if that
> can be done somehow in hadoop Mapreduce also. Because I found some papers
> in KNN hadoop also.
> And I trying to compare the performance too.
>
> Hope some pointers can help me.
>
>
> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning 
> wrote:
>
>>
>> have you considered implementing using something like spark?  That could
>> be much easier than raw map-reduce
>>
>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni > > wrote:
>>
>>> In KNN like algorithm we need to load model Data into cache for
>>> predicting the records.
>>>
>>> Here is the example for KNN.
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> So if the model will be a large file say1 or 2 GB we will be able to
>>> load them into Distributed cache.
>>>
>>> The one way is to split/partition the model Result into some files and
>>> perform the distance calculation for all records in that file and then find
>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>
>>> How can we parttion the file and perform the operation on these
>>> partition ?
>>>
>>> ie  1 record  parttition1,partition2,
>>>  2nd record  parttition1,partition2,...
>>>
>>> This is what came to my thought.
>>>
>>> Is there any further way.
>>>
>>> Any pointers would help me.
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/