Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-09 Thread Ted Dunning
Why is it you can't compute a mean?



On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic marko.di...@nissatech.com
wrote:

 Thank you for your answer Ted.

 What about some kind of Bisecting k-means? I'm trying to cluster time
 series of different length and I came up to an idea to use DTW as a
 similarity measure, which seems to be adequate, but the thing is, I cannot
 use it with K-means, since it's hard to define centroids based on time
 series which can have different length/phase. So I was thinking about
 Hierarchical clustering, since it seems appropriate to combine with DTW,
 but is not scalable, as you said. So my next thought is to try with
 bisecting k-means that seems scalable, since it is based on K-means step
 repetitions. My idea is next, by steps:

 - Take two signals as initial centroids (maybe two signals that have
 smallest similarity, calculated using DTW)
 - Assign all signals to two initial centroids
 - Repeat the procedure on the biggest cluster

 In this way I could use DTW as distance measure, that could be useful
 since my data may be shifted, skewed, and avoid calculating centroids. At
 the end I could take one signal from each cluster that is the most similar
 with others in cluster (some kind of centroid/medioid).

 What do you think about this approach and about the scalability?

 I would highly appreciate your answer, thanks.

 On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:

 On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic marko.di...@nissatech.com
 wrote:

  1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout
 that
 could be used as a distance measure for clustering?


 No.



 2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing
 that it could not be implemented efficiently on Hadoop, but I wanted to
 check if something like that is possible.


 Scalability as you suspected.



 3) Same question, just considering Agglomerative Hierarchical clustering.


 Again.  Agglomerative algorithms tend to be n^2 which contradicts scaling.




Re: Using Mahout 1.0-SNAPSHOT with yarn cluster continued

2015-01-09 Thread Dmitriy Lyubimov
strange. legacy still depends on m-math and should include it into job jar.
or did it get that much out of hand after MR deprecation?

On Fri, Jan 9, 2015 at 8:51 AM, mw m...@plista.com wrote:

 I found a solution!
 I had to upload the missing jars onto yarn hdfs and add the following to
 the hadoop Configuration:

 hadoopConf.set(tmpjars,/lib/mahout-math-1.0-20150108.
 230237-316.jar,/lib/commons-cli-2.0-mahout.jar);

 Best,
 Max

 On 01/09/2015 02:13 PM, mw wrote:

 I looked into the submitted job.jar and i found that the missing
 class(org.apache.mahout.math.Vector) is not contained.


 On 01/09/2015 12:57 PM, mw wrote:

 I wrote a message to the hadoop list about it. Also i found this
 https://issues.apache.org/jira/browse/MAHOUT-1498 ticket.
 Could it be a related bug?

 Best,
 Max
 On 01/08/2015 06:18 PM, Pat Ferrel wrote:

 That sounds like a Hadoop list question.

 All I can say is there is a job.jar in mrlegacy/target with all
 dependencies packaged. This should have everything needed for lda.

 On Jan 8, 2015, at 5:50 AM, mw m...@plista.com wrote:

 Hello again,

 maybe my question was misleading.
 I am asking whether the intended usage is to provide the job with the
 required library’s and sent those together with the job to yarn(if yes how
 can this be done?), or to add the required classes to the classpath of
 every node in the cluster.
 What is the best practice?

 Best,
 Max


 On 01/07/2015 06:13 PM, mw wrote:

 Hello,

 the first error was due to a missing property in yarn.xml. However no
 i have a different problem.


 i am working on a web application that should execute lda on a
 external yarn cluster.

 I am uploading all the relevant sequence files onto the yarn cluter.
 This is how it try to remotely execute lda on the cluster.

 try {
 ugi.doAs(new PrivilegedExceptionActionVoid() {
 public Void run() throws Exception {
 Configuration hdoopConf = new Configuration();
 hdoopConf.set(fs.defaultFS,
 hdfs://xxx.xxx.xxx.xxx:9000/user/xx);
 hdoopConf.set(yarn.resourcemanager.hostname, xxx.xxx.xxx.xxx);
 hdoopConf.set(mapreduce.framework.name, yarn);
 hdoopConf.set(mapred.framework.name, yarn);
 hdoopConf.set(mapred.job.tracker,
 xxx.xxx.xxx.xxx);
 hdoopConf.set(dfs.permissions.enabled, false);
 hdoopConf.set(hadoop.job.ugi, xx);
 hdoopConf.set(mapreduce.jobhistory.address,xxx.xxx.xxx.xxx:10020
 );
 CVB0Driver driver = new CVB0Driver();
 try {
 driver.run(hdoopConf, sparseVectorIn.suffix(/
 matrix),
 topicsOut, k, numTerms,
 doc_topic_smoothening, term_topic_smoothening,
 maxIter, iteration_block_size,
 convergenceDelta,
 sparseVectorIn.suffix(/dictionary.file-0), 
 topicsOut.suffix(/DocumentTopics/),
 sparseVectorIn,
 seed, testFraction, numTrainThreads,
 numUpdateThreads, maxItersPerDoc,
 numReduceTasks, backfillPerplexity);
 } catch (ClassNotFoundException e) {
 e.printStackTrace();
 } catch (InterruptedException e) {
 e.printStackTrace();
 }
 return null;
 }
 });
 } catch (InterruptedException e) {
 e.printStackTrace();
 }

 I am getting the following error message:

 Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:344)
 at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(
 Configuration.java:1844)
 at org.apache.hadoop.conf.Configuration.getClassByName(
 Configuration.java:1809)
 at org.apache.hadoop.conf.Configuration.getClass(
 Configuration.java:1903)
 at org.apache.hadoop.conf.Configuration.getClass(
 Configuration.java:1929)
 at org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(
 JobConf.java:837)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:983)

 at org.apache.hadoop.mapred.MapTask.createSortingCollector(
 MapTask.java:391)
 at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:80)
 at org.apache.hadoop.mapred.MapTask$NewOutputCollector.
 init(MapTask.java:675)
 at 

Re: Using Mahout 1.0-SNAPSHOT with yarn cluster continued

2015-01-09 Thread mw

I found a solution!
I had to upload the missing jars onto yarn hdfs and add the following to 
the hadoop Configuration:


hadoopConf.set(tmpjars,/lib/mahout-math-1.0-20150108.230237-316.jar,/lib/commons-cli-2.0-mahout.jar);

Best,
Max
On 01/09/2015 02:13 PM, mw wrote:
I looked into the submitted job.jar and i found that the missing 
class(org.apache.mahout.math.Vector) is not contained.



On 01/09/2015 12:57 PM, mw wrote:
I wrote a message to the hadoop list about it. Also i found this 
https://issues.apache.org/jira/browse/MAHOUT-1498 ticket.

Could it be a related bug?

Best,
Max
On 01/08/2015 06:18 PM, Pat Ferrel wrote:

That sounds like a Hadoop list question.

All I can say is there is a job.jar in mrlegacy/target with all 
dependencies packaged. This should have everything needed for lda.


On Jan 8, 2015, at 5:50 AM, mw m...@plista.com wrote:

Hello again,

maybe my question was misleading.
I am asking whether the intended usage is to provide the job with 
the required library’s and sent those together with the job to 
yarn(if yes how can this be done?), or to add the required classes 
to the classpath of every node in the cluster.

What is the best practice?

Best,
Max


On 01/07/2015 06:13 PM, mw wrote:

Hello,

the first error was due to a missing property in yarn.xml. However 
no i have a different problem.



i am working on a web application that should execute lda on a 
external yarn cluster.


I am uploading all the relevant sequence files onto the yarn cluter.
This is how it try to remotely execute lda on the cluster.

try {
ugi.doAs(new PrivilegedExceptionActionVoid() {
public Void run() throws Exception {
Configuration hdoopConf = new Configuration();
hdoopConf.set(fs.defaultFS, 
hdfs://xxx.xxx.xxx.xxx:9000/user/xx);

hdoopConf.set(yarn.resourcemanager.hostname, xxx.xxx.xxx.xxx);
hdoopConf.set(mapreduce.framework.name, yarn);
hdoopConf.set(mapred.framework.name, yarn);
hdoopConf.set(mapred.job.tracker, 
xxx.xxx.xxx.xxx);

hdoopConf.set(dfs.permissions.enabled, false);
hdoopConf.set(hadoop.job.ugi, xx);
hdoopConf.set(mapreduce.jobhistory.address,xxx.xxx.xxx.xxx:10020 ); 


CVB0Driver driver = new CVB0Driver();
try {
driver.run(hdoopConf, 
sparseVectorIn.suffix(/matrix),
topicsOut, k, numTerms, 
doc_topic_smoothening, term_topic_smoothening,
maxIter, iteration_block_size, 
convergenceDelta,
sparseVectorIn.suffix(/dictionary.file-0), 
topicsOut.suffix(/DocumentTopics/), sparseVectorIn,
seed, testFraction, 
numTrainThreads, numUpdateThreads, maxItersPerDoc,

numReduceTasks, backfillPerplexity);
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
return null;
}
});
} catch (InterruptedException e) {
e.printStackTrace();
}

I am getting the following error message:

Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at 
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1844)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1809)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1903)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1929)
at 
org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:837)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:983) 

at 
org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:391)

at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:80)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:675)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)

Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-09 Thread Marko Dinic

Thank you for your answer Ted.

What about some kind of Bisecting k-means? I'm trying to cluster time 
series of different length and I came up to an idea to use DTW as a 
similarity measure, which seems to be adequate, but the thing is, I 
cannot use it with K-means, since it's hard to define centroids based 
on time series which can have different length/phase. So I was thinking 
about Hierarchical clustering, since it seems appropriate to combine 
with DTW, but is not scalable, as you said. So my next thought is to 
try with bisecting k-means that seems scalable, since it is based on 
K-means step repetitions. My idea is next, by steps:


- Take two signals as initial centroids (maybe two signals that have 
smallest similarity, calculated using DTW)

- Assign all signals to two initial centroids
- Repeat the procedure on the biggest cluster

In this way I could use DTW as distance measure, that could be useful 
since my data may be shifted, skewed, and avoid calculating centroids. 
At the end I could take one signal from each cluster that is the most 
similar with others in cluster (some kind of centroid/medioid).


What do you think about this approach and about the scalability?

I would highly appreciate your answer, thanks.

On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:

On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic marko.di...@nissatech.com
wrote:


1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout that
could be used as a distance measure for clustering?



No.




2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing
that it could not be implemented efficiently on Hadoop, but I wanted to
check if something like that is possible.



Scalability as you suspected.




3) Same question, just considering Agglomerative Hierarchical clustering.



Again.  Agglomerative algorithms tend to be n^2 which contradicts scaling.



Re: Using Mahout 1.0-SNAPSHOT with yarn cluster continued

2015-01-09 Thread mw
I wrote a message to the hadoop list about it. Also i found this 
https://issues.apache.org/jira/browse/MAHOUT-1498 ticket.

Could it be a related bug?

Best,
Max
On 01/08/2015 06:18 PM, Pat Ferrel wrote:

That sounds like a Hadoop list question.

All I can say is there is a job.jar in mrlegacy/target with all dependencies 
packaged. This should have everything needed for lda.

On Jan 8, 2015, at 5:50 AM, mw m...@plista.com wrote:

Hello again,

maybe my question was misleading.
I am asking whether the intended usage is to provide the job with the required 
library’s and sent those together with the job to yarn(if yes how can this be 
done?), or to add the required classes to the classpath of every node in the 
cluster.
What is the best practice?

Best,
Max


On 01/07/2015 06:13 PM, mw wrote:

Hello,

the first error was due to a missing property in yarn.xml. However no i have a 
different problem.


i am working on a web application that should execute lda on a external yarn 
cluster.

I am uploading all the relevant sequence files onto the yarn cluter.
This is how it try to remotely execute lda on the cluster.

try {
ugi.doAs(new PrivilegedExceptionActionVoid() {
public Void run() throws Exception {
Configuration hdoopConf = new Configuration();
hdoopConf.set(fs.defaultFS, 
hdfs://xxx.xxx.xxx.xxx:9000/user/xx);
hdoopConf.set(yarn.resourcemanager.hostname, 
xxx.xxx.xxx.xxx);
hdoopConf.set(mapreduce.framework.name, yarn);
hdoopConf.set(mapred.framework.name, yarn);
hdoopConf.set(mapred.job.tracker, xxx.xxx.xxx.xxx);
hdoopConf.set(dfs.permissions.enabled, false);
hdoopConf.set(hadoop.job.ugi, xx);
hdoopConf.set(mapreduce.jobhistory.address,xxx.xxx.xxx.xxx:10020 );
CVB0Driver driver = new CVB0Driver();
try {
driver.run(hdoopConf, sparseVectorIn.suffix(/matrix),
topicsOut, k, numTerms, doc_topic_smoothening, 
term_topic_smoothening,
maxIter, iteration_block_size, convergenceDelta,
sparseVectorIn.suffix(/dictionary.file-0), 
topicsOut.suffix(/DocumentTopics/), sparseVectorIn,
seed, testFraction, numTrainThreads, 
numUpdateThreads, maxItersPerDoc,
numReduceTasks, backfillPerplexity);
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
return null;
}
});
} catch (InterruptedException e) {
e.printStackTrace();
}

I am getting the following error message:

Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at 
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1844)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1809)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1903)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1929)
at org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:837)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:983)
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:80)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:675)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at 

Re: Using Mahout 1.0-SNAPSHOT with yarn cluster continued

2015-01-09 Thread mw
I looked into the submitted job.jar and i found that the missing 
class(org.apache.mahout.math.Vector) is not contained.



On 01/09/2015 12:57 PM, mw wrote:
I wrote a message to the hadoop list about it. Also i found this 
https://issues.apache.org/jira/browse/MAHOUT-1498 ticket.

Could it be a related bug?

Best,
Max
On 01/08/2015 06:18 PM, Pat Ferrel wrote:

That sounds like a Hadoop list question.

All I can say is there is a job.jar in mrlegacy/target with all 
dependencies packaged. This should have everything needed for lda.


On Jan 8, 2015, at 5:50 AM, mw m...@plista.com wrote:

Hello again,

maybe my question was misleading.
I am asking whether the intended usage is to provide the job with the 
required library’s and sent those together with the job to yarn(if 
yes how can this be done?), or to add the required classes to the 
classpath of every node in the cluster.

What is the best practice?

Best,
Max


On 01/07/2015 06:13 PM, mw wrote:

Hello,

the first error was due to a missing property in yarn.xml. However 
no i have a different problem.



i am working on a web application that should execute lda on a 
external yarn cluster.


I am uploading all the relevant sequence files onto the yarn cluter.
This is how it try to remotely execute lda on the cluster.

try {
ugi.doAs(new PrivilegedExceptionActionVoid() {
public Void run() throws Exception {
Configuration hdoopConf = new Configuration();
hdoopConf.set(fs.defaultFS, 
hdfs://xxx.xxx.xxx.xxx:9000/user/xx);

hdoopConf.set(yarn.resourcemanager.hostname, xxx.xxx.xxx.xxx);
hdoopConf.set(mapreduce.framework.name, yarn);
hdoopConf.set(mapred.framework.name, yarn);
hdoopConf.set(mapred.job.tracker, 
xxx.xxx.xxx.xxx);

hdoopConf.set(dfs.permissions.enabled, false);
hdoopConf.set(hadoop.job.ugi, xx);
hdoopConf.set(mapreduce.jobhistory.address,xxx.xxx.xxx.xxx:10020 );
CVB0Driver driver = new CVB0Driver();
try {
driver.run(hdoopConf, 
sparseVectorIn.suffix(/matrix),
topicsOut, k, numTerms, 
doc_topic_smoothening, term_topic_smoothening,
maxIter, iteration_block_size, 
convergenceDelta,
sparseVectorIn.suffix(/dictionary.file-0), 
topicsOut.suffix(/DocumentTopics/), sparseVectorIn,
seed, testFraction, numTrainThreads, 
numUpdateThreads, maxItersPerDoc,

numReduceTasks, backfillPerplexity);
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
return null;
}
});
} catch (InterruptedException e) {
e.printStackTrace();
}

I am getting the following error message:

Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at 
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1844)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1809)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1903)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1929)
at 
org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:837)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:983)
at 
org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:391)

at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:80)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:675)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
at