Re: DTW distance measure and K-medioids, Hierarchical clustering
Why is it you can't compute a mean? On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic marko.di...@nissatech.com wrote: Thank you for your answer Ted. What about some kind of Bisecting k-means? I'm trying to cluster time series of different length and I came up to an idea to use DTW as a similarity measure, which seems to be adequate, but the thing is, I cannot use it with K-means, since it's hard to define centroids based on time series which can have different length/phase. So I was thinking about Hierarchical clustering, since it seems appropriate to combine with DTW, but is not scalable, as you said. So my next thought is to try with bisecting k-means that seems scalable, since it is based on K-means step repetitions. My idea is next, by steps: - Take two signals as initial centroids (maybe two signals that have smallest similarity, calculated using DTW) - Assign all signals to two initial centroids - Repeat the procedure on the biggest cluster In this way I could use DTW as distance measure, that could be useful since my data may be shifted, skewed, and avoid calculating centroids. At the end I could take one signal from each cluster that is the most similar with others in cluster (some kind of centroid/medioid). What do you think about this approach and about the scalability? I would highly appreciate your answer, thanks. On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote: On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic marko.di...@nissatech.com wrote: 1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout that could be used as a distance measure for clustering? No. 2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing that it could not be implemented efficiently on Hadoop, but I wanted to check if something like that is possible. Scalability as you suspected. 3) Same question, just considering Agglomerative Hierarchical clustering. Again. Agglomerative algorithms tend to be n^2 which contradicts scaling.
Re: Using Mahout 1.0-SNAPSHOT with yarn cluster continued
strange. legacy still depends on m-math and should include it into job jar. or did it get that much out of hand after MR deprecation? On Fri, Jan 9, 2015 at 8:51 AM, mw m...@plista.com wrote: I found a solution! I had to upload the missing jars onto yarn hdfs and add the following to the hadoop Configuration: hadoopConf.set(tmpjars,/lib/mahout-math-1.0-20150108. 230237-316.jar,/lib/commons-cli-2.0-mahout.jar); Best, Max On 01/09/2015 02:13 PM, mw wrote: I looked into the submitted job.jar and i found that the missing class(org.apache.mahout.math.Vector) is not contained. On 01/09/2015 12:57 PM, mw wrote: I wrote a message to the hadoop list about it. Also i found this https://issues.apache.org/jira/browse/MAHOUT-1498 ticket. Could it be a related bug? Best, Max On 01/08/2015 06:18 PM, Pat Ferrel wrote: That sounds like a Hadoop list question. All I can say is there is a job.jar in mrlegacy/target with all dependencies packaged. This should have everything needed for lda. On Jan 8, 2015, at 5:50 AM, mw m...@plista.com wrote: Hello again, maybe my question was misleading. I am asking whether the intended usage is to provide the job with the required library’s and sent those together with the job to yarn(if yes how can this be done?), or to add the required classes to the classpath of every node in the cluster. What is the best practice? Best, Max On 01/07/2015 06:13 PM, mw wrote: Hello, the first error was due to a missing property in yarn.xml. However no i have a different problem. i am working on a web application that should execute lda on a external yarn cluster. I am uploading all the relevant sequence files onto the yarn cluter. This is how it try to remotely execute lda on the cluster. try { ugi.doAs(new PrivilegedExceptionActionVoid() { public Void run() throws Exception { Configuration hdoopConf = new Configuration(); hdoopConf.set(fs.defaultFS, hdfs://xxx.xxx.xxx.xxx:9000/user/xx); hdoopConf.set(yarn.resourcemanager.hostname, xxx.xxx.xxx.xxx); hdoopConf.set(mapreduce.framework.name, yarn); hdoopConf.set(mapred.framework.name, yarn); hdoopConf.set(mapred.job.tracker, xxx.xxx.xxx.xxx); hdoopConf.set(dfs.permissions.enabled, false); hdoopConf.set(hadoop.job.ugi, xx); hdoopConf.set(mapreduce.jobhistory.address,xxx.xxx.xxx.xxx:10020 ); CVB0Driver driver = new CVB0Driver(); try { driver.run(hdoopConf, sparseVectorIn.suffix(/ matrix), topicsOut, k, numTerms, doc_topic_smoothening, term_topic_smoothening, maxIter, iteration_block_size, convergenceDelta, sparseVectorIn.suffix(/dictionary.file-0), topicsOut.suffix(/DocumentTopics/), sparseVectorIn, seed, testFraction, numTrainThreads, numUpdateThreads, maxItersPerDoc, numReduceTasks, backfillPerplexity); } catch (ClassNotFoundException e) { e.printStackTrace(); } catch (InterruptedException e) { e.printStackTrace(); } return null; } }); } catch (InterruptedException e) { e.printStackTrace(); } I am getting the following error message: Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.hadoop.conf.Configuration.getClassByNameOrNull( Configuration.java:1844) at org.apache.hadoop.conf.Configuration.getClassByName( Configuration.java:1809) at org.apache.hadoop.conf.Configuration.getClass( Configuration.java:1903) at org.apache.hadoop.conf.Configuration.getClass( Configuration.java:1929) at org.apache.hadoop.mapred.JobConf.getMapOutputValueClass( JobConf.java:837) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:983) at org.apache.hadoop.mapred.MapTask.createSortingCollector( MapTask.java:391) at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:80) at org.apache.hadoop.mapred.MapTask$NewOutputCollector. init(MapTask.java:675) at
Re: Using Mahout 1.0-SNAPSHOT with yarn cluster continued
I found a solution! I had to upload the missing jars onto yarn hdfs and add the following to the hadoop Configuration: hadoopConf.set(tmpjars,/lib/mahout-math-1.0-20150108.230237-316.jar,/lib/commons-cli-2.0-mahout.jar); Best, Max On 01/09/2015 02:13 PM, mw wrote: I looked into the submitted job.jar and i found that the missing class(org.apache.mahout.math.Vector) is not contained. On 01/09/2015 12:57 PM, mw wrote: I wrote a message to the hadoop list about it. Also i found this https://issues.apache.org/jira/browse/MAHOUT-1498 ticket. Could it be a related bug? Best, Max On 01/08/2015 06:18 PM, Pat Ferrel wrote: That sounds like a Hadoop list question. All I can say is there is a job.jar in mrlegacy/target with all dependencies packaged. This should have everything needed for lda. On Jan 8, 2015, at 5:50 AM, mw m...@plista.com wrote: Hello again, maybe my question was misleading. I am asking whether the intended usage is to provide the job with the required library’s and sent those together with the job to yarn(if yes how can this be done?), or to add the required classes to the classpath of every node in the cluster. What is the best practice? Best, Max On 01/07/2015 06:13 PM, mw wrote: Hello, the first error was due to a missing property in yarn.xml. However no i have a different problem. i am working on a web application that should execute lda on a external yarn cluster. I am uploading all the relevant sequence files onto the yarn cluter. This is how it try to remotely execute lda on the cluster. try { ugi.doAs(new PrivilegedExceptionActionVoid() { public Void run() throws Exception { Configuration hdoopConf = new Configuration(); hdoopConf.set(fs.defaultFS, hdfs://xxx.xxx.xxx.xxx:9000/user/xx); hdoopConf.set(yarn.resourcemanager.hostname, xxx.xxx.xxx.xxx); hdoopConf.set(mapreduce.framework.name, yarn); hdoopConf.set(mapred.framework.name, yarn); hdoopConf.set(mapred.job.tracker, xxx.xxx.xxx.xxx); hdoopConf.set(dfs.permissions.enabled, false); hdoopConf.set(hadoop.job.ugi, xx); hdoopConf.set(mapreduce.jobhistory.address,xxx.xxx.xxx.xxx:10020 ); CVB0Driver driver = new CVB0Driver(); try { driver.run(hdoopConf, sparseVectorIn.suffix(/matrix), topicsOut, k, numTerms, doc_topic_smoothening, term_topic_smoothening, maxIter, iteration_block_size, convergenceDelta, sparseVectorIn.suffix(/dictionary.file-0), topicsOut.suffix(/DocumentTopics/), sparseVectorIn, seed, testFraction, numTrainThreads, numUpdateThreads, maxItersPerDoc, numReduceTasks, backfillPerplexity); } catch (ClassNotFoundException e) { e.printStackTrace(); } catch (InterruptedException e) { e.printStackTrace(); } return null; } }); } catch (InterruptedException e) { e.printStackTrace(); } I am getting the following error message: Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1844) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1809) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1903) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1929) at org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:837) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:983) at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:80) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:675) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422)
Re: DTW distance measure and K-medioids, Hierarchical clustering
Thank you for your answer Ted. What about some kind of Bisecting k-means? I'm trying to cluster time series of different length and I came up to an idea to use DTW as a similarity measure, which seems to be adequate, but the thing is, I cannot use it with K-means, since it's hard to define centroids based on time series which can have different length/phase. So I was thinking about Hierarchical clustering, since it seems appropriate to combine with DTW, but is not scalable, as you said. So my next thought is to try with bisecting k-means that seems scalable, since it is based on K-means step repetitions. My idea is next, by steps: - Take two signals as initial centroids (maybe two signals that have smallest similarity, calculated using DTW) - Assign all signals to two initial centroids - Repeat the procedure on the biggest cluster In this way I could use DTW as distance measure, that could be useful since my data may be shifted, skewed, and avoid calculating centroids. At the end I could take one signal from each cluster that is the most similar with others in cluster (some kind of centroid/medioid). What do you think about this approach and about the scalability? I would highly appreciate your answer, thanks. On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote: On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic marko.di...@nissatech.com wrote: 1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout that could be used as a distance measure for clustering? No. 2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing that it could not be implemented efficiently on Hadoop, but I wanted to check if something like that is possible. Scalability as you suspected. 3) Same question, just considering Agglomerative Hierarchical clustering. Again. Agglomerative algorithms tend to be n^2 which contradicts scaling.
Re: Using Mahout 1.0-SNAPSHOT with yarn cluster continued
I wrote a message to the hadoop list about it. Also i found this https://issues.apache.org/jira/browse/MAHOUT-1498 ticket. Could it be a related bug? Best, Max On 01/08/2015 06:18 PM, Pat Ferrel wrote: That sounds like a Hadoop list question. All I can say is there is a job.jar in mrlegacy/target with all dependencies packaged. This should have everything needed for lda. On Jan 8, 2015, at 5:50 AM, mw m...@plista.com wrote: Hello again, maybe my question was misleading. I am asking whether the intended usage is to provide the job with the required library’s and sent those together with the job to yarn(if yes how can this be done?), or to add the required classes to the classpath of every node in the cluster. What is the best practice? Best, Max On 01/07/2015 06:13 PM, mw wrote: Hello, the first error was due to a missing property in yarn.xml. However no i have a different problem. i am working on a web application that should execute lda on a external yarn cluster. I am uploading all the relevant sequence files onto the yarn cluter. This is how it try to remotely execute lda on the cluster. try { ugi.doAs(new PrivilegedExceptionActionVoid() { public Void run() throws Exception { Configuration hdoopConf = new Configuration(); hdoopConf.set(fs.defaultFS, hdfs://xxx.xxx.xxx.xxx:9000/user/xx); hdoopConf.set(yarn.resourcemanager.hostname, xxx.xxx.xxx.xxx); hdoopConf.set(mapreduce.framework.name, yarn); hdoopConf.set(mapred.framework.name, yarn); hdoopConf.set(mapred.job.tracker, xxx.xxx.xxx.xxx); hdoopConf.set(dfs.permissions.enabled, false); hdoopConf.set(hadoop.job.ugi, xx); hdoopConf.set(mapreduce.jobhistory.address,xxx.xxx.xxx.xxx:10020 ); CVB0Driver driver = new CVB0Driver(); try { driver.run(hdoopConf, sparseVectorIn.suffix(/matrix), topicsOut, k, numTerms, doc_topic_smoothening, term_topic_smoothening, maxIter, iteration_block_size, convergenceDelta, sparseVectorIn.suffix(/dictionary.file-0), topicsOut.suffix(/DocumentTopics/), sparseVectorIn, seed, testFraction, numTrainThreads, numUpdateThreads, maxItersPerDoc, numReduceTasks, backfillPerplexity); } catch (ClassNotFoundException e) { e.printStackTrace(); } catch (InterruptedException e) { e.printStackTrace(); } return null; } }); } catch (InterruptedException e) { e.printStackTrace(); } I am getting the following error message: Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1844) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1809) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1903) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1929) at org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:837) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:983) at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:80) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:675) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at
Re: Using Mahout 1.0-SNAPSHOT with yarn cluster continued
I looked into the submitted job.jar and i found that the missing class(org.apache.mahout.math.Vector) is not contained. On 01/09/2015 12:57 PM, mw wrote: I wrote a message to the hadoop list about it. Also i found this https://issues.apache.org/jira/browse/MAHOUT-1498 ticket. Could it be a related bug? Best, Max On 01/08/2015 06:18 PM, Pat Ferrel wrote: That sounds like a Hadoop list question. All I can say is there is a job.jar in mrlegacy/target with all dependencies packaged. This should have everything needed for lda. On Jan 8, 2015, at 5:50 AM, mw m...@plista.com wrote: Hello again, maybe my question was misleading. I am asking whether the intended usage is to provide the job with the required library’s and sent those together with the job to yarn(if yes how can this be done?), or to add the required classes to the classpath of every node in the cluster. What is the best practice? Best, Max On 01/07/2015 06:13 PM, mw wrote: Hello, the first error was due to a missing property in yarn.xml. However no i have a different problem. i am working on a web application that should execute lda on a external yarn cluster. I am uploading all the relevant sequence files onto the yarn cluter. This is how it try to remotely execute lda on the cluster. try { ugi.doAs(new PrivilegedExceptionActionVoid() { public Void run() throws Exception { Configuration hdoopConf = new Configuration(); hdoopConf.set(fs.defaultFS, hdfs://xxx.xxx.xxx.xxx:9000/user/xx); hdoopConf.set(yarn.resourcemanager.hostname, xxx.xxx.xxx.xxx); hdoopConf.set(mapreduce.framework.name, yarn); hdoopConf.set(mapred.framework.name, yarn); hdoopConf.set(mapred.job.tracker, xxx.xxx.xxx.xxx); hdoopConf.set(dfs.permissions.enabled, false); hdoopConf.set(hadoop.job.ugi, xx); hdoopConf.set(mapreduce.jobhistory.address,xxx.xxx.xxx.xxx:10020 ); CVB0Driver driver = new CVB0Driver(); try { driver.run(hdoopConf, sparseVectorIn.suffix(/matrix), topicsOut, k, numTerms, doc_topic_smoothening, term_topic_smoothening, maxIter, iteration_block_size, convergenceDelta, sparseVectorIn.suffix(/dictionary.file-0), topicsOut.suffix(/DocumentTopics/), sparseVectorIn, seed, testFraction, numTrainThreads, numUpdateThreads, maxItersPerDoc, numReduceTasks, backfillPerplexity); } catch (ClassNotFoundException e) { e.printStackTrace(); } catch (InterruptedException e) { e.printStackTrace(); } return null; } }); } catch (InterruptedException e) { e.printStackTrace(); } I am getting the following error message: Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1844) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1809) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1903) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1929) at org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:837) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:983) at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:80) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:675) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector at