Need Help in Clustering
Hi, I am new to mahout. i have text data in fomat as Id,age,income,perwt,sex,city,product 1,23,2200,40,2,Boston,product #1 I want to perform kmeans clustering based on 2 feilds that is age and income.And i also want perform in specific number of clusters. I have already performed clustering by changing file into sequence vector files but i get empty file while performing clusterdump.I guess their is something wrong in the way the class are written and the way my input file is. Can anyone help me how to do this. Thanks is advance Rajan Gupta
Re: Need Help in Clustering
How are u converting your data to sequencefile? If you are not sure check this link: http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program Are you getting any clusteredpoints after running k-means? It would help if you could list the commands you had executed for troubleshooting. From: Rajan Gupta rajangupta0...@gmail.com To: dev@mahout.apache.org Sent: Monday, June 24, 2013 3:09 AM Subject: Need Help in Clustering Hi, I am new to mahout. i have text data in fomat as Id,age,income,perwt,sex,city,product 1,23,2200,40,2,Boston,product #1 I want to perform kmeans clustering based on 2 feilds that is age and income.And i also want perform in specific number of clusters. I have already performed clustering by changing file into sequence vector files but i get empty file while performing clusterdump.I guess their is something wrong in the way the class are written and the way my input file is. Can anyone help me how to do this. Thanks is advance Rajan Gupta
Re: Need Help in Clustering
Thanks for your response yes,I get clustered points after running Kmeans. I have done clustering sucessfully with 20newsdata and reuters data.Clusterdump also works properly with above stated examples. Now, i have text data in fomat as Id,age,income,perwt,sex,city,product 1,23,2200,40,2,Boston,product #1 -- i want to have ouput as Id,'age,'income,perwt,sex,city,product,cluster 1,23,2200,25,2,Boston,product #1,1 2,26,6600,30,1,New york,product #5,3 3,24,4400,48,2,Portland,product #24,2 4,29,9900,60,1,San Jose,product #70,4 Can anyone help... Do i need to create custom code for this, if yes do help me Thanks In advance, Regards, Rajan Gupta On Mon, Jun 24, 2013 at 12:46 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: How are u converting your data to sequencefile? If you are not sure check this link: http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program Are you getting any clusteredpoints after running k-means? It would help if you could list the commands you had executed for troubleshooting. From: Rajan Gupta rajangupta0...@gmail.com To: dev@mahout.apache.org Sent: Monday, June 24, 2013 3:09 AM Subject: Need Help in Clustering Hi, I am new to mahout. i have text data in fomat as Id,age,income,perwt,sex,city,product 1,23,2200,40,2,Boston,product #1 I want to perform kmeans clustering based on 2 feilds that is age and income.And i also want perform in specific number of clusters. I have already performed clustering by changing file into sequence vector files but i get empty file while performing clusterdump.I guess their is something wrong in the way the class are written and the way my input file is. Can anyone help me how to do this. Thanks is advance Rajan Gupta
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691954#comment-13691954 ] Grant Ingersoll commented on MAHOUT-1214: - Hi, Any progress on this? It is the last open issue for 0.8. Thanks, Grant Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Need Help in Clustering
On Mon, Jun 24, 2013 at 12:14 PM, Rajan Gupta rajangupta0...@gmail.comwrote: Do i need to create custom code for this, if yes do help me Yes. You definitely need custom code for this. You also need to think about your data and why you want clusters. What does age mean to a cluster? Are people with the same age supposed to be the same in some sense? What does 5 years difference mean? Is the distance from 20 to 25 the same as the different between 55 and 60? What about city? How many cities are there? Do you have any sense of which cities are more like some than others? What about income? Should perhaps use log(income) for computing distances? What is perwt? Why is there just one product per line? What products are more similar than others?
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691969#comment-13691969 ] Yiqun Hu commented on MAHOUT-1214: -- Grant, we have addressed all review comments and uploaded the updated patch. But haven't got any reply. Can this be included in 0.8? Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Mahout-Quality #2102
See https://builds.apache.org/job/Mahout-Quality/2102/ -- [...truncated 7204 lines...] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352) [WARNING] Failure executing PMD: Couldn't find the class Can't find resource rulesets/basic.xml. Make sure the resource is a valid file or URL or is on the CLASSPATH. Here's the current classpath: /home/hudson/tools/maven/apache-maven-3.0.4/boot/plexus-classworlds-2.4.jar java.lang.RuntimeException: Couldn't find the class Can't find resource rulesets/basic.xml. Make sure the resource is a valid file or URL or is on the CLASSPATH. Here's the current classpath: /home/hudson/tools/maven/apache-maven-3.0.4/boot/plexus-classworlds-2.4.jar at net.sourceforge.pmd.RuleSetFactory.classNotFoundProblem(RuleSetFactory.java:244) at net.sourceforge.pmd.RuleSetFactory.parseRuleSetNode(RuleSetFactory.java:234) at net.sourceforge.pmd.RuleSetFactory.createRuleSet(RuleSetFactory.java:161) at net.sourceforge.pmd.RuleSetFactory.createRuleSets(RuleSetFactory.java:126) at net.sourceforge.pmd.RuleSetFactory.createRuleSets(RuleSetFactory.java:111) at net.sourceforge.pmd.processor.AbstractPMDProcessor.createRuleSets(AbstractPMDProcessor.java:56) at net.sourceforge.pmd.processor.MonoThreadProcessor.processFiles(MonoThreadProcessor.java:41) at net.sourceforge.pmd.PMD.processFiles(PMD.java:271) at org.apache.maven.plugin.pmd.PmdReport.generateReport(PmdReport.java:296) at org.apache.maven.plugin.pmd.PmdReport.execute(PmdReport.java:194) at org.apache.maven.plugin.pmd.PmdReport.executeReport(PmdReport.java:168) at org.apache.maven.reporting.AbstractMavenReport.generate(AbstractMavenReport.java:190) at org.apache.maven.reporting.AbstractMavenReport.execute(AbstractMavenReport.java:99) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196) at org.apache.maven.cli.MavenCli.main(MavenCli.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352) [INFO] [INFO] [INFO] Building Mahout Integration 0.8-SNAPSHOT [INFO] [WARNING] The POM for com.atlassian.maven.plugins:maven-clover2-plugin:jar:3.1.11.1 is missing, no dependency information available [WARNING] Failed to retrieve plugin descriptor for com.atlassian.maven.plugins:maven-clover2-plugin:3.1.11.1: Plugin com.atlassian.maven.plugins:maven-clover2-plugin:3.1.11.1 or one of its dependencies could not be resolved: Failed to read artifact descriptor
Re: Mahout vectors/matrices/solvers on spark
Ok, so i was fairly easily able to build some DSL for our matrix manipulation (similar to breeze) in scala: inline matrix or vector: val a = dense((1, 2, 3), (3, 4, 5)) val b:Vector = (1,2,3) block views and assignments (element/row/vector/block/block of row or vector) a(::, 0) a(1, ::) a(0 to 1, 1 to 2) assignments a(0, ::) :=(3, 5, 7) a(0, 0 to 1) :=(3, 5) a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5)) operators // hadamard val c = a * b a *= b // matrix mul val m = a %*% b and bunch of other little things like sum, mean, colMeans etc. That much is easy. Also stuff like the ones found in breeze along the lines val (u,v,s) = svd(a) diag ((1,2,3)) and Cholesky in similar ways. I don't have inline initialization for sparse things (yet) simply because i don't need them, but of course all regular java constructors and methods are retained, all that is just a syntactic sugar in the spirit of DSLs in hope to make things a bit mroe readable. my (very little, and very insignificantly opinionated, really) criticism of Breeze in this context is its inconsistency between dense and sparse representations, namely, lack of consistent overarching trait(s), so that building structure-agnostic solvers like Mahout's Cholesky solver is impossible, as well as cross-type matrix use (say, the way i understand it, it is pretty much imposible to multiply a sparse matrix by a dense matrix). I suspect these problems stem from the fact that the authors for whatever reason decided to hardwire dense things with JBlas solvers whereas i dont believe matrix storage structures must be. But these problems do appear to be serious enough for me to ignore Breeze for now. If i decide to plug in jblas dense solvers, i guess i will just have them as yet another top-level routine interface taking any Matrix, e.g. val (u,v,s) = svd(m, jblas=true) On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Thank you. On Jun 23, 2013 6:16 PM, Ted Dunning ted.dunn...@gmail.com wrote: I think that this contract has migrated a bit from the first starting point. My feeling is that there is a de facto contract now that the matrix slice is a single row. Sent from my iPhone On Jun 23, 2013, at 16:32, Dmitriy Lyubimov dlie...@gmail.com wrote: What does Matrix. iterateAll() contractually do? Practically it seems to be row wise iteration for some implementations but it doesnt seem contractually state so in the javadoc. What is MatrixSlice if it is neither a row nor a colimn? How can i tell what exactly it is i am iterating over? On Jun 19, 2013 12:21 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix jake.man...@gmail.com wrote: Question #2: which in-core solvers are available for Mahout matrices? I know there's SSVD, probably Cholesky, is there something else? In paticular, i need to be solving linear systems, I guess Cholesky should be equipped enough to do just that? Question #3: why did we try to import Colt solvers rather than actually depend on Colt in the first place? Why did we not accept Colt's sparse matrices and created native ones instead? Colt seems to have a notion of parse in-core matrices too and seems like a well-rounded solution. However, it doesn't seem like being actively supported, whereas I know Mahout experienced continued enhancements to the in-core matrix support. Colt was totally abandoned, and I talked to the original author and he blessed it's adoption. When we pulled it in, we found it was woefully undertested, and tried our best to hook it in with proper tests and use APIs that fit with the use cases we had. Plus, we already had the start of some linear apis (i.e. the Vector interface) and dropping the API completely seemed not terribly worth it at the time. There was even more to it than that. Colt was under-tested and there have been warts that had to be pulled out in much of the code. But, worse than that, Colt's matrix and vector structure was a real bugger to extend or change. It also had all kinds of cruft where it pretended to support matrices of things, but in fact only supported matrices of doubles and floats. So using Colt as it was (and is since it is largely abandoned) was a non-starter. As far as in-memory solvers, we have: 1) LR decomposition (tested and kinda fast) 2) Cholesky decomposition (tested) 3) SVD (tested)
Re: Mahout vectors/matrices/solvers on spark
Dmitriy, This is very pretty. On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Ok, so i was fairly easily able to build some DSL for our matrix manipulation (similar to breeze) in scala: inline matrix or vector: val a = dense((1, 2, 3), (3, 4, 5)) val b:Vector = (1,2,3) block views and assignments (element/row/vector/block/block of row or vector) a(::, 0) a(1, ::) a(0 to 1, 1 to 2) assignments a(0, ::) :=(3, 5, 7) a(0, 0 to 1) :=(3, 5) a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5)) operators // hadamard val c = a * b a *= b // matrix mul val m = a %*% b and bunch of other little things like sum, mean, colMeans etc. That much is easy. Also stuff like the ones found in breeze along the lines val (u,v,s) = svd(a) diag ((1,2,3)) and Cholesky in similar ways. I don't have inline initialization for sparse things (yet) simply because i don't need them, but of course all regular java constructors and methods are retained, all that is just a syntactic sugar in the spirit of DSLs in hope to make things a bit mroe readable. my (very little, and very insignificantly opinionated, really) criticism of Breeze in this context is its inconsistency between dense and sparse representations, namely, lack of consistent overarching trait(s), so that building structure-agnostic solvers like Mahout's Cholesky solver is impossible, as well as cross-type matrix use (say, the way i understand it, it is pretty much imposible to multiply a sparse matrix by a dense matrix). I suspect these problems stem from the fact that the authors for whatever reason decided to hardwire dense things with JBlas solvers whereas i dont believe matrix storage structures must be. But these problems do appear to be serious enough for me to ignore Breeze for now. If i decide to plug in jblas dense solvers, i guess i will just have them as yet another top-level routine interface taking any Matrix, e.g. val (u,v,s) = svd(m, jblas=true) On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Thank you. On Jun 23, 2013 6:16 PM, Ted Dunning ted.dunn...@gmail.com wrote: I think that this contract has migrated a bit from the first starting point. My feeling is that there is a de facto contract now that the matrix slice is a single row. Sent from my iPhone On Jun 23, 2013, at 16:32, Dmitriy Lyubimov dlie...@gmail.com wrote: What does Matrix. iterateAll() contractually do? Practically it seems to be row wise iteration for some implementations but it doesnt seem contractually state so in the javadoc. What is MatrixSlice if it is neither a row nor a colimn? How can i tell what exactly it is i am iterating over? On Jun 19, 2013 12:21 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix jake.man...@gmail.com wrote: Question #2: which in-core solvers are available for Mahout matrices? I know there's SSVD, probably Cholesky, is there something else? In paticular, i need to be solving linear systems, I guess Cholesky should be equipped enough to do just that? Question #3: why did we try to import Colt solvers rather than actually depend on Colt in the first place? Why did we not accept Colt's sparse matrices and created native ones instead? Colt seems to have a notion of parse in-core matrices too and seems like a well-rounded solution. However, it doesn't seem like being actively supported, whereas I know Mahout experienced continued enhancements to the in-core matrix support. Colt was totally abandoned, and I talked to the original author and he blessed it's adoption. When we pulled it in, we found it was woefully undertested, and tried our best to hook it in with proper tests and use APIs that fit with the use cases we had. Plus, we already had the start of some linear apis (i.e. the Vector interface) and dropping the API completely seemed not terribly worth it at the time. There was even more to it than that. Colt was under-tested and there have been warts that had to be pulled out in much of the code. But, worse than that, Colt's matrix and vector structure was a real bugger to extend or change. It also had all kinds of cruft where it pretended to support matrices of things, but in fact only supported matrices of doubles and floats. So using Colt as it was (and is since it is largely abandoned) was a non-starter. As far as in-memory solvers, we have: 1) LR decomposition (tested and kinda fast) 2) Cholesky decomposition (tested) 3) SVD (tested)
Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #522
See https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/522/changes Changes: [smarthi] MAHOUT-944: lucene2seq - more code cleanup, removed unused imports [smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - fixed issue with not reading a directory list [smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - first round of Code cleanup based on feedback from code review [smarthi] MAHOUT-944:lucene2seq - removed unused import -- [...truncated 5861 lines...] INFO: Starting flush of map output Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill INFO: Finished spill 0 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task done INFO: Task:attempt_local_0019_m_00_0 is done. And is in the process of commiting Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task sendDone INFO: Task 'attempt_local_0019_m_00_0' done. Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task initialize INFO: Using ResourceCalculatorPlugin : null Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Merger$MergeQueue merge INFO: Merging 1 sorted segments Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Merger$MergeQueue merge INFO: Down to the last merge-pass, with 1 segments left of total size: 57 bytes Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task done INFO: Task:attempt_local_0019_r_00_0 is done. And is in the process of commiting Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task commit INFO: Task attempt_local_0019_r_00_0 is allowed to commit now Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask INFO: Saved output of task 'attempt_local_0019_r_00_0' to /tmp/mahout-work-hudson/reuters-lda-model/model-19 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate INFO: reduce reduce Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Task sendDone INFO: Task 'attempt_local_0019_r_00_0' done. Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: map 100% reduce 100% Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: Job complete: job_local_0019 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Counters: 17 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: File Output Format Counters Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Bytes Written=389 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: FileSystemCounters Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: FILE_BYTES_READ=1614631021 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: FILE_BYTES_WRITTEN=1629239107 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: File Input Format Counters Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Bytes Read=152 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Map-Reduce Framework Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Map output materialized bytes=61 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Map input records=0 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Reduce shuffle bytes=0 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Spilled Records=40 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Map output bytes=120 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Total committed heap usage (bytes)=697171968 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: SPLIT_RAW_BYTES=119 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Combine input records=20 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Reduce input records=20 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Reduce input groups=20 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Combine output records=20 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Reduce output records=20 Jun 24, 2013 6:35:28 PM org.apache.hadoop.mapred.Counters log INFO: Map output records=20 Jun 24, 2013 6:35:28 PM org.slf4j.impl.JCLLoggerAdapter info INFO: About to run iteration 20 of 20 Jun 24, 2013 6:35:28 PM org.slf4j.impl.JCLLoggerAdapter info INFO: About to run: Iteration 20 of 20, input path: /tmp/mahout-work-hudson/reuters-lda-model/model-19 Jun 24, 2013 6:35:28 PM
Re: Mahout vectors/matrices/solvers on spark
Yeah, I'm totally on board with a pretty scala DSL on top of some of our stuff. In particular, I've been experimenting with with wrapping the DistributedRowMatrix in a scalding wrapper, so we can do things like val matrixAsTypedPipe = DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols, path, conf)) // e.g. L1 normalize: matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx, v.normalize(1)) ) .write(new DistributedRowMatrixPipe(outputPath, conf)) // and anything else you would want to do with a scalding TypedPipe[Int, Vector] Currently I've been doing this with a package structure directly in Mahout, in: mahout/contrib/scalding What do people think about having this be something real, after 0.8 goes out? Are we ready for contrib modules which fold in diverse external projects in new ways? Integrating directly with pig and scalding is a bit too wide of a tent for Mahout core, but putting these integrations in entirely new projects is maybe a bit too far away. On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com wrote: Dmitriy, This is very pretty. On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Ok, so i was fairly easily able to build some DSL for our matrix manipulation (similar to breeze) in scala: inline matrix or vector: val a = dense((1, 2, 3), (3, 4, 5)) val b:Vector = (1,2,3) block views and assignments (element/row/vector/block/block of row or vector) a(::, 0) a(1, ::) a(0 to 1, 1 to 2) assignments a(0, ::) :=(3, 5, 7) a(0, 0 to 1) :=(3, 5) a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5)) operators // hadamard val c = a * b a *= b // matrix mul val m = a %*% b and bunch of other little things like sum, mean, colMeans etc. That much is easy. Also stuff like the ones found in breeze along the lines val (u,v,s) = svd(a) diag ((1,2,3)) and Cholesky in similar ways. I don't have inline initialization for sparse things (yet) simply because i don't need them, but of course all regular java constructors and methods are retained, all that is just a syntactic sugar in the spirit of DSLs in hope to make things a bit mroe readable. my (very little, and very insignificantly opinionated, really) criticism of Breeze in this context is its inconsistency between dense and sparse representations, namely, lack of consistent overarching trait(s), so that building structure-agnostic solvers like Mahout's Cholesky solver is impossible, as well as cross-type matrix use (say, the way i understand it, it is pretty much imposible to multiply a sparse matrix by a dense matrix). I suspect these problems stem from the fact that the authors for whatever reason decided to hardwire dense things with JBlas solvers whereas i dont believe matrix storage structures must be. But these problems do appear to be serious enough for me to ignore Breeze for now. If i decide to plug in jblas dense solvers, i guess i will just have them as yet another top-level routine interface taking any Matrix, e.g. val (u,v,s) = svd(m, jblas=true) On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Thank you. On Jun 23, 2013 6:16 PM, Ted Dunning ted.dunn...@gmail.com wrote: I think that this contract has migrated a bit from the first starting point. My feeling is that there is a de facto contract now that the matrix slice is a single row. Sent from my iPhone On Jun 23, 2013, at 16:32, Dmitriy Lyubimov dlie...@gmail.com wrote: What does Matrix. iterateAll() contractually do? Practically it seems to be row wise iteration for some implementations but it doesnt seem contractually state so in the javadoc. What is MatrixSlice if it is neither a row nor a colimn? How can i tell what exactly it is i am iterating over? On Jun 19, 2013 12:21 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix jake.man...@gmail.com wrote: Question #2: which in-core solvers are available for Mahout matrices? I know there's SSVD, probably Cholesky, is there something else? In paticular, i need to be solving linear systems, I guess Cholesky should be equipped enough to do just that? Question #3: why did we try to import Colt solvers rather than actually depend on Colt in the first place? Why did we not accept Colt's sparse matrices and created native ones instead? Colt seems to have a notion of parse in-core matrices too and seems like a well-rounded solution. However, it doesn't seem like being actively supported, whereas I know Mahout experienced continued enhancements to the in-core matrix support. Colt was totally abandoned, and I talked to the original author and he
Re: Mahout vectors/matrices/solvers on spark
That looks great Dmitry! The thing about Breeze that drives the complexity in it is partly specialization for Float, Double and Int matrices, and partly getting the syntax to just work for all combinations of matrix types and operands etc. mostly it does just work but occasionally not. I am surprised that dense * sparse matrix doesn't work but I guess as I previously mentioned the sparse matrix support is a bit shaky. David Hall is pretty happy to both look into enhancements and help out for contributions (eg I'm hoping to find time to look into a proper Diagonal matrix implementation and he was very helpful with pointers etc), so please do drop things into the google group mailing list. Hopefully wider adoption especially by this type of community will drive Breeze development. In another note I also really like Scaldings matrix API so scala ish wrappers for mahout would be cool - another pet project of mine is a port of that API to spark too :) N — Sent from Mailbox for iPhone On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix jake.man...@gmail.com wrote: Yeah, I'm totally on board with a pretty scala DSL on top of some of our stuff. In particular, I've been experimenting with with wrapping the DistributedRowMatrix in a scalding wrapper, so we can do things like val matrixAsTypedPipe = DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols, path, conf)) // e.g. L1 normalize: matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx, v.normalize(1)) ) .write(new DistributedRowMatrixPipe(outputPath, conf)) // and anything else you would want to do with a scalding TypedPipe[Int, Vector] Currently I've been doing this with a package structure directly in Mahout, in: mahout/contrib/scalding What do people think about having this be something real, after 0.8 goes out? Are we ready for contrib modules which fold in diverse external projects in new ways? Integrating directly with pig and scalding is a bit too wide of a tent for Mahout core, but putting these integrations in entirely new projects is maybe a bit too far away. On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com wrote: Dmitriy, This is very pretty. On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Ok, so i was fairly easily able to build some DSL for our matrix manipulation (similar to breeze) in scala: inline matrix or vector: val a = dense((1, 2, 3), (3, 4, 5)) val b:Vector = (1,2,3) block views and assignments (element/row/vector/block/block of row or vector) a(::, 0) a(1, ::) a(0 to 1, 1 to 2) assignments a(0, ::) :=(3, 5, 7) a(0, 0 to 1) :=(3, 5) a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5)) operators // hadamard val c = a * b a *= b // matrix mul val m = a %*% b and bunch of other little things like sum, mean, colMeans etc. That much is easy. Also stuff like the ones found in breeze along the lines val (u,v,s) = svd(a) diag ((1,2,3)) and Cholesky in similar ways. I don't have inline initialization for sparse things (yet) simply because i don't need them, but of course all regular java constructors and methods are retained, all that is just a syntactic sugar in the spirit of DSLs in hope to make things a bit mroe readable. my (very little, and very insignificantly opinionated, really) criticism of Breeze in this context is its inconsistency between dense and sparse representations, namely, lack of consistent overarching trait(s), so that building structure-agnostic solvers like Mahout's Cholesky solver is impossible, as well as cross-type matrix use (say, the way i understand it, it is pretty much imposible to multiply a sparse matrix by a dense matrix). I suspect these problems stem from the fact that the authors for whatever reason decided to hardwire dense things with JBlas solvers whereas i dont believe matrix storage structures must be. But these problems do appear to be serious enough for me to ignore Breeze for now. If i decide to plug in jblas dense solvers, i guess i will just have them as yet another top-level routine interface taking any Matrix, e.g. val (u,v,s) = svd(m, jblas=true) On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Thank you. On Jun 23, 2013 6:16 PM, Ted Dunning ted.dunn...@gmail.com wrote: I think that this contract has migrated a bit from the first starting point. My feeling is that there is a de facto contract now that the matrix slice is a single row. Sent from my iPhone On Jun 23, 2013, at 16:32, Dmitriy Lyubimov dlie...@gmail.com wrote: What does Matrix. iterateAll() contractually do? Practically it seems to be row wise iteration for some implementations but it doesnt seem contractually state so in the javadoc. What is
Development guide (ICFOSS)
Hi, I am Samiran. I participated in 3 day local workshop at ICFOSS ( http://community.apache.org/mentoringprogramme-icfoss-pilot.html). I am looking forward to contribute to Mahout project. I am Java beginner and learning it fast. My interest domain is data mining and I am familiar with clustering algorithms. I was checking this https://issues.apache.org/jira/browse/MAHOUT-1177 Let me know if somebody is already working on it. Also please suggest if I need to pay special attention to something. It would be great for me if you could point me, some bug or enhancement in jira (for beginners) so that I can have some hands on practice and understand the code base. Regards, Samiran
Re: Mahout vectors/matrices/solvers on spark
On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath nick.pentre...@gmail.comwrote: That looks great Dmitry! The thing about Breeze that drives the complexity in it is partly specialization for Float, Double and Int matrices, and partly getting the syntax to just work for all combinations of matrix types and operands etc. mostly it does just work but occasionally not. yes i noticed that, but since i am wrapping Mahout matrices, there's only a choice of double-filled matrices and vectors. Actually, i would argue that's the way it is supposed to be in the interest of KISS principle. I am not sure i see a value in int matrices for any problem i ever worked on, and skipping on precision to save the space is even more far-fetched notion as in real life numbers don't take as much space as their pre-vectorized features and annotations. In fact. model training parts and linear algebra are not where memory bottleneck seems to fat-up at all in my experience. There's often exponentially growing cpu-bound behavior, yes, but not RAM. I am surprised that dense * sparse matrix doesn't work but I guess as I previously mentioned the sparse matrix support is a bit shaky. This is solely based on eye-balling the trait architecture. I did not actually attempt it. But there's no single unifying trait for sure. David Hall is pretty happy to both look into enhancements and help out for contributions (eg I'm hoping to find time to look into a proper Diagonal matrix implementation and he was very helpful with pointers etc), so please do drop things into the google group mailing list. Hopefully wider adoption especially by this type of community will drive Breeze development. In another note I also really like Scaldings matrix API so scala ish wrappers for mahout would be cool - another pet project of mine is a port of that API to spark too :) N — Sent from Mailbox for iPhone On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix jake.man...@gmail.com wrote: Yeah, I'm totally on board with a pretty scala DSL on top of some of our stuff. In particular, I've been experimenting with with wrapping the DistributedRowMatrix in a scalding wrapper, so we can do things like val matrixAsTypedPipe = DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols, path, conf)) // e.g. L1 normalize: matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx, v.normalize(1)) ) .write(new DistributedRowMatrixPipe(outputPath, conf)) // and anything else you would want to do with a scalding TypedPipe[Int, Vector] Currently I've been doing this with a package structure directly in Mahout, in: mahout/contrib/scalding What do people think about having this be something real, after 0.8 goes out? Are we ready for contrib modules which fold in diverse external projects in new ways? Integrating directly with pig and scalding is a bit too wide of a tent for Mahout core, but putting these integrations in entirely new projects is maybe a bit too far away. On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com wrote: Dmitriy, This is very pretty. On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Ok, so i was fairly easily able to build some DSL for our matrix manipulation (similar to breeze) in scala: inline matrix or vector: val a = dense((1, 2, 3), (3, 4, 5)) val b:Vector = (1,2,3) block views and assignments (element/row/vector/block/block of row or vector) a(::, 0) a(1, ::) a(0 to 1, 1 to 2) assignments a(0, ::) :=(3, 5, 7) a(0, 0 to 1) :=(3, 5) a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5)) operators // hadamard val c = a * b a *= b // matrix mul val m = a %*% b and bunch of other little things like sum, mean, colMeans etc. That much is easy. Also stuff like the ones found in breeze along the lines val (u,v,s) = svd(a) diag ((1,2,3)) and Cholesky in similar ways. I don't have inline initialization for sparse things (yet) simply because i don't need them, but of course all regular java constructors and methods are retained, all that is just a syntactic sugar in the spirit of DSLs in hope to make things a bit mroe readable. my (very little, and very insignificantly opinionated, really) criticism of Breeze in this context is its inconsistency between dense and sparse representations, namely, lack of consistent overarching trait(s), so that building structure-agnostic solvers like Mahout's Cholesky solver is impossible, as well as cross-type matrix use (say, the way i understand it, it is pretty much imposible to multiply a sparse matrix by a dense matrix). I suspect these problems stem from the fact that the authors for whatever reason decided to hardwire dense things with JBlas solvers
Re: Mahout vectors/matrices/solvers on spark
I think that contrib modules would be very interesting. Specifically, good Scala DSL, pig integration and so on. On Mon, Jun 24, 2013 at 9:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath nick.pentre...@gmail.com wrote: That looks great Dmitry! The thing about Breeze that drives the complexity in it is partly specialization for Float, Double and Int matrices, and partly getting the syntax to just work for all combinations of matrix types and operands etc. mostly it does just work but occasionally not. yes i noticed that, but since i am wrapping Mahout matrices, there's only a choice of double-filled matrices and vectors. Actually, i would argue that's the way it is supposed to be in the interest of KISS principle. I am not sure i see a value in int matrices for any problem i ever worked on, and skipping on precision to save the space is even more far-fetched notion as in real life numbers don't take as much space as their pre-vectorized features and annotations. In fact. model training parts and linear algebra are not where memory bottleneck seems to fat-up at all in my experience. There's often exponentially growing cpu-bound behavior, yes, but not RAM. I am surprised that dense * sparse matrix doesn't work but I guess as I previously mentioned the sparse matrix support is a bit shaky. This is solely based on eye-balling the trait architecture. I did not actually attempt it. But there's no single unifying trait for sure. David Hall is pretty happy to both look into enhancements and help out for contributions (eg I'm hoping to find time to look into a proper Diagonal matrix implementation and he was very helpful with pointers etc), so please do drop things into the google group mailing list. Hopefully wider adoption especially by this type of community will drive Breeze development. In another note I also really like Scaldings matrix API so scala ish wrappers for mahout would be cool - another pet project of mine is a port of that API to spark too :) N — Sent from Mailbox for iPhone On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix jake.man...@gmail.com wrote: Yeah, I'm totally on board with a pretty scala DSL on top of some of our stuff. In particular, I've been experimenting with with wrapping the DistributedRowMatrix in a scalding wrapper, so we can do things like val matrixAsTypedPipe = DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols, path, conf)) // e.g. L1 normalize: matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx, v.normalize(1)) ) .write(new DistributedRowMatrixPipe(outputPath, conf)) // and anything else you would want to do with a scalding TypedPipe[Int, Vector] Currently I've been doing this with a package structure directly in Mahout, in: mahout/contrib/scalding What do people think about having this be something real, after 0.8 goes out? Are we ready for contrib modules which fold in diverse external projects in new ways? Integrating directly with pig and scalding is a bit too wide of a tent for Mahout core, but putting these integrations in entirely new projects is maybe a bit too far away. On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com wrote: Dmitriy, This is very pretty. On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Ok, so i was fairly easily able to build some DSL for our matrix manipulation (similar to breeze) in scala: inline matrix or vector: val a = dense((1, 2, 3), (3, 4, 5)) val b:Vector = (1,2,3) block views and assignments (element/row/vector/block/block of row or vector) a(::, 0) a(1, ::) a(0 to 1, 1 to 2) assignments a(0, ::) :=(3, 5, 7) a(0, 0 to 1) :=(3, 5) a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5)) operators // hadamard val c = a * b a *= b // matrix mul val m = a %*% b and bunch of other little things like sum, mean, colMeans etc. That much is easy. Also stuff like the ones found in breeze along the lines val (u,v,s) = svd(a) diag ((1,2,3)) and Cholesky in similar ways. I don't have inline initialization for sparse things (yet) simply because i don't need them, but of course all regular java constructors and methods are retained, all that is just a syntactic sugar in the spirit of DSLs in hope to make things a bit mroe readable. my (very little, and very insignificantly opinionated, really) criticism of Breeze in this context is its inconsistency between dense and sparse representations, namely, lack of consistent overarching trait(s), so that building structure-agnostic
Re: Mahout vectors/matrices/solvers on spark
You're right on that - so far doubles is all I've needed and all I can currently see needing. I'll take a look at your project and see how easy it is to integrate with my Spark ALS and other code - syntax wise it looks almost the same so swapping out the linear algebra backend would be quite trivial in theory. So far I've a working implementation of both implicit and explicit ALS versions that matches Mahout in RMSE given same parameters on the 3 movielens data sets. Still some work to do and more testing at scale, plus framework stuff. But hopefully I'd like to open source this at some point (but the Spark guys have a few projects upcoming so I'm also waiting a bit to see what happens there as it may end up duplicating a lot of what they're doing). — Sent from Mailbox for iPhone On Mon, Jun 24, 2013 at 10:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath nick.pentre...@gmail.comwrote: That looks great Dmitry! The thing about Breeze that drives the complexity in it is partly specialization for Float, Double and Int matrices, and partly getting the syntax to just work for all combinations of matrix types and operands etc. mostly it does just work but occasionally not. yes i noticed that, but since i am wrapping Mahout matrices, there's only a choice of double-filled matrices and vectors. Actually, i would argue that's the way it is supposed to be in the interest of KISS principle. I am not sure i see a value in int matrices for any problem i ever worked on, and skipping on precision to save the space is even more far-fetched notion as in real life numbers don't take as much space as their pre-vectorized features and annotations. In fact. model training parts and linear algebra are not where memory bottleneck seems to fat-up at all in my experience. There's often exponentially growing cpu-bound behavior, yes, but not RAM. I am surprised that dense * sparse matrix doesn't work but I guess as I previously mentioned the sparse matrix support is a bit shaky. This is solely based on eye-balling the trait architecture. I did not actually attempt it. But there's no single unifying trait for sure. David Hall is pretty happy to both look into enhancements and help out for contributions (eg I'm hoping to find time to look into a proper Diagonal matrix implementation and he was very helpful with pointers etc), so please do drop things into the google group mailing list. Hopefully wider adoption especially by this type of community will drive Breeze development. In another note I also really like Scaldings matrix API so scala ish wrappers for mahout would be cool - another pet project of mine is a port of that API to spark too :) N — Sent from Mailbox for iPhone On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix jake.man...@gmail.com wrote: Yeah, I'm totally on board with a pretty scala DSL on top of some of our stuff. In particular, I've been experimenting with with wrapping the DistributedRowMatrix in a scalding wrapper, so we can do things like val matrixAsTypedPipe = DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols, path, conf)) // e.g. L1 normalize: matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx, v.normalize(1)) ) .write(new DistributedRowMatrixPipe(outputPath, conf)) // and anything else you would want to do with a scalding TypedPipe[Int, Vector] Currently I've been doing this with a package structure directly in Mahout, in: mahout/contrib/scalding What do people think about having this be something real, after 0.8 goes out? Are we ready for contrib modules which fold in diverse external projects in new ways? Integrating directly with pig and scalding is a bit too wide of a tent for Mahout core, but putting these integrations in entirely new projects is maybe a bit too far away. On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com wrote: Dmitriy, This is very pretty. On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Ok, so i was fairly easily able to build some DSL for our matrix manipulation (similar to breeze) in scala: inline matrix or vector: val a = dense((1, 2, 3), (3, 4, 5)) val b:Vector = (1,2,3) block views and assignments (element/row/vector/block/block of row or vector) a(::, 0) a(1, ::) a(0 to 1, 1 to 2) assignments a(0, ::) :=(3, 5, 7) a(0, 0 to 1) :=(3, 5) a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5)) operators // hadamard val c = a * b a *= b // matrix mul val m = a %*% b and bunch of other little things like sum, mean, colMeans etc. That much is easy. Also stuff like the ones found in breeze along the lines val (u,v,s) = svd(a) diag ((1,2,3)) and Cholesky in similar ways. I
Re: Mahout vectors/matrices/solvers on spark
Well one fundamental step to get there in Mahout realm, the way i see it, is to create DSLs for Mahout's DRMs in spark. That's actually one of the other reasons i chose not to follow Breeze. When we unwind Mahout DRM's, we may see sparse or dense slices there with named vectors. To translate that into Breeze blocks would be a problem (and annotations/named vector treatment is yet another problem i guess). On Mon, Jun 24, 2013 at 2:08 PM, Nick Pentreath nick.pentre...@gmail.comwrote: You're right on that - so far doubles is all I've needed and all I can currently see needing. I'll take a look at your project and see how easy it is to integrate with my Spark ALS and other code - syntax wise it looks almost the same so swapping out the linear algebra backend would be quite trivial in theory. So far I've a working implementation of both implicit and explicit ALS versions that matches Mahout in RMSE given same parameters on the 3 movielens data sets. Still some work to do and more testing at scale, plus framework stuff. But hopefully I'd like to open source this at some point (but the Spark guys have a few projects upcoming so I'm also waiting a bit to see what happens there as it may end up duplicating a lot of what they're doing). — Sent from Mailbox for iPhone On Mon, Jun 24, 2013 at 10:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath nick.pentre...@gmail.comwrote: That looks great Dmitry! The thing about Breeze that drives the complexity in it is partly specialization for Float, Double and Int matrices, and partly getting the syntax to just work for all combinations of matrix types and operands etc. mostly it does just work but occasionally not. yes i noticed that, but since i am wrapping Mahout matrices, there's only a choice of double-filled matrices and vectors. Actually, i would argue that's the way it is supposed to be in the interest of KISS principle. I am not sure i see a value in int matrices for any problem i ever worked on, and skipping on precision to save the space is even more far-fetched notion as in real life numbers don't take as much space as their pre-vectorized features and annotations. In fact. model training parts and linear algebra are not where memory bottleneck seems to fat-up at all in my experience. There's often exponentially growing cpu-bound behavior, yes, but not RAM. I am surprised that dense * sparse matrix doesn't work but I guess as I previously mentioned the sparse matrix support is a bit shaky. This is solely based on eye-balling the trait architecture. I did not actually attempt it. But there's no single unifying trait for sure. David Hall is pretty happy to both look into enhancements and help out for contributions (eg I'm hoping to find time to look into a proper Diagonal matrix implementation and he was very helpful with pointers etc), so please do drop things into the google group mailing list. Hopefully wider adoption especially by this type of community will drive Breeze development. In another note I also really like Scaldings matrix API so scala ish wrappers for mahout would be cool - another pet project of mine is a port of that API to spark too :) N — Sent from Mailbox for iPhone On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix jake.man...@gmail.com wrote: Yeah, I'm totally on board with a pretty scala DSL on top of some of our stuff. In particular, I've been experimenting with with wrapping the DistributedRowMatrix in a scalding wrapper, so we can do things like val matrixAsTypedPipe = DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols, path, conf)) // e.g. L1 normalize: matrixAsTypedPipe.map((idx, v) : (Int, Vector) = (idx, v.normalize(1)) ) .write(new DistributedRowMatrixPipe(outputPath, conf)) // and anything else you would want to do with a scalding TypedPipe[Int, Vector] Currently I've been doing this with a package structure directly in Mahout, in: mahout/contrib/scalding What do people think about having this be something real, after 0.8 goes out? Are we ready for contrib modules which fold in diverse external projects in new ways? Integrating directly with pig and scalding is a bit too wide of a tent for Mahout core, but putting these integrations in entirely new projects is maybe a bit too far away. On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning ted.dunn...@gmail.com wrote: Dmitriy, This is very pretty. On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Ok, so i was fairly easily able to build some DSL for our matrix manipulation (similar to breeze) in scala: inline matrix or vector: val a = dense((1, 2, 3), (3, 4, 5)) val b:Vector =
Jenkins build is back to normal : Mahout-Quality #2103
See https://builds.apache.org/job/Mahout-Quality/2103/
Re: (Bi-)Weekly/Monthly Dev Sessions
Hi! Is the Google hangouts dev session tomorrow/Tuesday still happening? Lurkingly, Buro Mookerji On Fri, Jun 14, 2013 at 3:37 AM, Grant Ingersoll gsing...@apache.orgwrote: It seems to be that 6 pm ET is the consensus time for the majority of people, although my having screwed up the poll didn't help. Bi-weekly is the other consensus. It also looks like Tuesday or Thursday are the preferred dates. I can't make next week, so I'm going to propose we kick off on Tuesday, June 25 at 6 pm. That will give us time to dry-run the Google Hangouts, etc. Again, just to be clear, the goal here is to work on the development of Mahout, not to answer questions about how to run Mahout (we could do that separately if there is a desire.) I'll send out a reminder as we get closer. -Grant On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: I am from Northern Virginia, how many of us here are from the Washington DC Metro area? From: Jake Mannix jake.man...@gmail.com To: dev@mahout.apache.org dev@mahout.apache.org Sent: Wednesday, June 12, 2013 1:56 PM Subject: Re: (Bi-)Weekly/Monthly Dev Sessions Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon when I get back from europe at the end of the summer! On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Bi-weekly is good for me; I'm in Seattle and just filled out the poll. Great idea! On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: +1, am in Seattle as well and would love to attend and be involved. Sent from my iPhone On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com wrote: Good idea on recurring meetings. Im very interested in participating. Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8. An agenda for the meetings ahead of time will help us get the most of our time at the meetings. Thanks. On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org wrote: On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote: Angel and Suneel, you may want to re-fill out the new doodle. FYI, this week won't be representative of my schedule; I'm in the last few weeks of a job at ORNL where I travel every weekend. Normally I'll have more flexibility than just 6pm on weeknights. Yeah, Doodle makes you pick dates, but I just want it to be representative a week long period of time and not tied to a specific set of dates. So, just put in what your ideal times are in general and ignore the fact that it is set to next week. On 6/12/13 8:26 AM, Grant Ingersoll wrote: On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote: +1, awesome idea One question: the poll, while set to GMT -5, does say it's in Central Time. Is this a daylight savings thing? I turned on Time Zone support, so not sure how it will look to others, but it sounds like it adjusts based on your location... I see: 8 am, 10, 1, so on. I also realize, that I messed it up. I meant 9 pm, not 9 am. Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv Grant Ingersoll | @gsingers http://www.lucidworks.com -- -jake Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: (Bi-)Weekly/Monthly Dev Sessions
Not sure, but if we are having it I think we should focus on what's left for 0.8 release. From: Bhaskar Mookerji mooke...@spin-one.org To: dev@mahout.apache.org Cc: Suneel Marthi suneel_mar...@yahoo.com Sent: Monday, June 24, 2013 6:35 PM Subject: Re: (Bi-)Weekly/Monthly Dev Sessions Hi! Is the Google hangouts dev session tomorrow/Tuesday still happening? Lurkingly, Buro Mookerji On Fri, Jun 14, 2013 at 3:37 AM, Grant Ingersoll gsing...@apache.orgwrote: It seems to be that 6 pm ET is the consensus time for the majority of people, although my having screwed up the poll didn't help. Bi-weekly is the other consensus. It also looks like Tuesday or Thursday are the preferred dates. I can't make next week, so I'm going to propose we kick off on Tuesday, June 25 at 6 pm. That will give us time to dry-run the Google Hangouts, etc. Again, just to be clear, the goal here is to work on the development of Mahout, not to answer questions about how to run Mahout (we could do that separately if there is a desire.) I'll send out a reminder as we get closer. -Grant On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: I am from Northern Virginia, how many of us here are from the Washington DC Metro area? From: Jake Mannix jake.man...@gmail.com To: dev@mahout.apache.org dev@mahout.apache.org Sent: Wednesday, June 12, 2013 1:56 PM Subject: Re: (Bi-)Weekly/Monthly Dev Sessions Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon when I get back from europe at the end of the summer! On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Bi-weekly is good for me; I'm in Seattle and just filled out the poll. Great idea! On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: +1, am in Seattle as well and would love to attend and be involved. Sent from my iPhone On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com wrote: Good idea on recurring meetings. Im very interested in participating. Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8. An agenda for the meetings ahead of time will help us get the most of our time at the meetings. Thanks. On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org wrote: On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote: Angel and Suneel, you may want to re-fill out the new doodle. FYI, this week won't be representative of my schedule; I'm in the last few weeks of a job at ORNL where I travel every weekend. Normally I'll have more flexibility than just 6pm on weeknights. Yeah, Doodle makes you pick dates, but I just want it to be representative a week long period of time and not tied to a specific set of dates. So, just put in what your ideal times are in general and ignore the fact that it is set to next week. On 6/12/13 8:26 AM, Grant Ingersoll wrote: On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote: +1, awesome idea One question: the poll, while set to GMT -5, does say it's in Central Time. Is this a daylight savings thing? I turned on Time Zone support, so not sure how it will look to others, but it sounds like it adjusts based on your location... I see: 8 am, 10, 1, so on. I also realize, that I messed it up. I meant 9 pm, not 9 am. Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv Grant Ingersoll | @gsingers http://www.lucidworks.com -- -jake Grant Ingersoll | @gsingers http://www.lucidworks.com
Build failed in Jenkins: mahout-nightly » Mahout Integration #1272
See https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/1272/changes Changes: [smarthi] MAHOUT-944: lucene2seq - more code cleanup, removed unused imports [smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - fixed issue with not reading a directory list [smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - first round of Code cleanup based on feedback from code review [smarthi] MAHOUT-944:lucene2seq - removed unused import -- [INFO] [INFO] [INFO] Building Mahout Integration 0.8-SNAPSHOT [INFO] [INFO] [INFO] Deleting https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/ws/target [INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ mahout-integration --- [INFO] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ mahout-integration --- [INFO] Copying 0 resource [INFO] [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ mahout-integration --- [INFO] Changes detected - recompiling the module! [INFO] Compiling 131 source files to https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/ws/target/classes [WARNING] Note: Some input files use or override a deprecated API. [WARNING] Note: Recompile with -Xlint:deprecation for details. [WARNING] Note: https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/ws/src/main/java/org/apache/mahout/cf/taste/impl/model/mongodb/MongoDBDataModel.java uses unchecked or unsafe operations. [WARNING] Note: Recompile with -Xlint:unchecked for details. [INFO] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ mahout-integration --- [INFO] Copying 10 resources [INFO] [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ mahout-integration --- [INFO] Changes detected - recompiling the module! [INFO] Compiling 39 source files to https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/ws/target/test-classes [WARNING] Note: Some input files use or override a deprecated API. [WARNING] Note: Recompile with -Xlint:deprecation for details. [INFO] [INFO] --- maven-surefire-plugin:2.14.1:test (default-test) @ mahout-integration --- [INFO] Surefire report directory: https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-integration/ws/target/surefire-reports --- T E S T S --- --- T E S T S --- parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false Running org.apache.mahout.utils.nlp.collocations.llr.BloomTokenFilterTest Running org.apache.mahout.utils.vectors.arff.ARFFTypeTest parallel='classes', perCoreThreadCount=false, threadCount=1, useUnlimitedThreads=false parallel='classes', perCoreThreadCount=false, threadCount=1,
Build failed in Jenkins: mahout-nightly #1272
See https://builds.apache.org/job/mahout-nightly/1272/changes Changes: [smarthi] MAHOUT-944: lucene2seq - more code cleanup, removed unused imports [smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - fixed issue with not reading a directory list [smarthi] MAHOUT-833: Make conversion to sequence files map-reduce - first round of Code cleanup based on feedback from code review [smarthi] MAHOUT-944:lucene2seq - removed unused import -- [...truncated 2088 lines...] Uploading: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/maven-metadata.xml Uploaded: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/maven-metadata.xml (344 B at 2.0 KB/sec) Uploading: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-tests.jar Uploaded: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-tests.jar (2436 KB at 9055.6 KB/sec) Uploading: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml Uploaded: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml (2 KB at 20.0 KB/sec) Uploading: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-job.jar Uploaded: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-job.jar (19450 KB at 22381.7 KB/sec) Uploading: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml Uploaded: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml (2 KB at 17.5 KB/sec) Uploading: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-sources.jar Uploaded: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/mahout-core-0.8-20130624.230604-281-sources.jar (1149 KB at 3587.6 KB/sec) Uploading: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml Uploaded: https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.8-SNAPSHOT/maven-metadata.xml (2 KB at 20.3 KB/sec) [INFO] [INFO] [INFO] Building Mahout Integration 0.8-SNAPSHOT [INFO] [INFO] [INFO] Deleting https://builds.apache.org/job/mahout-nightly/ws/trunk/integration/target [INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ mahout-integration --- [INFO] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ mahout-integration --- [INFO] Copying 0 resource [INFO] [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ mahout-integration --- [INFO] Changes detected - recompiling the module! [INFO] Compiling 131 source files to https://builds.apache.org/job/mahout-nightly/ws/trunk/integration/target/classes [WARNING] Note: Some input files use or override a deprecated API. [WARNING] Note: Recompile with -Xlint:deprecation for details. [WARNING] Note: https://builds.apache.org/job/mahout-nightly/ws/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/mongodb/MongoDBDataModel.java uses unchecked or unsafe operations. [WARNING] Note: Recompile with -Xlint:unchecked for details. [INFO] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ mahout-integration --- [INFO] Copying 10 resources [INFO] [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ mahout-integration --- [INFO] Changes detected - recompiling the module! [INFO] Compiling 39 source files to https://builds.apache.org/job/mahout-nightly/ws/trunk/integration/target/test-classes [WARNING] Note: Some input files use or override a deprecated API. [WARNING] Note: Recompile with -Xlint:deprecation for details. [INFO] [INFO] --- maven-surefire-plugin:2.14.1:test (default-test) @ mahout-integration --- [INFO] Surefire report directory: https://builds.apache.org/job/mahout-nightly/ws/trunk/integration/target/surefire-reports --- T E S T S ---
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692518#comment-13692518 ] Robin Anil commented on MAHOUT-1214: https://reviews.apache.org/r/11931/ I have actually replied to your comments. My comment still stands with respect to using a non standard input format. Grant, can you take a look as well. Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692589#comment-13692589 ] Yiqun Hu commented on MAHOUT-1214: -- Hi, Robin, We also response to your comments about why a new input format is used. Please check our response in reviewboard. Because we introduce a new support for spectralkmeans in mahout: we allow user to specify affinity between data using any data identity. We believe this support is huge for mahout users. Just imagine when you need to specify pairwise affinities of petabyte data. Asking user to map data point first and specify row/column id is inconvenient. We response the comments and wait for the further discussion. There are two options here. One, if there is a way to use standard input format to implement this support, please suggest, because we thought it is impossible. Two, if you think this support is useless, we don't mind to remove it and keep with ourselves. Again, we need discussion to move forward. Sent from my iPhone Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692593#comment-13692593 ] Yiqun Hu commented on MAHOUT-1214: -- Robin, just see your response. Let us digest it then response. Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: (Bi-)Weekly/Monthly Dev Sessions
I'd really like to, but had a trip come up. If possible, can we push for one week? Otherwise, if others want to go forward, I can try to set things up and share it w/ others. On Jun 24, 2013, at 6:35 PM, Bhaskar Mookerji mooke...@spin-one.org wrote: Hi! Is the Google hangouts dev session tomorrow/Tuesday still happening? Lurkingly, Buro Mookerji On Fri, Jun 14, 2013 at 3:37 AM, Grant Ingersoll gsing...@apache.orgwrote: It seems to be that 6 pm ET is the consensus time for the majority of people, although my having screwed up the poll didn't help. Bi-weekly is the other consensus. It also looks like Tuesday or Thursday are the preferred dates. I can't make next week, so I'm going to propose we kick off on Tuesday, June 25 at 6 pm. That will give us time to dry-run the Google Hangouts, etc. Again, just to be clear, the goal here is to work on the development of Mahout, not to answer questions about how to run Mahout (we could do that separately if there is a desire.) I'll send out a reminder as we get closer. -Grant On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: I am from Northern Virginia, how many of us here are from the Washington DC Metro area? From: Jake Mannix jake.man...@gmail.com To: dev@mahout.apache.org dev@mahout.apache.org Sent: Wednesday, June 12, 2013 1:56 PM Subject: Re: (Bi-)Weekly/Monthly Dev Sessions Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon when I get back from europe at the end of the summer! On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Bi-weekly is good for me; I'm in Seattle and just filled out the poll. Great idea! On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: +1, am in Seattle as well and would love to attend and be involved. Sent from my iPhone On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com wrote: Good idea on recurring meetings. Im very interested in participating. Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8. An agenda for the meetings ahead of time will help us get the most of our time at the meetings. Thanks. On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org wrote: On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote: Angel and Suneel, you may want to re-fill out the new doodle. FYI, this week won't be representative of my schedule; I'm in the last few weeks of a job at ORNL where I travel every weekend. Normally I'll have more flexibility than just 6pm on weeknights. Yeah, Doodle makes you pick dates, but I just want it to be representative a week long period of time and not tied to a specific set of dates. So, just put in what your ideal times are in general and ignore the fact that it is set to next week. On 6/12/13 8:26 AM, Grant Ingersoll wrote: On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote: +1, awesome idea One question: the poll, while set to GMT -5, does say it's in Central Time. Is this a daylight savings thing? I turned on Time Zone support, so not sure how it will look to others, but it sounds like it adjusts based on your location... I see: 8 am, 10, 1, so on. I also realize, that I messed it up. I meant 9 pm, not 9 am. Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv Grant Ingersoll | @gsingers http://www.lucidworks.com -- -jake Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692620#comment-13692620 ] Yiqun Hu commented on MAHOUT-1214: -- Robin, I understand the philosophy of mahout. But when you said we can write a new mapreduce to finish the mapping string id to row/column. From my understanding, it does not solve the issue. In the new mapreduce, we still have to introduce the new input format as we did here. Am I right? Sent from my iPhone Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Build failed in Jenkins: mahout-nightly » Mahout Integration #1272
Can someone w/ more Hadoop experience look at this? We are getting: java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit cannot be cast to org.apache.hadoop.mapred.InputSplit at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214) AFAICT, we are using the new APIs, but this seems to think it should be the old APIs. Note, this is an intermittent issue. Sometimes it goes through just fine. Locally, it passes for me. Note, this could also be related to the Parallel tests stuff. -Grant On Jun 24, 2013, at 7:06 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.611 sec FAILURE! testSequential(org.apache.mahout.text.SequenceFilesFromMailArchivesTest) Time elapsed: 1.268 sec FAILURE! org.junit.ComparisonFailure: expected:TEST/subdir/[mail-messages].gz/u...@example.com but was:TEST/subdir/[subsubdir/mail-messages-2].gz/u...@example.com at org.junit.Assert.assertEquals(Assert.java:115) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.mahout.text.SequenceFilesFromMailArchivesTest.testSequential(SequenceFilesFromMailArchivesTest.java:108) Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: Build failed in Jenkins: mahout-nightly » Mahout Integration #1272
Never mind the noise here, I misread this! Still, we have some error going on w/ random failures. On Jun 24, 2013, at 8:33 PM, Grant Ingersoll gsing...@apache.org wrote: Can someone w/ more Hadoop experience look at this? We are getting: java.lang.ClassCastException: org.apache.mahout.text.LuceneSegmentInputSplit cannot be cast to org.apache.hadoop.mapred.InputSplit at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214) AFAICT, we are using the new APIs, but this seems to think it should be the old APIs. Note, this is an intermittent issue. Sometimes it goes through just fine. Locally, it passes for me. Note, this could also be related to the Parallel tests stuff. -Grant On Jun 24, 2013, at 7:06 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.611 sec FAILURE! testSequential(org.apache.mahout.text.SequenceFilesFromMailArchivesTest) Time elapsed: 1.268 sec FAILURE! org.junit.ComparisonFailure: expected:TEST/subdir/[mail-messages].gz/u...@example.com but was:TEST/subdir/[subsubdir/mail-messages-2].gz/u...@example.com at org.junit.Assert.assertEquals(Assert.java:115) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.mahout.text.SequenceFilesFromMailArchivesTest.testSequential(SequenceFilesFromMailArchivesTest.java:108) Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: (Bi-)Weekly/Monthly Dev Sessions
I am fine with pushing by a week. From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org Cc: Suneel Marthi suneel_mar...@yahoo.com Sent: Monday, June 24, 2013 8:25 PM Subject: Re: (Bi-)Weekly/Monthly Dev Sessions I'd really like to, but had a trip come up. If possible, can we push for one week? Otherwise, if others want to go forward, I can try to set things up and share it w/ others. On Jun 24, 2013, at 6:35 PM, Bhaskar Mookerji mooke...@spin-one.org wrote: Hi! Is the Google hangouts dev session tomorrow/Tuesday still happening? Lurkingly, Buro Mookerji On Fri, Jun 14, 2013 at 3:37 AM, Grant Ingersoll gsing...@apache.orgwrote: It seems to be that 6 pm ET is the consensus time for the majority of people, although my having screwed up the poll didn't help. Bi-weekly is the other consensus. It also looks like Tuesday or Thursday are the preferred dates. I can't make next week, so I'm going to propose we kick off on Tuesday, June 25 at 6 pm. That will give us time to dry-run the Google Hangouts, etc. Again, just to be clear, the goal here is to work on the development of Mahout, not to answer questions about how to run Mahout (we could do that separately if there is a desire.) I'll send out a reminder as we get closer. -Grant On Jun 12, 2013, at 3:04 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: I am from Northern Virginia, how many of us here are from the Washington DC Metro area? From: Jake Mannix jake.man...@gmail.com To: dev@mahout.apache.org dev@mahout.apache.org Sent: Wednesday, June 12, 2013 1:56 PM Subject: Re: (Bi-)Weekly/Monthly Dev Sessions Wow, a lot of Seattleites, I should organize a Mahout MeetUp / Hackathon when I get back from europe at the end of the summer! On Wed, Jun 12, 2013 at 10:44 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Bi-weekly is good for me; I'm in Seattle and just filled out the poll. Great idea! On Wed, Jun 12, 2013 at 10:22 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: +1, am in Seattle as well and would love to attend and be involved. Sent from my iPhone On Jun 12, 2013, at 10:18 AM, Ravi Mummulla ravi.mummu...@gmail.com wrote: Good idea on recurring meetings. Im very interested in participating. Biweekly works for me. I'm in Seattle (pacific) timezone - GMT-8. An agenda for the meetings ahead of time will help us get the most of our time at the meetings. Thanks. On Jun 12, 2013 6:23 AM, Grant Ingersoll gsing...@apache.org wrote: On Jun 12, 2013, at 8:41 AM, Shannon Quinn squ...@gatech.edu wrote: Angel and Suneel, you may want to re-fill out the new doodle. FYI, this week won't be representative of my schedule; I'm in the last few weeks of a job at ORNL where I travel every weekend. Normally I'll have more flexibility than just 6pm on weeknights. Yeah, Doodle makes you pick dates, but I just want it to be representative a week long period of time and not tied to a specific set of dates. So, just put in what your ideal times are in general and ignore the fact that it is set to next week. On 6/12/13 8:26 AM, Grant Ingersoll wrote: On Jun 12, 2013, at 7:29 AM, Shannon Quinn squ...@gatech.edu wrote: +1, awesome idea One question: the poll, while set to GMT -5, does say it's in Central Time. Is this a daylight savings thing? I turned on Time Zone support, so not sure how it will look to others, but it sounds like it adjusts based on your location... I see: 8 am, 10, 1, so on. I also realize, that I messed it up. I meant 9 pm, not 9 am. Here is the correct one: http://doodle.com/ymqaiwbh7khisnyv Grant Ingersoll | @gsingers http://www.lucidworks.com -- -jake Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhang da updated MAHOUT-1214: - Attachment: (was: MAHOUT-1214.patch) Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Assignee: Robin Anil Labels: clustering, improvement Fix For: 0.8 Attachments: MAHOUT-1214.patch, matrix_1, matrix_2 The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1268) Wrong output directory for CVB
Sebastian Schelter created MAHOUT-1268: -- Summary: Wrong output directory for CVB Key: MAHOUT-1268 URL: https://issues.apache.org/jira/browse/MAHOUT-1268 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 0.8 I think that I introduced a bug in MAHOUT-1262 by accidentally writing to the wrong output dir (as reported by Mark Wicks on the mailinglist). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1268) Wrong output directory for CVB
[ https://issues.apache.org/jira/browse/MAHOUT-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1268: --- Attachment: MAHOUT-1268.patch Wrong output directory for CVB -- Key: MAHOUT-1268 URL: https://issues.apache.org/jira/browse/MAHOUT-1268 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 0.8 Attachments: MAHOUT-1268.patch I think that I introduced a bug in MAHOUT-1262 by accidentally writing to the wrong output dir (as reported by Mark Wicks on the mailinglist). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1268) Wrong output directory for CVB
[ https://issues.apache.org/jira/browse/MAHOUT-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692745#comment-13692745 ] Jake Mannix commented on MAHOUT-1268: - has this been tested with cluster_reuters.sh? If so, +1 to get this in asap. Wrong output directory for CVB -- Key: MAHOUT-1268 URL: https://issues.apache.org/jira/browse/MAHOUT-1268 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 0.8 Attachments: MAHOUT-1268.patch I think that I introduced a bug in MAHOUT-1262 by accidentally writing to the wrong output dir (as reported by Mark Wicks on the mailinglist). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1268) Wrong output directory for CVB
[ https://issues.apache.org/jira/browse/MAHOUT-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692749#comment-13692749 ] Suneel Marthi commented on MAHOUT-1268: --- [~jake.mannix] testing cluster_reuters.sh now as I am typing this. But this should fix the issue. Wrong output directory for CVB -- Key: MAHOUT-1268 URL: https://issues.apache.org/jira/browse/MAHOUT-1268 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 0.8 Attachments: MAHOUT-1268.patch I think that I introduced a bug in MAHOUT-1262 by accidentally writing to the wrong output dir (as reported by Mark Wicks on the mailinglist). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1268) Wrong output directory for CVB
[ https://issues.apache.org/jira/browse/MAHOUT-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692753#comment-13692753 ] Suneel Marthi commented on MAHOUT-1268: --- [~ssc] Please commit this, applied the patch and tested CVB with cluster_reuters.sh and we are good now. Wrong output directory for CVB -- Key: MAHOUT-1268 URL: https://issues.apache.org/jira/browse/MAHOUT-1268 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Sebastian Schelter Assignee: Sebastian Schelter Fix For: 0.8 Attachments: MAHOUT-1268.patch I think that I introduced a bug in MAHOUT-1262 by accidentally writing to the wrong output dir (as reported by Mark Wicks on the mailinglist). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira