[jira] Commented: (MAHOUT-205) Pull Writable (and anything else hadoop dependent) out of the matrix module
[ https://issues.apache.org/jira/browse/MAHOUT-205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797111#action_12797111 ] Sean Owen commented on MAHOUT-205: -- Now, it may still be my environment, since I was still seeing compile problems. I completely reset my installation. After this patch I get errors like... /Users/srowen/Documents/Development/Mahout/core/src/main/java/org/apache/mahout/math/VectorWritable.java:[42,34] cannot find symbol symbol : constructor DenseVector(org.apache.mahout.math.Vector) location: class org.apache.mahout.math.DenseVector /Users/srowen/Documents/Development/Mahout/core/src/main/java/org/apache/mahout/math/DenseMatrixWritable.java:[30,17] cannot find symbol symbol : method rowSize() location: class org.apache.mahout.math.DenseMatrixWritable Does this make any sense? Pull Writable (and anything else hadoop dependent) out of the matrix module --- Key: MAHOUT-205 URL: https://issues.apache.org/jira/browse/MAHOUT-205 Project: Mahout Issue Type: Improvement Components: Math Affects Versions: 0.1 Environment: all Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.3 Attachments: MAHOUT-205.patch, MAHOUT-205.patch Vector and Matrix extend Writable, and while that was merely poorly coupled before, it will be an actual problem now that we have a separate submodule for matrix: this module should not depend on hadoop at all, ideally. Distributed matrix work, as well as simply Writable wrappers can go somewhere else (where? core? yet another submodule which depends on matrix?), but it would be really nice if we could produce an artifact which doesn't require Hadoop which has our core linear primitives. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-205) Pull Writable (and anything else hadoop dependent) out of the matrix module
[ https://issues.apache.org/jira/browse/MAHOUT-205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797180#action_12797180 ] Jake Mannix commented on MAHOUT-205: Are you getting those during mvn clean install, or in your IDE, or where? I remember seeing Idea forget that even though VectorWritable and DenseMatrixWritable were in o.a.m.math, since they were in the mahout-core module, not mahout-math, that it ended up confused. I'm running the tests in the background on a clean checkout (you're using the newer of the two patches, right?), and no compile problems here (although I didn't blow away all of my .m2 folder... I guess I can try that next...) - Mac OS X, fwiw. Pull Writable (and anything else hadoop dependent) out of the matrix module --- Key: MAHOUT-205 URL: https://issues.apache.org/jira/browse/MAHOUT-205 Project: Mahout Issue Type: Improvement Components: Math Affects Versions: 0.1 Environment: all Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.3 Attachments: MAHOUT-205.patch, MAHOUT-205.patch Vector and Matrix extend Writable, and while that was merely poorly coupled before, it will be an actual problem now that we have a separate submodule for matrix: this module should not depend on hadoop at all, ideally. Distributed matrix work, as well as simply Writable wrappers can go somewhere else (where? core? yet another submodule which depends on matrix?), but it would be really nice if we could produce an artifact which doesn't require Hadoop which has our core linear primitives. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-205) Pull Writable (and anything else hadoop dependent) out of the matrix module
[ https://issues.apache.org/jira/browse/MAHOUT-205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797192#action_12797192 ] Jake Mannix commented on MAHOUT-205: Ok, blew away ~/.m2, clean checkout with the later patch applied, and mvn clean compile ends with BUILD SUCCESSFUL (after 17m 9s, oof!). Anyone else try this patch to see if you see what Sean saw? Pull Writable (and anything else hadoop dependent) out of the matrix module --- Key: MAHOUT-205 URL: https://issues.apache.org/jira/browse/MAHOUT-205 Project: Mahout Issue Type: Improvement Components: Math Affects Versions: 0.1 Environment: all Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.3 Attachments: MAHOUT-205.patch, MAHOUT-205.patch Vector and Matrix extend Writable, and while that was merely poorly coupled before, it will be an actual problem now that we have a separate submodule for matrix: this module should not depend on hadoop at all, ideally. Distributed matrix work, as well as simply Writable wrappers can go somewhere else (where? core? yet another submodule which depends on matrix?), but it would be really nice if we could produce an artifact which doesn't require Hadoop which has our core linear primitives. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-205) Pull Writable (and anything else hadoop dependent) out of the matrix module
[ https://issues.apache.org/jira/browse/MAHOUT-205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797221#action_12797221 ] Drew Farris commented on MAHOUT-205: Works for me, with a clean checkout latest patch applied. Yes, that build does a long time indeed. (for me: 12m6s) FWIW, here are some of the worst performing unit tests: Running org.apache.mahout.clustering.dirichlet.TestMapReduce Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 164.514 sec Running org.apache.mahout.clustering.kmeans.TestKmeansClustering Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 56.446 sec Running org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 58.853 sec (Seems like it would be reasonable to open another issue the remind one of us to look more closely at these) Pull Writable (and anything else hadoop dependent) out of the matrix module --- Key: MAHOUT-205 URL: https://issues.apache.org/jira/browse/MAHOUT-205 Project: Mahout Issue Type: Improvement Components: Math Affects Versions: 0.1 Environment: all Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.3 Attachments: MAHOUT-205.patch, MAHOUT-205.patch Vector and Matrix extend Writable, and while that was merely poorly coupled before, it will be an actual problem now that we have a separate submodule for matrix: this module should not depend on hadoop at all, ideally. Distributed matrix work, as well as simply Writable wrappers can go somewhere else (where? core? yet another submodule which depends on matrix?), but it would be really nice if we could produce an artifact which doesn't require Hadoop which has our core linear primitives. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms
[ https://issues.apache.org/jira/browse/MAHOUT-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797233#action_12797233 ] Jake Mannix commented on MAHOUT-185: As a note on this: one of the things I've sometimes done (and we do for managing our Hadoop jobs at LinkedIn) to make dealing with messy CLI stuff more managable, is to also allow for Properties files with default arguments for various jobs (makes for much more easily reproducible results, and it's self documenting - just have mahout classify look first in classify.props to see if default args are defined, go from there...). Using a base class like hadoop's Tool, you can leverage ToolRunner and GenericOptionsParser as well, and then hooking in a Properties-based way to run it as well makes it pretty flexible. It would be really nice to consolidate all of our Driver/Job classes into this issue, so that it's a) not duplicated, but b) in one place. This issue should get some priority - it will seriously help with our usability if there's an easy way to launch all the various tasks from one simple place. I'd love to have a little jruby script to run some of this stuff too, because when I was first writing decomposer, I found it invaluable to be able to just drop into jirb's REPL and start issuing java commands to run the various Hadoop jobs I was testing. Add mahout shell script for easy launching of various algorithms Key: MAHOUT-185 URL: https://issues.apache.org/jira/browse/MAHOUT-185 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Environment: linux, bash Reporter: Robin Anil Fix For: 0.3 Currently, Each algorithm has a different point of entry. At its too complicated to understand and launch each one. A mahout shell script needs to be made in the bin directory which does something like the following mahout classify -algorithm bayes [OPTIONS] mahout cluster -algorithm canopy [OPTIONS] mahout fpm -algorithm pfpgrowth [OPTIONS] mahout taste -algorithm slopeone [OPTIONS] mahout misc -algorithm createVectorsFromText [OPTIONS] mahout examples WikipediaExample -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms
[ https://issues.apache.org/jira/browse/MAHOUT-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797277#action_12797277 ] Ted Dunning commented on MAHOUT-185: Regarding the properties file idea, I have had very good luck with a convention that I now use pretty ubiquitously. Each application has a default properties file that is baked into the jar file. This allows slow changes subject to recompilation. All of these default properties are subject to over-ride in an external property file found in the class path or the current working directory. These over-rides are monitored for changes to allow on-the-fly reconfiguration of long-running processes. For transaction systems (not Mahout-like stuff), I also allow requests to contain an additional over-ride map of properties. This allows certain things to be changed on a request by request basis. This helps enormously because it allows almost anything to be the subject of A/B testing. Add mahout shell script for easy launching of various algorithms Key: MAHOUT-185 URL: https://issues.apache.org/jira/browse/MAHOUT-185 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Environment: linux, bash Reporter: Robin Anil Fix For: 0.3 Currently, Each algorithm has a different point of entry. At its too complicated to understand and launch each one. A mahout shell script needs to be made in the bin directory which does something like the following mahout classify -algorithm bayes [OPTIONS] mahout cluster -algorithm canopy [OPTIONS] mahout fpm -algorithm pfpgrowth [OPTIONS] mahout taste -algorithm slopeone [OPTIONS] mahout misc -algorithm createVectorsFromText [OPTIONS] mahout examples WikipediaExample -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-238) Further Dependency Cleanup
Further Dependency Cleanup -- Key: MAHOUT-238 URL: https://issues.apache.org/jira/browse/MAHOUT-238 Project: Mahout Issue Type: Sub-task Reporter: Drew Farris Priority: Minor Fix For: 0.3 Further dependency cleanup is required, mainly to set the right hadoop dependency for mahout-math and fix exclusions for the hadoop dependency in the parent pom. Other minor cleanups too. The patch includes the following changes: maven (parent pom) * added inceptionYear (2008) * removed some exclusions for hadoop dependency: avro, commons-codec, commons-httpclient in the dependendy management section. * removed javax.mail dependency mahout-math * switched from o.a.m.hadoop:hadoop-core dependency to new o.a.hadoop:hadoop-core dependency used in core, version specified in dependencyManagement section of parent pom. * removed unnecessary compile scope from gson dependency mahout-core * removed: kfs, jets3t, xmlenc, unused, originally added to support old o.a.mahout.hadoop:hadoop-core:0.20.1 dependency * removed: commons-httpclient, now added transitively from new o.a.hadoop:hadoop-core:0.20.2-SNAPSHOT dependency * set slf4j-jcl to test scope. * removed: watchmaker-swing, added later in mahout-examples where it is actually used. * fixed uncommons-maths groupId * removed unused lucene-analyzers dependency. * added easymock dependencies explicitly mahout-utils * removed unused easymock dependencies mahout-examples * added watchmaker-framework and watchmaker-swing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-238) Further Dependency Cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-238: --- Attachment: MAHOUT-238.patch patch added Further Dependency Cleanup -- Key: MAHOUT-238 URL: https://issues.apache.org/jira/browse/MAHOUT-238 Project: Mahout Issue Type: Sub-task Affects Versions: 0.2 Reporter: Drew Farris Priority: Minor Fix For: 0.3 Attachments: MAHOUT-238.patch Further dependency cleanup is required, mainly to set the right hadoop dependency for mahout-math and fix exclusions for the hadoop dependency in the parent pom. Other minor cleanups too. The patch includes the following changes: maven (parent pom) * added inceptionYear (2008) * removed some exclusions for hadoop dependency: avro, commons-codec, commons-httpclient in the dependendy management section. * removed javax.mail dependency mahout-math * switched from o.a.m.hadoop:hadoop-core dependency to new o.a.hadoop:hadoop-core dependency used in core, version specified in dependencyManagement section of parent pom. * removed unnecessary compile scope from gson dependency mahout-core * removed: kfs, jets3t, xmlenc, unused, originally added to support old o.a.mahout.hadoop:hadoop-core:0.20.1 dependency * removed: commons-httpclient, now added transitively from new o.a.hadoop:hadoop-core:0.20.2-SNAPSHOT dependency * set slf4j-jcl to test scope. * removed: watchmaker-swing, added later in mahout-examples where it is actually used. * fixed uncommons-maths groupId * removed unused lucene-analyzers dependency. * added easymock dependencies explicitly mahout-utils * removed unused easymock dependencies mahout-examples * added watchmaker-framework and watchmaker-swing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-238) Further Dependency Cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-238: --- Affects Version/s: 0.2 Status: Patch Available (was: Open) Further Dependency Cleanup -- Key: MAHOUT-238 URL: https://issues.apache.org/jira/browse/MAHOUT-238 Project: Mahout Issue Type: Sub-task Affects Versions: 0.2 Reporter: Drew Farris Priority: Minor Fix For: 0.3 Attachments: MAHOUT-238.patch Further dependency cleanup is required, mainly to set the right hadoop dependency for mahout-math and fix exclusions for the hadoop dependency in the parent pom. Other minor cleanups too. The patch includes the following changes: maven (parent pom) * added inceptionYear (2008) * removed some exclusions for hadoop dependency: avro, commons-codec, commons-httpclient in the dependendy management section. * removed javax.mail dependency mahout-math * switched from o.a.m.hadoop:hadoop-core dependency to new o.a.hadoop:hadoop-core dependency used in core, version specified in dependencyManagement section of parent pom. * removed unnecessary compile scope from gson dependency mahout-core * removed: kfs, jets3t, xmlenc, unused, originally added to support old o.a.mahout.hadoop:hadoop-core:0.20.1 dependency * removed: commons-httpclient, now added transitively from new o.a.hadoop:hadoop-core:0.20.2-SNAPSHOT dependency * set slf4j-jcl to test scope. * removed: watchmaker-swing, added later in mahout-examples where it is actually used. * fixed uncommons-maths groupId * removed unused lucene-analyzers dependency. * added easymock dependencies explicitly mahout-utils * removed unused easymock dependencies mahout-examples * added watchmaker-framework and watchmaker-swing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
[ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797544#action_12797544 ] Deneche A. Hakim commented on MAHOUT-216: - Here are some results on a 5 slave ec2 cluster, using Kdd 100% || Num Map Tasks || Num Trees || Build Time || oob error || | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 | | 10 | 100 | 0h 10m 5s 231 | 1.7E-4 | the results looks good, now I'll have to try the generated classifier on kdd test data and see... Some known issues (that'll try to fix) are: * mapreduce implementations cannot handle multiple file datasets * because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data --- Key: MAHOUT-216 URL: https://issues.apache.org/jira/browse/MAHOUT-216 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.2 Reporter: Deneche A. Hakim Assignee: Deneche A. Hakim Fix For: 0.3 the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. According to [CHAN, 95]: {quote} Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute so that each partition represents a good but a smaller model of the entire training set {quote} [CHAN, 95]: Philip K. Chan, On the Accuracy of Meta-learning for Scalable Data Mining -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data
[ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797544#action_12797544 ] Deneche A. Hakim edited comment on MAHOUT-216 at 1/7/10 7:30 AM: - Here are some results on a 5 slave ec2 cluster, using Kdd 100% || Num Map Tasks || Num Trees || Build Time || oob error || | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 | | 10 | 100 | 0h 10m 5s 231 | 1.2E-4 | the results looks good, now I'll have to try the generated classifier on kdd test data and see... Some known issues (that'll try to fix) are: * mapreduce implementations cannot handle multiple file datasets * because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400) was (Author: adeneche): Here are some results on a 5 slave ec2 cluster, using Kdd 100% || Num Map Tasks || Num Trees || Build Time || oob error || | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 | | 10 | 100 | 0h 10m 5s 231 | 1.7E-4 | the results looks good, now I'll have to try the generated classifier on kdd test data and see... Some known issues (that'll try to fix) are: * mapreduce implementations cannot handle multiple file datasets * because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data --- Key: MAHOUT-216 URL: https://issues.apache.org/jira/browse/MAHOUT-216 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.2 Reporter: Deneche A. Hakim Assignee: Deneche A. Hakim Fix For: 0.3 the poor results of the partial decision forest implementation may be explained by the particular distribution of the partitioned data. For example, if a partition does not contain any instance of a given class, the decision trees built using this partition won't be able to classify this class. According to [CHAN, 95]: {quote} Random Selection of the partitioned data sets with a uniform distribution of classes is perhaps the most sensible solution. Here we may attempt to maintain the same frequency distribution over the ''class attribute so that each partition represents a good but a smaller model of the entire training set {quote} [CHAN, 95]: Philip K. Chan, On the Accuracy of Meta-learning for Scalable Data Mining -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-238) Further Dependency Cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797547#action_12797547 ] Sean Owen commented on MAHOUT-238: -- For some reason I can't apply the patch but again suspect it's a local problem. I'm about to just blow this all away and start over. My only question from visually inspecting the patch is: we shouldn't directly depend on commons-logging right? we log via SLF4J only. Further Dependency Cleanup -- Key: MAHOUT-238 URL: https://issues.apache.org/jira/browse/MAHOUT-238 Project: Mahout Issue Type: Sub-task Affects Versions: 0.2 Reporter: Drew Farris Priority: Minor Fix For: 0.3 Attachments: MAHOUT-238.patch Further dependency cleanup is required, mainly to set the right hadoop dependency for mahout-math and fix exclusions for the hadoop dependency in the parent pom. Other minor cleanups too. The patch includes the following changes: maven (parent pom) * added inceptionYear (2008) * removed some exclusions for hadoop dependency: avro, commons-codec, commons-httpclient in the dependendy management section. * removed javax.mail dependency mahout-math * switched from o.a.m.hadoop:hadoop-core dependency to new o.a.hadoop:hadoop-core dependency used in core, version specified in dependencyManagement section of parent pom. * removed unnecessary compile scope from gson dependency mahout-core * removed: kfs, jets3t, xmlenc, unused, originally added to support old o.a.mahout.hadoop:hadoop-core:0.20.1 dependency * removed: commons-httpclient, now added transitively from new o.a.hadoop:hadoop-core:0.20.2-SNAPSHOT dependency * set slf4j-jcl to test scope. * removed: watchmaker-swing, added later in mahout-examples where it is actually used. * fixed uncommons-maths groupId * removed unused lucene-analyzers dependency. * added easymock dependencies explicitly mahout-utils * removed unused easymock dependencies mahout-examples * added watchmaker-framework and watchmaker-swing -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.