[jira] Updated: (MAHOUT-281) scm urls are wrong in the poms
[ https://issues.apache.org/jira/browse/MAHOUT-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-281: Status: Patch Available (was: Open) > scm urls are wrong in the poms > -- > > Key: MAHOUT-281 > URL: https://issues.apache.org/jira/browse/MAHOUT-281 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-281.diff > > > The scm urls in the poms are wrong. This must be fixed before running the > release plugin to make an 0.3 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-281) scm urls are wrong in the poms
[ https://issues.apache.org/jira/browse/MAHOUT-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-281: Attachment: MAHOUT-281.diff Changed scm connection strings. (Needed a comparably simple example to show students at HPI how svn diff, patch and jira.) > scm urls are wrong in the poms > -- > > Key: MAHOUT-281 > URL: https://issues.apache.org/jira/browse/MAHOUT-281 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-281.diff > > > The scm urls in the poms are wrong. This must be fixed before running the > release plugin to make an 0.3 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-262) Writable for labeled vectors for supervised learning algorithms
[ https://issues.apache.org/jira/browse/MAHOUT-262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803690#action_12803690 ] Isabel Drost commented on MAHOUT-262: - Should be possible to apply the patch with -p1 instead of -p0 to remove the a/b directories. > Writable for labeled vectors for supervised learning algorithms > --- > > Key: MAHOUT-262 > URL: https://issues.apache.org/jira/browse/MAHOUT-262 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.2 >Reporter: Olivier Grisel > Fix For: 0.3 > > Attachments: MAHOUT-262-1.patch > > > Implement two new classes: > - SingleLabelVectorWritable for singly classified vectorized data item (one > and only one label index per instance) > - MultiLabelVectorWritable for multi categorized vectorized data item (0 or > more category indexes per instance) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation warnings
[ https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-246: Resolution: Fixed Assignee: Olivier Grisel Status: Resolved (was: Patch Available) Patch applies cleanly with -p1, all tests still work, changes look good. Committed in revision 901791. > upgrade to new lucene TokenStream API to cleanup deprecation warnings > - > > Key: MAHOUT-246 > URL: https://issues.apache.org/jira/browse/MAHOUT-246 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Olivier Grisel >Assignee: Olivier Grisel >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-246-2.patch > > > The attached patch use the new ts.incrementToken() / TermAttribute API > instead of the deprecated manual Token handling. > It also replaces to occurrences of the deprecated "new StandardAnalyzer()" to > the more explicit "new StandardAnalyzer(Version.LUCENE_CURRENT)". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-242) LLR Collocation Identifier
[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803381#action_12803381 ] Isabel Drost commented on MAHOUT-242: - {quote} I am not worried about them at this point. {quote} Also not very worried - probably should have indicated that basically everything I found could be filed as "trivial, minor or style question only"... > LLR Collocation Identifier > -- > > Key: MAHOUT-242 > URL: https://issues.apache.org/jira/browse/MAHOUT-242 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.3 >Reporter: Drew Farris >Priority: Minor > Attachments: MAHOUT-242.patch, mahout-colloc.tar.gz, > mahout-colloc.tar.gz > > > Identifies interesting Collocations in text using ngrams scored via the > LogLikelihoodRatio calculation. > As discussed in: > * > http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2 > * > http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e > Current form is a tar of a maven project that depends on mahout. Build as > usual with 'mvn clean install', can be executed using: > {noformat} > mvn -e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" > -Dexec.args="--input src/test/resources/article --colloc target/colloc > --output target/output -w" > {noformat} > Output will be placed in target/output and can be viewed nicely using: > {noformat} > sort -rn -k1 target/output/part-0 > {noformat} > Includes rudimentary unit tests. Please review and comment. Needs more work > to get this into patch state and integrate with Robin's document vectorizer > work in MAHOUT-237 > Some basic TODO/FIXME's include: > * use mahout math's ObjectInt map implementation when available > * make the analyzer configurable > * better input validation + negative unit tests. > * more flexible ways to generate units of analysis (n-1)grams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-264) Make mahout-math compatible with Java 1.5 (bytecode and standard library).
[ https://issues.apache.org/jira/browse/MAHOUT-264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803281#action_12803281 ] Isabel Drost commented on MAHOUT-264: - The changes to the pom look good. But why are the changes to Sorting.java and Arrays.java needed? > Make mahout-math compatible with Java 1.5 (bytecode and standard library). > -- > > Key: MAHOUT-264 > URL: https://issues.apache.org/jira/browse/MAHOUT-264 > Project: Mahout > Issue Type: Wish > Components: Math >Reporter: Dawid Weiss >Assignee: Benson Margulies >Priority: Minor > Attachments: MAHOUT-264.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-217) Tidy up generated data after unit tests are run
[ https://issues.apache.org/jira/browse/MAHOUT-217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803276#action_12803276 ] Isabel Drost commented on MAHOUT-217: - The test files I found creating but not deleting data in the tmp directory: ./utils/src/test/java/org/apache/mahout/utils/vectors/io/VectorWriterTest.java ./utils/src/test/java/org/apache/mahout/utils/vectors/SequenceFileVectorIterableTest.java ./core/src/test/java/org/apache/mahout/classifier/bayes/BayesFileFormatterTest.java ./core/src/test/java/org/apache/mahout/cf/taste/impl/model/file/FileDataModelTest.java > Tidy up generated data after unit tests are run > --- > > Key: MAHOUT-217 > URL: https://issues.apache.org/jira/browse/MAHOUT-217 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.3 >Reporter: Isabel Drost > Fix For: 0.3 > > > I tried to compile Mahout on people.apache.org yesterday: The build failed at > first, because tests could not generate test data. The reason: Some tests > tried to generate test data at /tmp//... - but those directories > did exist already and belonged to Sean. Why? Probably because Sean had run > the build earlier this year - but tests did not remove the data they > generated. > Proposed solution: Tests come with setup and with shutdown hooks. We should > remove any data when a test is finished and shut down. > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803275#action_12803275 ] Isabel Drost commented on MAHOUT-237: - Hmm, Robin your last comment is "ok. done" however the issue is still open? > Map/Reduce Implementation of Document Vectorizer > > > Key: MAHOUT-237 > URL: https://issues.apache.org/jira/browse/MAHOUT-237 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, SparseVector-VIntWritable.patch > > > Current Vectorizer uses Lucene Index to convert documents into SparseVectors > Ted is working on a Hash based Vectorizer which can map features into Vectors > of fixed size and sum it up to get the document Vector > This is a pure bag-of-words based Vectorizer written in Map/Reduce. > The input document is in SequenceFile . with key = docid, value = > content > First Map/Reduce over the document collection and generate the feature counts. > Second Sequential pass reads the output of the map/reduce and converts them > to SequenceFile where key=feature, value = unique id > Second stage should create shards of features of a given split size > Third Map/Reduce over the document collection, using each shard and create > Partial(containing the features of the given shard) SparseVectors > Fourth Map/Reduce over partial shard, group by docid, create full document > Vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-242) LLR Collocation Identifier
[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803274#action_12803274 ] Isabel Drost commented on MAHOUT-242: - First of all, thanks for the patch. The code looks good so far, patch applies cleanly and builds w/o problems. Some initial comments and questions I had when reading it: CollocMapper, Line 66: If I read your implementation correctly, this means that documents are always read fully into memory, right? So we would assume to only run the ngramCollector over documents that fit into main memory and unable to process larger documents. I am wondering whether this is an issue at all, and if so, whether there is any way around that. Gram, Line 192: You can omit the "else" clauses, in case the "if" already returns its result to the caller, however this is a question of style. I was wondering, why in line 177 you did not write "this.position != other.position"? NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated apis instead of suppressing deprecation warnings? LLRReducer, Line 143: How about making the method package private if it should be used in unit tests only anyway? Line 106: Would be nice to have an additional counter for the skipped grams. I agree with you that things like sentence boundary detection and more sophisticated tokenization should be left as work for an additional issue. Jake, would be great, if you could have a closer look to verify that this is about the pipeline you had in mind in the referenced e-mail threads and mention anything that might still be missing. > LLR Collocation Identifier > -- > > Key: MAHOUT-242 > URL: https://issues.apache.org/jira/browse/MAHOUT-242 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.3 >Reporter: Drew Farris >Priority: Minor > Attachments: MAHOUT-242.patch, mahout-colloc.tar.gz, > mahout-colloc.tar.gz > > > Identifies interesting Collocations in text using ngrams scored via the > LogLikelihoodRatio calculation. > As discussed in: > * > http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2 > * > http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e > Current form is a tar of a maven project that depends on mahout. Build as > usual with 'mvn clean install', can be executed using: > {noformat} > mvn -e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" > -Dexec.args="--input src/test/resources/article --colloc target/colloc > --output target/output -w" > {noformat} > Output will be placed in target/output and can be viewed nicely using: > {noformat} > sort -rn -k1 target/output/part-0 > {noformat} > Includes rudimentary unit tests. Please review and comment. Needs more work > to get this into patch state and integrate with Robin's document vectorizer > work in MAHOUT-237 > Some basic TODO/FIXME's include: > * use mahout math's ObjectInt map implementation when available > * make the analyzer configurable > * better input validation + negative unit tests. > * more flexible ways to generate units of analysis (n-1)grams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801280#action_12801280 ] Isabel Drost commented on MAHOUT-153: - Welcome to Mahout. Thanks for stepping up and volunteering to take over the work for this issue. > Implement kmeans++ for initial cluster selection in kmeans > -- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering >Affects Versions: 0.2 > Environment: OS Independent >Reporter: Panagiotis Papadimitriou > Fix For: 0.3 > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-244) Add root log-likelihood method to LogLikehood class.
[ https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-244: Resolution: Fixed Status: Resolved (was: Patch Available) Patch applies cleanly and looks good, project builds with it, unit test is included. Committed at revision 899157. > Add root log-likelihood method to LogLikehood class. > > > Key: MAHOUT-244 > URL: https://issues.apache.org/jira/browse/MAHOUT-244 > Project: Mahout > Issue Type: Improvement > Components: Math >Affects Versions: 0.3 >Reporter: Drew Farris >Assignee: Drew Farris >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-244.patch > > > Per discussion at: > http://www.lucidimagination.com/search/document/6dc8709e65a7ced1/llr_scoring_question > This patch adds a method for root log-likelihood calculation to the existing > LogLikelihood class + provides a unit test based on Shashi's numbers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-244) Add root log-likelihood method to LogLikehood class.
[ https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-244: --- Assignee: Drew Farris > Add root log-likelihood method to LogLikehood class. > > > Key: MAHOUT-244 > URL: https://issues.apache.org/jira/browse/MAHOUT-244 > Project: Mahout > Issue Type: Improvement > Components: Math >Affects Versions: 0.3 >Reporter: Drew Farris >Assignee: Drew Farris >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-244.patch > > > Per discussion at: > http://www.lucidimagination.com/search/document/6dc8709e65a7ced1/llr_scoring_question > This patch adds a method for root log-likelihood calculation to the existing > LogLikelihood class + provides a unit test based on Shashi's numbers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-85) Perceptron/Winnow Trainer
[ https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798524#action_12798524 ] Isabel Drost commented on MAHOUT-85: No, sorry. That was me committing a change that I made for MAHOUT-240 - reverted it. So far there are no Driver programs yet: This is only the sequential version. The model should be stored after training and loaded at application time. I have deferred implementing an end-to-end example to MAHOUT-241. Currently the implementation only provides for the training logic. > Perceptron/Winnow Trainer > - > > Key: MAHOUT-85 > URL: https://issues.apache.org/jira/browse/MAHOUT-85 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: MAHOUT-85.patch, MAHOUT-85.patch, > perceptronWinnowTrainer.diff > > > Please find attached a first sketch for perceptron and winnow training. > Please look very, very carefully at the patch, as I added the heart of the > algorithms in the emergency room at Charite Berlin (after I broke my leg when > cycling to the Hadoop Get Together ;) ). > The patch does not yet feature unit tests nor is it parallelised. Currently > my plan is to set up an example with the webKb dataset, add unit tests to the > code and after that go parallel. I would like to get some feedback early on, > in addition I would feel a lot better, if a second and third pair of eyes had > a look at the code to make sure all obvious mistakes are out as early as > possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-241) Example for perceptron
Example for perceptron -- Key: MAHOUT-241 URL: https://issues.apache.org/jira/browse/MAHOUT-241 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.3 Reporter: Isabel Drost Fix For: 0.3 The goal is to provide an end-to-end example based on the 20-newsgroups dataset to show how to get from a set of labelled training examples to a trained model that can later be reused. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-240) Parallel version of Perceptron
Parallel version of Perceptron -- Key: MAHOUT-240 URL: https://issues.apache.org/jira/browse/MAHOUT-240 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.3 Reporter: Isabel Drost Fix For: 0.3 So far Perceptron (as well as Winnow) training is still implemented to run w/o parallelization. The goal of this issue is to explore ways for parallelization and if possible to provide a parallel version, that is one that is based on map reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-85) Perceptron/Winnow Trainer
[ https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost resolved MAHOUT-85. Resolution: Fixed Finally committed. > Perceptron/Winnow Trainer > - > > Key: MAHOUT-85 > URL: https://issues.apache.org/jira/browse/MAHOUT-85 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: MAHOUT-85.patch, MAHOUT-85.patch, > perceptronWinnowTrainer.diff > > > Please find attached a first sketch for perceptron and winnow training. > Please look very, very carefully at the patch, as I added the heart of the > algorithms in the emergency room at Charite Berlin (after I broke my leg when > cycling to the Hadoop Get Together ;) ). > The patch does not yet feature unit tests nor is it parallelised. Currently > my plan is to set up an example with the webKb dataset, add unit tests to the > code and after that go parallel. I would like to get some feedback early on, > in addition I would feel a lot better, if a second and third pair of eyes had > a look at the code to make sure all obvious mistakes are out as early as > possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-231) Upgrade QM reports to use Clover 2.6
Upgrade QM reports to use Clover 2.6 Key: MAHOUT-231 URL: https://issues.apache.org/jira/browse/MAHOUT-231 Project: Mahout Issue Type: Task Components: Website Affects Versions: 0.3 Reporter: Isabel Drost Priority: Minor Fix For: 0.3 Atlassian has donated a license for a new Clover version. The reports provide more information and are easier to read. We should upgrade to site reports to use that version. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-85) Perceptron/Winnow Trainer
[ https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-85: --- Attachment: MAHOUT-85.patch The patch has tests added to the implementation. The additional abstraction proposed earlier is integrated. Distance measure is not configurable but corresponds to what was defined in the original algorithm formulations. The implementation currently is sequential-only. Still evaluating, if and how is might be possible to parallelize. Missing so far: An example showing how to use training, how to store the resulting model and how to apply the model. Probably should be done in a new issue to keep this one focused on the algorithm itself. In addition I still have to at least add links from our wiki to the wikipedia pages on both algorithms. (Had some time left during the past few days: Screws in my knee are out now ;) ) > Perceptron/Winnow Trainer > - > > Key: MAHOUT-85 > URL: https://issues.apache.org/jira/browse/MAHOUT-85 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: MAHOUT-85.patch, MAHOUT-85.patch, > perceptronWinnowTrainer.diff > > > Please find attached a first sketch for perceptron and winnow training. > Please look very, very carefully at the patch, as I added the heart of the > algorithms in the emergency room at Charite Berlin (after I broke my leg when > cycling to the Hadoop Get Together ;) ). > The patch does not yet feature unit tests nor is it parallelised. Currently > my plan is to set up an example with the webKb dataset, add unit tests to the > code and after that go parallel. I would like to get some feedback early on, > in addition I would feel a lot better, if a second and third pair of eyes had > a look at the code to make sure all obvious mistakes are out as early as > possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-85) Perceptron/Winnow Trainer
[ https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-85: --- Attachment: MAHOUT-85.patch The patch has tests added to the implementation. The additional abstraction proposed earlier is integrated. Distance measure is not configurable but corresponds to what was defined in the original algorithm formulations. The implementation currently is sequential-only. Still evaluating, if and how is might be possible to parallelize. Missing so far: An example showing how to use training, how to store the resulting model and how to apply the model. Probably should be done in a new issue to keep this one focused on the algorithm itself. In addition I still have to at least add links from our wiki to the wikipedia pages on both algorithms. (Had some time left during the past few days: Screws in my knee are out now ;) ) > Perceptron/Winnow Trainer > - > > Key: MAHOUT-85 > URL: https://issues.apache.org/jira/browse/MAHOUT-85 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: MAHOUT-85.patch, perceptronWinnowTrainer.diff > > > Please find attached a first sketch for perceptron and winnow training. > Please look very, very carefully at the patch, as I added the heart of the > algorithms in the emergency room at Charite Berlin (after I broke my leg when > cycling to the Hadoop Get Together ;) ). > The patch does not yet feature unit tests nor is it parallelised. Currently > my plan is to set up an example with the webKb dataset, add unit tests to the > code and after that go parallel. I would like to get some feedback early on, > in addition I would feel a lot better, if a second and third pair of eyes had > a look at the code to make sure all obvious mistakes are out as early as > possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-210) Publish code quality reports through maven
[ https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792449#action_12792449 ] Isabel Drost commented on MAHOUT-210: - Forgot to include what I changed to make it work: Seems like the workspace directory on hudson is only accessible to users logged in to hudson. So I changed the job to stage the generated site to a publicly accessible directory and adjust the links accordingly. To get Clover to work I gave maven the path to the clover license on Hudson and issued report generation and aggregation before the site is generated. The maven parameters used for building: -Dmaven.clover.license=$PATH - path to the clover license file clean install - to clean the target directories and start building and locally installing the artifacts clover:instrument clover:aggregate - generates the clover reports site:site - generates the maven site report files and stores them under $module/target/site for review site:stage -DstagingDirectory=/export/home/hudson/hudson/jobs/MahoutQM/site - stages the maven report files on a publicly readable directory > Publish code quality reports through maven > -- > > Key: MAHOUT-210 > URL: https://issues.apache.org/jira/browse/MAHOUT-210 > Project: Mahout > Issue Type: New Feature > Components: Website >Affects Versions: 0.1, 0.2 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: MAHOUT-210.patch > > > We should use mvn site:site to generate code reports and publish them online > for users to review and developers to easily spot problems. > First version that still needs checks adjusted to our needs is available > online at: > http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html > Further discussion on-list at > http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-210) Publish code quality reports through maven
[ https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost resolved MAHOUT-210. - Resolution: Fixed Links are working now and accessible without logging into hudson. What remains is refining the report configuration to our specific needs, but this can be done in a separate issue. > Publish code quality reports through maven > -- > > Key: MAHOUT-210 > URL: https://issues.apache.org/jira/browse/MAHOUT-210 > Project: Mahout > Issue Type: New Feature > Components: Website >Affects Versions: 0.1, 0.2 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: MAHOUT-210.patch > > > We should use mvn site:site to generate code reports and publish them online > for users to review and developers to easily spot problems. > First version that still needs checks adjusted to our needs is available > online at: > http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html > Further discussion on-list at > http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-210) Publish code quality reports through maven
[ https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792019#action_12792019 ] Isabel Drost commented on MAHOUT-210: - Update: Clover tests are up now as well. Only problem: When not logged in to hudson, one is not allowed to access the workspace directory. In addition Hudson seems to be unable to pick up all bits and pieces of our maven site reports automatically. Currently working modifying the task such that the report files get moved over to a publicly accessible directory. > Publish code quality reports through maven > -- > > Key: MAHOUT-210 > URL: https://issues.apache.org/jira/browse/MAHOUT-210 > Project: Mahout > Issue Type: New Feature > Components: Website >Affects Versions: 0.1, 0.2 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: MAHOUT-210.patch > > > We should use mvn site:site to generate code reports and publish them online > for users to review and developers to easily spot problems. > First version that still needs checks adjusted to our needs is available > online at: > http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html > Further discussion on-list at > http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-210) Publish code quality reports through maven
[ https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791887#action_12791887 ] Isabel Drost commented on MAHOUT-210: - Checked in the current status of the report configuration files. Feel free to adjust any configuration that does not quite fit our standards yet. I tried to address those issues mentioned by Sean earlier in the mail thread. I setup a Hudson job to build the documentation and linked it such that it gets published through Hudson. The URLs for that: http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/core-reports/index.html http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/examples-reports/index.html http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/matrix-reports/index.html http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/maven-reports/index.html http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/taste-web-reports/index.html http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/utils-reports/index.html Those urls were activated according to the description of Bhuvaneswaran A on infrastruct...@apache: 1) setup Hudson job to generate the reports. 2) login to hud...@hudson.zones.apache.org and create a symbolic link: {code} $ sudo su - hudson $ cd hudson/userContent $ ln -s /export/home/hudson/hudson/jobs/Mahout\ QM/$PATH_TO_DOCS ./lucene-mahout/$MODULE-reports {code} 3) Access via http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/$MODULE-reports/index.html The site should be regenerated once a day. Once that is done today those pages available on hudson should match those I already published on people.apache.org About to add links to our project page to the reports (going to be a separate page in the developers section). Missing: Currently the clover test coverage reports are not yet being generated - I need to change the Hudson job to take up the clover license file for that. > Publish code quality reports through maven > -- > > Key: MAHOUT-210 > URL: https://issues.apache.org/jira/browse/MAHOUT-210 > Project: Mahout > Issue Type: New Feature > Components: Website >Affects Versions: 0.1, 0.2 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: MAHOUT-210.patch > > > We should use mvn site:site to generate code reports and publish them online > for users to review and developers to easily spot problems. > First version that still needs checks adjusted to our needs is available > online at: > http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html > Further discussion on-list at > http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-224) Dependency Cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790658#action_12790658 ] Isabel Drost commented on MAHOUT-224: - Maven supports marking dependencies as "needed for tests only" (would be appropriate for junit), or as "provided by user" (might be appropriate for the Hadoop stuff that I think is needed only at compile time but is available on the Hadoop cluster when deploying Mahout, right?). This should reduce the number of jars that need to be distributed as well. But that can be addressed in a separate issue. > Dependency Cleanup > -- > > Key: MAHOUT-224 > URL: https://issues.apache.org/jira/browse/MAHOUT-224 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Drew Farris >Assignee: Drew Farris >Priority: Minor > Attachments: mahout-224.patch > > > In preparation for the binary release work described in MAHOUT-215, here's a > minor patch that does some some cleanup on the poms. > The hadoop and junit dependency versions are now established using the > dependencyManagement section of the parent pom in mahout/maven/pom.xml > A large number of transitive dependencies from the hadoop pom are now > excluded there as well -- these were not necessary previously because the > hadoop dependency was hand-rolled and did not include them. With the update > to the hadoop 0.20.2-SNAPSHOT, they now become required. > Also, the parent pom no longer has mahout/pom.xml as its parent, this allows > binary packaging to be performed in mahout/pom.xml after the build of all of > the other sub-modules is complete. > Also, removed the javamail dependency -- was there a reason this was present? > Verified that build and unit tests complete. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-217) Tidy up generated data after unit tests are run
[ https://issues.apache.org/jira/browse/MAHOUT-217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790656#action_12790656 ] Isabel Drost commented on MAHOUT-217: - Not only fpgrowth. I will take a closer look on Thursday, make a list and post it here. > Tidy up generated data after unit tests are run > --- > > Key: MAHOUT-217 > URL: https://issues.apache.org/jira/browse/MAHOUT-217 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.3 >Reporter: Isabel Drost > Fix For: 0.3 > > > I tried to compile Mahout on people.apache.org yesterday: The build failed at > first, because tests could not generate test data. The reason: Some tests > tried to generate test data at /tmp//... - but those directories > did exist already and belonged to Sean. Why? Probably because Sean had run > the build earlier this year - but tests did not remove the data they > generated. > Proposed solution: Tests come with setup and with shutdown hooks. We should > remove any data when a test is finished and shut down. > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790653#action_12790653 ] Isabel Drost commented on MAHOUT-220: - Before reorganizing code - could someone who is more familiar with the specific rules of the code-style used at Lucene double-check the exact checkstyle rules used for site-generation? I reused the checkstyle configuration that was already in Mahout-trunk (relaxing some of its rules) but am in doubt whether it really reflects our rules. > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.2 > > Attachments: MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-210) Publish code quality reports through maven
[ https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-210: Attachment: MAHOUT-210.patch The patch adds clover, findbugs, pmd, cpd and maven dependency reports as well as java doc generation. After application the site can be generated through mvn site:site - I have thrown out all general project information that is already available through our forest site. The plan is to run mvn clean install site:site site:deploy on a daily (maybe weekly?) basis on people.apache.org and publish the results there so they can be linked to from our site. > Publish code quality reports through maven > -- > > Key: MAHOUT-210 > URL: https://issues.apache.org/jira/browse/MAHOUT-210 > Project: Mahout > Issue Type: New Feature > Components: Website >Affects Versions: 0.1, 0.2 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: MAHOUT-210.patch > > > We should use mvn site:site to generate code reports and publish them online > for users to review and developers to easily spot problems. > First version that still needs checks adjusted to our needs is available > online at: > http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html > Further discussion on-list at > http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-85) Perceptron/Winnow Trainer
[ https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789312#action_12789312 ] Isabel Drost commented on MAHOUT-85: I am about to add tests currently. I guess, I will commit once I have those done and go on with a parallel version from there. > Perceptron/Winnow Trainer > - > > Key: MAHOUT-85 > URL: https://issues.apache.org/jira/browse/MAHOUT-85 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: perceptronWinnowTrainer.diff > > > Please find attached a first sketch for perceptron and winnow training. > Please look very, very carefully at the patch, as I added the heart of the > algorithms in the emergency room at Charite Berlin (after I broke my leg when > cycling to the Hadoop Get Together ;) ). > The patch does not yet feature unit tests nor is it parallelised. Currently > my plan is to set up an example with the webKb dataset, add unit tests to the > code and after that go parallel. I would like to get some feedback early on, > in addition I would feel a lot better, if a second and third pair of eyes had > a look at the code to make sure all obvious mistakes are out as early as > possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-217) Tidy up generated data after unit tests are run
Tidy up generated data after unit tests are run --- Key: MAHOUT-217 URL: https://issues.apache.org/jira/browse/MAHOUT-217 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Isabel Drost Fix For: 0.3 I tried to compile Mahout on people.apache.org yesterday: The build failed at first, because tests could not generate test data. The reason: Some tests tried to generate test data at /tmp//... - but those directories did exist already and belonged to Sean. Why? Probably because Sean had run the build earlier this year - but tests did not remove the data they generated. Proposed solution: Tests come with setup and with shutdown hooks. We should remove any data when a test is finished and shut down. Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-210) Publish code quality reports through maven
[ https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-210: --- Assignee: Isabel Drost > Publish code quality reports through maven > -- > > Key: MAHOUT-210 > URL: https://issues.apache.org/jira/browse/MAHOUT-210 > Project: Mahout > Issue Type: New Feature > Components: Website >Affects Versions: 0.1, 0.2 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > > We should use mvn site:site to generate code reports and publish them online > for users to review and developers to easily spot problems. > First version that still needs checks adjusted to our needs is available > online at: > http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html > Further discussion on-list at > http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).
[ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-11: -- Assignee: Drew Farris (was: Isabel Drost) Thanks. > Static fields used throughout clustering code (Canopy, K-Means). > > > Key: MAHOUT-11 > URL: https://issues.apache.org/jira/browse/MAHOUT-11 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Dawid Weiss >Assignee: Drew Farris > Fix For: 0.3 > > Attachments: MAHOUT-11-all-cleanup-20091128.patch, > MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, > MAHOUT-11.patch > > > I file this as a bug, even though I'm not 100% sure it is one. In the currect > code the information is exchanged via static fields (for example, distance > measure and thresholds for Canopies are static field). Is it always true in > Hadoop that one job runs inside one JVM with exclusive access? I haven't seen > it anywhere in Hadoop documentation and my impression was that everything > uses JobConf to pass configuration to jobs, but jobs are configured on a > per-object basis (a job is an object, a mapper is an object and everything > else is basically an object). > If it's possible for two jobs to run in parallel inside one JVM then this is > a limitation and bug in our code that needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).
[ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-11: -- Assignee: Isabel Drost > Static fields used throughout clustering code (Canopy, K-Means). > > > Key: MAHOUT-11 > URL: https://issues.apache.org/jira/browse/MAHOUT-11 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Dawid Weiss >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: MAHOUT-11-all-cleanup-20091128.patch, > MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, > MAHOUT-11.patch > > > I file this as a bug, even though I'm not 100% sure it is one. In the currect > code the information is exchanged via static fields (for example, distance > measure and thresholds for Canopies are static field). Is it always true in > Hadoop that one job runs inside one JVM with exclusive access? I haven't seen > it anywhere in Hadoop documentation and my impression was that everything > uses JobConf to pass configuration to jobs, but jobs are configured on a > per-object basis (a job is an object, a mapper is an object and everything > else is basically an object). > If it's possible for two jobs to run in parallel inside one JVM then this is > a limitation and bug in our code that needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).
[ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-11: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed. Thanks Drew for your help. > Static fields used throughout clustering code (Canopy, K-Means). > > > Key: MAHOUT-11 > URL: https://issues.apache.org/jira/browse/MAHOUT-11 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Dawid Weiss > Fix For: 0.3 > > Attachments: MAHOUT-11-all-cleanup-20091128.patch, > MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, > MAHOUT-11.patch > > > I file this as a bug, even though I'm not 100% sure it is one. In the currect > code the information is exchanged via static fields (for example, distance > measure and thresholds for Canopies are static field). Is it always true in > Hadoop that one job runs inside one JVM with exclusive access? I haven't seen > it anywhere in Hadoop documentation and my impression was that everything > uses JobConf to pass configuration to jobs, but jobs are configured on a > per-object basis (a job is an object, a mapper is an object and everything > else is basically an object). > If it's possible for two jobs to run in parallel inside one JVM then this is > a limitation and bug in our code that needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).
[ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788129#action_12788129 ] Isabel Drost commented on MAHOUT-11: I'll make the changes before committing - no need to submit a new patch version. > Static fields used throughout clustering code (Canopy, K-Means). > > > Key: MAHOUT-11 > URL: https://issues.apache.org/jira/browse/MAHOUT-11 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Dawid Weiss > Fix For: 0.3 > > Attachments: MAHOUT-11-all-cleanup-20091128.patch, > MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, > MAHOUT-11.patch > > > I file this as a bug, even though I'm not 100% sure it is one. In the currect > code the information is exchanged via static fields (for example, distance > measure and thresholds for Canopies are static field). Is it always true in > Hadoop that one job runs inside one JVM with exclusive access? I haven't seen > it anywhere in Hadoop documentation and my impression was that everything > uses JobConf to pass configuration to jobs, but jobs are configured on a > per-object basis (a job is an object, a mapper is an object and everything > else is basically an object). > If it's possible for two jobs to run in parallel inside one JVM then this is > a limitation and bug in our code that needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.
[ https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost resolved MAHOUT-90. Resolution: Later Marked as "Later" - currently snapshots are published to the apache maven repository. At the moment that should be enough for users to play around with latest code. > Adding all scripts (for nightly build) to SVN repository. > - > > Key: MAHOUT-90 > URL: https://issues.apache.org/jira/browse/MAHOUT-90 > Project: Mahout > Issue Type: New Feature >Reporter: Edward J. Yoon >Priority: Minor > Fix For: 0.3 > > Attachments: mahout.tgz > > > I made below scripts for the hudson continuous integration service on my > hudson account. > mahout/hudsonBuildMahoutPatch.sh > mahout/processMahoutPatchEmail.sh > mahout/hudsonPatchQueueAdmin.sh > They will be modified by only me, so It should be handled via SVN. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-85) Perceptron/Winnow Trainer
[ https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786679#action_12786679 ] Isabel Drost commented on MAHOUT-85: It is just a sequential version of the algorithm. No parallelisation and no Hadoop involved. > Perceptron/Winnow Trainer > - > > Key: MAHOUT-85 > URL: https://issues.apache.org/jira/browse/MAHOUT-85 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.3 > > Attachments: perceptronWinnowTrainer.diff > > > Please find attached a first sketch for perceptron and winnow training. > Please look very, very carefully at the patch, as I added the heart of the > algorithms in the emergency room at Charite Berlin (after I broke my leg when > cycling to the Hadoop Get Together ;) ). > The patch does not yet feature unit tests nor is it parallelised. Currently > my plan is to set up an example with the webKb dataset, add unit tests to the > code and after that go parallel. I would like to get some feedback early on, > in addition I would feel a lot better, if a second and third pair of eyes had > a look at the code to make sure all obvious mistakes are out as early as > possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.
[ https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786678#action_12786678 ] Isabel Drost commented on MAHOUT-90: I did add a hudson job to upload maven snapshots of our projects to the apache repository on a nightly basis. No idea however how building and publishing nightly releases should work at Apache. > Adding all scripts (for nightly build) to SVN repository. > - > > Key: MAHOUT-90 > URL: https://issues.apache.org/jira/browse/MAHOUT-90 > Project: Mahout > Issue Type: New Feature >Reporter: Edward J. Yoon >Assignee: Isabel Drost >Priority: Minor > Fix For: 0.3 > > Attachments: mahout.tgz > > > I made below scripts for the hudson continuous integration service on my > hudson account. > mahout/hudsonBuildMahoutPatch.sh > mahout/processMahoutPatchEmail.sh > mahout/hudsonPatchQueueAdmin.sh > They will be modified by only me, so It should be handled via SVN. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.
[ https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-90: -- Assignee: (was: Isabel Drost) > Adding all scripts (for nightly build) to SVN repository. > - > > Key: MAHOUT-90 > URL: https://issues.apache.org/jira/browse/MAHOUT-90 > Project: Mahout > Issue Type: New Feature >Reporter: Edward J. Yoon >Priority: Minor > Fix For: 0.3 > > Attachments: mahout.tgz > > > I made below scripts for the hudson continuous integration service on my > hudson account. > mahout/hudsonBuildMahoutPatch.sh > mahout/processMahoutPatchEmail.sh > mahout/hudsonPatchQueueAdmin.sh > They will be modified by only me, so It should be handled via SVN. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).
[ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785985#action_12785985 ] Isabel Drost commented on MAHOUT-11: Applies cleanly and builds w/o unit test failures here. The changes look all good to me. Great work, Drew. One question though: In the TestMeanShift test (lines 301 and 304) you removed the canopyId adjustments - could you please explain what was the reason this was necessary? I would like to commit this patch next week if noone objects. > Static fields used throughout clustering code (Canopy, K-Means). > > > Key: MAHOUT-11 > URL: https://issues.apache.org/jira/browse/MAHOUT-11 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Dawid Weiss > Fix For: 0.3 > > Attachments: MAHOUT-11-all-cleanup-20091128.patch, > MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, > MAHOUT-11.patch > > > I file this as a bug, even though I'm not 100% sure it is one. In the currect > code the information is exchanged via static fields (for example, distance > measure and thresholds for Canopies are static field). Is it always true in > Hadoop that one job runs inside one JVM with exclusive access? I haven't seen > it anywhere in Hadoop documentation and my impression was that everything > uses JobConf to pass configuration to jobs, but jobs are configured on a > per-object basis (a job is an object, a mapper is an object and everything > else is basically an object). > If it's possible for two jobs to run in parallel inside one JVM then this is > a limitation and bug in our code that needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-210) Publish code quality reports through maven
Publish code quality reports through maven -- Key: MAHOUT-210 URL: https://issues.apache.org/jira/browse/MAHOUT-210 Project: Mahout Issue Type: New Feature Components: Website Affects Versions: 0.1, 0.2 Reporter: Isabel Drost Fix For: 0.3 We should use mvn site:site to generate code reports and publish them online for users to review and developers to easily spot problems. First version that still needs checks adjusted to our needs is available online at: http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html Further discussion on-list at http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).
[ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782470#action_12782470 ] Isabel Drost commented on MAHOUT-11: Drew, go ahead then. > Static fields used throughout clustering code (Canopy, K-Means). > > > Key: MAHOUT-11 > URL: https://issues.apache.org/jira/browse/MAHOUT-11 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Dawid Weiss > Fix For: 0.3 > > Attachments: MAHOUT-11-kmeans-cleanup.patch, > MAHOUT-11-RandomSeedGenerator.patch, MAHOUT-11.patch > > > I file this as a bug, even though I'm not 100% sure it is one. In the currect > code the information is exchanged via static fields (for example, distance > measure and thresholds for Canopies are static field). Is it always true in > Hadoop that one job runs inside one JVM with exclusive access? I haven't seen > it anywhere in Hadoop documentation and my impression was that everything > uses JobConf to pass configuration to jobs, but jobs are configured on a > per-object basis (a job is an object, a mapper is an object and everything > else is basically an object). > If it's possible for two jobs to run in parallel inside one JVM then this is > a limitation and bug in our code that needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).
[ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780476#action_12780476 ] Isabel Drost commented on MAHOUT-11: First of all, thanks for the review. Passing the output collector directly - Jepp, makes sense. Will change and resubmit the patch. Tests with real data: Big thanks for that. Isabel > Static fields used throughout clustering code (Canopy, K-Means). > > > Key: MAHOUT-11 > URL: https://issues.apache.org/jira/browse/MAHOUT-11 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Dawid Weiss > Fix For: 0.3 > > Attachments: MAHOUT-11.patch > > > I file this as a bug, even though I'm not 100% sure it is one. In the currect > code the information is exchanged via static fields (for example, distance > measure and thresholds for Canopies are static field). Is it always true in > Hadoop that one job runs inside one JVM with exclusive access? I haven't seen > it anywhere in Hadoop documentation and my impression was that everything > uses JobConf to pass configuration to jobs, but jobs are configured on a > per-object basis (a job is an object, a mapper is an object and everything > else is basically an object). > If it's possible for two jobs to run in parallel inside one JVM then this is > a limitation and bug in our code that needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).
[ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-11: --- Attachment: MAHOUT-11.patch Not the original author of the source, but still managed to get the static fields out of the k-means clustering code. All unit-tests are still passing. However I would feel a lot better, if someone else double-checked the changes made. Looking at the code, I spotted some more points that could benefit from being revisited (e.g. usage of deprecated MapReduce APIs and introduction of status reports). But this should be done in a separate issue. > Static fields used throughout clustering code (Canopy, K-Means). > > > Key: MAHOUT-11 > URL: https://issues.apache.org/jira/browse/MAHOUT-11 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Dawid Weiss > Fix For: 0.3 > > Attachments: MAHOUT-11.patch > > > I file this as a bug, even though I'm not 100% sure it is one. In the currect > code the information is exchanged via static fields (for example, distance > measure and thresholds for Canopies are static field). Is it always true in > Hadoop that one job runs inside one JVM with exclusive access? I haven't seen > it anywhere in Hadoop documentation and my impression was that everything > uses JobConf to pass configuration to jobs, but jobs are configured on a > per-object basis (a job is an object, a mapper is an object and everything > else is basically an object). > If it's possible for two jobs to run in parallel inside one JVM then this is > a limitation and bug in our code that needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-200) Update information on Mahout site
[ https://issues.apache.org/jira/browse/MAHOUT-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost resolved MAHOUT-200. - Resolution: Fixed Fix Version/s: (was: 0.3) 0.2 Updated web page and fixed typo in release announcement. > Update information on Mahout site > - > > Key: MAHOUT-200 > URL: https://issues.apache.org/jira/browse/MAHOUT-200 > Project: Mahout > Issue Type: Improvement > Components: Website >Reporter: Isabel Drost >Assignee: Isabel Drost >Priority: Minor > Fix For: 0.2 > > Attachments: update_site.patch > > > After several people had trouble finding the docs we provide in the wiki, I > have created a "slightly" updated version of our website. I added a few links > to wiki pages that might be of interest to potential Mahout users. > I have uploaded the updated version to http://people.apache.org/~isabel/site > so all of you can have a look. Will commit on Tuesday next week if noone > objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-200) Update information on Mahout site
Update information on Mahout site - Key: MAHOUT-200 URL: https://issues.apache.org/jira/browse/MAHOUT-200 Project: Mahout Issue Type: Improvement Components: Website Reporter: Isabel Drost Priority: Minor Fix For: 0.3 Attachments: update_site.patch After several people had trouble finding the docs we provide in the wiki, I have created a "slightly" updated version of our website. I added a few links to wiki pages that might be of interest to potential Mahout users. I have uploaded the updated version to http://people.apache.org/~isabel/site so all of you can have a look. Will commit on Tuesday next week if noone objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-200) Update information on Mahout site
[ https://issues.apache.org/jira/browse/MAHOUT-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-200: --- Assignee: Isabel Drost > Update information on Mahout site > - > > Key: MAHOUT-200 > URL: https://issues.apache.org/jira/browse/MAHOUT-200 > Project: Mahout > Issue Type: Improvement > Components: Website >Reporter: Isabel Drost >Assignee: Isabel Drost >Priority: Minor > Fix For: 0.3 > > Attachments: update_site.patch > > > After several people had trouble finding the docs we provide in the wiki, I > have created a "slightly" updated version of our website. I added a few links > to wiki pages that might be of interest to potential Mahout users. > I have uploaded the updated version to http://people.apache.org/~isabel/site > so all of you can have a look. Will commit on Tuesday next week if noone > objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-200) Update information on Mahout site
[ https://issues.apache.org/jira/browse/MAHOUT-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-200: Attachment: update_site.patch > Update information on Mahout site > - > > Key: MAHOUT-200 > URL: https://issues.apache.org/jira/browse/MAHOUT-200 > Project: Mahout > Issue Type: Improvement > Components: Website >Reporter: Isabel Drost >Assignee: Isabel Drost >Priority: Minor > Fix For: 0.3 > > Attachments: update_site.patch > > > After several people had trouble finding the docs we provide in the wiki, I > have created a "slightly" updated version of our website. I added a few links > to wiki pages that might be of interest to potential Mahout users. > I have uploaded the updated version to http://people.apache.org/~isabel/site > so all of you can have a look. Will commit on Tuesday next week if noone > objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767710#action_12767710 ] Isabel Drost commented on MAHOUT-171: - It was my own fault - I forgot to "svn add" the file after I applied and built with my own patch. Sorry :/ > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: MAHOUT-171.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost resolved MAHOUT-171. - Resolution: Fixed Checked in. > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: MAHOUT-171.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost resolved MAHOUT-138. - Resolution: Fixed Fix Version/s: (was: 0.3) 0.2 The last ci changed the remaining classes - so at least grep does not find any usages of 'args\[' anywhere in our source code. > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.2 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth
[ https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766030#action_12766030 ] Isabel Drost commented on MAHOUT-157: - The patch looks good to me. Good work Robin. > Frequent Pattern Mining using Parallel FP-Growth > > > Key: MAHOUT-157 > URL: https://issues.apache.org/jira/browse/MAHOUT-157 > Project: Mahout > Issue Type: New Feature > Components: Frequent Itemset/Association Rule Mining >Affects Versions: 0.2 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.2 > > Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, > MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, > MAHOUT-157-codecleanup-javadocs.patch, > MAHOUT-157-Combinations-BSD-License.patch, > MAHOUT-157-Combinations-BSD-License.patch, > MAHOUT-157-CompactTransactionMapperFormat.patch, MAHOUT-157-final.patch, > MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-Oct-1.patch, > MAHOUT-157-Oct-10.pfpgrowth.patch, MAHOUT-157-Oct-8.pfpgrowth.patch, > MAHOUT-157-Oct-8.TestedMapReducePipeline.patch, > MAHOUT-157-Oct-9.StreamingDBRead-Inprogress.patch, > MAHOUT-157-September-10.patch, MAHOUT-157-September-18.patch, > MAHOUT-157-September-5.patch > > > Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764000#action_12764000 ] Isabel Drost commented on MAHOUT-138: - Robin, you briefly mentioned that for the bayes classifier it does not make sense to start up the different phases manually. Could you please detail which classes should not have main-methods attached to them and which ones should instead be used to start a training job in this issue? > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763997#action_12763997 ] Isabel Drost commented on MAHOUT-138: - Usage description for Taste examples is online in the wiki at: http://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples Current status: ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java ./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java 8 examples to go. > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763958#action_12763958 ] Isabel Drost commented on MAHOUT-138: - Sean - I just converted the implementation of the taste jobs in core - could you please have a look at the commandline option descriptions to check that everything is correct? http://cwiki.apache.org/confluence/display/MAHOUT/TasteCommandLine > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth
[ https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763929#action_12763929 ] Isabel Drost commented on MAHOUT-157: - Great work Robin. I just had a look at the code and only found some minor things: ParallelFPGrowth - it might be a good idea to reuse the DefaultOptionCreator to generate common options like input and output. - I would love to see a help option as well. - What happens, if the users gives the wrong parameters? As a user, I would rather not get confronted with a stack trace, even though it is an example. - did you provide details on how to run the algorithm, the assumptions it makes, file format, behaviour if the output file exists already on the wiki? - the class is named ParallelFPGrowth, but if I read it correctly, it looks like the entry point for both, the parallel and sequential version. Maybe rename to FPGrowthJob? FPGrowth - line 98 is this really a recoverable error that does not cause inconsistancies later on? Log message says "this should not happen" - what if against all odds, it does happen? Why not throw a non-Checked Exception? - line 177 we should not have source code that is commented out in newly added code. - The class seems to implement both - top k and vanilla fp growth - would it make sense to split that up into different classes? - generateFrequentPatterns - maybe it is just me, but I am always happy to find tiny little comments in methods that long that very shortly explain what the following code block is doing. FPTreeDepthCache - maybe mention in the docs that the implementation is not threadsafe? FPTree, Pattern - missing a class comment. Pattern - line 173 - please remove code that is commented out AggregatorMapper - the reporter is left unused. Nice-To-Have: It would be nice to have package level comments in JavaDoc as well. > Frequent Pattern Mining using Parallel FP-Growth > > > Key: MAHOUT-157 > URL: https://issues.apache.org/jira/browse/MAHOUT-157 > Project: Mahout > Issue Type: New Feature > Components: Frequent Itemset/Association Rule Mining >Affects Versions: 0.2 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.2 > > Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, > MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, > MAHOUT-157-Combinations-BSD-License.patch, > MAHOUT-157-Combinations-BSD-License.patch, > MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-Oct-1.patch, > MAHOUT-157-Oct-8.pfpgrowth.patch, > MAHOUT-157-Oct-8.TestedMapReducePipeline.patch, > MAHOUT-157-September-10.patch, MAHOUT-157-September-18.patch, > MAHOUT-157-September-5.patch > > > Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763455#action_12763455 ] Isabel Drost commented on MAHOUT-138: - Sean: sure, trying to get to it as soon as I find time to do so (hopefully tomorrow). > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763378#action_12763378 ] Isabel Drost edited comment on MAHOUT-138 at 10/8/09 12:15 AM: --- >From the classes above, I worked through up to the classification stuff. >Documentation is in the wiki at: >http://cwiki.apache.org/confluence/display/MAHOUT/ClassifyingYourData (the >links with commandline in their name) and >http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData (again >the links with commandline in their name). Currently there are only examples left to convert as well as three classes containing main methods from the taste code: ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOneDiffsToAveragesJob.java ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOnePrefsToDiffsJob.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/bookcrossing/BookCrossingRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/NetflixRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/TransposeToByUser.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens/GroupLensRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java ./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java was (Author: isabel): >From the classes above, I worked through up to the classification stuff. >Documentation is in the wiki at: >http://cwiki.apache.org/confluence/display/MAHOUT/ClassifyingYourData (the >links with commandline in their name) and >http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData (again >the links with commandline in their name). Currently their are only examples left to convert as well as three classes containing main methods from the taste code: ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOneDiffsToAveragesJob.java ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOnePrefsToDiffsJob.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/bookcrossing/BookCrossingRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/NetflixRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/TransposeToByUser.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens/GroupLensRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java ./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We
[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763378#action_12763378 ] Isabel Drost commented on MAHOUT-138: - >From the classes above, I worked through up to the classification stuff. >Documentation is in the wiki at: >http://cwiki.apache.org/confluence/display/MAHOUT/ClassifyingYourData (the >links with commandline in their name) and >http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData (again >the links with commandline in their name). Currently their are only examples left to convert as well as three classes containing main methods from the taste code: ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOneDiffsToAveragesJob.java ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOnePrefsToDiffsJob.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/bookcrossing/BookCrossingRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/NetflixRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/TransposeToByUser.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens/GroupLensRecommenderEvaluatorRunner.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java ./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762898#action_12762898 ] Isabel Drost commented on MAHOUT-138: - Sean, you can easily follow what is going on with this issue on the subversion commit panel: https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761589#action_12761589 ] Isabel Drost commented on MAHOUT-171: - https://issues.apache.org/jira/browse/INFRA-2229 - is done as well. > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: MAHOUT-171.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-171: Attachment: MAHOUT-171.patch I moved the stuff in the buildtools directory over to our "maven" module. I think we have few enough configuration files to bundle all the maven/eclipse/intellij/checkstyle stuff in one module. I added javadoc and source-jar download and a line to enable the checkstyle config for the eclipse plugin. However so far the checkstyle config itself seems rather rudimentary to me - can disable it, if that is not cleaned up yet. I deleted the NOTICE and LICENSE files that obviously were copied over from another project to buildtools/src/main/resources/META-INF. I converted the NOTICE and LICENSE file generation to use the maven remote resources plugin as recommended in the Apache parent pom. (Thanks to Jukka for clarifying on how to include a custom NOTICE file) I added sublemental entries that describe our dependencies so besides license and notice there is a DEPENDENCIES file being generated with information on project, license, project url and the like for all (transitive) dependencies of Mahout. I think this patch requires a review of the changes and the generated artifacts to make sure everything is still where it belongs to after the changes. > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: MAHOUT-171.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-171: Attachment: (was: parent_pom.patch) > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: MAHOUT-171.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-184) Code tweaks for .df.* code
[ https://issues.apache.org/jira/browse/MAHOUT-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761501#action_12761501 ] Isabel Drost commented on MAHOUT-184: - Looks good to me. Deneche, could you please also have a look at the patch to spot any issues early on? I would prefer using CLI for the job implementation (core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java), but that can be done in a later patch. > Code tweaks for .df.* code > -- > > Key: MAHOUT-184 > URL: https://issues.apache.org/jira/browse/MAHOUT-184 > Project: Mahout > Issue Type: Improvement >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 0.2 > > Attachments: Tweaks_to__df__.patch > > > This follows on my last email to the mailing list, and code inspection. It's > big enough I made a patch. No surprises I hope given the consensus on code > style and practice. Might be some good takeaways in here, or points for > further discussion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer
[ https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759864#action_12759864 ] Isabel Drost commented on MAHOUT-180: - That sounds great! Thank you for offering to donate the code. If you need any help porting the code or any other support, we are happy to help. You may also want to have a look at http://incubator.apache.org/ip-clearance/index.html that explains the legal steps for donating large code donations. > port Hadoop-ified Lanczos SVD implementation from decomposer > > > Key: MAHOUT-180 > URL: https://issues.apache.org/jira/browse/MAHOUT-180 > Project: Mahout > Issue Type: New Feature > Components: Matrix >Affects Versions: 0.2 >Reporter: Jake Mannix >Priority: Minor > > I wrote up a hadoop version of the Lanczos algorithm for performing SVD on > sparse matrices available at http://decomposer.googlecode.com/, which is > Apache-licensed, and I'm willing to donate it. I'll have to port over the > implementation to use Mahout vectors, or else add in these vectors as well. > Current issues with the decomposer implementation include: if your matrix is > really big, you need to re-normalize before decomposition: find the largest > eigenvalue first, and divide all your rows by that value, then decompose, or > else you'll blow over Double.MAX_VALUE once you've run too many iterations > (the L^2 norm of intermediate vectors grows roughly as > (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on > the lower end is better than blowing over MAX_VALUE). When this is ported to > Mahout, we should add in the capability to do this automatically (run a > couple iterations to find the largest eigenvalue, save that, then iterate > while scaling vectors by 1/max_eigenvalue). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759468#action_12759468 ] Isabel Drost commented on MAHOUT-171: - Got account, changed the build to maven and tied the build to minerva (supports maven builds). Build runs smoothly (and successfully) again in hudson now. > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: parent_pom.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-134) [PATCH] Cluster decode error handling
[ https://issues.apache.org/jira/browse/MAHOUT-134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-134: Resolution: Fixed Status: Resolved (was: Patch Available) Committed to revision 816588. > [PATCH] Cluster decode error handling > - > > Key: MAHOUT-134 > URL: https://issues.apache.org/jira/browse/MAHOUT-134 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Robert Burrell Donkin >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: MAHOUT-134.patch, mahout-cluster-format-error.patch, > mahout-cluster-format-error.patch > > > ATM the javadocs are unclear as to whether null is an acceptable return value > and callers do not null check the return value. However, the implementation > may return null in or throw other runtime exceptions when the format is not > correct. This makes it hard to diagnose when there's a problem with the > format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757119#action_12757119 ] Isabel Drost commented on MAHOUT-138: - Added changes to cli for FuzzyKMeans, Dirichlet and MeanShiftCanopy (see below for exact classes). Added documentation to the wiki (see links to command line client documentation at http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData ) ./core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansJob.java ./core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletJob.java ./core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletDriver.java ./core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyJob.java ./core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java I have added one additional helper class (DefaultOptionCreator) that provides methods for creating the most common options (k clusters, input, output and the like) to avoid copying and ensure that the same option strings are used all over the code. > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.2 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-78) HBase RowResult/BatchUpdate access via Mahout Vector interface
[ https://issues.apache.org/jira/browse/MAHOUT-78?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757074#action_12757074 ] Isabel Drost commented on MAHOUT-78: What is the current status of this issue? Allen, did you have a chance looking into creating tests with a mocked HBase? > HBase RowResult/BatchUpdate access via Mahout Vector interface > -- > > Key: MAHOUT-78 > URL: https://issues.apache.org/jira/browse/MAHOUT-78 > Project: Mahout > Issue Type: New Feature >Reporter: Allen Day >Priority: Minor > Fix For: 0.2 > > Attachments: hbase.patch > > > An adapter class is attached that allows read/write operations on HBase rows > using the Vector interface. This allows, e.g. canopy clustering of rows in > an HBase table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757012#action_12757012 ] Isabel Drost commented on MAHOUT-171: - https://issues.apache.org/jira/browse/INFRA-2237 is the account related issue > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: parent_pom.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755377#action_12755377 ] Isabel Drost commented on MAHOUT-138: - Will do so and put some documentation of the command line parameters on the wiki while I go along. > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.2 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-138: Attachment: MAHOUT-138_fuzzyKMeansJob.patch Patch to convert FuzzyKMeansJob to use CLI for argument parsing. > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.2 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-134) [PATCH] Cluster decode error handling
[ https://issues.apache.org/jira/browse/MAHOUT-134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-134: Assignee: Isabel Drost Status: Patch Available (was: Reopened) See last attachment. Committing on Friday if noone objects. > [PATCH] Cluster decode error handling > - > > Key: MAHOUT-134 > URL: https://issues.apache.org/jira/browse/MAHOUT-134 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Robert Burrell Donkin >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: MAHOUT-134.patch, mahout-cluster-format-error.patch, > mahout-cluster-format-error.patch > > > ATM the javadocs are unclear as to whether null is an acceptable return value > and callers do not null check the return value. However, the implementation > may return null in or throw other runtime exceptions when the format is not > correct. This makes it hard to diagnose when there's a problem with the > format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-134) [PATCH] Cluster decode error handling
[ https://issues.apache.org/jira/browse/MAHOUT-134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-134: Attachment: MAHOUT-134.patch Adjusted patch to current trunk version. > [PATCH] Cluster decode error handling > - > > Key: MAHOUT-134 > URL: https://issues.apache.org/jira/browse/MAHOUT-134 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Robert Burrell Donkin > Fix For: 0.2 > > Attachments: MAHOUT-134.patch, mahout-cluster-format-error.patch, > mahout-cluster-format-error.patch > > > ATM the javadocs are unclear as to whether null is an acceptable return value > and callers do not null check the return value. However, the implementation > may return null in or throw other runtime exceptions when the format is not > correct. This makes it hard to diagnose when there's a problem with the > format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-108) Implementation of Assoication Rules learning by Apriori algorithm
[ https://issues.apache.org/jira/browse/MAHOUT-108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost resolved MAHOUT-108. - Resolution: Won't Fix Superseded by FPGrowth patch (MAHOUT-157). > Implementation of Assoication Rules learning by Apriori algorithm > - > > Key: MAHOUT-108 > URL: https://issues.apache.org/jira/browse/MAHOUT-108 > Project: Mahout > Issue Type: Task > Environment: Linux, Hadoop-0.17.1 >Reporter: chao deng > Fix For: 0.2 > > Original Estimate: 504h > Remaining Estimate: 504h > > Target: Association Rules learning is a popular method for discovering > interesting relations between variables in large databases. Here, we would > implement the Apriori algorithm using Hadoop&Mapreduce parallel techniques. > Applications: Typically, association rules learning is used to discover > regularities between products in large scale transaction data in > supermarkets. For example, the rule "{onions, patatoes}->beef" found in the > sales data would indicate that if a customer buys onions and potatoes > together, he or she is likely to also buy beef. Such information can be used > as the basis for decisions about marketing activities. In addition to the > market basket analysis, association rules are employed today in many > application areas including Web usage mining, intrusion detection and > bioinformatics. > Apriori algorithm: Apriori is the best-known algorithm to mine association > rules. It uses a breadth-first search strategy to counting the support of > itemsets and uses a candidate generation function which exploits the downward > closure property of support -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-167) Convert clustering code to Hadoop 0.20 API
[ https://issues.apache.org/jira/browse/MAHOUT-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754966#action_12754966 ] Isabel Drost commented on MAHOUT-167: - Hmm. Should then defer this issue to a later version of Mahout? > Convert clustering code to Hadoop 0.20 API > -- > > Key: MAHOUT-167 > URL: https://issues.apache.org/jira/browse/MAHOUT-167 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.1 >Reporter: Jeff Eastman >Assignee: Jeff Eastman > Fix For: 0.2 > > > We need to update the clustering implementations to remove the deprecated > Hadoop API calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-160) ClusterDumper utility to output all the clusters in all sequence files and points
[ https://issues.apache.org/jira/browse/MAHOUT-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754963#action_12754963 ] Isabel Drost commented on MAHOUT-160: - If that is committed - can we close the issue? > ClusterDumper utility to output all the clusters in all sequence files and > points > - > > Key: MAHOUT-160 > URL: https://issues.apache.org/jira/browse/MAHOUT-160 > Project: Mahout > Issue Type: Improvement >Reporter: Shashikant Kore >Assignee: Grant Ingersoll > Fix For: 0.2 > > Attachments: mahout-160-dict.patch, mahout-160.patch > > > The current ClusterDumper utility takes a sequence file and points file as > input and prints the cluster vector along with the points that belong to the > clusters in the sequence file. This utility doesn't produce correct results > in case there are multiple sequence files and points. > To avoid this problem, all the point to cluster mappings need to be read > first and then iterate on the sequence files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754860#action_12754860 ] Isabel Drost commented on MAHOUT-171: - As for the Hudson subtask: To me, http://wiki.apache.org/general/Hudson reads like you need to be PMC member to change the Hudson settings for Mahout? > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: parent_pom.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754855#action_12754855 ] Isabel Drost commented on MAHOUT-171: - Filed subtask to infra: https://issues.apache.org/jira/browse/INFRA-2229 > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: parent_pom.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754850#action_12754850 ] Isabel Drost commented on MAHOUT-138: - >From a first glimpse at the code, it looks like there are quite a few other >classes that need switching as well (grepped through the code base, so no >guarantee that there are no false positives): * ./core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansJob.java * ./core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletJob.java * ./core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletDriver.java * ./core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyJob.java * ./core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java * ./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/bayes/BayesDriver.java * ./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/bayes/BayesThetaNormalizerDriver.java * ./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/cbayes/CBayesNormalizedWeightDriver.java * ./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/cbayes/CBayesDriver.java * ./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/cbayes/CBayesThetaDriver.java * ./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/cbayes/CBayesThetaNormalizerDriver.java * ./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesWeightSummerDriver.java * ./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureDriver.java * ./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesTfIdfDriver.java * ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java * ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOneDiffsToAveragesJob.java * ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOnePrefsToDiffsJob.java * ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java * ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java * ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java * ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java * ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java * ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java * ./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java * ./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java * ./examples/src/main/java/org/apache/mahout/cf/taste/example/bookcrossing/BookCrossingRecommenderEvaluatorRunner.java * ./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/NetflixRecommenderEvaluatorRunner.java * ./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/TransposeToByUser.java * ./examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterRecommenderEvaluatorRunner.java * ./examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens/GroupLensRecommenderEvaluatorRunner.java I'd like to offer my help with some of these. > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.2 > > Attachments: MAHOUT-138.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754847#action_12754847 ] Isabel Drost edited comment on MAHOUT-138 at 9/14/09 12:17 AM: --- Hmm - the patch seems to be out of sync with trunk. was (Author: isabel): Hmm - the patch seems to be out of sync with trunk. From looking at it, it also seems it contains two changes - the CLI support and adding a RandomSeedGenerator? > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.2 > > Attachments: MAHOUT-138.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754847#action_12754847 ] Isabel Drost commented on MAHOUT-138: - Hmm - the patch seems to be out of sync with trunk. From looking at it, it also seems it contains two changes - the CLI support and adding a RandomSeedGenerator? > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.2 > > Attachments: MAHOUT-138.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
[ https://issues.apache.org/jira/browse/MAHOUT-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost resolved MAHOUT-172. - Resolution: Fixed fixed in revision 814495 > When running on a Hadoop cluster LDA fails with Caused by: > java.io.IOException: Cannot open filename /user/*/output/state-*/_logs > - > > Key: MAHOUT-172 > URL: https://issues.apache.org/jira/browse/MAHOUT-172 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: lda.patch > > > I tried running the reuters example of lda on a hadoop cluster today. Seems > like the implementation tries to read all files in output/state-* which fails > if in that directory "_logs" is found. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-108) Implementation of Assoication Rules learning by Apriori algorithm
[ https://issues.apache.org/jira/browse/MAHOUT-108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754694#action_12754694 ] Isabel Drost commented on MAHOUT-108: - Contacted (at least tried to) Chao Deng asking for the status and if I could help him submit the patch. Should we close this issue as won't fix or defer it to a later version if he does not respond? Or is anyone else up to implementing a patch for this task until 0.2? > Implementation of Assoication Rules learning by Apriori algorithm > - > > Key: MAHOUT-108 > URL: https://issues.apache.org/jira/browse/MAHOUT-108 > Project: Mahout > Issue Type: Task > Environment: Linux, Hadoop-0.17.1 >Reporter: chao deng > Fix For: 0.2 > > Original Estimate: 504h > Remaining Estimate: 504h > > Target: Association Rules learning is a popular method for discovering > interesting relations between variables in large databases. Here, we would > implement the Apriori algorithm using Hadoop&Mapreduce parallel techniques. > Applications: Typically, association rules learning is used to discover > regularities between products in large scale transaction data in > supermarkets. For example, the rule "{onions, patatoes}->beef" found in the > sales data would indicate that if a customer buys onions and potatoes > together, he or she is likely to also buy beef. Such information can be used > as the basis for decisions about marketing activities. In addition to the > market basket analysis, association rules are employed today in many > application areas including Web usage mining, intrusion detection and > bioinformatics. > Apriori algorithm: Apriori is the best-known algorithm to mine association > rules. It uses a breadth-first search strategy to counting the support of > itemsets and uses a candidate generation function which exploits the downward > closure property of support -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-171: --- Assignee: Isabel Drost > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: parent_pom.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
[ https://issues.apache.org/jira/browse/MAHOUT-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-172: --- Assignee: Isabel Drost > When running on a Hadoop cluster LDA fails with Caused by: > java.io.IOException: Cannot open filename /user/*/output/state-*/_logs > - > > Key: MAHOUT-172 > URL: https://issues.apache.org/jira/browse/MAHOUT-172 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: lda.patch > > > I tried running the reuters example of lda on a hadoop cluster today. Seems > like the implementation tries to read all files in output/state-* which fails > if in that directory "_logs" is found. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
[ https://issues.apache.org/jira/browse/MAHOUT-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754692#action_12754692 ] Isabel Drost commented on MAHOUT-172: - Committing on Monday. > When running on a Hadoop cluster LDA fails with Caused by: > java.io.IOException: Cannot open filename /user/*/output/state-*/_logs > - > > Key: MAHOUT-172 > URL: https://issues.apache.org/jira/browse/MAHOUT-172 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Isabel Drost >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: lda.patch > > > I tried running the reuters example of lda on a hadoop cluster today. Seems > like the implementation tries to read all files in output/state-* which fails > if in that directory "_logs" is found. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth
[ https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752249#action_12752249 ] Isabel Drost commented on MAHOUT-157: - The formatting still looks a bit weird (spaces, line length etc.) PFPGrowth, line 101, 183, 214 - please at least add a warning log message prior to deleting pre-existing output path and document somewhere on the usage page that default behaviour is deleting the output path, if exists. (I think that differs from the implementation in lda - we need to agree on consistant behaviour across mahout in such cases). line 174 - the combiner is commented out? ParallelCountingMapper - shouldn't you report status through the reporter during mapping? ParallelFPGrowthMapper - line 87 - please do not use e.printStackTrace() but generate a regular log message and log the exception stack trace through the logger. I would love to see some more comments: the expected format of key and value, the expected content of glist and flist. ParallelFPGrowthReducer line 111 - don't use e.printStackTrace. AggregatorReducer - line 91 same Attribute/TreeNode - The code is pretty clear, still I would love to see some more documentation on the overall data structure. FrequentPatternMaxHeap - line 74 - Huh? Judging from the return value, you can omit the comparison against null here. (line 81 same. FPGrowth - line 42 - the method name should not start with a capital letter. 517 lines for implementing the whole algorithm in one class - looks a bit large for me. Is it possible to split it up? line 165 - converting from Integer to int and back again usually costs quite a bit of performance. Is there a way to rely on primitives only, or implement your own incrementable integer type? Btw., Integer.valueOf(1) should be replaced by Integer.ONE - that should be quicker and prevent in-accuracies. Type T - I think it would make the code better readable if T were given a clearer name, something like TransactionType? Otherwise you need to document what exactly T represents. line 229: Would reformulating the while conditions as "while(!tempNode.childNodes.isEmpty()) { ... } make the code clearer here? line 239: Where does the magic number 6 come from here? Define as a constant with a speaking name? the two generatedSinglePathPatterns methods look rather similar - is it possible to not copy the code but extract it into its own method or reuse one in the other? line 293 (and earlier): Where does the magic number 4 come from? Define constant with speaking name? line 506: Looks like a strange log message? Concerning your idea of going the algorithm interface way for fpGrowth: If you can already make out what the interface should look like, I think that would be a good way to make it easier for future implementors of other frequent itemset algorithms. > Frequent Pattern Mining using Parallel FP-Growth > > > Key: MAHOUT-157 > URL: https://issues.apache.org/jira/browse/MAHOUT-157 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.2 >Reporter: Robin Anil > Fix For: 0.2 > > Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, > MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, > MAHOUT-157-Combinations-BSD-License.patch, > MAHOUT-157-Combinations-BSD-License.patch, > MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-September-5.patch > > > Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
[ https://issues.apache.org/jira/browse/MAHOUT-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-172: Attachment: lda.patch The patch extends the url pattern to not match everything in the output directory but only stuff that starts with part* - since the lda job seems to run fine for me. > When running on a Hadoop cluster LDA fails with Caused by: > java.io.IOException: Cannot open filename /user/*/output/state-*/_logs > - > > Key: MAHOUT-172 > URL: https://issues.apache.org/jira/browse/MAHOUT-172 > Project: Mahout > Issue Type: Bug > Components: Clustering >Affects Versions: 0.1 >Reporter: Isabel Drost > Fix For: 0.2 > > Attachments: lda.patch > > > I tried running the reuters example of lda on a hadoop cluster today. Seems > like the implementation tries to read all files in output/state-* which fails > if in that directory "_logs" is found. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs - Key: MAHOUT-172 URL: https://issues.apache.org/jira/browse/MAHOUT-172 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.1 Reporter: Isabel Drost Fix For: 0.2 I tried running the reuters example of lda on a hadoop cluster today. Seems like the implementation tries to read all files in output/state-* which fails if in that directory "_logs" is found. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-171: Description: Opening a JIRA task to collect what has to be done for moving over to using apache version 5 parent pom (see also http://markmail.org/thread/ld26m3xxzoztqsk6 ). * Link Apache parent pom into our pom. * Update hudson to build via maven ( ? ). * File subtask at INFRA-1896 to include mahout in repository.apache.org was: Opening a JIRA task to collect what has to be done for moving over to using apache version 5 parent pom (see also http://markmail.org/thread/ld26m3xxzoztqsk6 ). * Link Apache parent pom into our pom. * Update hudson to build via maven (?). * File subtask at INFRA-1896 to include mahout in repository.apache.org > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost > Fix For: 0.2 > > Attachments: parent_pom.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven ( ? ). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-171) Move deployment to repository.apache.org
[ https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-171: Attachment: parent_pom.patch Mahout Parent pom now includes reference to apache parent pom. > Move deployment to repository.apache.org > > > Key: MAHOUT-171 > URL: https://issues.apache.org/jira/browse/MAHOUT-171 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.1 >Reporter: Isabel Drost > Fix For: 0.2 > > Attachments: parent_pom.patch > > > Opening a JIRA task to collect what has to be done for moving over to using > apache version 5 parent pom (see also > http://markmail.org/thread/ld26m3xxzoztqsk6 ). >* Link Apache parent pom into our pom. >* Update hudson to build via maven (?). >* File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-171) Move deployment to repository.apache.org
Move deployment to repository.apache.org Key: MAHOUT-171 URL: https://issues.apache.org/jira/browse/MAHOUT-171 Project: Mahout Issue Type: Improvement Affects Versions: 0.1 Reporter: Isabel Drost Fix For: 0.2 Opening a JIRA task to collect what has to be done for moving over to using apache version 5 parent pom (see also http://markmail.org/thread/ld26m3xxzoztqsk6 ). * Link Apache parent pom into our pom. * Update hudson to build via maven (?). * File subtask at INFRA-1896 to include mahout in repository.apache.org -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-124) Online Classification using HBase
[ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742227#action_12742227 ] Isabel Drost commented on MAHOUT-124: - > Ant config was done to decrease the job jar file size. See first comment in > this issue point No:3 Ah, thanks for the reminder... > I need the new Eclipse Code formatter for that purpose. I am still using the > lucene code formatter, which is causing this break. Ok, I see. I guess that should be no show-stopper for the code to get in. > Docs... already on it! > Removed all hard coded map/Reduce task number limit from code. Will conform > to the cluster its being run on. Great! > Map/Reduce jobs doesnt do much leg work that it confuses reading the code, I > could factor them out as well if needed. I think we could leave that open for a later patch. > TODO: Algorithm will keep datastore internally. > TODO: add jar from latest trunk of HBase You could probably add a JIRA task to upgrade HBase to the official release as soon as that is out. Just so we do not forget that task. Other than that, to me it looks like this code code go in by the end of this week. If anyone else would like to have a look over the code before and needs more time, please do tell. > Online Classification using HBase > - > > Key: MAHOUT-124 > URL: https://issues.apache.org/jira/browse/MAHOUT-124 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.2 >Reporter: Robin Anil >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, > MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch > > > # Batch classification of flat file documents and flat file model: > # Storing the model in HBase and the end of Model Building Map/Reduce > stages > # Using the model stored in HBase create an interface (both command > line and web service) to classify a give document > # Using the model stored in HBase, batch classify documents stored on > the HDFS -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth
[ https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742033#action_12742033 ] Isabel Drost commented on MAHOUT-157: - Patch applies to trunk but I run into problems when trying to get it to compile. I needed to apply MAHOUT-124 first. Then I got an error that indicated that you are using the Combinations class not only in the tests (where it is put by the diff) but also in the regular source code. After copying the class to src/main/java, I get the following error: MAHOUT-157/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ParallelFPGrowthReducer.java:[92,41] get(java.lang.String) in org.apache.mahout.common.Parameters cannot be applied to (java.lang.String,java.lang.String) I guess I have done something wrong when applying the patches one after another? Other than that I only have some general comments before going into more detail for the review: I am missing some documentation, both JavaDoc and package.html, at least a link to the original paper would be nice to have. PFPGrowth - seems like you do quite a lot of work in your constructor. I think it is no good idea to start map reduce jobs from within a constructor. Maybe I am reading something wrong here? Is it possible to break up the test into unit tests? I think that would make changing the code and tracking where the change actually broke the code by far easier. AggregatorReducer, line 88: Please avoid calls to .printStackTrace() - usually those messages get lost when the system is in production. Better log the message with Logger.("your message", ) - maybe rethrow the exception if you cannot handle it properly. TreeNode - the class seems to contain public attributes only but no methods. Please at least explain which type of tree these nodes are supposed to be a part of. From the code alone I am not able to understand the its usage... > Frequent Pattern Mining using Parallel FP-Growth > > > Key: MAHOUT-157 > URL: https://issues.apache.org/jira/browse/MAHOUT-157 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.2 >Reporter: Robin Anil > Fix For: 0.2 > > Attachments: MAHOUT-157-August-6.patch, > MAHOUT-157-Combinations-BSD-License.patch, > MAHOUT-157-Combinations-BSD-License.patch, > MAHOUT-157-inProgress-August-5.patch > > > Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-124) Online Classification using HBase
[ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742031#action_12742031 ] Isabel Drost commented on MAHOUT-124: - Alltogether really nice changes. The patch now applies to trunk without problems and builds (except for the missing hbase dependency). As this will be one of the last reviews, I tried to be a little more picky also with minor changes like added System.out.println and missing documentation... The ant config file (build.xml) contains changes that I see nowhere explained. Are they supposed to remain for the final patch? In the examples concerning the TestClassifier - it has imports for java.io.* and java.util.* - for the final patch could you please revert those to the specific imports? - could you please try to avoid reformatting the code as much as possible? It makes reading patches a whole lot easier. - in line 129 there is quite a bit of code commented out - better through it out entirely? If needed later the snippet is still in jira. - line 224 - have the timing statistics been left in intentionally? utils/nlp/NGrams - The class is missing documentation. I guess your intention was to generate nGrams from a line of text, not the whole document? Otherwise holding document and nGrams both in memory seems a little bit much. There also seems to be no unit test for it? The classes implementing the caching algorithms are missing documentation. At least some /** {...@inheritdoc} */ and a short comment on top that explains the purpose of the implemention would be nice. (Same applies for Pair and Parameters). CBayesNormalizerReducer still has HBase Dependencies - is it possible to factor them out? BayesThetaNormalizerDriver - setting the number of map tasks was commented away compared with trunk. Intentional? BayesClassifierMapper - lines 106, 110 and following: Shouldn't the log message be something like "Using ..." instead of "Testing ..."? classifier/bayes/interfaces/algorithm/Algorithm - you still give a pointer to the datastore with every method call to the Algorithm. Wouldn't the interface look cleaner if the Algorithm would hold a reference to an initialized datastore and use that for further requests? I don't think it is very likely that users will go to HBase for the first document to classify and to an InMemoryStore for the next document. bayes/algorithm/CBayesAlgorithm, BayesAlgorithm, bayes/common/ClassifierPriorityQueue - is missing some basic javaDoc. BayesTfIdfDriver, BayesTfIdfReducer, BayesWeightSummerReducer - I assume the dependency to HBase cannot be factored out? BayesFeatureMapper - there is a System.out.println in there... One last question: You reference hbase-0.20.0 which is not released yet. I guess we should include a prebuilt version in our lib directory and ship that until hbase has an official release to use? > Online Classification using HBase > - > > Key: MAHOUT-124 > URL: https://issues.apache.org/jira/browse/MAHOUT-124 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.2 >Reporter: Robin Anil >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, > MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch > > > # Batch classification of flat file documents and flat file model: > # Storing the model in HBase and the end of Model Building Map/Reduce > stages > # Using the model stored in HBase create an interface (both command > line and web service) to classify a give document > # Using the model stored in HBase, batch classify documents stored on > the HDFS -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-124) Online Classification using HBase
[ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-124: --- Assignee: Isabel Drost > Online Classification using HBase > - > > Key: MAHOUT-124 > URL: https://issues.apache.org/jira/browse/MAHOUT-124 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.2 >Reporter: Robin Anil >Assignee: Isabel Drost > Fix For: 0.2 > > Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, > MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch > > > # Batch classification of flat file documents and flat file model: > # Storing the model in HBase and the end of Model Building Map/Reduce > stages > # Using the model stored in HBase create an interface (both command > line and web service) to classify a give document > # Using the model stored in HBase, batch classify documents stored on > the HDFS -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-124) Online Classification using HBase
[ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733075#action_12733075 ] Isabel Drost commented on MAHOUT-124: - Just forgot two final notes: You should update your svn-checkout. The patch was done against an old revision of trunk and does no longer apply cleanly. The patch was broken - line 988 in the patch file has a broken directive: @@ -48,67 +54,107 @@ should really be @@ -48,67 +54,105 @@ the effect being that "patch" assumes a hunk length of 107 lines which makes it fail. Your hunk is only 105 lines, so better not lie to "patch" :) However, that one was trivial to fix. (Thanks to Thilo Fromm for helping me fix and explain that.) > Online Classification using HBase > - > > Key: MAHOUT-124 > URL: https://issues.apache.org/jira/browse/MAHOUT-124 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.2 >Reporter: Robin Anil > Attachments: MAHOUT-124-July-13.patch, MAHOUT-124-July-6.patch, > MAHOUT-124-June-23.patch > > > # Batch classification of flat file documents and flat file model: > # Storing the model in HBase and the end of Model Building Map/Reduce > stages > # Using the model stored in HBase create an interface (both command > line and web service) to classify a give document > # Using the model stored in HBase, batch classify documents stored on > the HDFS -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-124) Online Classification using HBase
[ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733069#action_12733069 ] Isabel Drost commented on MAHOUT-124: - *ThetaNormalizerReducer, *BayesTFIDFReducer and *BayesSummerReducer still have dependencies to HBase - I think one can factor them out. Interface "Algorithm" - I think it might sense to initialise the the Algorithm with a reference to the datastore instead of injecting that reference with every method call. Other than that: Looks good. Bayes and CBayes look a lot cleaner now. Interface Datastore looks good. I like the separation of data handling and actual algorithm implementation. I would move Pair over to the utils package. Good work Robin. > Online Classification using HBase > - > > Key: MAHOUT-124 > URL: https://issues.apache.org/jira/browse/MAHOUT-124 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.2 >Reporter: Robin Anil > Attachments: MAHOUT-124-July-13.patch, MAHOUT-124-July-6.patch, > MAHOUT-124-June-23.patch > > > # Batch classification of flat file documents and flat file model: > # Storing the model in HBase and the end of Model Building Map/Reduce > stages > # Using the model stored in HBase create an interface (both command > line and web service) to classify a give document > # Using the model stored in HBase, batch classify documents stored on > the HDFS -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-108) Implementation of Assoication Rules learning by Apriori algorithm
[ https://issues.apache.org/jira/browse/MAHOUT-108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731024#action_12731024 ] Isabel Drost commented on MAHOUT-108: - Hello Chao Deng, how is the status of your apriori patch? Isabel > Implementation of Assoication Rules learning by Apriori algorithm > - > > Key: MAHOUT-108 > URL: https://issues.apache.org/jira/browse/MAHOUT-108 > Project: Mahout > Issue Type: Task > Environment: Linux, Hadoop-0.17.1 >Reporter: chao deng > Fix For: 0.2 > > Original Estimate: 504h > Remaining Estimate: 504h > > Target: Association Rules learning is a popular method for discovering > interesting relations between variables in large databases. Here, we would > implement the Apriori algorithm using Hadoop&Mapreduce parallel techniques. > Applications: Typically, association rules learning is used to discover > regularities between products in large scale transaction data in > supermarkets. For example, the rule "{onions, patatoes}->beef" found in the > sales data would indicate that if a customer buys onions and potatoes > together, he or she is likely to also buy beef. Such information can be used > as the basis for decisions about marketing activities. In addition to the > market basket analysis, association rules are employed today in many > application areas including Web usage mining, intrusion detection and > bioinformatics. > Apriori algorithm: Apriori is the best-known algorithm to mine association > rules. It uses a breadth-first search strategy to counting the support of > itemsets and uses a candidate generation function which exploits the downward > closure property of support -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-124) Online Classification using HBase
[ https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728311#action_12728311 ] Isabel Drost commented on MAHOUT-124: - Some initial comments on the patch: org/apache/mahout/utils/Cache.java - I am missing some documentation for the methods. For interfaces, you can omit the public with methods. For classes implementing this interface, you might want to at least use @inheritDoc to link back to the original documentation. Please also note in the class comment whether your implementation is safe to use in a multi-threaded context or not. org.apache.mahout.common.Model - To me it looks a bit weird to add a dependency to HBase directly to the model. I would prefer the HBase implementation to be less tightly coupled with the core code. Currently it looks like the model is really doing two tasks at once: Implementing an in-memory-model as well as an HBase model. I think it should be possible to refactor the code such that the two can be separated into distinct classes that can then be used interchangeably. My first guess would be that the strategy pattern should be helpful with this task. You probably will have to refactor CBayesModel and BayesModel as well. The same applies to org/apache/mahout/classifier/Classify.java and CBayesModel, Model, BayesTfIdfDriver, BayesTfIDFReducer, BayesWeightSummerReducer. org.apache.mahout.classifier.cbase - I really like your additions for reporting progress back to Hadoop. I would suggest to split these from the patch, open a separate Issue and attach the changes there. This would keep this patch more focussed on the original task of adding HBase support. org.apache.mahout.classifier.cbase.CBayesModel - Please remove the code you commented out if you do not need it anymore. In case of catching an IOException you should at least write some warning log message (e.g. line 60). > Online Classification using HBase > - > > Key: MAHOUT-124 > URL: https://issues.apache.org/jira/browse/MAHOUT-124 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.2 >Reporter: Robin Anil > Attachments: MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch > > > # Batch classification of flat file documents and flat file model: > # Storing the model in HBase and the end of Model Building Map/Reduce > stages > # Using the model stored in HBase create an interface (both command > line and web service) to classify a give document > # Using the model stored in HBase, batch classify documents stored on > the HDFS -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.