[jira] Updated: (MAHOUT-281) scm urls are wrong in the poms

2010-02-10 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-281:


Status: Patch Available  (was: Open)

> scm urls are wrong in the poms
> --
>
> Key: MAHOUT-281
> URL: https://issues.apache.org/jira/browse/MAHOUT-281
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-281.diff
>
>
> The scm urls in the poms are wrong. This must be fixed before running the 
> release plugin to make an 0.3 release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-281) scm urls are wrong in the poms

2010-02-10 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-281:


Attachment: MAHOUT-281.diff

Changed scm connection strings. (Needed a comparably simple example to show 
students at HPI how svn diff, patch and jira.)

> scm urls are wrong in the poms
> --
>
> Key: MAHOUT-281
> URL: https://issues.apache.org/jira/browse/MAHOUT-281
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-281.diff
>
>
> The scm urls in the poms are wrong. This must be fixed before running the 
> release plugin to make an 0.3 release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-262) Writable for labeled vectors for supervised learning algorithms

2010-01-22 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803690#action_12803690
 ] 

Isabel Drost commented on MAHOUT-262:
-

Should be possible to apply the patch with -p1 instead of -p0 to remove the a/b 
directories.

> Writable for labeled vectors for supervised learning algorithms
> ---
>
> Key: MAHOUT-262
> URL: https://issues.apache.org/jira/browse/MAHOUT-262
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.2
>Reporter: Olivier Grisel
> Fix For: 0.3
>
> Attachments: MAHOUT-262-1.patch
>
>
> Implement two new classes:
>  - SingleLabelVectorWritable for singly classified vectorized data item (one 
> and only one label index per instance)
>  - MultiLabelVectorWritable for multi categorized vectorized data item (0 or 
> more category indexes per instance)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation warnings

2010-01-21 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-246:


Resolution: Fixed
  Assignee: Olivier Grisel
Status: Resolved  (was: Patch Available)

Patch applies cleanly with -p1,  all tests still work, changes look good. 
Committed in revision 901791.

> upgrade to new lucene TokenStream API to cleanup deprecation warnings
> -
>
> Key: MAHOUT-246
> URL: https://issues.apache.org/jira/browse/MAHOUT-246
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Olivier Grisel
>Assignee: Olivier Grisel
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-246-2.patch
>
>
> The attached patch use the new ts.incrementToken() / TermAttribute API 
> instead of the deprecated manual Token handling.
> It also replaces to occurrences of the deprecated "new StandardAnalyzer()" to 
> the more explicit "new StandardAnalyzer(Version.LUCENE_CURRENT)".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-242) LLR Collocation Identifier

2010-01-21 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803381#action_12803381
 ] 

Isabel Drost commented on MAHOUT-242:
-

{quote}
I am not worried about them at this point.
{quote}

Also not very worried - probably should have indicated that basically 
everything I found could be filed as "trivial, minor or style question only"...

> LLR Collocation Identifier
> --
>
> Key: MAHOUT-242
> URL: https://issues.apache.org/jira/browse/MAHOUT-242
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.3
>Reporter: Drew Farris
>Priority: Minor
> Attachments: MAHOUT-242.patch, mahout-colloc.tar.gz, 
> mahout-colloc.tar.gz
>
>
> Identifies interesting Collocations in text using ngrams scored via the 
> LogLikelihoodRatio calculation. 
> As discussed in: 
> * 
> http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
> * 
> http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e
> Current form is a tar of a maven project that depends on mahout. Build as 
> usual with 'mvn clean install', can be executed using:
> {noformat}
> mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" 
> -Dexec.args="--input src/test/resources/article --colloc target/colloc 
> --output target/output -w"
> {noformat}
> Output will be placed in target/output and can be viewed nicely using:
> {noformat}
> sort -rn -k1 target/output/part-0
> {noformat}
> Includes rudimentary unit tests. Please review and comment. Needs more work 
> to get this into patch state and integrate with Robin's document vectorizer 
> work in MAHOUT-237
> Some basic TODO/FIXME's include:
> * use mahout math's ObjectInt map implementation when available
> * make the analyzer configurable
> * better input validation + negative unit tests.
> * more flexible ways to generate units of analysis (n-1)grams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-264) Make mahout-math compatible with Java 1.5 (bytecode and standard library).

2010-01-21 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803281#action_12803281
 ] 

Isabel Drost commented on MAHOUT-264:
-

The changes to the pom look good.

But why are the changes to Sorting.java and Arrays.java needed?

> Make mahout-math compatible with Java 1.5 (bytecode and standard library).
> --
>
> Key: MAHOUT-264
> URL: https://issues.apache.org/jira/browse/MAHOUT-264
> Project: Mahout
>  Issue Type: Wish
>  Components: Math
>Reporter: Dawid Weiss
>Assignee: Benson Margulies
>Priority: Minor
> Attachments: MAHOUT-264.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-217) Tidy up generated data after unit tests are run

2010-01-21 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803276#action_12803276
 ] 

Isabel Drost commented on MAHOUT-217:
-


The test files I found creating but not deleting data in the tmp directory:

./utils/src/test/java/org/apache/mahout/utils/vectors/io/VectorWriterTest.java
./utils/src/test/java/org/apache/mahout/utils/vectors/SequenceFileVectorIterableTest.java
./core/src/test/java/org/apache/mahout/classifier/bayes/BayesFileFormatterTest.java
./core/src/test/java/org/apache/mahout/cf/taste/impl/model/file/FileDataModelTest.java



> Tidy up generated data after unit tests are run
> ---
>
> Key: MAHOUT-217
> URL: https://issues.apache.org/jira/browse/MAHOUT-217
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Isabel Drost
> Fix For: 0.3
>
>
> I tried to compile Mahout on people.apache.org yesterday: The build failed at 
> first, because tests could not generate test data. The reason: Some tests 
> tried to generate test data at /tmp//... - but those directories 
> did exist already and belonged to Sean. Why? Probably because Sean had run 
> the build earlier this year - but tests did not remove the data they 
> generated.
> Proposed solution: Tests come with setup and with shutdown hooks. We should 
> remove any data when a test is finished and shut down.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-21 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803275#action_12803275
 ] 

Isabel Drost commented on MAHOUT-237:
-

Hmm, Robin your last comment is "ok. done" however the issue is still open?

> Map/Reduce Implementation of Document Vectorizer
> 
>
> Key: MAHOUT-237
> URL: https://issues.apache.org/jira/browse/MAHOUT-237
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
> DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
> DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors 
> of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile . with key = docid, value = 
> content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them 
> to SequenceFile where key=feature, value = unique id 
> Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create 
> Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document 
> Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-242) LLR Collocation Identifier

2010-01-21 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803274#action_12803274
 ] 

Isabel Drost commented on MAHOUT-242:
-

First of all, thanks for the patch. The code looks good so far, patch applies 
cleanly and builds w/o problems. Some initial comments and questions I had when 
reading it: 
 
CollocMapper, Line 66: If I read your implementation correctly, this means that 
documents are always read fully into memory, right? So we would assume to only 
run the ngramCollector over documents that fit into main memory and unable to 
process larger documents. I am wondering whether this is an issue at all, and 
if so, whether there is any way around that. 
 
Gram, Line 192: You can omit the "else" clauses, in case the "if" already 
returns its result to the caller, however this is a question of style. I was 
wondering, why in line 177 you did not write "this.position != other.position"? 
 
NGramCollector, Line 47 (and a few others): Shouldn't we avoid using deprecated 
apis instead of suppressing deprecation warnings? 
 
LLRReducer, Line 143: How about making the method package private if it should 
be used in unit tests only anyway? 
Line 106: Would be nice to have an additional counter for the skipped grams. 
 
 
I agree with you that things like sentence boundary detection and more 
sophisticated tokenization should be left as work for an additional issue. 
 
Jake, would be great, if you could have a closer look to verify that this is 
about the pipeline you had in mind in the referenced e-mail threads and mention 
anything that might still be missing.

> LLR Collocation Identifier
> --
>
> Key: MAHOUT-242
> URL: https://issues.apache.org/jira/browse/MAHOUT-242
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.3
>Reporter: Drew Farris
>Priority: Minor
> Attachments: MAHOUT-242.patch, mahout-colloc.tar.gz, 
> mahout-colloc.tar.gz
>
>
> Identifies interesting Collocations in text using ngrams scored via the 
> LogLikelihoodRatio calculation. 
> As discussed in: 
> * 
> http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
> * 
> http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e
> Current form is a tar of a maven project that depends on mahout. Build as 
> usual with 'mvn clean install', can be executed using:
> {noformat}
> mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" 
> -Dexec.args="--input src/test/resources/article --colloc target/colloc 
> --output target/output -w"
> {noformat}
> Output will be placed in target/output and can be viewed nicely using:
> {noformat}
> sort -rn -k1 target/output/part-0
> {noformat}
> Includes rudimentary unit tests. Please review and comment. Needs more work 
> to get this into patch state and integrate with Robin's document vectorizer 
> work in MAHOUT-237
> Some basic TODO/FIXME's include:
> * use mahout math's ObjectInt map implementation when available
> * make the analyzer configurable
> * better input validation + negative unit tests.
> * more flexible ways to generate units of analysis (n-1)grams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-01-16 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801280#action_12801280
 ] 

Isabel Drost commented on MAHOUT-153:
-

Welcome to Mahout. Thanks for stepping up and volunteering to take over the 
work for this issue.

> Implement kmeans++ for initial cluster selection in kmeans
> --
>
> Key: MAHOUT-153
> URL: https://issues.apache.org/jira/browse/MAHOUT-153
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.2
> Environment: OS Independent
>Reporter: Panagiotis Papadimitriou
> Fix For: 0.3
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-244) Add root log-likelihood method to LogLikehood class.

2010-01-14 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-244:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch applies cleanly and looks good, project builds with it, unit test is 
included. Committed at revision 899157.

> Add root log-likelihood method to LogLikehood class.
> 
>
> Key: MAHOUT-244
> URL: https://issues.apache.org/jira/browse/MAHOUT-244
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.3
>Reporter: Drew Farris
>Assignee: Drew Farris
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-244.patch
>
>
> Per discussion at: 
> http://www.lucidimagination.com/search/document/6dc8709e65a7ced1/llr_scoring_question
> This patch adds a method for root log-likelihood calculation to the existing 
> LogLikelihood class + provides a unit test based on Shashi's numbers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-244) Add root log-likelihood method to LogLikehood class.

2010-01-14 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost reassigned MAHOUT-244:
---

Assignee: Drew Farris

> Add root log-likelihood method to LogLikehood class.
> 
>
> Key: MAHOUT-244
> URL: https://issues.apache.org/jira/browse/MAHOUT-244
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.3
>Reporter: Drew Farris
>Assignee: Drew Farris
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-244.patch
>
>
> Per discussion at: 
> http://www.lucidimagination.com/search/document/6dc8709e65a7ced1/llr_scoring_question
> This patch adds a method for root log-likelihood calculation to the existing 
> LogLikelihood class + provides a unit test based on Shashi's numbers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-85) Perceptron/Winnow Trainer

2010-01-10 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798524#action_12798524
 ] 

Isabel Drost commented on MAHOUT-85:


No, sorry. That was me committing a change that I made for MAHOUT-240 - 
reverted it.

So far there are no Driver programs yet: This is only the sequential version. 
The model should be stored after training and loaded at application time. I 
have deferred implementing an end-to-end example to MAHOUT-241. Currently the 
implementation only provides for the training logic.

> Perceptron/Winnow Trainer
> -
>
> Key: MAHOUT-85
> URL: https://issues.apache.org/jira/browse/MAHOUT-85
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: MAHOUT-85.patch, MAHOUT-85.patch, 
> perceptronWinnowTrainer.diff
>
>
> Please find attached a first sketch for perceptron and winnow training. 
> Please look very, very carefully at the patch, as I added the heart of the 
> algorithms in the emergency room at Charite Berlin (after I broke my leg when 
> cycling to the Hadoop Get Together ;) ). 
> The patch does not yet feature unit tests nor is it parallelised. Currently 
> my plan is to set up an example with the webKb dataset, add unit tests to the 
> code and after that go parallel. I would like to get some feedback early on, 
> in addition I would feel a lot better, if a second and third pair of eyes had 
> a look at the code to make sure all obvious mistakes are out as early as 
> possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-241) Example for perceptron

2010-01-10 Thread Isabel Drost (JIRA)
Example for perceptron
--

 Key: MAHOUT-241
 URL: https://issues.apache.org/jira/browse/MAHOUT-241
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 0.3
Reporter: Isabel Drost
 Fix For: 0.3


The goal is to provide an end-to-end example based on the 20-newsgroups dataset 
to show how to get from a set of labelled training examples to a trained model 
that can later be reused.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-240) Parallel version of Perceptron

2010-01-10 Thread Isabel Drost (JIRA)
Parallel version of Perceptron
--

 Key: MAHOUT-240
 URL: https://issues.apache.org/jira/browse/MAHOUT-240
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 0.3
Reporter: Isabel Drost
 Fix For: 0.3


So far Perceptron (as well as Winnow) training is still implemented to run w/o 
parallelization. The goal of this issue is to explore ways for parallelization 
and if possible to provide a parallel version, that is one that is based on map 
reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-85) Perceptron/Winnow Trainer

2010-01-10 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost resolved MAHOUT-85.


Resolution: Fixed

Finally committed.

> Perceptron/Winnow Trainer
> -
>
> Key: MAHOUT-85
> URL: https://issues.apache.org/jira/browse/MAHOUT-85
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: MAHOUT-85.patch, MAHOUT-85.patch, 
> perceptronWinnowTrainer.diff
>
>
> Please find attached a first sketch for perceptron and winnow training. 
> Please look very, very carefully at the patch, as I added the heart of the 
> algorithms in the emergency room at Charite Berlin (after I broke my leg when 
> cycling to the Hadoop Get Together ;) ). 
> The patch does not yet feature unit tests nor is it parallelised. Currently 
> my plan is to set up an example with the webKb dataset, add unit tests to the 
> code and after that go parallel. I would like to get some feedback early on, 
> in addition I would feel a lot better, if a second and third pair of eyes had 
> a look at the code to make sure all obvious mistakes are out as early as 
> possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-231) Upgrade QM reports to use Clover 2.6

2009-12-27 Thread Isabel Drost (JIRA)
Upgrade QM reports to use Clover 2.6


 Key: MAHOUT-231
 URL: https://issues.apache.org/jira/browse/MAHOUT-231
 Project: Mahout
  Issue Type: Task
  Components: Website
Affects Versions: 0.3
Reporter: Isabel Drost
Priority: Minor
 Fix For: 0.3


Atlassian has donated a license for a new Clover version. The reports provide 
more information and are easier to read. We should upgrade to site reports to 
use that version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-85) Perceptron/Winnow Trainer

2009-12-26 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-85:
---

Attachment: MAHOUT-85.patch

The patch has tests added to the implementation. The additional abstraction 
proposed earlier is integrated. Distance measure is not configurable but 
corresponds to what was defined in the original algorithm formulations.

The implementation currently is sequential-only. Still evaluating, if and how 
is might be possible to parallelize.

Missing so far: An example showing how to use training, how to store the 
resulting model and how to apply the model. Probably should be done in a new 
issue to keep this one focused on the algorithm itself. In addition I still 
have to at least add links from our wiki to the wikipedia pages on both 
algorithms.

(Had some time left during the past few days: Screws in my knee are out now ;) )

> Perceptron/Winnow Trainer
> -
>
> Key: MAHOUT-85
> URL: https://issues.apache.org/jira/browse/MAHOUT-85
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: MAHOUT-85.patch, MAHOUT-85.patch, 
> perceptronWinnowTrainer.diff
>
>
> Please find attached a first sketch for perceptron and winnow training. 
> Please look very, very carefully at the patch, as I added the heart of the 
> algorithms in the emergency room at Charite Berlin (after I broke my leg when 
> cycling to the Hadoop Get Together ;) ). 
> The patch does not yet feature unit tests nor is it parallelised. Currently 
> my plan is to set up an example with the webKb dataset, add unit tests to the 
> code and after that go parallel. I would like to get some feedback early on, 
> in addition I would feel a lot better, if a second and third pair of eyes had 
> a look at the code to make sure all obvious mistakes are out as early as 
> possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-85) Perceptron/Winnow Trainer

2009-12-26 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-85:
---

Attachment: MAHOUT-85.patch

The patch has tests added to the implementation. The additional abstraction 
proposed earlier is integrated. Distance measure is not configurable but 
corresponds to what was defined in the original algorithm formulations.

The implementation currently is sequential-only. Still evaluating, if and how 
is might be possible to parallelize.

Missing so far: An example showing how to use training, how to store the 
resulting model and how to apply the model. Probably should be done in a new 
issue to keep this one focused on the algorithm itself. In addition I still 
have to at least add links from our wiki to the wikipedia pages on both 
algorithms.

(Had some time left during the past few days: Screws in my knee are out now ;) )

> Perceptron/Winnow Trainer
> -
>
> Key: MAHOUT-85
> URL: https://issues.apache.org/jira/browse/MAHOUT-85
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: MAHOUT-85.patch, perceptronWinnowTrainer.diff
>
>
> Please find attached a first sketch for perceptron and winnow training. 
> Please look very, very carefully at the patch, as I added the heart of the 
> algorithms in the emergency room at Charite Berlin (after I broke my leg when 
> cycling to the Hadoop Get Together ;) ). 
> The patch does not yet feature unit tests nor is it parallelised. Currently 
> my plan is to set up an example with the webKb dataset, add unit tests to the 
> code and after that go parallel. I would like to get some feedback early on, 
> in addition I would feel a lot better, if a second and third pair of eyes had 
> a look at the code to make sure all obvious mistakes are out as early as 
> possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-210) Publish code quality reports through maven

2009-12-18 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792449#action_12792449
 ] 

Isabel Drost commented on MAHOUT-210:
-

Forgot to include what I changed to make it work:

Seems like the workspace directory on hudson is only accessible to users logged 
in to hudson. So I changed the job to stage the generated site to a publicly 
accessible directory and adjust the links accordingly. 

To get Clover to work I gave maven the path to the clover license on Hudson and 
issued report generation and aggregation before the site is generated.

The maven parameters used for building:

-Dmaven.clover.license=$PATH - path to the clover license file
clean install - to clean the target directories and start building and locally 
installing the artifacts
clover:instrument clover:aggregate  - generates the clover reports
site:site - generates the maven site report files and stores them under 
$module/target/site for review
site:stage -DstagingDirectory=/export/home/hudson/hudson/jobs/MahoutQM/site - 
stages the maven report files on a publicly readable directory


> Publish code quality reports through maven
> --
>
> Key: MAHOUT-210
> URL: https://issues.apache.org/jira/browse/MAHOUT-210
> Project: Mahout
>  Issue Type: New Feature
>  Components: Website
>Affects Versions: 0.1, 0.2
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: MAHOUT-210.patch
>
>
> We should use mvn site:site to generate code reports and publish them online 
> for users to review and developers to easily spot problems.
> First version that still needs checks adjusted to our needs is available 
> online at:
> http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html
> Further discussion on-list at
> http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-210) Publish code quality reports through maven

2009-12-18 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost resolved MAHOUT-210.
-

Resolution: Fixed

Links are working now and accessible without logging into hudson. What remains 
is refining the report configuration to our specific needs, but this can be 
done in a separate issue.

> Publish code quality reports through maven
> --
>
> Key: MAHOUT-210
> URL: https://issues.apache.org/jira/browse/MAHOUT-210
> Project: Mahout
>  Issue Type: New Feature
>  Components: Website
>Affects Versions: 0.1, 0.2
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: MAHOUT-210.patch
>
>
> We should use mvn site:site to generate code reports and publish them online 
> for users to review and developers to easily spot problems.
> First version that still needs checks adjusted to our needs is available 
> online at:
> http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html
> Further discussion on-list at
> http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-210) Publish code quality reports through maven

2009-12-17 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792019#action_12792019
 ] 

Isabel Drost commented on MAHOUT-210:
-

Update: Clover tests are up now as well. Only problem: When not logged in to 
hudson, one is not allowed to access the workspace directory. In addition 
Hudson seems to be unable to pick up all bits and pieces of our maven site 
reports automatically. Currently working modifying the task such that the 
report files get moved over to a publicly accessible directory.

> Publish code quality reports through maven
> --
>
> Key: MAHOUT-210
> URL: https://issues.apache.org/jira/browse/MAHOUT-210
> Project: Mahout
>  Issue Type: New Feature
>  Components: Website
>Affects Versions: 0.1, 0.2
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: MAHOUT-210.patch
>
>
> We should use mvn site:site to generate code reports and publish them online 
> for users to review and developers to easily spot problems.
> First version that still needs checks adjusted to our needs is available 
> online at:
> http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html
> Further discussion on-list at
> http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-210) Publish code quality reports through maven

2009-12-17 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791887#action_12791887
 ] 

Isabel Drost commented on MAHOUT-210:
-

Checked in the current status of the report configuration files. Feel free to 
adjust any configuration that does not quite fit our standards yet. I tried to 
address those issues mentioned by Sean earlier in the mail thread.

I setup a Hudson job to build the documentation and linked it such that it gets 
published through Hudson. The URLs for that:

http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/core-reports/index.html
http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/examples-reports/index.html
http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/matrix-reports/index.html
http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/maven-reports/index.html
http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/taste-web-reports/index.html
http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/utils-reports/index.html

Those urls were activated according to the description of Bhuvaneswaran A on 
infrastruct...@apache:

 1) setup Hudson job to generate the reports. 
 2) login to hud...@hudson.zones.apache.org and create a symbolic link:
{code}
  $ sudo su - hudson
  $ cd hudson/userContent
  $ ln -s /export/home/hudson/hudson/jobs/Mahout\ QM/$PATH_TO_DOCS 
./lucene-mahout/$MODULE-reports
{code}
   3) Access via 
http://hudson.zones.apache.org/hudson/userContent/lucene-mahout/$MODULE-reports/index.html

The site should be regenerated once a day. Once that is done today those pages 
available on hudson should match those I already published on people.apache.org

About to add links to our project page to the reports (going to be a separate 
page in the developers section).

Missing: Currently the clover test coverage reports are not yet being generated 
- I need to change the Hudson job to take up the clover license file for that.

> Publish code quality reports through maven
> --
>
> Key: MAHOUT-210
> URL: https://issues.apache.org/jira/browse/MAHOUT-210
> Project: Mahout
>  Issue Type: New Feature
>  Components: Website
>Affects Versions: 0.1, 0.2
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: MAHOUT-210.patch
>
>
> We should use mvn site:site to generate code reports and publish them online 
> for users to review and developers to easily spot problems.
> First version that still needs checks adjusted to our needs is available 
> online at:
> http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html
> Further discussion on-list at
> http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-224) Dependency Cleanup

2009-12-15 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790658#action_12790658
 ] 

Isabel Drost commented on MAHOUT-224:
-

Maven supports marking dependencies as "needed for tests only" (would be 
appropriate for junit), or as "provided by user" (might be appropriate for the 
Hadoop stuff that I think is needed only at compile time but is available on 
the Hadoop cluster when deploying Mahout, right?). This should reduce the 
number of jars that need to be distributed as well. But that can be addressed 
in a separate issue.

> Dependency Cleanup
> --
>
> Key: MAHOUT-224
> URL: https://issues.apache.org/jira/browse/MAHOUT-224
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Drew Farris
>Assignee: Drew Farris
>Priority: Minor
> Attachments: mahout-224.patch
>
>
> In preparation for the binary release work described in MAHOUT-215, here's a 
> minor patch that does some some cleanup on the poms. 
> The hadoop and junit dependency versions are now established using the 
> dependencyManagement section of the parent pom in mahout/maven/pom.xml
> A large number of transitive dependencies from the hadoop pom are now 
> excluded there as well -- these were not necessary previously because the 
> hadoop dependency was hand-rolled and did not include them. With the update 
> to the hadoop 0.20.2-SNAPSHOT, they now become required.
> Also, the parent pom no longer has mahout/pom.xml as its parent, this allows 
> binary packaging to be performed in mahout/pom.xml after the build of all of 
> the other sub-modules is complete.
> Also, removed the javamail dependency -- was there a reason this was present?
> Verified that build and unit tests complete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-217) Tidy up generated data after unit tests are run

2009-12-15 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790656#action_12790656
 ] 

Isabel Drost commented on MAHOUT-217:
-

Not only fpgrowth. I will take a closer look on Thursday, make a list and post 
it here.

> Tidy up generated data after unit tests are run
> ---
>
> Key: MAHOUT-217
> URL: https://issues.apache.org/jira/browse/MAHOUT-217
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Isabel Drost
> Fix For: 0.3
>
>
> I tried to compile Mahout on people.apache.org yesterday: The build failed at 
> first, because tests could not generate test data. The reason: Some tests 
> tried to generate test data at /tmp//... - but those directories 
> did exist already and belonged to Sean. Why? Probably because Sean had run 
> the build earlier this year - but tests did not remove the data they 
> generated.
> Proposed solution: Tests come with setup and with shutdown hooks. We should 
> remove any data when a test is finished and shut down.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-15 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790653#action_12790653
 ] 

Isabel Drost commented on MAHOUT-220:
-

Before reorganizing code - could someone who is more familiar with the specific 
rules of the code-style used at Lucene double-check the exact checkstyle rules 
used for site-generation? I reused the checkstyle configuration that was 
already in Mahout-trunk (relaxing some of its rules) but am in doubt whether it 
really reflects our rules.

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-210) Publish code quality reports through maven

2009-12-11 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-210:


Attachment: MAHOUT-210.patch

The patch adds clover, findbugs, pmd, cpd and maven dependency reports as well 
as java doc generation.

After application the site can be generated through mvn site:site - I have 
thrown out all general project information that is already available through 
our forest site.

The plan is to run mvn clean install site:site site:deploy on a daily (maybe 
weekly?) basis on people.apache.org and publish the results there so they can 
be linked to from our site.

> Publish code quality reports through maven
> --
>
> Key: MAHOUT-210
> URL: https://issues.apache.org/jira/browse/MAHOUT-210
> Project: Mahout
>  Issue Type: New Feature
>  Components: Website
>Affects Versions: 0.1, 0.2
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: MAHOUT-210.patch
>
>
> We should use mvn site:site to generate code reports and publish them online 
> for users to review and developers to easily spot problems.
> First version that still needs checks adjusted to our needs is available 
> online at:
> http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html
> Further discussion on-list at
> http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-85) Perceptron/Winnow Trainer

2009-12-11 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789312#action_12789312
 ] 

Isabel Drost commented on MAHOUT-85:


I am about to add tests currently. I guess, I will commit once I have those 
done and go on with a parallel version from there.

> Perceptron/Winnow Trainer
> -
>
> Key: MAHOUT-85
> URL: https://issues.apache.org/jira/browse/MAHOUT-85
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: perceptronWinnowTrainer.diff
>
>
> Please find attached a first sketch for perceptron and winnow training. 
> Please look very, very carefully at the patch, as I added the heart of the 
> algorithms in the emergency room at Charite Berlin (after I broke my leg when 
> cycling to the Hadoop Get Together ;) ). 
> The patch does not yet feature unit tests nor is it parallelised. Currently 
> my plan is to set up an example with the webKb dataset, add unit tests to the 
> code and after that go parallel. I would like to get some feedback early on, 
> in addition I would feel a lot better, if a second and third pair of eyes had 
> a look at the code to make sure all obvious mistakes are out as early as 
> possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-217) Tidy up generated data after unit tests are run

2009-12-11 Thread Isabel Drost (JIRA)
Tidy up generated data after unit tests are run
---

 Key: MAHOUT-217
 URL: https://issues.apache.org/jira/browse/MAHOUT-217
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Isabel Drost
 Fix For: 0.3


I tried to compile Mahout on people.apache.org yesterday: The build failed at 
first, because tests could not generate test data. The reason: Some tests tried 
to generate test data at /tmp//... - but those directories did 
exist already and belonged to Sean. Why? Probably because Sean had run the 
build earlier this year - but tests did not remove the data they generated.

Proposed solution: Tests come with setup and with shutdown hooks. We should 
remove any data when a test is finished and shut down.

Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-210) Publish code quality reports through maven

2009-12-10 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost reassigned MAHOUT-210:
---

Assignee: Isabel Drost

> Publish code quality reports through maven
> --
>
> Key: MAHOUT-210
> URL: https://issues.apache.org/jira/browse/MAHOUT-210
> Project: Mahout
>  Issue Type: New Feature
>  Components: Website
>Affects Versions: 0.1, 0.2
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
>
> We should use mvn site:site to generate code reports and publish them online 
> for users to review and developers to easily spot problems.
> First version that still needs checks adjusted to our needs is available 
> online at:
> http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html
> Further discussion on-list at
> http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-12-10 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost reassigned MAHOUT-11:
--

Assignee: Drew Farris  (was: Isabel Drost)

Thanks.

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
>Assignee: Drew Farris
> Fix For: 0.3
>
> Attachments: MAHOUT-11-all-cleanup-20091128.patch, 
> MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, 
> MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-12-10 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost reassigned MAHOUT-11:
--

Assignee: Isabel Drost

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: MAHOUT-11-all-cleanup-20091128.patch, 
> MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, 
> MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-12-10 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-11:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed. Thanks Drew for your help.

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11-all-cleanup-20091128.patch, 
> MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, 
> MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-12-09 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788129#action_12788129
 ] 

Isabel Drost commented on MAHOUT-11:


I'll make the changes before committing - no need to submit a new patch version.

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11-all-cleanup-20091128.patch, 
> MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, 
> MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.

2009-12-07 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost resolved MAHOUT-90.


Resolution: Later

Marked as "Later" - currently snapshots are published to the apache maven 
repository. At the moment that should be enough for users to play around with 
latest code.

> Adding all scripts (for nightly build) to SVN repository.
> -
>
> Key: MAHOUT-90
> URL: https://issues.apache.org/jira/browse/MAHOUT-90
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Edward J. Yoon
>Priority: Minor
> Fix For: 0.3
>
> Attachments: mahout.tgz
>
>
> I made below scripts for the hudson continuous integration service on my 
> hudson account. 
> mahout/hudsonBuildMahoutPatch.sh   
> mahout/processMahoutPatchEmail.sh
> mahout/hudsonPatchQueueAdmin.sh
> They will be modified by only me, so It should be handled via SVN.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-85) Perceptron/Winnow Trainer

2009-12-06 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786679#action_12786679
 ] 

Isabel Drost commented on MAHOUT-85:


It is just a sequential version of the algorithm. No parallelisation and no 
Hadoop involved.

> Perceptron/Winnow Trainer
> -
>
> Key: MAHOUT-85
> URL: https://issues.apache.org/jira/browse/MAHOUT-85
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.3
>
> Attachments: perceptronWinnowTrainer.diff
>
>
> Please find attached a first sketch for perceptron and winnow training. 
> Please look very, very carefully at the patch, as I added the heart of the 
> algorithms in the emergency room at Charite Berlin (after I broke my leg when 
> cycling to the Hadoop Get Together ;) ). 
> The patch does not yet feature unit tests nor is it parallelised. Currently 
> my plan is to set up an example with the webKb dataset, add unit tests to the 
> code and after that go parallel. I would like to get some feedback early on, 
> in addition I would feel a lot better, if a second and third pair of eyes had 
> a look at the code to make sure all obvious mistakes are out as early as 
> possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.

2009-12-06 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786678#action_12786678
 ] 

Isabel Drost commented on MAHOUT-90:


I did add a hudson job to upload maven snapshots of our projects to the apache 
repository on a nightly basis. No idea however how building and publishing 
nightly releases should work at Apache.

> Adding all scripts (for nightly build) to SVN repository.
> -
>
> Key: MAHOUT-90
> URL: https://issues.apache.org/jira/browse/MAHOUT-90
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Edward J. Yoon
>Assignee: Isabel Drost
>Priority: Minor
> Fix For: 0.3
>
> Attachments: mahout.tgz
>
>
> I made below scripts for the hudson continuous integration service on my 
> hudson account. 
> mahout/hudsonBuildMahoutPatch.sh   
> mahout/processMahoutPatchEmail.sh
> mahout/hudsonPatchQueueAdmin.sh
> They will be modified by only me, so It should be handled via SVN.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.

2009-12-06 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost reassigned MAHOUT-90:
--

Assignee: (was: Isabel Drost)

> Adding all scripts (for nightly build) to SVN repository.
> -
>
> Key: MAHOUT-90
> URL: https://issues.apache.org/jira/browse/MAHOUT-90
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Edward J. Yoon
>Priority: Minor
> Fix For: 0.3
>
> Attachments: mahout.tgz
>
>
> I made below scripts for the hudson continuous integration service on my 
> hudson account. 
> mahout/hudsonBuildMahoutPatch.sh   
> mahout/processMahoutPatchEmail.sh
> mahout/hudsonPatchQueueAdmin.sh
> They will be modified by only me, so It should be handled via SVN.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-12-04 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785985#action_12785985
 ] 

Isabel Drost commented on MAHOUT-11:


Applies cleanly and builds w/o unit test failures here.

The changes look all good to me. Great work, Drew.

One question though: In the TestMeanShift test (lines 301 and 304) you removed 
the canopyId adjustments - could you please explain what was the reason this 
was necessary?

I would like to commit this patch next week if noone objects.

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11-all-cleanup-20091128.patch, 
> MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, 
> MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-210) Publish code quality reports through maven

2009-11-28 Thread Isabel Drost (JIRA)
Publish code quality reports through maven
--

 Key: MAHOUT-210
 URL: https://issues.apache.org/jira/browse/MAHOUT-210
 Project: Mahout
  Issue Type: New Feature
  Components: Website
Affects Versions: 0.1, 0.2
Reporter: Isabel Drost
 Fix For: 0.3


We should use mvn site:site to generate code reports and publish them online 
for users to review and developers to easily spot problems.

First version that still needs checks adjusted to our needs is available online 
at:

http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html

Further discussion on-list at

http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-25 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782470#action_12782470
 ] 

Isabel Drost commented on MAHOUT-11:


Drew, go ahead then.

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11-kmeans-cleanup.patch, 
> MAHOUT-11-RandomSeedGenerator.patch, MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-19 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780476#action_12780476
 ] 

Isabel Drost commented on MAHOUT-11:


First of all, thanks for the review.

Passing the output collector directly - Jepp, makes sense. Will change and 
resubmit the patch.

Tests with real data: Big thanks for that.

Isabel

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-19 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-11:
---

Attachment: MAHOUT-11.patch

Not the original author of the source, but still managed to get the static 
fields out of the k-means clustering code. All unit-tests are still passing. 
However I would feel a lot better, if someone else double-checked the changes 
made.

Looking at the code, I spotted some more points that could benefit from being 
revisited (e.g. usage of deprecated MapReduce APIs and introduction of status 
reports). But this should be done in a separate issue.

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-200) Update information on Mahout site

2009-11-18 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost resolved MAHOUT-200.
-

   Resolution: Fixed
Fix Version/s: (was: 0.3)
   0.2

Updated web page and fixed typo in release announcement.

> Update information on Mahout site
> -
>
> Key: MAHOUT-200
> URL: https://issues.apache.org/jira/browse/MAHOUT-200
> Project: Mahout
>  Issue Type: Improvement
>  Components: Website
>Reporter: Isabel Drost
>Assignee: Isabel Drost
>Priority: Minor
> Fix For: 0.2
>
> Attachments: update_site.patch
>
>
> After several people had trouble finding the docs we provide in the wiki, I 
> have created a "slightly" updated version of our website. I added a few links 
> to wiki pages that might be of interest to potential Mahout users.
> I have uploaded the updated version to http://people.apache.org/~isabel/site 
> so all of you can have a look. Will commit on Tuesday next week if noone 
> objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-200) Update information on Mahout site

2009-11-13 Thread Isabel Drost (JIRA)
Update information on Mahout site
-

 Key: MAHOUT-200
 URL: https://issues.apache.org/jira/browse/MAHOUT-200
 Project: Mahout
  Issue Type: Improvement
  Components: Website
Reporter: Isabel Drost
Priority: Minor
 Fix For: 0.3
 Attachments: update_site.patch

After several people had trouble finding the docs we provide in the wiki, I 
have created a "slightly" updated version of our website. I added a few links 
to wiki pages that might be of interest to potential Mahout users.

I have uploaded the updated version to http://people.apache.org/~isabel/site so 
all of you can have a look. Will commit on Tuesday next week if noone objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-200) Update information on Mahout site

2009-11-13 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost reassigned MAHOUT-200:
---

Assignee: Isabel Drost

> Update information on Mahout site
> -
>
> Key: MAHOUT-200
> URL: https://issues.apache.org/jira/browse/MAHOUT-200
> Project: Mahout
>  Issue Type: Improvement
>  Components: Website
>Reporter: Isabel Drost
>Assignee: Isabel Drost
>Priority: Minor
> Fix For: 0.3
>
> Attachments: update_site.patch
>
>
> After several people had trouble finding the docs we provide in the wiki, I 
> have created a "slightly" updated version of our website. I added a few links 
> to wiki pages that might be of interest to potential Mahout users.
> I have uploaded the updated version to http://people.apache.org/~isabel/site 
> so all of you can have a look. Will commit on Tuesday next week if noone 
> objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-200) Update information on Mahout site

2009-11-13 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-200:


Attachment: update_site.patch

> Update information on Mahout site
> -
>
> Key: MAHOUT-200
> URL: https://issues.apache.org/jira/browse/MAHOUT-200
> Project: Mahout
>  Issue Type: Improvement
>  Components: Website
>Reporter: Isabel Drost
>Assignee: Isabel Drost
>Priority: Minor
> Fix For: 0.3
>
> Attachments: update_site.patch
>
>
> After several people had trouble finding the docs we provide in the wiki, I 
> have created a "slightly" updated version of our website. I added a few links 
> to wiki pages that might be of interest to potential Mahout users.
> I have uploaded the updated version to http://people.apache.org/~isabel/site 
> so all of you can have a look. Will commit on Tuesday next week if noone 
> objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org

2009-10-19 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767710#action_12767710
 ] 

Isabel Drost commented on MAHOUT-171:
-

It was my own fault - I forgot to "svn add" the file after I applied and built 
with my own patch. Sorry :/

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: MAHOUT-171.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-171) Move deployment to repository.apache.org

2009-10-19 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost resolved MAHOUT-171.
-

Resolution: Fixed

Checked in.

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: MAHOUT-171.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-15 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost resolved MAHOUT-138.
-

   Resolution: Fixed
Fix Version/s: (was: 0.3)
   0.2

The last ci changed the remaining classes - so at least grep does not find any 
usages of 'args\[' anywhere in our source code.

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.2
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth

2009-10-15 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766030#action_12766030
 ] 

Isabel Drost commented on MAHOUT-157:
-

The patch looks good to me. Good work Robin.

> Frequent Pattern Mining using Parallel FP-Growth
> 
>
> Key: MAHOUT-157
> URL: https://issues.apache.org/jira/browse/MAHOUT-157
> Project: Mahout
>  Issue Type: New Feature
>  Components: Frequent Itemset/Association Rule Mining
>Affects Versions: 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, 
> MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, 
> MAHOUT-157-codecleanup-javadocs.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-CompactTransactionMapperFormat.patch, MAHOUT-157-final.patch, 
> MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-Oct-1.patch, 
> MAHOUT-157-Oct-10.pfpgrowth.patch, MAHOUT-157-Oct-8.pfpgrowth.patch, 
> MAHOUT-157-Oct-8.TestedMapReducePipeline.patch, 
> MAHOUT-157-Oct-9.StreamingDBRead-Inprogress.patch, 
> MAHOUT-157-September-10.patch, MAHOUT-157-September-18.patch, 
> MAHOUT-157-September-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-09 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764000#action_12764000
 ] 

Isabel Drost commented on MAHOUT-138:
-

Robin, you briefly mentioned that for the bayes classifier it does not make 
sense to start up the different phases manually. Could you please detail which 
classes should not have main-methods attached to them and which ones should 
instead be used to start a training job in this issue?

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-09 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763997#action_12763997
 ] 

Isabel Drost commented on MAHOUT-138:
-

Usage description for Taste examples is online in the wiki at: 

http://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples 

Current status: 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java
 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java
 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java
 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java
 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java
 
./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java
 

8 examples to go.

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-09 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763958#action_12763958
 ] 

Isabel Drost commented on MAHOUT-138:
-

Sean - I just converted the implementation of the taste jobs in core - could 
you please have a look at the commandline option descriptions to check that 
everything is correct?

http://cwiki.apache.org/confluence/display/MAHOUT/TasteCommandLine

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth

2009-10-09 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763929#action_12763929
 ] 

Isabel Drost commented on MAHOUT-157:
-

Great work Robin. I just had a look at the code and only found some minor 
things: 

ParallelFPGrowth 
- it might be a good idea to reuse the DefaultOptionCreator to generate common 
options like input and output. 
- I would love to see a help option as well. 
- What happens, if the users gives the wrong parameters? As a user, I would 
rather not get confronted with a stack trace, even though it is an example. 
- did you provide details on how to run the algorithm, the assumptions it 
makes, file format, behaviour if the output file exists already on the wiki? 
- the class is named ParallelFPGrowth, but if I read it correctly, it looks 
like the entry point for both, the parallel and sequential version. Maybe 
rename to FPGrowthJob? 

FPGrowth 
 - line 98 is this really a recoverable error that does not cause 
inconsistancies later on? Log message says "this should not happen" - what if 
against all odds, it does happen? Why not throw a non-Checked Exception? 
- line 177 we should not have source code that is commented out in newly added 
code. 
- The class seems to implement both - top k and vanilla fp growth - would it 
make sense to split that up into different classes? 
- generateFrequentPatterns - maybe it is just me, but I am always happy to find 
tiny little comments in methods that long that very shortly explain what the 
following code block is doing. 

FPTreeDepthCache 
- maybe mention in the docs that the implementation is not threadsafe? 

FPTree, Pattern 
- missing a class comment. 

Pattern 
- line 173 - please remove code that is commented out 

AggregatorMapper 
- the reporter is left unused. 

Nice-To-Have: It would be nice to have package level comments in JavaDoc as 
well.

> Frequent Pattern Mining using Parallel FP-Growth
> 
>
> Key: MAHOUT-157
> URL: https://issues.apache.org/jira/browse/MAHOUT-157
> Project: Mahout
>  Issue Type: New Feature
>  Components: Frequent Itemset/Association Rule Mining
>Affects Versions: 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, 
> MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-Oct-1.patch, 
> MAHOUT-157-Oct-8.pfpgrowth.patch, 
> MAHOUT-157-Oct-8.TestedMapReducePipeline.patch, 
> MAHOUT-157-September-10.patch, MAHOUT-157-September-18.patch, 
> MAHOUT-157-September-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-08 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763455#action_12763455
 ] 

Isabel Drost commented on MAHOUT-138:
-

Sean: sure, trying to get to it as soon as I find time to do so (hopefully 
tomorrow).

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-08 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763378#action_12763378
 ] 

Isabel Drost edited comment on MAHOUT-138 at 10/8/09 12:15 AM:
---

>From the classes above, I worked through up to the classification stuff. 
>Documentation is in the wiki at: 
>http://cwiki.apache.org/confluence/display/MAHOUT/ClassifyingYourData (the 
>links with commandline in their name) and 
>http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData (again 
>the links with commandline in their name).

Currently there are only examples left to convert as well as three classes 
containing main methods from the taste code:

./core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java
./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOneDiffsToAveragesJob.java
./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOnePrefsToDiffsJob.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/bookcrossing/BookCrossingRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/NetflixRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/TransposeToByUser.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens/GroupLensRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java
./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java



  was (Author: isabel):

>From the classes above, I worked through up to the classification stuff. 
>Documentation is in the wiki at: 
>http://cwiki.apache.org/confluence/display/MAHOUT/ClassifyingYourData (the 
>links with commandline in their name) and 
>http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData (again 
>the links with commandline in their name).

Currently their are only examples left to convert as well as three classes 
containing main methods from the taste code:

./core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java
./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOneDiffsToAveragesJob.java
./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOnePrefsToDiffsJob.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/bookcrossing/BookCrossingRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/NetflixRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/TransposeToByUser.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens/GroupLensRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java
./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java


  
> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We

[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-07 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763378#action_12763378
 ] 

Isabel Drost commented on MAHOUT-138:
-


>From the classes above, I worked through up to the classification stuff. 
>Documentation is in the wiki at: 
>http://cwiki.apache.org/confluence/display/MAHOUT/ClassifyingYourData (the 
>links with commandline in their name) and 
>http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData (again 
>the links with commandline in their name).

Currently their are only examples left to convert as well as three classes 
containing main methods from the taste code:

./core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java
./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOneDiffsToAveragesJob.java
./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOnePrefsToDiffsJob.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/bookcrossing/BookCrossingRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/NetflixRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/TransposeToByUser.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens/GroupLensRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java
./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java



> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-06 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762898#action_12762898
 ] 

Isabel Drost commented on MAHOUT-138:
-

Sean, you can easily follow what is going on with this issue on the subversion 
commit panel:

https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org

2009-10-02 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761589#action_12761589
 ] 

Isabel Drost commented on MAHOUT-171:
-

https://issues.apache.org/jira/browse/INFRA-2229 - is done as well.

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: MAHOUT-171.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-171) Move deployment to repository.apache.org

2009-10-02 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-171:


Attachment: MAHOUT-171.patch

I moved the stuff in the buildtools directory over to our "maven" module. I 
think we have few enough configuration files to bundle all the 
maven/eclipse/intellij/checkstyle stuff in one module. I added javadoc and 
source-jar download and a line to enable the checkstyle config for the eclipse 
plugin. However so far the checkstyle config itself seems rather rudimentary to 
me - can disable it, if that is not cleaned up yet.

I deleted the NOTICE and LICENSE files that obviously were copied over from 
another project to buildtools/src/main/resources/META-INF.

I converted the NOTICE and LICENSE file generation to use the maven remote 
resources plugin as recommended in the Apache parent pom. (Thanks to Jukka for 
clarifying on how to include a custom NOTICE file)

I added sublemental entries that describe our dependencies so besides license 
and notice there is a DEPENDENCIES file being generated with information on 
project, license, project url and the like for all (transitive) dependencies of 
Mahout.

I think this patch requires a review of the changes and the generated artifacts 
to make sure everything is still where it belongs to after the changes.

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: MAHOUT-171.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-171) Move deployment to repository.apache.org

2009-10-02 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-171:


Attachment: (was: parent_pom.patch)

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: MAHOUT-171.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-184) Code tweaks for .df.* code

2009-10-02 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761501#action_12761501
 ] 

Isabel Drost commented on MAHOUT-184:
-

Looks good to me. Deneche, could you please also have a look at the patch to 
spot any issues early on?

I would prefer using CLI for the job implementation 
(core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java), but 
that can be done in a later patch.



> Code tweaks for .df.* code
> --
>
> Key: MAHOUT-184
> URL: https://issues.apache.org/jira/browse/MAHOUT-184
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 0.2
>
> Attachments: Tweaks_to__df__.patch
>
>
> This follows on my last email to the mailing list, and code inspection. It's 
> big enough I made a patch. No surprises I hope given the consensus on code 
> style and practice. Might be some good takeaways in here, or points for 
> further discussion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

2009-09-26 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759864#action_12759864
 ] 

Isabel Drost commented on MAHOUT-180:
-

That sounds great! Thank you for offering to donate the code. If you need any 
help porting the code or any other support, we are happy to help.

You may also want to have a look at 
http://incubator.apache.org/ip-clearance/index.html that explains the legal 
steps for donating large code donations.

> port Hadoop-ified Lanczos SVD implementation from decomposer
> 
>
> Key: MAHOUT-180
> URL: https://issues.apache.org/jira/browse/MAHOUT-180
> Project: Mahout
>  Issue Type: New Feature
>  Components: Matrix
>Affects Versions: 0.2
>Reporter: Jake Mannix
>Priority: Minor
>
> I wrote up a hadoop version of the Lanczos algorithm for performing SVD on 
> sparse matrices available at http://decomposer.googlecode.com/, which is 
> Apache-licensed, and I'm willing to donate it.  I'll have to port over the 
> implementation to use Mahout vectors, or else add in these vectors as well.
> Current issues with the decomposer implementation include: if your matrix is 
> really big, you need to re-normalize before decomposition: find the largest 
> eigenvalue first, and divide all your rows by that value, then decompose, or 
> else you'll blow over Double.MAX_VALUE once you've run too many iterations 
> (the L^2 norm of intermediate vectors grows roughly as 
> (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on 
> the lower end is better than blowing over MAX_VALUE).  When this is ported to 
> Mahout, we should add in the capability to do this automatically (run a 
> couple iterations to find the largest eigenvalue, save that, then iterate 
> while scaling vectors by 1/max_eigenvalue).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org

2009-09-25 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759468#action_12759468
 ] 

Isabel Drost commented on MAHOUT-171:
-

Got account, changed the build to maven and tied the build to minerva (supports 
maven builds). Build runs smoothly (and successfully) again in hudson now.

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: parent_pom.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-134) [PATCH] Cluster decode error handling

2009-09-18 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-134:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to revision 816588.

> [PATCH] Cluster decode error handling
> -
>
> Key: MAHOUT-134
> URL: https://issues.apache.org/jira/browse/MAHOUT-134
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Robert Burrell Donkin
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: MAHOUT-134.patch, mahout-cluster-format-error.patch, 
> mahout-cluster-format-error.patch
>
>
> ATM the javadocs are unclear as to whether null is an acceptable return value 
> and callers do not null check the return value. However, the implementation 
> may return null in or throw other runtime exceptions when the format is not 
> correct. This makes it hard to diagnose when there's a problem with the 
> format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-09-18 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757119#action_12757119
 ] 

Isabel Drost commented on MAHOUT-138:
-

Added changes to cli for FuzzyKMeans, Dirichlet and MeanShiftCanopy (see below 
for exact classes). Added documentation to the wiki (see links to command line 
client documentation at 
http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData ) 

./core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansJob.java
 
./core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletJob.java 
./core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletDriver.java
 
./core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyJob.java
 
./core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java
 

I have added one additional helper class (DefaultOptionCreator) that provides 
methods for creating the most common options (k clusters, input, output and the 
like) to avoid copying and ensure that the same option strings are used all 
over the code.

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.2
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-78) HBase RowResult/BatchUpdate access via Mahout Vector interface

2009-09-18 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-78?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757074#action_12757074
 ] 

Isabel Drost commented on MAHOUT-78:


What is the current status of this issue? Allen, did you have a chance looking 
into creating tests with a mocked HBase?

> HBase RowResult/BatchUpdate access via Mahout Vector interface
> --
>
> Key: MAHOUT-78
> URL: https://issues.apache.org/jira/browse/MAHOUT-78
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Allen Day
>Priority: Minor
> Fix For: 0.2
>
> Attachments: hbase.patch
>
>
> An adapter class is attached that allows read/write operations on HBase rows 
> using the Vector interface.  This allows, e.g. canopy clustering of rows in 
> an HBase table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org

2009-09-17 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757012#action_12757012
 ] 

Isabel Drost commented on MAHOUT-171:
-

https://issues.apache.org/jira/browse/INFRA-2237 is the account related issue

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: parent_pom.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-09-14 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755377#action_12755377
 ] 

Isabel Drost commented on MAHOUT-138:
-

Will do so and put some documentation of the command line parameters on the 
wiki while I go along.

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.2
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-09-14 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-138:


Attachment: MAHOUT-138_fuzzyKMeansJob.patch

Patch to convert FuzzyKMeansJob to use CLI for argument parsing.

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.2
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-134) [PATCH] Cluster decode error handling

2009-09-14 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-134:


Assignee: Isabel Drost
  Status: Patch Available  (was: Reopened)

See last attachment. Committing on Friday if noone objects.

> [PATCH] Cluster decode error handling
> -
>
> Key: MAHOUT-134
> URL: https://issues.apache.org/jira/browse/MAHOUT-134
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Robert Burrell Donkin
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: MAHOUT-134.patch, mahout-cluster-format-error.patch, 
> mahout-cluster-format-error.patch
>
>
> ATM the javadocs are unclear as to whether null is an acceptable return value 
> and callers do not null check the return value. However, the implementation 
> may return null in or throw other runtime exceptions when the format is not 
> correct. This makes it hard to diagnose when there's a problem with the 
> format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-134) [PATCH] Cluster decode error handling

2009-09-14 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-134:


Attachment: MAHOUT-134.patch

Adjusted patch to current trunk version.

> [PATCH] Cluster decode error handling
> -
>
> Key: MAHOUT-134
> URL: https://issues.apache.org/jira/browse/MAHOUT-134
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Robert Burrell Donkin
> Fix For: 0.2
>
> Attachments: MAHOUT-134.patch, mahout-cluster-format-error.patch, 
> mahout-cluster-format-error.patch
>
>
> ATM the javadocs are unclear as to whether null is an acceptable return value 
> and callers do not null check the return value. However, the implementation 
> may return null in or throw other runtime exceptions when the format is not 
> correct. This makes it hard to diagnose when there's a problem with the 
> format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-108) Implementation of Assoication Rules learning by Apriori algorithm

2009-09-14 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost resolved MAHOUT-108.
-

Resolution: Won't Fix

Superseded by FPGrowth patch (MAHOUT-157).

> Implementation of Assoication Rules learning by Apriori algorithm
> -
>
> Key: MAHOUT-108
> URL: https://issues.apache.org/jira/browse/MAHOUT-108
> Project: Mahout
>  Issue Type: Task
> Environment: Linux, Hadoop-0.17.1
>Reporter: chao deng
> Fix For: 0.2
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Target: Association Rules learning is a popular method for discovering 
> interesting relations between variables in large databases. Here, we would 
> implement the Apriori algorithm using Hadoop&Mapreduce parallel techniques.
> Applications: Typically, association rules  learning is used to discover 
> regularities between products in large scale transaction data in 
> supermarkets. For example, the rule  "{onions, patatoes}->beef" found in the 
> sales data would indicate that if a customer buys onions and potatoes 
> together, he or she is likely to also buy beef. Such information can be used 
> as the basis for decisions about marketing activities. In addition to the 
> market basket analysis, association rules are employed today in many 
> application areas including Web usage mining, intrusion detection and 
> bioinformatics.
> Apriori algorithm: Apriori is the best-known algorithm to mine association 
> rules. It uses a breadth-first search strategy to counting the support of 
> itemsets and uses a candidate generation function which exploits the downward 
> closure property of support

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-167) Convert clustering code to Hadoop 0.20 API

2009-09-14 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754966#action_12754966
 ] 

Isabel Drost commented on MAHOUT-167:
-

Hmm. Should then defer this issue to a later version of Mahout?

> Convert clustering code to Hadoop 0.20 API
> --
>
> Key: MAHOUT-167
> URL: https://issues.apache.org/jira/browse/MAHOUT-167
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Jeff Eastman
>Assignee: Jeff Eastman
> Fix For: 0.2
>
>
> We need to update the clustering implementations to remove the deprecated 
> Hadoop API calls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-160) ClusterDumper utility to output all the clusters in all sequence files and points

2009-09-14 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754963#action_12754963
 ] 

Isabel Drost commented on MAHOUT-160:
-

If that is committed - can we close the issue?

> ClusterDumper utility to output all the clusters in all sequence files and 
> points
> -
>
> Key: MAHOUT-160
> URL: https://issues.apache.org/jira/browse/MAHOUT-160
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Shashikant Kore
>Assignee: Grant Ingersoll
> Fix For: 0.2
>
> Attachments: mahout-160-dict.patch, mahout-160.patch
>
>
> The current ClusterDumper utility takes a sequence file and points file as 
> input and prints the cluster vector along with the points that belong to the 
> clusters in the sequence file. This utility doesn't produce correct results 
> in case there are multiple sequence files and points. 
> To avoid this problem, all the point to cluster mappings need to be read 
> first and then iterate on the sequence files.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org

2009-09-14 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754860#action_12754860
 ] 

Isabel Drost commented on MAHOUT-171:
-

As for the Hudson subtask: To me, http://wiki.apache.org/general/Hudson reads 
like you need to be PMC member to change the Hudson settings for Mahout?

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: parent_pom.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-171) Move deployment to repository.apache.org

2009-09-14 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754855#action_12754855
 ] 

Isabel Drost commented on MAHOUT-171:
-

Filed subtask to infra: https://issues.apache.org/jira/browse/INFRA-2229

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: parent_pom.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-09-14 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754850#action_12754850
 ] 

Isabel Drost commented on MAHOUT-138:
-

>From a first glimpse at the code, it looks like there are quite a few other 
>classes that need switching as well (grepped through the code base, so no 
>guarantee that there are no false positives):

   * 
./core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansJob.java
   * 
./core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletJob.java
   * 
./core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletDriver.java
   * 
./core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyJob.java
   * 
./core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java
   * 
./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/bayes/BayesDriver.java
   * 
./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/bayes/BayesThetaNormalizerDriver.java
   * 
./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/cbayes/CBayesNormalizedWeightDriver.java
   * 
./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/cbayes/CBayesDriver.java
   * 
./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/cbayes/CBayesThetaDriver.java
   * 
./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/cbayes/CBayesThetaNormalizerDriver.java
   * 
./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesWeightSummerDriver.java
   * 
./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesFeatureDriver.java
   * 
./core/src/main/java/org/apache/mahout/classifier/bayes/mapreduce/common/BayesTfIdfDriver.java
   * ./core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java
   * 
./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOneDiffsToAveragesJob.java
   * 
./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOnePrefsToDiffsJob.java
   * 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java
   * 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java
   * 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
   * 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
   * 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java
   * 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java
   * 
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java
   * 
./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java
   * 
./examples/src/main/java/org/apache/mahout/cf/taste/example/bookcrossing/BookCrossingRecommenderEvaluatorRunner.java
   * 
./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/NetflixRecommenderEvaluatorRunner.java
   * 
./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/TransposeToByUser.java
   * 
./examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterRecommenderEvaluatorRunner.java
   * 
./examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens/GroupLensRecommenderEvaluatorRunner.java

I'd like to offer my  help with some of these.

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.2
>
> Attachments: MAHOUT-138.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-09-14 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754847#action_12754847
 ] 

Isabel Drost edited comment on MAHOUT-138 at 9/14/09 12:17 AM:
---

Hmm - the patch seems to be out of sync with trunk. 

  was (Author: isabel):
Hmm - the patch seems to be out of sync with trunk. From looking at it, it 
also seems it contains two changes - the CLI support and adding a 
RandomSeedGenerator?
  
> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.2
>
> Attachments: MAHOUT-138.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-09-14 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754847#action_12754847
 ] 

Isabel Drost commented on MAHOUT-138:
-

Hmm - the patch seems to be out of sync with trunk. From looking at it, it also 
seems it contains two changes - the CLI support and adding a 
RandomSeedGenerator?

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.2
>
> Attachments: MAHOUT-138.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs

2009-09-13 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost resolved MAHOUT-172.
-

Resolution: Fixed

fixed in revision 814495

> When running on a Hadoop cluster LDA fails with Caused by: 
> java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
> -
>
> Key: MAHOUT-172
> URL: https://issues.apache.org/jira/browse/MAHOUT-172
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: lda.patch
>
>
> I tried running the reuters example of lda on a hadoop cluster today. Seems 
> like the implementation tries to read all files in output/state-* which fails 
> if in that directory "_logs" is found.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-108) Implementation of Assoication Rules learning by Apriori algorithm

2009-09-13 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754694#action_12754694
 ] 

Isabel Drost commented on MAHOUT-108:
-

Contacted (at least tried to) Chao Deng asking for the status and if I could 
help him submit the patch. Should we close this issue as won't fix or defer it 
to a later version if he does not respond? Or is anyone else up to implementing 
a patch for this task until 0.2?

> Implementation of Assoication Rules learning by Apriori algorithm
> -
>
> Key: MAHOUT-108
> URL: https://issues.apache.org/jira/browse/MAHOUT-108
> Project: Mahout
>  Issue Type: Task
> Environment: Linux, Hadoop-0.17.1
>Reporter: chao deng
> Fix For: 0.2
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Target: Association Rules learning is a popular method for discovering 
> interesting relations between variables in large databases. Here, we would 
> implement the Apriori algorithm using Hadoop&Mapreduce parallel techniques.
> Applications: Typically, association rules  learning is used to discover 
> regularities between products in large scale transaction data in 
> supermarkets. For example, the rule  "{onions, patatoes}->beef" found in the 
> sales data would indicate that if a customer buys onions and potatoes 
> together, he or she is likely to also buy beef. Such information can be used 
> as the basis for decisions about marketing activities. In addition to the 
> market basket analysis, association rules are employed today in many 
> application areas including Web usage mining, intrusion detection and 
> bioinformatics.
> Apriori algorithm: Apriori is the best-known algorithm to mine association 
> rules. It uses a breadth-first search strategy to counting the support of 
> itemsets and uses a candidate generation function which exploits the downward 
> closure property of support

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-171) Move deployment to repository.apache.org

2009-09-13 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost reassigned MAHOUT-171:
---

Assignee: Isabel Drost

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: parent_pom.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs

2009-09-13 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost reassigned MAHOUT-172:
---

Assignee: Isabel Drost

> When running on a Hadoop cluster LDA fails with Caused by: 
> java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
> -
>
> Key: MAHOUT-172
> URL: https://issues.apache.org/jira/browse/MAHOUT-172
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: lda.patch
>
>
> I tried running the reuters example of lda on a hadoop cluster today. Seems 
> like the implementation tries to read all files in output/state-* which fails 
> if in that directory "_logs" is found.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs

2009-09-13 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754692#action_12754692
 ] 

Isabel Drost commented on MAHOUT-172:
-

Committing on Monday.

> When running on a Hadoop cluster LDA fails with Caused by: 
> java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
> -
>
> Key: MAHOUT-172
> URL: https://issues.apache.org/jira/browse/MAHOUT-172
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Isabel Drost
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: lda.patch
>
>
> I tried running the reuters example of lda on a hadoop cluster today. Seems 
> like the implementation tries to read all files in output/state-* which fails 
> if in that directory "_logs" is found.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth

2009-09-07 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752249#action_12752249
 ] 

Isabel Drost commented on MAHOUT-157:
-

The formatting still looks a bit weird (spaces, line length etc.)

PFPGrowth, line 101, 183, 214 - please at least add a warning log message prior 
to deleting pre-existing output path and document somewhere on the usage page 
that default behaviour is deleting the output path, if exists. (I think that 
differs from the implementation in lda - we need to agree on consistant 
behaviour across mahout in such cases).

line 174 - the combiner is commented out?

ParallelCountingMapper - shouldn't you report status through the reporter 
during mapping?

ParallelFPGrowthMapper - line 87 - please do not use e.printStackTrace() but 
generate a regular log message and log the exception stack trace through the 
logger.

I would love to see some more comments: the expected format of key and value, 
the expected content of glist and flist.

ParallelFPGrowthReducer line 111 - don't use e.printStackTrace.

AggregatorReducer - line 91 same

Attribute/TreeNode - The code is pretty clear, still I would love to see some 
more documentation on the overall data structure.

FrequentPatternMaxHeap - line 74 - Huh? Judging from the return value, you can 
omit the comparison against null here. (line 81 same.

FPGrowth - line 42 - the method name should not start with a capital letter.

517 lines for implementing the whole algorithm in one class - looks a bit large 
for me. Is it possible to split it up?

line 165 - converting from Integer to int and back again usually costs quite a 
bit of performance. Is there a way to rely on primitives only, or implement 
your own incrementable integer type? Btw., Integer.valueOf(1) should be 
replaced by Integer.ONE - that should be quicker and prevent in-accuracies.

Type T - I think it would make the code better readable if T were given a 
clearer name, something like TransactionType? Otherwise you need to document 
what exactly T represents.

line 229: Would reformulating the while conditions as 
"while(!tempNode.childNodes.isEmpty()) { ... } make the code clearer here?

line 239: Where does the magic number 6 come from here? Define as a constant 
with a speaking name?

the two generatedSinglePathPatterns methods look rather similar - is it 
possible to not copy the code but extract it into its own method or reuse one 
in the other?

line 293 (and earlier): Where does the magic number 4 come from? Define 
constant with speaking name?

line 506: Looks like a strange log message?

Concerning your idea of going the algorithm interface way for fpGrowth: If you 
can already make out what the interface should look like, I think that would be 
a good way to make it easier for future implementors of other frequent itemset 
algorithms.

> Frequent Pattern Mining using Parallel FP-Growth
> 
>
> Key: MAHOUT-157
> URL: https://issues.apache.org/jira/browse/MAHOUT-157
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.2
>Reporter: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, 
> MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-September-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs

2009-09-04 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-172:


Attachment: lda.patch

The patch extends the url pattern to not match everything in the output 
directory but only stuff that starts with part* - since the lda job seems to 
run fine for me.

> When running on a Hadoop cluster LDA fails with Caused by: 
> java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
> -
>
> Key: MAHOUT-172
> URL: https://issues.apache.org/jira/browse/MAHOUT-172
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Isabel Drost
> Fix For: 0.2
>
> Attachments: lda.patch
>
>
> I tried running the reuters example of lda on a hadoop cluster today. Seems 
> like the implementation tries to read all files in output/state-* which fails 
> if in that directory "_logs" is found.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs

2009-09-04 Thread Isabel Drost (JIRA)
When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: 
Cannot open filename /user/*/output/state-*/_logs
-

 Key: MAHOUT-172
 URL: https://issues.apache.org/jira/browse/MAHOUT-172
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.1
Reporter: Isabel Drost
 Fix For: 0.2


I tried running the reuters example of lda on a hadoop cluster today. Seems 
like the implementation tries to read all files in output/state-* which fails 
if in that directory "_logs" is found.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-171) Move deployment to repository.apache.org

2009-09-04 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-171:


Description: 
Opening a JIRA task to collect what has to be done for moving over to using 
apache version 5 parent pom (see also 
http://markmail.org/thread/ld26m3xxzoztqsk6 ).

   * Link Apache parent pom into our pom.
   * Update hudson to build via maven ( ? ).
   * File subtask at INFRA-1896 to include mahout in repository.apache.org


  was:
Opening a JIRA task to collect what has to be done for moving over to using 
apache version 5 parent pom (see also 
http://markmail.org/thread/ld26m3xxzoztqsk6 ).

   * Link Apache parent pom into our pom.
   * Update hudson to build via maven (?).
   * File subtask at INFRA-1896 to include mahout in repository.apache.org



> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
> Fix For: 0.2
>
> Attachments: parent_pom.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven ( ? ).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-171) Move deployment to repository.apache.org

2009-09-04 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost updated MAHOUT-171:


Attachment: parent_pom.patch

Mahout Parent pom now includes reference to apache parent pom.

> Move deployment to repository.apache.org
> 
>
> Key: MAHOUT-171
> URL: https://issues.apache.org/jira/browse/MAHOUT-171
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.1
>Reporter: Isabel Drost
> Fix For: 0.2
>
> Attachments: parent_pom.patch
>
>
> Opening a JIRA task to collect what has to be done for moving over to using 
> apache version 5 parent pom (see also 
> http://markmail.org/thread/ld26m3xxzoztqsk6 ).
>* Link Apache parent pom into our pom.
>* Update hudson to build via maven (?).
>* File subtask at INFRA-1896 to include mahout in repository.apache.org

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-171) Move deployment to repository.apache.org

2009-09-04 Thread Isabel Drost (JIRA)
Move deployment to repository.apache.org


 Key: MAHOUT-171
 URL: https://issues.apache.org/jira/browse/MAHOUT-171
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.1
Reporter: Isabel Drost
 Fix For: 0.2


Opening a JIRA task to collect what has to be done for moving over to using 
apache version 5 parent pom (see also 
http://markmail.org/thread/ld26m3xxzoztqsk6 ).

   * Link Apache parent pom into our pom.
   * Update hudson to build via maven (?).
   * File subtask at INFRA-1896 to include mahout in repository.apache.org


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-124) Online Classification using HBase

2009-08-11 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742227#action_12742227
 ] 

Isabel Drost commented on MAHOUT-124:
-

> Ant config was done to decrease the job jar file size. See first comment in 
> this issue point No:3

Ah, thanks for the reminder...

> I need the new Eclipse Code formatter for that purpose. I am still using the 
> lucene code formatter, which is causing this break.

Ok, I see. I guess that should be no show-stopper for the code to get in.

> Docs... already on it!
> Removed all hard coded map/Reduce task number limit from code. Will conform 
> to the cluster its being run on.

Great!

> Map/Reduce jobs doesnt do much leg work that it confuses reading the code, I 
> could factor them out as well if needed.

I think we could leave that open for a later patch.

> TODO: Algorithm will keep datastore internally.
> TODO: add jar from latest trunk of HBase

You could probably add a JIRA task to upgrade HBase to the official release as 
soon as that is out. Just so we do not forget that task. Other than that, to me 
it looks like this code code go in by the end of this week. If anyone else 
would like to have a look over the code before and needs more time, please do 
tell.

> Online Classification using HBase
> -
>
> Key: MAHOUT-124
> URL: https://issues.apache.org/jira/browse/MAHOUT-124
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.2
>Reporter: Robin Anil
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, 
> MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #   Batch classification of flat file documents and flat file model:
> #   Storing the model in HBase and the end of Model Building Map/Reduce 
> stages
> #   Using the model stored in HBase create an interface (both command 
> line and web service) to classify a give document
> #   Using the model stored in HBase, batch classify documents stored on 
> the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth

2009-08-11 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742033#action_12742033
 ] 

Isabel Drost commented on MAHOUT-157:
-

Patch applies to trunk but I run into problems when trying to get it to 
compile. I needed to apply MAHOUT-124 first. Then I got an error that indicated 
that you are using the Combinations class not only in the tests (where it is 
put by the diff) but also in the regular source code. After copying the class 
to src/main/java, I get the following error: 

MAHOUT-157/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ParallelFPGrowthReducer.java:[92,41]
 get(java.lang.String) in org.apache.mahout.common.Parameters cannot be applied 
to (java.lang.String,java.lang.String)

I guess I have done something wrong when applying the patches one after another?


Other than that I only have some general comments before going into more detail 
for the review: I am missing some documentation, both JavaDoc and package.html, 
at least a link to the original paper would be nice to have.

PFPGrowth - seems like you do quite a lot of work in your constructor. I think 
it is no good idea to start map reduce jobs from within a constructor. Maybe I 
am reading something wrong here?

Is it possible to break up the test into unit tests? I think that would make 
changing the code and tracking where the change actually broke the code by far 
easier.

AggregatorReducer, line 88: Please avoid calls to .printStackTrace() 
- usually those messages get lost when the system is in production. Better log 
the message with Logger.("your message", ) - maybe 
rethrow the exception if you cannot handle it properly.

TreeNode - the class seems to contain public attributes only but no methods. 
Please at least explain which type of tree these nodes are supposed to be a 
part of. From the code alone I am not able to understand the its usage... 

> Frequent Pattern Mining using Parallel FP-Growth
> 
>
> Key: MAHOUT-157
> URL: https://issues.apache.org/jira/browse/MAHOUT-157
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.2
>Reporter: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-157-August-6.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-inProgress-August-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-124) Online Classification using HBase

2009-08-11 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742031#action_12742031
 ] 

Isabel Drost commented on MAHOUT-124:
-


Alltogether really nice changes. The patch now applies to trunk without 
problems and builds (except for the missing hbase dependency). As this will be 
one of the last reviews, I tried to be a little more picky also with minor 
changes like added System.out.println and missing documentation...

The ant config file (build.xml) contains changes that I see nowhere explained. 
Are they supposed to remain for the final patch?

In the examples concerning the TestClassifier
  - it has imports for java.io.* and java.util.* - for the final patch could 
you please revert those to the specific imports?
  - could you please try to avoid reformatting the code as much as possible? It 
makes reading patches a whole lot easier.
  - in line 129 there is quite a bit of code commented out - better through it 
out entirely? If needed later the snippet is still in jira.
  - line 224 - have the timing statistics been left in intentionally?

utils/nlp/NGrams
  - The class is missing documentation. I guess your intention was to generate 
nGrams from a line of text, not the whole document? Otherwise holding document 
and nGrams both in memory seems a little bit much. There also seems to be no 
unit test for it?

The classes implementing the caching algorithms are missing documentation. At 
least some /** {...@inheritdoc} */ and a short comment on top that explains the 
purpose of the implemention would be nice. (Same applies for Pair and 
Parameters).

CBayesNormalizerReducer still has HBase Dependencies - is it possible to factor 
them out?

BayesThetaNormalizerDriver - setting the number of map tasks was commented away 
compared with trunk. Intentional?

BayesClassifierMapper - lines 106, 110 and following: Shouldn't the log message 
be something like "Using ..." instead of "Testing ..."?

classifier/bayes/interfaces/algorithm/Algorithm - you still give a pointer to 
the datastore with every method call to the Algorithm. Wouldn't the interface 
look cleaner if the Algorithm would hold a reference to an initialized 
datastore and use that for further requests? I don't think it is very likely 
that users will go to HBase for the first document to classify and to an 
InMemoryStore for the next document.

bayes/algorithm/CBayesAlgorithm, BayesAlgorithm, 
bayes/common/ClassifierPriorityQueue - is missing some basic javaDoc.

BayesTfIdfDriver, BayesTfIdfReducer, BayesWeightSummerReducer - I assume the 
dependency to HBase cannot be factored out?

BayesFeatureMapper - there is a System.out.println in there...

One last question: You reference hbase-0.20.0 which is not released yet. I 
guess we should include a prebuilt version in our lib directory and ship that 
until hbase has an official release to use?

> Online Classification using HBase
> -
>
> Key: MAHOUT-124
> URL: https://issues.apache.org/jira/browse/MAHOUT-124
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.2
>Reporter: Robin Anil
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, 
> MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #   Batch classification of flat file documents and flat file model:
> #   Storing the model in HBase and the end of Model Building Map/Reduce 
> stages
> #   Using the model stored in HBase create an interface (both command 
> line and web service) to classify a give document
> #   Using the model stored in HBase, batch classify documents stored on 
> the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-124) Online Classification using HBase

2009-08-11 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost reassigned MAHOUT-124:
---

Assignee: Isabel Drost

> Online Classification using HBase
> -
>
> Key: MAHOUT-124
> URL: https://issues.apache.org/jira/browse/MAHOUT-124
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.2
>Reporter: Robin Anil
>Assignee: Isabel Drost
> Fix For: 0.2
>
> Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, 
> MAHOUT-124-July-23.patch, MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #   Batch classification of flat file documents and flat file model:
> #   Storing the model in HBase and the end of Model Building Map/Reduce 
> stages
> #   Using the model stored in HBase create an interface (both command 
> line and web service) to classify a give document
> #   Using the model stored in HBase, batch classify documents stored on 
> the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-124) Online Classification using HBase

2009-07-19 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733075#action_12733075
 ] 

Isabel Drost commented on MAHOUT-124:
-

Just forgot two final notes:

You should update your svn-checkout. The patch was done against an old revision 
of trunk and does no longer apply cleanly.

The patch was broken - line 988 in the patch file has a broken directive: 

@@ -48,67 +54,107 @@

should really be

@@ -48,67 +54,105 @@

the effect being that "patch" assumes a hunk length of 107 lines which makes it 
fail. Your hunk is only 105 lines, so better not lie to "patch" :) However, 
that one was trivial to fix.

(Thanks to Thilo Fromm for helping me fix and explain that.)

> Online Classification using HBase
> -
>
> Key: MAHOUT-124
> URL: https://issues.apache.org/jira/browse/MAHOUT-124
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.2
>Reporter: Robin Anil
> Attachments: MAHOUT-124-July-13.patch, MAHOUT-124-July-6.patch, 
> MAHOUT-124-June-23.patch
>
>
> #   Batch classification of flat file documents and flat file model:
> #   Storing the model in HBase and the end of Model Building Map/Reduce 
> stages
> #   Using the model stored in HBase create an interface (both command 
> line and web service) to classify a give document
> #   Using the model stored in HBase, batch classify documents stored on 
> the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-124) Online Classification using HBase

2009-07-19 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733069#action_12733069
 ] 

Isabel Drost commented on MAHOUT-124:
-

*ThetaNormalizerReducer, *BayesTFIDFReducer and *BayesSummerReducer still have 
dependencies to HBase - I think one can factor them out.

Interface "Algorithm" - I think it might sense to initialise the the Algorithm 
with a reference to the datastore instead of injecting that reference with 
every method call. Other than that: Looks good. Bayes and CBayes look a lot 
cleaner now.

Interface Datastore looks good. I like the separation of data handling and 
actual algorithm implementation.

I would move Pair over to the utils package.

Good work Robin.


> Online Classification using HBase
> -
>
> Key: MAHOUT-124
> URL: https://issues.apache.org/jira/browse/MAHOUT-124
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.2
>Reporter: Robin Anil
> Attachments: MAHOUT-124-July-13.patch, MAHOUT-124-July-6.patch, 
> MAHOUT-124-June-23.patch
>
>
> #   Batch classification of flat file documents and flat file model:
> #   Storing the model in HBase and the end of Model Building Map/Reduce 
> stages
> #   Using the model stored in HBase create an interface (both command 
> line and web service) to classify a give document
> #   Using the model stored in HBase, batch classify documents stored on 
> the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-108) Implementation of Assoication Rules learning by Apriori algorithm

2009-07-14 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731024#action_12731024
 ] 

Isabel Drost commented on MAHOUT-108:
-

Hello Chao Deng,

how is the status of your apriori patch?

Isabel

> Implementation of Assoication Rules learning by Apriori algorithm
> -
>
> Key: MAHOUT-108
> URL: https://issues.apache.org/jira/browse/MAHOUT-108
> Project: Mahout
>  Issue Type: Task
> Environment: Linux, Hadoop-0.17.1
>Reporter: chao deng
> Fix For: 0.2
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Target: Association Rules learning is a popular method for discovering 
> interesting relations between variables in large databases. Here, we would 
> implement the Apriori algorithm using Hadoop&Mapreduce parallel techniques.
> Applications: Typically, association rules  learning is used to discover 
> regularities between products in large scale transaction data in 
> supermarkets. For example, the rule  "{onions, patatoes}->beef" found in the 
> sales data would indicate that if a customer buys onions and potatoes 
> together, he or she is likely to also buy beef. Such information can be used 
> as the basis for decisions about marketing activities. In addition to the 
> market basket analysis, association rules are employed today in many 
> application areas including Web usage mining, intrusion detection and 
> bioinformatics.
> Apriori algorithm: Apriori is the best-known algorithm to mine association 
> rules. It uses a breadth-first search strategy to counting the support of 
> itemsets and uses a candidate generation function which exploits the downward 
> closure property of support

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-124) Online Classification using HBase

2009-07-07 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728311#action_12728311
 ] 

Isabel Drost commented on MAHOUT-124:
-

Some initial comments on the patch:

org/apache/mahout/utils/Cache.java - I am missing some documentation for the 
methods. For interfaces, you can omit the public with methods. For classes 
implementing this interface, you might want to at least use @inheritDoc to link 
back to the original documentation. Please also note in the class comment 
whether your implementation is safe to use in a multi-threaded context or not.

org.apache.mahout.common.Model - To me it looks a bit weird to add a dependency 
to HBase directly to the model. I would prefer the HBase implementation to be 
less tightly coupled with the core code. Currently it looks like the model is 
really doing two tasks at once: Implementing an in-memory-model as well as an 
HBase model. I think it should be possible to refactor the code such that the 
two can be separated into distinct classes that can then be used 
interchangeably. My first guess would be that the strategy pattern should be 
helpful with this task. 

You probably will have to refactor CBayesModel and BayesModel as well. The same 
applies to org/apache/mahout/classifier/Classify.java and CBayesModel, Model, 
BayesTfIdfDriver, BayesTfIDFReducer, BayesWeightSummerReducer.

org.apache.mahout.classifier.cbase - I really like your additions for reporting 
progress back to Hadoop. I would suggest to split these from the patch, open a 
separate Issue and attach the changes there. This would keep this patch more 
focussed on the original task of adding HBase support.

org.apache.mahout.classifier.cbase.CBayesModel - Please remove the code you 
commented out if you do not need it anymore. In case of catching an IOException 
you should at least write some warning log message (e.g. line 60). 

> Online Classification using HBase
> -
>
> Key: MAHOUT-124
> URL: https://issues.apache.org/jira/browse/MAHOUT-124
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.2
>Reporter: Robin Anil
> Attachments: MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
>
>
> #   Batch classification of flat file documents and flat file model:
> #   Storing the model in HBase and the end of Model Building Map/Reduce 
> stages
> #   Using the model stored in HBase create an interface (both command 
> line and web service) to classify a give document
> #   Using the model stored in HBase, batch classify documents stored on 
> the HDFS

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   >