[ 
https://issues.apache.org/jira/browse/MAHOUT-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037486#comment-13037486
 ] 

Daniel McEnnis commented on MAHOUT-668:
---------------------------------------

Thank you, Ted, for putting so much time into this.  I'll do my best to answer 
as consisely and completely as possible.

1. Use case: This is the algorithm for those learning problems that are simply 
too massive even for Mahout's memory streamlined algorithms.  Particularly for 
knn, its the advertising company with 50,000 classes of people, tens to 
hundreds of millions of examples and many terabytes of log data to classify 
which type of person a log belongs to.  Memory footprint becomes the biggest 
issue as even the model takes more memory than what is available.  For the 
other Mahout classifiers, training data size is limited to available memory on 
data nodes.

2. I forgot to add javadoc to the test classes.  I'll fix that for the next 
patch.

3.  These distance measures have very different assumptions from those in 
recommendation. A missing vector entry (say in sparse vector format) means 0, 
not missing.  This requires a hack of all distance measures to accommodate it.  
The measures are also 0 - Infinity, not -1 - 1 and the smaller the better.  
Cosine distance doesn't fit this, so its got a transform to map it to 0-2 where 
smaller is better.  KL Distance is based on entropy.  I'll double check my 
references for the details.

4. MasterVector and ClassLabelVector- I created my own Dictionary class because 
of my difficulty understanding it.  I'm willing to switch, it just means taking 
more time to understand the code.  The name is arbitrary.  I can change it as 
needed. DfCountDictionary works better for me as its not an inverted reference. 
 

5. standard classifier - Until today, I thought this was specific to the Bayes 
algorithm.  I'll add it to the next patch.

6. usability.  Any user reading the javadoc on the entry classes ModelBuilder, 
Classifier, or TestClassifier have instructions on how to setup data for this 
patch.  All three should have their options explained.  I'll add it to the list 
of things to put in the next patch.  My understanding was that there is no 
standard for at least input formats in Mahout.  This patch describes my 
proposal for what input formats each Mahout component ought to be able to 
process.

7.  still working on model suggestions....



> Adding knn support to Mahout classifiers
> ----------------------------------------
>
>                 Key: MAHOUT-668
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-668
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Daniel McEnnis
>              Labels: classification, knn
>         Attachments: MAHOUT-668.pat, Mahout-668-2.patch, Mahout-668-3.patch, 
> Mahout-668.pat
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Initial implementation of the knn.  This is a minimum base set with many more 
> possible add-ons including support for text and weka input as well as a 
> classify only (no confusion matrix) back end.  The system was tested on the 
> 20 newsgroup data set.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to