[ 
https://issues.apache.org/jira/browse/MAHOUT-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037505#comment-13037505
 ] 

Ted Dunning commented on MAHOUT-668:
------------------------------------

{quote}
On Sat, May 21, 2011 at 5:47 PM, Daniel McEnnis (JIRA) <[email protected]> wrote:
1. Use case: This is the algorithm for those learning problems that are simply 
too massive even for Mahout's memory streamlined algorithms.  Particularly for 
knn, its the advertising company with 50,000 classes of people, tens to 
hundreds of millions of examples and many terabytes of log data to classify 
which type of person a log belongs to.  Memory footprint becomes the biggest 
issue as even the model takes more memory than what is available.  For the 
other Mahout classifiers, training data size is limited to available memory on 
data nodes.
{quote}

Actually not.  In fact, this is not true for any of the other model training 
algorithms in Mahout except kind of sort of, but not really for the random 
forest.  For the Naive Bayes algorithms and the SGD algorithms it is distinctly 
not true.
 
{quote}
3.  These distance measures have very different assumptions from those in 
recommendation. A missing vector entry (say in sparse vector format) means 0, 
not missing.  This requires a hack of all distance measures to accommodate it.  
{quote}

I don't see why.  Most of the other distance measures in Mahout use this same 
convention.  Certainly v1.getDifferenceSquared and 
v1.minus(v2).assign(Functions.abs).sum() would give you results that assume 0's 
for missing elements.

I really think that the sub-classes of 
org.apache.mahout.common.distance.DistanceMeasure do just what you are saying 
that you want.

{quote}
The measures are also 0 - Infinity, not -1 - 1 and the smaller the better.  
Cosine distance doesn't fit this, so its got a transform to map it to 0-2 where 
smaller is better.  
{quote}

My point was that cosine distance is essentially the same as Euclidean 
distance.  Why not just use that?
 
{quote}
KL Distance is based on entropy.  I'll double check my references for the 
details.
{quote}

I am pretty sure that you are looking at Kuhlback-Liebler divergence.  I think 
you just need to put in a wikipedia reference.  Your javadoc is not quite 
correct in any case.
 
{quote}
5. standard classifier - Until today, I thought this was specific to the Bayes 
algorithm.  I'll add it to the next patch.
{quote}

Look at org.apache.mahout.classifier.AbstractVectorClassifier
 
{quote}
6. usability.  Any user reading the javadoc on the entry classes ModelBuilder, 
Classifier, or TestClassifier have instructions on how to setup data for this 
patch.  All three should have their options explained.  
{quote}

That isn't want I meant.  Command line documentation is all well and good, but 
there should be a usable API as well, especially for deployment in a working 
system.  Very few systems can afford to do an entire map-reduce when they just 
want to classify a few data points.
 
I'll add it to the list of things to put in the next patch.  My understanding 
was that there is no standard for at least input formats in Mahout.  This patch 
describes my proposal for what input formats each Mahout component ought to be 
able to process.

If you are pushing for a standard, then that should be independent of your 
classifier and you should explain how that interacts with, say, the hashed 
vector encoding framework.  See 
org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder
 

> Adding knn support to Mahout classifiers
> ----------------------------------------
>
>                 Key: MAHOUT-668
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-668
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Daniel McEnnis
>              Labels: classification, knn
>         Attachments: MAHOUT-668.pat, Mahout-668-2.patch, Mahout-668-3.patch, 
> Mahout-668.pat
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Initial implementation of the knn.  This is a minimum base set with many more 
> possible add-ons including support for text and weka input as well as a 
> classify only (no confusion matrix) back end.  The system was tested on the 
> 20 newsgroup data set.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to