[
https://issues.apache.org/jira/browse/MAHOUT-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037505#comment-13037505
]
Ted Dunning commented on MAHOUT-668:
------------------------------------
{quote}
On Sat, May 21, 2011 at 5:47 PM, Daniel McEnnis (JIRA) <[email protected]> wrote:
1. Use case: This is the algorithm for those learning problems that are simply
too massive even for Mahout's memory streamlined algorithms. Particularly for
knn, its the advertising company with 50,000 classes of people, tens to
hundreds of millions of examples and many terabytes of log data to classify
which type of person a log belongs to. Memory footprint becomes the biggest
issue as even the model takes more memory than what is available. For the
other Mahout classifiers, training data size is limited to available memory on
data nodes.
{quote}
Actually not. In fact, this is not true for any of the other model training
algorithms in Mahout except kind of sort of, but not really for the random
forest. For the Naive Bayes algorithms and the SGD algorithms it is distinctly
not true.
{quote}
3. These distance measures have very different assumptions from those in
recommendation. A missing vector entry (say in sparse vector format) means 0,
not missing. This requires a hack of all distance measures to accommodate it.
{quote}
I don't see why. Most of the other distance measures in Mahout use this same
convention. Certainly v1.getDifferenceSquared and
v1.minus(v2).assign(Functions.abs).sum() would give you results that assume 0's
for missing elements.
I really think that the sub-classes of
org.apache.mahout.common.distance.DistanceMeasure do just what you are saying
that you want.
{quote}
The measures are also 0 - Infinity, not -1 - 1 and the smaller the better.
Cosine distance doesn't fit this, so its got a transform to map it to 0-2 where
smaller is better.
{quote}
My point was that cosine distance is essentially the same as Euclidean
distance. Why not just use that?
{quote}
KL Distance is based on entropy. I'll double check my references for the
details.
{quote}
I am pretty sure that you are looking at Kuhlback-Liebler divergence. I think
you just need to put in a wikipedia reference. Your javadoc is not quite
correct in any case.
{quote}
5. standard classifier - Until today, I thought this was specific to the Bayes
algorithm. I'll add it to the next patch.
{quote}
Look at org.apache.mahout.classifier.AbstractVectorClassifier
{quote}
6. usability. Any user reading the javadoc on the entry classes ModelBuilder,
Classifier, or TestClassifier have instructions on how to setup data for this
patch. All three should have their options explained.
{quote}
That isn't want I meant. Command line documentation is all well and good, but
there should be a usable API as well, especially for deployment in a working
system. Very few systems can afford to do an entire map-reduce when they just
want to classify a few data points.
I'll add it to the list of things to put in the next patch. My understanding
was that there is no standard for at least input formats in Mahout. This patch
describes my proposal for what input formats each Mahout component ought to be
able to process.
If you are pushing for a standard, then that should be independent of your
classifier and you should explain how that interacts with, say, the hashed
vector encoding framework. See
org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder
> Adding knn support to Mahout classifiers
> ----------------------------------------
>
> Key: MAHOUT-668
> URL: https://issues.apache.org/jira/browse/MAHOUT-668
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Affects Versions: 0.6
> Reporter: Daniel McEnnis
> Labels: classification, knn
> Attachments: MAHOUT-668.pat, Mahout-668-2.patch, Mahout-668-3.patch,
> Mahout-668.pat
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> Initial implementation of the knn. This is a minimum base set with many more
> possible add-ons including support for text and weka input as well as a
> classify only (no confusion matrix) back end. The system was tested on the
> 20 newsgroup data set.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira