[
https://issues.apache.org/jira/browse/MAHOUT-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037486#comment-13037486
]
Daniel McEnnis commented on MAHOUT-668:
---------------------------------------
Thank you, Ted, for putting so much time into this. I'll do my best to answer
as consisely and completely as possible.
1. Use case: This is the algorithm for those learning problems that are simply
too massive even for Mahout's memory streamlined algorithms. Particularly for
knn, its the advertising company with 50,000 classes of people, tens to
hundreds of millions of examples and many terabytes of log data to classify
which type of person a log belongs to. Memory footprint becomes the biggest
issue as even the model takes more memory than what is available. For the
other Mahout classifiers, training data size is limited to available memory on
data nodes.
2. I forgot to add javadoc to the test classes. I'll fix that for the next
patch.
3. These distance measures have very different assumptions from those in
recommendation. A missing vector entry (say in sparse vector format) means 0,
not missing. This requires a hack of all distance measures to accommodate it.
The measures are also 0 - Infinity, not -1 - 1 and the smaller the better.
Cosine distance doesn't fit this, so its got a transform to map it to 0-2 where
smaller is better. KL Distance is based on entropy. I'll double check my
references for the details.
4. MasterVector and ClassLabelVector- I created my own Dictionary class because
of my difficulty understanding it. I'm willing to switch, it just means taking
more time to understand the code. The name is arbitrary. I can change it as
needed. DfCountDictionary works better for me as its not an inverted reference.
5. standard classifier - Until today, I thought this was specific to the Bayes
algorithm. I'll add it to the next patch.
6. usability. Any user reading the javadoc on the entry classes ModelBuilder,
Classifier, or TestClassifier have instructions on how to setup data for this
patch. All three should have their options explained. I'll add it to the list
of things to put in the next patch. My understanding was that there is no
standard for at least input formats in Mahout. This patch describes my
proposal for what input formats each Mahout component ought to be able to
process.
7. still working on model suggestions....
> Adding knn support to Mahout classifiers
> ----------------------------------------
>
> Key: MAHOUT-668
> URL: https://issues.apache.org/jira/browse/MAHOUT-668
> Project: Mahout
> Issue Type: Improvement
> Components: Classification
> Affects Versions: 0.6
> Reporter: Daniel McEnnis
> Labels: classification, knn
> Attachments: MAHOUT-668.pat, Mahout-668-2.patch, Mahout-668-3.patch,
> Mahout-668.pat
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> Initial implementation of the knn. This is a minimum base set with many more
> possible add-ons including support for text and weka input as well as a
> classify only (no confusion matrix) back end. The system was tested on the
> 20 newsgroup data set.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira