[
https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859693#action_12859693
]
Robin Anil commented on MAHOUT-384:
-----------------------------------
Hi Tony. Nice work on the patch. But before we commit this, there are a couple
of things you need to cover. I still have to read the algorithm in detail to
know whats happening. But I have some queries and suggestions below which is a
kind of a checklist to make this a commitable patch
1) I am not a fan of Text based input, though it is what most of the algorithms
in Mahout was first implement in. The idea of splitting and joining text files
based on comma is not very clean. Can you convert this to deal with
SequenceFile of VectorWritable OR some other Writable Format? Whats your input
schema?
2) There is a code-style we enforce in Mahout. You can use the mvn
checkstyle:checkstyle to see the violations. We also have an eclipse formatter
which formats code that almost match the checkstyle(there are rare manual
interventions required). Take a look at this
https://cwiki.apache.org/MAHOUT/howtocontribute.html you will find the Eclipse
formatter file at the bottom
3) For parsing args use the apache commons cli2 library. Take a look at
o/a/m/clustering/kmeans/KMeansDriver to see usage
4) What is Utils being used for?
5) @Override
+ public void setup(Context context) throws
IOException,InterruptedException{
+
+ String filePath = context.getConfiguration().get("a");
+ sumAttribute = Utils.readFile(filePath+"/part-r-00000");
+
+ }
Please use distributed cache to read the file in a map/reduce context. See the
DictionaryVectorizer Map/Reduce classes for usage
6) job.setNumReduceTasks(1); ? Is this necessary? Doesn't it hurt scalability
of this algorithm? Is the single reducer going to get a lot of data from the
mapper? If Yes, then you should think of removing this constraint and let it
use the hadoop parameters or parameterize it
7) Can this job be Optimised using a Combiner? If yes, its really worth
spending time to make one
8) Tests! :)
> Implement of AVF algorithm
> --------------------------
>
> Key: MAHOUT-384
> URL: https://issues.apache.org/jira/browse/MAHOUT-384
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Reporter: tony cui
> Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind
> of
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and
> introduced by this paper :
> http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by
> original input data.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.