[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-07 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763378#action_12763378
 ] 

Isabel Drost commented on MAHOUT-138:
-


>From the classes above, I worked through up to the classification stuff. 
>Documentation is in the wiki at: 
>http://cwiki.apache.org/confluence/display/MAHOUT/ClassifyingYourData (the 
>links with commandline in their name) and 
>http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData (again 
>the links with commandline in their name).

Currently their are only examples left to convert as well as three classes 
containing main methods from the taste code:

./core/src/main/java/org/apache/mahout/cf/taste/hadoop/RecommenderJob.java
./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOneDiffsToAveragesJob.java
./core/src/main/java/org/apache/mahout/cf/taste/hadoop/SlopeOnePrefsToDiffsJob.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/bookcrossing/BookCrossingRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/NetflixRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/netflix/TransposeToByUser.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens/GroupLensRecommenderEvaluatorRunner.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/InputDriver.java
./examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/OutputDriver.java
./examples/src/main/java/org/apache/mahout/ga/watchmaker/cd/tool/CDInfosTool.java



> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth

2009-10-07 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-157:
--

Attachment: MAHOUT-157-Oct-8.pfpgrowth.patch

Implementation of Top K Parallel FPGrowth using the optimised algorithm 
detailed above. 
This implementation uses Custom Writable Classes instead of Text. 


Need to do testing and verification of results.  But code wise the 
implementation is done

> Frequent Pattern Mining using Parallel FP-Growth
> 
>
> Key: MAHOUT-157
> URL: https://issues.apache.org/jira/browse/MAHOUT-157
> Project: Mahout
>  Issue Type: New Feature
>  Components: Frequent Itemset/Association Rule Mining
>Affects Versions: 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, 
> MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-Oct-1.patch, 
> MAHOUT-157-Oct-8.pfpgrowth.patch, MAHOUT-157-September-10.patch, 
> MAHOUT-157-September-18.patch, MAHOUT-157-September-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763265#action_12763265
 ] 

Grant Ingersoll commented on MAHOUT-138:


I think we just need to go through the various main() methods and see what is 
left.

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-186) Classifier PriorityQueue returns erroneous results

2009-10-07 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763172#action_12763172
 ] 

Ted Dunning commented on MAHOUT-186:


You are right that I should code up an example before speaking.  But it does 
seem that, against all odds, that what I was suggesting works.

Here is a test case that illustrates what I meant.  I am still not sure what 
everybody is saying:

{noformat}
package com.infovell.logging.test;

import junit.framework.TestCase;

import java.util.PriorityQueue;
import java.util.Random;
import java.util.List;
import java.util.ArrayList;
import java.util.Collections;

public class FooTest extends TestCase {
public void testQueue() {
PriorityQueue pq = new PriorityQueue(10);
Random gen = new Random(123L);
for (int i = 0; i < 1000; i++) {
double x = gen.nextDouble();
if (pq.size() < 10 || x > pq.peek()) {
pq.add(x);
while (pq.size() > 10) {
pq.remove();
}
}
}

List r = new ArrayList(pq);
Collections.reverse(r);
System.out.printf("%s\n", r);
assertEquals(0.994991252160446, r.get(0), 1e-7);
assertEquals(0.9881699208527764, r.get(9), 1e-7);
}
}
{noformat}

> Classifier PriorityQueue returns erroneous results
> --
>
> Key: MAHOUT-186
> URL: https://issues.apache.org/jira/browse/MAHOUT-186
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.1, 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-186.patch
>
>
> A simple test fails 
> import org.apache.hadoop.util.PriorityQueue;
> PriorityQueue queue = new ClassifierResultPriorityQueue(3);
> queue.insert(new ClassifierResult("label1", 5));
> queue.insert(new ClassifierResult("label2", 4));
> queue.insert(new ClassifierResult("label3", 3));
> queue.insert(new ClassifierResult("label4", 2));
> queue.insert(new ClassifierResult("label5", 1));
> 
> assertEquals("Incorrect Size", 3, queue.size());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label3', score=3.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label4', score=2.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label5', score=1.0}
> Expected label1 and label2 at the top

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-186) Classifier PriorityQueue returns erroneous results

2009-10-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763162#action_12763162
 ] 

Sean Owen commented on MAHOUT-186:
--

I will make up an alternate patch that either shows what I mean or shows me I'm 
wrong. My central question is, what requires a custom subclass of 
PriorityQueue? I understand that the "new List()" thing doesn't give the items 
in order but that doesn't imply a subclass is needed.

> Classifier PriorityQueue returns erroneous results
> --
>
> Key: MAHOUT-186
> URL: https://issues.apache.org/jira/browse/MAHOUT-186
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.1, 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-186.patch
>
>
> A simple test fails 
> import org.apache.hadoop.util.PriorityQueue;
> PriorityQueue queue = new ClassifierResultPriorityQueue(3);
> queue.insert(new ClassifierResult("label1", 5));
> queue.insert(new ClassifierResult("label2", 4));
> queue.insert(new ClassifierResult("label3", 3));
> queue.insert(new ClassifierResult("label4", 2));
> queue.insert(new ClassifierResult("label5", 1));
> 
> assertEquals("Incorrect Size", 3, queue.size());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label3', score=3.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label4', score=2.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label5', score=1.0}
> Expected label1 and label2 at the top

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-186) Classifier PriorityQueue returns erroneous results

2009-10-07 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763145#action_12763145
 ] 

Robin Anil commented on MAHOUT-186:
---

new List(priorityQueue) i believe doesnt keep the order of the priorityQueue as 
the toArray and Iterator both returns data in random order. So you need to keep 
polling the top of the heap isn't it?


> Classifier PriorityQueue returns erroneous results
> --
>
> Key: MAHOUT-186
> URL: https://issues.apache.org/jira/browse/MAHOUT-186
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.1, 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-186.patch
>
>
> A simple test fails 
> import org.apache.hadoop.util.PriorityQueue;
> PriorityQueue queue = new ClassifierResultPriorityQueue(3);
> queue.insert(new ClassifierResult("label1", 5));
> queue.insert(new ClassifierResult("label2", 4));
> queue.insert(new ClassifierResult("label3", 3));
> queue.insert(new ClassifierResult("label4", 2));
> queue.insert(new ClassifierResult("label5", 1));
> 
> assertEquals("Incorrect Size", 3, queue.size());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label3', score=3.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label4', score=2.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label5', score=1.0}
> Expected label1 and label2 at the top

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-186) Classifier PriorityQueue returns erroneous results

2009-10-07 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763124#action_12763124
 ] 

Ted Dunning commented on MAHOUT-186:



I don't quite understand the last comment, but generally if you want the top n 
items in descending order, you keep a descending queue as you say in order to 
make insertion efficient.  It is generally good to cache the score of the least 
element to speed comparisons even a little bit more.

Then when you want the results, you can just fill a list in reverse order 

or just do this:

List r = new ArrayList(priorityQueue);
Collections.reverse(r);

Since this is pretty simple, I think I misunderstood the question.

> Classifier PriorityQueue returns erroneous results
> --
>
> Key: MAHOUT-186
> URL: https://issues.apache.org/jira/browse/MAHOUT-186
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.1, 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-186.patch
>
>
> A simple test fails 
> import org.apache.hadoop.util.PriorityQueue;
> PriorityQueue queue = new ClassifierResultPriorityQueue(3);
> queue.insert(new ClassifierResult("label1", 5));
> queue.insert(new ClassifierResult("label2", 4));
> queue.insert(new ClassifierResult("label3", 3));
> queue.insert(new ClassifierResult("label4", 2));
> queue.insert(new ClassifierResult("label5", 1));
> 
> assertEquals("Incorrect Size", 3, queue.size());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label3', score=3.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label4', score=2.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label5', score=1.0}
> Expected label1 and label2 at the top

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763109#action_12763109
 ] 

Sean Owen commented on MAHOUT-138:
--

I see, there was a commit, from Isabel. Is it done then? Isabel you had 
suggested moving this to 0.3, so I suppose you're saying it's not done, but 
wonder what the delta is then.

Grant I tend to agree with quick review and commits since patches very quickly 
go stale. But my question I suppose was, if you don't want to mark this for 
0.3, who is waiting to do what for how long on this, if it is to block 0.2?

This isn't my patch at all, I'm not involved.

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-170) Enable Java compile optimize flag during build

2009-10-07 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763085#action_12763085
 ] 

Robin Anil commented on MAHOUT-170:
---

HBase does jvm tuning out of the by enabling Concurrent GC Sweep  in the 
hbase-env.sh

For Sequential Versions we can enable it from the Shell Script
For Hadoop jobs to get the benefit, it has to be put in hadoop-env.sh or in 
mapred.child.java.opts  conf parameter

> Enable Java compile optimize flag during build
> --
>
> Key: MAHOUT-170
> URL: https://issues.apache.org/jira/browse/MAHOUT-170
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Robin Anil
> Fix For: 0.2
>
> Attachments: optimize.patch
>
>
> in maven compile plugin enable optimize=true flag

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-186) Classifier PriorityQueue returns erroneous results

2009-10-07 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763036#action_12763036
 ] 

Robin Anil commented on MAHOUT-186:
---

well i want to get the data in the descending order.

if i keep a descending priorityQueue i cant get the least element without 
polling the entire queue
if i keep an ascending priorityqueue, I wont be able to get the reverse 
iterator without doing the getTopResults()




> Classifier PriorityQueue returns erroneous results
> --
>
> Key: MAHOUT-186
> URL: https://issues.apache.org/jira/browse/MAHOUT-186
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.1, 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-186.patch
>
>
> A simple test fails 
> import org.apache.hadoop.util.PriorityQueue;
> PriorityQueue queue = new ClassifierResultPriorityQueue(3);
> queue.insert(new ClassifierResult("label1", 5));
> queue.insert(new ClassifierResult("label2", 4));
> queue.insert(new ClassifierResult("label3", 3));
> queue.insert(new ClassifierResult("label4", 2));
> queue.insert(new ClassifierResult("label5", 1));
> 
> assertEquals("Incorrect Size", 3, queue.size());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label3', score=3.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label4', score=2.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label5', score=1.0}
> Expected label1 and label2 at the top

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-148) Convert Classification Algs to use richer Writable syntax

2009-10-07 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-148:
--

Status: Patch Available  (was: In Progress)

> Convert Classification Algs to use richer Writable syntax
> -
>
> Key: MAHOUT-148
> URL: https://issues.apache.org/jira/browse/MAHOUT-148
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.1, 0.2
>Reporter: Grant Ingersoll
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-148-Work-In-Progress.patch, MAHOUT-148.patch
>
>
> Much of the classification capabilities relies on parsing values out from the 
> Text object just to determine what type of "thing" is being used.  We should 
> try to avoid having to do string manipulation for this kind of thing and 
> instead encapsulate it in Writable instances.  This should make things 
> perform faster and bring stronger typing to the problem, which should make it 
> easier to understand and debug the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-186) Classifier PriorityQueue returns erroneous results

2009-10-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763035#action_12763035
 ] 

Sean Owen commented on MAHOUT-186:
--

Not sure what's up with the hadoop class, but sure makes sense to use the 
standard PriorityQueue class. why do we need a custom subclass at all? seems 
like this can be done with a regular PriorityQueue, a Comparator, and use of 
the standard PriorityQueue methods. That is, do we need getTopResults(), for 
example.

> Classifier PriorityQueue returns erroneous results
> --
>
> Key: MAHOUT-186
> URL: https://issues.apache.org/jira/browse/MAHOUT-186
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.1, 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-186.patch
>
>
> A simple test fails 
> import org.apache.hadoop.util.PriorityQueue;
> PriorityQueue queue = new ClassifierResultPriorityQueue(3);
> queue.insert(new ClassifierResult("label1", 5));
> queue.insert(new ClassifierResult("label2", 4));
> queue.insert(new ClassifierResult("label3", 3));
> queue.insert(new ClassifierResult("label4", 2));
> queue.insert(new ClassifierResult("label5", 1));
> 
> assertEquals("Incorrect Size", 3, queue.size());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label3', score=3.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label4', score=2.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label5', score=1.0}
> Expected label1 and label2 at the top

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-148) Convert Classification Algs to use richer Writable syntax

2009-10-07 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-148:
--

Attachment: MAHOUT-148.patch

Verified by running all combinations of

Bayes|CBayes
hdfs|hbase 
sequential|mapreduce
both Training and Testing.

Noticed a slight improvement in running time of various map/reduce jobs (20% 
decrease for 20newsgroups dataset)



> Convert Classification Algs to use richer Writable syntax
> -
>
> Key: MAHOUT-148
> URL: https://issues.apache.org/jira/browse/MAHOUT-148
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.1, 0.2
>Reporter: Grant Ingersoll
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-148-Work-In-Progress.patch, MAHOUT-148.patch
>
>
> Much of the classification capabilities relies on parsing values out from the 
> Text object just to determine what type of "thing" is being used.  We should 
> try to avoid having to do string manipulation for this kind of thing and 
> instead encapsulate it in Writable instances.  This should make things 
> perform faster and bring stronger typing to the problem, which should make it 
> easier to understand and debug the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-186) Classifier PriorityQueue returns erroneous results

2009-10-07 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-186:
--

Status: Patch Available  (was: In Progress)

> Classifier PriorityQueue returns erroneous results
> --
>
> Key: MAHOUT-186
> URL: https://issues.apache.org/jira/browse/MAHOUT-186
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.1, 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-186.patch
>
>
> A simple test fails 
> import org.apache.hadoop.util.PriorityQueue;
> PriorityQueue queue = new ClassifierResultPriorityQueue(3);
> queue.insert(new ClassifierResult("label1", 5));
> queue.insert(new ClassifierResult("label2", 4));
> queue.insert(new ClassifierResult("label3", 3));
> queue.insert(new ClassifierResult("label4", 2));
> queue.insert(new ClassifierResult("label5", 1));
> 
> assertEquals("Incorrect Size", 3, queue.size());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label3', score=3.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label4', score=2.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label5', score=1.0}
> Expected label1 and label2 at the top

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-186) Classifier PriorityQueue returns erroneous results

2009-10-07 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-186:
--

Attachment: MAHOUT-186.patch

Fix:
Added PriorityQueue Test. 

Used java.util.PriorityQueue instead of the org.apache.hadoop.util.PriorityQueue



> Classifier PriorityQueue returns erroneous results
> --
>
> Key: MAHOUT-186
> URL: https://issues.apache.org/jira/browse/MAHOUT-186
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.1, 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-186.patch
>
>
> A simple test fails 
> import org.apache.hadoop.util.PriorityQueue;
> PriorityQueue queue = new ClassifierResultPriorityQueue(3);
> queue.insert(new ClassifierResult("label1", 5));
> queue.insert(new ClassifierResult("label2", 4));
> queue.insert(new ClassifierResult("label3", 3));
> queue.insert(new ClassifierResult("label4", 2));
> queue.insert(new ClassifierResult("label5", 1));
> 
> assertEquals("Incorrect Size", 3, queue.size());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label3', score=3.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label4', score=2.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label5', score=1.0}
> Expected label1 and label2 at the top

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-186) Classifier PriorityQueue returns erroneous results

2009-10-07 Thread Robin Anil (JIRA)
Classifier PriorityQueue returns erroneous results
--

 Key: MAHOUT-186
 URL: https://issues.apache.org/jira/browse/MAHOUT-186
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.1, 0.2
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.2


A simple test fails 

import org.apache.hadoop.util.PriorityQueue;
PriorityQueue queue = new ClassifierResultPriorityQueue(3);
queue.insert(new ClassifierResult("label1", 5));
queue.insert(new ClassifierResult("label2", 4));
queue.insert(new ClassifierResult("label3", 3));
queue.insert(new ClassifierResult("label4", 2));
queue.insert(new ClassifierResult("label5", 1));

assertEquals("Incorrect Size", 3, queue.size());
log.info(queue.pop().toString());
log.info(queue.pop().toString());
log.info(queue.pop().toString());

09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
ClassifierResult{category='label3', score=3.0}
09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
ClassifierResult{category='label4', score=2.0}
09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
ClassifierResult{category='label5', score=1.0}

Expected label1 and label2 at the top

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (MAHOUT-186) Classifier PriorityQueue returns erroneous results

2009-10-07 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-186 started by Robin Anil.

> Classifier PriorityQueue returns erroneous results
> --
>
> Key: MAHOUT-186
> URL: https://issues.apache.org/jira/browse/MAHOUT-186
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.1, 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
>
> A simple test fails 
> import org.apache.hadoop.util.PriorityQueue;
> PriorityQueue queue = new ClassifierResultPriorityQueue(3);
> queue.insert(new ClassifierResult("label1", 5));
> queue.insert(new ClassifierResult("label2", 4));
> queue.insert(new ClassifierResult("label3", 3));
> queue.insert(new ClassifierResult("label4", 2));
> queue.insert(new ClassifierResult("label5", 1));
> 
> assertEquals("Incorrect Size", 3, queue.size());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> log.info(queue.pop().toString());
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label3', score=3.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label4', score=2.0}
> 09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest: 
> ClassifierResult{category='label5', score=1.0}
> Expected label1 and label2 at the top

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Classify() method results anomoly - help!

2009-10-07 Thread Robin Anil
Hi Sandra,
 I tested the priority queue implementation it does seem that there is some
problem with the priority queue implementation of hadoop
import org.apache.hadoop.util.PriorityQueue;
PriorityQueue queue = new
ClassifierResultPriorityQueue(3);
queue.insert(new ClassifierResult("label1", 5));
queue.insert(new ClassifierResult("label2", 4));
queue.insert(new ClassifierResult("label3", 3));
queue.insert(new ClassifierResult("label4", 2));
queue.insert(new ClassifierResult("label5", 1));

assertEquals("Incorrect Size", 3, queue.size());
log.info(queue.pop().toString());
log.info(queue.pop().toString());
log.info(queue.pop().toString());

09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest:
ClassifierResult{category='label3', score=3.0}
09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest:
ClassifierResult{category='label4', score=2.0}
09/10/07 16:58:39 INFO common.ClassifierResultPriorityQueueTest:
ClassifierResult{category='label5', score=1.0}
label1 and label2 were missing. I couldn't explain this behaviour.

I changed it to java.util PriorityQueue. So its working now.


On Wed, Sep 30, 2009 at 6:43 PM, Sandra Clover wrote:

> Hi Robin, Thanks for the reply & for updating the documentation &
> your advice. I'll try the trunk version. To answer your question I am
> using Mahout version 0.1 & Hadoop 0.19.2. Hope this helps... Thanks
> again, Robin Sandra.
>
>  - Original Message -
>  From: "Robin Anil"
>  To: mahout-u...@lucene.apache.org
>  Subject: Re: Classify() method results anomoly - help!
>   Date: Wed, 30 Sep 2009 18:08:05 +0530
>
>
>  Hi Sandra, those scores are indicative of the relative score not the
>  probability, Thank for bringing this to our notice, I will fix the
>  documentation, you may try the trunk and see if the former error is
>  coming. Also
>  could you tell me the version of hadoop you are using.
>
>
>
>   On Wed, Sep 30, 2009 at 5:30 PM, Sandra Clover wrote:
>
>  > Thanks Grant, I'll look into that. I've been having a look at the
>  > numbers returned from the getScore() method also. I have noticed a
>  range
>  > from 0 to around 2.243434+ with numbers in between like:
>  > 1659.930763537123 According to the API documentation for this
>  method:
>  > "The label and the associated score(Usually probabilty)". This does
>  not
>  > look like probability to me. I was kind of expecting an answer
>  between 0
>  > and 1 or 0 and 100 or something like that. Are these results
>  typical or
>  > indicative of some sort of bug? Once again, comments/suggestions
>  > appreciated.Sandra.
>  >
>  >
>  >
>  > - Original Message -
>  > From: "Grant Ingersoll"
>  > To: mahout-u...@lucene.apache.org
>  > Subject: Re: Classify() method results anomoly - help!
>  > Date: Tue, 29 Sep 2009 16:02:46 -0400
>  >
>  >
>  >
>  > On Sep 29, 2009, at 8:47 AM, Sandra Clover wrote:
>  >
>  > > Hi, I'm using Mahout 0.1 for document classification (using the
>  > > distributed Bayesian Network) and I'm getting some answers back.
>  I
>  > > have noticed 1 thing that is really bugging me. I'm wondering can
>  > you
>  > > help please:-
>  > > Problem: Concernign the Classify() method there are 2
>  constructors
>  > in
>  > > the API. The first one returns just one answer (according to the
>  > API it
>  > > returns: "the single best category"). The second constructor says
>  > that
>  > > it: "return the top numResults, ranked by score" My problem is
>  that
>  > I
>  > > have compared and contrasted the results in both techniques. I
>  have
>  > > noticed that the single best category does not appear at *all* in
>  > the
>  > > range of categories given by the second contructor! Strange no? I
>  > would
>  > > of expected that it should come top of the list. I have gone to a
>  > value
>  > > of 20 deep in the numResults level and have not even see in the
>  > best
>  > > category. Has anyone encountered this before? I would appreciate
>  > any
>  > > comments/suggestions/user-experience that you may like to share.
>  > Thanks,
>  > > Sandra.
>  > >
>  >
>  > That sounds like a bug. Can you try out the trunk version of
>  > Mahout and see if it is still there? A lot of the classification
>  > stuff has been reworked recently (I'm not even sure at the moment
>  > that those two classify methods are even still in the code!)
>  >
>  > --
>  > An Excellent Credit Score is 750
>  > See Yours in Just 2 Easy Steps!
>  >
>  >
>
> --
> An Excellent Credit Score is 750
> See Yours in Just 2 Easy Steps!
>
>