[jira] Issue Comment Edited: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

2009-12-29 Thread zhao zhendong (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794809#action_12794809
 ] 

zhao zhendong edited comment on MAHOUT-232 at 12/30/09 7:07 AM:


I still work on it :(. I can attach them as a patch tomorrow or the day
after tomorrow, maybe.

I will check the code of MAHOUT-228.





-- 
-

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>

Department of Computer Science
School of Computing
National University of Singapore

Mail: zhaozhend...@gmail.com


  was (Author: maximzhao):
I still work on it :(. I can attach them as a patch tomorrow or the day
after tomorrow, maybe.

I will check the code of MAHOUT-228.





-- 
-

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>

Department of Computer Science
School of Computing
National University of Singapore

Homepage:http://zhaozhendong.googlepages.com
Mail: zhaozhend...@gmail.com

  
> Implementation of sequential SVM solver based on Pegasos
> 
>
> Key: MAHOUT-232
> URL: https://issues.apache.org/jira/browse/MAHOUT-232
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.1
>Reporter: zhao zhendong
> Attachments: SequentialSVM_0.1.patch
>
>
> After discussed with guys in this community, I decided to re-implement a 
> Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
> line style,  SparseMatrix and SparseVector etc.) , Eventually, it will 
> support HDFS. 
> The plan of Sequential Pegasos:
> 1 Supporting the general file system ( almost finished );
> 2 Supporting HDFS;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

2009-12-29 Thread zhao zhendong (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhao zhendong updated MAHOUT-232:
-

Attachment: SequentialSVM_0.1.patch

> Implementation of sequential SVM solver based on Pegasos
> 
>
> Key: MAHOUT-232
> URL: https://issues.apache.org/jira/browse/MAHOUT-232
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.1
>Reporter: zhao zhendong
> Attachments: SequentialSVM_0.1.patch
>
>
> After discussed with guys in this community, I decided to re-implement a 
> Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
> line style,  SparseMatrix and SparseVector etc.) , Eventually, it will 
> support HDFS. 
> The plan of Sequential Pegasos:
> 1 Supporting the general file system ( almost finished );
> 2 Supporting HDFS;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

2009-12-29 Thread zhao zhendong (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhao zhendong updated MAHOUT-232:
-

Affects Version/s: 0.1
   Status: Patch Available  (was: Open)

Sequential SVM based on Pegasos.
---
Currently, this package provides (Features):
---

1. Sequential SVM linear solver, include training and testing.

2. It supports general file system right now, it means that HDFS supporting 
will be a near future work.

3. Supporting large-scale data set. ( need to assign the argument 
"trainSampleNum" )
   Because of the Pegasos only need to sample certain samples, this package 
supports to pre-fetch
   the certain size (e.g. max iteration) of samples to memory.
   For example: if the size of data set has 100,000,000 samples, due to the 
default maximum iteration is 10,000,
   as the result, this package only randomly loads 10,000 samples to memory. 

---
TODO:
---
1. Supporting HDFS;

2. Because of adopting mahout.math.SparseMatrix and 
mahout.math.SparseVectorUnsafe,
   I must assign the cardinality of matrix while create them. It's not easy for 
reading
   the data set with the format of SVM-light or libsvm, which are very popular 
in
   Machine learning community. Such dataset does not store the number of 
samples and
   the size of dimension.
   Currently, I still use a stupid method to read the data to map<> first,
   then dump the data to SparseMatrix.
   Does any one know some smart methods or other matrix to support such 
operation?

---
Usage:
---
Training:
SVMPegasosTraining.java
I have hard encoded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function. 
The default argument is:
-tr ../examples/src/test/resources/svmdataset/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model


Testing:
SVMPegasosTesting.java
I have hard encoded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function.
The default argument is:
-te ../examples/src/test/resources/svmdataset/test.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

> Implementation of sequential SVM solver based on Pegasos
> 
>
> Key: MAHOUT-232
> URL: https://issues.apache.org/jira/browse/MAHOUT-232
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.1
>Reporter: zhao zhendong
>
> After discussed with guys in this community, I decided to re-implement a 
> Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
> line style,  SparseMatrix and SparseVector etc.) , Eventually, it will 
> support HDFS. 
> The plan of Sequential Pegasos:
> 1 Supporting the general file system ( almost finished );
> 2 Supporting HDFS;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [math] watch out for Windows

2009-12-29 Thread deneche abdelhakim
last time I tried, running Hadoop 0.20 on Windows was impossible for
me...should we still try to support Windows ? I found that installing
Ubuntu on Windows using Virtual Box is the easiest way to use Hadoop
inside Windows

On Mon, Dec 28, 2009 at 8:47 PM, Benson Margulies  wrote:
> Robin & I just established that the new code generator isn't working
> on Windows at all. I'm in process on a repair.
>


Re: [jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

2009-12-29 Thread Steve Umfleet
Hadoop put their MurmurHash in utils, so that might be a consideration.  But 
for Mahout it fits better, imo, in org.apache.mahout.common with other code 
that has similar philosophy and purpose.  I make the assumption that others 
will want to add some alternative hash tools, therefore I'd create a "hash" 
package in mahout.common.

The randomizers I'd put in org.apache.mahout.math due to their interaction with 
Vector, either at that very depth or in org.apache.mahout.math.randomizer, as 
.math is beginning to get dense based on number of modules.

I imagine using the priors outside of sgd, so they could be moved to 
org.apache.mahout.math as well, where they may merit their own sub package. 


--- On Tue, 12/29/09, Ted Dunning (JIRA)  wrote:

> From: Ted Dunning (JIRA) 
> Subject: [jira] Commented: (MAHOUT-228) Need sequential logistic regression 
> implementation using SGD techniques
> To: mahout-dev@lucene.apache.org
> Date: Tuesday, December 29, 2009, 12:29 PM
> 
>     [ 
> https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795138#action_12795138
> ] 
> 
> Ted Dunning commented on MAHOUT-228:
> 
> 
> 
> This is the time.  The MurmurHash and Randomizer
> classes both seem ripe for promotion to other packages.
> 
> What I will do is file some additional JIRA's that include
> just those classes (one JIRA for Murmur, one for
> Randomizer/Vectorizer).  Those patches will probably
> make it in before this one does because they are
> simpler.  At that point, I will rework the patch on
> this JIRA to not include those classes.
> 
> Where would you recommend these others go?
> 
> 
> > Need sequential logistic regression implementation
> using SGD techniques
> >
> ---
> >
> >             
>    Key: MAHOUT-228
> >             
>    URL: https://issues.apache.org/jira/browse/MAHOUT-228
> >         
>    Project: Mahout
> >          Issue Type: New
> Feature
> >          Components:
> Classification
> >            Reporter: Ted
> Dunning
> >         
>    Fix For: 0.3
> >
> >         Attachments:
> logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf,
> sgd-derivation.tex, sgd.csv
> >
> >
> > Stochastic gradient descent (SGD) is often fast enough
> for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> > I often need to have a logistic regression in Java as
> well, so that is a reasonable place to start.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue
> online.
> 
> 





[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

2009-12-29 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795144#action_12795144
 ] 

Robin Anil commented on MAHOUT-228:
---

I say. let the hash functions be in math. 

The text Randomizers can go in util.vectors 

vectors.lucence, vectors.arff etc are there currently. Or we move the all these 
to core along with Randomizers and DictionaryBased?

> Need sequential logistic regression implementation using SGD techniques
> ---
>
> Key: MAHOUT-228
> URL: https://issues.apache.org/jira/browse/MAHOUT-228
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Reporter: Ted Dunning
> Fix For: 0.3
>
> Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, 
> sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable 
> learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a 
> reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

2009-12-29 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795141#action_12795141
 ] 

Jake Mannix commented on MAHOUT-228:


bq. Where would you recommend these others go?

Somewhere in the math module, package name, I don't know.

> Need sequential logistic regression implementation using SGD techniques
> ---
>
> Key: MAHOUT-228
> URL: https://issues.apache.org/jira/browse/MAHOUT-228
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Reporter: Ted Dunning
> Fix For: 0.3
>
> Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, 
> sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable 
> learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a 
> reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795140#action_12795140
 ] 

Jake Mannix commented on MAHOUT-220:


bq. Robin:  This is a library, our job is to have options for people like us to 
debate over . So lets agree upon a common mechanism.

Yep, agreed.  We need fully deterministic techniques as well as probabilistic 
ones (which will often scale better), and let people use what works for them 
and they are comfortable with.

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795139#action_12795139
 ] 

Robin Anil commented on MAHOUT-220:
---

The current Bayes implementation is an island. if you skim through the training 
mechanism. Its a very optimised. (with least map/reduces) and the kind of 
information I store in hbase and in memory is very specific to that paper. 

First there is the weight, which is a matrix of feature as row and label as 
column and cell as the weight.
Secondly, there is sum of cols and rows. put along with the weight matrix. 
Then there are special rows containing, the theta normalizer and alpha 
smoothing value etc. 

 You can see its not really doing bayes rule. it is reproducing the math of 
CBayes paper.  So I see noway of it direcly using the sgd model. 

We could have a Bayes Algo implementation specfic to the model you are 
training.  If thats ok?

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

2009-12-29 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795138#action_12795138
 ] 

Ted Dunning commented on MAHOUT-228:



This is the time.  The MurmurHash and Randomizer classes both seem ripe for 
promotion to other packages.

What I will do is file some additional JIRA's that include just those classes 
(one JIRA for Murmur, one for Randomizer/Vectorizer).  Those patches will 
probably make it in before this one does because they are simpler.  At that 
point, I will rework the patch on this JIRA to not include those classes.

Where would you recommend these others go?


> Need sequential logistic regression implementation using SGD techniques
> ---
>
> Key: MAHOUT-228
> URL: https://issues.apache.org/jira/browse/MAHOUT-228
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Reporter: Ted Dunning
> Fix For: 0.3
>
> Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, 
> sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable 
> learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a 
> reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795137#action_12795137
 ] 

Jake Mannix commented on MAHOUT-220:


bq. The extreme case is the DenseRandomizer. Every term gets spread out to 
every feature so you have collisions on every term on every feature. Because of 
the random weighting, you preserve enough information to allow effective 
learning.

Right, this is the use case in the stochastic decomposition case, cool.

bq. Should we generalize this concept to Vectorizer? The dictionary approach 
can accept a previously computed dictionary (possibly augmenting it on the fly) 
and might be called a DictionaryVectorizer or WeightedDictionaryVectorizer. At 
the level I have been working, the storage of the dictionary is an open 
question. The randomizers could inherit from the same basic interface (or 
abstract class).

Definitely.  

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795136#action_12795136
 ] 

Ted Dunning commented on MAHOUT-220:


{quote}
For sgd algorithm. I suggest you define your own matrix names, row indices and 
column indices, which your algorithm and your datastore agree upon.
{quote}

That is fine if sgd is an island, but it plausibly should be able to output 
models to be used by the Bayes classifier in a map-reduce setting.  That 
requires some documentation of how DataStore is used by the Bayes models.

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795135#action_12795135
 ] 

Ted Dunning commented on MAHOUT-220:


{quote}
Robin: I am not very clear what is happening there when two words have the same 
hash?. Arent we loosing out on a lot of information ? The one i am proposing is 
going to do exact numbering of the features.
{quote}

That is the point of the "probes" parameter.  That allows for multiple hashing 
as Jake is suggesting.  If you have, for example, 4 probes for each word, the 
chances of complete collision is minuscule and where there are collisions, the 
learning algorithm puts the weight on the non-colliding probes.

The extreme case is the DenseRandomizer.  Every term gets spread out to every 
feature so you have collisions on every term on every feature.  Because of the 
random weighting, you preserve enough information to allow effective learning.

See vowpal wabbit for a practical example.  They handle 10^12 (very) sparse 
features in memory and can learn at disk bandwidth in some applications.

{quote}
Jake: They might belong in a more general place, actually. If I'm going to use 
some of this stuff in the decompositions (although I'm not sure yet of the 
efficacy of the single hash for doing SVD), it should go somewhere in the math 
module.
{quote}

Should we generalize this concept to Vectorizer?  The dictionary approach can 
accept a previously computed dictionary (possibly augmenting it on the fly) and 
might be called a DictionaryVectorizer or WeightedDictionaryVectorizer.  At the 
level I have been working, the storage of the dictionary is an open question.  
The randomizers could inherit from the same basic interface (or abstract class).


> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795133#action_12795133
 ] 

Robin Anil commented on MAHOUT-220:
---

Anyways, I guess we are sounding like ML engineers here. This is a library, our 
job is to have options for people like us to debate over :). So lets agree upon 
a common mechanism. 

i.e Have different ways to create a term frequency vector. ie List => 
SparseVector from documents. 

Once the SparseVector is created. Use uniform M/R jobs to do things like tfidf 
weighting, log likelihood(although i think we need the orginal file to get the 
co-occurrence and not the SparseVector)

Any ideas?






> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795131#action_12795131
 ] 

Jake Mannix commented on MAHOUT-220:


bq. I am not very clear what is happening there when two words have the same 
hash?. Arent we loosing out on a lot of information ?

You can lose some information, sure, but there are *tons* of words, and you 
don't lose much information.  It is a probabilistic technique though.

Personally I prefer the mutli-hash approach, because at least there I really 
believe the projection is preserving distances properly.  In the single hash 
case, sometimes (ie for some single word documents, with different words), the 
collapse of distance is extreme (as Robin is alluding to).

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795127#action_12795127
 ] 

Robin Anil commented on MAHOUT-220:
---

I am not very clear what is happening there when two words have the same hash?. 
Arent we loosing out on a lot of information ? The one i am proposing is going 
to do exact numbering of the features. 

One thing my method suffer from is addition of new data. It will take another 
couple of M/R to create the new dictionary file, while preserving the old ids. 
Its cumbersome but doable.
What is happening in a Randomizer approach. Since you are fixing the feature 
set size. The new hash ids will also change when that feature set size increase 
right?

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795128#action_12795128
 ] 

Jake Mannix commented on MAHOUT-220:


Anil,

  Your map-reduces look great, that's the kind of thing I've done for this as 
well.  Good stuff.  

As for HBase and caching layers,  I'd say it's still not fully scalable, as 
it's limited by whatever cache size you set, and your hit/miss ratio.  It seems 
the Datastore interface really is just a wrapper around Matrix and Vector, 
calling out to the entries.  Doing so in a random-access fashion seems like the 
reverse of the the way I'd do it: pass the Algorithm *to* the Datastore, and 
have the computations be done where the data lives (iterate over the Datastore 
internally, either in memory, or if it knows it's backed by mySQL, say, it can 
batch calls to the db, pulling chunks into memory, if it's HDFS-backed, then it 
can fire off a M/R job, etc...).

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795124#action_12795124
 ] 

Jake Mannix commented on MAHOUT-220:


Ted,

  While I'm totally down with using the randomizer / hashing techniques in 
places, I don't think we should totally wed ourselves to it either - having the 
option of using the "real" vector representation should probably be implemented 
to, as people understand it better, and it's pretty standard.

bq. If you like these, we can promote them to a common area under classifier.

  They might belong in a more general place, actually.  If I'm going to use 
some of this stuff in the decompositions (although I'm not sure yet of the 
efficacy of the single hash for doing SVD), it should go somewhere in the math 
module.

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795122#action_12795122
 ] 

Ted Dunning commented on MAHOUT-220:



Anil,

See classifier.sgd.TermRandomizer (and implementations DenseRandomizer and 
BinaryRandomizer) for a term list to vector converter.  These are in the 
MAHOUT-228 patch.

It has the virtue of converting term lists to vectors of fixed size.  It 
currently does not do term weighting, but that would be a very easy fix.  The 
approach is roughly along the lines of 
http://arxiv.org/PS_cache/arxiv/pdf/0902/0902.2206v2.pdf or the stochastic 
decomposition work.

If you like these, we can promote them to a common area under classifier.

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795117#action_12795117
 ] 

Robin Anil commented on MAHOUT-220:
---

A Caching layer is implemented in HbaseDatastore, You can set the cache size. 
Take a look at MAHOUT-124 for more details

I am just porting the feature mapper and tfidf mapper from bayes classifier 
common over to make a the new text vectorizer. Take a look at them. Its a fully 
distributed way of doing tf.idf in 2 map/reduces. 

For the vector convertor
Here is the idea in Steps

M/R1:  Count frequencies of words tokenized using configurable lucene Analyzer
SEQ1: read the frequency list, prune words less than minSupport and create the 
dictionary file(string => long) and the frequency file (string=>long)
Do map/reduce in chunks by keeping a block of the dictionary file in memory. 
   repeat- M/R2: Run over the input documents. replacing string with the 
integer id. and create (docid => sparsevector). This sparsevector as weigths as 
TF. but its incomplete.
Now run a map reduce over the incomplete sparse vectors. Group by docid.In 
reducer, merge the sparse vectors. 
Initial SparseVectors dataset is ready.

function multiplyIDF(){
M/R3: Calculate DF from the SparseVector dataset
M/R4: Run over the SparseVector TF dataset. and get IDF.
}


This is the first plan. Atleast when i finish. Second is to convert the 
document into a stream of integers using the dictionary file. Then subsequent 
funcitons can run M/R jobs to calculate LLR and make bigrams. 

For this. The sparsevector merge MapReduce fucntion should be generic enough. 






> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795114#action_12795114
 ] 

Jake Mannix commented on MAHOUT-220:


Robin,

  To really be scalable here, I'm down with the M/R approach for the 
classifiers.  The random-access nature of the current Datastore interface 
definitely seems limiting - even using HBase this way means we're making lots 
of remote calls, while a traditional hadoop job would do the nice "put the 
coding where the data lives" instead.

Switching over to use SparseVectors and doing things sequentially over the data 
set stored in SequenceFile's of them seems definitely the way I'd see this 
going.  Is that what your current hadoopified version of this do?

bq. I am currenly writing a Map/reduce job to convert text documents to vectors 
without relying on Lucene.

What is the way you're doing this?  Is this bag-of-words representation (what 
form of tf are you using?  how are you putting in idf if it's fully 
distributed?)?

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

2009-12-29 Thread Steve Umfleet (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795064#action_12795064
 ] 

Steve Umfleet commented on MAHOUT-228:
--

Hi Ted.  Watching your progress on SGD was instructive.  Thanks for the 
"template" of how to submit and proceed with an issue.

At what point in the process are decisions about packages resolved?  For 
example, MurmurHash at first glance, and based on its own documentation, seems 
like it might be broadly useful outside of org.apache.mahout.classifier.

> Need sequential logistic regression implementation using SGD techniques
> ---
>
> Key: MAHOUT-228
> URL: https://issues.apache.org/jira/browse/MAHOUT-228
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Reporter: Ted Dunning
> Fix For: 0.3
>
> Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, 
> sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable 
> learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a 
> reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-29 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795060#action_12795060
 ] 

Benson Margulies commented on MAHOUT-230:
-

Heck, a read of the reference cited in the JDK 1.6 doc would prove rewarding, 
no doubt. Anyone else willing?

> Replace org.apache.mahout.math.Sorting with code of clear provenance
> 
>
> Key: MAHOUT-230
> URL: https://issues.apache.org/jira/browse/MAHOUT-230
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: replace-sorting.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
> the Sun JRE, based on the private internal function names and contents. That 
> code has a restrictive license. We need to take the equivalent file 
> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
> replacement.
> The problematic code are the quickSort and mergeSort functions, which extend 
> 'Arrays' by supporting slices of arrays and custom sorting predicate 
> functions. 
> One might also wistfully note that the more recent JDKs from Sun have 
> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
> Harmony, so a really energetic person might build implementations in here to 
> match. However, expediency calls for just bashing on the Harmony 
> implementation to solve the problem at hand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-29 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795058#action_12795058
 ] 

Robin Anil commented on MAHOUT-230:
---

What about hadoop?. I guess its their core operation.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/MergeSort.html

> Replace org.apache.mahout.math.Sorting with code of clear provenance
> 
>
> Key: MAHOUT-230
> URL: https://issues.apache.org/jira/browse/MAHOUT-230
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: replace-sorting.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
> the Sun JRE, based on the private internal function names and contents. That 
> code has a restrictive license. We need to take the equivalent file 
> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
> replacement.
> The problematic code are the quickSort and mergeSort functions, which extend 
> 'Arrays' by supporting slices of arrays and custom sorting predicate 
> functions. 
> One might also wistfully note that the more recent JDKs from Sun have 
> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
> Harmony, so a really energetic person might build implementations in here to 
> match. However, expediency calls for just bashing on the Harmony 
> implementation to solve the problem at hand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-29 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795055#action_12795055
 ] 

Grant Ingersoll commented on MAHOUT-230:


I seem to recall Lucene having it's own merge sort, maybe we should look there? 
 Not saying it's faster, but might be worth looking at.

> Replace org.apache.mahout.math.Sorting with code of clear provenance
> 
>
> Key: MAHOUT-230
> URL: https://issues.apache.org/jira/browse/MAHOUT-230
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: replace-sorting.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
> the Sun JRE, based on the private internal function names and contents. That 
> code has a restrictive license. We need to take the equivalent file 
> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
> replacement.
> The problematic code are the quickSort and mergeSort functions, which extend 
> 'Arrays' by supporting slices of arrays and custom sorting predicate 
> functions. 
> One might also wistfully note that the more recent JDKs from Sun have 
> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
> Harmony, so a really energetic person might build implementations in here to 
> match. However, expediency calls for just bashing on the Harmony 
> implementation to solve the problem at hand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-29 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795054#action_12795054
 ] 

Benson Margulies commented on MAHOUT-230:
-

And it's all in the merge sorts.

> Replace org.apache.mahout.math.Sorting with code of clear provenance
> 
>
> Key: MAHOUT-230
> URL: https://issues.apache.org/jira/browse/MAHOUT-230
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: replace-sorting.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
> the Sun JRE, based on the private internal function names and contents. That 
> code has a restrictive license. We need to take the equivalent file 
> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
> replacement.
> The problematic code are the quickSort and mergeSort functions, which extend 
> 'Arrays' by supporting slices of arrays and custom sorting predicate 
> functions. 
> One might also wistfully note that the more recent JDKs from Sun have 
> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
> Harmony, so a really energetic person might build implementations in here to 
> match. However, expediency calls for just bashing on the Harmony 
> implementation to solve the problem at hand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-29 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795053#action_12795053
 ] 

Grant Ingersoll commented on MAHOUT-230:


Committed revision 894390.

> Replace org.apache.mahout.math.Sorting with code of clear provenance
> 
>
> Key: MAHOUT-230
> URL: https://issues.apache.org/jira/browse/MAHOUT-230
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: replace-sorting.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
> the Sun JRE, based on the private internal function names and contents. That 
> code has a restrictive license. We need to take the equivalent file 
> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
> replacement.
> The problematic code are the quickSort and mergeSort functions, which extend 
> 'Arrays' by supporting slices of arrays and custom sorting predicate 
> functions. 
> One might also wistfully note that the more recent JDKs from Sun have 
> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
> Harmony, so a really energetic person might build implementations in here to 
> match. However, expediency calls for just bashing on the Harmony 
> implementation to solve the problem at hand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795050#action_12795050
 ] 

Robin Anil edited comment on MAHOUT-220 at 12/29/09 1:36 PM:
-

Datastore is an interface which allows you pick a named vector or a named 
matrix and lookup the cell.  
For Bayes classifier, the entire code is based on tokens and not SparseVectors. 
The names of the matrix, the row and column are therefore string and the 
contract between the Algorithm and Datastore is decided per algo. for the 
Cbayes/Bayes algorithms, We have the HBaseBayesDatastore.java and 
InMemoryBayesDatastore.java. 

{code}
  double getWeight(String matrixName, String row, String column) throws 
InvalidDatastoreException;
  double getWeight(String vectorName, String index) throws 
InvalidDatastoreException;
{code}

For sgd algorithm. I suggest you define your own matrix names, row indices and 
column indices, which your algorithm and your datastore agree upon.

I know it, this creates a limitation that you cant use integer based column and 
row names. Maybe we can parameterize it OR change Bayes package to use Vectors 
instead of the current string token based implementation. 

I am currenly writing a Map/reduce job to convert text documents to vectors 
without relying on Lucene. Once that is done, I will overhaul the classifier 
package to use SparseVectors. 

Before that I need to know if this Patch is ok. In terms of code style, I will 
then patch it and start with the enhancements. 


  was (Author: robinanil):
Datastore is an interface which allows you pick a named vector or a matrix 
and lookup the cell.  For Bayes classifier, since the entire code is based on 
tokens and not SparseVectors. The names of the matrix, the row and column is 
upto the implementation. for the Cbayes/Bayes algorithms, We have the 
HBaseBayesDatastore.java and 
InMemoryBayesDatastore.java. 

{code}
  double getWeight(String matrixName, String row, String column) throws 
InvalidDatastoreException;
  double getWeight(String vectorName, String index) throws 
InvalidDatastoreException;
{code}

For sgd algorithm. I suggest you define your own matrix names, row indices and 
column indices, which your algorithm and datastore agree upon.

I know it, this creates a limitation that you can use integer based column and 
row names. Maybe we can parameterize it OR change Bayes package to use Vectors 
instead of the current string token based implementation. 

I am currenly writing a Map/reduce job to convert text documents to vectors 
without relying on Lucene. Once that is done, I will overhaul the classifier 
package to use SparseVectors. 

Before that I need to know if this Patch is ok. In terms of code style, I will 
then patch it and start with the enhancements 

  
> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-29 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795050#action_12795050
 ] 

Robin Anil commented on MAHOUT-220:
---

Datastore is an interface which allows you pick a named vector or a matrix and 
lookup the cell.  For Bayes classifier, since the entire code is based on 
tokens and not SparseVectors. The names of the matrix, the row and column is 
upto the implementation. for the Cbayes/Bayes algorithms, We have the 
HBaseBayesDatastore.java and 
InMemoryBayesDatastore.java. 

{code}
  double getWeight(String matrixName, String row, String column) throws 
InvalidDatastoreException;
  double getWeight(String vectorName, String index) throws 
InvalidDatastoreException;
{code}

For sgd algorithm. I suggest you define your own matrix names, row indices and 
column indices, which your algorithm and datastore agree upon.

I know it, this creates a limitation that you can use integer based column and 
row names. Maybe we can parameterize it OR change Bayes package to use Vectors 
instead of the current string token based implementation. 

I am currenly writing a Map/reduce job to convert text documents to vectors 
without relying on Lucene. Once that is done, I will overhaul the classifier 
package to use SparseVectors. 

Before that I need to know if this Patch is ok. In terms of code style, I will 
then patch it and start with the enhancements 


> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-29 Thread Benson Margulies
Simple answer: the Harmony team didn't code as well as the Sun people did.

This is not my metier, so if someone else can suggest algorithmic
improvements ...

On Tue, Dec 29, 2009 at 8:06 AM, Robin Anil (JIRA)  wrote:
>
>    [ 
> https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795047#action_12795047
>  ]
>
> Robin Anil commented on MAHOUT-230:
> ---
>
> I ran the SortingTest instead of 3.2 sec it now takes 5.2 seconds. I repeated 
> the patch and rechecked. Any idea why the perf dip?  Notice the perf drop in 
> the second block  i.e. after the line break in each block
>
> {code:xml|title=Original Output}
>
>   name="testBinarySearch"/>
>   name="testBinarySearchObjects"/>
>   name="testQuickSortBytes"/>
>   name="testQuickSortChars"/>
>   name="testQuickSortInts"/>
>   name="testQuickSortLongs"/>
>   name="testQuickSortShorts"/>
>   name="testQuickSortFloats"/>
>   name="testQuickSortDoubles"/>
>   name="testMergeSortBytes"/>
>
>   name="testMergeSortChars"/>
>   name="testMergeSortInts"/>
>   name="testMergeSortLongs"/>
>   name="testMergeSortShorts"/>
>   name="testMergeSortFloats"/>
>   name="testMergeSortDoubles"/>
> {code}
>
> {code:xml|title=After Patching}
>
>   name="testBinarySearch"/>
>   name="testBinarySearchObjects"/>
>   name="testQuickSortBytes"/>
>   name="testQuickSortChars"/>
>   name="testQuickSortInts"/>
>   name="testQuickSortLongs"/>
>   name="testQuickSortShorts"/>
>   name="testQuickSortFloats"/>
>   name="testQuickSortDoubles"/>
>   name="testMergeSortBytes"/>
>
>   name="testMergeSortChars"/>
>   name="testMergeSortInts"/>
>   name="testMergeSortLongs"/>
>   name="testMergeSortShorts"/>
>   name="testMergeSortFloats"/>
>   name="testMergeSortDoubles"/>
> {code}
>
>> Replace org.apache.mahout.math.Sorting with code of clear provenance
>> 
>>
>>                 Key: MAHOUT-230
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-230
>>             Project: Mahout
>>          Issue Type: Bug
>>          Components: Math
>>    Affects Versions: 0.3
>>            Reporter: Benson Margulies
>>            Assignee: Benson Margulies
>>             Fix For: 0.3
>>
>>         Attachments: replace-sorting.diff
>>
>>   Original Estimate: 72h
>>  Remaining Estimate: 72h
>>
>> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
>> the Sun JRE, based on the private internal function names and contents. That 
>> code has a restrictive license. We need to take the equivalent file 
>> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
>> replacement.
>> The problematic code are the quickSort and mergeSort functions, which extend 
>> 'Arrays' by supporting slices of arrays and custom sorting predicate 
>> functions.
>> One might also wistfully note that the more recent JDKs from Sun have 
>> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
>> Harmony, so a really energetic person might build implementations in here to 
>> match. However, expediency calls for just bashing on the Harmony 
>> implementation to solve the problem at hand.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-29 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795048#action_12795048
 ] 

Grant Ingersoll commented on MAHOUT-230:


I think we need to commit and then worry about performance.  The legal issues 
outweigh the performance issues at this point.  I'll commit and then we can 
open a new one for performance..

> Replace org.apache.mahout.math.Sorting with code of clear provenance
> 
>
> Key: MAHOUT-230
> URL: https://issues.apache.org/jira/browse/MAHOUT-230
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: replace-sorting.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
> the Sun JRE, based on the private internal function names and contents. That 
> code has a restrictive license. We need to take the equivalent file 
> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
> replacement.
> The problematic code are the quickSort and mergeSort functions, which extend 
> 'Arrays' by supporting slices of arrays and custom sorting predicate 
> functions. 
> One might also wistfully note that the more recent JDKs from Sun have 
> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
> Harmony, so a really energetic person might build implementations in here to 
> match. However, expediency calls for just bashing on the Harmony 
> implementation to solve the problem at hand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-29 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-230:
--

Assignee: Grant Ingersoll  (was: Benson Margulies)

> Replace org.apache.mahout.math.Sorting with code of clear provenance
> 
>
> Key: MAHOUT-230
> URL: https://issues.apache.org/jira/browse/MAHOUT-230
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: replace-sorting.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
> the Sun JRE, based on the private internal function names and contents. That 
> code has a restrictive license. We need to take the equivalent file 
> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
> replacement.
> The problematic code are the quickSort and mergeSort functions, which extend 
> 'Arrays' by supporting slices of arrays and custom sorting predicate 
> functions. 
> One might also wistfully note that the more recent JDKs from Sun have 
> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
> Harmony, so a really energetic person might build implementations in here to 
> match. However, expediency calls for just bashing on the Harmony 
> implementation to solve the problem at hand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-29 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795047#action_12795047
 ] 

Robin Anil commented on MAHOUT-230:
---

I ran the SortingTest instead of 3.2 sec it now takes 5.2 seconds. I repeated 
the patch and rechecked. Any idea why the perf dip?  Notice the perf drop in 
the second block  i.e. after the line break in each block

{code:xml|title=Original Output} 

  
  
  
  
  
  
  
  
  
  

  
  
  
  
  
  
{code} 

{code:xml|title=After Patching} 

  
  
  
  
  
  
  
  
  
  

  
  
  
  
  
  
{code} 

> Replace org.apache.mahout.math.Sorting with code of clear provenance
> 
>
> Key: MAHOUT-230
> URL: https://issues.apache.org/jira/browse/MAHOUT-230
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: replace-sorting.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
> the Sun JRE, based on the private internal function names and contents. That 
> code has a restrictive license. We need to take the equivalent file 
> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
> replacement.
> The problematic code are the quickSort and mergeSort functions, which extend 
> 'Arrays' by supporting slices of arrays and custom sorting predicate 
> functions. 
> One might also wistfully note that the more recent JDKs from Sun have 
> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
> Harmony, so a really energetic person might build implementations in here to 
> match. However, expediency calls for just bashing on the Harmony 
> implementation to solve the problem at hand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.