Re: Classification beginner questions

2011-06-16 Thread Ted Dunning
A full sort is not usually feasible/desirable. Better to just keep a pool of samples and replace random samples. On Thu, Jun 16, 2011 at 2:41 AM, Lance Norskog wrote: > Use a crypto-hash on the base data as the sorting key. The base data > is the value (payload). That should randomly permute th

Re: Classification beginner questions

2011-06-15 Thread Lance Norskog
Use a crypto-hash on the base data as the sorting key. The base data is the value (payload). That should randomly permute things. On Wed, Jun 15, 2011 at 2:50 PM, Ted Dunning wrote: > It is already in Mahout, I think. > > On Tue, Jun 14, 2011 at 5:48 AM, Lance Norskog wrote: > >> Coding a permut

Re: Classification beginner questions

2011-06-15 Thread Ted Dunning
It is already in Mahout, I think. On Tue, Jun 14, 2011 at 5:48 AM, Lance Norskog wrote: > Coding a permutation like this in Map/Reduce is a good beginner exercise. > > On Sun, Jun 12, 2011 at 11:34 PM, Ted Dunning > wrote: > > But the key is that you have to have both kinds of samples. Moreove

Re: Classification beginner questions

2011-06-13 Thread Lance Norskog
Coding a permutation like this in Map/Reduce is a good beginner exercise. On Sun, Jun 12, 2011 at 11:34 PM, Ted Dunning wrote: > But the key is that you have to have both kinds of samples.  Moreover, > for all of the stochastic gradient descent work, you need to have them > in a random-ish order.

Re: Classification beginner questions

2011-06-12 Thread Ted Dunning
But the key is that you have to have both kinds of samples. Moreover, for all of the stochastic gradient descent work, you need to have them in a random-ish order. You can't show all of one category and then all of another. It is even worse if you sort your data. On Mon, Jun 13, 2011 at 5:35 AM

Re: Classification beginner questions

2011-06-12 Thread Hector Yee
If you have a much larger background set you can try online passive aggressive in mahout 0.6 as it uses hinge loss and does not update the model of it gets things correct. Log loss will always have a gradient in contrast. On Jun 12, 2011 7:54 AM, "Joscha Feth" wrote: > Hi Ted, > > I see. Only for

Re: Classification beginner questions

2011-06-12 Thread Ted Dunning
An infinite number of samples is fine. It is still true that you need to have training samples from all of the target categories. On Sun, Jun 12, 2011 at 2:53 PM, Joscha Feth wrote: > Hi Ted, > > I see. Only for the OLR or also for any other algorithm? What if my > other category theoretically c

Re: Classification beginner questions

2011-06-12 Thread Joscha Feth
Hi Ted, I see. Only for the OLR or also for any other algorithm? What if my other category theoretically contains an infinite number of samples? Cheers, Joscha Am 12.06.2011 um 15:08 schrieb Ted Dunning : > Joscha, > > There is no implicit training. you need to give negative examples as > well

Re: Classification beginner questions

2011-06-12 Thread Ted Dunning
Joscha, There is no implicit training. you need to give negative examples as well as positive. On Sat, Jun 11, 2011 at 9:08 AM, Joscha Feth wrote: > Hello Ted, > > thanks for your response! > What I wanted to accomplish is actually quite simple in theory: I have some > sentences which have thi

Re: Classification beginner questions

2011-06-11 Thread Joscha Feth
Hello Ted, thanks for your response! What I wanted to accomplish is actually quite simple in theory: I have some sentences which have things in common (like some similar words for example). I want to train my model with these example sentences I have. Once it is trained I want to give an unknown s

Re: Classification beginner questions

2011-06-11 Thread Joscha Feth
Hello Sebastian, Thanks for the hint, I did get the MEAP edition of the ebook already through manning, however I find myself struggling to translate the newsgroup and wikipedia examples to my usecase. Especially I can't seem to be able to find any code examples which helps me with the generation o

Re: Classification beginner questions

2011-06-11 Thread Joscha Feth
Hector, thank you very much for youir response, I adapted my example: -- 8< -- public class OLRTest { private static final String[] animals = new String[] { "alligator", "ant", "bear", "bee", "bird", "camel", "cat", "cheetah", "chicken", "chimpanzee", "cow", "crocodile

Re: Classification beginner questions

2011-06-10 Thread Ted Dunning
The target variable here is always zero. Shouldn't it vary? On Fri, Jun 10, 2011 at 9:54 AM, Joscha Feth wrote: >            algorithm.train(0, generateVector(animal)); >

Re: Classification beginner questions

2011-06-10 Thread Sebastian Schelter
Hi Joscha, If you have some money left, I'd recommend to get a copy of Mahout in Action, which features a very nice to read, detailed introduction to classification with Mahout, including strategies for feature selection. --sebastian On 10.06.2011 17:28, Hector Yee wrote: Oh you have a very

Re: Classification beginner questions

2011-06-10 Thread Hector Yee
Oh you have a very strange feature, you are using the label as a feature, may bad. I thought the words were the labels. Usually it's something like weight, height, something meaningful. If it's just the label like you have you might as well use a hash map there is no feature to learn! But if you

Re: Classification beginner questions

2011-06-10 Thread Hector Yee
It's the one with the highest score. the relative score to other classes matter more than the absolute value. Especially when you have many classes like you have. Even with logistic regression my personal preference is to use the noLink function and use that score. Sent from my iPad On Jun 10

Classification beginner questions

2011-06-10 Thread Joscha Feth
Hello fellow Mahouts, I am trying to grasp Mahout and generated a very simple (but obviously wrong) example which I hoped would help me understand how everything works: -- 8< -- public class OLRTest { private static final int FEATURES = 1; private static final int CATEGORIES = 2; pr