Oh you have a very strange feature, you are using the label as a feature, may 
bad. I thought the words were the labels.
Usually it's something like weight, height, something meaningful. If it's just 
the label like you have you might as well use a hash map there is no feature to 
learn! But if you want try making it an indicator vector. Set features to the 
number of animals and for the vector set it to 1 at the index of the animal in 
the array, 0 otherwise. E.g for ant the feature is 0, 1 , 00000

Sent from my iPad

On Jun 10, 2011, at 12:54 AM, Joscha Feth <[email protected]> wrote:

> Hello fellow Mahouts,
> 
> I am trying to grasp Mahout and generated a very simple (but obviously
> wrong) example which I hoped would help me understand how everything works:
> 
> -- 8< --
> public class OLRTest {
> 
>    private static final int FEATURES = 1;
>    private static final int CATEGORIES = 2;
> 
>    private static final WordValueEncoder ANIMAL_ENCODER = new
> AdaptiveWordValueEncoder(
>            "animal");
> 
>    private static final String[] animals = new String[] { "alligator",
> "ant",
>            "bear", "bee", "bird", "camel", "cat", "cheetah", "chicken",
>            "chimpanzee", "cow", "crocodile", "deer", "dog", "dolphin",
> "duck",
>            "eagle", "elephant", "fish", "fly", "fox", "frog", "giraffe",
>            "goat", "goldfish", "hamster", "hippopotamus", "horse",
> "kangaroo",
>            "kitten", "lion", "lobster", "monkey", "octopus", "owl",
> "panda",
>            "pig", "puppy", "rabbit", "rat", "scorpion", "seal", "shark",
>            "sheep", "snail", "snake", "spider", "squirrel", "tiger",
> "turtle",
>            "wolf", "zebra" };
> 
>    public static void main(String[] args) {
>        final OnlineLogisticRegression algorithm = new
> OnlineLogisticRegression(
>                CATEGORIES, FEATURES, new L1());
> 
>        for (String animal : animals) {
>            algorithm.train(0, generateVector(animal));
>        }
> 
>        algorithm.close();
> 
>        testClassify(algorithm, "lion");
>        testClassify(algorithm, "rabbit");
>        testClassify(algorithm, "xyz");
>        testClassify(algorithm, "something");
>    }
> 
>    private static void testClassify(final OnlineLogisticRegression
> algorithm,
>            final String allegedAnimal) {
>        System.out.println(allegedAnimal
>                + " is an animal with a probability of "
>                + algorithm.classifyScalar(generateVector(allegedAnimal)) *
> 100
>                + "%");
>    }
> 
>    private static Vector generateVector(String animal) {
>        final Vector v = new RandomAccessSparseVector(FEATURES);
>        ANIMAL_ENCODER.addToVector(animal, v);
>        return v;
>    }
> }
> -- 8< --
> 
> The output of running this sample code is:
> -- 8< --
> lion is an animal with a probability of 0.12008121418417145%
> rabbit is an animal with a probability of 0.11720244687895641%
> xyz is an animal with a probability of 0.04192879358244322%
> something is an animal with a probability of 0.04047790610981663%
> -- 8< --
> 
> There were multiple surprising things for me:
> * I would have suspected the probability of "lion" and "rabbit" close to
> 100%
> * I would have suspected the probability of "xyz" and "something" close to
> 0%
> * I would have suspected the probability of "lion" being the same as the one
> for "rabbit"
> * I would have suspected the probability of "xyz" being the same as the one
> for "something"
> 
> I know that the animals sample provided is extremely small, but even when
> training with multiple passes (100, 1000, 10000) it did change the
> probabilities only marginally.
> What am I missing here?
> 
> Thanks very much!
> Joscha Feth

Reply via email to