Classification beginner questions

Joscha Feth Fri, 10 Jun 2011 00:55:56 -0700

Hello fellow Mahouts,

I am trying to grasp Mahout and generated a very simple (but obviously
wrong) example which I hoped would help me understand how everything works:


-- 8< --
public class OLRTest {

    private static final int FEATURES = 1;
    private static final int CATEGORIES = 2;

    private static final WordValueEncoder ANIMAL_ENCODER = new
AdaptiveWordValueEncoder(
            "animal");

    private static final String[] animals = new String[] { "alligator",
"ant",
            "bear", "bee", "bird", "camel", "cat", "cheetah", "chicken",
            "chimpanzee", "cow", "crocodile", "deer", "dog", "dolphin",
"duck",
            "eagle", "elephant", "fish", "fly", "fox", "frog", "giraffe",
            "goat", "goldfish", "hamster", "hippopotamus", "horse",
"kangaroo",
            "kitten", "lion", "lobster", "monkey", "octopus", "owl",
"panda",
            "pig", "puppy", "rabbit", "rat", "scorpion", "seal", "shark",
            "sheep", "snail", "snake", "spider", "squirrel", "tiger",
"turtle",
            "wolf", "zebra" };

    public static void main(String[] args) {
        final OnlineLogisticRegression algorithm = new
OnlineLogisticRegression(
                CATEGORIES, FEATURES, new L1());

        for (String animal : animals) {
            algorithm.train(0, generateVector(animal));
        }

        algorithm.close();

        testClassify(algorithm, "lion");
        testClassify(algorithm, "rabbit");
        testClassify(algorithm, "xyz");
        testClassify(algorithm, "something");
    }

    private static void testClassify(final OnlineLogisticRegression
algorithm,
            final String allegedAnimal) {
        System.out.println(allegedAnimal
                + " is an animal with a probability of "
                + algorithm.classifyScalar(generateVector(allegedAnimal)) *
100
                + "%");
    }

    private static Vector generateVector(String animal) {
        final Vector v = new RandomAccessSparseVector(FEATURES);
        ANIMAL_ENCODER.addToVector(animal, v);
        return v;
    }
}
-- 8< --

The output of running this sample code is:
-- 8< --
lion is an animal with a probability of 0.12008121418417145%
rabbit is an animal with a probability of 0.11720244687895641%
xyz is an animal with a probability of 0.04192879358244322%
something is an animal with a probability of 0.04047790610981663%
-- 8< --

There were multiple surprising things for me:
* I would have suspected the probability of "lion" and "rabbit" close to
100%
* I would have suspected the probability of "xyz" and "something" close to
0%
* I would have suspected the probability of "lion" being the same as the one
for "rabbit"
* I would have suspected the probability of "xyz" being the same as the one
for "something"

I know that the animals sample provided is extremely small, but even when
training with multiple passes (100, 1000, 10000) it did change the
probabilities only marginally.
What am I missing here?

Thanks very much!
Joscha Feth

Classification beginner questions

Reply via email to