Hello fellow Mahouts,
I am trying to grasp Mahout and generated a very simple (but obviously
wrong) example which I hoped would help me understand how everything works:
-- 8< --
public class OLRTest {
private static final int FEATURES = 1;
private static final int CATEGORIES = 2;
private static final WordValueEncoder ANIMAL_ENCODER = new
AdaptiveWordValueEncoder(
"animal");
private static final String[] animals = new String[] { "alligator",
"ant",
"bear", "bee", "bird", "camel", "cat", "cheetah", "chicken",
"chimpanzee", "cow", "crocodile", "deer", "dog", "dolphin",
"duck",
"eagle", "elephant", "fish", "fly", "fox", "frog", "giraffe",
"goat", "goldfish", "hamster", "hippopotamus", "horse",
"kangaroo",
"kitten", "lion", "lobster", "monkey", "octopus", "owl",
"panda",
"pig", "puppy", "rabbit", "rat", "scorpion", "seal", "shark",
"sheep", "snail", "snake", "spider", "squirrel", "tiger",
"turtle",
"wolf", "zebra" };
public static void main(String[] args) {
final OnlineLogisticRegression algorithm = new
OnlineLogisticRegression(
CATEGORIES, FEATURES, new L1());
for (String animal : animals) {
algorithm.train(0, generateVector(animal));
}
algorithm.close();
testClassify(algorithm, "lion");
testClassify(algorithm, "rabbit");
testClassify(algorithm, "xyz");
testClassify(algorithm, "something");
}
private static void testClassify(final OnlineLogisticRegression
algorithm,
final String allegedAnimal) {
System.out.println(allegedAnimal
+ " is an animal with a probability of "
+ algorithm.classifyScalar(generateVector(allegedAnimal)) *
100
+ "%");
}
private static Vector generateVector(String animal) {
final Vector v = new RandomAccessSparseVector(FEATURES);
ANIMAL_ENCODER.addToVector(animal, v);
return v;
}
}
-- 8< --
The output of running this sample code is:
-- 8< --
lion is an animal with a probability of 0.12008121418417145%
rabbit is an animal with a probability of 0.11720244687895641%
xyz is an animal with a probability of 0.04192879358244322%
something is an animal with a probability of 0.04047790610981663%
-- 8< --
There were multiple surprising things for me:
* I would have suspected the probability of "lion" and "rabbit" close to
100%
* I would have suspected the probability of "xyz" and "something" close to
0%
* I would have suspected the probability of "lion" being the same as the one
for "rabbit"
* I would have suspected the probability of "xyz" being the same as the one
for "something"
I know that the animals sample provided is extremely small, but even when
training with multiple passes (100, 1000, 10000) it did change the
probabilities only marginally.
What am I missing here?
Thanks very much!
Joscha Feth