So to answer my own question, the order of training matters. I had been
doing all category 1 then all category 0. Apparently this breaks things
badly


On Wed, Feb 13, 2013 at 4:29 PM, Brian McCallister <bri...@skife.org> wrote:

> I'm trying to do a basic two category classifier on textual data, I am
> working with a training set of only about 100,000 documents, and am using
> an AdaptiveLogisticRegression with default settings.
>
> When I build the trainer it reports:
>
>
> % correct:       0.9996315789473774
> AUC:              0.75
> log likelihood: -0.032966543010819874
>
> Which seems pretty good.
>
> When I then classify the *training data* everything lands in the first
> category, when in fact they are split down the middle.
>
> Creation of vectors looks like:
>
>         FeatureVectorEncoder content_encoder = new
> AdaptiveWordValueEncoder("content");
>         content_encoder.setProbes(2);
>
>         FeatureVectorEncoder type_encoder = new
> StaticWordValueEncoder("type");
>         type_encoder.setProbes(2);
>
>         Vector v = new RandomAccessSparseVector(100);
>         type_encoder.addToVector(type, v);
>
>         for (String word : data.getWords()) {
>             content_encoder.addToVector(word, v);
>         }
>         return new NamedVector(v, label);
>
> where data.getWords() is the massaged (tidy, extract characters, then run
> trhough lucene standard analyzer and lower case filter) content of various
> documents.\
>
> Training looks like:
>
>             Configuration hconf = new Configuration();
>             FileSystem fs = FileSystem.get(path, hconf);
>
>             SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
> Path(path), hconf);
>             LongWritable key = new LongWritable();
>             VectorWritable value = new VectorWritable();
>             AdaptiveLogisticRegression reg = new
> AdaptiveLogisticRegression(2, 100, new L1());
>
>             while (reader.next(key, value)) {
>                 NamedVector v = (NamedVector) value.get();
>                 System.out.println(v.getName());
>                 reg.train("spam".equals(v.getName()) ? 1 : 0, v);
>             }
>             reader.close();
>             reg.close();
>             CrossFoldLearner best =
> reg.getBest().getPayload().getLearner();
>             System.out.println(best.percentCorrect());
>             System.out.println(best.auc());
>             System.out.println(best.getLogLikelihood());
>
>             ModelSerializer.writeBinary(model.getPath(),
> reg.getBest().getPayload().getLearner());
>
>
> And running through the test data looks like:
>
>             InputStream in = new FileInputStream(model);
>             CrossFoldLearner best = ModelSerializer.readBinary(in,
> CrossFoldLearner.class);
>             in.close();
>
>             Configuration hconf = new Configuration();
>             FileSystem fs = FileSystem.get(path, hconf);
>
>             SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
> Path(path), hconf);
>             LongWritable key = new LongWritable();
>             VectorWritable value = new VectorWritable();
>
>             int correct = 0;
>             int total = 0;
>             while (reader.next(key, value)) {
>                 total++;
>                 NamedVector v = (NamedVector) value.get();
>                 int expected = "spam".equals(v.getName()) ? 1 : 0;
>                 Vector p = new DenseVector(2);
>                 best.classifyFull(p, v);
>                 int cat = p.maxValueIndex();
>                 System.out.println(cat == 1 ? "SPAM" : "HAM");
>                 if (cat == expected) { correct++;}
>             }
>             reader.close();
>             best.close();
>
>             double cd = correct;
>             double td = total;
>
>             System.out.println(cd / td);
>
> Can anyone help me figure out what I am doing wrong?
>
> Also, I'd love to try naive bayes or complementary naive bayes, but I am
> unable to find any documentation on how to do so :-(
>

Reply via email to