So to answer my own question, the order of training matters. I had been doing all category 1 then all category 0. Apparently this breaks things badly
On Wed, Feb 13, 2013 at 4:29 PM, Brian McCallister <bri...@skife.org> wrote: > I'm trying to do a basic two category classifier on textual data, I am > working with a training set of only about 100,000 documents, and am using > an AdaptiveLogisticRegression with default settings. > > When I build the trainer it reports: > > > % correct: 0.9996315789473774 > AUC: 0.75 > log likelihood: -0.032966543010819874 > > Which seems pretty good. > > When I then classify the *training data* everything lands in the first > category, when in fact they are split down the middle. > > Creation of vectors looks like: > > FeatureVectorEncoder content_encoder = new > AdaptiveWordValueEncoder("content"); > content_encoder.setProbes(2); > > FeatureVectorEncoder type_encoder = new > StaticWordValueEncoder("type"); > type_encoder.setProbes(2); > > Vector v = new RandomAccessSparseVector(100); > type_encoder.addToVector(type, v); > > for (String word : data.getWords()) { > content_encoder.addToVector(word, v); > } > return new NamedVector(v, label); > > where data.getWords() is the massaged (tidy, extract characters, then run > trhough lucene standard analyzer and lower case filter) content of various > documents.\ > > Training looks like: > > Configuration hconf = new Configuration(); > FileSystem fs = FileSystem.get(path, hconf); > > SequenceFile.Reader reader = new SequenceFile.Reader(fs, new > Path(path), hconf); > LongWritable key = new LongWritable(); > VectorWritable value = new VectorWritable(); > AdaptiveLogisticRegression reg = new > AdaptiveLogisticRegression(2, 100, new L1()); > > while (reader.next(key, value)) { > NamedVector v = (NamedVector) value.get(); > System.out.println(v.getName()); > reg.train("spam".equals(v.getName()) ? 1 : 0, v); > } > reader.close(); > reg.close(); > CrossFoldLearner best = > reg.getBest().getPayload().getLearner(); > System.out.println(best.percentCorrect()); > System.out.println(best.auc()); > System.out.println(best.getLogLikelihood()); > > ModelSerializer.writeBinary(model.getPath(), > reg.getBest().getPayload().getLearner()); > > > And running through the test data looks like: > > InputStream in = new FileInputStream(model); > CrossFoldLearner best = ModelSerializer.readBinary(in, > CrossFoldLearner.class); > in.close(); > > Configuration hconf = new Configuration(); > FileSystem fs = FileSystem.get(path, hconf); > > SequenceFile.Reader reader = new SequenceFile.Reader(fs, new > Path(path), hconf); > LongWritable key = new LongWritable(); > VectorWritable value = new VectorWritable(); > > int correct = 0; > int total = 0; > while (reader.next(key, value)) { > total++; > NamedVector v = (NamedVector) value.get(); > int expected = "spam".equals(v.getName()) ? 1 : 0; > Vector p = new DenseVector(2); > best.classifyFull(p, v); > int cat = p.maxValueIndex(); > System.out.println(cat == 1 ? "SPAM" : "HAM"); > if (cat == expected) { correct++;} > } > reader.close(); > best.close(); > > double cd = correct; > double td = total; > > System.out.println(cd / td); > > Can anyone help me figure out what I am doing wrong? > > Also, I'd love to try naive bayes or complementary naive bayes, but I am > unable to find any documentation on how to do so :-( >