Exactly. One addition detail, this format that the Bayes classifier want is pretty easy to generate from a Lucene term vector.
It is probably a good idea to experiment with emitting multiple copies of repeated terms. On Tue, Apr 5, 2011 at 2:10 PM, Daniel McEnnis <[email protected]> wrote: > Its actually not text to classify for the Bayes classifier but > tokenized words. No punctuation and tokens separated by a space. One > file per line with the classification starting every line. I hope > this helps... >
