Exactly.
One addition detail, this format that the Bayes classifier want is pretty
easy to generate from a Lucene term vector.
It is probably a good idea to experiment with emitting multiple copies of
repeated terms.
On Tue, Apr 5, 2011 at 2:10 PM, Daniel McEnnis wrote:
> Its actually not text
: Tuesday, April 05, 2011 3:19 PM
> To: user@mahout.apache.org
> Subject: Re: Classification with data from Lucene
>
> The Lucene intake does not support searches on the index.
>
> If you can make a copies of the index, here's a trick: delete the
> documents you don't
o dump the words and
frequencies from the index, add a label, and modify the BayesFeatureDriver
class to take my input.
David
-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: Tuesday, April 05, 2011 3:19 PM
To: user@mahout.apache.org
Subject: Re: Classification
The Lucene intake does not support searches on the index.
If you can make a copies of the index, here's a trick: delete the
documents you don't want, then optimize the index. You will need a
Lucene program to do this.
Use this to separate the big index into training and test indexes.
On Mon, Apr
I have a large Lucene index (with TermFreq vectors). I do not have easy
access to the original source docs that the index was made from. I have
identified a set of docs in the index as Category X. Is there a way to
run Mahout's Bayesian classification algorithm, trained on the docs in
Category X, o