David,

Its actually not text to classify for the Bayes classifier but
tokenized words.  No punctuation and tokens separated by a space. One
file per line with the classification starting every line.  I hope
this helps...

Daniel.

On Tue, Apr 5, 2011 at 4:49 PM, David Croley <[email protected]> wrote:
> I'm not too worried about splitting the data into test and train sets. My 
> main issue is that the classifier examples I can find all take as input a 
> file with the form (at least for text):
>
> <label>\t<text to classifiy...>
>
> However, I don't have the original content of the files, only the index with 
> term frequency vectors. I know the first step for the Bayesian algorithms is 
> creating a TF-IDF vector, but is seems the existing code cannot take TF-IDF 
> vectors like the cluster algorithms or even some variant of the Term 
> Frequency vectors I can get from Lucene.
>
> At this point, I am going to try to write code to dump the words and 
> frequencies from the index, add a label, and modify the BayesFeatureDriver 
> class to take my input.
>
> David
>
>
> -----Original Message-----
> From: Lance Norskog [mailto:[email protected]]
> Sent: Tuesday, April 05, 2011 3:19 PM
> To: [email protected]
> Subject: Re: Classification with data from Lucene
>
> The Lucene intake does not support searches on the index.
>
> If you can make a copies of the index, here's a trick: delete the
> documents you don't want, then optimize the index. You will need a
> Lucene program to do this.
> Use this to separate the big index into training and test indexes.
>
> On Mon, Apr 4, 2011 at 6:51 PM, David Croley <[email protected]> wrote:
>> I have a large Lucene index (with TermFreq vectors). I do not have easy
>> access to the original source docs that the index was made from. I have
>> identified a set of docs in the index as Category X. Is there a way to
>> run Mahout's Bayesian classification algorithm, trained on the docs in
>> Category X, on the remaining docs in the index to better indentify
>> category matches?
>>
>>
>>
>> I have also exported the Lucene data into a Vector file in prep to run
>> some clustering experiments (as per the wiki examples) and also wondered
>> if that data could be used to feed the CBayes code. From what I can
>> tell, the classification code in Mahout takes a completely different
>> form of input compared to the clustering algorithms.
>>
>>
>>
>> Thanks for any pointers.
>>
>>
>>
>>
>>
>> David Croley
>>
>> Lead Engineer
>>
>> RenewData
>>
>> 512.351.0198 BlackBerry
>>
>> 512.276.5518 Desk
>>
>> [email protected]
>>
>> www.renewdata.com <http://www.renewdata.com/>
>>
>>
>>
>> Global in reach. Local in focus.
>>
>>
>>
>>
>>
>> Confidentiality Notice: This electronic communication contained in this 
>> e-mail from [email protected] (including any attachments) may contain 
>> privileged and/or confidential information. This communication is intended 
>> only for the use of indicated e-mail addressees. Please be advised that any 
>> disclosure, dissemination, distribution, copying, or other use of this 
>> communication or any attached document other than for the purpose intended 
>> by the sender is strictly prohibited. If you have received this 
>> communication in error, please notify the sender immediately by reply e-mail 
>> and promptly destroy all electronic and printed copies of this communication 
>> and any attached document. Thank you in advance for your cooperation.
>>
>
>
>
> --
> Lance Norskog
> [email protected]
>
>
> Confidentiality Notice: This electronic communication contained in this 
> e-mail from [email protected] (including any attachments) may contain 
> privileged and/or confidential information. This communication is intended 
> only for the use of indicated e-mail addressees. Please be advised that any 
> disclosure, dissemination, distribution, copying, or other use of this 
> communication or any attached document other than for the purpose intended by 
> the sender is strictly prohibited. If you have received this communication in 
> error, please notify the sender immediately by reply e-mail and promptly 
> destroy all electronic and printed copies of this communication and any 
> attached document. Thank you in advance for your cooperation.
>

Reply via email to