David, Its actually not text to classify for the Bayes classifier but tokenized words. No punctuation and tokens separated by a space. One file per line with the classification starting every line. I hope this helps...
Daniel. On Tue, Apr 5, 2011 at 4:49 PM, David Croley <[email protected]> wrote: > I'm not too worried about splitting the data into test and train sets. My > main issue is that the classifier examples I can find all take as input a > file with the form (at least for text): > > <label>\t<text to classifiy...> > > However, I don't have the original content of the files, only the index with > term frequency vectors. I know the first step for the Bayesian algorithms is > creating a TF-IDF vector, but is seems the existing code cannot take TF-IDF > vectors like the cluster algorithms or even some variant of the Term > Frequency vectors I can get from Lucene. > > At this point, I am going to try to write code to dump the words and > frequencies from the index, add a label, and modify the BayesFeatureDriver > class to take my input. > > David > > > -----Original Message----- > From: Lance Norskog [mailto:[email protected]] > Sent: Tuesday, April 05, 2011 3:19 PM > To: [email protected] > Subject: Re: Classification with data from Lucene > > The Lucene intake does not support searches on the index. > > If you can make a copies of the index, here's a trick: delete the > documents you don't want, then optimize the index. You will need a > Lucene program to do this. > Use this to separate the big index into training and test indexes. > > On Mon, Apr 4, 2011 at 6:51 PM, David Croley <[email protected]> wrote: >> I have a large Lucene index (with TermFreq vectors). I do not have easy >> access to the original source docs that the index was made from. I have >> identified a set of docs in the index as Category X. Is there a way to >> run Mahout's Bayesian classification algorithm, trained on the docs in >> Category X, on the remaining docs in the index to better indentify >> category matches? >> >> >> >> I have also exported the Lucene data into a Vector file in prep to run >> some clustering experiments (as per the wiki examples) and also wondered >> if that data could be used to feed the CBayes code. From what I can >> tell, the classification code in Mahout takes a completely different >> form of input compared to the clustering algorithms. >> >> >> >> Thanks for any pointers. >> >> >> >> >> >> David Croley >> >> Lead Engineer >> >> RenewData >> >> 512.351.0198 BlackBerry >> >> 512.276.5518 Desk >> >> [email protected] >> >> www.renewdata.com <http://www.renewdata.com/> >> >> >> >> Global in reach. Local in focus. >> >> >> >> >> >> Confidentiality Notice: This electronic communication contained in this >> e-mail from [email protected] (including any attachments) may contain >> privileged and/or confidential information. This communication is intended >> only for the use of indicated e-mail addressees. Please be advised that any >> disclosure, dissemination, distribution, copying, or other use of this >> communication or any attached document other than for the purpose intended >> by the sender is strictly prohibited. If you have received this >> communication in error, please notify the sender immediately by reply e-mail >> and promptly destroy all electronic and printed copies of this communication >> and any attached document. Thank you in advance for your cooperation. >> > > > > -- > Lance Norskog > [email protected] > > > Confidentiality Notice: This electronic communication contained in this > e-mail from [email protected] (including any attachments) may contain > privileged and/or confidential information. This communication is intended > only for the use of indicated e-mail addressees. Please be advised that any > disclosure, dissemination, distribution, copying, or other use of this > communication or any attached document other than for the purpose intended by > the sender is strictly prohibited. If you have received this communication in > error, please notify the sender immediately by reply e-mail and promptly > destroy all electronic and printed copies of this communication and any > attached document. Thank you in advance for your cooperation. >
