Benglish, I'm not sure I understand your requirements, but perhaps you could use a Naive Bayes classifier? https://en.wikipedia.org/wiki/Naive_Bayes_classifier
Typical Bayes separates into Yes/No (spam detection, etc), but can be extended to N-categories. Lucene provides access to the words it has indexed in your documents. You could feed those to a classifier for training. A quick Google Search brought this back, perhaps it would get you started: http://lucene.apache.org/core/4_8_1/classification/org/apache/lucene/classification/SimpleNaiveBayesClassifier.html They also have a KNearestNeighbor version, see the implementers link here: http://lucene.apache.org/core/4_8_1/classification/org/apache/lucene/classification/Classifier.html You might also want to consider Solr, which is a layer on top of Lucene. -- Mark Bennett / LucidWorks: Search & Big Data / [email protected] Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513 On Jun 15, 2014, at 10:37 PM, benglish <[email protected]> wrote: > Hi pals, > > I have a huge number of text files with defined tagged topics. What I am > going to do is to tag the test files due to those pre-tagged files. > Searching on the Net, I couldn't find my answer: Is it possible to train > Lucene with tagged files and then it tags test files according to those > pre-defined tags? > > Yours Sincerely, > benglish > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Train-Lucene-with-topic-defined-files-tp4141979.html > Sent from the Lucene - General mailing list archive at Nabble.com.
