I am trying to classify files that have CSV data or LOG file data or english
TXT. A different set of operations will be executed on esch type of file. The
training file is of the form with about 20 lines for each type (do I need more
?):
The model is trained with model = DocumentCategorizerME.train("en",
sampleStream,5, 100);
CSV 546894493,John Smith,5354188684365040,(432)
209-8058,[email protected],(341) 611-2944,18970 Avonaco
Ln,Brandsville,MA,92145
TXT In the following simple sentences, subjects are in yellow, and verbs are in
green.
LOG 127.0.0.1 - - [10/Apr/2007:10:54:21 +0300] "GET /unix_sysadmin.html
HTTP/1.1" 200 3880 "http://pti.local/" "Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.8.1.3) Gecko/20061201 Firefox/2.0.0.3 (Ubuntu-feisty)"
After creating a model, when I try to test and classify,
evaluator.evaluteSample(sample) for a CSV input returns accuracy is 1.0 for
CSV which is correct, but for the same sample myCategorizer.categorize(sample)
from myCategorizer.categorize returns LOG.
Please suggest how this can be fixed.
thanks,
Ravi.