Hi, I have a model trained using OpenNLP doccat programmatically and I am thinking in which ways I should approach improving my model performance? I have around 70 labels and 12000 entries in my both training and test dataset. In my experiments, I am using 90% to 10% training to test data randomly. Currently my model accuracy is around 60% - 70%.
Here are the questions that I have. * Will dropping stop words could improve the model accuracy. I did that and seems it could but did not see a significant improvement. ? * Does the trained model get skewed if irregular inclusion of spaces or tabs are present in the training or test data? E.g., "label" "This car is made around 2007" * Does the spaces between label and data should be constant? (Hope the doccat engine trim() them)? But wanted to make sure? * Is there a way to configure not to dump the console output from the model? If possible, Please let me know. Thanks In Advance. Lahiru -- Regards Lahiru Sandakith Gallege
