Re: Doccat evaluator

2014-04-11 Thread William Colen
Now in the trunk we have the tools:

$ bin/opennlp DoccatEvaluator
Usage: opennlp DoccatEvaluator[.leipzig] [-reportOutputFile outputFile]
-model model [-misclassified true|false] -data sampleData [-encoding
charsetName]

Arguments description:
 -reportOutputFile outputFile
the path of the fine-grained report file.
 -model model
the model file to be evaluated.
 -misclassified true|false
if true will print false negatives and false positives.
 -data sampleData
data to be used, usually a file name.
-encoding charsetName
 encoding for reading and writing text, if absent the system default is
used.


n$ bin/opennlp DoccatCrossValidator
Usage: opennlp DoccatCrossValidator[.leipzig] [-reportOutputFile
outputFile] [-misclassified true|false] [-folds num] [-featureGenerators
fg] [-params paramsFile] -lang language -data sampleData [-encoding
charsetName]

Arguments description:
-reportOutputFile outputFile
the path of the fine-grained report file.
 -misclassified true|false
if true will print false negatives and false positives.
 -folds num
number of folds, default is 10.
 -featureGenerators fg
Comma separated feature generator classes. Bag of words is used if not
specified.
 -params paramsFile
training parameters file.
 -lang language
language which is being processed.
-data sampleData
 data to be used, usually a file name.
-encoding charsetName
 encoding for reading and writing text, if absent the system default is
used.


If misclassified is true, the evaluator will use the stderr to print the
misclassified documents.
If reportOutputFile is set, the evaluator will print to it some detailed
reports, for example the f-measure for the different outcomes and the
confusion matrix.

2014-04-10 19:48 GMT-03:00 William Colen william.co...@gmail.com:

 Yes, I just finished implementing the confusion matrix report, just like
 the one I did for the POS Tagger. I will commit it today.

 I could not test it properly with Leipzig corpus. For some reason to
 Doccat never fails with this corpus!
 To effectively test it I used the 20news corpus.


 2014-04-10 19:37 GMT-03:00 Jörn Kottmann kottm...@gmail.com:

 I thought it should be done similar to the way pos tags are measured when
 I implemented that.

 A confusion matrix might also be helpful to see which categories are more
 difficult to classify for the system.

 Jörn


 On 04/10/2014 03:00 PM, William Colen wrote:

 Actually, since we always add a tag to each document, accuracy makes
 sense.
 We could implement F-1 for the individual categories.

 2014-04-09 17:23 GMT-03:00 William Colen william.co...@gmail.com:

  Hello,

 I was checking if there is any open issue related to Doccat, and I found
 this one -

 OPENNLP-81: Add a cli tool for the doccat evaluation support

 I noticed that there is already a class
 named DocumentCategorizerEvaluator, which is not used anywhere
 internally.
 This is evaluating performance in terms of accuracy, but I believe it
 would
 be better do do it in terms of F-Measuare.

 Any thoughts?

 As we are working in a major version, I think it would be OK to change
 it.


 Thank you,
 William






Re: Doccat evaluator

2014-04-10 Thread Jörn Kottmann
I thought it should be done similar to the way pos tags are measured 
when I implemented that.


A confusion matrix might also be helpful to see which categories are 
more difficult to classify for the system.


Jörn

On 04/10/2014 03:00 PM, William Colen wrote:

Actually, since we always add a tag to each document, accuracy makes sense.
We could implement F-1 for the individual categories.

2014-04-09 17:23 GMT-03:00 William Colen william.co...@gmail.com:


Hello,

I was checking if there is any open issue related to Doccat, and I found
this one -

OPENNLP-81: Add a cli tool for the doccat evaluation support

I noticed that there is already a class
named DocumentCategorizerEvaluator, which is not used anywhere internally.
This is evaluating performance in terms of accuracy, but I believe it would
be better do do it in terms of F-Measuare.

Any thoughts?

As we are working in a major version, I think it would be OK to change it.


Thank you,
William





Re: Doccat evaluator

2014-04-10 Thread William Colen
Yes, I just finished implementing the confusion matrix report, just like
the one I did for the POS Tagger. I will commit it today.

I could not test it properly with Leipzig corpus. For some reason to Doccat
never fails with this corpus!
To effectively test it I used the 20news corpus.


2014-04-10 19:37 GMT-03:00 Jörn Kottmann kottm...@gmail.com:

 I thought it should be done similar to the way pos tags are measured when
 I implemented that.

 A confusion matrix might also be helpful to see which categories are more
 difficult to classify for the system.

 Jörn


 On 04/10/2014 03:00 PM, William Colen wrote:

 Actually, since we always add a tag to each document, accuracy makes
 sense.
 We could implement F-1 for the individual categories.

 2014-04-09 17:23 GMT-03:00 William Colen william.co...@gmail.com:

  Hello,

 I was checking if there is any open issue related to Doccat, and I found
 this one -

 OPENNLP-81: Add a cli tool for the doccat evaluation support

 I noticed that there is already a class
 named DocumentCategorizerEvaluator, which is not used anywhere
 internally.
 This is evaluating performance in terms of accuracy, but I believe it
 would
 be better do do it in terms of F-Measuare.

 Any thoughts?

 As we are working in a major version, I think it would be OK to change
 it.


 Thank you,
 William





Doccat evaluator

2014-04-09 Thread William Colen
Hello,

I was checking if there is any open issue related to Doccat, and I found
this one -

OPENNLP-81: Add a cli tool for the doccat evaluation support

I noticed that there is already a class named DocumentCategorizerEvaluator,
which is not used anywhere internally. This is evaluating performance in
terms of accuracy, but I believe it would be better do do it in terms of
F-Measuare.

Any thoughts?

As we are working in a major version, I think it would be OK to change it.


Thank you,
William