Re: Doccat evaluator

William Colen Fri, 11 Apr 2014 04:28:27 -0700

Now in the trunk we have the tools:

$ bin/opennlp DoccatEvaluator
Usage: opennlp DoccatEvaluator[.leipzig] [-reportOutputFile outputFile]
-model model [-misclassified true|false] -data sampleData [-encoding
charsetName]


Arguments description:
 -reportOutputFile outputFile
the path of the fine-grained report file.
 -model model
the model file to be evaluated.
 -misclassified true|false
if true will print false negatives and false positives.
 -data sampleData
data to be used, usually a file name.
-encoding charsetName
 encoding for reading and writing text, if absent the system default is
used.


n$ bin/opennlp DoccatCrossValidator
Usage: opennlp DoccatCrossValidator[.leipzig] [-reportOutputFile
outputFile] [-misclassified true|false] [-folds num] [-featureGenerators
fg] [-params paramsFile] -lang language -data sampleData [-encoding
charsetName]

Arguments description:
-reportOutputFile outputFile
the path of the fine-grained report file.
 -misclassified true|false
if true will print false negatives and false positives.
 -folds num
number of folds, default is 10.
 -featureGenerators fg
Comma separated feature generator classes. Bag of words is used if not
specified.
 -params paramsFile
training parameters file.
 -lang language
language which is being processed.
-data sampleData
 data to be used, usually a file name.
-encoding charsetName
 encoding for reading and writing text, if absent the system default is
used.


If misclassified is true, the evaluator will use the stderr to print the
misclassified documents.
If reportOutputFile is set, the evaluator will print to it some detailed
reports, for example the f-measure for the different outcomes and the
confusion matrix.

2014-04-10 19:48 GMT-03:00 William Colen <william.co...@gmail.com>:

> Yes, I just finished implementing the confusion matrix report, just like
> the one I did for the POS Tagger. I will commit it today.
>
> I could not test it properly with Leipzig corpus. For some reason to
> Doccat never fails with this corpus!
> To effectively test it I used the 20news corpus.
>
>
> 2014-04-10 19:37 GMT-03:00 Jörn Kottmann <kottm...@gmail.com>:
>
> I thought it should be done similar to the way pos tags are measured when
>> I implemented that.
>>
>> A confusion matrix might also be helpful to see which categories are more
>> difficult to classify for the system.
>>
>> Jörn
>>
>>
>> On 04/10/2014 03:00 PM, William Colen wrote:
>>
>>> Actually, since we always add a tag to each document, accuracy makes
>>> sense.
>>> We could implement F-1 for the individual categories.
>>>
>>> 2014-04-09 17:23 GMT-03:00 William Colen <william.co...@gmail.com>:
>>>
>>>  Hello,
>>>>
>>>> I was checking if there is any open issue related to Doccat, and I found
>>>> this one -
>>>>
>>>> OPENNLP-81: Add a cli tool for the doccat evaluation support
>>>>
>>>> I noticed that there is already a class
>>>> named DocumentCategorizerEvaluator, which is not used anywhere
>>>> internally.
>>>> This is evaluating performance in terms of accuracy, but I believe it
>>>> would
>>>> be better do do it in terms of F-Measuare.
>>>>
>>>> Any thoughts?
>>>>
>>>> As we are working in a major version, I think it would be OK to change
>>>> it.
>>>>
>>>>
>>>> Thank you,
>>>> William
>>>>
>>>>
>>
>

Re: Doccat evaluator

Reply via email to