Hello,

On Thu, Jul 6, 2017 at 1:55 PM, Viraf Bankwalla
<[email protected]> wrote:
> Thanks.  The data is in opennlp format, and the training set has 74680 lines. 
>  A small percentage of the lines have named entities, as shown below.  As the 
> data and some of the labels refer to sensitive information, I have masked the 
> labels below.
>    3314 <START:type1>
>    1568 <START:type2>
>     398 <START:type3>
>     289 <START:type4>
>     175 <START:type5>
>     159 <START:type6>
>      84 <START:type7>
>      81 <START:type8>
>      67 <START:type9>
>      29 <START:type10>
>      24 <START:type11>
>
> What should I look for to track down the discrepancy of 39 reported outcomes 
> to the expected 45 using BILOU ?

For the less frequent classes, that means that some outcome does not
occur. For example, it could be the case that I-token does not occur,
you need to print the output of the outcomes to know. In any case, I
would not worry about that.

> Ant suggestions on how to improve accuracy would be appreciated.  My params / 
> feature generator config is below

Well, it is very difficult (almost impossible) to learn any good
models for those classes with very few samples, although this depends
on the lexical variability of the entity mentions. If the entities
show substantial variability then from type4-5 onwards you will not
learn anything but noise. If you do not have any additional training
data to improve those numbers, you could try with regexnamefinder or
with the DictionaryNameFinder.

For other improvements via feature engineering, I suggest reading
related literature. In a previous email I listed the papers describing
the best performing NER systems in the CoNLL 2003 newswire benchmark.

http://mail-archives.apache.org/mod_mbox/opennlp-dev/201702.mbox/%3CCAKvDkVA0yHXuNQbA-2tJPm6QHVAKz4eMhG_wx31YuJrWffYZ9g%40mail.gmail.com%3E

Cheers,

Rodrigo

Reply via email to