Hello, On Thu, Jul 6, 2017 at 1:55 PM, Viraf Bankwalla <[email protected]> wrote: > Thanks. The data is in opennlp format, and the training set has 74680 lines. > A small percentage of the lines have named entities, as shown below. As the > data and some of the labels refer to sensitive information, I have masked the > labels below. > 3314 <START:type1> > 1568 <START:type2> > 398 <START:type3> > 289 <START:type4> > 175 <START:type5> > 159 <START:type6> > 84 <START:type7> > 81 <START:type8> > 67 <START:type9> > 29 <START:type10> > 24 <START:type11> > > What should I look for to track down the discrepancy of 39 reported outcomes > to the expected 45 using BILOU ?
For the less frequent classes, that means that some outcome does not occur. For example, it could be the case that I-token does not occur, you need to print the output of the outcomes to know. In any case, I would not worry about that. > Ant suggestions on how to improve accuracy would be appreciated. My params / > feature generator config is below Well, it is very difficult (almost impossible) to learn any good models for those classes with very few samples, although this depends on the lexical variability of the entity mentions. If the entities show substantial variability then from type4-5 onwards you will not learn anything but noise. If you do not have any additional training data to improve those numbers, you could try with regexnamefinder or with the DictionaryNameFinder. For other improvements via feature engineering, I suggest reading related literature. In a previous email I listed the papers describing the best performing NER systems in the CoNLL 2003 newswire benchmark. http://mail-archives.apache.org/mod_mbox/opennlp-dev/201702.mbox/%3CCAKvDkVA0yHXuNQbA-2tJPm6QHVAKz4eMhG_wx31YuJrWffYZ9g%40mail.gmail.com%3E Cheers, Rodrigo
