Hello, if you choose BIO encoding, the number of classes are multiplied by 2 (B-token, I-token) plus we need to add the O class. If you have, say 12 classes, the number of outcomes will be 25.
with BILOU encoding, classes x 4 (BILU) plus O class = 49 (12 classes x 4 combination + O) I do not know how many entity types do you actually have in the training, but with 11 entity types the number of outcomes should be different: with BIO: (11 * 2) + 1 = 23 with BILOU: (11 * 4) + 1 = 45 if you have your corpus in opennlp format, can you do the following: cat en-6-class-opennlp.txt | perl -pe 's/ /\n/g' | grep "<START" | sort | uniq -c | sort -nr I do this with a 6 class corpus, and I get: 43820 <START:location> 42882 <START:organization> 38802 <START:person> 23217 <START:date> 22976 <START:misc> 2137 <START:time> HTH, R On Wed, Jul 5, 2017 at 4:34 PM, Viraf Bankwalla <[email protected]> wrote: > I am using OpenNLP 1.8.0 and have trained NameFinder with approximately 78K > sentences (perceptron model). I have 11 named entity types, and am finding > alot of noise in the output. Looking at the output from training it > indicates 39 outcomes. I would have assumed that this would align with the > number of named entity types. Could one please explain what the Number of > Outcomes refers to ? > Also any guidance on data prep and / or areas to explore on how to reduce the > FP's would be helpful. > Thanks > - viraf > > > Indexing events using cutoff of 3 > > Computing event counts... done. 1315813 events > Indexing... done. > Collecting events... Done indexing. > Incorporating indexed data for training... > done. > Number of Event Tokens: 1315813 > Number of Outcomes: 39 > Number of Predicates: 290935 > Computing model parameters... > Performing 300 iterations. > 1: . (1313259/1315813) 0.9980589947051747 > 2: . (1314613/1315813) 0.9990880163062684 > 3: . (1314904/1315813) 0.9993091723519983 > 4: . (1315136/1315813) 0.9994854891994531 > 5: . (1315250/1315813) 0.9995721276503576 > 6: . (1315335/1315813) 0.9996367264953303 > 7: . (1315402/1315813) 0.999687645584897 > 8: . (1315451/1315813) 0.9997248849190576 > 9: . (1315517/1315813) 0.9997750440222128 > 10: . (1315509/1315813) 0.9997689641309213 > 20: . (1315687/1315813) 0.9999042417121582 > Stopping: change in training set accuracy less than 1.0E-5 > Stats: (1315427/1315813) 0.999706645245183 > ...done. > Compressed 290935 parameters to 13506 > 2507 outcome patterns
