Thanks. The data is in opennlp format, and the training set has 74680 lines. A small percentage of the lines have named entities, as shown below. As the data and some of the labels refer to sensitive information, I have masked the labels below. 3314 <START:type1> 1568 <START:type2> 398 <START:type3> 289 <START:type4> 175 <START:type5> 159 <START:type6> 84 <START:type7> 81 <START:type8> 67 <START:type9> 29 <START:type10> 24 <START:type11>
What should I look for to track down the discrepancy of 39 reported outcomes to the expected 45 using BILOU ? Ant suggestions on how to improve accuracy would be appreciated. My params / feature generator config is below Parameters are: Algorithm=PERCEPTRON Iterations=150 Cutoff=3 BeamSize=5 Feature generators are: generators> <cache> <generators> <window prevLength="3" nextLength="3"> <custom class="com.maximus.ird.caimr.sii.fdl.opennlp.LowercaseTokenFeatureGenerator" /> </window> <window prevLength="3" nextLength="3"> <tokenclass wordAndClass="true" /> </window> <window prevLength="3" nextLength="3"> <custom class="com.maximus.ird.caimr.sii.fdl.opennlp.TokenPosFeatureGenerator" /> </window> <definition /> <prevmap /> <custom class="opennlp.tools.util.featuregen.TrigramNameFeatureGenerator" /> <sentence begin="true" end="false" /> </generators> </cache> </generators> Where LowercaseTokenClassFeatureGenerator is TokenClassFeatureGenerator specifying word and classTokenPosFeatureGenerator is a tokenposTrigramNameFeatureGenerator generates trigrams Thanks - viraf On Thursday, July 6, 2017, 6:22:39 AM EDT, Rodrigo Agerri <[email protected]> wrote: Hello, if you choose BIO encoding, the number of classes are multiplied by 2 (B-token, I-token) plus we need to add the O class. If you have, say 12 classes, the number of outcomes will be 25. with BILOU encoding, classes x 4 (BILU) plus O class = 49 (12 classes x 4 combination + O) I do not know how many entity types do you actually have in the training, but with 11 entity types the number of outcomes should be different: with BIO: (11 * 2) + 1 = 23 with BILOU: (11 * 4) + 1 = 45 if you have your corpus in opennlp format, can you do the following: cat en-6-class-opennlp.txt | perl -pe 's/ /\n/g' | grep "<START" | sort | uniq -c | sort -nr I do this with a 6 class corpus, and I get: 43820 <START:location> 42882 <START:organization> 38802 <START:person> 23217 <START:date> 22976 <START:misc> 2137 <START:time> HTH, R On Wed, Jul 5, 2017 at 4:34 PM, Viraf Bankwalla <[email protected]> wrote: > I am using OpenNLP 1.8.0 and have trained NameFinder with approximately 78K > sentences (perceptron model). I have 11 named entity types, and am finding > alot of noise in the output. Looking at the output from training it > indicates 39 outcomes. I would have assumed that this would align with the > number of named entity types. Could one please explain what the Number of > Outcomes refers to ? > Also any guidance on data prep and / or areas to explore on how to reduce the > FP's would be helpful. > Thanks > - viraf > > > Indexing events using cutoff of 3 > > Computing event counts... done. 1315813 events > Indexing... done. > Collecting events... Done indexing. > Incorporating indexed data for training... > done. > Number of Event Tokens: 1315813 > Number of Outcomes: 39 > Number of Predicates: 290935 > Computing model parameters... > Performing 300 iterations. > 1: . (1313259/1315813) 0.9980589947051747 > 2: . (1314613/1315813) 0.9990880163062684 > 3: . (1314904/1315813) 0.9993091723519983 > 4: . (1315136/1315813) 0.9994854891994531 > 5: . (1315250/1315813) 0.9995721276503576 > 6: . (1315335/1315813) 0.9996367264953303 > 7: . (1315402/1315813) 0.999687645584897 > 8: . (1315451/1315813) 0.9997248849190576 > 9: . (1315517/1315813) 0.9997750440222128 > 10: . (1315509/1315813) 0.9997689641309213 > 20: . (1315687/1315813) 0.9999042417121582 > Stopping: change in training set accuracy less than 1.0E-5 > Stats: (1315427/1315813) 0.999706645245183 > ...done. > Compressed 290935 parameters to 13506 > 2507 outcome patterns
