Re: TokenNameFinder

Rodrigo Agerri Thu, 06 Jul 2017 03:23:29 -0700

Hello,

if you choose BIO encoding, the number of classes are multiplied by 2
(B-token, I-token) plus we need to add the O class. If you have, say
12 classes, the number of outcomes will be 25.


with BILOU encoding, classes x 4 (BILU) plus O class = 49 (12 classes
x 4 combination + O)

I do not know how many entity types do you actually have in the
training, but with 11 entity types the number of outcomes should be
different:

with BIO: (11 * 2) + 1 = 23
with BILOU: (11 * 4) + 1 = 45

if you have your corpus in opennlp format, can you do the following:

cat en-6-class-opennlp.txt | perl -pe 's/ /\n/g' | grep "<START" |
sort | uniq -c | sort -nr

I do this with a 6 class corpus, and I get:

43820 <START:location>
42882 <START:organization>
38802 <START:person>
23217 <START:date>
22976 <START:misc>
2137 <START:time>

HTH,

R


On Wed, Jul 5, 2017 at 4:34 PM, Viraf Bankwalla
<[email protected]> wrote:
> I am using OpenNLP 1.8.0 and have trained NameFinder with approximately 78K 
> sentences (perceptron model).  I have 11 named entity types, and am finding 
> alot of noise in the output.  Looking at the output from training it 
> indicates 39 outcomes.  I would have assumed that this would align with the 
> number of named entity types.  Could one please explain what the Number of 
> Outcomes refers to ?
> Also any guidance on data prep and / or areas to explore on how to reduce the 
> FP's would be helpful.
> Thanks
> - viraf
>
>
> Indexing events using cutoff of 3
>
>     Computing event counts...  done. 1315813 events
>     Indexing...  done.
> Collecting events... Done indexing.
> Incorporating indexed data for training...
> done.
>     Number of Event Tokens: 1315813
>         Number of Outcomes: 39
>       Number of Predicates: 290935
> Computing model parameters...
> Performing 300 iterations.
>   1:  . (1313259/1315813) 0.9980589947051747
>   2:  . (1314613/1315813) 0.9990880163062684
>   3:  . (1314904/1315813) 0.9993091723519983
>   4:  . (1315136/1315813) 0.9994854891994531
>   5:  . (1315250/1315813) 0.9995721276503576
>   6:  . (1315335/1315813) 0.9996367264953303
>   7:  . (1315402/1315813) 0.999687645584897
>   8:  . (1315451/1315813) 0.9997248849190576
>   9:  . (1315517/1315813) 0.9997750440222128
>  10:  . (1315509/1315813) 0.9997689641309213
>  20:  . (1315687/1315813) 0.9999042417121582
> Stopping: change in training set accuracy less than 1.0E-5
> Stats: (1315427/1315813) 0.999706645245183
> ...done.
> Compressed 290935 parameters to 13506
> 2507 outcome patterns

Re: TokenNameFinder

Reply via email to