Re: TokenNameFinder

Viraf Bankwalla Thu, 06 Jul 2017 05:00:22 -0700

Thanks.  The data is in opennlp format, and the training set has 74680 lines.  
A small percentage of the lines have named entities, as shown below.  As the 
data and some of the labels refer to sensitive information, I have masked the 
labels below.  
   3314 <START:type1>
   1568 <START:type2>
    398 <START:type3>
    289 <START:type4>
    175 <START:type5>
    159 <START:type6>
     84 <START:type7>
     81 <START:type8>
     67 <START:type9>
     29 <START:type10>
     24 <START:type11>


What should I look for to track down the discrepancy of 39 reported outcomes to 
the expected 45 using BILOU ?
Ant suggestions on how to improve accuracy would be appreciated.  My params / 
feature generator config is below
Parameters are: Algorithm=PERCEPTRON
Iterations=150
Cutoff=3
BeamSize=5

Feature generators are:
generators>
    <cache>
        <generators>
            <window prevLength="3" nextLength="3">
                <custom 
class="com.maximus.ird.caimr.sii.fdl.opennlp.LowercaseTokenFeatureGenerator" />
            </window>
            <window prevLength="3" nextLength="3">
                <tokenclass wordAndClass="true" />
            </window>
             
            <window prevLength="3" nextLength="3">
                <custom 
class="com.maximus.ird.caimr.sii.fdl.opennlp.TokenPosFeatureGenerator" />
            </window>

             <definition />
            <prevmap />
            <custom 
class="opennlp.tools.util.featuregen.TrigramNameFeatureGenerator" />
            <sentence begin="true" end="false" />
        </generators>
    </cache>
</generators>

Where LowercaseTokenClassFeatureGenerator is TokenClassFeatureGenerator 
specifying word and classTokenPosFeatureGenerator is a 
tokenposTrigramNameFeatureGenerator  generates trigrams

Thanks
- viraf
On Thursday, July 6, 2017, 6:22:39 AM EDT, Rodrigo Agerri <[email protected]> 
wrote:

Hello,

if you choose BIO encoding, the number of classes are multiplied by 2
(B-token, I-token) plus we need to add the O class. If you have, say
12 classes, the number of outcomes will be 25.

with BILOU encoding, classes x 4 (BILU) plus O class = 49 (12 classes
x 4 combination + O)

I do not know how many entity types do you actually have in the
training, but with 11 entity types the number of outcomes should be
different:

with BIO: (11 * 2) + 1 = 23
with BILOU: (11 * 4) + 1 = 45

if you have your corpus in opennlp format, can you do the following:

cat en-6-class-opennlp.txt | perl -pe 's/ /\n/g' | grep "<START" |
sort | uniq -c | sort -nr

I do this with a 6 class corpus, and I get:

43820 <START:location>
42882 <START:organization>
38802 <START:person>
23217 <START:date>
22976 <START:misc>
2137 <START:time>

HTH,

R


On Wed, Jul 5, 2017 at 4:34 PM, Viraf Bankwalla
<[email protected]> wrote:
> I am using OpenNLP 1.8.0 and have trained NameFinder with approximately 78K 
> sentences (perceptron model).  I have 11 named entity types, and am finding 
> alot of noise in the output.  Looking at the output from training it 
> indicates 39 outcomes.  I would have assumed that this would align with the 
> number of named entity types.  Could one please explain what the Number of 
> Outcomes refers to ?
> Also any guidance on data prep and / or areas to explore on how to reduce the 
> FP's would be helpful.
> Thanks
> - viraf
>
>
> Indexing events using cutoff of 3
>
>    Computing event counts...  done. 1315813 events
>    Indexing...  done.
> Collecting events... Done indexing.
> Incorporating indexed data for training...
> done.
>    Number of Event Tokens: 1315813
>        Number of Outcomes: 39
>      Number of Predicates: 290935
> Computing model parameters...
> Performing 300 iterations.
>  1:  . (1313259/1315813) 0.9980589947051747
>  2:  . (1314613/1315813) 0.9990880163062684
>  3:  . (1314904/1315813) 0.9993091723519983
>  4:  . (1315136/1315813) 0.9994854891994531
>  5:  . (1315250/1315813) 0.9995721276503576
>  6:  . (1315335/1315813) 0.9996367264953303
>  7:  . (1315402/1315813) 0.999687645584897
>  8:  . (1315451/1315813) 0.9997248849190576
>  9:  . (1315517/1315813) 0.9997750440222128
>  10:  . (1315509/1315813) 0.9997689641309213
>  20:  . (1315687/1315813) 0.9999042417121582
> Stopping: change in training set accuracy less than 1.0E-5
> Stats: (1315427/1315813) 0.999706645245183
> ...done.
> Compressed 290935 parameters to 13506
> 2507 outcome patterns

Re: TokenNameFinder

Reply via email to