Handling tokens not present in the language model (and also with no suffix 
present in the model) causes a null pointer exception in the tagger process
------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: UIMA-2106
                 URL: https://issues.apache.org/jira/browse/UIMA-2106
             Project: UIMA
          Issue Type: Bug
          Components: Sandbox-Tagger
    Affects Versions: 2.3
         Environment: OS
Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 
4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011

JVM
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)

            Reporter: Nicolas Hernandez
            Priority: Minor
             Fix For: 2.3


The HMMTagger Analysis Engine class uses the 
org.apache.uima.examples.tagger.Viterbi.java implementation to determine the 
pos tag list of a given sentence.
In practice this implementation is partially dependant on the part of speech 
tagging (likewise the remaining HMMTagger classes actually).
For exemple it makes strong assumptions on the kind of tokens it can take as 
input. It assumes no restriction about the token covertext values.
It results in using some covertext probabilities for initialization or default 
value when the tagger processes an unknown token...

As a consequence if the coveredText used for setting the default value is not 
present in the training model an error occurs. Roughly speaking, the process 
looks first for probability associated to the current token coverText, if the 
coverText is not present in the model, it looks in the model for the 
probability of its longest suffix, and finally if it does not found a match, 
the process assigns to the unknown coverText the probability of the arbitrary 
coverText : "("  
The problem is that if the probability of this coverText is not available in 
the model, the probability of the unknown token is not defined and a null 
pointer exception occurs latter when the variable is called.

Why the probability of the "(" text would not be available in the model ? In a 
large training corpus if we consider all the tokens, there is little chance not 
to find at least one occurrence of "(". 
Nevertheless if we use the HMM training  AE to build a model for predicting 
noun gender and number, or verb tense and person, or "being a part of" named 
entity... these tokens won t have the "(" coverText... and consequently an 
error will occurs when the tagging will be performed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to