[ 
https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015412#comment-13015412
 ] 

Jerry Cwiklik commented on UIMA-2106:
-------------------------------------

Nicolas, I just committed your patch. 

> Handling tokens not present in the language model (and also with no suffix 
> present in the model) causes a null pointer exception in the tagger process
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: UIMA-2106
>                 URL: https://issues.apache.org/jira/browse/UIMA-2106
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-Tagger
>    Affects Versions: 2.3
>         Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 
> 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>            Reporter: Nicolas Hernandez
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: TaggerHandlingTokensNotPresentInTheLanguageModel.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The HMMTagger Analysis Engine class uses the 
> org.apache.uima.examples.tagger.Viterbi.java implementation to determine the 
> pos tag list of a given sentence.
> In practice this implementation is partially dependant on the part of speech 
> tagging (likewise the remaining HMMTagger classes actually).
> For exemple it makes strong assumptions on the kind of tokens it can take as 
> input. It assumes no restriction about the token covertext values.
> It results in using some covertext probabilities for initialization or 
> default value when the tagger processes an unknown token...
> As a consequence if the coveredText used for setting the default value is not 
> present in the training model an error occurs. Roughly speaking, the process 
> looks first for probability associated to the current token coverText, if the 
> coverText is not present in the model, it looks in the model for the 
> probability of its longest suffix, and finally if it does not found a match, 
> the process assigns to the unknown coverText the probability of the arbitrary 
> coverText : "("  
> The problem is that if the probability of this coverText is not available in 
> the model, the probability of the unknown token is not defined and a null 
> pointer exception occurs latter when the variable is called.
> Why the probability of the "(" text would not be available in the model ? In 
> a large training corpus if we consider all the tokens, there is little chance 
> not to find at least one occurrence of "(". 
> Nevertheless if we use the HMM training  AE to build a model for predicting 
> noun gender and number, or verb tense and person, or "being a part of" named 
> entity... these tokens won t have the "(" coverText... and consequently an 
> error will occurs when the tagging will be performed.
> A patch has been proposed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to