Re: [jira] [Commented] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process

Nicolas Hernandez Fri, 01 Apr 2011 02:14:26 -0700

Thanks

I do that.


On Thu, Mar 31, 2011 at 8:28 PM, Richard Eckart de Castilho (JIRA)
<[email protected]> wrote:
>
>    [ 
> https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014124#comment-13014124
>  ]
>
> Richard Eckart de Castilho commented on UIMA-2106:
> --------------------------------------------------
>
> I believe only users with the role "developer" can assign issues. But you can 
> already attach a patch.
>
>> Handling tokens not present in the language model (and also with no suffix 
>> present in the model) causes a null pointer exception in the tagger process
>> ------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>                 Key: UIMA-2106
>>                 URL: https://issues.apache.org/jira/browse/UIMA-2106
>>             Project: UIMA
>>          Issue Type: Bug
>>          Components: Sandbox-Tagger
>>    Affects Versions: 2.3
>>         Environment: OS
>> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 
>> (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
>> JVM
>> java version "1.6.0_17"
>> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
>> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>>            Reporter: Nicolas Hernandez
>>            Priority: Minor
>>             Fix For: 2.3
>>
>>   Original Estimate: 5m
>>  Remaining Estimate: 5m
>>
>> The HMMTagger Analysis Engine class uses the 
>> org.apache.uima.examples.tagger.Viterbi.java implementation to determine the 
>> pos tag list of a given sentence.
>> In practice this implementation is partially dependant on the part of speech 
>> tagging (likewise the remaining HMMTagger classes actually).
>> For exemple it makes strong assumptions on the kind of tokens it can take as 
>> input. It assumes no restriction about the token covertext values.
>> It results in using some covertext probabilities for initialization or 
>> default value when the tagger processes an unknown token...
>> As a consequence if the coveredText used for setting the default value is 
>> not present in the training model an error occurs. Roughly speaking, the 
>> process looks first for probability associated to the current token 
>> coverText, if the coverText is not present in the model, it looks in the 
>> model for the probability of its longest suffix, and finally if it does not 
>> found a match, the process assigns to the unknown coverText the probability 
>> of the arbitrary coverText : "("
>> The problem is that if the probability of this coverText is not available in 
>> the model, the probability of the unknown token is not defined and a null 
>> pointer exception occurs latter when the variable is called.
>> Why the probability of the "(" text would not be available in the model ? In 
>> a large training corpus if we consider all the tokens, there is little 
>> chance not to find at least one occurrence of "(".
>> Nevertheless if we use the HMM training  AE to build a model for predicting 
>> noun gender and number, or verb tense and person, or "being a part of" named 
>> entity... these tokens won t have the "(" coverText... and consequently an 
>> error will occurs when the tagging will be performed.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>



-- 
[email protected]
#
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
#
Laboratoire LINA-TALN CNRS UMR 6241
tel. +33 (0)2 51 12 58 55
#
Université de Nantes - Institut Universitaire de Technologie -
Département Informatique
tel. +33 (0)2 40 30 60 67

Re: [jira] [Commented] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process

Reply via email to