Thanks I do that.
On Thu, Mar 31, 2011 at 8:28 PM, Richard Eckart de Castilho (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014124#comment-13014124 > ] > > Richard Eckart de Castilho commented on UIMA-2106: > -------------------------------------------------- > > I believe only users with the role "developer" can assign issues. But you can > already attach a patch. > >> Handling tokens not present in the language model (and also with no suffix >> present in the model) causes a null pointer exception in the tagger process >> ------------------------------------------------------------------------------------------------------------------------------------------------------ >> >> Key: UIMA-2106 >> URL: https://issues.apache.org/jira/browse/UIMA-2106 >> Project: UIMA >> Issue Type: Bug >> Components: Sandbox-Tagger >> Affects Versions: 2.3 >> Environment: OS >> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 >> (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011 >> JVM >> java version "1.6.0_17" >> Java(TM) SE Runtime Environment (build 1.6.0_17-b04) >> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode) >> Reporter: Nicolas Hernandez >> Priority: Minor >> Fix For: 2.3 >> >> Original Estimate: 5m >> Remaining Estimate: 5m >> >> The HMMTagger Analysis Engine class uses the >> org.apache.uima.examples.tagger.Viterbi.java implementation to determine the >> pos tag list of a given sentence. >> In practice this implementation is partially dependant on the part of speech >> tagging (likewise the remaining HMMTagger classes actually). >> For exemple it makes strong assumptions on the kind of tokens it can take as >> input. It assumes no restriction about the token covertext values. >> It results in using some covertext probabilities for initialization or >> default value when the tagger processes an unknown token... >> As a consequence if the coveredText used for setting the default value is >> not present in the training model an error occurs. Roughly speaking, the >> process looks first for probability associated to the current token >> coverText, if the coverText is not present in the model, it looks in the >> model for the probability of its longest suffix, and finally if it does not >> found a match, the process assigns to the unknown coverText the probability >> of the arbitrary coverText : "(" >> The problem is that if the probability of this coverText is not available in >> the model, the probability of the unknown token is not defined and a null >> pointer exception occurs latter when the variable is called. >> Why the probability of the "(" text would not be available in the model ? In >> a large training corpus if we consider all the tokens, there is little >> chance not to find at least one occurrence of "(". >> Nevertheless if we use the HMM training AE to build a model for predicting >> noun gender and number, or verb tense and person, or "being a part of" named >> entity... these tokens won t have the "(" coverText... and consequently an >> error will occurs when the tagging will be performed. > > -- > This message is automatically generated by JIRA. > For more information on JIRA, see: http://www.atlassian.com/software/jira > -- [email protected] # http://enicolashernandez.blogspot.com http://www.univ-nantes.fr/hernandez-n # Laboratoire LINA-TALN CNRS UMR 6241 tel. +33 (0)2 51 12 58 55 # Université de Nantes - Institut Universitaire de Technologie - Département Informatique tel. +33 (0)2 40 30 60 67
