Turn the HMMTagger class into a more generic class for tagging tasks  
----------------------------------------------------------------------

                 Key: UIMA-2110
                 URL: https://issues.apache.org/jira/browse/UIMA-2110
             Project: UIMA
          Issue Type: Improvement
          Components: Sandbox-Tagger
    Affects Versions: 2.3
         Environment: OS
Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 
4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011

JVM
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
            Reporter: Nicolas Hernandez
            Priority: Minor
             Fix For: 2.3.1


Despite its name, the code of the org.apache.uima.examples.tagger.HMMTagger 
class is not totally independant from the pos tagging task. 
In addition it assumes that the feature path to update with the result of the 
tagging is org.apache.uima.TokenAnnotation:posTag.

We propose to let the possibility to users to specify by parameter the feature 
path to set. This parameter is optional. If it is left free, the tagger will 
work as usually using the org.apache.uima.TokenAnnotation:posTag as default 
value.
 
By the way, we propose to add three optional parameters : InputView, 
SentenceType and ModelFile.
Since the HMM Learner has got the possibility to specify the view to use to 
train a model, we consequently decide to give the same possibility for the 
tagger. By default, it works on the _InitialView. It is actually quite useful 
in practice!

The org.apache.uima.TokenAnnotation type is not the only annotation type which 
is assumed 
to be present in the CAS. Actually, the HMMTagger processes tokens sentence by 
sentence. It uses the   
org.apache.uima.SentenceAnnotation to select the tokens. The SentenceType 
parameter aims at 
letting the users free to specify their own sentence annotation Type. The 
default value is 
org.apache.uima.SentenceAnnotation. 

The ModelFile parameter is a concurrent way to the resource declaration way to 
specify a model.
Left empty, it won t be considered. Otherwise it will predomine over the 
resource declaration. 
When specified, the multiple deployement of the tagger cannot be allowed but in 
practice for the user it may be easier to configure a parameter through 
Eclipse.    

Two distincts patches will be provided, one for the class and the other for the 
descriptor.

Future improvement of the class might offer the possibility to create new 
annotations not only to update existing ones.  
Future improvement of the descriptor may dissociate what it is up to the tagger 
and what it is relevant for the pos tagger...


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to