[
https://issues.apache.org/jira/browse/UIMA-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016751#comment-13016751
]
Nicolas Hernandez commented on UIMA-2110:
-----------------------------------------
Hi Tommaso
Yes we actually used the HMMTagger to train some models. We used the French
Treebank (FTB) for that
http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php
We obtained French models for tagging pos, morphological information and also
lemma.
And it works fine !!! (The HMM tagger for predicting lemma is probably not a
judicious choice but we test it since the processing chain and the data was
available). The FTB offers some secondary information we did not test yet.
>From the user point of view who knows that his task can be solve by a HMM but
>who does not want to know how, the HMM trainer and tagger are really easy to
>use. For all the other cases, ClearTk is probably a better solution, but it
>requires development skills and takes more time to get in.
Indeed the current HMM trainer implementation uses a few features (it uses
n-grams, suffix, lower/upercase text in some configurations), ClearTk offers
much more configurable features.
About the resources we produced. So far, the license attached to the FTB is
unclear for the distribution of the models we can train with. We are not sure
to be able to release then under the Apache License. Our attempt to obtain the
right from the authors of the corpus dit not come off yet.
Nearly, I will blog post the procedure to create the resources so that anyone
will be able to do it themself. I used a couple of nice AEs: one to turn into
CAS annotations any XML structure and one to map any annotation to another
depending on some constraint declarations. The latter is already released under
Apache license, the former will be quite soon.
I will also release the models with respect of the corpus license which allows
use of the corpus for research purpose.
> Turn the HMMTagger class into a more generic class for tagging tasks
> ----------------------------------------------------------------------
>
> Key: UIMA-2110
> URL: https://issues.apache.org/jira/browse/UIMA-2110
> Project: UIMA
> Issue Type: Improvement
> Components: Sandbox-Tagger
> Affects Versions: 2.3
> Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu
> 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
> Reporter: Nicolas Hernandez
> Priority: Minor
> Fix For: 2.3.1Addons, 2.3.1
>
> Attachments: AMoreGenericHMMTaggerDesc.patch,
> AMoreGenericHMMTaggerSrcClass.patch
>
> Original Estimate: 1.5h
> Remaining Estimate: 1.5h
>
> Despite its name, the code of the org.apache.uima.examples.tagger.HMMTagger
> class is not totally independant from the pos tagging task.
> In addition it assumes that the feature path to update with the result of the
> tagging is org.apache.uima.TokenAnnotation:posTag.
> We propose to let the possibility to users to specify by parameter the
> feature
> path to set. This parameter is optional. If it is left free, the tagger will
> work as usually using the org.apache.uima.TokenAnnotation:posTag as default
> value.
>
> By the way, we propose to add three optional parameters : InputView,
> SentenceType and ModelFile.
> Since the HMM Learner has got the possibility to specify the view to use to
> train a model, we consequently decide to give the same possibility for the
> tagger. By default, it works on the _InitialView. It is actually quite useful
> in practice!
> The org.apache.uima.TokenAnnotation type is not the only annotation type
> which is assumed
> to be present in the CAS. Actually, the HMMTagger processes tokens sentence
> by sentence. It uses the
> org.apache.uima.SentenceAnnotation to select the tokens. The SentenceType
> parameter aims at
> letting the users free to specify their own sentence annotation Type. The
> default value is
> org.apache.uima.SentenceAnnotation.
> The ModelFile parameter is a concurrent way to the resource declaration way
> to specify a model.
> Left empty, it won t be considered. Otherwise it will predomine over the
> resource declaration.
> When specified, the multiple deployement of the tagger cannot be allowed but
> in practice for the user it may be easier to configure a parameter through
> Eclipse.
> Two distincts patches will be provided, one for the class and the other for
> the descriptor.
> Future improvement of the class might offer the possibility to create new
> annotations not only to update existing ones.
> Future improvement of the descriptor may dissociate what it is up to the
> tagger and what it is relevant for the pos tagger...
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira