Re: UIMA TokenizerTrainer component : the model file is not created

Nicolas Hernandez Fri, 17 Jun 2011 05:36:25 -0700

Tommaso you said you successfully used the OpenNLP UIMA trainers.

I am currently attempting to build French models for the various tasks
OpenNLP can deal with. But since I am also involved in UIMA stuff, I
wanted to test the OpenNLP UIMA components for doing that.
My goal is to donate the models to the OpenNLP community (i.e. in
http://opennlp.sourceforge.net/models-1.5/)


Before testing the tokenizerTrainer, I tested the SentenceDetector. I
found at least two problems with the UIMA component
https://issues.apache.org/jira/browse/OPENNLP-197
One of them is not yet referenced in the jira. But I am currious to
know whether you encountered it.

I noted that models trained with the UIMA component give wrong
begin/end offset despite the fact they manage to split text in
sentences. I observed that the begin of a current sentence starts
including as a first token the punctuation character of the previous
one while the
previous one does not include it as its last one.

Have you noticed the problem ?

I think that, most of all, my problems are due to the lack of
documentation for the uima integration. I plan to blog post about my
experience. Since I see there is an open issue for that
https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find
the time to blog spot, I can try to write it in some way it can also
be used to contribute to the documentation too (if you are interested
in).



On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez
<[email protected]> wrote:
> Hello Tommaso,
>
> after some more tests... I think I have found how to reproduce my problem.
>
> Tommaso, you re right it works fine with the pipeline you described
> (i.e. with the WhitespaceTokenizer followed by the token trainer
> (wst-tokenTrainer-AAE)) but only if the input texts are formatted as
> 'normal' texts...
> I tested the pipeline with texts already formatted in a 'wst' way (a
> sentence per line and tokens separated by a whitespace character) and
> like that it does not work any longer (despite the presence of the
> sentence and token annotations).
>
> So my guess is that in command line the tokenTrainer needs to input a
> wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer
> needs (in some way a 'detokenized' text).
>
> If needed, I can open a 'question' issue and attach the texts I used
> to produce the problem.
>
> /Nicolas
>
> ---------- Forwarded message ----------
> From: Tommaso Teofili <[email protected]>
> Date: Wed, Jun 15, 2011 at 5:30 PM
> Subject: Re: UIMA TokenizerTrainer component : the model file is not created
> To: [email protected], [email protected]
>
>
> Hello Nicolas,
> I successfully used the OpenNLP UIMA TokenizerTrainer and also the
> other trainers, for a simple proof I created an aggregate analysis
> engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
> TokenizerTrainer in a fixed flow, then used a
> FileSystemCollectionReader to to feed the pipeline.
> In the TokenizerTrainer I set:
>         <nameValuePair>
>   <name>opennlp.uima.TokenType</name>
>   <value>
>      <string>org.apache.uima.TokenAnnotation</string>
>   </value>
> </nameValuePair>
>         <nameValuePair>
>   <name>opennlp.uima.language</name>
>   <value>
>      <string>en-US</string>
>   </value>
> </nameValuePair>
>         <nameValuePair>
>   <name>opennlp.uima.ModelName</name>
>   <value>
>      <string>target/Tokens.bin</string>
>   </value>
> </nameValuePair>
>
> which then created the Tokens.bin model that I was able to test from
> command line and via APIs.
> Are you using it in a different way?
> Regards,
> Tommaso
>
> 2011/6/15 Nicolas Hernandez <[email protected]>
>>
>> Hello
>>
>> Does someone have already used the UIMA TokenizerTrainer component ? I
>> am a bit confused since it does not create any model file.
>>
>> In my stdout I got this :
>> Indexing events using cutoff of 5
>>        Computing event counts...
>>
>> done. 69669 events
>>        Indexing...  done.
>> Sorting and merging events... done. Reduced 69669 events to 16467.
>> Done indexing.
>> Incorporating indexed data for training...
>> done.
>>        Number of Event Tokens: 16467
>>            Number of Outcomes: 1
>>          Number of Predicates: 5624
>> ...done.
>> Computing model parameters...
>> Performing 100 iterations.
>>  1:  .. loglikelihood=0.0      1.0
>>  2:  .. loglikelihood=0.0      1.0
>>
>> This look like a problem I got when I trained the model in command
>> line without using the '<SPLIT>' tag. In command line, It differs
>> since in command line I also got the following exception
>> Exception in thread "main" java.lang.IllegalArgumentException: The
>> maxent model is not compatible!
>>
>> I solved this problem by adding the tag as it is mentioned in the post
>> of maxent model is not compatible with Tokenizer training       Fri, 13 May,
>> 09:33
>>  http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
>>
>> Does anyone know if it is the same problem ? In that case, how to
>> specify the '<SPLIT>' tag in the UIMA version? As much as I understand
>> its role, it is important to let the user the possibility of setting
>> it.
>>
>> More globaly I am interested by any return on experience of people who
>> successfully managed to build models with the UIMA OpenNLP * Trainer
>> components. For now, I also got some trouble with the SentenceTrainer
>> and I do not have test the others.
>>
>> /Nicolas
>>
>>
>> --
>> [email protected]
>> #
>> http://enicolashernandez.blogspot.com
>> http://www.univ-nantes.fr/hernandez-n
>> #
>> Laboratoire LINA-TALN CNRS UMR 6241
>> tel. +33 (0)2 51 12 58 55
>> #
>> Université de Nantes - Institut Universitaire de Technologie -
>> Département Informatique
>> tel. +33 (0)2 40 30 60 67
>
>
>
>
> --
> [email protected]
> #
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> #
> Laboratoire LINA-TALN CNRS UMR 6241
> tel. +33 (0)2 51 12 58 55
> #
> Université de Nantes - Institut Universitaire de Technologie -
> Département Informatique
> tel. +33 (0)2 40 30 60 67
>



-- 
[email protected]
#
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
#
Laboratoire LINA-TALN CNRS UMR 6241
tel. +33 (0)2 51 12 58 55
#
Université de Nantes - Institut Universitaire de Technologie -
Département Informatique
tel. +33 (0)2 40 30 60 67

Re: UIMA TokenizerTrainer component : the model file is not created

Reply via email to