Re: TokenizerTrainer

Andreas Niekler Thu, 14 Mar 2013 03:44:35 -0700

Thank you,

what i don't understand is how this is producing valid training data
since i just delete whitespaces. You said that i need to include some
<SPLIT> Tags to have proper training data. Can you please comment on the
fact why we have proper training data after detokenizing. I hope that
it's ok to ask all these querstions but i really want to understand
openNLP tokenisation.



Thank you very much and i will create a detokenizer dict based on all
relevant special characters contained in my 300k dataset.

Andreas

Am 14.03.2013 11:34, schrieb Jörn Kottmann:
> On 03/14/2013 11:27 AM, Andreas Niekler wrote:
>> Hello,
>>
>>> We probably need to fix the detokenizer rules used for the German models
>>> a bit to handle these cases correctly.
>> Are those rules public somewhere so that i can edit them myself? I can
>> provide them to the community afterwars. Mostly characters like „“ are
>> not recognized by the tokenizer. I don't want to convert them before
>> tokenizing because we analyze things like direct speech and those
>> characters are a good indicator for that.
> 
> No for the German models I wrote some code to do the detokenization
> which supported
> a specific corpus. Anyway, this work then lead me to contribute the
> detokenizer to OpenNLP.
> 
> There is one file for English:
> https://github.com/apache/opennlp/tree/trunk/opennlp-tools/lang/en/tokenizer
> 
> 
> We would be happy to receive a contribution for German.
> Have a look at the documentation, there is a section about the detokenizer.
> 
>>> I suggest to use our detokenizer to turn your tokenized text into
>>> training data.
>> Has the detokenizer a command line tool as well?
>>
> 
> Yes, there is one. Have a look at the CLI help.
> 
> Jörn
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: [email protected]

Re: TokenizerTrainer

Reply via email to