Hello,

i'll create a jira issue and implement a flag wether to set <SPLIT> or
just delete the whitespace. I hope this will do it then.

Thank you for all the clarifications

Andreas

Am 14.03.2013 13:48, schrieb Jörn Kottmann:
> Have a look here:
> http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.tokenizer.detokenizing
> 
> 
> Here is the detokenizer tool:
> https://github.com/apache/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/tokenizer/DictionaryDetokenizerTool.java
> 
> 
> Looks like it doesn't output the <SPLIT> tag, we should change that. The
> main purpose of it is to generate training data
> for the tokenizer. Anyway, patches to improve the detokenizer are very
> welcome, looks like the documentation needs a few
> fixes too.
> 
> HTH,
> Jörn
> 
> On 03/14/2013 01:32 PM, Andreas Niekler wrote:
>> Hello,
>>
>> ok i will find out what the name of the tool is and i will create a
>> rules xml and a abbreviations list (not sure about the format as well
>> here - but i hope i find an example).
>>
>> Are you interested in hosting the model after i finally succeed?
>>
>> Thank you very much
>>
>> Andreas
>>
>> Am 14.03.2013 13:25, schrieb Jörn Kottmann:
>>> On 03/14/2013 12:20 PM, Andreas Niekler wrote:
>>>> So the detokenizer adds the <SPLIT> tag where it is needed?
>>> Exactly, you need to merge the tokens again which were previously not
>>> separated
>>> by a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text
>>> "AKTIEN SCHWEIZ/Verlauf:"
>>> and in the training data you encode that as "AKTIEN
>>> SCHWEIZ/Verlauf<SPLIT>:".
>>>
>>> The detokenizer just figures out which tokens are merged together and
>>> which are not
>>> based on some rules. There is a util which can use that information to
>>> output the tokenizer
>>> training data, should be integrated into the CLI but its a while since I
>>> last used it.
>>>
>>> Don't hesitate to ask if you need more help,
>>> Jörn
>>>
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: [email protected]

Reply via email to