Hello, i'll create a jira issue and implement a flag wether to set <SPLIT> or just delete the whitespace. I hope this will do it then.
Thank you for all the clarifications Andreas Am 14.03.2013 13:48, schrieb Jörn Kottmann: > Have a look here: > http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.tokenizer.detokenizing > > > Here is the detokenizer tool: > https://github.com/apache/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/tokenizer/DictionaryDetokenizerTool.java > > > Looks like it doesn't output the <SPLIT> tag, we should change that. The > main purpose of it is to generate training data > for the tokenizer. Anyway, patches to improve the detokenizer are very > welcome, looks like the documentation needs a few > fixes too. > > HTH, > Jörn > > On 03/14/2013 01:32 PM, Andreas Niekler wrote: >> Hello, >> >> ok i will find out what the name of the tool is and i will create a >> rules xml and a abbreviations list (not sure about the format as well >> here - but i hope i find an example). >> >> Are you interested in hosting the model after i finally succeed? >> >> Thank you very much >> >> Andreas >> >> Am 14.03.2013 13:25, schrieb Jörn Kottmann: >>> On 03/14/2013 12:20 PM, Andreas Niekler wrote: >>>> So the detokenizer adds the <SPLIT> tag where it is needed? >>> Exactly, you need to merge the tokens again which were previously not >>> separated >>> by a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text >>> "AKTIEN SCHWEIZ/Verlauf:" >>> and in the training data you encode that as "AKTIEN >>> SCHWEIZ/Verlauf<SPLIT>:". >>> >>> The detokenizer just figures out which tokens are merged together and >>> which are not >>> based on some rules. There is a util which can use that information to >>> output the tokenizer >>> training data, should be integrated into the CLI but its a while since I >>> last used it. >>> >>> Don't hesitate to ask if you need more help, >>> Jörn >>> > -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: [email protected]
