Hello, ok i checked in the sources now and i can see that the tokenizer skips further tokenisation if a certain pattern is matched. I looked also into the default values of the patterns. As far as i can see within the Fatory.java there is no pattern for the "de" language flag. For me this means that there is no standard way of training a german model with the TokenizerTrainer tool. I guess i have to write my own training tool where i set the pattern within the TokenizerME myself. Am i right here?
Finally i wonder if the training class for the de-token.bin file on the models page is public so that i can adopt it for my own data. If anyone can point me there this would be very helpful. Thank you Andreas Am 13.03.2013 12:15, schrieb James Kosin: > Andreas, > > Tokenizing is a very simple procedure; so, the default of 100 iterations > should suffice as long as you have a large training set. Greater than > say about 1,000 lines. > > James > > On 3/13/2013 4:39 AM, Andreas Niekler wrote: >> Hello, >> >> it was a clean set which i just annotated with the <SPLIT> tags. >> >> And the german root bases for those examples are not right in those >> cases i posted. >> >> I used 500 iterations could it be an overfitting problem? >> >> Thnakns for you help. >> >> Am 13.03.2013 02:38, schrieb James Kosin: >>> On 3/12/2013 10:22 AM, Andreas Niekler wrote: >>>> stehenge - blieben >>>> fre - undlicher >>> Andreas, >>> >>> I'm not an expert on German, but in English the models are also trained >>> on splitting contractions and other words into their root bases. >>> >>> ie: You'll -split-> You 'll -meaning-> You will >>> Can't -split-> Can 't -meaning-> Can not >>> >>> Other words may also get parsed and separated by the tokenizer. >>> >>> Did you create the training data yourself? Or was this a clean set of >>> data from another source? >>> >>> James >>> > -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: [email protected]
