Hi Rodrigo, thanks for your message. My problem is that dates does not follow a correct format, you said:
2016/05/12 B-DATE 14/05/2012 B-DATE 15-02-2016 B-DATE These dates have no problems, the problems comes when i have: 2016 05 12 14 05 2012 15 02 2016 (with a whitespace separator) If i have these dates in a corpus they will be splitted into three tokens. At the moment i i am using a custom tokenizer that does not split dates, the result is each date into ONE token "2016 05 12" "14 05 2012" "15 02 2016" Now the problem is during the namefinder training, i cannot have: 2016 05 12 B-DATE 14 05 2012 B-DATE 15 02 2016 B-DATE i do not think whitespaces are allowed here. The solution that i am following at the moment (Daniel answer) is splitting the text with the SimpleTokenizer and then annotate the dates with specific custom features (i will pass those features using the AdditionalContextGenerator https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/AdditionalContextFeatureGenerator.java), i will have: 2016 ne=B-date 05 ne=I-date 12 ne=I-date (where *ne=* is the prefix of the generator: https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/AdditionalContextFeatureGenerator.java#L38 ) In this way i do not care if a date has or not a whitespace (or other separators) i simply use BIO encoding. What do you think? Thanks, Damiano 2017-03-02 23:02 GMT+01:00 Rodrigo Agerri <rage...@apache.org>: > Hi Damiano, > > Maybe I am not understanding your question, but if you just give the > NameFinder tokenized annotated data that should be fine: > > word O > 2017 B-DATE > 03 I-DATE > 02 I-DATE > word O > > Then at testing time, if you tokenize the dates like that, the NameFinder > should still try to find the sequences. If you have in the training data > various ways of representing dates: > > 2016/05/12 B-DATE > 14/05/2012 B-DATE > 15-02-2016 B-DATE > > If will all depend on the how the tokenizer will do it and how it is > annotated in the training data. In any case, the most important thing is > for the tokenization to be consistent for training and testing. > > HTH, > > Rodrigo > > ... > > On Thu, Mar 2, 2017 at 5:46 PM, Damiano Porta <damianopo...@gmail.com> > wrote: > > > Hello everybody, > > > > i have created a custom tokenizer that does not split specific "patterns" > > like, emails, telephones, dates etc. I convert them into ONE single > token. > > The other parts of text are tokenized with the > > SimpleTokenizer. > > > > The problem is when i need to train a NER model. For example if my data > has > > dates like 2017 03 02 these will be converted into three tokens > (whitespace > > tokenizer) i must avoid that. > > > > Can i specify the tokenizer using the TokenNameFinderTrainer tool? > > > > Thanks > > Damiano > > >