Hi Damiano, Maybe I am not understanding your question, but if you just give the NameFinder tokenized annotated data that should be fine:
word O 2017 B-DATE 03 I-DATE 02 I-DATE word O Then at testing time, if you tokenize the dates like that, the NameFinder should still try to find the sequences. If you have in the training data various ways of representing dates: 2016/05/12 B-DATE 14/05/2012 B-DATE 15-02-2016 B-DATE If will all depend on the how the tokenizer will do it and how it is annotated in the training data. In any case, the most important thing is for the tokenization to be consistent for training and testing. HTH, Rodrigo ... On Thu, Mar 2, 2017 at 5:46 PM, Damiano Porta <[email protected]> wrote: > Hello everybody, > > i have created a custom tokenizer that does not split specific "patterns" > like, emails, telephones, dates etc. I convert them into ONE single token. > The other parts of text are tokenized with the > SimpleTokenizer. > > The problem is when i need to train a NER model. For example if my data has > dates like 2017 03 02 these will be converted into three tokens (whitespace > tokenizer) i must avoid that. > > Can i specify the tokenizer using the TokenNameFinderTrainer tool? > > Thanks > Damiano >
