Hi Damiano, In general this is a difficult problem (making n-grams from unigrams). Have you considered using RegEx to find your dates/phone numbers etc. and protecting them from the tokenizer (i.e. replacing the white space with printable (though possible not an alphanumeric character like an underscore)? Daniel
On 3/2/17, 11:46 AM, "Damiano Porta" <damianopo...@gmail.com> wrote: Hello everybody, i have created a custom tokenizer that does not split specific "patterns" like, emails, telephones, dates etc. I convert them into ONE single token. The other parts of text are tokenized with the SimpleTokenizer. The problem is when i need to train a NER model. For example if my data has dates like 2017 03 02 these will be converted into three tokens (whitespace tokenizer) i must avoid that. Can i specify the tokenizer using the TokenNameFinderTrainer tool? Thanks Damiano