Hello Daniel, yes exactly, i do that. I am using regexes to find those patterns. Daniel, is this problem only related to TokenNameFinderTrainer tool? If i train it via code should i use custom tokenizer? If not i will follow your solution using underscores.
Thanks Damiano 2017-03-02 18:00 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <[email protected]>: > Hi Damiano, > In general this is a difficult problem (making n-grams from unigrams). > Have you considered using RegEx to find your dates/phone numbers etc. and > protecting them from the tokenizer (i.e. replacing the white space with > printable (though possible not an alphanumeric character like an > underscore)? > Daniel > > On 3/2/17, 11:46 AM, "Damiano Porta" <[email protected]> wrote: > > Hello everybody, > > i have created a custom tokenizer that does not split specific > "patterns" > like, emails, telephones, dates etc. I convert them into ONE single > token. > The other parts of text are tokenized with the > SimpleTokenizer. > > The problem is when i need to train a NER model. For example if my > data has > dates like 2017 03 02 these will be converted into three tokens > (whitespace > tokenizer) i must avoid that. > > Can i specify the tokenizer using the TokenNameFinderTrainer tool? > > Thanks > Damiano > > >
