Re: Tokenizer for NER training

Russ, Daniel (NIH/CIT) [E] Thu, 02 Mar 2017 09:04:28 -0800

Hi Damiano,
   In general this is a difficult problem (making n-grams from unigrams).  Have 
you considered using RegEx to find your dates/phone numbers etc. and protecting 
them from the tokenizer (i.e. replacing the white space with printable (though 
possible not an alphanumeric character like an underscore)?
Daniel


On 3/2/17, 11:46 AM, "Damiano Porta" <damianopo...@gmail.com> wrote:

    Hello everybody,
    
    i have created a custom tokenizer that does not split specific "patterns"
    like, emails, telephones, dates etc. I convert them into ONE single token.
    The other parts of text are tokenized with the
    SimpleTokenizer.
    
    The problem is when i need to train a NER model. For example if my data has
    dates like 2017 03 02 these will be converted into three tokens (whitespace
    tokenizer) i must avoid that.
    
    Can i specify the tokenizer using the TokenNameFinderTrainer tool?
    
    Thanks
    Damiano

Re: Tokenizer for NER training

Reply via email to