Re: Tokenizer for NER training

Damiano Porta Thu, 02 Mar 2017 09:24:44 -0800

Hello Daniel, yes exactly, i do that. I am using regexes to find those
patterns.
Daniel, is this problem only related to TokenNameFinderTrainer tool? If i
train it via code should i use custom tokenizer?
If not i will follow your solution using underscores.


Thanks
Damiano

2017-03-02 18:00 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <[email protected]>:

> Hi Damiano,
>    In general this is a difficult problem (making n-grams from unigrams).
> Have you considered using RegEx to find your dates/phone numbers etc. and
> protecting them from the tokenizer (i.e. replacing the white space with
> printable (though possible not an alphanumeric character like an
> underscore)?
> Daniel
>
> On 3/2/17, 11:46 AM, "Damiano Porta" <[email protected]> wrote:
>
>     Hello everybody,
>
>     i have created a custom tokenizer that does not split specific
> "patterns"
>     like, emails, telephones, dates etc. I convert them into ONE single
> token.
>     The other parts of text are tokenized with the
>     SimpleTokenizer.
>
>     The problem is when i need to train a NER model. For example if my
> data has
>     dates like 2017 03 02 these will be converted into three tokens
> (whitespace
>     tokenizer) i must avoid that.
>
>     Can i specify the tokenizer using the TokenNameFinderTrainer tool?
>
>     Thanks
>     Damiano
>
>
>

Re: Tokenizer for NER training

Reply via email to