Re: Tokenizer for NER training

Rodrigo Agerri Thu, 02 Mar 2017 14:03:59 -0800

Hi Damiano,

Maybe I am not understanding your question, but if you just give the
NameFinder tokenized annotated data that should be fine:

word O
2017 B-DATE
03 I-DATE
02 I-DATE
word O

Then at testing time, if you tokenize the dates like that, the NameFinder
should still try to find the sequences. If you have in the training data
various ways of representing dates:

2016/05/12 B-DATE
14/05/2012 B-DATE
15-02-2016 B-DATE

If will all depend on the how the tokenizer will do it and how it is
annotated in the training data. In any case, the most important thing is
for the tokenization to be consistent for training and testing.

HTH,

Rodrigo

...

On Thu, Mar 2, 2017 at 5:46 PM, Damiano Porta <damianopo...@gmail.com>
wrote:

> Hello everybody,
>
> i have created a custom tokenizer that does not split specific "patterns"
> like, emails, telephones, dates etc. I convert them into ONE single token.
> The other parts of text are tokenized with the
> SimpleTokenizer.
>
> The problem is when i need to train a NER model. For example if my data has
> dates like 2017 03 02 these will be converted into three tokens (whitespace
> tokenizer) i must avoid that.
>
> Can i specify the tokenizer using the TokenNameFinderTrainer tool?
>
> Thanks
> Damiano
>

Re: Tokenizer for NER training

Reply via email to