Re: Tokenizer for NER training

Damiano Porta Thu, 02 Mar 2017 14:24:51 -0800

Hi Rodrigo, thanks for your message.
My problem is that dates does not follow a correct format, you said:


2016/05/12 B-DATE
14/05/2012 B-DATE
15-02-2016 B-DATE

These dates have no problems, the problems comes when i have:

2016 05 12
14 05 2012
15 02 2016

(with a whitespace separator)

If i have these dates in a corpus they will be splitted into three tokens.
At the moment i i am using a custom tokenizer that does not split dates,
the result is each date into ONE token

"2016 05 12"
"14 05 2012"
"15 02 2016"

Now the problem is during the namefinder training, i cannot have:

2016 05 12 B-DATE
14 05 2012 B-DATE
15 02 2016 B-DATE

i do not think whitespaces are allowed here.

The solution that i am following at the moment (Daniel answer) is splitting
the text with the SimpleTokenizer and then annotate the dates with specific
custom features (i will pass those features using the
AdditionalContextGenerator
https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/AdditionalContextFeatureGenerator.java),
i will have:

2016 ne=B-date
05 ne=I-date
12 ne=I-date

(where *ne=* is the prefix of the generator:
https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/AdditionalContextFeatureGenerator.java#L38
)

In this way i do not care if a date has or not a whitespace (or other
separators) i simply use BIO encoding.

What do you think?
Thanks,

Damiano






2017-03-02 23:02 GMT+01:00 Rodrigo Agerri <rage...@apache.org>:

> Hi Damiano,
>
> Maybe I am not understanding your question, but if you just give the
> NameFinder tokenized annotated data that should be fine:
>
> word O
> 2017 B-DATE
> 03 I-DATE
> 02 I-DATE
> word O
>
> Then at testing time, if you tokenize the dates like that, the NameFinder
> should still try to find the sequences. If you have in the training data
> various ways of representing dates:
>
> 2016/05/12 B-DATE
> 14/05/2012 B-DATE
> 15-02-2016 B-DATE
>
> If will all depend on the how the tokenizer will do it and how it is
> annotated in the training data. In any case, the most important thing is
> for the tokenization to be consistent for training and testing.
>
> HTH,
>
> Rodrigo
>
> ...
>
> On Thu, Mar 2, 2017 at 5:46 PM, Damiano Porta <damianopo...@gmail.com>
> wrote:
>
> > Hello everybody,
> >
> > i have created a custom tokenizer that does not split specific "patterns"
> > like, emails, telephones, dates etc. I convert them into ONE single
> token.
> > The other parts of text are tokenized with the
> > SimpleTokenizer.
> >
> > The problem is when i need to train a NER model. For example if my data
> has
> > dates like 2017 03 02 these will be converted into three tokens
> (whitespace
> > tokenizer) i must avoid that.
> >
> > Can i specify the tokenizer using the TokenNameFinderTrainer tool?
> >
> > Thanks
> > Damiano
> >
>

Re: Tokenizer for NER training

Reply via email to