Re: Tokenizer for NER training

Rodrigo Agerri Thu, 02 Mar 2017 14:55:32 -0800

Hello,

This is what I meant in my first example. If you are annotating tokens (or
already have them annotated) in a corpus with the BIO format, then as long
as you annotate each token of the date with the NE class you will be fine.
As long as in testing time you use the same tokenization.


Cheers,

R

On Thu, Mar 2, 2017 at 11:24 PM, Damiano Porta <damianopo...@gmail.com>
wrote:

> Hi Rodrigo, thanks for your message.
> My problem is that dates does not follow a correct format, you said:
>
> 2016/05/12 B-DATE
> 14/05/2012 B-DATE
> 15-02-2016 B-DATE
>
> These dates have no problems, the problems comes when i have:
>
> 2016 05 12
> 14 05 2012
> 15 02 2016
>
> (with a whitespace separator)
>
> If i have these dates in a corpus they will be splitted into three tokens.
> At the moment i i am using a custom tokenizer that does not split dates,
> the result is each date into ONE token
>
> "2016 05 12"
> "14 05 2012"
> "15 02 2016"
>
> Now the problem is during the namefinder training, i cannot have:
>
> 2016 05 12 B-DATE
> 14 05 2012 B-DATE
> 15 02 2016 B-DATE
>
> i do not think whitespaces are allowed here.
>
> The solution that i am following at the moment (Daniel answer) is splitting
> the text with the SimpleTokenizer and then annotate the dates with specific
> custom features (i will pass those features using the
> AdditionalContextGenerator
> https://github.com/apache/opennlp/blob/master/opennlp-
> tools/src/main/java/opennlp/tools/util/featuregen/
> AdditionalContextFeatureGenerator.java),
> i will have:
>
> 2016 ne=B-date
> 05 ne=I-date
> 12 ne=I-date
>
> (where *ne=* is the prefix of the generator:
> https://github.com/apache/opennlp/blob/master/opennlp-
> tools/src/main/java/opennlp/tools/util/featuregen/
> AdditionalContextFeatureGenerator.java#L38
> )
>
> In this way i do not care if a date has or not a whitespace (or other
> separators) i simply use BIO encoding.
>
> What do you think?
> Thanks,
>
> Damiano
>
>
>
>
>
>
> 2017-03-02 23:02 GMT+01:00 Rodrigo Agerri <rage...@apache.org>:
>
> > Hi Damiano,
> >
> > Maybe I am not understanding your question, but if you just give the
> > NameFinder tokenized annotated data that should be fine:
> >
> > word O
> > 2017 B-DATE
> > 03 I-DATE
> > 02 I-DATE
> > word O
> >
> > Then at testing time, if you tokenize the dates like that, the NameFinder
> > should still try to find the sequences. If you have in the training data
> > various ways of representing dates:
> >
> > 2016/05/12 B-DATE
> > 14/05/2012 B-DATE
> > 15-02-2016 B-DATE
> >
> > If will all depend on the how the tokenizer will do it and how it is
> > annotated in the training data. In any case, the most important thing is
> > for the tokenization to be consistent for training and testing.
> >
> > HTH,
> >
> > Rodrigo
> >
> > ...
> >
> > On Thu, Mar 2, 2017 at 5:46 PM, Damiano Porta <damianopo...@gmail.com>
> > wrote:
> >
> > > Hello everybody,
> > >
> > > i have created a custom tokenizer that does not split specific
> "patterns"
> > > like, emails, telephones, dates etc. I convert them into ONE single
> > token.
> > > The other parts of text are tokenized with the
> > > SimpleTokenizer.
> > >
> > > The problem is when i need to train a NER model. For example if my data
> > has
> > > dates like 2017 03 02 these will be converted into three tokens
> > (whitespace
> > > tokenizer) i must avoid that.
> > >
> > > Can i specify the tokenizer using the TokenNameFinderTrainer tool?
> > >
> > > Thanks
> > > Damiano
> > >
> >
>

Re: Tokenizer for NER training

Reply via email to