Re: Tokenizer for NER training

Damiano Porta Thu, 02 Mar 2017 15:01:09 -0800

Ok! Thanks

2017-03-02 23:53 GMT+01:00 Rodrigo Agerri <rodrigo.age...@ehu.eus>:


> Hello,
>
> This is what I meant in my first example. If you are annotating tokens (or
> already have them annotated) in a corpus with the BIO format, then as long
> as you annotate each token of the date with the NE class you will be fine.
> As long as in testing time you use the same tokenization.
>
> Cheers,
>
> R
>
> On Thu, Mar 2, 2017 at 11:24 PM, Damiano Porta <damianopo...@gmail.com>
> wrote:
>
> > Hi Rodrigo, thanks for your message.
> > My problem is that dates does not follow a correct format, you said:
> >
> > 2016/05/12 B-DATE
> > 14/05/2012 B-DATE
> > 15-02-2016 B-DATE
> >
> > These dates have no problems, the problems comes when i have:
> >
> > 2016 05 12
> > 14 05 2012
> > 15 02 2016
> >
> > (with a whitespace separator)
> >
> > If i have these dates in a corpus they will be splitted into three
> tokens.
> > At the moment i i am using a custom tokenizer that does not split dates,
> > the result is each date into ONE token
> >
> > "2016 05 12"
> > "14 05 2012"
> > "15 02 2016"
> >
> > Now the problem is during the namefinder training, i cannot have:
> >
> > 2016 05 12 B-DATE
> > 14 05 2012 B-DATE
> > 15 02 2016 B-DATE
> >
> > i do not think whitespaces are allowed here.
> >
> > The solution that i am following at the moment (Daniel answer) is
> splitting
> > the text with the SimpleTokenizer and then annotate the dates with
> specific
> > custom features (i will pass those features using the
> > AdditionalContextGenerator
> > https://github.com/apache/opennlp/blob/master/opennlp-
> > tools/src/main/java/opennlp/tools/util/featuregen/
> > AdditionalContextFeatureGenerator.java),
> > i will have:
> >
> > 2016 ne=B-date
> > 05 ne=I-date
> > 12 ne=I-date
> >
> > (where *ne=* is the prefix of the generator:
> > https://github.com/apache/opennlp/blob/master/opennlp-
> > tools/src/main/java/opennlp/tools/util/featuregen/
> > AdditionalContextFeatureGenerator.java#L38
> > )
> >
> > In this way i do not care if a date has or not a whitespace (or other
> > separators) i simply use BIO encoding.
> >
> > What do you think?
> > Thanks,
> >
> > Damiano
> >
> >
> >
> >
> >
> >
> > 2017-03-02 23:02 GMT+01:00 Rodrigo Agerri <rage...@apache.org>:
> >
> > > Hi Damiano,
> > >
> > > Maybe I am not understanding your question, but if you just give the
> > > NameFinder tokenized annotated data that should be fine:
> > >
> > > word O
> > > 2017 B-DATE
> > > 03 I-DATE
> > > 02 I-DATE
> > > word O
> > >
> > > Then at testing time, if you tokenize the dates like that, the
> NameFinder
> > > should still try to find the sequences. If you have in the training
> data
> > > various ways of representing dates:
> > >
> > > 2016/05/12 B-DATE
> > > 14/05/2012 B-DATE
> > > 15-02-2016 B-DATE
> > >
> > > If will all depend on the how the tokenizer will do it and how it is
> > > annotated in the training data. In any case, the most important thing
> is
> > > for the tokenization to be consistent for training and testing.
> > >
> > > HTH,
> > >
> > > Rodrigo
> > >
> > > ...
> > >
> > > On Thu, Mar 2, 2017 at 5:46 PM, Damiano Porta <damianopo...@gmail.com>
> > > wrote:
> > >
> > > > Hello everybody,
> > > >
> > > > i have created a custom tokenizer that does not split specific
> > "patterns"
> > > > like, emails, telephones, dates etc. I convert them into ONE single
> > > token.
> > > > The other parts of text are tokenized with the
> > > > SimpleTokenizer.
> > > >
> > > > The problem is when i need to train a NER model. For example if my
> data
> > > has
> > > > dates like 2017 03 02 these will be converted into three tokens
> > > (whitespace
> > > > tokenizer) i must avoid that.
> > > >
> > > > Can i specify the tokenizer using the TokenNameFinderTrainer tool?
> > > >
> > > > Thanks
> > > > Damiano
> > > >
> > >
> >
>

Re: Tokenizer for NER training

Reply via email to