Ok! Thanks 2017-03-02 23:53 GMT+01:00 Rodrigo Agerri <rodrigo.age...@ehu.eus>:
> Hello, > > This is what I meant in my first example. If you are annotating tokens (or > already have them annotated) in a corpus with the BIO format, then as long > as you annotate each token of the date with the NE class you will be fine. > As long as in testing time you use the same tokenization. > > Cheers, > > R > > On Thu, Mar 2, 2017 at 11:24 PM, Damiano Porta <damianopo...@gmail.com> > wrote: > > > Hi Rodrigo, thanks for your message. > > My problem is that dates does not follow a correct format, you said: > > > > 2016/05/12 B-DATE > > 14/05/2012 B-DATE > > 15-02-2016 B-DATE > > > > These dates have no problems, the problems comes when i have: > > > > 2016 05 12 > > 14 05 2012 > > 15 02 2016 > > > > (with a whitespace separator) > > > > If i have these dates in a corpus they will be splitted into three > tokens. > > At the moment i i am using a custom tokenizer that does not split dates, > > the result is each date into ONE token > > > > "2016 05 12" > > "14 05 2012" > > "15 02 2016" > > > > Now the problem is during the namefinder training, i cannot have: > > > > 2016 05 12 B-DATE > > 14 05 2012 B-DATE > > 15 02 2016 B-DATE > > > > i do not think whitespaces are allowed here. > > > > The solution that i am following at the moment (Daniel answer) is > splitting > > the text with the SimpleTokenizer and then annotate the dates with > specific > > custom features (i will pass those features using the > > AdditionalContextGenerator > > https://github.com/apache/opennlp/blob/master/opennlp- > > tools/src/main/java/opennlp/tools/util/featuregen/ > > AdditionalContextFeatureGenerator.java), > > i will have: > > > > 2016 ne=B-date > > 05 ne=I-date > > 12 ne=I-date > > > > (where *ne=* is the prefix of the generator: > > https://github.com/apache/opennlp/blob/master/opennlp- > > tools/src/main/java/opennlp/tools/util/featuregen/ > > AdditionalContextFeatureGenerator.java#L38 > > ) > > > > In this way i do not care if a date has or not a whitespace (or other > > separators) i simply use BIO encoding. > > > > What do you think? > > Thanks, > > > > Damiano > > > > > > > > > > > > > > 2017-03-02 23:02 GMT+01:00 Rodrigo Agerri <rage...@apache.org>: > > > > > Hi Damiano, > > > > > > Maybe I am not understanding your question, but if you just give the > > > NameFinder tokenized annotated data that should be fine: > > > > > > word O > > > 2017 B-DATE > > > 03 I-DATE > > > 02 I-DATE > > > word O > > > > > > Then at testing time, if you tokenize the dates like that, the > NameFinder > > > should still try to find the sequences. If you have in the training > data > > > various ways of representing dates: > > > > > > 2016/05/12 B-DATE > > > 14/05/2012 B-DATE > > > 15-02-2016 B-DATE > > > > > > If will all depend on the how the tokenizer will do it and how it is > > > annotated in the training data. In any case, the most important thing > is > > > for the tokenization to be consistent for training and testing. > > > > > > HTH, > > > > > > Rodrigo > > > > > > ... > > > > > > On Thu, Mar 2, 2017 at 5:46 PM, Damiano Porta <damianopo...@gmail.com> > > > wrote: > > > > > > > Hello everybody, > > > > > > > > i have created a custom tokenizer that does not split specific > > "patterns" > > > > like, emails, telephones, dates etc. I convert them into ONE single > > > token. > > > > The other parts of text are tokenized with the > > > > SimpleTokenizer. > > > > > > > > The problem is when i need to train a NER model. For example if my > data > > > has > > > > dates like 2017 03 02 these will be converted into three tokens > > > (whitespace > > > > tokenizer) i must avoid that. > > > > > > > > Can i specify the tokenizer using the TokenNameFinderTrainer tool? > > > > > > > > Thanks > > > > Damiano > > > > > > > > > >