Re: Tokenizer for NER training

2017-03-02 Thread Damiano Porta
Ok! Thanks 2017-03-02 23:53 GMT+01:00 Rodrigo Agerri : > Hello, > > This is what I meant in my first example. If you are annotating tokens (or > already have them annotated) in a corpus with the BIO format, then as long > as you annotate each token of the date with the NE

Re: Tokenizer for NER training

2017-03-02 Thread Rodrigo Agerri
Hello, This is what I meant in my first example. If you are annotating tokens (or already have them annotated) in a corpus with the BIO format, then as long as you annotate each token of the date with the NE class you will be fine. As long as in testing time you use the same tokenization.

Re: Tokenizer for NER training

2017-03-02 Thread Damiano Porta
Hi Rodrigo, thanks for your message. My problem is that dates does not follow a correct format, you said: 2016/05/12 B-DATE 14/05/2012 B-DATE 15-02-2016 B-DATE These dates have no problems, the problems comes when i have: 2016 05 12 14 05 2012 15 02 2016 (with a whitespace separator) If i

Re: Tokenizer for NER training

2017-03-02 Thread Rodrigo Agerri
Hi Damiano, Maybe I am not understanding your question, but if you just give the NameFinder tokenized annotated data that should be fine: word O 2017 B-DATE 03 I-DATE 02 I-DATE word O Then at testing time, if you tokenize the dates like that, the NameFinder should still try to find the

[GitHub] opennlp pull request #126: OPENNLP-989: Fix validation of CONT after START w...

2017-03-02 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/opennlp/pull/126 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

Re: Tokenizer for NER training

2017-03-02 Thread Russ, Daniel (NIH/CIT) [E]
No, because you enter the “phone number” state after “call me at” Let me annotate the state: call_OTHER me_OTHER at_OTHER +_START 39_IN 06_IN <…> 56_IN ._OTHER On 3/2/17, 12:47 PM, "Damiano Porta" wrote: ok, yes it should be a good solution! So, do you

Re: Tokenizer for NER training

2017-03-02 Thread Damiano Porta
ok, yes it should be a good solution! So, do you think is better to have "call me at + 39 06 12 23 45 56" (the telephone has 7 tokens) and add a custom feature on each token to let classifier trains it as part of the telephone number. I did it during the tokenization because i am parsing very bad

Re: Tokenizer for NER training

2017-03-02 Thread Russ, Daniel (NIH/CIT) [E]
Damino, I am not an expert on the NameFinder, but I don’t think you want to use a custom tokenizer. You might consider using a custom feature generator. I know there is an xml definition. I might create an additional featuregenerator that looks for your regex patterns and adds a set of

Re: Tokenizer for NER training

2017-03-02 Thread Damiano Porta
Hello Daniel, yes exactly, i do that. I am using regexes to find those patterns. Daniel, is this problem only related to TokenNameFinderTrainer tool? If i train it via code should i use custom tokenizer? If not i will follow your solution using underscores. Thanks Damiano 2017-03-02 18:00

Re: Tokenizer for NER training

2017-03-02 Thread Russ, Daniel (NIH/CIT) [E]
Hi Damiano, In general this is a difficult problem (making n-grams from unigrams). Have you considered using RegEx to find your dates/phone numbers etc. and protecting them from the tokenizer (i.e. replacing the white space with printable (though possible not an alphanumeric character like

Tokenizer for NER training

2017-03-02 Thread Damiano Porta
Hello everybody, i have created a custom tokenizer that does not split specific "patterns" like, emails, telephones, dates etc. I convert them into ONE single token. The other parts of text are tokenized with the SimpleTokenizer. The problem is when i need to train a NER model. For example if my