subject:"Tokenizer for NER training"

Re: Tokenizer for NER training

2017-03-02 Thread Damiano Porta

Ok! Thanks 2017-03-02 23:53 GMT+01:00 Rodrigo Agerri : > Hello, > > This is what I meant in my first example. If you are annotating tokens (or > already have them annotated) in a corpus with the BIO format, then as long > as you annotate each token of the date with the NE class you will be fine.

Re: Tokenizer for NER training

2017-03-02 Thread Rodrigo Agerri

Hello, This is what I meant in my first example. If you are annotating tokens (or already have them annotated) in a corpus with the BIO format, then as long as you annotate each token of the date with the NE class you will be fine. As long as in testing time you use the same tokenization. Cheers,

Re: Tokenizer for NER training

2017-03-02 Thread Damiano Porta

Hi Rodrigo, thanks for your message. My problem is that dates does not follow a correct format, you said: 2016/05/12 B-DATE 14/05/2012 B-DATE 15-02-2016 B-DATE These dates have no problems, the problems comes when i have: 2016 05 12 14 05 2012 15 02 2016 (with a whitespace separator) If i have

Re: Tokenizer for NER training

2017-03-02 Thread Rodrigo Agerri

Hi Damiano, Maybe I am not understanding your question, but if you just give the NameFinder tokenized annotated data that should be fine: word O 2017 B-DATE 03 I-DATE 02 I-DATE word O Then at testing time, if you tokenize the dates like that, the NameFinder should still try to find the sequences

Re: Tokenizer for NER training

2017-03-02 Thread Russ, Daniel (NIH/CIT) [E]

No, because you enter the “phone number” state after “call me at” Let me annotate the state: call_OTHER me_OTHER at_OTHER +_START 39_IN 06_IN <…> 56_IN ._OTHER On 3/2/17, 12:47 PM, "Damiano Porta" wrote: ok, yes it should be a good solution! So, do you think is better to have "c

Re: Tokenizer for NER training

2017-03-02 Thread Damiano Porta

ok, yes it should be a good solution! So, do you think is better to have "call me at + 39 06 12 23 45 56" (the telephone has 7 tokens) and add a custom feature on each token to let classifier trains it as part of the telephone number. I did it during the tokenization because i am parsing very bad

Re: Tokenizer for NER training

2017-03-02 Thread Russ, Daniel (NIH/CIT) [E]

Damino, I am not an expert on the NameFinder, but I don’t think you want to use a custom tokenizer. You might consider using a custom feature generator. I know there is an xml definition. I might create an additional featuregenerator that looks for your regex patterns and adds a set of f

Re: Tokenizer for NER training

2017-03-02 Thread Damiano Porta

Hello Daniel, yes exactly, i do that. I am using regexes to find those patterns. Daniel, is this problem only related to TokenNameFinderTrainer tool? If i train it via code should i use custom tokenizer? If not i will follow your solution using underscores. Thanks Damiano 2017-03-02 18:00 GMT+01:

Re: Tokenizer for NER training

2017-03-02 Thread Russ, Daniel (NIH/CIT) [E]

Hi Damiano, In general this is a difficult problem (making n-grams from unigrams). Have you considered using RegEx to find your dates/phone numbers etc. and protecting them from the tokenizer (i.e. replacing the white space with printable (though possible not an alphanumeric character like a

Tokenizer for NER training

2017-03-02 Thread Damiano Porta

Hello everybody, i have created a custom tokenizer that does not split specific "patterns" like, emails, telephones, dates etc. I convert them into ONE single token. The other parts of text are tokenized with the SimpleTokenizer. The problem is when i need to train a NER model. For example if my

Re: Tokenizer for NER training

Re: Tokenizer for NER training

Re: Tokenizer for NER training

Re: Tokenizer for NER training

Re: Tokenizer for NER training

Re: Tokenizer for NER training

Re: Tokenizer for NER training

Re: Tokenizer for NER training

Re: Tokenizer for NER training

Tokenizer for NER training

10 matches

Site Navigation

Mail list logo

Footer information