Ok! Thanks
2017-03-02 23:53 GMT+01:00 Rodrigo Agerri :
> Hello,
>
> This is what I meant in my first example. If you are annotating tokens (or
> already have them annotated) in a corpus with the BIO format, then as long
> as you annotate each token of the date with the NE class you will be fine.
Hello,
This is what I meant in my first example. If you are annotating tokens (or
already have them annotated) in a corpus with the BIO format, then as long
as you annotate each token of the date with the NE class you will be fine.
As long as in testing time you use the same tokenization.
Cheers,
Hi Rodrigo, thanks for your message.
My problem is that dates does not follow a correct format, you said:
2016/05/12 B-DATE
14/05/2012 B-DATE
15-02-2016 B-DATE
These dates have no problems, the problems comes when i have:
2016 05 12
14 05 2012
15 02 2016
(with a whitespace separator)
If i have
Hi Damiano,
Maybe I am not understanding your question, but if you just give the
NameFinder tokenized annotated data that should be fine:
word O
2017 B-DATE
03 I-DATE
02 I-DATE
word O
Then at testing time, if you tokenize the dates like that, the NameFinder
should still try to find the sequences
No, because you enter the “phone number” state after “call me at” Let me
annotate the state:
call_OTHER me_OTHER at_OTHER +_START 39_IN 06_IN <…> 56_IN ._OTHER
On 3/2/17, 12:47 PM, "Damiano Porta" wrote:
ok, yes it should be a good solution!
So, do you think is better to have "c
ok, yes it should be a good solution!
So, do you think is better to have "call me at + 39 06 12 23 45 56" (the
telephone has 7 tokens) and add a custom feature on each token to let
classifier trains it as part of the telephone number.
I did it during the tokenization because i am parsing very bad
Damino,
I am not an expert on the NameFinder, but I don’t think you want to use a
custom tokenizer. You might consider using a custom feature generator. I know
there is an xml definition. I might create an additional featuregenerator that
looks for your regex patterns and adds a set of f
Hello Daniel, yes exactly, i do that. I am using regexes to find those
patterns.
Daniel, is this problem only related to TokenNameFinderTrainer tool? If i
train it via code should i use custom tokenizer?
If not i will follow your solution using underscores.
Thanks
Damiano
2017-03-02 18:00 GMT+01:
Hi Damiano,
In general this is a difficult problem (making n-grams from unigrams). Have
you considered using RegEx to find your dates/phone numbers etc. and protecting
them from the tokenizer (i.e. replacing the white space with printable (though
possible not an alphanumeric character like a
Hello everybody,
i have created a custom tokenizer that does not split specific "patterns"
like, emails, telephones, dates etc. I convert them into ONE single token.
The other parts of text are tokenized with the
SimpleTokenizer.
The problem is when i need to train a NER model. For example if my
10 matches
Mail list logo