Re: Tokenizer for NER training

Russ, Daniel (NIH/CIT) [E] Thu, 02 Mar 2017 09:54:12 -0800

No, because you enter the “phone number” state after “call me at”  Let me 
annotate the state:


call_OTHER me_OTHER at_OTHER +_START 39_IN 06_IN  <…> 56_IN ._OTHER

On 3/2/17, 12:47 PM, "Damiano Porta" <[email protected]> wrote:

    ok, yes it should be a good solution!
    
    So, do you think is better to have "call me at + 39 06 12 23 45 56" (the
    telephone has 7 tokens) and add a custom feature on each token to let
    classifier trains it as part of the telephone number.
    I did it during the tokenization because i am parsing very bad documents so
    the telephone formats are many (separators of numbers too) . - / | \s
    
    2017-03-02 18:38 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <[email protected]>:
    
    > Damino,
    >
    >     I am not an expert on the NameFinder, but I don’t think you want to
    > use a custom tokenizer.  You might consider using a custom feature
    > generator.  I know there is an xml definition.  I might create an
    > additional featuregenerator that looks for your regex patterns and adds a
    > set of features to the feature list.   The nice thing about the classifier
    > is that you will catch things like “call me at 3011234567.” even though
    > your regex wont match (if you look at the previous 4 words to catch “call
    > me”).
    >
    >
    > Daniel
    >
    > On 3/2/17, 12:24 PM, "Damiano Porta" <[email protected]> wrote:
    >
    >     Hello Daniel, yes exactly, i do that. I am using regexes to find those
    >     patterns.
    >     Daniel, is this problem only related to TokenNameFinderTrainer tool?
    > If i
    >     train it via code should i use custom tokenizer?
    >     If not i will follow your solution using underscores.
    >
    >     Thanks
    >     Damiano
    >
    >     2017-03-02 18:00 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
    > [email protected]>:
    >
    >     > Hi Damiano,
    >     >    In general this is a difficult problem (making n-grams from
    > unigrams).
    >     > Have you considered using RegEx to find your dates/phone numbers
    > etc. and
    >     > protecting them from the tokenizer (i.e. replacing the white space
    > with
    >     > printable (though possible not an alphanumeric character like an
    >     > underscore)?
    >     > Daniel
    >     >
    >     > On 3/2/17, 11:46 AM, "Damiano Porta" <[email protected]> wrote:
    >     >
    >     >     Hello everybody,
    >     >
    >     >     i have created a custom tokenizer that does not split specific
    >     > "patterns"
    >     >     like, emails, telephones, dates etc. I convert them into ONE
    > single
    >     > token.
    >     >     The other parts of text are tokenized with the
    >     >     SimpleTokenizer.
    >     >
    >     >     The problem is when i need to train a NER model. For example if
    > my
    >     > data has
    >     >     dates like 2017 03 02 these will be converted into three tokens
    >     > (whitespace
    >     >     tokenizer) i must avoid that.
    >     >
    >     >     Can i specify the tokenizer using the TokenNameFinderTrainer
    > tool?
    >     >
    >     >     Thanks
    >     >     Damiano
    >     >
    >     >
    >     >
    >
    >
    >

Re: Tokenizer for NER training

Reply via email to