To create a NER model OpenNLP extracts features from the context, things
such as: word prefix and suffix, next word, previous word, previous word
prefix and suffix, next word prefix and suffix etc.
When you don't configure the feature generator it will apply the default:
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api

Default feature generator:

AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
         *new* AdaptiveFeatureGenerator[]{
           *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2, 2),
           *new* WindowFeatureGenerator(*new*
TokenClassFeatureGenerator(true), 2, 2),
           *new* OutcomePriorFeatureGenerator(),
           *new* PreviousMapFeatureGenerator(),
           *new* BigramNameFeatureGenerator(),
           *new* SentenceFeatureGenerator(true, false)
           });


These default features should work for most cases (specially English), but
they of course can be incremented. If you do so, your model will take new
features in account. So yes, you are putting the features in your model.

To configure custom features is not easy. I would start with the default
and use 10-fold cross-validation and take notes of its effectiveness. Than
change/add a feature, evaluate and take notes. Sometimes a feature that we
are sure would help can destroy the model effectiveness.

Regards
William


2016-06-29 7:00 GMT-03:00 Damiano Porta <damianopo...@gmail.com>:

> Thank you William! Really appreciated!
>
> I only do not get one point, when you said "You could increment your
> model using
> Custom Feature Generators" does it mean that i can "put" these features
> inside ONE *.bin* file (model) that implement different things, or, name
> finder is one thing and those feature generators other?
>
> Thank you in advance for the clarification.
>
> 2016-06-29 1:23 GMT+02:00 William Colen <william.co...@gmail.com>:
>
> > Not exactly. You would create a new NER model to replace yours.
> >
> > In this approach you would need a corpus like this:
> >
> > <START:personMale> Pierre Vinken <END> , 61 years old , will join the
> board
> > as a nonexecutive director Nov. 29 .
> > Mr . <START:personMale> Vinken <END> is chairman of Elsevier N.V. , the
> > Dutch publishing group . <START:personFemale> Jessie Robson <END> is
> > retiring , she was a board member for 5 years .
> >
> >
> > I am not an English native speaker, so I am not sure if the example is
> > clear enough. I tried to use Jessie as a neutral name and "she" as
> > disambiguation.
> >
> > With a corpus big enough maybe you could create a model that outputs both
> > classes, personMale and personFemale. To train a model you can follow
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> >
> > Let's say your results are not good enough. You could increment your
> model
> > using Custom Feature Generators (
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > and
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > ).
> >
> > One of the implemented featuregen can take a dictionary (
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > ).
> > You can also implement other convenient FeatureGenerator, for instance
> > regex.
> >
> > Again, it is just a wild guess of how to implement it. I don't know if it
> > would perform well. I was only thinking how to implement a gender ML
> model
> > that uses the surrounding context.
> >
> > Hope I could clarify.
> >
> > William
> >
> > 2016-06-28 19:15 GMT-03:00 Damiano Porta <damianopo...@gmail.com>:
> >
> > > Hi William,
> > > Ok, so you are talking about a kind of pipe where we execute:
> > >
> > > 1. NER (personM for example)
> > > 2. Regex (filter to reduce false positives)
> > > 3. Plain dictionary (filter as above) ?
> > >
> > > Yes we can split out model in two for M and F, it is not a big problem,
> > we
> > > have a database grouped by gender.
> > >
> > > I only have a doubt regarding the use of a dictionary. Because if we
> use
> > a
> > > dictionary to create the model, we could only use it to detect names
> > > without using NER. No?
> > >
> > >
> > >
> > > 2016-06-29 0:10 GMT+02:00 William Colen <william.co...@gmail.com>:
> > >
> > > > Do you plan to use the surrounding context? If yes, maybe you could
> try
> > > to
> > > > split NER in two categories: PersonM and PersonF. Just an idea, never
> > > read
> > > > or tried anything like it. You would need a training corpus with
> these
> > > > classes.
> > > >
> > > > You could add both the plain dictionary and the regex as NER features
> > as
> > > > well and check how it improves.
> > > >
> > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <damianopo...@gmail.com>:
> > > >
> > > > > Hello everybody,
> > > > >
> > > > > we built a NER model to find persons (name) inside our documents.
> > > > > We are looking for the best approach to understand if the name is
> > > > > male/female.
> > > > >
> > > > > Possible solutions:
> > > > > - Plain dictionary?
> > > > > - Regex to check the initial and/letters of the name?
> > > > > - Classifier? (naive bayes? Maxent?)
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > >
> >
>

Reply via email to