Re: Model to detect the gender

Damiano Porta Mon, 04 Jul 2016 05:28:02 -0700

Hi Mondher,
you gave me really good advice! Thank you!
Let me recap a little bit.


Basically I need a dictionary to understand if a name can be male/female or
both. If I am sure that's male or female i will not go further, otherwise
IF i find an entity that can be both I will do the classification task.

The classification is build with a list of features, these features
represent the "state" of specific surrounding tokens.
The classification is done via the Doccat Trainer
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.doccat

Now i have to create a .train file to train my model with MALE FEMALE
classes.

Should i build the model doing something like:

*FEMALE*  False   True   UNCERTAIN   1   FEMALE   3   FEMALE   4   UNCERTAIN
  2   EMPTY   0   EMPTY   0
*FEMALE*  False   True   UNCERTAIN   1   FEMALE   1   FEMALE   3   UNCERTAIN
  2   EMPTY   0   EMPTY   0
*FEMALE*  False   True   UNCERTAIN   1   FEMALE   2   FEMALE   1   UNCERTAIN
  2   EMPTY   0   EMPTY   0
*MALE*  True   False   UNCERTAIN   1   MALE   3   MALE   4   UNCERTAIN   2
  EMPTY   0   EMPTY   0
*MALE*  True   False   UNCERTAIN   1   MALE   1   MALE   3   UNCERTAIN   2
  EMPTY   0   EMPTY   0
*MALE*  True   False   UNCERTAIN   1   MALE   2   MALE   1   UNCERTAIN   2
  EMPTY   0   EMPTY   0

This way?
Obviously, that's a stupid data, I just repeated it. I am asking that to
understand "how to add those features into the training of the classifier"

Thank you really much! I am looking forward to your reply.
Damiano



2016-07-01 15:05 GMT+02:00 Mondher Bouazizi <[email protected]>:

> Hi,
>
> Sorry for my late reply. I didn't understand well your last email, but here
> is what I meant:
>
> Given a simple dictionary you have that has the following columns:
>
> Name           Type           Gender
> Agatha         First           F
> John            First           M
> Smith          Both           B
>
> where:
> - "First" refers to first name, "Last" (not in the example) refers to last
> name, and Both means it can be both.
> - "F" refers to female, "M" refers to males, and "B" refers to both
> genders.
>
> and given the following two sentences:
>
> 1. "It was nice meeting you John. I hope we meet again soon."
>
> 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and felt
> she knows something"
>
> In the first example, when you check in the dictionary, the name "John" is
> a male name, so no need to go any further.
> However, in the second example, the name "Smith", which is a family name in
> our case, can be fit for both, males and females. Therefore, we need to
> extract features from the surrounding context and perform a classification
> task.
> Here are some of the features I think they would be interesting to use:
>
> . Presence of a male initiative before the word {True, False}
> . Presence of a female initiative before the word {True, False}
>
> . Gender of the first personal pronoun (subject or object form) to the
> right of the name    Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> . Distance between the name and the first personal pronoun to the right (in
> words)         Values=NUMERIC
> . Gender of the second personal pronoun to the right of the
> name                                 Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the second personal pronoun right
>                  Values=NUMERIC
> . Gender of the third personal pronoun to the right of the
> name                                      Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the third personal pronoun right (in
> words)                  Values=NUMERIC
>
> . Gender of the first personal pronoun (subject or object form) to the left
> of the name       Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> . Distance between the name and the first personal pronoun to the left (in
> words)            Values=NUMERIC
> . Gender of the second personal pronoun to the left of the
> name                                    Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the second personal pronoun left
>                     Values=NUMERIC
> . Gender of the third personal pronoun to the left of the
> name                                        Values={MALE, FEMALE,
> UNCERTAIN, EMPTY}
> . Distance between the name and the third personal pronoun left (in
> words)                    Values=NUMERIC
>
> In the second example here are the values you have for your features
>
> F1 = False
> F2 = True
> F3 = UNCERTAIN
> F4 = 1
> F5 = FEMALE
> F6 = 3
> F7 = FEMALE
> F8 = 4
> F9 = UNCERTAIN
> F10 = 2
> F11 = EMPTY
> F12 = 0
> F13 = EMPTY
> F14 = 0
>
> Of course the choice of features depends on the type of data, and the
> features themselves might not work well for some texts such as ones
> collected from twitter for example.
>
> I hope this help you.
>
> Best regards
>
> Mondher
>
>
> On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <[email protected]>
> wrote:
>
> > Hi Mondher,
> > could you give me a raw example to understand how i should train the
> > classifier model?
> >
> > Thank you in advance!
> > Damiano
> >
> >
> > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <[email protected]>:
> >
> > > Hi,
> > >
> > > I would recommend a hybrid approach where, in a first step, you use a
> > plain
> > > dictionary and then perform the classification if needed.
> > >
> > > It's straightforward, but I think it would present better performances
> > than
> > > just performing a classification task.
> > >
> > > In the first step you use a dictionary of names along with an attribute
> > > specifying whether the name fits for males, females or both. In case
> the
> > > name fits for males or females exclusively, then no need to go any
> > further.
> > >
> > > If the name fits for both genders, or is a family name etc., a second
> > step
> > > is needed where you extract features from the context (surrounding
> words,
> > > etc.) and perform a classification task using any machine learning
> > > algorithm.
> > >
> > > Another way would be using the information itself (whether the name
> fits
> > > for males, females or both) as a feature when you perform the
> > > classification.
> > >
> > > Best regards,
> > >
> > > Mondher
> > >
> > > I am not sure
> > >
> > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <
> [email protected]>
> > > wrote:
> > >
> > > > Awesome! Thank you so much WIlliam!
> > > >
> > > > 2016-06-29 13:36 GMT+02:00 William Colen <[email protected]>:
> > > >
> > > > > To create a NER model OpenNLP extracts features from the context,
> > > things
> > > > > such as: word prefix and suffix, next word, previous word, previous
> > > word
> > > > > prefix and suffix, next word prefix and suffix etc.
> > > > > When you don't configure the feature generator it will apply the
> > > default:
> > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> > > > >
> > > > > Default feature generator:
> > > > >
> > > > > AdaptiveFeatureGenerator featureGenerator = *new*
> > > CachedFeatureGenerator(
> > > > >          *new* AdaptiveFeatureGenerator[]{
> > > > >            *new* WindowFeatureGenerator(*new*
> > TokenFeatureGenerator(),
> > > 2,
> > > > > 2),
> > > > >            *new* WindowFeatureGenerator(*new*
> > > > > TokenClassFeatureGenerator(true), 2, 2),
> > > > >            *new* OutcomePriorFeatureGenerator(),
> > > > >            *new* PreviousMapFeatureGenerator(),
> > > > >            *new* BigramNameFeatureGenerator(),
> > > > >            *new* SentenceFeatureGenerator(true, false)
> > > > >            });
> > > > >
> > > > >
> > > > > These default features should work for most cases (specially
> > English),
> > > > but
> > > > > they of course can be incremented. If you do so, your model will
> take
> > > new
> > > > > features in account. So yes, you are putting the features in your
> > > model.
> > > > >
> > > > > To configure custom features is not easy. I would start with the
> > > default
> > > > > and use 10-fold cross-validation and take notes of its
> effectiveness.
> > > > Than
> > > > > change/add a feature, evaluate and take notes. Sometimes a feature
> > that
> > > > we
> > > > > are sure would help can destroy the model effectiveness.
> > > > >
> > > > > Regards
> > > > > William
> > > > >
> > > > >
> > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <[email protected]>:
> > > > >
> > > > > > Thank you William! Really appreciated!
> > > > > >
> > > > > > I only do not get one point, when you said "You could increment
> > your
> > > > > > model using
> > > > > > Custom Feature Generators" does it mean that i can "put" these
> > > features
> > > > > > inside ONE *.bin* file (model) that implement different things,
> or,
> > > > name
> > > > > > finder is one thing and those feature generators other?
> > > > > >
> > > > > > Thank you in advance for the clarification.
> > > > > >
> > > > > > 2016-06-29 1:23 GMT+02:00 William Colen <[email protected]
> >:
> > > > > >
> > > > > > > Not exactly. You would create a new NER model to replace yours.
> > > > > > >
> > > > > > > In this approach you would need a corpus like this:
> > > > > > >
> > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old , will
> join
> > > the
> > > > > > board
> > > > > > > as a nonexecutive director Nov. 29 .
> > > > > > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier
> > N.V. ,
> > > > the
> > > > > > > Dutch publishing group . <START:personFemale> Jessie Robson
> <END>
> > > is
> > > > > > > retiring , she was a board member for 5 years .
> > > > > > >
> > > > > > >
> > > > > > > I am not an English native speaker, so I am not sure if the
> > example
> > > > is
> > > > > > > clear enough. I tried to use Jessie as a neutral name and "she"
> > as
> > > > > > > disambiguation.
> > > > > > >
> > > > > > > With a corpus big enough maybe you could create a model that
> > > outputs
> > > > > both
> > > > > > > classes, personMale and personFemale. To train a model you can
> > > follow
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > > > > >
> > > > > > > Let's say your results are not good enough. You could increment
> > > your
> > > > > > model
> > > > > > > using Custom Feature Generators (
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > > > > and
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > > > > ).
> > > > > > >
> > > > > > > One of the implemented featuregen can take a dictionary (
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > > > > ).
> > > > > > > You can also implement other convenient FeatureGenerator, for
> > > > instance
> > > > > > > regex.
> > > > > > >
> > > > > > > Again, it is just a wild guess of how to implement it. I don't
> > know
> > > > if
> > > > > it
> > > > > > > would perform well. I was only thinking how to implement a
> gender
> > > ML
> > > > > > model
> > > > > > > that uses the surrounding context.
> > > > > > >
> > > > > > > Hope I could clarify.
> > > > > > >
> > > > > > > William
> > > > > > >
> > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <
> [email protected]
> > >:
> > > > > > >
> > > > > > > > Hi William,
> > > > > > > > Ok, so you are talking about a kind of pipe where we execute:
> > > > > > > >
> > > > > > > > 1. NER (personM for example)
> > > > > > > > 2. Regex (filter to reduce false positives)
> > > > > > > > 3. Plain dictionary (filter as above) ?
> > > > > > > >
> > > > > > > > Yes we can split out model in two for M and F, it is not a
> big
> > > > > problem,
> > > > > > > we
> > > > > > > > have a database grouped by gender.
> > > > > > > >
> > > > > > > > I only have a doubt regarding the use of a dictionary.
> Because
> > if
> > > > we
> > > > > > use
> > > > > > > a
> > > > > > > > dictionary to create the model, we could only use it to
> detect
> > > > names
> > > > > > > > without using NER. No?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <
> > [email protected]
> > > >:
> > > > > > > >
> > > > > > > > > Do you plan to use the surrounding context? If yes, maybe
> you
> > > > could
> > > > > > try
> > > > > > > > to
> > > > > > > > > split NER in two categories: PersonM and PersonF. Just an
> > idea,
> > > > > never
> > > > > > > > read
> > > > > > > > > or tried anything like it. You would need a training corpus
> > > with
> > > > > > these
> > > > > > > > > classes.
> > > > > > > > >
> > > > > > > > > You could add both the plain dictionary and the regex as
> NER
> > > > > features
> > > > > > > as
> > > > > > > > > well and check how it improves.
> > > > > > > > >
> > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta <
> > > [email protected]
> > > > >:
> > > > > > > > >
> > > > > > > > > > Hello everybody,
> > > > > > > > > >
> > > > > > > > > > we built a NER model to find persons (name) inside our
> > > > documents.
> > > > > > > > > > We are looking for the best approach to understand if the
> > > name
> > > > is
> > > > > > > > > > male/female.
> > > > > > > > > >
> > > > > > > > > > Possible solutions:
> > > > > > > > > > - Plain dictionary?
> > > > > > > > > > - Regex to check the initial and/letters of the name?
> > > > > > > > > > - Classifier? (naive bayes? Maxent?)
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Model to detect the gender

Reply via email to