Hi Mondher, could you give me a raw example to understand how i should train the classifier model?
Thank you in advance! Damiano 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi <[email protected]>: > Hi, > > I would recommend a hybrid approach where, in a first step, you use a plain > dictionary and then perform the classification if needed. > > It's straightforward, but I think it would present better performances than > just performing a classification task. > > In the first step you use a dictionary of names along with an attribute > specifying whether the name fits for males, females or both. In case the > name fits for males or females exclusively, then no need to go any further. > > If the name fits for both genders, or is a family name etc., a second step > is needed where you extract features from the context (surrounding words, > etc.) and perform a classification task using any machine learning > algorithm. > > Another way would be using the information itself (whether the name fits > for males, females or both) as a feature when you perform the > classification. > > Best regards, > > Mondher > > I am not sure > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta <[email protected]> > wrote: > > > Awesome! Thank you so much WIlliam! > > > > 2016-06-29 13:36 GMT+02:00 William Colen <[email protected]>: > > > > > To create a NER model OpenNLP extracts features from the context, > things > > > such as: word prefix and suffix, next word, previous word, previous > word > > > prefix and suffix, next word prefix and suffix etc. > > > When you don't configure the feature generator it will apply the > default: > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api > > > > > > Default feature generator: > > > > > > AdaptiveFeatureGenerator featureGenerator = *new* > CachedFeatureGenerator( > > > *new* AdaptiveFeatureGenerator[]{ > > > *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), > 2, > > > 2), > > > *new* WindowFeatureGenerator(*new* > > > TokenClassFeatureGenerator(true), 2, 2), > > > *new* OutcomePriorFeatureGenerator(), > > > *new* PreviousMapFeatureGenerator(), > > > *new* BigramNameFeatureGenerator(), > > > *new* SentenceFeatureGenerator(true, false) > > > }); > > > > > > > > > These default features should work for most cases (specially English), > > but > > > they of course can be incremented. If you do so, your model will take > new > > > features in account. So yes, you are putting the features in your > model. > > > > > > To configure custom features is not easy. I would start with the > default > > > and use 10-fold cross-validation and take notes of its effectiveness. > > Than > > > change/add a feature, evaluate and take notes. Sometimes a feature that > > we > > > are sure would help can destroy the model effectiveness. > > > > > > Regards > > > William > > > > > > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta <[email protected]>: > > > > > > > Thank you William! Really appreciated! > > > > > > > > I only do not get one point, when you said "You could increment your > > > > model using > > > > Custom Feature Generators" does it mean that i can "put" these > features > > > > inside ONE *.bin* file (model) that implement different things, or, > > name > > > > finder is one thing and those feature generators other? > > > > > > > > Thank you in advance for the clarification. > > > > > > > > 2016-06-29 1:23 GMT+02:00 William Colen <[email protected]>: > > > > > > > > > Not exactly. You would create a new NER model to replace yours. > > > > > > > > > > In this approach you would need a corpus like this: > > > > > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old , will join > the > > > > board > > > > > as a nonexecutive director Nov. 29 . > > > > > Mr . <START:personMale> Vinken <END> is chairman of Elsevier N.V. , > > the > > > > > Dutch publishing group . <START:personFemale> Jessie Robson <END> > is > > > > > retiring , she was a board member for 5 years . > > > > > > > > > > > > > > > I am not an English native speaker, so I am not sure if the example > > is > > > > > clear enough. I tried to use Jessie as a neutral name and "she" as > > > > > disambiguation. > > > > > > > > > > With a corpus big enough maybe you could create a model that > outputs > > > both > > > > > classes, personMale and personFemale. To train a model you can > follow > > > > > > > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training > > > > > > > > > > Let's say your results are not good enough. You could increment > your > > > > model > > > > > using Custom Feature Generators ( > > > > > > > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen > > > > > and > > > > > > > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html > > > > > ). > > > > > > > > > > One of the implemented featuregen can take a dictionary ( > > > > > > > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html > > > > > ). > > > > > You can also implement other convenient FeatureGenerator, for > > instance > > > > > regex. > > > > > > > > > > Again, it is just a wild guess of how to implement it. I don't know > > if > > > it > > > > > would perform well. I was only thinking how to implement a gender > ML > > > > model > > > > > that uses the surrounding context. > > > > > > > > > > Hope I could clarify. > > > > > > > > > > William > > > > > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta <[email protected]>: > > > > > > > > > > > Hi William, > > > > > > Ok, so you are talking about a kind of pipe where we execute: > > > > > > > > > > > > 1. NER (personM for example) > > > > > > 2. Regex (filter to reduce false positives) > > > > > > 3. Plain dictionary (filter as above) ? > > > > > > > > > > > > Yes we can split out model in two for M and F, it is not a big > > > problem, > > > > > we > > > > > > have a database grouped by gender. > > > > > > > > > > > > I only have a doubt regarding the use of a dictionary. Because if > > we > > > > use > > > > > a > > > > > > dictionary to create the model, we could only use it to detect > > names > > > > > > without using NER. No? > > > > > > > > > > > > > > > > > > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen <[email protected] > >: > > > > > > > > > > > > > Do you plan to use the surrounding context? If yes, maybe you > > could > > > > try > > > > > > to > > > > > > > split NER in two categories: PersonM and PersonF. Just an idea, > > > never > > > > > > read > > > > > > > or tried anything like it. You would need a training corpus > with > > > > these > > > > > > > classes. > > > > > > > > > > > > > > You could add both the plain dictionary and the regex as NER > > > features > > > > > as > > > > > > > well and check how it improves. > > > > > > > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta < > [email protected] > > >: > > > > > > > > > > > > > > > Hello everybody, > > > > > > > > > > > > > > > > we built a NER model to find persons (name) inside our > > documents. > > > > > > > > We are looking for the best approach to understand if the > name > > is > > > > > > > > male/female. > > > > > > > > > > > > > > > > Possible solutions: > > > > > > > > - Plain dictionary? > > > > > > > > - Regex to check the initial and/letters of the name? > > > > > > > > - Classifier? (naive bayes? Maxent?) > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
