The co-referencer we used used to have in opennlp-tools has a model to detect the gender of names. That could could be extracted and put into a stand alone component.
Jörn On Mon, Jul 4, 2016 at 2:41 PM, Joern Kottmann <[email protected]> wrote: > I was speaking about the second case. We could build a dedicated component > specialized in extracting properties about already detected entities. > > Jörn > > On Mon, Jul 4, 2016 at 2:33 PM, Damiano Porta <[email protected]> > wrote: > >> Hello Jorn, >> Do you mean that i need to "extend" my NER model to find other >> name-related >> entities too? >> >> OR >> >> Find the entities with a dictionary and then train a maxent model that >> finds other properties like person title, job position etc? >> >> Thanks for the clarification. >> >> >> 2016-07-04 12:15 GMT+02:00 Joern Kottmann <[email protected]>: >> >> > Hello, >> > >> > there are also other interesting properties e.g. person title (e.g. >> > professor, doctor), job title/position, >> > company legal form. And much more for other entity types. >> > >> > Maybe it would be worth it to build a dedicated component to extract >> > properties from entities. >> > >> > Jörn >> > >> > On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi < >> > [email protected] >> > > wrote: >> > >> > > Hi, >> > > >> > > Sorry for my late reply. I didn't understand well your last email, but >> > here >> > > is what I meant: >> > > >> > > Given a simple dictionary you have that has the following columns: >> > > >> > > Name Type Gender >> > > Agatha First F >> > > John First M >> > > Smith Both B >> > > >> > > where: >> > > - "First" refers to first name, "Last" (not in the example) refers to >> > last >> > > name, and Both means it can be both. >> > > - "F" refers to female, "M" refers to males, and "B" refers to both >> > > genders. >> > > >> > > and given the following two sentences: >> > > >> > > 1. "It was nice meeting you John. I hope we meet again soon." >> > > >> > > 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and >> > felt >> > > she knows something" >> > > >> > > In the first example, when you check in the dictionary, the name >> "John" >> > is >> > > a male name, so no need to go any further. >> > > However, in the second example, the name "Smith", which is a family >> name >> > in >> > > our case, can be fit for both, males and females. Therefore, we need >> to >> > > extract features from the surrounding context and perform a >> > classification >> > > task. >> > > Here are some of the features I think they would be interesting to >> use: >> > > >> > > . Presence of a male initiative before the word {True, False} >> > > . Presence of a female initiative before the word {True, False} >> > > >> > > . Gender of the first personal pronoun (subject or object form) to the >> > > right of the name Values={MALE, FEMALE, UNCERTAIN, EMPTY} >> > > . Distance between the name and the first personal pronoun to the >> right >> > (in >> > > words) Values=NUMERIC >> > > . Gender of the second personal pronoun to the right of the >> > > name Values={MALE, FEMALE, UNCERTAIN, >> > > EMPTY} >> > > . Distance between the name and the second personal pronoun right >> > > Values=NUMERIC >> > > . Gender of the third personal pronoun to the right of the >> > > name Values={MALE, FEMALE, >> > UNCERTAIN, >> > > EMPTY} >> > > . Distance between the name and the third personal pronoun right (in >> > > words) Values=NUMERIC >> > > >> > > . Gender of the first personal pronoun (subject or object form) to the >> > left >> > > of the name Values={MALE, FEMALE, UNCERTAIN, EMPTY} >> > > . Distance between the name and the first personal pronoun to the left >> > (in >> > > words) Values=NUMERIC >> > > . Gender of the second personal pronoun to the left of the >> > > name Values={MALE, FEMALE, >> UNCERTAIN, >> > > EMPTY} >> > > . Distance between the name and the second personal pronoun left >> > > Values=NUMERIC >> > > . Gender of the third personal pronoun to the left of the >> > > name Values={MALE, FEMALE, >> > > UNCERTAIN, EMPTY} >> > > . Distance between the name and the third personal pronoun left (in >> > > words) Values=NUMERIC >> > > >> > > In the second example here are the values you have for your features >> > > >> > > F1 = False >> > > F2 = True >> > > F3 = UNCERTAIN >> > > F4 = 1 >> > > F5 = FEMALE >> > > F6 = 3 >> > > F7 = FEMALE >> > > F8 = 4 >> > > F9 = UNCERTAIN >> > > F10 = 2 >> > > F11 = EMPTY >> > > F12 = 0 >> > > F13 = EMPTY >> > > F14 = 0 >> > > >> > > Of course the choice of features depends on the type of data, and the >> > > features themselves might not work well for some texts such as ones >> > > collected from twitter for example. >> > > >> > > I hope this help you. >> > > >> > > Best regards >> > > >> > > Mondher >> > > >> > > >> > > On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta < >> [email protected]> >> > > wrote: >> > > >> > > > Hi Mondher, >> > > > could you give me a raw example to understand how i should train the >> > > > classifier model? >> > > > >> > > > Thank you in advance! >> > > > Damiano >> > > > >> > > > >> > > > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi < >> [email protected] >> > >: >> > > > >> > > > > Hi, >> > > > > >> > > > > I would recommend a hybrid approach where, in a first step, you >> use a >> > > > plain >> > > > > dictionary and then perform the classification if needed. >> > > > > >> > > > > It's straightforward, but I think it would present better >> > performances >> > > > than >> > > > > just performing a classification task. >> > > > > >> > > > > In the first step you use a dictionary of names along with an >> > attribute >> > > > > specifying whether the name fits for males, females or both. In >> case >> > > the >> > > > > name fits for males or females exclusively, then no need to go any >> > > > further. >> > > > > >> > > > > If the name fits for both genders, or is a family name etc., a >> second >> > > > step >> > > > > is needed where you extract features from the context (surrounding >> > > words, >> > > > > etc.) and perform a classification task using any machine learning >> > > > > algorithm. >> > > > > >> > > > > Another way would be using the information itself (whether the >> name >> > > fits >> > > > > for males, females or both) as a feature when you perform the >> > > > > classification. >> > > > > >> > > > > Best regards, >> > > > > >> > > > > Mondher >> > > > > >> > > > > I am not sure >> > > > > >> > > > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta < >> > > [email protected]> >> > > > > wrote: >> > > > > >> > > > > > Awesome! Thank you so much WIlliam! >> > > > > > >> > > > > > 2016-06-29 13:36 GMT+02:00 William Colen < >> [email protected] >> > >: >> > > > > > >> > > > > > > To create a NER model OpenNLP extracts features from the >> context, >> > > > > things >> > > > > > > such as: word prefix and suffix, next word, previous word, >> > previous >> > > > > word >> > > > > > > prefix and suffix, next word prefix and suffix etc. >> > > > > > > When you don't configure the feature generator it will apply >> the >> > > > > default: >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api >> > > > > > > >> > > > > > > Default feature generator: >> > > > > > > >> > > > > > > AdaptiveFeatureGenerator featureGenerator = *new* >> > > > > CachedFeatureGenerator( >> > > > > > > *new* AdaptiveFeatureGenerator[]{ >> > > > > > > *new* WindowFeatureGenerator(*new* >> > > > TokenFeatureGenerator(), >> > > > > 2, >> > > > > > > 2), >> > > > > > > *new* WindowFeatureGenerator(*new* >> > > > > > > TokenClassFeatureGenerator(true), 2, 2), >> > > > > > > *new* OutcomePriorFeatureGenerator(), >> > > > > > > *new* PreviousMapFeatureGenerator(), >> > > > > > > *new* BigramNameFeatureGenerator(), >> > > > > > > *new* SentenceFeatureGenerator(true, false) >> > > > > > > }); >> > > > > > > >> > > > > > > >> > > > > > > These default features should work for most cases (specially >> > > > English), >> > > > > > but >> > > > > > > they of course can be incremented. If you do so, your model >> will >> > > take >> > > > > new >> > > > > > > features in account. So yes, you are putting the features in >> your >> > > > > model. >> > > > > > > >> > > > > > > To configure custom features is not easy. I would start with >> the >> > > > > default >> > > > > > > and use 10-fold cross-validation and take notes of its >> > > effectiveness. >> > > > > > Than >> > > > > > > change/add a feature, evaluate and take notes. Sometimes a >> > feature >> > > > that >> > > > > > we >> > > > > > > are sure would help can destroy the model effectiveness. >> > > > > > > >> > > > > > > Regards >> > > > > > > William >> > > > > > > >> > > > > > > >> > > > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta < >> [email protected] >> > >: >> > > > > > > >> > > > > > > > Thank you William! Really appreciated! >> > > > > > > > >> > > > > > > > I only do not get one point, when you said "You could >> increment >> > > > your >> > > > > > > > model using >> > > > > > > > Custom Feature Generators" does it mean that i can "put" >> these >> > > > > features >> > > > > > > > inside ONE *.bin* file (model) that implement different >> things, >> > > or, >> > > > > > name >> > > > > > > > finder is one thing and those feature generators other? >> > > > > > > > >> > > > > > > > Thank you in advance for the clarification. >> > > > > > > > >> > > > > > > > 2016-06-29 1:23 GMT+02:00 William Colen < >> > [email protected] >> > > >: >> > > > > > > > >> > > > > > > > > Not exactly. You would create a new NER model to replace >> > yours. >> > > > > > > > > >> > > > > > > > > In this approach you would need a corpus like this: >> > > > > > > > > >> > > > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old , >> will >> > > join >> > > > > the >> > > > > > > > board >> > > > > > > > > as a nonexecutive director Nov. 29 . >> > > > > > > > > Mr . <START:personMale> Vinken <END> is chairman of >> Elsevier >> > > > N.V. , >> > > > > > the >> > > > > > > > > Dutch publishing group . <START:personFemale> Jessie >> Robson >> > > <END> >> > > > > is >> > > > > > > > > retiring , she was a board member for 5 years . >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > I am not an English native speaker, so I am not sure if >> the >> > > > example >> > > > > > is >> > > > > > > > > clear enough. I tried to use Jessie as a neutral name and >> > "she" >> > > > as >> > > > > > > > > disambiguation. >> > > > > > > > > >> > > > > > > > > With a corpus big enough maybe you could create a model >> that >> > > > > outputs >> > > > > > > both >> > > > > > > > > classes, personMale and personFemale. To train a model you >> > can >> > > > > follow >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training >> > > > > > > > > >> > > > > > > > > Let's say your results are not good enough. You could >> > increment >> > > > > your >> > > > > > > > model >> > > > > > > > > using Custom Feature Generators ( >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen >> > > > > > > > > and >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html >> > > > > > > > > ). >> > > > > > > > > >> > > > > > > > > One of the implemented featuregen can take a dictionary ( >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html >> > > > > > > > > ). >> > > > > > > > > You can also implement other convenient FeatureGenerator, >> for >> > > > > > instance >> > > > > > > > > regex. >> > > > > > > > > >> > > > > > > > > Again, it is just a wild guess of how to implement it. I >> > don't >> > > > know >> > > > > > if >> > > > > > > it >> > > > > > > > > would perform well. I was only thinking how to implement a >> > > gender >> > > > > ML >> > > > > > > > model >> > > > > > > > > that uses the surrounding context. >> > > > > > > > > >> > > > > > > > > Hope I could clarify. >> > > > > > > > > >> > > > > > > > > William >> > > > > > > > > >> > > > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta < >> > > [email protected] >> > > > >: >> > > > > > > > > >> > > > > > > > > > Hi William, >> > > > > > > > > > Ok, so you are talking about a kind of pipe where we >> > execute: >> > > > > > > > > > >> > > > > > > > > > 1. NER (personM for example) >> > > > > > > > > > 2. Regex (filter to reduce false positives) >> > > > > > > > > > 3. Plain dictionary (filter as above) ? >> > > > > > > > > > >> > > > > > > > > > Yes we can split out model in two for M and F, it is >> not a >> > > big >> > > > > > > problem, >> > > > > > > > > we >> > > > > > > > > > have a database grouped by gender. >> > > > > > > > > > >> > > > > > > > > > I only have a doubt regarding the use of a dictionary. >> > > Because >> > > > if >> > > > > > we >> > > > > > > > use >> > > > > > > > > a >> > > > > > > > > > dictionary to create the model, we could only use it to >> > > detect >> > > > > > names >> > > > > > > > > > without using NER. No? >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen < >> > > > [email protected] >> > > > > >: >> > > > > > > > > > >> > > > > > > > > > > Do you plan to use the surrounding context? If yes, >> maybe >> > > you >> > > > > > could >> > > > > > > > try >> > > > > > > > > > to >> > > > > > > > > > > split NER in two categories: PersonM and PersonF. >> Just an >> > > > idea, >> > > > > > > never >> > > > > > > > > > read >> > > > > > > > > > > or tried anything like it. You would need a training >> > corpus >> > > > > with >> > > > > > > > these >> > > > > > > > > > > classes. >> > > > > > > > > > > >> > > > > > > > > > > You could add both the plain dictionary and the regex >> as >> > > NER >> > > > > > > features >> > > > > > > > > as >> > > > > > > > > > > well and check how it improves. >> > > > > > > > > > > >> > > > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta < >> > > > > [email protected] >> > > > > > >: >> > > > > > > > > > > >> > > > > > > > > > > > Hello everybody, >> > > > > > > > > > > > >> > > > > > > > > > > > we built a NER model to find persons (name) inside >> our >> > > > > > documents. >> > > > > > > > > > > > We are looking for the best approach to understand >> if >> > the >> > > > > name >> > > > > > is >> > > > > > > > > > > > male/female. >> > > > > > > > > > > > >> > > > > > > > > > > > Possible solutions: >> > > > > > > > > > > > - Plain dictionary? >> > > > > > > > > > > > - Regex to check the initial and/letters of the >> name? >> > > > > > > > > > > > - Classifier? (naive bayes? Maxent?) >> > > > > > > > > > > > >> > > > > > > > > > > > Thanks >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
