I was speaking about the second case. We could build a dedicated component specialized in extracting properties about already detected entities.
Jörn On Mon, Jul 4, 2016 at 2:33 PM, Damiano Porta <[email protected]> wrote: > Hello Jorn, > Do you mean that i need to "extend" my NER model to find other name-related > entities too? > > OR > > Find the entities with a dictionary and then train a maxent model that > finds other properties like person title, job position etc? > > Thanks for the clarification. > > > 2016-07-04 12:15 GMT+02:00 Joern Kottmann <[email protected]>: > > > Hello, > > > > there are also other interesting properties e.g. person title (e.g. > > professor, doctor), job title/position, > > company legal form. And much more for other entity types. > > > > Maybe it would be worth it to build a dedicated component to extract > > properties from entities. > > > > Jörn > > > > On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi < > > [email protected] > > > wrote: > > > > > Hi, > > > > > > Sorry for my late reply. I didn't understand well your last email, but > > here > > > is what I meant: > > > > > > Given a simple dictionary you have that has the following columns: > > > > > > Name Type Gender > > > Agatha First F > > > John First M > > > Smith Both B > > > > > > where: > > > - "First" refers to first name, "Last" (not in the example) refers to > > last > > > name, and Both means it can be both. > > > - "F" refers to female, "M" refers to males, and "B" refers to both > > > genders. > > > > > > and given the following two sentences: > > > > > > 1. "It was nice meeting you John. I hope we meet again soon." > > > > > > 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and > > felt > > > she knows something" > > > > > > In the first example, when you check in the dictionary, the name "John" > > is > > > a male name, so no need to go any further. > > > However, in the second example, the name "Smith", which is a family > name > > in > > > our case, can be fit for both, males and females. Therefore, we need to > > > extract features from the surrounding context and perform a > > classification > > > task. > > > Here are some of the features I think they would be interesting to use: > > > > > > . Presence of a male initiative before the word {True, False} > > > . Presence of a female initiative before the word {True, False} > > > > > > . Gender of the first personal pronoun (subject or object form) to the > > > right of the name Values={MALE, FEMALE, UNCERTAIN, EMPTY} > > > . Distance between the name and the first personal pronoun to the right > > (in > > > words) Values=NUMERIC > > > . Gender of the second personal pronoun to the right of the > > > name Values={MALE, FEMALE, UNCERTAIN, > > > EMPTY} > > > . Distance between the name and the second personal pronoun right > > > Values=NUMERIC > > > . Gender of the third personal pronoun to the right of the > > > name Values={MALE, FEMALE, > > UNCERTAIN, > > > EMPTY} > > > . Distance between the name and the third personal pronoun right (in > > > words) Values=NUMERIC > > > > > > . Gender of the first personal pronoun (subject or object form) to the > > left > > > of the name Values={MALE, FEMALE, UNCERTAIN, EMPTY} > > > . Distance between the name and the first personal pronoun to the left > > (in > > > words) Values=NUMERIC > > > . Gender of the second personal pronoun to the left of the > > > name Values={MALE, FEMALE, > UNCERTAIN, > > > EMPTY} > > > . Distance between the name and the second personal pronoun left > > > Values=NUMERIC > > > . Gender of the third personal pronoun to the left of the > > > name Values={MALE, FEMALE, > > > UNCERTAIN, EMPTY} > > > . Distance between the name and the third personal pronoun left (in > > > words) Values=NUMERIC > > > > > > In the second example here are the values you have for your features > > > > > > F1 = False > > > F2 = True > > > F3 = UNCERTAIN > > > F4 = 1 > > > F5 = FEMALE > > > F6 = 3 > > > F7 = FEMALE > > > F8 = 4 > > > F9 = UNCERTAIN > > > F10 = 2 > > > F11 = EMPTY > > > F12 = 0 > > > F13 = EMPTY > > > F14 = 0 > > > > > > Of course the choice of features depends on the type of data, and the > > > features themselves might not work well for some texts such as ones > > > collected from twitter for example. > > > > > > I hope this help you. > > > > > > Best regards > > > > > > Mondher > > > > > > > > > On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta <[email protected] > > > > > wrote: > > > > > > > Hi Mondher, > > > > could you give me a raw example to understand how i should train the > > > > classifier model? > > > > > > > > Thank you in advance! > > > > Damiano > > > > > > > > > > > > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi < > [email protected] > > >: > > > > > > > > > Hi, > > > > > > > > > > I would recommend a hybrid approach where, in a first step, you > use a > > > > plain > > > > > dictionary and then perform the classification if needed. > > > > > > > > > > It's straightforward, but I think it would present better > > performances > > > > than > > > > > just performing a classification task. > > > > > > > > > > In the first step you use a dictionary of names along with an > > attribute > > > > > specifying whether the name fits for males, females or both. In > case > > > the > > > > > name fits for males or females exclusively, then no need to go any > > > > further. > > > > > > > > > > If the name fits for both genders, or is a family name etc., a > second > > > > step > > > > > is needed where you extract features from the context (surrounding > > > words, > > > > > etc.) and perform a classification task using any machine learning > > > > > algorithm. > > > > > > > > > > Another way would be using the information itself (whether the name > > > fits > > > > > for males, females or both) as a feature when you perform the > > > > > classification. > > > > > > > > > > Best regards, > > > > > > > > > > Mondher > > > > > > > > > > I am not sure > > > > > > > > > > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > Awesome! Thank you so much WIlliam! > > > > > > > > > > > > 2016-06-29 13:36 GMT+02:00 William Colen < > [email protected] > > >: > > > > > > > > > > > > > To create a NER model OpenNLP extracts features from the > context, > > > > > things > > > > > > > such as: word prefix and suffix, next word, previous word, > > previous > > > > > word > > > > > > > prefix and suffix, next word prefix and suffix etc. > > > > > > > When you don't configure the feature generator it will apply > the > > > > > default: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api > > > > > > > > > > > > > > Default feature generator: > > > > > > > > > > > > > > AdaptiveFeatureGenerator featureGenerator = *new* > > > > > CachedFeatureGenerator( > > > > > > > *new* AdaptiveFeatureGenerator[]{ > > > > > > > *new* WindowFeatureGenerator(*new* > > > > TokenFeatureGenerator(), > > > > > 2, > > > > > > > 2), > > > > > > > *new* WindowFeatureGenerator(*new* > > > > > > > TokenClassFeatureGenerator(true), 2, 2), > > > > > > > *new* OutcomePriorFeatureGenerator(), > > > > > > > *new* PreviousMapFeatureGenerator(), > > > > > > > *new* BigramNameFeatureGenerator(), > > > > > > > *new* SentenceFeatureGenerator(true, false) > > > > > > > }); > > > > > > > > > > > > > > > > > > > > > These default features should work for most cases (specially > > > > English), > > > > > > but > > > > > > > they of course can be incremented. If you do so, your model > will > > > take > > > > > new > > > > > > > features in account. So yes, you are putting the features in > your > > > > > model. > > > > > > > > > > > > > > To configure custom features is not easy. I would start with > the > > > > > default > > > > > > > and use 10-fold cross-validation and take notes of its > > > effectiveness. > > > > > > Than > > > > > > > change/add a feature, evaluate and take notes. Sometimes a > > feature > > > > that > > > > > > we > > > > > > > are sure would help can destroy the model effectiveness. > > > > > > > > > > > > > > Regards > > > > > > > William > > > > > > > > > > > > > > > > > > > > > 2016-06-29 7:00 GMT-03:00 Damiano Porta < > [email protected] > > >: > > > > > > > > > > > > > > > Thank you William! Really appreciated! > > > > > > > > > > > > > > > > I only do not get one point, when you said "You could > increment > > > > your > > > > > > > > model using > > > > > > > > Custom Feature Generators" does it mean that i can "put" > these > > > > > features > > > > > > > > inside ONE *.bin* file (model) that implement different > things, > > > or, > > > > > > name > > > > > > > > finder is one thing and those feature generators other? > > > > > > > > > > > > > > > > Thank you in advance for the clarification. > > > > > > > > > > > > > > > > 2016-06-29 1:23 GMT+02:00 William Colen < > > [email protected] > > > >: > > > > > > > > > > > > > > > > > Not exactly. You would create a new NER model to replace > > yours. > > > > > > > > > > > > > > > > > > In this approach you would need a corpus like this: > > > > > > > > > > > > > > > > > > <START:personMale> Pierre Vinken <END> , 61 years old , > will > > > join > > > > > the > > > > > > > > board > > > > > > > > > as a nonexecutive director Nov. 29 . > > > > > > > > > Mr . <START:personMale> Vinken <END> is chairman of > Elsevier > > > > N.V. , > > > > > > the > > > > > > > > > Dutch publishing group . <START:personFemale> Jessie Robson > > > <END> > > > > > is > > > > > > > > > retiring , she was a board member for 5 years . > > > > > > > > > > > > > > > > > > > > > > > > > > > I am not an English native speaker, so I am not sure if the > > > > example > > > > > > is > > > > > > > > > clear enough. I tried to use Jessie as a neutral name and > > "she" > > > > as > > > > > > > > > disambiguation. > > > > > > > > > > > > > > > > > > With a corpus big enough maybe you could create a model > that > > > > > outputs > > > > > > > both > > > > > > > > > classes, personMale and personFemale. To train a model you > > can > > > > > follow > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training > > > > > > > > > > > > > > > > > > Let's say your results are not good enough. You could > > increment > > > > > your > > > > > > > > model > > > > > > > > > using Custom Feature Generators ( > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html > > > > > > > > > ). > > > > > > > > > > > > > > > > > > One of the implemented featuregen can take a dictionary ( > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html > > > > > > > > > ). > > > > > > > > > You can also implement other convenient FeatureGenerator, > for > > > > > > instance > > > > > > > > > regex. > > > > > > > > > > > > > > > > > > Again, it is just a wild guess of how to implement it. I > > don't > > > > know > > > > > > if > > > > > > > it > > > > > > > > > would perform well. I was only thinking how to implement a > > > gender > > > > > ML > > > > > > > > model > > > > > > > > > that uses the surrounding context. > > > > > > > > > > > > > > > > > > Hope I could clarify. > > > > > > > > > > > > > > > > > > William > > > > > > > > > > > > > > > > > > 2016-06-28 19:15 GMT-03:00 Damiano Porta < > > > [email protected] > > > > >: > > > > > > > > > > > > > > > > > > > Hi William, > > > > > > > > > > Ok, so you are talking about a kind of pipe where we > > execute: > > > > > > > > > > > > > > > > > > > > 1. NER (personM for example) > > > > > > > > > > 2. Regex (filter to reduce false positives) > > > > > > > > > > 3. Plain dictionary (filter as above) ? > > > > > > > > > > > > > > > > > > > > Yes we can split out model in two for M and F, it is not > a > > > big > > > > > > > problem, > > > > > > > > > we > > > > > > > > > > have a database grouped by gender. > > > > > > > > > > > > > > > > > > > > I only have a doubt regarding the use of a dictionary. > > > Because > > > > if > > > > > > we > > > > > > > > use > > > > > > > > > a > > > > > > > > > > dictionary to create the model, we could only use it to > > > detect > > > > > > names > > > > > > > > > > without using NER. No? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2016-06-29 0:10 GMT+02:00 William Colen < > > > > [email protected] > > > > > >: > > > > > > > > > > > > > > > > > > > > > Do you plan to use the surrounding context? If yes, > maybe > > > you > > > > > > could > > > > > > > > try > > > > > > > > > > to > > > > > > > > > > > split NER in two categories: PersonM and PersonF. Just > an > > > > idea, > > > > > > > never > > > > > > > > > > read > > > > > > > > > > > or tried anything like it. You would need a training > > corpus > > > > > with > > > > > > > > these > > > > > > > > > > > classes. > > > > > > > > > > > > > > > > > > > > > > You could add both the plain dictionary and the regex > as > > > NER > > > > > > > features > > > > > > > > > as > > > > > > > > > > > well and check how it improves. > > > > > > > > > > > > > > > > > > > > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta < > > > > > [email protected] > > > > > > >: > > > > > > > > > > > > > > > > > > > > > > > Hello everybody, > > > > > > > > > > > > > > > > > > > > > > > > we built a NER model to find persons (name) inside > our > > > > > > documents. > > > > > > > > > > > > We are looking for the best approach to understand if > > the > > > > > name > > > > > > is > > > > > > > > > > > > male/female. > > > > > > > > > > > > > > > > > > > > > > > > Possible solutions: > > > > > > > > > > > > - Plain dictionary? > > > > > > > > > > > > - Regex to check the initial and/letters of the name? > > > > > > > > > > > > - Classifier? (naive bayes? Maxent?) > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
