[ https://issues.apache.org/jira/browse/STANBOL-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311455#comment-14311455 ]
Cristian Petroaca commented on STANBOL-1279: -------------------------------------------- Hey Rupert, I added the latest entity coref engine code with the following changes: - enhanced the ./build_yago_dbpedia_labels.sh script to check for downloaded archives and to output a better status. There is no way around the 7za command though. I need it to unzip the 7z archives. - moved the spatial and org membership attributes from config files inside the jar to OSGI attributes. They are quite a few but it does not look horribly crowded in the GUI. - added entity-coref-dbpedia data bundle. - created new dbpedia index that contains the oraganisational membership attributes such as :occupation, :associatedBands and :employer. I also put it up on wetransfer. You should receive a mail with the download link. Basically the engine works with 3 types of co-referencing: 1. Spatial: ex Angela Merkel -> The German Chancellor 2. Organisation membership : ex Mick Jagger -> The Rolling Stones singer. 3. Class based - when the class has more that 2 words in it : ex Boris Becker -> The former tennis player. > Named Entity co-reference resolution engine based on yago/dbpedia contextual > information > ---------------------------------------------------------------------------------------- > > Key: STANBOL-1279 > URL: https://issues.apache.org/jira/browse/STANBOL-1279 > Project: Stanbol > Issue Type: New Feature > Components: Enhancement Engines > Reporter: Cristian Petroaca > Assignee: Rupert Westenthaler > Labels: co-reference, dbpedia, entity, named, yago > Attachments: named_entity_coref_ver_1.patch, > named_entity_coref_ver_2.patch, named_entity_coref_ver_3.patch > > > Develop an enhancement engine that will perform co-reference resolution of > Named Entities in a given text. The co-references will be noun phrases which > refer to those Named Entities by having a minimal set of attributes which > match contextual information (yago rdf:type and dbpedia spatial and object > function giving info - more on this below) from dbpedia/yago for that Named > Entity. > We have the following text as an example : "Microsoft has posted its 2013 > earnings. The software company did better than expected. ... The > Redmond-based company will hire 500 new developers this year." > The enhancement engine will link "Microsoft" with "The software company" and > "The Redmond-based company". > Below there are the steps necessary in order to extract the co-references. > Named Entity extraction > ================== > Extract all Named Entities from the given text. If there are no Named > Entities then the process stops here. > Noun Phrases extraction > =================== > Select all noun phrases after the first Named Entity that have: > + a determinate pos which implies reference to an entity local to the text, > such as "the, this, these") but not "another, every", etc which implies a > reference to an entity outside of the text. > + at least another noun aside from the main required noun which further > describes it. For example I will not count "The company" as being a > legitimate candidate since this could create a lot of false positives by > considering the double meaning of some words such as "in the company of good > people". > All noun phrases need to be lemmatized in case there are any plurals. > This step should have different logic implemented for different languages. > This step ensures good recall. > > Noun Phrases matching > =================== > This step tries to match the previously selected noun phrases to the Named > Entities from step 1 and establish the co-references. > For every noun phrase the following rules will be applied: > Yago:class matching > -------------------------- > For each NER prior to the current noun phrase in the text match the > yago:class label to the contents of the noun phrase. If there are no matches > then drop the current noun phrase. > Group membership rules matching > ------------------------------------------- > For each NER prior to the current noun phrase: > + Spatial membership : the noun phrase is part of a LOCATION. > If the noun phrase contains a LOCATION or a demonym then check any location > properties of the matching NER. These properties will be part of a generic > ontology. For clarity I will describe the dbpedia extracted properties which > will be aligned to this generic ontology. > If matching NER is a : > - person, match against :birthPlace, :region, :nationality > - organisation, match against :foundationPlace, :locationCity, :location, > :hometown > - place, match against :country, :subdivisionName, :location. > Example: The Italian President, The Richmond-based company > + Organisational membership : the NER is part of an ORGANISATION. > If the noun phrase contains an ORGANISATION then check the following > properties of the maching NER. These properties will be part of a generic > ontology. For clarity I will describe the dbpedia extracted properties which > will be aligned to this generic ontology. > If matching NER is : > - person, match against :occupation, :associatedActs > - organisation : no dbpedia properties to match > - location : no dbpedia properties to match > Example: The Microsoft executive, The Pink Floyd singer > Functional description rules matching > ----------------------------------------------- > The noun phrase describes what the NER does conceptually. > If there are no NERs in the noun phrase then match the following properties > of the matching NER to the contents of the noun phrase (aside from the nouns > which are part of the yago:class) : > If NER is a: > - person : no dbpedia properties to match > - organisation : , match against :service, :industry, :genre > - location : no dbpedia properties to match > Example: The software company. > If no matches were found for the current NER with rules "Group membership" > and "Functional description" rules then if the yago:class which matched has > more than 2 nouns then we also consider this a good co-reference but with a > lower confidence maybe. > Ex: The former tennis player, the theoretical physicist. > Co-references creation > ================== > Based on the number of nouns which matched from the previous step we create a > confidence level. The number of matched nouns cannot be lower than 2 and we > must have a yago:class match. > For all NERs which got to this point, select the closest ones in the text to > the noun phrase which matched against the same properties (yago:class and > dbpedia) and mark them as co-references. > The "Noun Phrases matching" and "Co-references creation" steps are designed > to filter out all bad co-references and ensure good precision. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)