[
https://issues.apache.org/jira/browse/STANBOL-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rafa Haro reassigned STANBOL-1037:
----------------------------------
Assignee: Rafa Haro
> Entity Disambiguation for Stanbol
> ---------------------------------
>
> Key: STANBOL-1037
> URL: https://issues.apache.org/jira/browse/STANBOL-1037
> Project: Stanbol
> Issue Type: Story
> Components: Enhancer, Entityhub
> Reporter: Rafa Haro
> Assignee: Rafa Haro
> Priority: Major
> Labels: gsoc2013, mentoring
> Attachments: stanbol-enhancement-workflow.001.png
>
>
> Entity Disambiguation in Stanbol would mainly refers to the process of
> modifying the fise:confidence values of EntityAnnotations obtained as a
> result of any Linking Engine within Stanbol (EntityLinkingEngine or
> NamedEntityLinking). Such modifications to confidence values should be done
> in order to obtain a ranking of possible candidates (entities) to link with
> for each EntityAnnotation after a disambiguation process. Each candidate
> would be an Entity within EntityHub or any other Knowledge Base configured in
> Stanbol.
> Disambiguation
> ============
> Entity Linking is not a trivial task due to the name ambiguity problem, i.e.,
> the same name may refer to different entities in different contexts and also
> the same entity usually can be mentioned using a set of different names. For
> instance, the name Michael Jordan can refer to more than 20 entities in
> Wikipedia, some of them are
> shown below:
> - Michael Jordan(NBA Player)
> - Michael I. Jordan(Berkeley Professor)
> - Michael B. Jordan(American Actor)
> This situation happens not only with these well known semantic knowledge
> bases like DBpedia or Freebase, but are also important for any enterprise
> semantic dataset or custom vocabularies. An instant example is to resolve the
> ambiguity within a database of employees.
> Formally, Entity Disambiguation for Stanbol should work as follows: after an
> enhancement process of a ContentItem using an enhancement chain that includes
> a Linking Engine, we would get a set of TextAnnotations TA = {T1,
> T2,......Tn}. Each TextAnnotation in TA should contain a name mention which
> is characterized by its name, its local surrounding context
> (fise:selection-context) and the ContentItem containing it. For each
> TextAnnotation in TA and as a result of the Linking Engine, we would get a
> set of EntityAnnotations EAi = {E1i, E2i,....., ENi} where i corresponds to
> TextAnnotation i in TA. We should rely on the linking engines to provide all
> possible entity annotations (candidates within all sites in the EntityHub)
> for each TextAnnotation. Each EntityAnnotation is characterized by its
> Knowledge Base (entityhub:site) and its entry in that knowledge base
> (fise:entity-reference). The objective of the disambiguation process is to
> rank each EntityAnnotation set EAi through the modification of its
> EntityAnnotations' confidence values so that the entity with the higher
> confidence value were the referent entity for the TextAnnotation associated
> to EAi.
> Algorithms
> ========
> ** Local Approaches
> (From [1]) Conventional entity linking approaches have focused on making
> independent Entity Linking decisions using the local mention-to-entity
> compatibility for each isolated mention. The essential idea was to extract
> the discriminative features from the description of a specific entity and
> then link each name mention in a document by comparing the contextual
> similarity with each of its candidate referent entities. Such approach is
> followed by Disambiguation-MLT engine in STANBOL-723.
> ** Global Approaches (Collective Entity Linking)
> The main drawback of the local-based approaches stems from the fact that they
> do not take into consideration the interdependence between different Entity
> Linking decisions. Specifically, the entities in a topical coherent document
> usually are semantically related to each other. In such cases, figuring out
> the referent entity of one name mention may in turn give us useful
> information to link the other name mentions in the same document. That
> suggests that disambiguation performance could be improved by resolving all
> mentions at the same time.
> This approach only makes sense in an scenario with highly connected knowledge
> bases where the entities are semantically related in some way.
> ** Graph Based Approaches
> In these approaches, both Knowledge Base and interdependence between possible
> Entity Linking decisions are modeled as graphs and inference algorithms are
> used to resolve all the mentions within a document.
> Knowledge Bases
> ==============
> As described in STANBOL-223, for Disambiguation, it is necessary to use some
> data as disambiguation features. Disambiguation data nature will depend on
> the knowledge base particularities. In general, it will be necessary to
> generate a Semantic context for each candidate and process it in the
> disambiguation algorithm. The Disambiguation Context could be a fixed data
> structure for each kind of disambiguation engine in Stanbol and developers
> should be in charge to develop mechanism to create those contexts for their
> custom vocabularies or knowledge bases.
> For instance, with Local Approaches, developers should be able to configure
> Disambiguation-MLT or any other local based disambiguation engine in order to
> obtain a disambiguation context from EntityHub for computing its similarity
> with mentions' contexts within the Content Item.
> This can be as easy as select Entity's disambiguation fields or as complex as
> making calls to methods for building disambiguation contexts on the fly.
> Normally, the first option will involve the generation of disambiguation
> fields at EntityHub's index creation time. For instance, as described in
> STANBOL-223, for DBPedia, it is possible to extract sentences with
> occurrences of entities'e mentions from Wikipedia using
> https://github.com/ogrisel/pignlproc. These sentences can be included in
> DBPedia EntityHub index as disambiguation fields. Entities' abstract can also
> be used for disambiguation. All these fields should be configurable (boost)
> for disambiguation purposes.
> General Architecture and Workflow
> ==========================
> A typical Disambiguation system architecture would include three steps:
> ** Candidates Generation: from a surface form (name mention) in the Content
> Item, generate a set of possible entities within EntityHub to link with. A
> typical source of entities' names are entities' labels, but others fields can
> be used. In this step, is it necessary to resolve how to search on that
> names' sources: Exact Matching, Overlapping, Fuzzy Search, Full-Text Search,
> Case-sensitive, Coreference Resolution....
> ** Candidate Ranking: rank the probabilities to be the reference entity of
> all candidates. Basically, this step involves the execution of the specific
> disambiguation engine as an enhancement post processing phase.
> ** Detect and Cluster Missing Entities: those mentions that actually
> shouldn't be linked to any Entity should be extracted and grouped in clusters
> (one cluster for each unknown entity). These entities can be suggested to the
> user in order to include them in the knowledge base (Automatic Knowledge Base
> Population).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)