[
https://issues.apache.org/jira/browse/STANBOL-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247196#comment-13247196
]
Olivier Grisel commented on STANBOL-223:
----------------------------------------
As an alternative to MLT which will compute an unnormalized similarity score as
an approximate of the cosine similarity, one could use a Jaccard coefficient
index of the overlapping words (either restricted to co-occurring names or any
other words, not restricted to names) of the potential entities descriptions +
past mentions found by the existing name lookup and the document context to
re-rank the link candidates.
For instance papers such as the following might be interesting to study:
http://aclweb.org/anthology/P/P11/P11-1138.pdf
http://liuchuan.org/pub/CS475.pdf
Also before using complex disambiguation logics such as Jaccard coef and MLT
one should implement simpler approaches such as:
- Add a configuration option to the entity linking engine to perform exact
search name only, both the on the canonical labels from the entity hub +
redirect names (for DBpedia only, could be stored as alternative names) or the
mention expressions that carry a link as found in the wikipedia dump (need a
dedicated extraction as explained above).
- Ad-hoc rules could also be interesting: if the named detected by OpenNLP is a
firstname (as indexed in the entity hub for instance), one could mark the name
as ambiguous and skip its linking.
> Entity Disambiguation based on Solr MLT
> ---------------------------------------
>
> Key: STANBOL-223
> URL: https://issues.apache.org/jira/browse/STANBOL-223
> Project: Stanbol
> Issue Type: New Feature
> Components: Enhancer
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> In short:
> The Idea is to use sentences with links to an Entity in a dataset (e.g.
> wikipedia) as context and compare this with the surrounding text of an Entity
> extracted from the analyzed content. Solr More Like This (MLT) queries will
> be used for the ranking.
>
> More details:
> Sentences with occurrences of the Entity can be extracted by using
> https://github.com/ogrisel/pignlproc. Functionality will be added to output
> the results (entity -[0..*]-> "sentence") as N-TRIPLEs (a RDF serialization).
> This will allow it to indexed this information (together with all the other
> information of Entities) by using the Indexing Tools porvided by the Stanbol
> Entityhub (e.g. entityhub/indexing/dbpedia).
> The following Information will be used for EntityDisambiguation:
> (1) TextAnnotations providing the label, the type as detected by the NLP
> framework, the context of the extraction
> (1b) In addition links to other Text Annotations about the same Entity could
> be used to extend the context
> (2) A Solr Index (ReferencedSite of the Stanbol Entityhub) providing at least
> the labels, types and the occurrences of the Entities
> EntityDisambiguation will filter based on the label and the type (filter
> query) and rank selected Entities based on a "More Like This" query with the
> context over the occurrences.
> A first prototype of this engine was implemented during the bbuzz - Semantic
> Hackatron (http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon) as an own
> EnhancementEngine that uses an separate Solr Index for the MLT queries.
> The plan is to implement this as an optional (configureable) feature to the
> existing ReferencedSiteEntityTaggingEnhancementEngine. Users will be able to
> activate/deactivate Entity disambiguation via the OSGI Console if the
> required data are available for a ReferencedSite.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira