[ 
https://issues.apache.org/jira/browse/STANBOL-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247196#comment-13247196
 ] 

Olivier Grisel commented on STANBOL-223:
----------------------------------------

As an alternative to MLT which will compute an unnormalized similarity score as 
an approximate of the cosine similarity, one could use a Jaccard coefficient 
index of the overlapping words (either restricted to co-occurring names or any 
other words, not restricted to names) of the potential entities descriptions + 
past mentions found by the existing name lookup and the document context to 
re-rank the link candidates.

For instance papers such as the following might be interesting to study:

  http://aclweb.org/anthology/P/P11/P11-1138.pdf
  http://liuchuan.org/pub/CS475.pdf

Also before using complex disambiguation logics such as Jaccard coef and MLT 
one should implement simpler approaches such as:

- Add a configuration option to the entity linking engine to perform exact 
search name only, both the on the canonical labels from the entity hub + 
redirect names (for DBpedia only, could be stored as alternative names) or the 
mention expressions that carry a link as found in the wikipedia dump (need a 
dedicated extraction as explained above).

- Ad-hoc rules could also be interesting: if the named detected by OpenNLP is a 
firstname (as indexed in the entity hub for instance), one could mark the name 
as ambiguous and skip its linking.
                
> Entity Disambiguation based on Solr MLT
> ---------------------------------------
>
>                 Key: STANBOL-223
>                 URL: https://issues.apache.org/jira/browse/STANBOL-223
>             Project: Stanbol
>          Issue Type: New Feature
>          Components: Enhancer
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> In short:
> The Idea is to use sentences with links to an Entity in a dataset (e.g. 
> wikipedia) as context and compare this with the surrounding text of an Entity 
> extracted from the analyzed content. Solr More Like This (MLT) queries will 
> be used for the ranking. 
>  
> More details:
> Sentences with occurrences of the Entity can be extracted by using 
> https://github.com/ogrisel/pignlproc. Functionality will be added to output 
> the results (entity -[0..*]-> "sentence") as N-TRIPLEs (a RDF serialization). 
> This will allow it to indexed this information (together with all the other 
> information of Entities) by using the Indexing Tools porvided by the Stanbol 
> Entityhub (e.g. entityhub/indexing/dbpedia).
> The following Information will be used for EntityDisambiguation:
> (1) TextAnnotations providing the label, the type as detected by the NLP 
> framework, the context of the extraction
> (1b) In addition links to other Text Annotations about the same Entity could 
> be used to extend the context
> (2) A Solr Index (ReferencedSite of the Stanbol Entityhub) providing at least 
> the labels, types and the occurrences of the Entities
> EntityDisambiguation will filter based on the label and the type (filter 
> query) and rank selected Entities based on a "More Like This" query with the 
> context over the occurrences.
> A first prototype of this engine was implemented during the bbuzz - Semantic 
> Hackatron (http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon) as an own 
> EnhancementEngine that uses an separate Solr Index for the MLT queries.
> The plan is to implement this as an optional (configureable) feature to the 
> existing ReferencedSiteEntityTaggingEnhancementEngine. Users will be able to 
> activate/deactivate Entity disambiguation via the OSGI Console if the 
> required data are available for a ReferencedSite.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to