[ 
https://issues.apache.org/jira/browse/STANBOL-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antonio David Pérez Morales updated STANBOL-1141:
-------------------------------------------------

    Attachment: gsoc-wikilinks-1.0-SNAPSHOT.zip

Adding source code of the tool (without the tests due to size restriction)
                
> Wikilinks Parser and TDB Generator
> ----------------------------------
>
>                 Key: STANBOL-1141
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1141
>             Project: Stanbol
>          Issue Type: Sub-task
>          Components: Enhancer, Entityhub
>            Reporter: Antonio David Pérez Morales
>              Labels: freebase, jenatdb, wikilinks
>         Attachments: gsoc-wikilinks-1.0-SNAPSHOT.zip
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Cross-document coreference resolution is the task of grouping the entity 
> mentions in a collection of documents into sets that each represent a 
> distinct entity. It is central to knowledge base construction and also useful 
> for joint inference with other NLP components.
> Wikilinks is one of the result of this task. Wikilinks dataset comprising of 
> 40 million mentions over 3 million entities. The method is based on finding 
> hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. 
> The resource provides URLs of webpages, along with the anchor of the links, 
> and the Wikipedia pages they link to. As provided, this dataset can be used 
> to get all the surface strings that refer to a Wikipedia page, but further, 
> it can be used to download the webpages and extract the context around the 
> webpages
> UMass (http://www.iesl.cs.umass.edu/) has created expanded versions of the 
> dataset containing the following extra features:
> * Complete webpage content (with cleaned DOM structure)
> * Extracted context for the mentions
> * Alignment to Freebase entities
> The expanded dataset can be downloaded from 
> http://iesl.cs.umass.edu/downloads/wiki-link/context-only/
> A tool is needed for parsing this information and store it in any kind of 
> storage consumible later within Stanbol. For the first version, it is 
> possible to convert this dataset to RDF and store it in a triple store like 
> JenaTDB. The goal of this task is to provide an API on the top of this store 
> for easing the retrieval of entities' contextual data. So, "in disambiguation 
> time", we can use the URI of the referenced entity to lookup for 
> disambiguation contexts in Wikilinks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to