[jira] [Updated] (STANBOL-1141) Wikilinks Parser and TDB Generator

JIRA Fri, 13 Sep 2013 08:59:52 -0700

     [ 
https://issues.apache.org/jira/browse/STANBOL-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Antonio David Pérez Morales updated STANBOL-1141:
-------------------------------------------------

    Description: 
Cross-document coreference resolution is the task of grouping the entity 
mentions in a collection of documents into sets that each represent a distinct 
entity. It is central to knowledge base construction and also useful for joint 
inference with other NLP components.
Wikilinks is one of the result of this task. Wikilinks dataset comprising of 40 
million mentions over 3 million entities. The method is based on finding 
hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. The 
resource provides URLs of webpages, along with the anchor of the links, and the 
Wikipedia pages they link to. As provided, this dataset can be used to get all 
the surface strings that refer to a Wikipedia page, but further, it can be used 
to download the webpages and extract the context around the webpages

UMass (http://www.iesl.cs.umass.edu/) has created expanded versions of the 
dataset containing the following extra features:

* Complete webpage content (with cleaned DOM structure)
* Extracted context for the mentions
* Alignment to Freebase entities

The expanded dataset can be downloaded from 
http://iesl.cs.umass.edu/downloads/wiki-link/context-only/

A tool is needed for parsing this information and store it in any kind of 
storage consumible later within Stanbol. For the first version, it is possible 
to convert this dataset to RDF and store it in a triple store like JenaTDB. The 
goal of this task is to provide an API on the top of this store for easing the 
retrieval of entities' contextual data. So, "in disambiguation time", we can 
use the URI of the referenced entity to lookup for disambiguation contexts in 
Wikilinks


  was:
Cross-document coreference resolution is the task of grouping the entity 
mentions in a collection of documents into sets that each represent a distinct 
entity. It is central to knowledge base construction and also useful for joint 
inference with other NLP components.

Wikilinks is one of the result of this task. 
Wikilinks dataset comprising of 40 million mentions over 3 million entities. 
The method is based on finding hyperlinks to Wikipedia from a web crawl and 
using anchor text as mentions. In addition to providing large-scale labeled 
data without human effort, we are able to include many styles of text beyond 
newswire and many entity types beyond people.

UMass has created expanded versions of the dataset containing the following 
extra features:

* Complete webpage content (with cleaned DOM structure)
* Extracted context for the mentions
* Alignment to Freebase entities

The expanded dataset can be downloaded from 
http://iesl.cs.umass.edu/downloads/wiki-link/context-only/

A tool is needed in order to parser this information and store it in any type 
of storage like Jena TDB. 

Wikilinks provides information of documents with mentions to Freebase entities 
and this information can be used both to desambiguate and to merge with the 
Freebase information in order to have a large set of valuable data.


    
> Wikilinks Parser and TDB Generator
> ----------------------------------
>
>                 Key: STANBOL-1141
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1141
>             Project: Stanbol
>          Issue Type: Sub-task
>          Components: Enhancer, Entityhub
>            Reporter: Antonio David Pérez Morales
>              Labels: freebase, jenatdb, wikilinks
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Cross-document coreference resolution is the task of grouping the entity 
> mentions in a collection of documents into sets that each represent a 
> distinct entity. It is central to knowledge base construction and also useful 
> for joint inference with other NLP components.
> Wikilinks is one of the result of this task. Wikilinks dataset comprising of 
> 40 million mentions over 3 million entities. The method is based on finding 
> hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. 
> The resource provides URLs of webpages, along with the anchor of the links, 
> and the Wikipedia pages they link to. As provided, this dataset can be used 
> to get all the surface strings that refer to a Wikipedia page, but further, 
> it can be used to download the webpages and extract the context around the 
> webpages
> UMass (http://www.iesl.cs.umass.edu/) has created expanded versions of the 
> dataset containing the following extra features:
> * Complete webpage content (with cleaned DOM structure)
> * Extracted context for the mentions
> * Alignment to Freebase entities
> The expanded dataset can be downloaded from 
> http://iesl.cs.umass.edu/downloads/wiki-link/context-only/
> A tool is needed for parsing this information and store it in any kind of 
> storage consumible later within Stanbol. For the first version, it is 
> possible to convert this dataset to RDF and store it in a triple store like 
> JenaTDB. The goal of this task is to provide an API on the top of this store 
> for easing the retrieval of entities' contextual data. So, "in disambiguation 
> time", we can use the URI of the referenced entity to lookup for 
> disambiguation contexts in Wikilinks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (STANBOL-1141) Wikilinks Parser and TDB Generator

Reply via email to