Hi all

For my project: Freebase Entity Disambiguation in Stanbol, for the first
milestone for next Friday 28 June, I've integrated the Freebase data dump
as EntityHub ReferencedSite in Stanbol, using the Freebase indexing tool
(Readme file) and following the Rupert's indications (and freebase index
generated by him).

But, this index can't be used as is for disambiguation because it does not
include the required data so additional data is needed in order to perform
disambiguation.

Besides use the Freebase information, I had thought to use the WikiLinks
resource [1] released by Google.
The Wikipedia links (WikiLinks) data consists of web pages that satisfy the
following two constraints:
 - Contain at least one hyperlink that points to Wikipedia
 - The anchor text of that hyperlink closely matches the title of the
target Wikipedia page

For example, gives:

 URL
ftp://121.244.82.26/BENarkhede/ben-official/Utilizing%20Six%20Sigma%20for%20Defect%20Reductions-%20A%20Case%20Study.doc
 MENTION    quality control    49584
http://en.wikipedia.org/wiki/Quality_control
 TOKEN    right    14911
 TOKEN    smarter    14944
 TOKEN    driven    9151
 TOKEN    defines    81657
 TOKEN    laboratory    34360
 TOKEN    workshop    41603
 TOKEN    savings    89906
 TOKEN    Korean    96665

Each file is in the following format:

 -------

 URL\t<url>\n
 MENTION\t<mention>\t<byte_offset>\t<target_url>\n
 MENTION\t<mention>\t<byte_offset>\t<target_url>\n
 MENTION\t<mention>\t<byte_offset>\t<target_url>\n
 ...
 TOKEN\t<token>\t<byte_offset>\n
 TOKEN\t<token>\t<byte_offset>\n
 TOKEN\t<token>\t<byte_offset>\n
 ...
 \n\n
 URL\t<url>\n
 ...

where each web-page is identified by its url (annotated by URL) and TOKENs
are the frequent words on that page.
The statistics of this resource are:
- Number of Document: 11 million
- Number of Entities: 3 million
- Number of Mentions: 40 million

Using this resource, it would be possible to transform it to RDF format and
use it in the indexer tool to linking it using the URL or it would be alse
possible to create a ManagedSite with this information.

What do you guys think? I need your opinions in order to keep advancing in
my project.
Appreciate your help

Thanks
Antonio


[1] https://code.google.com/p/wiki-links/

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Reply via email to