Hi all For my project: Freebase Entity Disambiguation in Stanbol, for the first milestone for next Friday 28 June, I've integrated the Freebase data dump as EntityHub ReferencedSite in Stanbol, using the Freebase indexing tool (Readme file) and following the Rupert's indications (and freebase index generated by him).
But, this index can't be used as is for disambiguation because it does not include the required data so additional data is needed in order to perform disambiguation. Besides use the Freebase information, I had thought to use the WikiLinks resource [1] released by Google. The Wikipedia links (WikiLinks) data consists of web pages that satisfy the following two constraints: - Contain at least one hyperlink that points to Wikipedia - The anchor text of that hyperlink closely matches the title of the target Wikipedia page For example, gives: URL ftp://121.244.82.26/BENarkhede/ben-official/Utilizing%20Six%20Sigma%20for%20Defect%20Reductions-%20A%20Case%20Study.doc MENTION quality control 49584 http://en.wikipedia.org/wiki/Quality_control TOKEN right 14911 TOKEN smarter 14944 TOKEN driven 9151 TOKEN defines 81657 TOKEN laboratory 34360 TOKEN workshop 41603 TOKEN savings 89906 TOKEN Korean 96665 Each file is in the following format: ------- URL\t<url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n ... TOKEN\t<token>\t<byte_offset>\n TOKEN\t<token>\t<byte_offset>\n TOKEN\t<token>\t<byte_offset>\n ... \n\n URL\t<url>\n ... where each web-page is identified by its url (annotated by URL) and TOKENs are the frequent words on that page. The statistics of this resource are: - Number of Document: 11 million - Number of Entities: 3 million - Number of Mentions: 40 million Using this resource, it would be possible to transform it to RDF format and use it in the indexer tool to linking it using the URL or it would be alse possible to create a ManagedSite with this information. What do you guys think? I need your opinions in order to keep advancing in my project. Appreciate your help Thanks Antonio [1] https://code.google.com/p/wiki-links/ -- ------------------------------ This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. Zaizi Ltd is registered in England and Wales with the registration number 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, London W6 7AN.