Hi Antonio, have you noticed that under [2] there are additional resources available. Especially the "Expanded Dataset" should be of interest as it contains
> * Complete webpage content (with cleaned DOM structure) > * Extracted context for the mentions > * Alignment to Freebase entities > * and more... I would advice you to convert the data to RDF as this allow its usage within our current tool chain. I would suggest to: * use the Entity URI as subject * define a properties for other information provided by the dataset * generate separate RDF files for the different properties (to allow selective imports of those data). If the "Expanded Dataset" already includes the "Extracted context for the mentions" and the "Alignment to Freebase entities" doing so should be much easier as with the original dataset. best Rupert [2] http://www.iesl.cs.umass.edu/data/wiki-links On Wed, Jun 26, 2013 at 12:47 PM, Antonio Perez <ape...@zaizi.com> wrote: > Hi all > > For my project: Freebase Entity Disambiguation in Stanbol, for the first > milestone for next Friday 28 June, I've integrated the Freebase data dump > as EntityHub ReferencedSite in Stanbol, using the Freebase indexing tool > (Readme file) and following the Rupert's indications (and freebase index > generated by him). > > But, this index can't be used as is for disambiguation because it does not > include the required data so additional data is needed in order to perform > disambiguation. > > Besides use the Freebase information, I had thought to use the WikiLinks > resource [1] released by Google. > The Wikipedia links (WikiLinks) data consists of web pages that satisfy the > following two constraints: > - Contain at least one hyperlink that points to Wikipedia > - The anchor text of that hyperlink closely matches the title of the > target Wikipedia page > > For example, gives: > > URL > ftp://121.244.82.26/BENarkhede/ben-official/Utilizing%20Six%20Sigma%20for%20Defect%20Reductions-%20A%20Case%20Study.doc > MENTION quality control 49584 > http://en.wikipedia.org/wiki/Quality_control > TOKEN right 14911 > TOKEN smarter 14944 > TOKEN driven 9151 > TOKEN defines 81657 > TOKEN laboratory 34360 > TOKEN workshop 41603 > TOKEN savings 89906 > TOKEN Korean 96665 > > Each file is in the following format: > > ------- > > URL\t<url>\n > MENTION\t<mention>\t<byte_offset>\t<target_url>\n > MENTION\t<mention>\t<byte_offset>\t<target_url>\n > MENTION\t<mention>\t<byte_offset>\t<target_url>\n > ... > TOKEN\t<token>\t<byte_offset>\n > TOKEN\t<token>\t<byte_offset>\n > TOKEN\t<token>\t<byte_offset>\n > ... > \n\n > URL\t<url>\n > ... > > where each web-page is identified by its url (annotated by URL) and TOKENs > are the frequent words on that page. > The statistics of this resource are: > - Number of Document: 11 million > - Number of Entities: 3 million > - Number of Mentions: 40 million > > Using this resource, it would be possible to transform it to RDF format and > use it in the indexer tool to linking it using the URL or it would be alse > possible to create a ManagedSite with this information. > > What do you guys think? I need your opinions in order to keep advancing in > my project. > Appreciate your help > > Thanks > Antonio > > > [1] https://code.google.com/p/wiki-links/ > > -- > > ------------------------------ > This message should be regarded as confidential. If you have received this > email in error please notify the sender and destroy it immediately. > Statements of intent shall only become binding when confirmed in hard copy > by an authorised signatory. > > Zaizi Ltd is registered in England and Wales with the registration number > 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, > London W6 7AN. -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen