Hi Antonio,

have you noticed that under [2] there are additional resources
available. Especially the "Expanded Dataset" should be of interest as
it contains

> * Complete webpage content (with cleaned DOM structure)
> * Extracted context for the mentions
> * Alignment to Freebase entities
> * and more...

I would advice you to convert the data to RDF as this allow its usage
within our current tool chain.

I would suggest to:

* use the Entity URI as subject
* define a properties for other information provided by the dataset
* generate separate RDF files for the different properties (to allow
selective imports of those data).

If the "Expanded Dataset" already includes the "Extracted context for
the mentions" and the "Alignment to Freebase entities" doing so should
be much easier as with the original dataset.

best
Rupert

[2] http://www.iesl.cs.umass.edu/data/wiki-links

On Wed, Jun 26, 2013 at 12:47 PM, Antonio Perez <ape...@zaizi.com> wrote:
> Hi all
>
> For my project: Freebase Entity Disambiguation in Stanbol, for the first
> milestone for next Friday 28 June, I've integrated the Freebase data dump
> as EntityHub ReferencedSite in Stanbol, using the Freebase indexing tool
> (Readme file) and following the Rupert's indications (and freebase index
> generated by him).
>
> But, this index can't be used as is for disambiguation because it does not
> include the required data so additional data is needed in order to perform
> disambiguation.
>
> Besides use the Freebase information, I had thought to use the WikiLinks
> resource [1] released by Google.
> The Wikipedia links (WikiLinks) data consists of web pages that satisfy the
> following two constraints:
>  - Contain at least one hyperlink that points to Wikipedia
>  - The anchor text of that hyperlink closely matches the title of the
> target Wikipedia page
>
> For example, gives:
>
>  URL
> ftp://121.244.82.26/BENarkhede/ben-official/Utilizing%20Six%20Sigma%20for%20Defect%20Reductions-%20A%20Case%20Study.doc
>  MENTION    quality control    49584
> http://en.wikipedia.org/wiki/Quality_control
>  TOKEN    right    14911
>  TOKEN    smarter    14944
>  TOKEN    driven    9151
>  TOKEN    defines    81657
>  TOKEN    laboratory    34360
>  TOKEN    workshop    41603
>  TOKEN    savings    89906
>  TOKEN    Korean    96665
>
> Each file is in the following format:
>
>  -------
>
>  URL\t<url>\n
>  MENTION\t<mention>\t<byte_offset>\t<target_url>\n
>  MENTION\t<mention>\t<byte_offset>\t<target_url>\n
>  MENTION\t<mention>\t<byte_offset>\t<target_url>\n
>  ...
>  TOKEN\t<token>\t<byte_offset>\n
>  TOKEN\t<token>\t<byte_offset>\n
>  TOKEN\t<token>\t<byte_offset>\n
>  ...
>  \n\n
>  URL\t<url>\n
>  ...
>
> where each web-page is identified by its url (annotated by URL) and TOKENs
> are the frequent words on that page.
> The statistics of this resource are:
> - Number of Document: 11 million
> - Number of Entities: 3 million
> - Number of Mentions: 40 million
>
> Using this resource, it would be possible to transform it to RDF format and
> use it in the indexer tool to linking it using the URL or it would be alse
> possible to create a ManagedSite with this information.
>
> What do you guys think? I need your opinions in order to keep advancing in
> my project.
> Appreciate your help
>
> Thanks
> Antonio
>
>
> [1] https://code.google.com/p/wiki-links/
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to