I don't know of such a resource off-hand, but you might want to consider
expanding your search to text corpuses annotated with Freebase or Google
Knowledge Graph IDs (the same IDs are used for both). Wikidata contains
mappings to Freebase IDs, although it is somewhat incomplete (and this
additional mapping adds an extra layer of variability).

The other issue is that all of the corpuses that I'm aware of are
automatically annotated, so their not "gold standard" truth sets, but you
could cherry pick the high confidence annotations and/or do additional
human verification.

Two that I know of are:

ClueWeb09 & ClueWeb12 - 800M documents, 11B "clues" -
https://research.googleblog.com/2013/07/11-billion-clues-in-800-million.html
TREC KBA Stream Corpus 2014 - 394M documents, 9.4B mentions -
http://trec-kba.org/data/fakba1/

I haven't seen any recent releases of similar stuff. Not sure what
identifiers Google will use for this kind of work in the future now that
they've shutdown Freebase.

Tom


On Sun, Feb 5, 2017 at 9:47 AM, Samuel Printz <samuel.pri...@outlook.de>
wrote:

> Hello everyone,
>
> I am looking for a text corpus that is annotated with Wikidata entites.
> I need this for the evaluation of an entity linking tool based on
> Wikidata, which is part of my bachelor thesis.
>
> Does such a corpus exist?
>
> Ideal would be a corpus annotated in the NIF format [1], as I want to
> use GERBIL [2] for the evaluation. But it is not necessary.
>
> Thanks for hints!
> Samuel
>
> [1] https://site.nlp2rdf.org/
> [2] http://aksw.org/Projects/GERBIL.html
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to