Re: [Dbpedia-discussion] RDF Validator puts Freebase and DBpedia Live to the test

Kingsley Idehen Tue, 09 Apr 2013 18:40:44 -0700

FYI


On 4/9/13 8:01 PM, Paul A. Houle wrote:

PRESS RELEASE
Paul Houle, Ontology2 founder, stated that "we updated Infovore to accept data from DBpedia, and ran a head to head test, in terms of RDF validity,
between Freebase and DBpedia Live."
"Unlike most scientific results", he said, "these results are repeatable, because you can reproduce them yourself with Infovore 1.1. I encourage you to use this tool to put other RDF data sets, large and small, to the test."
The tool parallelSuperEyeball was run against both the 2013-03-31 Freebase
RDF dump and the 2012-04-30 edition of DBpedia Live.
Although Freebase asserts roughly 1.2 billion facts, Infovore rejects roughly
200 million useless facts in pre-filtering.  Downstream of that we found
944,909,025 valid facts and than 66,781,906 invalid facts,  in addition to
5 especially malformed facts.
This is a serious regression compared to the 2013-01-27 RDF dump, in which only about 13 million invalid triples were discovered. The main cause of the increase is the introduction of 40 million or so "triples" lacking an object
connected with the predictate ns:common.topic.notable_for.  Previously,
the bulk of the invalid triples were incorrectly formatted dates.
The rate of invalid triples in Dbpedia Live was found to be orders of magnitude
less than Freebase.
Only 8,664 invalid facts were found in DBpedia Live, compared to 247,557,030
valid facts.  The predominant problem in DBpedia Live turned out to be
noncomfortmant IRIs that came in from Wikipedia.  This is comparable in
magnitude to the number of facts found invalid in the old Freebase quad dump
in the process of creating :BaseKB Pro.
Just one of the tools included with Infovore, parallelSuperEyeball is an
industrial strength RDF validator that uses streaming processing and the Map/Reduce
paradigm to attain nearly perfect parallel speedup at many tasks on common
four core computers. Infovore 1.1 brings many improvements, including a threefold
speedup of parallelSuperEyeball and the new Infovore shell.
Please take a look at our github project at
https://github.com/paulhoule/infovore/wiki
and feel free to fork or star it. Note that many infovore data products are
also available at
http://basekb.com/
Because infovore is memory efficient, it is possible to use it to handle much large data sets than can be kept in a triple store on any given hardware. The main limitation in handling large RDF data sets is running out of disk space,
which it can do quickly by avoiding random access I/O.
"We challenge RDF data providers to put their data to the test", said Paul Houle, "Today it's an expectation that people and organizations publish only
valid XML files,  and the publication of superParallelEyeball is a step to
a world that speaks valid RDF and that can clean and repair invalid files." Ontology2 is a privately held company that develops web sites and data products based on Freebase, DBpedia, and other sources. Contact p...@ontology2.com with
questions about Ontology2 products and services.



--

Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [Dbpedia-discussion] RDF Validator puts Freebase and DBpedia Live to the test

Reply via email to