FYI
On 4/9/13 8:01 PM, Paul A. Houle wrote:
PRESS RELEASEPaul Houle, Ontology2 founder, stated that "we updated Infovore to accept data from DBpedia, and ran a head to head test, in terms of RDF validity,between Freebase and DBpedia Live.""Unlike most scientific results", he said, "these results are repeatable, because you can reproduce them yourself with Infovore 1.1. I encourage you to use this tool to put other RDF data sets, large and small, to the test."The tool parallelSuperEyeball was run against both the 2013-03-31 Freebase RDF dump and the 2012-04-30 edition of DBpedia Live.Although Freebase asserts roughly 1.2 billion facts, Infovore rejects roughly200 million useless facts in pre-filtering. Downstream of that we found 944,909,025 valid facts and than 66,781,906 invalid facts, in addition to 5 especially malformed facts.This is a serious regression compared to the 2013-01-27 RDF dump, in which only about 13 million invalid triples were discovered. The main cause of the increase is the introduction of 40 million or so "triples" lacking an objectconnected with the predictate ns:common.topic.notable_for. Previously, the bulk of the invalid triples were incorrectly formatted dates.The rate of invalid triples in Dbpedia Live was found to be orders of magnitudeless than Freebase.Only 8,664 invalid facts were found in DBpedia Live, compared to 247,557,030valid facts. The predominant problem in DBpedia Live turned out to be noncomfortmant IRIs that came in from Wikipedia. This is comparable inmagnitude to the number of facts found invalid in the old Freebase quad dumpin the process of creating :BaseKB Pro. Just one of the tools included with Infovore, parallelSuperEyeball is anindustrial strength RDF validator that uses streaming processing and the Map/Reduceparadigm to attain nearly perfect parallel speedup at many tasks on commonfour core computers. Infovore 1.1 brings many improvements, including a threefoldspeedup of parallelSuperEyeball and the new Infovore shell. Please take a look at our github project at https://github.com/paulhoule/infovore/wikiand feel free to fork or star it. Note that many infovore data products arealso available at http://basekb.com/Because infovore is memory efficient, it is possible to use it to handle much large data sets than can be kept in a triple store on any given hardware. The main limitation in handling large RDF data sets is running out of disk space,which it can do quickly by avoiding random access I/O."We challenge RDF data providers to put their data to the test", said Paul Houle, "Today it's an expectation that people and organizations publish onlyvalid XML files, and the publication of superParallelEyeball is a step toa world that speaks valid RDF and that can clean and repair invalid files." Ontology2 is a privately held company that develops web sites and data products based on Freebase, DBpedia, and other sources. Contact p...@ontology2.com withquestions about Ontology2 products and services.
-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca handle: @kidehen Google+ Profile: https://plus.google.com/112399767740508618350/about LinkedIn Profile: http://www.linkedin.com/in/kidehen
smime.p7s
Description: S/MIME Cryptographic Signature