Over time people have gotten the message that you shouldn't write XML like

   System.out.println("<blah>"+someString+"</blah>")

   because it is something that usually ends in tears.

Although (most) RDF toolkits are like XML toolkits in that they choke on invalid data, people who write RDF seem to have little concern of whether or not it is valid. This cultural problem is one of the reasons why RDF has seemed to catch on so slow. If you told somebody their XML is invalid, they'll feel like they have to do, but people don't seem to take any action when they hear that the 20 GB file they published is trash.

As a general practice you should use real RDF tools to write RDF files. This adds some overhead, but it's generally not hard and it gives you a pretty good chance you'll get valid output. ;-)

   Lately I've been working on this system

https://github.com/paulhoule/infovore/wiki

which is intended to deal with exactly this situation on a large scale. The "Parallel Super Eyeball 3" (3 means triple, PSE 4 is a hypothetical tool that does the same for quads) tool physically separates valid and invalid triples so you can use the valid triples while being aware of what invalid data tried to sneak it.

Early next week I'm planning on rolling out ":BaseKB Now" which will be filtered Freebase data, processed automatically on a weekly basis. I've got a project in the pipeline that are going to require Wikipedia Categories (I better get them fast before they go away) and another large 4D metamemomic data set for which Wikidata Phase I will be a Rosetta Stone so support for those data sets are on my critical path.

-----Original Message----- From: Sebastian Hellmann
Sent: Friday, August 9, 2013 10:44 AM
To: Discussion list for the Wikidata project.
Cc: Dimitris Kontokostas ; Jona Christopher Sahnwaldt
Subject: Re: [Wikidata-l] Wikidata RDF export available

Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.

I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:

Over the last 4 years, I have seen about two dozen people (undergraduate
and PhD students, as well as Post-Docs) implement "simple" serializers
for RDF.

They all failed.


_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Reply via email to