Hi Stas,

I think in terms of the dump, /replacing/ the Turtle dump with the N-Triples dump would be a good option. (Not sure if that's what you were suggesting?)

As you already mentioned, N-Triples is easier to process with typical unix command-line tools and scripts, etc. But also any (RDF 1.1) N-Triples file should be valid Turtle, so I don't see a convincing need to have both: existing tools expecting Turtle shouldn't have a problem with N-Triples.

(Also just to put the idea out there of perhaps (also) having N-Quads where the fourth element indicates the document from which the RDF graph can be dereferenced. This can be useful for a tool that, e.g., just wants to quickly refresh a single graph from the dump, or more generally that wants to keep track of a simple and quick notion of provenance: "this triple was found in this Web document".)

Cheers,
Aidan

On 26-08-2016 16:30, Stas Malyshev wrote:
Hi!

I was thinking recently about various data processing scenarios in
wikidata and there's one case we don't have a good coverage for I think.

TLDR: One of the things I think we might do to make it easier to work
with data is having ntriples (line-based) RDF dump format available.

If you need to process a lot of data (like all enwiki sitelinks, etc.)
then the Query Service is not very efficient there, due to limits and
sheer volume of data. We could increase limits but not by much - I don't
think we can allow a 30-minute processing task to hog the resources of
the service to itself. We have some ways to mitigate this, in theory,
but in practice they'll take time to be implemented and deployed.

The other approach would be to do dump processing. Which would work in
most scenarios but the problem is that we have two forms of dump right
now - JSON and TTL (Turtle) and both are not easy to process without
tools with deep understanding of the formats. For JSON, we have Wikidata
Toolkit but it can't ingest RDF/Turtle, and also has some entry barrier
to get everything running even when operation that needs to be done is
trivial.

So I was thinking - what if we had also ntriples RDF dump? The
difference between ntriples and Turtle is that ntriples is line-based
and fully expanded - which means every line can be understood on its own
without needing any context. This enables to process the dump using the
most basic text processing tools or any software that can read a line of
text and apply regexp to it. The downside of ntriples is it's really
verbose, but compression will take care of most of it, and storing
another 10-15G or so should not be a huge deal. Also, current code
already knows how to generate ntriples dump (in fact, almost all unit
tests internally use this format) - we just need to create a job that
actually generates it.

Of course, with right tools you can generate ntriples dump from both
Turtle one and JSON one (Wikidata toolkit can do the latter, IIRC) but
it's one more moving part which makes it harder and introduces potential
for inconsistencies and surprises.

So, what do you think - would having ntriples RDF dump for wikidata help
things?


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to