(restricting discussion and recipients to DBpedia)

On 10 August 2013 15:53, Sebastian Hellmann
<hellm...@informatik.uni-leipzig.de> wrote:
> @Markus: actually, the question is important for DBpedia, because disk space
> on our download server is getting tight for DBpedia 3.9 and other soon to
> come data publishing projects.
> I'm sorry to use your thread for this, but I see the opportunity to create a
> "best current practise" easily and we might be able to save a lot of space
> by doing so.
>
> Maybe we can skip on NTriples .nt and .nq files?

As for the DBpedia download server - I think we should check the
server logs to find out which files are actually downloaded. For
example, if we find that no one downloads the dumps for many of the
smaller languages, we could reduce the number of languages that are
stored on the server.

Dropping .nt and .nq sounds good. UTF-8 is much better than \u
escaping. There are two issues though:

- There is no standard for a Turtle equivalent of N-Quads. It's
trivial to define and produce - like N-Quads, but with UTF-8 encoding
instead of \u escapes - but there's no standard, not even a published
draft like http://sw.deri.org/2008/07/n-quads/ . Not even a
recommended file extension. We produced .tql files in the 3.8 release
mainly because Turtle is much more readable than NT for most languages
and in the current implementation, enabling this format was actually
simpler than excluding it. But if we drop .nq, I don't know how many
users will be able to load DBpedia quads instead of or in addition to
triples.

- Although percent-escapes are "strongly discouraged" by the RDF spec
[1], DBpedia English resources still use URIs (for backwards
compatibility), which escape all non-ASCII characters. To move
forward, we added a IRI-sameAs-URI dataset. And to fully exploit the
increased readability that Turtle offers, we already use IRIs (which
don't escape non-ASCII characters, but are otherwise identical to
their URI counterparts) for DBpedia English resources in the Turtle
files. URIs are still used in the NT files. That means if we drop the
NT files, some users may have compatibility problems. I think that's a
minor problem though. There were a few other changes in DBpedia
percent-escapes between 3.7 and 3.8 [2], and no real problems with
that.

Regards,
Christopher


[1] http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref
[2] http://wiki.dbpedia.org/URIencoding

>
>
> 435G    downloads.dbpedia.org
> 1.8G    1.0
> 2.5G    2.0
> 5.1G    3.0
> 7.6G    3.0rc
> 6.0G    3.1
> 6.4G    3.2
> 7.3G    3.3
> 21G    3.4
> 32G    3.5
> 35G    3.5.1
> 34G    3.6
> 44G    3.7
> 63G    3.7-i18n
> 169G    3.8
> ??? 3.9
> 22M    wikicompany
> 1.6G    wiktionary
> ...
>
> All the best,
> Sebastian

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to