(restricting discussion and recipients to DBpedia) On 10 August 2013 15:53, Sebastian Hellmann <hellm...@informatik.uni-leipzig.de> wrote: > @Markus: actually, the question is important for DBpedia, because disk space > on our download server is getting tight for DBpedia 3.9 and other soon to > come data publishing projects. > I'm sorry to use your thread for this, but I see the opportunity to create a > "best current practise" easily and we might be able to save a lot of space > by doing so. > > Maybe we can skip on NTriples .nt and .nq files?
As for the DBpedia download server - I think we should check the server logs to find out which files are actually downloaded. For example, if we find that no one downloads the dumps for many of the smaller languages, we could reduce the number of languages that are stored on the server. Dropping .nt and .nq sounds good. UTF-8 is much better than \u escaping. There are two issues though: - There is no standard for a Turtle equivalent of N-Quads. It's trivial to define and produce - like N-Quads, but with UTF-8 encoding instead of \u escapes - but there's no standard, not even a published draft like http://sw.deri.org/2008/07/n-quads/ . Not even a recommended file extension. We produced .tql files in the 3.8 release mainly because Turtle is much more readable than NT for most languages and in the current implementation, enabling this format was actually simpler than excluding it. But if we drop .nq, I don't know how many users will be able to load DBpedia quads instead of or in addition to triples. - Although percent-escapes are "strongly discouraged" by the RDF spec [1], DBpedia English resources still use URIs (for backwards compatibility), which escape all non-ASCII characters. To move forward, we added a IRI-sameAs-URI dataset. And to fully exploit the increased readability that Turtle offers, we already use IRIs (which don't escape non-ASCII characters, but are otherwise identical to their URI counterparts) for DBpedia English resources in the Turtle files. URIs are still used in the NT files. That means if we drop the NT files, some users may have compatibility problems. I think that's a minor problem though. There were a few other changes in DBpedia percent-escapes between 3.7 and 3.8 [2], and no real problems with that. Regards, Christopher [1] http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref [2] http://wiki.dbpedia.org/URIencoding > > > 435G downloads.dbpedia.org > 1.8G 1.0 > 2.5G 2.0 > 5.1G 3.0 > 7.6G 3.0rc > 6.0G 3.1 > 6.4G 3.2 > 7.3G 3.3 > 21G 3.4 > 32G 3.5 > 35G 3.5.1 > 34G 3.6 > 44G 3.7 > 63G 3.7-i18n > 169G 3.8 > ??? 3.9 > 22M wikicompany > 1.6G wiktionary > ... > > All the best, > Sebastian ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion