NT issues (was: Re: tdbloader2 issues)

Andy Seaborne Wed, 01 Apr 2015 07:20:12 -0700

Thanks for that.
JENA-911 created.

Each of the large public dumps has had quality issues. I'm surewikidata will fix their process if someone helps them. (Freebase did.)

I understand it's frustrating but fixing it in the parser/loader is nota real fix, only a limited workaround, because that data can be passedon to with systems which can't cope. That's what standards are for!!



(anyone know who is involved?)

The RDF 1.1 took some time to look at orignal-NT - the <>-grammar ruleallows junk IRIs and, if you assume some IRI parsing (java.net.URI isnot bad) then even things like \n (which was an NL not the characters"\" and "n" as the widedata people are using it) are not gettingthrough. The original NT grammar was specific for test cases and isopen and loose by design.


Please do feed back to wikidata and we can hope it gets fixed at source.

(Ditto DBpedia for that matter)

        Andy

Related: JENA-864

NFC and NFCK are two normalization requirements (warnings, not errors)but they seem to be more of a hinderance than a help so I'm suggestingremoving the checking. The IRIs are legal even if no NFC - just not inthe preferred by W3C form.


On 01/04/15 14:11, Michael Brunnbauer wrote:


Hello Andy,

[tdbloader2 disk access pattern]

Lots of unique nodes can slow things down because of all the node writing.


And there is no way to convert this algorithm to sequential access?

[tdbloader2 parser]

But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in IRIs.


Could you provide a set of data with one feature per NTriple line,marking in
a comment what you expect, and I'll check each one and add them to the test
suite.


See attachment. I would consider all triples in it illegal according to the
n triples spec.

If I allow these characters that RFC 1738 calls "unsafe", why then not allow
CR, LF and TAB? And why then allow \\ but not \", which seems to be sanctioned
by older versions of the spec:

  http://www.w3.org/2001/sw/RDFCore/ntriples/#character

I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n
IRIs, e.G.:

<http://www.wikidata.org/entity/P1348v> 
<http://www.algaebase.org/search/species/detail/?species_id=26717\n> .
<http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D> 
<http://www.wikidata.org/entity/P18v> 
<http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg>
 .

This trial and error cleaning of data dumps with self made scripts and days
between each try is very straining and probably a big deterrent for newcomers.
I had it with DBpedia and now I have it with Wikidata all over again (with
new syntax problems).

Regards,

Michael Brunnbauer

NT issues (was: Re: tdbloader2 issues)

Reply via email to