Hello Andy, it would just be great to have a mode for tdbloader[2] where invalid triples/quads are simply ignored.
Regards, Michael Brunnbauer On Wed, Apr 01, 2015 at 03:17:08PM +0100, Andy Seaborne wrote: > Thanks for that. > JENA-911 created. > > Each of the large public dumps has had quality issues. I'm sure wikidata > will fix their process if someone helps them. (Freebase did.) > > I understand it's frustrating but fixing it in the parser/loader is not a > real fix, only a limited workaround, because that data can be passed on to > with systems which can't cope. That's what standards are for!! > > > (anyone know who is involved?) > > The RDF 1.1 took some time to look at orignal-NT - the <>-grammar rule > allows junk IRIs and, if you assume some IRI parsing (java.net.URI is not > bad) then even things like \n (which was an NL not the characters "\" and > "n" as the widedata people are using it) are not getting through. The > original NT grammar was specific for test cases and is open and loose by > design. > > Please do feed back to wikidata and we can hope it gets fixed at source. > > (Ditto DBpedia for that matter) > > Andy > > Related: JENA-864 > > NFC and NFCK are two normalization requirements (warnings, not errors) but > they seem to be more of a hinderance than a help so I'm suggesting removing > the checking. The IRIs are legal even if no NFC - just not in the preferred > by W3C form. > > On 01/04/15 14:11, Michael Brunnbauer wrote: > > > >Hello Andy, > > > >[tdbloader2 disk access pattern] > >>Lots of unique nodes can slow things down because of all the node writing. > > > >And there is no way to convert this algorithm to sequential access? > > > >[tdbloader2 parser] > >>>>But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in > >>>>IRIs. > >> > >>Could you provide a set of data with one feature per NTriple line,marking in > >>a comment what you expect, and I'll check each one and add them to the test > >>suite. > > > >See attachment. I would consider all triples in it illegal according to the > >n triples spec. > > > >If I allow these characters that RFC 1738 calls "unsafe", why then not allow > >CR, LF and TAB? And why then allow \\ but not \", which seems to be > >sanctioned > >by older versions of the spec: > > > > http://www.w3.org/2001/sw/RDFCore/ntriples/#character > > > >I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n > >IRIs, e.G.: > > > ><http://www.wikidata.org/entity/P1348v> > ><http://www.algaebase.org/search/species/detail/?species_id=26717\n> . > ><http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D> > > <http://www.wikidata.org/entity/P18v> > ><http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg> > > . > > > >This trial and error cleaning of data dumps with self made scripts and days > >between each try is very straining and probably a big deterrent for > >newcomers. > >I had it with DBpedia and now I have it with Wikidata all over again (with > >new syntax problems). > > > >Regards, > > > >Michael Brunnbauer > > -- ++ Michael Brunnbauer ++ netEstate GmbH ++ Geisenhausener Straße 11a ++ 81379 München ++ Tel +49 89 32 19 77 80 ++ Fax +49 89 32 19 77 89 ++ E-Mail bru...@netestate.de ++ http://www.netestate.de/ ++ ++ Sitz: München, HRB Nr.142452 (Handelsregister B München) ++ USt-IdNr. DE221033342 ++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer ++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
pgpY1iR9QyrkL.pgp
Description: PGP signature