Hello Andy,

it would just be great to have a mode for tdbloader[2] where invalid
triples/quads are simply ignored.

Regards,

Michael Brunnbauer

On Wed, Apr 01, 2015 at 03:17:08PM +0100, Andy Seaborne wrote:
> Thanks for that.
> JENA-911 created.
> 
> Each of the large public dumps has had quality issues.  I'm sure wikidata
> will fix their process if someone helps them.  (Freebase did.)
> 
> I understand it's frustrating but fixing it in the parser/loader is not a
> real fix, only a limited workaround, because that data can be passed on to
> with systems which can't cope.  That's what standards are for!!
> 
> 
> (anyone know who is involved?)
> 
> The RDF 1.1 took some time to look at orignal-NT  - the <>-grammar rule
> allows junk IRIs and, if you assume some IRI parsing (java.net.URI is not
> bad) then even things like \n (which was an NL not the characters "\" and
> "n" as the widedata people are using it) are not getting through.  The
> original NT grammar was specific for test cases and is open and loose by
> design.
> 
> Please do feed back to wikidata and we can hope it gets fixed at source.
> 
> (Ditto DBpedia for that matter)
> 
>       Andy
> 
> Related: JENA-864
> 
> NFC and NFCK are two normalization requirements (warnings, not errors) but
> they seem to be more of a hinderance than a help so I'm suggesting removing
> the checking.  The IRIs are legal even if no NFC - just not in the preferred
> by W3C form.
> 
> On 01/04/15 14:11, Michael Brunnbauer wrote:
> >
> >Hello Andy,
> >
> >[tdbloader2 disk access pattern]
> >>Lots of unique nodes can slow things down because of all the node writing.
> >
> >And there is no way to convert this algorithm to sequential access?
> >
> >[tdbloader2 parser]
> >>>>But also no " { } | ^ ` if I read that right? tdbloader2 accepts those in 
> >>>>IRIs.
> >>
> >>Could you provide a set of data with one feature per NTriple line,marking in
> >>a comment what you expect, and I'll check each one and add them to the test
> >>suite.
> >
> >See attachment. I would consider all triples in it illegal according to the
> >n triples spec.
> >
> >If I allow these characters that RFC 1738 calls "unsafe", why then not allow
> >CR, LF and TAB? And why then allow \\ but not \", which seems to be 
> >sanctioned
> >by older versions of the spec:
> >
> >  http://www.w3.org/2001/sw/RDFCore/ntriples/#character
> >
> >I found 752 triples with \" IRIs in the Wikidata dump and 94 triples with \n
> >IRIs, e.G.:
> >
> ><http://www.wikidata.org/entity/P1348v> 
> ><http://www.algaebase.org/search/species/detail/?species_id=26717\n> .
> ><http://www.wikidata.org/entity/Q181274S0B6CB54F-C792-4A12-B20E-A165B91BB46D>
> > <http://www.wikidata.org/entity/P18v> 
> ><http://commons.wikimedia.org/wiki/File:George_\"Corpsegrinder\"_Fisher_of_Cannibal_Corpse.jpg>
> > .
> >
> >This trial and error cleaning of data dumps with self made scripts and days
> >between each try is very straining and probably a big deterrent for 
> >newcomers.
> >I had it with DBpedia and now I have it with Wikidata all over again (with
> >new syntax problems).
> >
> >Regards,
> >
> >Michael Brunnbauer
> >

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail bru...@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Attachment: pgpY1iR9QyrkL.pgp
Description: PGP signature

Reply via email to