You're right. I didn't notice it at first, but parsing the data with *.ttl file suffix doesn't raise the error, while parsing *.nt does. Forcing to parse N-Triples as Turtle also doesn't raise the error. However, when I parse the InputStream with N-Triples as Turtle without a base IRI, the error is raised. No error is raised when base IRI is provided (e.g., RDFDataMgr.read(model, in, "http://example.org/", Lang.TURTLE)).
I would say parsing IRIs in general should be (easily) configurable. Sometimes I needed a tolerant RDF parser that preserves invalid IRIs or omits the triples in which they appear (e.g., when parsing DBpedia or Wikidata). Other times I wanted a strict parser to detect IRI errors in the source under my control. - Jindrich On Wed, 28 Apr 2021 at 11:14, Andy Seaborne <a...@apache.org> wrote: > > > On 28/04/2021 08:00, Jindřich Mynarz wrote: > > Hi Andy, > > > > thanks for the clarifications and sharing work-arounds! > > It shouldn't be an parse-stopping error in the first place; it should > print something and continue. > > > When I wrote reading the same data from a file doesn't raise this error, > I > > used: RDFDataMgr.read(model, "data.ttl"); > > (recreated here) > Hmm - then something else is going on as well !?! > > There is a difference between NT and TTL in that Turtle resolves URIs > relative to the base, and N-Triples doesn't. They use the same error > handler. > > This also means there is another workaroud: force the use of Turtle: > > RDFParser.source(file).forceLang(Lang.TTL).parse > > > However, neither does reading the data with the riot CLI raise an error, > so > > I suppose the error is silently ignored. > > ---- > > General question: How picky should IRI parsing be? > > Some of the big public database dumps have a few bad, by RFC 3986, URIs > in them (a few 10's out of billions) so Jena passes through not-so-good > URIs by default, rather than aborting the parser, as long as the input > data is structurally correct. e.g. A "<" without a ">" is structurally > incorrect. > > What Jena does is for NT, TTL, and roughly it is the same for RDF/XML: > > * basic language syntax - can it read between <> with some low level > checking. No spaces, no newlines, no "<" in the middle of an URI, no > always-illegal IRI chars [1], no control characters. These are > structural problems in the data. > > * Then there is a RFC3986-check: these are recoverable because the file > is structurally correct. Almost all are logged and parsing continues. > The UUID check should be in this category. > > The goal is to provide scheme-specifc checks for some schemes: http:, > https:, urn: urn:uuid, did: file: as well as general RFC3986. > > https://afs.github.io/rdf-iri-syntax.html > > Andy > > [1] The illegal characters "{" and "}", '"', "^","|", "`" > They usually break things like printing later on if they get into a > graph or database. > And, yes, HTML URLs can have {} in them but they shouldn't get out as IRIs. > > > > > - Jindrich > > > > On Tue, 27 Apr 2021 at 20:27, Andy Seaborne <a...@apache.org> wrote: > > > >> Recorded as https://issues.apache.org/jira/browse/JENA-2097 > >> > >> Was the reading from a file using "riot"? > >> > >> It is different because it installs a different error handler that > >> reports errors but does not throw an exception. > >> > >> A java workaround is: > >> > >> ErrorHandler eh = ErrorHandlerFactory > >> .errorHandlerTracking(ErrorHandlerFactory.stdLogger, > >> false, false); > >> > >> then > >> > >> RDFParser.source(in).errorHandler(eh).lang(Lang.NT).parse(model); > >> > >> or set system-wide with: > >> > >> ErrorHandlerFactory.setDefaultErrorHandler > >> > >> > >> On 27/04/2021 16:29, Jindřich Mynarz wrote: > >>> If I read the code correctly, the regular expression for checking UUID > >> URNs > >>> is matched to the complete original IRI string (e.g., > >>> "urn:uuid:3e5baa77-a990-4a34-85b5-b66246829d24") ( > >>> > >> > https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/irix/IRIProviderJenaIRI.java#L270 > >> ) > >>> rather than to the fragment trailing after "urn:uuid" that the comment > >>> suggests ( > >>> > >> > https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/irix/IRIProviderJenaIRI.java#L255 > >>> ). > >>> > >>> Is this correct understanding? > >> > >> Yes. > >> > >> (And the regexp is case sensitive which it should not be.) > >> > >> Thanks for the details and investigation. > >> > >> Andy > >> > >>> > >>> - Jindrich > >>> > >>> On Tue, 27 Apr 2021 at 16:59, Jindřich Mynarz < > mynarzjindr...@gmail.com> > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> when reading data with UUID URNs from InputStreams in Jena 4 I get the > >>>> error "Bad IRI: Not a valid UUID string". For example, when I run the > >>>> following: > >>>> > >>>> Model model = ModelFactory.createDefaultModel(); > >>>> String data = "<urn:uuid:4f115b8c-5300-4e4d-84f4-1a7593e5fd57> < > >>>> http://ex.org/a> <http://ex.org/b> ."; > >>>> try (ByteArrayInputStream in = new > >>>> ByteArrayInputStream(data.getBytes(StandardCharsets.UTF_8))) { > >>>> RDFDataMgr.read(model, in, Lang.NTRIPLES); > >>>> } > >>>> > >>>> I get the following error: > >>>> > >>>> org.apache.jena.riot.RiotException: [line: 1, col: 1 ] Bad IRI: Not a > >>>> valid UUID string: urn:uuid:4f115b8c-5300-4e4d-84f4-1a7593e5fd57 > >>>> at > >>>> > >> > org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146) > >>>> at > >>>> > >> > org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112) > >>>> at > >>>> > >> > org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85) > >>>> at > >>>> > >> > org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187) > >>>> at > >>>> > >> > org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259) > >>>> at > >> org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70) > >>>> at > org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:92) > >>>> at > org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61) > >>>> at > >> org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53) > >>>> at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43) > >>>> at > >>>> > >> > org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184) > >>>> at org.apache.jena.riot.RDFParser.read(RDFParser.java:357) > >>>> at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:347) > >>>> at org.apache.jena.riot.RDFParser.parse(RDFParser.java:294) > >>>> at > >> org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550) > >>>> at > >>>> > >> > org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:721) > >>>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:255) > >>>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:222) > >>>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:208) > >>>> > >>>> This error doesn't appear when the data is read from a file or when I > >> use > >>>> the previous version of Jena (i.e. 3.17.0). > >>>> > >>>> Can someone spot if I'm doing something incorrectly? Is this a genuine > >>>> regression in Jena 4? > >>>> > >>>> Best regards, > >>>> > >>>> Jindrich > >>>> > >>> > >> > > >