Re: Reading UUID URNs from InputStreams in Jena 4

Jindřich Mynarz Wed, 28 Apr 2021 03:59:31 -0700

You're right. I didn't notice it at first, but parsing the data with *.ttl
file suffix doesn't raise the error, while parsing *.nt does. Forcing to
parse N-Triples as Turtle also doesn't raise the error. However, when I
parse the InputStream with N-Triples as Turtle without a base IRI, the
error is raised. No error is raised when base IRI is provided
(e.g., RDFDataMgr.read(model, in, "http://example.org/";, Lang.TURTLE)).


I would say parsing IRIs in general should be (easily) configurable.
Sometimes I needed a tolerant RDF parser that preserves invalid IRIs or
omits the triples in which they appear (e.g., when parsing DBpedia or
Wikidata). Other times I wanted a strict parser to detect IRI errors in the
source under my control.

- Jindrich

On Wed, 28 Apr 2021 at 11:14, Andy Seaborne <a...@apache.org> wrote:

>
>
> On 28/04/2021 08:00, Jindřich Mynarz wrote:
> > Hi Andy,
> >
> > thanks for the clarifications and sharing work-arounds!
>
> It shouldn't be an parse-stopping error in the first place; it should
> print something and continue.
>
> > When I wrote reading the same data from a file doesn't raise this error,
> I
> > used: RDFDataMgr.read(model, "data.ttl");
>
> (recreated here)
> Hmm - then something else is going on as well !?!
>
> There is a difference between NT and TTL in that Turtle resolves URIs
> relative to the base, and N-Triples doesn't. They use the same error
> handler.
>
> This also means there is another workaroud: force the use of Turtle:
>
>      RDFParser.source(file).forceLang(Lang.TTL).parse
>
> > However, neither does reading the data with the riot CLI raise an error,
> so
> > I suppose the error is silently ignored.
>
> ----
>
> General question: How picky should IRI parsing be?
>
> Some of the big public database dumps have a few bad, by RFC 3986, URIs
> in them (a few 10's out of billions) so Jena passes through not-so-good
> URIs by default, rather than aborting the parser, as long as the input
> data is structurally correct. e.g. A "<" without a ">" is structurally
> incorrect.
>
> What Jena does is for NT, TTL, and roughly it is the same for RDF/XML:
>
> * basic language syntax - can it read between <> with some low level
> checking. No spaces, no newlines, no "<" in the middle of an URI, no
> always-illegal IRI chars [1], no control characters. These are
> structural problems in the data.
>
> * Then there is a RFC3986-check: these are recoverable because the file
> is structurally correct. Almost all are logged and parsing continues.
> The UUID check should be in this category.
>
> The goal is to provide scheme-specifc checks for some schemes: http:,
> https:, urn: urn:uuid, did: file: as well as general RFC3986.
>
> https://afs.github.io/rdf-iri-syntax.html
>
>      Andy
>
> [1] The illegal characters "{" and "}", '"', "^","|", "`"
> They usually break things like printing later on if they get into a
> graph or database.
> And, yes, HTML URLs can have {} in them but they shouldn't get out as IRIs.
>
> >
> > - Jindrich
> >
> > On Tue, 27 Apr 2021 at 20:27, Andy Seaborne <a...@apache.org> wrote:
> >
> >> Recorded as https://issues.apache.org/jira/browse/JENA-2097
> >>
> >> Was the reading from a file using "riot"?
> >>
> >> It is different because it installs a different error handler that
> >> reports errors but does not throw an exception.
> >>
> >> A java workaround is:
> >>
> >>       ErrorHandler eh = ErrorHandlerFactory
> >>           .errorHandlerTracking(ErrorHandlerFactory.stdLogger,
> >>                                 false, false);
> >>
> >> then
> >>
> >> RDFParser.source(in).errorHandler(eh).lang(Lang.NT).parse(model);
> >>
> >> or set system-wide with:
> >>
> >> ErrorHandlerFactory.setDefaultErrorHandler
> >>
> >>
> >> On 27/04/2021 16:29, Jindřich Mynarz wrote:
> >>> If I read the code correctly, the regular expression for checking UUID
> >> URNs
> >>> is matched to the complete original IRI string (e.g.,
> >>> "urn:uuid:3e5baa77-a990-4a34-85b5-b66246829d24") (
> >>>
> >>
> https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/irix/IRIProviderJenaIRI.java#L270
> >> )
> >>> rather than to the fragment trailing after "urn:uuid" that the comment
> >>> suggests (
> >>>
> >>
> https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/irix/IRIProviderJenaIRI.java#L255
> >>> ).
> >>>
> >>> Is this correct understanding?
> >>
> >> Yes.
> >>
> >> (And the regexp is case sensitive which it should not be.)
> >>
> >> Thanks for the details and investigation.
> >>
> >>       Andy
> >>
> >>>
> >>> - Jindrich
> >>>
> >>> On Tue, 27 Apr 2021 at 16:59, Jindřich Mynarz <
> mynarzjindr...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> when reading data with UUID URNs from InputStreams in Jena 4 I get the
> >>>> error "Bad IRI: Not a valid UUID string". For example, when I run the
> >>>> following:
> >>>>
> >>>> Model model = ModelFactory.createDefaultModel();
> >>>> String data = "<urn:uuid:4f115b8c-5300-4e4d-84f4-1a7593e5fd57> <
> >>>> http://ex.org/a> <http://ex.org/b> .";
> >>>> try (ByteArrayInputStream in = new
> >>>> ByteArrayInputStream(data.getBytes(StandardCharsets.UTF_8))) {
> >>>>       RDFDataMgr.read(model, in, Lang.NTRIPLES);
> >>>> }
> >>>>
> >>>> I get the following error:
> >>>>
> >>>> org.apache.jena.riot.RiotException: [line: 1, col: 1 ] Bad IRI: Not a
> >>>> valid UUID string: urn:uuid:4f115b8c-5300-4e4d-84f4-1a7593e5fd57
> >>>> at
> >>>>
> >>
> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146)
> >>>> at
> >>>>
> >>
> org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112)
> >>>> at
> >>>>
> >>
> org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85)
> >>>> at
> >>>>
> >>
> org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187)
> >>>> at
> >>>>
> >>
> org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
> >>>> at
> >> org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70)
> >>>> at
> org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:92)
> >>>> at
> org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
> >>>> at
> >> org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
> >>>> at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
> >>>> at
> >>>>
> >>
> org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
> >>>> at org.apache.jena.riot.RDFParser.read(RDFParser.java:357)
> >>>> at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:347)
> >>>> at org.apache.jena.riot.RDFParser.parse(RDFParser.java:294)
> >>>> at
> >> org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550)
> >>>> at
> >>>>
> >>
> org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:721)
> >>>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:255)
> >>>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:222)
> >>>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:208)
> >>>>
> >>>> This error doesn't appear when the data is read from a file or when I
> >> use
> >>>> the previous version of Jena (i.e. 3.17.0).
> >>>>
> >>>> Can someone spot if I'm doing something incorrectly? Is this a genuine
> >>>> regression in Jena 4?
> >>>>
> >>>> Best regards,
> >>>>
> >>>> Jindrich
> >>>>
> >>>
> >>
> >
>

Re: Reading UUID URNs from InputStreams in Jena 4

Reply via email to