Re: Reading UUID URNs from InputStreams in Jena 4

Andy Seaborne Wed, 28 Apr 2021 04:47:45 -0700



On 28/04/2021 11:59, Jindřich Mynarz wrote:

You're right. I didn't notice it at first, but parsing the data with *.ttl
file suffix doesn't raise the error, while parsing *.nt does. Forcing to
parse N-Triples as Turtle also doesn't raise the error. However, when I
parse the InputStream with N-Triples as Turtle without a base IRI, the
error is raised. No error is raised when base IRI is provided
(e.g., RDFDataMgr.read(model, in, "http://example.org/";, Lang.TURTLE)).


Good - we're seeing the same thing.

I would say parsing IRIs in general should be (easily) configurable.
Sometimes I needed a tolerant RDF parser that preserves invalid IRIs or
omits the triples in which they appear (e.g., when parsing DBpedia or
Wikidata). Other times I wanted a strict parser to detect IRI errors in the
source under my control.

ErrorHandler is the policy; then you can choose to ignore/log/abort onwarnings/errors.

The other control is StreamRDF which can apply after-parser policy, orfixups or rewrites.

For IRIs, they are (should be) warnings; errors are usually seriousparse issues.

Omitting during parsing is messy in Turtle because you are likely to getseveral valid triples skipped. Then we're into "where's my data?" questions.

Nowadays Wikidata dumps are valid syntax (I checked as part of the IRIxwork), with just with a few bad IRIs. Haven't tried DBpedia in a whilebut last time I tried it was valid syntax, with dubious URIs.


    Andy


- Jindrich

On Wed, 28 Apr 2021 at 11:14, Andy Seaborne <[email protected]> wrote:



On 28/04/2021 08:00, Jindřich Mynarz wrote:

Hi Andy,

thanks for the clarifications and sharing work-arounds!


It shouldn't be an parse-stopping error in the first place; it should
print something and continue.

When I wrote reading the same data from a file doesn't raise this error,

used: RDFDataMgr.read(model, "data.ttl");


(recreated here)
Hmm - then something else is going on as well !?!

There is a difference between NT and TTL in that Turtle resolves URIs
relative to the base, and N-Triples doesn't. They use the same error
handler.

This also means there is another workaroud: force the use of Turtle:

      RDFParser.source(file).forceLang(Lang.TTL).parse

However, neither does reading the data with the riot CLI raise an error,

so

I suppose the error is silently ignored.


----

General question: How picky should IRI parsing be?

Some of the big public database dumps have a few bad, by RFC 3986, URIs
in them (a few 10's out of billions) so Jena passes through not-so-good
URIs by default, rather than aborting the parser, as long as the input
data is structurally correct. e.g. A "<" without a ">" is structurally
incorrect.

What Jena does is for NT, TTL, and roughly it is the same for RDF/XML:

* basic language syntax - can it read between <> with some low level
checking. No spaces, no newlines, no "<" in the middle of an URI, no
always-illegal IRI chars [1], no control characters. These are
structural problems in the data.

* Then there is a RFC3986-check: these are recoverable because the file
is structurally correct. Almost all are logged and parsing continues.
The UUID check should be in this category.

The goal is to provide scheme-specifc checks for some schemes: http:,
https:, urn: urn:uuid, did: file: as well as general RFC3986.

https://afs.github.io/rdf-iri-syntax.html

      Andy

[1] The illegal characters "{" and "}", '"', "^","|", "`"
They usually break things like printing later on if they get into a
graph or database.
And, yes, HTML URLs can have {} in them but they shouldn't get out as IRIs.


- Jindrich

On Tue, 27 Apr 2021 at 20:27, Andy Seaborne <[email protected]> wrote:

Recorded as https://issues.apache.org/jira/browse/JENA-2097

Was the reading from a file using "riot"?

It is different because it installs a different error handler that
reports errors but does not throw an exception.

A java workaround is:

       ErrorHandler eh = ErrorHandlerFactory
           .errorHandlerTracking(ErrorHandlerFactory.stdLogger,
                                 false, false);

then

RDFParser.source(in).errorHandler(eh).lang(Lang.NT).parse(model);

or set system-wide with:

ErrorHandlerFactory.setDefaultErrorHandler


On 27/04/2021 16:29, Jindřich Mynarz wrote:

If I read the code correctly, the regular expression for checking UUID

URNs

is matched to the complete original IRI string (e.g.,
"urn:uuid:3e5baa77-a990-4a34-85b5-b66246829d24") (

https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/irix/IRIProviderJenaIRI.java#L270

rather than to the fragment trailing after "urn:uuid" that the comment
suggests (

https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/irix/IRIProviderJenaIRI.java#L255

).

Is this correct understanding?


Yes.

(And the regexp is case sensitive which it should not be.)

Thanks for the details and investigation.

       Andy


- Jindrich

On Tue, 27 Apr 2021 at 16:59, Jindřich Mynarz <

[email protected]>

wrote:

Hi,

when reading data with UUID URNs from InputStreams in Jena 4 I get the
error "Bad IRI: Not a valid UUID string". For example, when I run the
following:

Model model = ModelFactory.createDefaultModel();
String data = "<urn:uuid:4f115b8c-5300-4e4d-84f4-1a7593e5fd57> <
http://ex.org/a> <http://ex.org/b> .";
try (ByteArrayInputStream in = new
ByteArrayInputStream(data.getBytes(StandardCharsets.UTF_8))) {
       RDFDataMgr.read(model, in, Lang.NTRIPLES);
}

I get the following error:

org.apache.jena.riot.RiotException: [line: 1, col: 1 ] Bad IRI: Not a
valid UUID string: urn:uuid:4f115b8c-5300-4e4d-84f4-1a7593e5fd57
at

org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146)

at

org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112)

at

org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85)

at

org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187)

at

org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)

at

org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70)

at

org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:92)

at

org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)

at

org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)

at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
at

org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)

at org.apache.jena.riot.RDFParser.read(RDFParser.java:357)
at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:347)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:294)
at

org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550)

at

org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:721)

at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:255)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:222)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:208)

This error doesn't appear when the data is read from a file or when I

use

the previous version of Jena (i.e. 3.17.0).

Can someone spot if I'm doing something incorrectly? Is this a genuine
regression in Jena 4?

Best regards,

Jindrich

Re: Reading UUID URNs from InputStreams in Jena 4

Reply via email to