On 28/04/2021 08:00, Jindřich Mynarz wrote:
Hi Andy,

thanks for the clarifications and sharing work-arounds!

It shouldn't be an parse-stopping error in the first place; it should print something and continue.

When I wrote reading the same data from a file doesn't raise this error, I
used: RDFDataMgr.read(model, "data.ttl");

(recreated here)
Hmm - then something else is going on as well !?!

There is a difference between NT and TTL in that Turtle resolves URIs relative to the base, and N-Triples doesn't. They use the same error handler.

This also means there is another workaroud: force the use of Turtle:

    RDFParser.source(file).forceLang(Lang.TTL).parse

However, neither does reading the data with the riot CLI raise an error, so
I suppose the error is silently ignored.

----

General question: How picky should IRI parsing be?

Some of the big public database dumps have a few bad, by RFC 3986, URIs in them (a few 10's out of billions) so Jena passes through not-so-good URIs by default, rather than aborting the parser, as long as the input data is structurally correct. e.g. A "<" without a ">" is structurally incorrect.

What Jena does is for NT, TTL, and roughly it is the same for RDF/XML:

* basic language syntax - can it read between <> with some low level checking. No spaces, no newlines, no "<" in the middle of an URI, no always-illegal IRI chars [1], no control characters. These are structural problems in the data.

* Then there is a RFC3986-check: these are recoverable because the file is structurally correct. Almost all are logged and parsing continues. The UUID check should be in this category.

The goal is to provide scheme-specifc checks for some schemes: http:, https:, urn: urn:uuid, did: file: as well as general RFC3986.

https://afs.github.io/rdf-iri-syntax.html

    Andy

[1] The illegal characters "{" and "}", '"', "^","|", "`"
They usually break things like printing later on if they get into a graph or database.
And, yes, HTML URLs can have {} in them but they shouldn't get out as IRIs.


- Jindrich

On Tue, 27 Apr 2021 at 20:27, Andy Seaborne <[email protected]> wrote:

Recorded as https://issues.apache.org/jira/browse/JENA-2097

Was the reading from a file using "riot"?

It is different because it installs a different error handler that
reports errors but does not throw an exception.

A java workaround is:

      ErrorHandler eh = ErrorHandlerFactory
          .errorHandlerTracking(ErrorHandlerFactory.stdLogger,
                                false, false);

then

RDFParser.source(in).errorHandler(eh).lang(Lang.NT).parse(model);

or set system-wide with:

ErrorHandlerFactory.setDefaultErrorHandler


On 27/04/2021 16:29, Jindřich Mynarz wrote:
If I read the code correctly, the regular expression for checking UUID
URNs
is matched to the complete original IRI string (e.g.,
"urn:uuid:3e5baa77-a990-4a34-85b5-b66246829d24") (

https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/irix/IRIProviderJenaIRI.java#L270
)
rather than to the fragment trailing after "urn:uuid" that the comment
suggests (

https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/irix/IRIProviderJenaIRI.java#L255
).

Is this correct understanding?

Yes.

(And the regexp is case sensitive which it should not be.)

Thanks for the details and investigation.

      Andy


- Jindrich

On Tue, 27 Apr 2021 at 16:59, Jindřich Mynarz <[email protected]>
wrote:

Hi,

when reading data with UUID URNs from InputStreams in Jena 4 I get the
error "Bad IRI: Not a valid UUID string". For example, when I run the
following:

Model model = ModelFactory.createDefaultModel();
String data = "<urn:uuid:4f115b8c-5300-4e4d-84f4-1a7593e5fd57> <
http://ex.org/a> <http://ex.org/b> .";
try (ByteArrayInputStream in = new
ByteArrayInputStream(data.getBytes(StandardCharsets.UTF_8))) {
      RDFDataMgr.read(model, in, Lang.NTRIPLES);
}

I get the following error:

org.apache.jena.riot.RiotException: [line: 1, col: 1 ] Bad IRI: Not a
valid UUID string: urn:uuid:4f115b8c-5300-4e4d-84f4-1a7593e5fd57
at

org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146)
at

org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112)
at

org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85)
at

org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187)
at

org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
at
org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70)
at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:92)
at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
at
org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
at

org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:357)
at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:347)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:294)
at
org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550)
at

org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:721)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:255)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:222)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:208)

This error doesn't appear when the data is read from a file or when I
use
the previous version of Jena (i.e. 3.17.0).

Can someone spot if I'm doing something incorrectly? Is this a genuine
regression in Jena 4?

Best regards,

Jindrich




Reply via email to