On 28/04/2021 08:00, Jindřich Mynarz wrote:
Hi Andy,
thanks for the clarifications and sharing work-arounds!
It shouldn't be an parse-stopping error in the first place; it should
print something and continue.
When I wrote reading the same data from a file doesn't raise this error,
I
used: RDFDataMgr.read(model, "data.ttl");
(recreated here)
Hmm - then something else is going on as well !?!
There is a difference between NT and TTL in that Turtle resolves URIs
relative to the base, and N-Triples doesn't. They use the same error
handler.
This also means there is another workaroud: force the use of Turtle:
RDFParser.source(file).forceLang(Lang.TTL).parse
However, neither does reading the data with the riot CLI raise an error,
so
I suppose the error is silently ignored.
----
General question: How picky should IRI parsing be?
Some of the big public database dumps have a few bad, by RFC 3986, URIs
in them (a few 10's out of billions) so Jena passes through not-so-good
URIs by default, rather than aborting the parser, as long as the input
data is structurally correct. e.g. A "<" without a ">" is structurally
incorrect.
What Jena does is for NT, TTL, and roughly it is the same for RDF/XML:
* basic language syntax - can it read between <> with some low level
checking. No spaces, no newlines, no "<" in the middle of an URI, no
always-illegal IRI chars [1], no control characters. These are
structural problems in the data.
* Then there is a RFC3986-check: these are recoverable because the file
is structurally correct. Almost all are logged and parsing continues.
The UUID check should be in this category.
The goal is to provide scheme-specifc checks for some schemes: http:,
https:, urn: urn:uuid, did: file: as well as general RFC3986.
https://afs.github.io/rdf-iri-syntax.html
Andy
[1] The illegal characters "{" and "}", '"', "^","|", "`"
They usually break things like printing later on if they get into a
graph or database.
And, yes, HTML URLs can have {} in them but they shouldn't get out as IRIs.
- Jindrich
On Tue, 27 Apr 2021 at 20:27, Andy Seaborne <a...@apache.org> wrote:
Recorded as https://issues.apache.org/jira/browse/JENA-2097
Was the reading from a file using "riot"?
It is different because it installs a different error handler that
reports errors but does not throw an exception.
A java workaround is:
ErrorHandler eh = ErrorHandlerFactory
.errorHandlerTracking(ErrorHandlerFactory.stdLogger,
false, false);
then
RDFParser.source(in).errorHandler(eh).lang(Lang.NT).parse(model);
or set system-wide with:
ErrorHandlerFactory.setDefaultErrorHandler
On 27/04/2021 16:29, Jindřich Mynarz wrote:
If I read the code correctly, the regular expression for checking UUID
URNs
is matched to the complete original IRI string (e.g.,
"urn:uuid:3e5baa77-a990-4a34-85b5-b66246829d24") (
https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/irix/IRIProviderJenaIRI.java#L270
)
rather than to the fragment trailing after "urn:uuid" that the comment
suggests (
https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/irix/IRIProviderJenaIRI.java#L255
).
Is this correct understanding?
Yes.
(And the regexp is case sensitive which it should not be.)
Thanks for the details and investigation.
Andy
- Jindrich
On Tue, 27 Apr 2021 at 16:59, Jindřich Mynarz <
mynarzjindr...@gmail.com>
wrote:
Hi,
when reading data with UUID URNs from InputStreams in Jena 4 I get the
error "Bad IRI: Not a valid UUID string". For example, when I run the
following:
Model model = ModelFactory.createDefaultModel();
String data = "<urn:uuid:4f115b8c-5300-4e4d-84f4-1a7593e5fd57> <
http://ex.org/a> <http://ex.org/b> .";
try (ByteArrayInputStream in = new
ByteArrayInputStream(data.getBytes(StandardCharsets.UTF_8))) {
RDFDataMgr.read(model, in, Lang.NTRIPLES);
}
I get the following error:
org.apache.jena.riot.RiotException: [line: 1, col: 1 ] Bad IRI: Not a
valid UUID string: urn:uuid:4f115b8c-5300-4e4d-84f4-1a7593e5fd57
at
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146)
at
org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112)
at
org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85)
at
org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187)
at
org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
at
org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70)
at
org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:92)
at
org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
at
org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
at
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:357)
at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:347)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:294)
at
org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550)
at
org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:721)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:255)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:222)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:208)
This error doesn't appear when the data is read from a file or when I
use
the previous version of Jena (i.e. 3.17.0).
Can someone spot if I'm doing something incorrectly? Is this a genuine
regression in Jena 4?
Best regards,
Jindrich