Jena's RDF/XML parser, ARP, was original a separate subsystem that could be configured for different possible directions of the RDF 1.0 working group and different treatment of IRIs that were possible at the time (this is before RFC3986/3987). It is the "xmlinput" package in jena-core.

It has a close coupling to jena-iri with features such as customization of errors, and an idiosyncratic approach to relative IRIs (if called directly). These are outside normal use of RDF/XML. When used from model.read or a RIOT API, these features aren't accessible.

Both jena-iri and ARP are hard to maintain.

xmlinput is the last part of Jena that uses jena-iri directly.

Jena has a IRI abstraction - IRIx that allows switching IRI providers. The Jena releases use jena-iri as the provider through the IRIx abstraction - errors message are the same as before.

There is a test suite for compatibility - on a pass/warning/error basis, not error message text, that gives the expected behaviour of an IRIx implementation.


RFCs and W3C documents that define the URIs, IRIs, and the specific URI schemes evolve so maintenance is necessary.

RDF 1.1 removed the special "RDF URI reference" in favour of RFC 3987.
W3C has a REC about DIDs (a new "did:" URI scheme).
RFC 6874 changes the core URI grammar of RFC 3986, adding support for IPv6 zones.
RFC 8089 define "file:" as it is actually used.
RFC 8141 replaces the definition of URNs with a new RFC.


My long-term aspiration is to have an RDF/XML parser and IRI handling that is:

1/ Maintainable.
2/ For use as a parser in Jena and only for that.

That means making RDF/XML handling much simpler, with functionality for reading conformant RDF/XML and not variations that are not used by Jena users. The test suite has good coverage.

For IRIs, switch from jena-iri to a new IRI library that has up-to-date support for IRIs. jena-iri also has scheme-specific rules for a large number of legacy schemes (gopher:, telnet:, fax:, ...). This extensibility causes a very high cost to maintain. It has not been remade from the original configuration files for many years (that step is not in the build).

New IRI library:
https://github.com/afs/x4ld/tree/main/iri4ld

jena-iri is also slower than iri4ld and this is visible in parsing (the impact is 5-10% of parsing speed on N-triples.)

Error message do change, hopefully to ones that are easier to understand. jena-iri error messages are quite technical.

This all applies to xmloutput as well but that's already converted to IRIx.


I have a new PR in-progress that converts RDF/XML parsing to use IRIx.
It does change the behaviour for directly using RDFXMLReader when relative URIs are given as the base. A fully legacy setup exists that passes all the tests for normal parsing use but does not pass some detailed local behaviour tests in the RDF/XML writer.

Roadmap:

Eventually have multiple packages, until we decide that migration has happened and they are getting in the way.

Packages used by RIOT/modle.read are essential maintenance only.


* xmlinput0 - this is ARP xmlinput as it is in Jena 4.7.0.

* xmlinput1 - this is ARP switched to use IRIx.

* xmlinput2 - an RDF/XML parser (starting with ARP and cutting out the unused parts) that covers Jena needs and not trying to do everything ARP does. xmlinput2 does not yet exist.

The new PR gets the codebase to xmlinput1(as "xmlinput").

If all goes well, we can have 4.8.0 default to use xmlinput1, switchable back to xmlinput0.

When called from model.read or RIOT, it should not make a difference.

It would be great to have users test but any affected users are using legacy features and they are less likely to upgrade regularly. Reports about direct use of ARP have been very infrequent.

    Andy

Reply via email to