Evolving RDF/XL support and ARP.

Andy Seaborne Fri, 24 Feb 2023 06:17:06 -0800

Jena's RDF/XML parser, ARP, was original a separate subsystem that couldbe configured for different possible directions of the RDF 1.0 workinggroup and different treatment of IRIs that were possible at the time(this is before RFC3986/3987). It is the "xmlinput" package in jena-core.

It has a close coupling to jena-iri with features such as customizationof errors, and an idiosyncratic approach to relative IRIs (if calleddirectly). These are outside normal use of RDF/XML. When used frommodel.read or a RIOT API, these features aren't accessible.


Both jena-iri and ARP are hard to maintain.

xmlinput is the last part of Jena that uses jena-iri directly.

Jena has a IRI abstraction - IRIx that allows switching IRI providers.The Jena releases use jena-iri as the provider through the IRIxabstraction - errors message are the same as before.

There is a test suite for compatibility - on a pass/warning/error basis,not error message text, that gives the expected behaviour of an IRIximplementation.

RFCs and W3C documents that define the URIs, IRIs, and the specific URIschemes evolve so maintenance is necessary.


RDF 1.1 removed the special "RDF URI reference" in favour of RFC 3987.
W3C has a REC about DIDs (a new "did:" URI scheme).

RFC 6874 changes the core URI grammar of RFC 3986, adding support forIPv6 zones.

RFC 8089 define "file:" as it is actually used.
RFC 8141 replaces the definition of URNs with a new RFC.

My long-term aspiration is to have an RDF/XML parser and IRI handlingthat is:


1/ Maintainable.
2/ For use as a parser in Jena and only for that.

That means making RDF/XML handling much simpler, with functionality forreading conformant RDF/XML and not variations that are not used by Jenausers. The test suite has good coverage.

For IRIs, switch from jena-iri to a new IRI library that has up-to-datesupport for IRIs. jena-iri also has scheme-specific rules for a largenumber of legacy schemes (gopher:, telnet:, fax:, ...). Thisextensibility causes a very high cost to maintain. It has not beenremade from the original configuration files for many years (that stepis not in the build).


New IRI library:
https://github.com/afs/x4ld/tree/main/iri4ld

jena-iri is also slower than iri4ld and this is visible in parsing (theimpact is 5-10% of parsing speed on N-triples.)

Error message do change, hopefully to ones that are easier tounderstand. jena-iri error messages are quite technical.


This all applies to xmloutput as well but that's already converted to IRIx.


I have a new PR in-progress that converts RDF/XML parsing to use IRIx.

It does change the behaviour for directly using RDFXMLReader whenrelative URIs are given as the base. A fully legacy setup exists thatpasses all the tests for normal parsing use but does not pass somedetailed local behaviour tests in the RDF/XML writer.


Roadmap:

Eventually have multiple packages, until we decide that migration hashappened and they are getting in the way.


Packages used by RIOT/modle.read are essential maintenance only.


* xmlinput0 - this is ARP xmlinput as it is in Jena 4.7.0.

* xmlinput1 - this is ARP switched to use IRIx.

* xmlinput2 - an RDF/XML parser (starting with ARP and cutting out theunused parts) that covers Jena needs and not trying to do everything ARPdoes. xmlinput2 does not yet exist.


The new PR gets the codebase to xmlinput1(as "xmlinput").

If all goes well, we can have 4.8.0 default to use xmlinput1, switchableback to xmlinput0.


When called from model.read or RIOT, it should not make a difference.

It would be great to have users test but any affected users are usinglegacy features and they are less likely to upgrade regularly. Reportsabout direct use of ARP have been very infrequent.


    Andy

Evolving RDF/XL support and ARP.

Reply via email to