Datatypes in the rdf: namespace.

2023-02-26 Thread Andy Seaborne

(Moral: Never pull on the end of a loose bit of string in a codebase...)

There are 3 datatypes in the RDF namespace which are there for 
convenience but not mentioned in the RDF Abstract data model. So they 
are not required even if they were normatively defined.


rdf:XMLLiteral, rdf:HTML, rdf:JSON

Jena's XMLLiteralType is compliant with RDF 1.0 but RDF 1.1 changed the 
rdf:XMLLiteral (no canonicalization, the value space is DOM4 based).


In RDF 1.0, rdf:XMLLiteral is the one and only required datatype. It's 
weird because the lexical space has canonicalization and normalization 
requirement (the lexical space is the same as value space - puts all the 
work on the user!).


In RDF 1.1, rdf:XMLLiteral is not required (even if normative, which it 
isn't for other reasons) and it has become just a datatype definition.


In RDF 1.1, there is rdf:HTML. The Jena RDF vocabulary has a constant. 
There is no value handling.


rdf:JSON exists in http://www.w3.org/1999/02/22-rdf-syntax-ns, it was 
defined by JSON-LD. The Jena RDF vocabulary has a constant. There is no 
value handling.


rdf:JSON is likely to make it into RDF 1.2 Concepts. Its value space is 
a canonicalized form of JSON.


All three have complex requirements for the value space (making them a 
bit of a DOS vector!).


It might be simpler to do the same for all 3 datatypes - constants but 
no value support.


Andy


Re: Evolving RDF/XML support and ARP.

2023-02-26 Thread Andy Seaborne




On 24/02/2023 14:24, Andy Seaborne wrote:

Issue for updating ARP to use IRIx, as described below.

https://github.com/apache/jena/issues/1773

Draft PR:

https://github.com/apache/jena/pull/1774


This has xmlinput0 (the state of ARP 4.7.0, using jena-iri directly) 
with ARP0 and RDFXMLReader0 as classes.


The package xmlinput is the updated RDF/XML parsing code. The class ARP 
in xmlinput is deprecated as a warning that running ARP without the rest 
of Jena is not going to continue except while xmlinput0 exists.


Andy



On 24/02/2023 14:16, Andy Seaborne wrote:
Jena's RDF/XML parser, ARP, was original a separate subsystem that 
could be configured for different possible directions of the RDF 1.0 
working group and different treatment of IRIs that were possible at 
the time (this is before RFC3986/3987). It is the "xmlinput" package 
in jena-core.


It has a close coupling to jena-iri with features such as 
customization of errors, and an idiosyncratic approach to relative 
IRIs (if called directly). These are outside normal use of RDF/XML.  
When used from model.read or a RIOT API, these features aren't 
accessible.


Both jena-iri and ARP are hard to maintain.

xmlinput is the last part of Jena that uses jena-iri directly.

Jena has a IRI abstraction - IRIx that allows switching IRI providers. 
The Jena releases use jena-iri as the provider through the IRIx 
abstraction - errors message are the same as before.


There is a test suite for compatibility - on a pass/warning/error 
basis, not error message text, that gives the expected behaviour of an 
IRIx implementation.



RFCs and W3C documents that define the URIs, IRIs, and the specific 
URI schemes evolve so maintenance is necessary.


RDF 1.1 removed the special "RDF URI reference" in favour of RFC 3987.
W3C has a REC about DIDs (a new "did:" URI scheme).
RFC 6874 changes the core URI grammar of RFC 3986, adding support for 
IPv6 zones.

RFC 8089 define "file:" as it is actually used.
RFC 8141 replaces the definition of URNs with a new RFC.


My long-term aspiration is to have an RDF/XML parser and IRI handling 
that is:


1/ Maintainable.
2/ For use as a parser in Jena and only for that.

That means making RDF/XML handling much simpler, with functionality 
for reading conformant RDF/XML and not variations that are not used by 
Jena users. The test suite has good coverage.


For IRIs, switch from jena-iri to a new IRI library that has 
up-to-date support for IRIs. jena-iri also has scheme-specific rules 
for a large number of legacy schemes (gopher:, telnet:, fax:, ...). 
This extensibility causes a very high cost to maintain. It has not 
been remade from the original configuration files for many years (that 
step is not in the build).


New IRI library:
https://github.com/afs/x4ld/tree/main/iri4ld

jena-iri is also slower than iri4ld and this is visible in parsing 
(the impact is 5-10% of parsing speed on N-triples.)


Error message do change, hopefully to ones that are easier to 
understand. jena-iri error messages are quite technical.


This all applies to xmloutput as well but that's already converted to 
IRIx.



I have a new PR in-progress that converts RDF/XML parsing to use IRIx.
It does change the behaviour for directly using RDFXMLReader when 
relative URIs are given as the base. A fully legacy setup exists that 
passes all the tests for normal parsing use but does not pass some 
detailed local behaviour tests in the RDF/XML writer.


Roadmap:

Eventually have multiple packages, until we decide that migration has 
happened and they are getting in the way.


Packages used by RIOT/modle.read are essential maintenance only.


* xmlinput0 - this is ARP xmlinput as it is in Jena 4.7.0.

* xmlinput1 - this is ARP switched to use IRIx.

* xmlinput2 - an RDF/XML parser (starting with ARP and cutting out the 
unused parts) that covers Jena needs and not trying to do everything 
ARP does. xmlinput2 does not yet exist.


The new PR gets the codebase to xmlinput1(as "xmlinput").

If all goes well, we can have 4.8.0 default to use xmlinput1, 
switchable back to xmlinput0.


When called from model.read or RIOT, it should not make a difference.

It would be great to have users test but any affected users are using 
legacy features and they are less likely to upgrade regularly. Reports 
about direct use of ARP have been very infrequent.


 Andy



[GitHub] [jena-site] kinow commented on pull request #146: Add basic search with Fuse.js (search engine), Mark.js (word highlighter) and Hugo (search index)

2023-02-26 Thread via GitHub


kinow commented on PR #146:
URL: https://github.com/apache/jena-site/pull/146#issuecomment-1445423471

   >this could save you some time: 
https://github.com/leeoniya/uFuzzy#a-biased-appraisal-of-similar-work
   
   Thanks @leeoniya. Will keep that in mind if we have issues with Fuse.js and 
need to pick another library. For the moment Fuse.js seems to be OK :+1: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@jena.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [jena-site] kinow opened a new pull request, #149: Add more Markdown code formatting

2023-02-26 Thread via GitHub


kinow opened a new pull request, #149:
URL: https://github.com/apache/jena-site/pull/149

   Adding more ```language to the Markdown files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@jena.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [jena-site] kinow commented on a diff in pull request #149: Add more Markdown code formatting

2023-02-26 Thread via GitHub


kinow commented on code in PR #149:
URL: https://github.com/apache/jena-site/pull/149#discussion_r1118153525


##
source/documentation/query/lateral-join.md:
##
@@ -16,22 +16,26 @@ sub-patterns.
 Another way to think of a lateral join is as a `flatmap`.
 
 Examples:
-```
+
+```sparql
 ## Get exactly one label for each subject with type `:T`
 SELECT * {
-   ?s rdf:type :T
-   LATERAL {
- SELECT * { ?s rdfs:label ?label } LIMIT 1
-   }
+  ?s rdf:type :T
+  LATERAL {
+SELECT * { ?s rdfs:label ?label } LIMIT 1
+  }
 }
 ```
 
-```
+```sparql
 ## Get zero or one labels for each subject.
 SELECT * {
-   ?s ?p ?o
-   LATERAL { OPTIONAL { SELECT * ?s rdfs:label ?label } LIMIT 1}

Review Comment:
   Hmm, I think the `{}'s` were not balanced here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@jena.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org