Re: Jena next (AFS)

2019-11-17 Thread Claude Warren
I am a bit concerned about Streams.

I am working with some large scale streams from stored objects in another
project and keep coming up against stack overflow issues when attempting to
convert merge  them or convert from iterators.  Perhaps I have not done it
correctly but the iterator approach seems cleaner when you don't have or
can't have all the data in memory at once.

We might consider switching from the Jena specific iterators to
commons-collections4 (perhaps contributing some additions there).

Claude

On Sun, Nov 17, 2019 at 5:34 PM Andy Seaborne  wrote:

> This is a bit of a brain dump ...
>
> == DatasetGraph
>
> Graph Triple, Quad, DatasetGraph in a single API place.
>
> == Graph - SPI
>
> Graph - add a few navigation operations to make writing system directly
> on Graph easier - though still not as rich as the Model API, and avoid
> much of the object churn.
>
> The operations are (not final names)
>
>Graph.fwd(subject, predicate)
> -- return a single Node or null.
>Graph.fwdList(subject, predicate)
> -- return a list of Nodes
>Graph.fwdUnique(subject, predicate)
> -- return a single Node, exception if 0 or more than one.
>
> Same for "bwk"
>
>
> https://github.com/apache/jena/blob/master/jena-shacl/src/main/java/org/apache/jena/shacl/lib/G.java
> is a library version of this that was helpful but adding a few
> operations directly to graph
>
> If the data is known to be good (SHACL), the application code can use
> fwd()/bwk() without worrying about testing for zero or multiple predicates.
>
> The reason for putting the basic oprations in the Graph interface and
> not everything in a library is for potential efficiency. An impl may be
> able to do a good job of fwd() and if that is the basis of graph
> analytics efficiency matters long term, at least not to design it out.
>
> == Assembler
>
> The graph SPI additions is also motivated by assemblers.  Assemblers are
> currently Model/Resource based but the important usage is in Fuseki - an
> ideal goal is Fuseki works on Graph/Node.
>
> Converting assemblers to Graph/Node does not look too burdensome and
> with a wrapper layer we can hopefully include all the old tests to check
> evolution.
>
> == Graph - indexing
>
> Currently, Graphs are term-indexed only or value-indexed, not both.
>
> Graph should plain term-indexed. value-indexing, which can be calculated
> on the fly, would be a separate higher-level concept.
>
> This is motivated by scale and having the same behaviour on all graph.
> At scale, canonicalizing the inputs is better than value-indexing.
>
> "values" would only be in the Model API.
>
> == Transactions
>
> Unify the transaction approach (also changes Model) so complex
> assemblages of graphs, and other things,  are transactional.
>
> Remove graph transactions - replace by
> org.apache.jena.sparql.core.Transactional.
>
> Then graphs as views of datasets and also combinations of Transactionals
> in single transaction (two DatasetGraph, or collection of Graphs (teh
> assmebler case)) can be done.
>
> == Events
>
> Make events an intercepting wrapper, not built-in to Graph itself.
> Add transaction lifecycle events.
>
> == Streams - yes and no.
>
> A Stream is several java objects so a potential cost
> for a simple operations like Graph.contains() or find() or a few things
> is not small.
>
> Keep iterators, provide stream(s,p,o).
>
> == Nodes
>
> Lang tags - force to lower case.
>
> Simplify - remove a layer of indirection. This relates to indexing.
>
> Node_Literal - no LiteralLabels
> Node_Blank - two longs or a string label, not using BlankNodeId
>
> Investigate integrate nodes with ARQ's NodeValue.
>
> == IRIs
>
> jena-iri is general, powerful and hard to maintain.
> Jena does not use all of it.
> Jena needs a simpler, direct parser/checker.
>
> https://github.com/afs/iri4ld
>
> which is a parser in java with little copying. It parse URIs, and then
> has a little on scheme specific rules for http(s), file and URN.
>
> The various open source libraries and JDK classes do not track the
> current standards very well (RFC 2396 vs RFC 3986). I have found that
> compliance is mixed due to legacy compatibility needs.
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren


Jena next (AFS)

2019-11-17 Thread Andy Seaborne

This is a bit of a brain dump ...

== DatasetGraph

Graph Triple, Quad, DatasetGraph in a single API place.

== Graph - SPI

Graph - add a few navigation operations to make writing system directly 
on Graph easier - though still not as rich as the Model API, and avoid 
much of the object churn.


The operations are (not final names)

  Graph.fwd(subject, predicate)
   -- return a single Node or null.
  Graph.fwdList(subject, predicate)
   -- return a list of Nodes
  Graph.fwdUnique(subject, predicate)
   -- return a single Node, exception if 0 or more than one.

Same for "bwk"

https://github.com/apache/jena/blob/master/jena-shacl/src/main/java/org/apache/jena/shacl/lib/G.java 
is a library version of this that was helpful but adding a few 
operations directly to graph


If the data is known to be good (SHACL), the application code can use 
fwd()/bwk() without worrying about testing for zero or multiple predicates.


The reason for putting the basic oprations in the Graph interface and 
not everything in a library is for potential efficiency. An impl may be 
able to do a good job of fwd() and if that is the basis of graph 
analytics efficiency matters long term, at least not to design it out.


== Assembler

The graph SPI additions is also motivated by assemblers.  Assemblers are 
currently Model/Resource based but the important usage is in Fuseki - an 
ideal goal is Fuseki works on Graph/Node.


Converting assemblers to Graph/Node does not look too burdensome and 
with a wrapper layer we can hopefully include all the old tests to check 
evolution.


== Graph - indexing

Currently, Graphs are term-indexed only or value-indexed, not both.

Graph should plain term-indexed. value-indexing, which can be calculated 
on the fly, would be a separate higher-level concept.


This is motivated by scale and having the same behaviour on all graph. 
At scale, canonicalizing the inputs is better than value-indexing.


"values" would only be in the Model API.

== Transactions

Unify the transaction approach (also changes Model) so complex 
assemblages of graphs, and other things,  are transactional.


Remove graph transactions - replace by
org.apache.jena.sparql.core.Transactional.

Then graphs as views of datasets and also combinations of Transactionals 
in single transaction (two DatasetGraph, or collection of Graphs (teh 
assmebler case)) can be done.


== Events

Make events an intercepting wrapper, not built-in to Graph itself.
Add transaction lifecycle events.

== Streams - yes and no.

A Stream is several java objects so a potential cost
for a simple operations like Graph.contains() or find() or a few things 
is not small.


Keep iterators, provide stream(s,p,o).

== Nodes

Lang tags - force to lower case.

Simplify - remove a layer of indirection. This relates to indexing.

Node_Literal - no LiteralLabels
Node_Blank - two longs or a string label, not using BlankNodeId

Investigate integrate nodes with ARQ's NodeValue.

== IRIs

jena-iri is general, powerful and hard to maintain.
Jena does not use all of it.
Jena needs a simpler, direct parser/checker.

https://github.com/afs/iri4ld

which is a parser in java with little copying. It parse URIs, and then 
has a little on scheme specific rules for http(s), file and URN.


The various open source libraries and JDK classes do not track the 
current standards very well (RFC 2396 vs RFC 3986). I have found that 
compliance is mixed due to legacy compatibility needs.


Re: Jena next

2019-11-17 Thread Andy Seaborne




On 15/11/2019 09:49, Marco Neumann wrote:


I believe future versions of Jena need to be a bit bolder, this while
maintaining basic API features and design choices for compatibility in a
maintenance release.


Agreed - there should be a system which is production-stable, certainly 
covering up to Fuseki+text with extension points.


Then there are evolving modules.

For example, SPARQL 1.2 does not destabilize SPARQL 1.1 - that may mean 
1.2 is separate and optionally replaces 1.1. When some 1.2 features are 
considered stable, they move to core-ARQ.


And some things like first steps in concurrency for ARQ don't need to be 
in release.



Jena 4 might be a good candidate for such a LTS
maintenance release while Jena 5 should take a more ambitious approach and
innovate where it hasn't made inroads in the technical community yet.


Jena3 is LTS in practice.

Given we have non-dedicated resources, I don't think it is helpful to 
ourselves to set up some grand long term roadmap that we can't 
realistically achieve.  What will happen is that progress will be 
demotivating slow when measured against the ultimate goals.


If more people become active contributors or some of us can get 
resourcing, then that can change.


Andy


(SPARQL 1.2, RDF 2, parallelization/concurrency, improved scripting
support, web focus, UI, integration with other projects etc). I believe
last time we have seen such transformative change to Jena was for releases
2.7+ and the migration of the project to the Apache software foundation
almost 10 years ago.

And IMO an area that needs closer attention as well is documentation,
education and outreach. I am not sure where we currently stand in
regards to developer adoption, downloads and deployments etc but it would
be useful to gauge this type of information more systematically and more
regularly for project reviews.

Marco





On Wed, Nov 13, 2019 at 7:18 PM Andy Seaborne  wrote:


I'd like to start a discussion on where the project might go longer term.

This can be specific areas, overall design, about project processes,
anything.

If we are going to do a major change, Jena4, what preparation for that
can be done? (e.g. deprecation and signalling in Jena3, before the
change happens).

Realistically, Jena4 means having Jena3 and Jena4 in parallel. Jena4
need not be that big - we can have Jena5 etc.

I'll put some technical points in a separate email.

I would put on the list:

* How has the world changed? What should the project produce?
* Target audience: for developers of Jena, while Jena3 is for users.
* Target: Java14, JPMS.
* Clear-up not easily done with perfect compatibility.
* Simpler. There are APIs and packages entangled due to history.

To the lurkers :-)

Feedback and specific feature requests are welcome. But before you "go
shopping", you may wish to factor in that every feature needs effort to
do it. The better place to be is that an application can get what it
needs to do, not whether the Jena system has every feature built-in.

  Andy