Re: Future of Clerezza and Stanbol

Peter Ansell Sun, 11 Nov 2012 15:44:45 -0800

On 9 November 2012 19:56, Rupert Westenthaler
<[email protected]> wrote:
> Hi all,
>
> let me share my throughs. Because this mail is rather long I tried to
> split it up in three separate section (1) RDF (2) RESTful/ Web
> Interface and (3) other related topics
>
>
> RDF libs:
> ====
>
> Out of the viewpoint of Apache Stanbol one needs to ask the Question
> if it makes sense to manage an own RDF API. I expect the Semantic Web
> Standards to evolve quite a bit in the coming years and I do have
> concern that the Clerezza RDF modules will be updated/extended to
> provide implementations of those. One example of such an situation is
> SPARQL 1.1 that is around for quite some time and is still not
> supported by Clerezza. While I do like the small API, the flexibility
> to use different TripleStores and that Clerezza comes with OSGI
> support I think given the current situation we would need to discuss
> all options and those do also include a switch to Apache Jena or
> Sesame. Especially Sesame would be an attractive option as their RDF
> Graph API [1] is very similar to what Clerezza uses.


Sesame has three different APIs for RDF graph manipulation/querying.
The main API that users target in my experience is the Repository API.
Repository implementors are encouraged to target the SAIL API. There
are very few users or implementors who actually use the Graph API for
significant purposes, in my experience.

> Apache Jena's
> counterparts (Model [2] and Graph [3]) are considerable different and
> more complex interfaces. In addition Jena will only change to
> org.apache packages with the next major release so a switch before
> that release would mean two incompatible API changes.
>
> My personal opinion is that we should keep using Clerezza for now.
> Invest some effort to improve the Clerezza RDF modules and than see
> how it further develops. Such an Effort should include
>
> *  to implement SPQRAL fast lane (as already discussed with Reto
> during ApacheCon). Fast lane would allow Clerezza to use the native
> SPARQL engine of the used Triplestore. Meaning that Clerezza only
> parses those parts of the SPARQL query to understand the RDF graph to
> execute the Query on. This information is than used to parse the query
> to the native SPARQL engine via an extended Interface of the
> TcProvide. The Clerezza SPARQL implementation would only be used in
> case the TcProvider does not provide a native SPARQL implementation of
> if the Query spans RDF graphs managed by different TcProvider
> instances. By that Clerezza users would be able to use any SPARQL
> feature provided by the used TripleStore.

The SPARQL 1.1 specification is now a Proposed Recommendation, so it
would be a good time to implement it now without fearing more of the
large changes that have happened between each of the Working Drafts so
far.

> * update to the newest Jena versions (see also STANBOL-621; Peter
> Ansell's Clerezza fork on github [5] as well as Sebastian Schaffert's
> Jena bundle used for the Stanbol/LMF integration [5])

I made changes to Clerezza to experiment with a few things that I saw
as issues when I was experimenting with Stanbol. The biggest design
issue for me was that every graph was loaded into memory in bulk. The
underlying reason for this seemed to be that java.util.Iterator does
not have a close method, so there is no way of knowing when to release
resources if an iterator is not used to completion. The other issue
for me was that I wanted to use an underlying Sesame repository, and
the Sesame module had not been maintained, and had been left off the
parent reactor, so it was no longer compatible with the other modules
when I was experimenting with it. Given that those were my goals, I
removed all of the Clerezza CMS modules from my Git fork and focused
on the underlying libraries, that would be much easier to maintain if
they were seperate.

In addition, I did not want to use OSGI, so I had to make changes in
many cases to allow a completely programmatic instantiation of
components, as some fields were left private with no mutator method
and in some cases no public contructor that could be used to populate
the field programmatically. For all of the good that OSGI may provide
for otherwise complex systems, it is not good Java software
engineering to make fields private.

> * finish and release the SingleTdbDatasetTcProvider.java
> (CLEREZZA-691) as this is important for the Stanbol Ontology Manager
> component
> * move the Indexed in-memory graph (CLEREZZA-683) from the Stanbol
> code base to Clerezza and release it so that we can use it from their
> in Stanbol
> * provide an Clerezza JsonLD parser/serializer. This is critical for
> Stanbol as several CMS use this as preferred RDF serialization.

I would focus on getting a single Java implementation of the JSON-LD
working here, and I know that Reto has been working on this by
contributing a Clerezza serialiser/callback implementation to Tristan
King's JSONLD-Java library at GitHub. When I get a chance I am going
to suggest that the dependencies for Sesame/Jena/Clerezza are in
separate modules in the JSONLD-Java project, to make maven dependency
chaining simpler.

Given that Stanbol is already quite large, it is not viable to
transfer the RDF libraries there, but it does not look like it is
viable to leave them combined with the other Clerezza modules as they
are unrelated and will have a different release cycle, when and if the
CMS components are maintained. It would be useful IMO to split the
Clerezza project to make it simpler to maintain the reusable
libraries. In particular, if Clerezza split its RDF libraries it may
eventually gain a similar level of developer support as either Sesame
or Jena. Currently the only project I can name that uses Clerezza RDF
libraries at their core is Stanbol, which reduces the user base, and
hence reduces the developer support base. That doesn't mean that there
aren't other projects out there that use Clerezza, just that I have
not come across them. Almost all of the comments in this thread are
about fixing issues in the Clerezza RDF libraries based on experience
from Stanbol.

One issue that would be easier to solve if the project split would be
the issue of disparate version numbers between modules that make it
difficult to identify when to update dependencies. I have mentioned to
the Stanbol list about having a single version for each release, so
that people have a single figure in their head when they describe a
version for the stanbol components they are relying on. It would be
much easier to depend on Clerezza IMO, if there was a single current
release version number for all of the library components. Maven
properties make it insanely easy to migrate to new versions of
multi-module libraries. If there is a single version number for each
new release, and if the trunk is consistently stable, then releases
can be made at any time and users can bump a single version property
in Maven to upgrade their systems.

I know that in Subversion it is difficult to guarantee that the trunk
is stable at each point in time--because people are afraid to branch
due to the difficulty involved and will instead develop new features
on the trunk--but the good news is that there are a number of
distributed version control systems around now that make it easy to
develop features on lightweight branches separate from the trunk and
painlessly merge them back in when they are stable. The suggestion to
Stanbol to switch to a single version number for releases was knocked
back based on the premise that not all modules would be stable at the
same time. If people were able to easily use branches for new features
and they test them well before integrating them, then the trunk would
be fairly consistently stable. The resulting stability would remove
the difficulties that Stanbol is still having since my suggestion with
identifying all of the necessary dependencies to update when they want
to release a new version of a particular module, as all modules, even
those without significant changes since their last release would be
released at each stage based on their trunk/master tests consistently
passing in Jenkins.

Cheers,

Peter

Re: Future of Clerezza and Stanbol

Reply via email to