On 9 November 2012 19:56, Rupert Westenthaler <[email protected]> wrote: > Hi all, > > let me share my throughs. Because this mail is rather long I tried to > split it up in three separate section (1) RDF (2) RESTful/ Web > Interface and (3) other related topics > > > RDF libs: > ==== > > Out of the viewpoint of Apache Stanbol one needs to ask the Question > if it makes sense to manage an own RDF API. I expect the Semantic Web > Standards to evolve quite a bit in the coming years and I do have > concern that the Clerezza RDF modules will be updated/extended to > provide implementations of those. One example of such an situation is > SPARQL 1.1 that is around for quite some time and is still not > supported by Clerezza. While I do like the small API, the flexibility > to use different TripleStores and that Clerezza comes with OSGI > support I think given the current situation we would need to discuss > all options and those do also include a switch to Apache Jena or > Sesame. Especially Sesame would be an attractive option as their RDF > Graph API [1] is very similar to what Clerezza uses.
Sesame has three different APIs for RDF graph manipulation/querying. The main API that users target in my experience is the Repository API. Repository implementors are encouraged to target the SAIL API. There are very few users or implementors who actually use the Graph API for significant purposes, in my experience. > Apache Jena's > counterparts (Model [2] and Graph [3]) are considerable different and > more complex interfaces. In addition Jena will only change to > org.apache packages with the next major release so a switch before > that release would mean two incompatible API changes. > > My personal opinion is that we should keep using Clerezza for now. > Invest some effort to improve the Clerezza RDF modules and than see > how it further develops. Such an Effort should include > > * to implement SPQRAL fast lane (as already discussed with Reto > during ApacheCon). Fast lane would allow Clerezza to use the native > SPARQL engine of the used Triplestore. Meaning that Clerezza only > parses those parts of the SPARQL query to understand the RDF graph to > execute the Query on. This information is than used to parse the query > to the native SPARQL engine via an extended Interface of the > TcProvide. The Clerezza SPARQL implementation would only be used in > case the TcProvider does not provide a native SPARQL implementation of > if the Query spans RDF graphs managed by different TcProvider > instances. By that Clerezza users would be able to use any SPARQL > feature provided by the used TripleStore. The SPARQL 1.1 specification is now a Proposed Recommendation, so it would be a good time to implement it now without fearing more of the large changes that have happened between each of the Working Drafts so far. > * update to the newest Jena versions (see also STANBOL-621; Peter > Ansell's Clerezza fork on github [5] as well as Sebastian Schaffert's > Jena bundle used for the Stanbol/LMF integration [5]) I made changes to Clerezza to experiment with a few things that I saw as issues when I was experimenting with Stanbol. The biggest design issue for me was that every graph was loaded into memory in bulk. The underlying reason for this seemed to be that java.util.Iterator does not have a close method, so there is no way of knowing when to release resources if an iterator is not used to completion. The other issue for me was that I wanted to use an underlying Sesame repository, and the Sesame module had not been maintained, and had been left off the parent reactor, so it was no longer compatible with the other modules when I was experimenting with it. Given that those were my goals, I removed all of the Clerezza CMS modules from my Git fork and focused on the underlying libraries, that would be much easier to maintain if they were seperate. In addition, I did not want to use OSGI, so I had to make changes in many cases to allow a completely programmatic instantiation of components, as some fields were left private with no mutator method and in some cases no public contructor that could be used to populate the field programmatically. For all of the good that OSGI may provide for otherwise complex systems, it is not good Java software engineering to make fields private. > * finish and release the SingleTdbDatasetTcProvider.java > (CLEREZZA-691) as this is important for the Stanbol Ontology Manager > component > * move the Indexed in-memory graph (CLEREZZA-683) from the Stanbol > code base to Clerezza and release it so that we can use it from their > in Stanbol > * provide an Clerezza JsonLD parser/serializer. This is critical for > Stanbol as several CMS use this as preferred RDF serialization. I would focus on getting a single Java implementation of the JSON-LD working here, and I know that Reto has been working on this by contributing a Clerezza serialiser/callback implementation to Tristan King's JSONLD-Java library at GitHub. When I get a chance I am going to suggest that the dependencies for Sesame/Jena/Clerezza are in separate modules in the JSONLD-Java project, to make maven dependency chaining simpler. Given that Stanbol is already quite large, it is not viable to transfer the RDF libraries there, but it does not look like it is viable to leave them combined with the other Clerezza modules as they are unrelated and will have a different release cycle, when and if the CMS components are maintained. It would be useful IMO to split the Clerezza project to make it simpler to maintain the reusable libraries. In particular, if Clerezza split its RDF libraries it may eventually gain a similar level of developer support as either Sesame or Jena. Currently the only project I can name that uses Clerezza RDF libraries at their core is Stanbol, which reduces the user base, and hence reduces the developer support base. That doesn't mean that there aren't other projects out there that use Clerezza, just that I have not come across them. Almost all of the comments in this thread are about fixing issues in the Clerezza RDF libraries based on experience from Stanbol. One issue that would be easier to solve if the project split would be the issue of disparate version numbers between modules that make it difficult to identify when to update dependencies. I have mentioned to the Stanbol list about having a single version for each release, so that people have a single figure in their head when they describe a version for the stanbol components they are relying on. It would be much easier to depend on Clerezza IMO, if there was a single current release version number for all of the library components. Maven properties make it insanely easy to migrate to new versions of multi-module libraries. If there is a single version number for each new release, and if the trunk is consistently stable, then releases can be made at any time and users can bump a single version property in Maven to upgrade their systems. I know that in Subversion it is difficult to guarantee that the trunk is stable at each point in time--because people are afraid to branch due to the difficulty involved and will instead develop new features on the trunk--but the good news is that there are a number of distributed version control systems around now that make it easy to develop features on lightweight branches separate from the trunk and painlessly merge them back in when they are stable. The suggestion to Stanbol to switch to a single version number for releases was knocked back based on the premise that not all modules would be stable at the same time. If people were able to easily use branches for new features and they test them well before integrating them, then the trunk would be fairly consistently stable. The resulting stability would remove the difficulties that Stanbol is still having since my suggestion with identifying all of the necessary dependencies to update when they want to release a new version of a particular module, as all modules, even those without significant changes since their last release would be released at each stage based on their trunk/master tests consistently passing in Jenkins. Cheers, Peter
