Re: future of rdflib (Was: Re: [rdflib-dev] text indexing)

Chimezie Ogbuji Wed, 03 May 2006 20:40:48 -0700

I have some suggestions

This sounds neat, but I keep running into bugs and poor code support
and documentation in rdflib, and it's getting to the point that I'm
wondering whether I should be using it in future production scenarios
-- just a few, er, highlights:


1. ntriples serialization is broken, as I reported last week


The NTriples test document (rather small) could easily be ported into
a unittest:
http://www.w3.org/2000/10/rdf-tests/rdfcore/ntriples/test.nt

3. the support for datatyped literals is remarkably poor. I've been
complaining about this for years, have sent design ideas and even a
patch or two, as I recall, and it's still really bad... (Every time I
teach someone rdflib, or use it in a project, I have to start by
showing them how to do real, *minimally* decent datatyped-literal
support by writing a wrapper around Literal, which maps basic Python
types to RDF datatypes. This gets *so* old!)


I guess I wouldn't characterize this as poor but mostly inconvenient,
because it's a straight forward mapping.  I don't have any your ideas
/ patches (might have been before I switched to using rdflib instead
of 4Suite RDF), but (ironically - see below), such a binding is in the
Sparta code base:

SchemaToPython = {  #  (schema->python, python->schema)  Does not validate.
   'http://www.w3.org/2001/XMLSchema#string': (unicode, unicode),
   'http://www.w3.org/2001/XMLSchema#normalizedString': (unicode, unicode),
   'http://www.w3.org/2001/XMLSchema#token': (unicode, unicode),
   'http://www.w3.org/2001/XMLSchema#language': (unicode, unicode),
   'http://www.w3.org/2001/XMLSchema#boolean': (bool, lambda
i:unicode(i).lower()),
   'http://www.w3.org/2001/XMLSchema#decimal': (float, unicode),
   'http://www.w3.org/2001/XMLSchema#integer': (long, unicode),
   'http://www.w3.org/2001/XMLSchema#nonPositiveInteger': (int, unicode),
   'http://www.w3.org/2001/XMLSchema#long': (long, unicode),
   'http://www.w3.org/2001/XMLSchema#nonNegativeInteger': (int, unicode),
   'http://www.w3.org/2001/XMLSchema#negativeInteger': (int, unicode),
   'http://www.w3.org/2001/XMLSchema#int': (int, unicode),
   'http://www.w3.org/2001/XMLSchema#unsignedLong': (long, unicode),
   'http://www.w3.org/2001/XMLSchema#positiveInteger': (int, unicode),
   'http://www.w3.org/2001/XMLSchema#short': (int, unicode),
   'http://www.w3.org/2001/XMLSchema#unsignedInt': (long, unicode),
   'http://www.w3.org/2001/XMLSchema#byte': (int, unicode),
   'http://www.w3.org/2001/XMLSchema#unsignedShort': (int, unicode),
   'http://www.w3.org/2001/XMLSchema#unsignedByte': (int, unicode),
   'http://www.w3.org/2001/XMLSchema#float': (float, unicode),
   'http://www.w3.org/2001/XMLSchema#double': (float, unicode),
   'http://www.w3.org/2001/XMLSchema#base64Binary':
(base64.decodestring, lambda i:base64.encodestring(i)[:-1]),
   'http://www.w3.org/2001/XMLSchema#anyURI': (str, str),
}

Probably makes better design sense to move it into a method on Literal.


4. no *real* query language support, though it seems Chimezie has
fixed this to some extent by allowing Versa querying... Would be
*really* nice to have that rolled into the next major release. SPARQL
would be nice too, but *any* query language is better than none.


This is tough one, because IMHO a decent, portable, *standalone* (I'm
a big fan of minimal software depency) SPARQL parser is all that's
needed to tie in the very effective sparql-q library that
unfortunately has to be used programatically.  I've been fooling
around with BisonGen (the Python C extension parser generator that
4Suite uses for XPath/XPointer/XSLT parsing) to that end.

Mostly, however I've been using rdflib via the 4Suite RDF API in order
to dispatch Versa queries as well as develop web applications using
4Suite repository and persisting RDF in rdflib - for production
projects.  So, the querying support has been sufficient for me and
will be vastly improved with SPARQL support.  It will also introduce
some interesting oppurtinities (given that SPARQL result sets bind
values to variables and Versa has built-in support for variable
binding).


5. Is there any inference support at all, even for just simple RDFS?
It's not clear what the state of play here is, and that's a problem
in addition to what appears to be the lack of support...


Inference support should be third-party or clearly seperated given how
open-ended RDF/RDFS entailment can be (i.e. what entailment rules do
you support, how do you implement them - forward or backward chaining,
when do you infer, do you persist what you infer?, etc..).  If you mix
and match inference behavior with a 'vanilla' RDF programming API, you
risk bloating your API in such a way that it's no longer portable or
agile

6. A shocking amount of API instability for a 2.x versioned library.


I think coming to consensus on the 'core' API and freezing it would
help.  Especially since you can then bang out a thorough regression
test suite as a foundation for further release cycles.


7. Sparta -- which I hear may be rolled into a future release -- is
really unhelpful in many cases; for anything but a very simple RDF
graph, it's a lot more trouble than it's worth to create a bunch of
in-mem Py objects and then just interact with *them*... That's not my
idea at all of a Python-RDF databinding tool.


Well it's the same principle in any data language / format.  The idea
is relying on a host language (a rather agile one in this case) as an
idiom for interacting with raw data where the alternative is too
cumbersome

Element Tree versus DOM, E4X versus DOM etc..  It makes sense in some
use cases but not in others.  However, I'd much rather interact with
nodes as objects than using Graph.triples, Graph.subect_properties,
Graph.add, Graph.remove etc..

(FWIW: Sparta's use of OWL cardinality constraints is *completely*
broken. OWL cardinality constraints are *not* database constraints,
at all, but that's how Sparta uses and describes them. Which just
*spreads confusion*. That's just broken by design. Nothing prevennts
anyone from doing database constraints, but those properties should
*not* be in the OWL namespace. Database integrity constraints are a
*very* good thing, but that's *not* what OWL max cardinality is about.)


I agree.  Such checking should be made by a 'formal', external
reasoner (FacT++, Pellet, etc..) or at the very least, an application
that has a specific set of OWL/RDFs constraints in mind.

8. Graph.value() is also completely broken with respect to RDF
semantics, and the explanation in the error message that you get when
you call Graph.value() is misleading and unhelpful. In fact, the
docstring for this method is *flatly* wrong; it says "Useful if one
knows that there may only be one value"... Hmm, actually, value() is
*totally* broken if there is more than one value, regardless of what
one knows or whether one knows that there *may* be more than one
value (what does that even mean?)... And the any keyword arg really
makes it worse... the docstring says if any=True, then value() will
"return any value in the case there is more than one" -- huh? Is
there any guarantee as to which one it returns? The first? A random
one? Is it deterministic? How is this useful? Why not return a Python
set of the values? Or a list? Or a tuple? (And if the value is an
*RDF* list, make a class to distinguish that case and return an
instance of that class...)


I think support for 'formal' querying languages makes such a function
less useful in an API.  In general i'm of the opinion that there is a
very clear boundary to which a 'vanilla' RDF API should be kept.  It
should be bound as closely to the host language as possible (__len__,
__contains__, etc..) and clearly seperated from the 'other' APIs
(Querying, Inference, Persistence,etc.).


I really shouldn't have to suffer Sparta to get this kind of sane
support from bare rdflib.


Sparta is very lightweight and I think it would be worth the effort to port.

So, this is open source, which means, basically, "put up or shut up".
And I'm prepared to do just that, since I have projects where I need
sane RDF support in Python. If yr interested in working on rdflib
with funding, ping me and we can talk. I don't know if fixing rdflib,
or forking it, or starting over is the right thing. But something
needs to be done, and soon, to improve the state of RDF libraries for
Python.


I have a huge interest in rdflib as well, more from a web app
perspective than pure RDF processing.  Alot of what I look for in a
Python RDF library I have with rdflib and don't with others (Redland)
- which is why I started working with it and pushed to port 4Suite to
use it instead.  Having a mailing list helps.  It's a large body of
work to manage and it's made worse by the fact that RDF is so recent
on the programming scene that there aren't even any well known best
practices on APIs, programming conventions, etc.. My suggestions for
quickly aligning the compass:

1) More test suites for the aspects of RDF that are more formal (N3
test suites, NTriples test suites, RDF/XML test suites, API test
suites).
2) Extend the Python / RDF idiom with built in binding of datatyped
Literals to Python objects and RDF node-level data binding (Sparta).
3) Freeze core Graph / Store API
4) Testsuites for Notation 3 persistence (use of contexts and formulae)
5) Documentation (at least for the 'core' Graph API) - this is the
achilles heel of open source development

Cheers,
Kendall Clark


_______________________________________________
Dev mailing list
[email protected]
http://rdflib.net/mailman/listinfo/dev


_______________________________________________
Dev mailing list
[email protected]
http://rdflib.net/mailman/listinfo/dev

Re: future of rdflib (Was: Re: [rdflib-dev] text indexing)

Reply via email to