While I'm still tweaking and fixing small bugs, the text indexing stuff
that I've been working on is ready for rolling into the trunk. Here's a
quick overview of what's going on:
In order to facilitate text indexing, I added some simple event
functionality that allows handlers (callables) to subscribe to events
that happen to graphs. When a triple is added to a graph, an event is
fired and callables that have subscribed to that event are handed the
event and the triple that triggered it. As complex as it sounds, it's
only a few lines of Python code. A new kind of graph called a TextIndex
subscribes to graph events and indexes any triples that are added to the
data graph. So there are two graphs when text indexing, the data graph
which contains the source of all the triples, and the index graph that
contains the index information of the data graph. The core of events is
here:
http://svn.rdflib.net/branches/michel-events/rdflib/events.py
Graph support is done by a subclass:
http://svn.rdflib.net/branches/michel-events/rdflib/EventGraph.py
although that should probably be rolled into Graph.
Events remove the need to subclass and override Graph.add/remove to add
special behavior. It keeps complex stuff out of Graph. In fact I
imagine some complex stuff that is now in Graph could be refactored into
event handlers. Of note I discovered is that it is absurdly difficult
to subclass Graph these days, you have to subclass Graph, and
ConjunctiveGraph, and override CG.parse at a bare minimum to enhance
Graph. Event handlers remove a lot of the need to subclass as I've
found overriding add/remove to be the most common case.
The index graph will index any triple that has a object literal string.
I won't bore you with the mundane triple calculus, but the basic
operation is that literal text is "split" into terms. The terms are
filtered to remove very common terms (currently english only) and each
term is stored in an RDF graph along with other statements that says
which statements the term occured in in the *data* graph. Here's an
example using the boston.openguides.org data (about 80K triples data
graph size, 86K triples index graph size):
>>> for s,p,o in t.search('coffee'): print s
...
http://boston.openguides.org/?id=Sunrise_Coffee_%26_Sub_Shop_Ii;format=rdf#obj
http://boston.openguides.org/?id=Two_Sisters_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Winston's_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Sonny's_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Topsfield_Bagel_Co_%26_Coffee;format=rdf#obj
http://boston.openguides.org/?id=Romano's_Bakery_%26_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Starbucks_Coffee_Co;format=rdf#obj
http://boston.openguides.org/?id=Starbucks_Coffee_Company;format=rdf#obj
Here I'm just printing the subject, but you can print the predicate that
the term occurs in as well (Note the term 'coffee' occurs in the
*objects* of these statements which for brevity here I am only showing
the *subjects*. In this data, the object is the name of the company
which the subject URI also happens to be based on, but the subjects are
not indexed, only the objects). The object is always None, since the
text index does not store the object literal itself (that would make the
index huge) and that data is easily available from the source graph (if
you keep it around). For convienience, text indexes support a 'link_to'
method that lets you link a TI to a data graph so that when you query it
it returns the real object literal value instead of None. A
straightforward doctest and all the gory details are in:
http://svn.rdflib.net/branches/michel-events/rdflib/TextIndex.py
In addition to facilitating text indexing, this events adds some other
bonus functionality, like multiple stores can subscribe to one graph,
and when the one graph changes, all the stores changes as well. Note
that the graph is still only backed by one graph that is actually
queried and considered the backend for the graph, but other stores can
"follow along" the changes made to the primary store. For example, a
graph can have an in memory backend, and also have a database store
subscribe to the graph, so that any changes to the in memory graph are
written through to the db as well. I have also used a subscriber that
keeps a count of all the triples added to the graph, and commits a ZODB
subtransaction or prints a progress message when a threshold is hit.
This could solve the "upkeep" use case Chimzie mentioned earlier.
Comments? I'll probably be rolling this in this week if there is no
conflicting work going on. This work was paid for by the good folks at
Six Feet Up and we are starting to use this code in production.
Searching time is very fast for even very large data sets, and I was
able to index 2MT from a 10MT Swoogle dump and still get subsecond
search speeds for searches. I think this could really differentiate
rdflib, especially given that we are not text indexing using xapian or
any kind of black box, but instead keep all the index data in rdf
itself. This makes it ultimately portable.
_______________________________________________
Dev mailing list
[email protected]
http://rdflib.net/mailman/listinfo/dev