[rdflib-dev] text indexing

Michel Pelletier Mon, 01 May 2006 22:03:16 -0700

While I'm still tweaking and fixing small bugs, the text indexing stuffthat I've been working on is ready for rolling into the trunk. Here's aquick overview of what's going on:

In order to facilitate text indexing, I added some simple eventfunctionality that allows handlers (callables) to subscribe to eventsthat happen to graphs. When a triple is added to a graph, an event isfired and callables that have subscribed to that event are handed theevent and the triple that triggered it. As complex as it sounds, it'sonly a few lines of Python code. A new kind of graph called a TextIndexsubscribes to graph events and indexes any triples that are added to thedata graph. So there are two graphs when text indexing, the data graphwhich contains the source of all the triples, and the index graph thatcontains the index information of the data graph. The core of events ishere:


http://svn.rdflib.net/branches/michel-events/rdflib/events.py

Graph support is done by a subclass:

http://svn.rdflib.net/branches/michel-events/rdflib/EventGraph.py

although that should probably be rolled into Graph.

Events remove the need to subclass and override Graph.add/remove to addspecial behavior. It keeps complex stuff out of Graph. In fact Iimagine some complex stuff that is now in Graph could be refactored intoevent handlers. Of note I discovered is that it is absurdly difficultto subclass Graph these days, you have to subclass Graph, andConjunctiveGraph, and override CG.parse at a bare minimum to enhanceGraph. Event handlers remove a lot of the need to subclass as I'vefound overriding add/remove to be the most common case.

The index graph will index any triple that has a object literal string.I won't bore you with the mundane triple calculus, but the basicoperation is that literal text is "split" into terms. The terms arefiltered to remove very common terms (currently english only) and eachterm is stored in an RDF graph along with other statements that sayswhich statements the term occured in in the *data* graph. Here's anexample using the boston.openguides.org data (about 80K triples datagraph size, 86K triples index graph size):


>>> for s,p,o in t.search('coffee'): print s
...
http://boston.openguides.org/?id=Sunrise_Coffee_%26_Sub_Shop_Ii;format=rdf#obj
http://boston.openguides.org/?id=Two_Sisters_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Winston's_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Sonny's_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Topsfield_Bagel_Co_%26_Coffee;format=rdf#obj
http://boston.openguides.org/?id=Romano's_Bakery_%26_Coffee_Shop;format=rdf#obj
http://boston.openguides.org/?id=Starbucks_Coffee_Co;format=rdf#obj
http://boston.openguides.org/?id=Starbucks_Coffee_Company;format=rdf#obj

Here I'm just printing the subject, but you can print the predicate thatthe term occurs in as well (Note the term 'coffee' occurs in the*objects* of these statements which for brevity here I am only showingthe *subjects*. In this data, the object is the name of the companywhich the subject URI also happens to be based on, but the subjects arenot indexed, only the objects). The object is always None, since thetext index does not store the object literal itself (that would make theindex huge) and that data is easily available from the source graph (ifyou keep it around). For convienience, text indexes support a 'link_to'method that lets you link a TI to a data graph so that when you query itit returns the real object literal value instead of None. Astraightforward doctest and all the gory details are in:


http://svn.rdflib.net/branches/michel-events/rdflib/TextIndex.py

In addition to facilitating text indexing, this events adds some otherbonus functionality, like multiple stores can subscribe to one graph,and when the one graph changes, all the stores changes as well. Notethat the graph is still only backed by one graph that is actuallyqueried and considered the backend for the graph, but other stores can"follow along" the changes made to the primary store. For example, agraph can have an in memory backend, and also have a database storesubscribe to the graph, so that any changes to the in memory graph arewritten through to the db as well. I have also used a subscriber thatkeeps a count of all the triples added to the graph, and commits a ZODBsubtransaction or prints a progress message when a threshold is hit.This could solve the "upkeep" use case Chimzie mentioned earlier.

Comments? I'll probably be rolling this in this week if there is noconflicting work going on. This work was paid for by the good folks atSix Feet Up and we are starting to use this code in production.Searching time is very fast for even very large data sets, and I wasable to index 2MT from a 10MT Swoogle dump and still get subsecondsearch speeds for searches. I think this could really differentiaterdflib, especially given that we are not text indexing using xapian orany kind of black box, but instead keep all the index data in rdfitself. This makes it ultimately portable.

_______________________________________________
Dev mailing list
[email protected]
http://rdflib.net/mailman/listinfo/dev

[rdflib-dev] text indexing

Reply via email to