Re: Sphinx integration (was: Working on Lucene)

Jan Lehnardt Fri, 21 Mar 2008 15:27:21 -0700


On Mar 21, 2008, at 17:55 , Chris Anderson wrote:

On Fri, Mar 21, 2008 at 1:34 PM, Jan Lehnardt <[EMAIL PROTECTED]> wrote:

Thanks for the input. This is actually an implementation detail of
the Indexer, but I agree that this should be supported. I also think
we should have some standard way here so other search solutions
can be plugged in without breaking things.


Jan,

Some thoughts about Sphinx integration.

The HTTP API as it currently stands (just the ability to page through
an entire view) is sufficient to implement Sphinx indexing on views as
an external process.

However, Sphinx has the requirement that the documents it indexes each
have a unique, numerical id. Using the CouchDB document ID would not
be advised in that case. Using a map function the emits once per
document (or using Reduce/Combine when it becomes available) coupled
with a function to deterministically convert CouchDB document ids into
integers should make for views which can be easily indexed by Sphinx.

The map function might look like this

function(doc) {
if (doc.title) {
  map(docIDtoInteger(doc.id), doc.title);
}
}

It's too bad that Sphinx doesn't support arbitrary strings as document
IDs, but I'm sure there are plenty of reversible string-to-integer
mappings that could be used. In that case Sphinx would be queried and
return a list of matching integers IDs, which could be mapped back to
CouchDB document IDs, and then retrieved from the Couch.

This thought experiment is encouraging because it shows that even
without integration into CouchDB, some very useful custom full-text
indexes could be created. AFAIK Sphinx's support for updating indexes
is limited to merging new documents into the index, so it would have
little use for an API to find view-rows which have been changed or
removed. Luckily, index rebuild is lightning fast.


This all makes perfect sense to me.

We should come up with some "schema" (heh) that defines how
FT Indexers should behave. I am thinking of a special _design
document that sets various configuration variables for the indexer.

E.g. the views to use for indexing:

{
 "_id":"_design/fulltextsearch",
 "_rev":"123",
 "_fulltext_options": {
   "views": ["names", "cities"];
 }
}

where names and cities were the names of two views. The Indexer
then could maintain two separate fulltext indexes based on these
views. The HTTP API for querying could look like this:

http://server/database/_fulltext/names?query="+Me?er -Meyer"

This is not meant as a definitive RFC, but a starting point for
discussions. Please chime in :)

Cheers
Jan
--

Re: Sphinx integration (was: Working on Lucene)

Reply via email to