Chris Anderson wrote:
Sphinx is not the best contender for integration, because of it's
limited support for incremental updates. It is, however, a good
boundary condition on how to design the Indexer API so that a wide
range of search engines can work with CouchDB.
Sphinx is going to support real time updates in one of the next few releases so that won't be a problem much longer.

However there's a different problem with using Sphinx to search CouchDB: Sphinx is not designed to index documents with differing structures. All documents in an index have to follow the same structure. You can still use Sphinx with CouchDB very well if you only index views. You have to know the exact structure of all view results and then you can tell Sphinx about the strucure and it will be able to index the result.

But if you want to search any arbitrary CouchDB database then it gets a lot more complicated. Sphinx only supports a fixed number of fulltext searchable text fields per document (32). That number is definately high enough for most documents but it does not reflect CouchDB's flexibility. In order to use Sphinx on a dynamic schema you would have to go through all documents to create a mapping of the hierarchically stored values into a one dimensional associative array (2 dimensional for the multivalue attributes) and then store this mapping with each document. Now you can go through the documents and extend the static schema on every document that requires an additional field. You can either reuse fields which makes the entire grouping and sorting useless because each field has a different meaning for each document or you leave a lot of fields empty creating a huge overhead.

An alternative would be to create a lot of indexes with different schemas as Sphinx supports searching multiple indexes at a time. But I doubt this idea scales well if you have a different schema on every document.

So my approach to integration was rather to allow Sphinx to use CouchDB as a data source. You can configure Sphinx to index a certain view then and the view will have to produce 1-dimensional JSON results that work for Sphinx. Searching does not use CouchDB's REST API at all then. This method works fine for applications where many documents have the same structure (like the demo forum or an article/comments site like a blog) or for applications where the number of structures that documents can have is limited (you can create a mapping to one larger common structure then). However this will not be useful to any application that really makes use of CouchDB's flexible structure so I certainly hope there'll be other systems available for searching.

Cheers!
Nils

Reply via email to