Heya Søren,
thanks for picking this up. Any help is greatly appreciated :)

On Mar 21, 2008, at 11:02 , Søren Hilmer wrote:
I have hacked a little on the LuceneIndexer, and fixed some bugs to get it compiling and running, though I also had to patch couchdb4j (patch was from the couchdb4j issues page). Also found that the readme needs some tuning,
LuceneIndexer in couch.ini is now FullTextSearchQueryServer, right?

Nope, that is the LuceneSearcher. The LuceneIndexer is now
DbUpdateNotificationProcess.


But all is still not well, here is a few of questions that I hope someone can
supply answers to:

1) couchdb4j uses "_all_docs_by_update_seq" to get a specific revision of a document. The trunk version of couchdb does not support this. Has it been
discontinued in favor of "_all_docs_by_seq" ?

Yes.


2) What was actually the intention of the LuceneIndexer, I guess that it should traverse all the databases and all the documents within these and store the result in the database "couchdbfulltext", right? Some work to
achieve this seams necessary.

LuceneIndexer is supposed to create the fulltext index that LuceneSearcher then can query. It is responsible for building and maintaining that index. That
is update and delete entries as needed. See below.


3) When a database has a changed document, the indexer should re- index it,
right?

Correct. LuceneIndexer is launched along with and from CouchDB if you supply
the ini option I mentioned above. CouchDB opens a stdio connection with
LuceneIndexer. LuceneIndexer has on listen to stdin for messages CouchDB
sends. Now every time a database changes, CouchDB sends down the database name followed by a newline to LuceneIndexer. CouchDB expects LuceneIndexer
NOT to answer.

The first time a change notification is sent, that is, when no index has been written, LuceneIndexer fetches all documents from CouchDB and integrates their contents into the search index. With that, LuceneIndexer maintains the update sequence number of that database. So on all subsequent notifications, LuceneIndexer can use that sequence number to ask only for the documents that changed since the last time and in turn can then update the fulltext index accordingly. In practice you would not fetch each doc individually but make sure you only query every N seconds
or only once for each M notifications.

Makes sense? :)

4) I have still not looked at the LuceneSearcher, how is that hooked into
couchdb?

In the same way with the ini option you mentioned in your mail. When CouchDB gets a fulltext query, it sends the query string over stdio to the LuceneSearcher along with the database name. LuceneSearcher returns a list of documents and probabilities of all documents that match that query. CouchDB then returns this
list.

Note however, that there is no HTTP API to test that, only the internal API has that. So you'd have to start CouchDB with the Erlang console (-i flag IIRC) and use
couchdb_ft_query:execute("database", "+ query +string"). to send CouchDB
fulltext queries.

I hope to get it working and supply a patch when it does, hopefully I am not
treading on anyones toes here.

By no means! Please go ahead. We are grateful for every helping hand. If
you have any more questions, just send them in. You might want to check
out #couchdb on Freenode if you are into IRC.

Thanks for your help.

Cheers,
Jan
--

Reply via email to