Re: Working on Lucene

Jan Lehnardt Fri, 21 Mar 2008 13:00:58 -0700

Heya Søren,
thanks for picking this up. Any help is greatly appreciated :)


On Mar 21, 2008, at 11:02 , Søren Hilmer wrote:

I have hacked a little on the LuceneIndexer, and fixed some bugs toget itcompiling and running, though I also had to patch couchdb4j (patchwas fromthe couchdb4j issues page). Also found that the readme needs sometuning,
LuceneIndexer in couch.ini is now FullTextSearchQueryServer, right?


Nope, that is the LuceneSearcher. The LuceneIndexer is now
DbUpdateNotificationProcess.

But all is still not well, here is a few of questions that I hopesomeone can
supply answers to:
1) couchdb4j uses "_all_docs_by_update_seq" to get a specificrevision of adocument. The trunk version of couchdb does not support this. Has itbeen
discontinued in favor of "_all_docs_by_seq" ?


Yes.

2) What was actually the intention of the LuceneIndexer, I guessthat itshould traverse all the databases and all the documents within theseandstore the result in the database "couchdbfulltext", right? Some workto
achieve this seams necessary.

LuceneIndexer is supposed to create the fulltext index thatLuceneSearcherthen can query. It is responsible for building and maintaining thatindex. That

is update and delete entries as needed. See below.

3) When a database has a changed document, the indexer should re-index it,
right?

Correct. LuceneIndexer is launched along with and from CouchDB if yousupply

the ini option I mentioned above. CouchDB opens a stdio connection with
LuceneIndexer. LuceneIndexer has on listen to stdin for messages CouchDB

sends. Now every time a database changes, CouchDB sends down thedatabasename followed by a newline to LuceneIndexer. CouchDB expectsLuceneIndexer

NOT to answer.

The first time a change notification is sent, that is, when no indexhas been written,LuceneIndexer fetches all documents from CouchDB and integrates theircontentsinto the search index. With that, LuceneIndexer maintains the updatesequencenumber of that database. So on all subsequent notifications,LuceneIndexer canuse that sequence number to ask only for the documents that changedsince thelast time and in turn can then update the fulltext index accordingly.In practice youwould not fetch each doc individually but make sure you only queryevery N seconds

or only once for each M notifications.

Makes sense? :)

4) I have still not looked at the LuceneSearcher, how is that hookedinto
couchdb?

In the same way with the ini option you mentioned in your mail. WhenCouchDBgets a fulltext query, it sends the query string over stdio to theLuceneSearcheralong with the database name. LuceneSearcher returns a list ofdocuments andprobabilities of all documents that match that query. CouchDB thenreturns this

list.

Note however, that there is no HTTP API to test that, only theinternal API has that.So you'd have to start CouchDB with the Erlang console (-i flag IIRC)and use

couchdb_ft_query:execute("database", "+ query +string"). to send CouchDB
fulltext queries.

I hope to get it working and supply a patch when it does, hopefullyI am not
treading on anyones toes here.


By no means! Please go ahead. We are grateful for every helping hand. If
you have any more questions, just send them in. You might want to check
out #couchdb on Freenode if you are into IRC.

Thanks for your help.

Cheers,
Jan
--

Re: Working on Lucene

Reply via email to