Heya Søren,
thanks for picking this up. Any help is greatly appreciated :)
On Mar 21, 2008, at 11:02 , Søren Hilmer wrote:
I have hacked a little on the LuceneIndexer, and fixed some bugs to
get it
compiling and running, though I also had to patch couchdb4j (patch
was from
the couchdb4j issues page). Also found that the readme needs some
tuning,
LuceneIndexer in couch.ini is now FullTextSearchQueryServer, right?
Nope, that is the LuceneSearcher. The LuceneIndexer is now
DbUpdateNotificationProcess.
But all is still not well, here is a few of questions that I hope
someone can
supply answers to:
1) couchdb4j uses "_all_docs_by_update_seq" to get a specific
revision of a
document. The trunk version of couchdb does not support this. Has it
been
discontinued in favor of "_all_docs_by_seq" ?
Yes.
2) What was actually the intention of the LuceneIndexer, I guess
that it
should traverse all the databases and all the documents within these
and
store the result in the database "couchdbfulltext", right? Some work
to
achieve this seams necessary.
LuceneIndexer is supposed to create the fulltext index that
LuceneSearcher
then can query. It is responsible for building and maintaining that
index. That
is update and delete entries as needed. See below.
3) When a database has a changed document, the indexer should re-
index it,
right?
Correct. LuceneIndexer is launched along with and from CouchDB if you
supply
the ini option I mentioned above. CouchDB opens a stdio connection with
LuceneIndexer. LuceneIndexer has on listen to stdin for messages CouchDB
sends. Now every time a database changes, CouchDB sends down the
database
name followed by a newline to LuceneIndexer. CouchDB expects
LuceneIndexer
NOT to answer.
The first time a change notification is sent, that is, when no index
has been written,
LuceneIndexer fetches all documents from CouchDB and integrates their
contents
into the search index. With that, LuceneIndexer maintains the update
sequence
number of that database. So on all subsequent notifications,
LuceneIndexer can
use that sequence number to ask only for the documents that changed
since the
last time and in turn can then update the fulltext index accordingly.
In practice you
would not fetch each doc individually but make sure you only query
every N seconds
or only once for each M notifications.
Makes sense? :)
4) I have still not looked at the LuceneSearcher, how is that hooked
into
couchdb?
In the same way with the ini option you mentioned in your mail. When
CouchDB
gets a fulltext query, it sends the query string over stdio to the
LuceneSearcher
along with the database name. LuceneSearcher returns a list of
documents and
probabilities of all documents that match that query. CouchDB then
returns this
list.
Note however, that there is no HTTP API to test that, only the
internal API has that.
So you'd have to start CouchDB with the Erlang console (-i flag IIRC)
and use
couchdb_ft_query:execute("database", "+ query +string"). to send CouchDB
fulltext queries.
I hope to get it working and supply a patch when it does, hopefully
I am not
treading on anyones toes here.
By no means! Please go ahead. We are grateful for every helping hand. If
you have any more questions, just send them in. You might want to check
out #couchdb on Freenode if you are into IRC.
Thanks for your help.
Cheers,
Jan
--