Hi All,

If we're discussing a native CouchDB full text search, I'd like to point out a few things about Cloudant's implementation that might guide the design. Our search indexing strategy is discussed here:

http://support.cloudant.com/kb/search/search-indexing

What we've opted to do is borrow certain parts of Lucene (particularly the analyzer, query parser, and searching/scoring) but completely rewrite the inverted index storage format. We store the inverted index as a CouchDB map-reduce view. This is of course inefficient in terms of disk space, but great in terms of taking advantage of CouchDB's robustness and job scheduling capabilities.

The map format for the inverted index is the following:

{"id":doc_id:"key":[lucene_field,term],"value":[[array with list of term positions in document]]}

In the web link above, you can see the inverted index for a single document with id "example_glossary".

Search in Lucene is scored using the tf-idf model. tf = term frequency = number of times a token appears in a particular document. idf = inverse document frequency = how often a token appears in any document in the collection. The map view gives us the information we need for tf, and to get idf we use the Erlang builtin _count reduce function.

We have separate processing for indexing and search -- the indexing is just a regular view server (we use our Java view server by default, but there is nothing to prevent anyone from using a javascript or erlang map function to create the inverted index).

The "searcher application" knows how to utilize these CouchDB views to do search query logic (boolean queries, phrase queries, range queries, etc.). The searcher application knows nothing about who created the map-reduce views that it utilizes -- only that the format is as specified above (the one caveat is that the searcher application needs to know the analyzer to work properly).

Just as with clucene vs. lucene, we could standardize the format of the CouchDB inverted indices for any "native" search applications. Writing a simple Erlang based "searcher application" shouldn't be very difficult (I plan to do it at some point) -- one option would be to extend Norman's multiview. This erlang "searcher application" would work on mobile devices.

I propose that we have a standard inverted index format, that this inverted index is stored in a CouchDB view, and that all indexing and search applications recognize this standard format. Keep in mind that for many applications external services such as couchdb-lucene and ElasticSearch will be superior to a native CouchDB search solution -- these are optimized for efficiency of inverted index lookups.

Dave




On 03/29/11 08:57, Albin Stigo wrote:
Couchdb + Lucene (Elasticsearch etc.) is a really great combination
and definitely enough on the server side... IMHO what is missing is a
full text engine for Couchdb on mobile - that would be a killer...
Currently the only full text search library on mobile devices is
sqlite fts3 which is great but doesn't have replication. Maybe someone
could implement something based on sqlite fts3 which uses the changes
stream to keep in sync...?

How do you search couchdb on mobile devices?

--Albin

On Tue, Mar 29, 2011 at 4:09 PM, Norman Barker<norman.bar...@gmail.com>  wrote:
Benoit,

interesting post on Lucy, I have been monitoring that as well (and
though no where near as good as Robert's work) I have integrated
clucene and couchdb as I was looking for a solution that didn't use
Java.

I see a trend with couchdb and NIFs, what is the official standpoint
here, test and test the c / c++ library so that any chance of bringing
the VM down is reduced? I know with Java and JNI in an app server you
are taking a huge risk (heartbeat works, but an app server takes
several minutes to start up), with Erlang are you relying on the
heartbeat service to restart the VM in case of failure?

I am interested in helping with any NIF on top of Lucy.

thanks,

Norman

On Tue, Mar 29, 2011 at 7:58 AM, Simon Metson
<simonmet...@googlemail.com>  wrote:
Does http://blog.cloudant.com/developer-preview-cloudant-search-for-couchdb/ 
help wrt. the original post? Cloudant's search is built on Lucene.
Cheers
Simon

Sent with Sparrow
On Tuesday, 29 March 2011 at 14:24, Dennis Geurts wrote:
Hi all,
Looking at the amount of replies wrt to this topic it seems there's much 
interest in full text searching.

It's really hard to tell how one would expect this feature to be implemented in 
couchdb in such a way that it would supersede the nice couchdb-lucene combo.

That said, if you want a _really simple_ (and probably bad solution performance 
wise!) fulltext search implementation, have a look at couchdb lists.

You decide which _view is sent to the _list function; within the _list function 
you can implement your full text search by inspecting the document data in 
javascript.

This setup at least allows for replication of the fts functionality and might 
be just enough for the OP.



Cheers, dennis



----- Reply message -----
From: "Zdravko Gligic"<zgli...@gmail.com>
Date: Tue, Mar 29, 2011 13:49
Subject: Full text search - is it coming? If yes, approx when.
To: "user@couchdb.apache.org"<user@couchdb.apache.org>

I have a bit tricky use case of super tagging or rather a somewhat
hierarchical docs categorization. Several CouchDB gurus have suggested
that I should look at Lucene and such. My problem is hosting because
I would most rather go with a cloud solution such as Cloudant and
forthcoming (I hope it's still forthcoming) CouchBase. Comparatively,
I have very little amount of data - large number of tiny docs that are
indexed every which way possible - such that the size of views dwarfs
the size of docs.

The full-text-searching problem is best illustrated by the
full-text-searching hosting state of affairs at Cloudant and CouchBase
- the only two commercial companies worth mentioning within the
CouchDb community. Neither one uses Lucene out of the box and only
Cloudant has their own solution. This means that I could not use a
redundancy-performance perfect Master-Master replication that is
hosted by both. This is why either full-text-searching needs to
become CouchDb's internal first citizen or our hosting friends need to
internalize and make Lucene their first class citizen.

P.S. I love both but ...


Reply via email to