We are building an application which will require us to index data for each
of our users so that we can provide full text search on their data. Here are
some notable things about the application:

A) The data for every user is totally unrelated to every other user. This
gives us few advantages:

   1. we can keep our indexes small in size.
   2. merging/compatcting fragmented index will take less time.
   3. if some indexes becomes inaccessible for whatever reason
   (corruption?), only those users gets affected. Other users are unaffected
   and the service is available for them.

B) Each user can have few different types of data. We want to keep each type
in separate folders, for the same reasons as above.

So, our index hierarchy will look something like:
/user1/type1/<index files>
/user1/type2/<index files>
/user2/type1/<index files>
/user3/type3/<index files>

C) Often, probably with every itereation, we'll add "types" of data that can
be indexed.
So we want to have an efficient/programmatic way to add schemas for
different "types". We would like to avoid having fixed schema for indexing.
I like Lucene's schema-less way of indexing stuff.

D) The users can fire search queries which will search either: - Within a
specific "type" for that user - Across all types for that user: in this case
we want to fire a parallel query like Lucene has.
(ParallelMultiSearcher<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html>
)

E) We require real time update for the index. *This is a must.*

F) We are are planning to shard our index across multiple machines. For this
also, we want:
if a shard becomes inaccessible, only those users whose data are residing in
that shard gets affected. Other users get uninterrupted service.

We were considering Lucene, Sphinx and Solr to do this. This is what we
found:

   - Sphinx: No efficient way to do A, B, C, F. Or is there?
   - Luecne: Everything looks possible, as it is very low level. But we have
   to write wrappers to do F and build a communication layer between the web
   server and the search server.
   - Solr: Not sure if we can do A, B, C easily. Can we?

So, my question is what is the best software for the above requirements? I
am inclined more towards Solr and then Lucene if we get all the
requirements.

-- 
Regards
Shrinath.M

Reply via email to