I'm now considering if Solr (Lucene) is a good choice when we have a huge number of indexed document and a large number of new documents needs to be indexed everyday.

Maybe I'm wrong, but my feeling is that the way the sort caches are handled (recreated after new commit, not shared between Searcher), the solution does not scale. And it is not just a memory issue (memory is cheap), but more the lack of update of an existing cache.

I'm testing if I can sort on a field that might be faster to cache: any hints on this ? Would that make a difference if I use a field with less different values than a timestamp ? I'm looking for some details on how the cache is populated on the first query. Also, for the code insiders ;-), would that be difficult to change this caching mechanism to allow update and reuse of an existing cache ?

Thanks for your help
Christophe

christophe wrote:
The problem is that I will have hundreds of users doing queries, and a continuous flow of document coming in. So a delay in warming up a cache "could" be acceptable if I do it a few times per day. But not on a too regular basis (right now, the first query that loads the cache takes 150s).

However: I'm not sure why it looks not to be a good idea to update the caches when updates are committed ? Any centralized cache (memcached is a good one) that is maintained up to date by the update/commit process would be great. Config options could then let to the user to decide if the cache is shared between servers or not. Creating a new cache and then swap it will double the necessary memory.

I also have a related questions regarding readers: a new reader is opened when documents are committed. And the cache is associated with the reader (if I got it right). Are all user requests served by this reader ? How does that scale if I have many concurrent users ?

C.

Norberto Meijome wrote:
On Mon, 20 Oct 2008 16:28:23 +0300
christophe <[EMAIL PROTECTED]> wrote:

Hum..... this mean I have to wait before I index new documents and avoid indexing when they are created (I have about 50 000 new documents created each day and I was planning to make those searchable ASAP).

you can always index + optimize out of band in a 'master' / RW server , and
then send the updated index to your slave (the one actually serving the
requests).
This *will NOT* remove the need to refresh your cache, but it will remove any
delay introduced by commit/indexing + optimise.

Too bad there is no way to have a centralized cache that can be shared AND updated when new documents are created.

hmm not sure it makes sense like that... but maybe along the lines of having an active cache that is used to serve queries, and new ones being prepared, and
then swapped when ready.
Speaking of which (or not :P) , has anyone thought about / done any work on using memcached for these internal solr caches? I guess it would make sense for
setups with several slaves ( or even a master updating memcached
too...)...though for a setup with shards it would be slightly more involved (although it *could* be used to support several slaves per 'data shard' ).

All the best,
B
_________________________
{Beto|Norberto|Numard} Meijome

RTFM and STFW before anything bad happens.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Reply via email to