Re: Sorting performance

christophe Tue, 21 Oct 2008 00:03:45 -0700

I'm now considering if Solr (Lucene) is a good choice when we have ahuge number of indexed document and a large number of new documentsneeds to be indexed everyday.

Maybe I'm wrong, but my feeling is that the way the sort caches arehandled (recreated after new commit, not shared between Searcher), thesolution does not scale. And it is not just a memory issue (memory ischeap), but more the lack of update of an existing cache.

I'm testing if I can sort on a field that might be faster to cache: anyhints on this ? Would that make a difference if I use a field with lessdifferent values than a timestamp ? I'm looking for some details on howthe cache is populated on the first query. Also, for the code insiders;-), would that be difficult to change this caching mechanism to allowupdate and reuse of an existing cache ?


Thanks for your help
Christophe

christophe wrote:

The problem is that I will have hundreds of users doing queries, and acontinuous flow of document coming in.So a delay in warming up a cache "could" be acceptable if I do it afew times per day. But not on a too regular basis (right now, thefirst query that loads the cache takes 150s).
However: I'm not sure why it looks not to be a good idea to update thecaches when updates are committed ? Any centralized cache (memcachedis a good one) that is maintained up to date by the update/commitprocess would be great. Config options could then let to the user todecide if the cache is shared between servers or not. Creating a newcache and then swap it will double the necessary memory.
I also have a related questions regarding readers: a new reader isopened when documents are committed. And the cache is associated withthe reader (if I got it right). Are all user requests served by thisreader ? How does that scale if I have many concurrent users ?
C.

Norberto Meijome wrote:
On Mon, 20 Oct 2008 16:28:23 +0300
christophe <[EMAIL PROTECTED]> wrote:
Hum..... this mean I have to wait before I index new documents andavoid indexing when they are created (I have about 50 000 newdocuments created each day and I was planning to make thosesearchable ASAP).
you can always index + optimize out of band in a 'master' / RW server, and
then send the updated index to your slave (the one actually serving the
requests).
This *will NOT* remove the need to refresh your cache, but it willremove any
delay introduced by commit/indexing + optimise.
Too bad there is no way to have a centralized cache that can beshared AND updated when new documents are created.
hmm not sure it makes sense like that... but maybe along the lines ofhaving anactive cache that is used to serve queries, and new ones beingprepared, and
then swapped when ready.
Speaking of which (or not :P) , has anyone thought about / done anywork onusing memcached for these internal solr caches? I guess it would makesense for
setups with several slaves ( or even a master updating memcached
too...)...though for a setup with shards it would be slightly moreinvolved(although it *could* be used to support several slaves per 'datashard' ).
All the best,
B
_________________________
{Beto|Norberto|Numard} Meijome

RTFM and STFW before anything bad happens.
I speak for myself, not my employer. Contents may be hot. Slipperywhen wet.Reading disclaimers makes you go blind. Writing them is worse. Youhave been
Warned.

Re: Sorting performance

Reply via email to