I'm now considering if Solr (Lucene) is a good choice when we have a
huge number of indexed document and a large number of new documents
needs to be indexed everyday.
Maybe I'm wrong, but my feeling is that the way the sort caches are
handled (recreated after new commit, not shared between Searcher), the
solution does not scale. And it is not just a memory issue (memory is
cheap), but more the lack of update of an existing cache.
I'm testing if I can sort on a field that might be faster to cache: any
hints on this ? Would that make a difference if I use a field with less
different values than a timestamp ? I'm looking for some details on how
the cache is populated on the first query. Also, for the code insiders
;-), would that be difficult to change this caching mechanism to allow
update and reuse of an existing cache ?
Thanks for your help
Christophe
christophe wrote:
The problem is that I will have hundreds of users doing queries, and a
continuous flow of document coming in.
So a delay in warming up a cache "could" be acceptable if I do it a
few times per day. But not on a too regular basis (right now, the
first query that loads the cache takes 150s).
However: I'm not sure why it looks not to be a good idea to update the
caches when updates are committed ? Any centralized cache (memcached
is a good one) that is maintained up to date by the update/commit
process would be great. Config options could then let to the user to
decide if the cache is shared between servers or not. Creating a new
cache and then swap it will double the necessary memory.
I also have a related questions regarding readers: a new reader is
opened when documents are committed. And the cache is associated with
the reader (if I got it right). Are all user requests served by this
reader ? How does that scale if I have many concurrent users ?
C.
Norberto Meijome wrote:
On Mon, 20 Oct 2008 16:28:23 +0300
christophe <[EMAIL PROTECTED]> wrote:
Hum..... this mean I have to wait before I index new documents and
avoid indexing when they are created (I have about 50 000 new
documents created each day and I was planning to make those
searchable ASAP).
you can always index + optimize out of band in a 'master' / RW server
, and
then send the updated index to your slave (the one actually serving the
requests).
This *will NOT* remove the need to refresh your cache, but it will
remove any
delay introduced by commit/indexing + optimise.
Too bad there is no way to have a centralized cache that can be
shared AND updated when new documents are created.
hmm not sure it makes sense like that... but maybe along the lines of
having an
active cache that is used to serve queries, and new ones being
prepared, and
then swapped when ready.
Speaking of which (or not :P) , has anyone thought about / done any
work on
using memcached for these internal solr caches? I guess it would make
sense for
setups with several slaves ( or even a master updating memcached
too...)...though for a setup with shards it would be slightly more
involved
(although it *could* be used to support several slaves per 'data
shard' ).
All the best,
B
_________________________
{Beto|Norberto|Numard} Meijome
RTFM and STFW before anything bad happens.
I speak for myself, not my employer. Contents may be hot. Slippery
when wet.
Reading disclaimers makes you go blind. Writing them is worse. You
have been
Warned.