On 11/07/2005, at 10:43 AM, Chris Hostetter wrote:


: > Generally speaking, you only ever need one active Searcher, which
: > all of
: > your threads should be able to use.  (Of course, Nathan says that
: > in his
: > code base, doing this causes his JVM to freeze up, but I've never seen
: > this myself).
: >
: Thanks for your response Chris.  Do you think we are going down a
: deadly path by having "many smaller" IndexSearchers open rather than
: "one very large one"?

I'm sorry ... i think i may have confused you, i forgot that this thread was regarding partioning the index. i ment one searcher *per index* ...
don't try to make a seperate searcher per client, or have a pool of
searchers, or anything like that.  But if you have a need to partition
your data into multiple indexes, then have one searcher per index.

Actually I think I confused you first, and then you confused me back... Let me... uhh, clarify 'ourselves'.. :)

My use of the word 'pool' was an error on my part (and a very silly one). I should really have meant "LRU Cache".

We have recognized that there is a finite # of IndexSearchers that can probably be open at one time. So we'll use an LRU cache to make sure only the 'actively' in use Searchers are open. However there will only be one IndexSearcher for a given physical Index directory open at a time, we're just making sure only the recently used ones are kept open to keep memory limits sane.


now assume you partition your data into two seperate indexes, unless the
way you partition your data lets you cleanly so that each of hte
two indexes contains only half the number of terms as if you had one big index, then sorting on a field in those two indexes will require more RAM
then sorting on the same data in asingle index.


Our data is logically segmented into Projects. Each Project can contain Documents and Mail. So we currently have 2 physical Indexes per Project. 90% of the time our users work within one project at a time, and only work in "document mode" or "mail mode". Every now and then they may need to do a general search across all Entities and/or Projects they are involved in (accomplished with Mulitsearcher). Perhaps we should just put Documents and Mail all in one Index for a project (ie have 1 Index per project)??

Part of the reason in to partition is to make the cost of rebuilding a given project cheaper. Reduces the risk of an Uber-Index being corrupted and screwing all the users up. We can order the reindexing of projects to make sure our more important customers get re-indexed first if there is a serious issue.

I would have thought that partitioning indexes would have performance benefits too: a lot less data to scan (most of the data is already relevant).

Since this isn't in production yet, I'd rather be proven wrong now rather than later! :)

Thanks for your input.

Paul

Reply via email to