Hello, 

We are already using this design in production for a email job application 
system.
Each client (company) have an account and may have multiple users
When a new client is created, a new lucene index is automatically created when 
new job-applications arrive for this account. 
Job applications are in principle owned by users, but some times they can share 
it with other users in same account, so the search can be user-independent.
This design works fine for us as the flow of job applications is not the same 
for different accounts. There are lucene indices that are more often updated 
than others.
It also permit us to rebuild one client index without impacting others

We have only one problem : when the index is updated and searched at the same 
time, the index may be corrupted and an exception may be thrown by the indexer 
("Read past OEF", i unfortunately don't have the stack trace right now under my 
hand). I think that it is because the search and indexation are made in two 
different java processes. We will rework the routines to lock the search when 
an indexation is running and vice versa

--- sven

lundi 11 juillet 2005, 03:03:29, vous avez écrit:


PS> On 11/07/2005, at 10:43 AM, Chris Hostetter wrote:

>>
>> : > Generally speaking, you only ever need one active Searcher, which
>> : > all of
>> : > your threads should be able to use.  (Of course, Nathan says that
>> : > in his
>> : > code base, doing this causes his JVM to freeze up, but I've  
>> never seen
>> : > this myself).
>> : >
>> : Thanks for your response Chris.  Do you think we are going down a
>> : deadly path by having "many smaller" IndexSearchers open rather than
>> : "one very large one"?
>>
>> I'm sorry ... i think i may have confused you, i forgot  that this
>> thread
>> was regarding partioning the index.  i ment one searcher *per  
>> index* ...
>> don't try to make a seperate searcher per client, or have a pool of
>> searchers, or anything like that.  But if you have a need to partition
>> your data into multiple indexes, then have one searcher per index.

PS> Actually I think I confused you first, and then you confused me  
PS> back... Let me... uhh, clarify 'ourselves'.. :)

PS> My use of the word 'pool' was an error on my part (and a very silly
PS> one).  I should really have meant "LRU Cache".

PS> We have recognized that there is a finite # of IndexSearchers that
PS> can probably be open at one time.  So we'll use an LRU cache to make
PS> sure only the 'actively' in use Searchers are open.  However there
PS> will only be one IndexSearcher for a given physical Index directory
PS> open at a time, we're just making sure only the recently used ones
PS> are kept open to keep memory limits sane.

>>
>> now assume you partition your data into two seperate indexes,  
>> unless the
>> way you partition your data lets you cleanly so that each of hte
>> two indexes contains only half the number of terms as if you had  
>> one big
>> index, then sorting on a field in those two indexes will require  
>> more RAM
>> then sorting on the same data in asingle index.
>>

PS> Our data is logically segmented into Projects.  Each Project can  
PS> contain Documents and Mail.  So we currently have 2 physical Indexes
PS> per Project.  90% of the time our users work within one project at a
PS> time, and only work in "document mode" or "mail mode".  Every now and
PS> then they may need to do a general search across all Entities and/or
PS> Projects they are involved in (accomplished with Mulitsearcher).
PS> Perhaps we should just put Documents and Mail all in one Index for a
PS> project (ie have 1 Index per project)??

PS> Part of the reason in to partition is to make the cost of rebuilding
PS> a given project cheaper.  Reduces the risk of an Uber-Index being
PS> corrupted and screwing all the users up.  We can order the reindexing
PS> of projects to make sure our more important customers get re-indexed
PS> first if there is a serious issue.

PS> I would have thought that partitioning indexes would have performance
PS> benefits too:  a lot less data to scan (most of the data is already
PS> relevant).

PS> Since this isn't in production yet, I'd rather be proven wrong now
PS> rather than later! :)

PS> Thanks for your input.

PS> Paul

Reply via email to