Chris,

Yes, disk space is cheap, and with so little overlap you won't gain much by 
putting everything in a single index.  Plus, when each user has a separate 
index, it's easy to to split users and distribute over multiple machines if you 
ever need to do that, it's easy and fast to completely reindex one user's data 
without affecting other users, etc.

Several years ago I built Simpy at http://www.simpy.com/ that way (but 
pre-Solr, so it uses Lucene directly) and never regretted it.  There are way 
more than 20K users there with many searches per second and with constant 
indexing.  Each user has an index for bookmarks and an index for notes.  Each 
group has its own index, shared by all group members.  The main bookmark search 
is another index.  People search is yet another index.  And so on.  Single 
server.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Chris Cornell <srchn...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Sunday, May 17, 2009 8:37:44 PM
> Subject: Re: multicore for 20k users?
> 
> Thanks for helping Ryan,
> 
> On Sun, May 17, 2009 at 7:17 PM, Ryan McKinley wrote:
> > how much overlap is there with the 20k user documents?
> 
> There are around 20k users but each one has anywhere from zero to
> thousands of documents.  The final overlap is unknown because there is
> a current set of documents but each user will add documents on the fly
> (it's like their own personal search engine in a way).
> 
> >
> > if you create a separate index for each of them will you be indexing 90% of
> > the documents 20K times?
> 
> Probably more like 5-10%
> 
> > How many total documents could an individual user
> > typically see?
> 
> Average is around 100 now but we want them to be able to add more.
> 
> > How many total distinct documents are you talking about?  Is
> > the indexing strategy the same for all users?  (the same analysis etc)
> 
> The indexing strategy is the same for each user.
> 
> >
> > Is it actually possible to limit visibility by "role" rather then user?
> 
> No, it has to be by user since it is a private document set.  We just
> want to save on diskspace when there are big documents that are the
> same across users (based on document checksum).
> 
> >
> > I would start with trying to put everything in one index -- if that is not
> > possible, then look at a multi-core option.
> 
> OK.  Another thing is that we want to allow the user to restrict
> searches based on when the document was added... if we do share an
> indexed item and insert some attribute into each query (like
> "user:ralph") then it couldn't have date-added based search.  Unless a
> field was added like date-added-by-ralph, date-added-by-sally (ugh!).
> 
> Or maybe "diskspace is cheap" and we just should strive for simplicity?
> 
> Thanks,
> Chris
> 
> >
> >
> >
> > On May 17, 2009, at 5:53 PM, Chris Cornell wrote:
> >
> >> Trying to create a search solution for about 20k users at a company.
> >> Each person's documents are private and different (some overlap... it
> >> would be nice to not have to store/index copies).
> >>
> >> Is multicore something that would work or should we auto-insert a
> >> facet into each query generated by the person?
> >>
> >> Thanks for any advice, I am very new to solr.  Any tiny push in the
> >> right direction would be appreciated.
> >>
> >> Thanks,
> >> Chris
> >
> >

Reply via email to