Since there was a debate about using multisearcher, what about using ParallelMultiSearcher?
I am having indexes with 60 million documents and sometimes it grows to 100 million. I shard the DB by week. I use ParallelMultiSearcher to search across the shards. All data is in single system. Till now i didn't faced any issue. I used Lucene 2.9 and recently upgraded to 3.0.2. Do i need to switch to MultiReader? Regards Ganesh ----- Original Message ----- From: "Luca Rondanini" <luca.rondan...@gmail.com> To: <java-user@lucene.apache.org> Sent: Monday, November 22, 2010 11:29 PM Subject: Re: best practice: 1.4 billions documents > Thank you all, I really got some good hints! > > of course I will distribute my index over many machines: store everything on > one computer is just crazy, 1.4B docs is going to be an index of almost 2T > (in my case) > > the best solution for me at the moment (from your suggestions) seems to > identify a criteria to force a request (search/update) to access only a > subset of the index. Multi or Parallel Searchers....I'll decide later. > > Solr is a really good option and I've already planned of "stealing" parts of > code but I have time and resources to try to build my own platform > especially since my data need heavy processing. > > I'll keep you posted > Luca > > > > > > > On Mon, Nov 22, 2010 at 8:54 AM, eks dev <eks...@yahoo.co.uk> wrote: > >> Am I the only one who thinks this is not the way to go, MultiReader (or >> MulitiSearcher) is not going to fix your problems. Having 1.4B Documents on >> one machine is a big number, does not matter how you partition them (or you >> have some really expensive hardware at your disposal). Did I miss the >> point >> somewhere with this recommendation "use MultiReader and you are good for >> 1.4B Document"? >> >> Imo, you must distribute your index across many machines. >> >> Your best chance is to look at solr cloud and solr replication (solr Wiki >> is >> your friend). Of course, you can do it yourself, but building distributet >> setup with what you call "real time updates" is a huge pain. >> >> Alternatively, google for lucene or solr on cassandra (has some very nice >> properties about update latency and architectural simplicity).I do not know >> if this is somewhere in production. >> >> Good luck, >> e. >> >> >> >> On Mon, Nov 22, 2010 at 5:18 PM, Uwe Schindler <u...@thetaphi.de> wrote: >> >> > There is no reason to use MultiSearcher instead the much more consistent >> > and effective MultiReader! We (Robert and me) are already planning to >> > deprecate it. MultiSearcher itsself has no benefit over a simple >> > IndexSearcher on top of a MultiReader since Lucene 2.9, it has only >> > problems. >> > >> > Use cases for real MultiSearchers are only the subclasses for "remote >> > search" or (perhaps) multi-threaded search, but the latter I would not >> > recommend (instead let the additional CPUs in your machine be free for >> other >> > users doing searches in parallel). Multithreading a single search should >> not >> > be done, as it slows down multiple users accessing the same index at the >> > same time. Spend the additional CPU power for other things like warming >> > searchers, indexing additional documents, or filling FieldCache in >> parallel. >> > >> > Uwe >> > >> > ----- >> > Uwe Schindler >> > H.-H.-Meier-Allee 63, D-28213 Bremen >> > http://www.thetaphi.de >> > eMail: u...@thetaphi.de >> > >> > >> > > -----Original Message----- >> > > From: David Fertig [mailto:dfer...@cymfony.com] >> > > Sent: Monday, November 22, 2010 4:54 PM >> > > To: java-user@lucene.apache.org >> > > Subject: RE: best practice: 1.4 billions documents >> > > >> > > >> We have a couple billion docs in our archives as well...Breaking >> them >> > > >> up by day worked well for us >> > > >> > > We do not have 2 billion segments in one index We have roughly 5-10 >> > million >> > > documents per index. We are currently using a miltisearcher but >> > unresolved >> > > lucene issues in this will force us to move to a multireader. >> > > >> > > As far as the parallel searcher goes, read back on the thread with >> > subject >> > > "Search returning documents matching a NOT range". >> > > There is an acknowledged/proven bug with a small unit test, but there >> is >> > some >> > > disagreement about the internal reasons it fails. I have been unable to >> > > generate further discussion or a resolution. This was supposed to be >> > added as a >> > > bug to the JIRA for the 3.3 release, but has not been. I am not which >> > class Solr >> > > uses, but if it uses MultiSearcher, it will have the same bug. >> > > >> > > -----Original Message----- >> > > From: Luca Rondanini [mailto:luca.rondan...@gmail.com] >> > > Sent: Monday, November 22, 2010 1:47 AM >> > > To: java-user@lucene.apache.org >> > > Subject: Re: best practice: 1.4 billions documents >> > > >> > > Hi David, thanks for your answer. it really helped a lot! so, you have >> an >> > index >> > > with more than 2 billions segments. this is pretty much the answer I >> was >> > > searching for: lucene alone is able to manage such a big index. >> > > >> > > which kind of problems do you have with the parallel searchers? I'm >> going >> > to >> > > build my index in the next couple of weeks if you want we can confront >> > our >> > > data >> > > >> > > thanks again >> > > Luca >> > > >> > > >> > > On Sun, Nov 21, 2010 at 6:22 PM, David Fertig <dfer...@cymfony.com> >> > wrote: >> > > >> > > > Actually I've been bitten by an still-unresolved issue with the >> > > > parallel searchers and recommend a MultiReader instead. >> > > > We have a couple billion docs in our archives as well. Breaking them >> > > > up by day worked well for us, but you'll need to do something. >> > > > >> > > > -----Original Message----- >> > > > From: Luca Rondanini [mailto:luca.rondan...@gmail.com] >> > > > Sent: Sunday, November 21, 2010 8:13 PM >> > > > To: java-user@lucene.apache.org; yo...@lucidimagination.com >> > > > Subject: Re: best practice: 1.4 billions documents >> > > > >> > > > thank you both! >> > > > >> > > > Johannes, katta seems interesting but I will need to solve the >> > > > problems of "hot" updates to the index >> > > > >> > > > Yonik, I see your point - so your suggestion would be to build an >> > > > architecture based on ParallelMultiSearcher? >> > > > >> > > > >> > > > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley >> > > > <yo...@lucidimagination.com >> > > > >wrote: >> > > > >> > > > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini >> > > > > <luca.rondan...@gmail.com> wrote: >> > > > > > Hi everybody, >> > > > > > >> > > > > > I really need some good advice! I need to index in lucene >> > > > > > something >> > > > like >> > > > > 1.4 >> > > > > > billions documents. I had experience in lucene but I've never >> > > > > > worked >> > > > with >> > > > > > such a big number of documents. Also this is just the number of >> > > > > > docs at >> > > > > > "start-up": they are going to grow and fast. >> > > > > > >> > > > > > I don't have to tell you that I need the system to be fast and to >> > > > support >> > > > > > real time updates to the documents >> > > > > > >> > > > > > The first solution that came to my mind was to use >> > > > ParallelMultiSearcher, >> > > > > > splitting the index into many "sub-index" (how many docs per >> index? >> > > > > > 100,000?) but I don't have experience with it and I don't know >> how >> > > > > > well >> > > > > will >> > > > > > scale while the number of documents grows! >> > > > > > >> > > > > > A more solid solution seems to build some kind of integration >> with >> > > > > hadoop. >> > > > > > But I didn't find match about lucene and hadoop integration. >> > > > > > >> > > > > > Any idea? Which direction should I go (pure lucene or hadoop)? >> > > > > >> > > > > There seems to be a common misconception about hadoop regarding >> > > search. >> > > > > Map-reduce as implemented in hadoop is really for batch oriented >> > > > > jobs only (or those types of jobs where you don't need a quick >> > > > > response time). It's definitely not for normal queries (unless you >> > > > > have unusual requirements). >> > > > > >> > > > > -Yonik >> > > > > http://www.lucidimagination.com >> > > > > >> > > > > >> -------------------------------------------------------------------- >> > > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > > >> > > > > >> > > > >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org