Re: best practice: 1.4 billions documents

Ganesh Wed, 24 Nov 2010 21:44:47 -0800

Since there was a debate about using multisearcher, what about using 
ParallelMultiSearcher?


I am having indexes with 60 million documents and sometimes it grows to 100 
million. I shard the DB by week. I use ParallelMultiSearcher to search across 
the shards.  All data is in single system. Till now i didn't faced any issue. I 
used Lucene 2.9 and recently upgraded to 3.0.2.

Do i need to switch to MultiReader?

Regards
Ganesh 


----- Original Message ----- 
From: "Luca Rondanini" <luca.rondan...@gmail.com>
To: <java-user@lucene.apache.org>
Sent: Monday, November 22, 2010 11:29 PM
Subject: Re: best practice: 1.4 billions documents


> Thank you all, I really got some good hints!
> 
> of course I will distribute my index over many machines: store everything on
> one computer is just crazy, 1.4B docs is going to be an index of almost 2T
> (in my case)
> 
> the best solution for me at the moment (from your suggestions) seems to
> identify a criteria to force a request (search/update) to access only a
> subset of the index. Multi or Parallel Searchers....I'll decide later.
> 
> Solr is a really good option and I've already planned of "stealing" parts of
> code but I have time and resources to try to build my own platform
> especially since my data need heavy processing.
> 
> I'll keep you posted
> Luca
> 
> 
> 
> 
> 
> 
> On Mon, Nov 22, 2010 at 8:54 AM, eks dev <eks...@yahoo.co.uk> wrote:
> 
>> Am I the only one who thinks this is not the way to go, MultiReader (or
>> MulitiSearcher) is not going to fix your problems. Having 1.4B Documents on
>> one machine is a big number, does not matter how you partition them (or you
>> have some really expensive hardware at your disposal).  Did I miss the
>> point
>> somewhere with this recommendation "use MultiReader and you are good for
>> 1.4B Document"?
>>
>> Imo, you must distribute your index across many machines.
>>
>> Your best chance is to look at solr cloud and solr replication (solr Wiki
>> is
>> your friend). Of course, you can do it yourself, but building distributet
>> setup with what you call "real time updates" is a huge pain.
>>
>> Alternatively, google for lucene or solr on cassandra (has some very nice
>> properties about update latency and architectural simplicity).I do not know
>> if this is somewhere in production.
>>
>> Good luck,
>> e.
>>
>>
>>
>> On Mon, Nov 22, 2010 at 5:18 PM, Uwe Schindler <u...@thetaphi.de> wrote:
>>
>> > There is no reason to use MultiSearcher instead the much more consistent
>> > and effective  MultiReader! We (Robert and me) are already planning to
>> > deprecate it. MultiSearcher itsself has no benefit over a simple
>> > IndexSearcher on top of a MultiReader since Lucene 2.9, it has only
>> > problems.
>> >
>> > Use cases for real MultiSearchers are only the subclasses for "remote
>> > search" or (perhaps) multi-threaded search, but the latter I would not
>> > recommend (instead let the additional CPUs in your machine be free for
>> other
>> > users doing searches in parallel). Multithreading a single search should
>> not
>> > be done, as it slows down multiple users accessing the same index at the
>> > same time. Spend the additional CPU power for other things like warming
>> > searchers, indexing additional documents, or filling FieldCache in
>> parallel.
>> >
>> > Uwe
>> >
>> > -----
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail: u...@thetaphi.de
>> >
>> >
>> > > -----Original Message-----
>> > > From: David Fertig [mailto:dfer...@cymfony.com]
>> > > Sent: Monday, November 22, 2010 4:54 PM
>> > > To: java-user@lucene.apache.org
>> > > Subject: RE: best practice: 1.4 billions documents
>> > >
>> > > >> We have a couple billion docs in our archives as well...Breaking
>> them
>> > > >> up by day worked well for us
>> > >
>> > > We do not have 2 billion segments in one index  We have roughly 5-10
>> > million
>> > > documents per index. We are currently using a miltisearcher but
>> > unresolved
>> > > lucene issues in this will force us to move to a multireader.
>> > >
>> > > As far as the parallel searcher goes, read back on the thread with
>> > subject
>> > > "Search returning documents matching a NOT range".
>> > > There is an acknowledged/proven bug with a small unit test, but there
>> is
>> > some
>> > > disagreement about the internal reasons it fails. I have been unable to
>> > > generate further discussion or a resolution. This was supposed to be
>> > added as a
>> > > bug to the JIRA for the 3.3 release, but has not been.  I am not which
>> > class Solr
>> > > uses, but if it uses MultiSearcher, it will have the same bug.
>> > >
>> > > -----Original Message-----
>> > > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
>> > > Sent: Monday, November 22, 2010 1:47 AM
>> > > To: java-user@lucene.apache.org
>> > > Subject: Re: best practice: 1.4 billions documents
>> > >
>> > > Hi David, thanks for your answer. it really helped a lot! so, you have
>> an
>> > index
>> > > with more than 2 billions segments. this is pretty much the answer I
>> was
>> > > searching for: lucene alone is able to manage such a big index.
>> > >
>> > > which kind of problems do you have with the parallel searchers? I'm
>> going
>> > to
>> > > build my index in the next couple of weeks if you want we can confront
>> > our
>> > > data
>> > >
>> > > thanks again
>> > > Luca
>> > >
>> > >
>> > > On Sun, Nov 21, 2010 at 6:22 PM, David Fertig <dfer...@cymfony.com>
>> > wrote:
>> > >
>> > > > Actually I've been bitten by an still-unresolved issue with the
>> > > > parallel searchers and recommend a MultiReader instead.
>> > > > We have a couple billion docs in our archives as well.  Breaking them
>> > > > up by day worked well for us, but you'll need to do something.
>> > > >
>> > > > -----Original Message-----
>> > > > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
>> > > > Sent: Sunday, November 21, 2010 8:13 PM
>> > > > To: java-user@lucene.apache.org; yo...@lucidimagination.com
>> > > > Subject: Re: best practice: 1.4 billions documents
>> > > >
>> > > > thank you both!
>> > > >
>> > > > Johannes, katta seems interesting but I will need to solve the
>> > > > problems of "hot" updates to the index
>> > > >
>> > > > Yonik, I see your point - so your suggestion would be to build an
>> > > > architecture based on ParallelMultiSearcher?
>> > > >
>> > > >
>> > > > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley
>> > > > <yo...@lucidimagination.com
>> > > > >wrote:
>> > > >
>> > > > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
>> > > > > <luca.rondan...@gmail.com> wrote:
>> > > > > > Hi everybody,
>> > > > > >
>> > > > > > I really need some good advice! I need to index in lucene
>> > > > > > something
>> > > > like
>> > > > > 1.4
>> > > > > > billions documents. I had experience in lucene but I've never
>> > > > > > worked
>> > > > with
>> > > > > > such a big number of documents. Also this is just the number of
>> > > > > > docs at
>> > > > > > "start-up": they are going to grow and fast.
>> > > > > >
>> > > > > > I don't have to tell you that I need the system to be fast and to
>> > > > support
>> > > > > > real time updates to the documents
>> > > > > >
>> > > > > > The first solution that came to my mind was to use
>> > > > ParallelMultiSearcher,
>> > > > > > splitting the index into many "sub-index" (how many docs per
>> index?
>> > > > > > 100,000?) but I don't have experience with it and I don't know
>> how
>> > > > > > well
>> > > > > will
>> > > > > > scale while the number of documents grows!
>> > > > > >
>> > > > > > A more solid solution seems to build some kind of integration
>> with
>> > > > > hadoop.
>> > > > > > But I didn't find match about lucene and hadoop integration.
>> > > > > >
>> > > > > > Any idea? Which direction should I go (pure lucene or hadoop)?
>> > > > >
>> > > > > There seems to be a common misconception about hadoop regarding
>> > > search.
>> > > > > Map-reduce as implemented in hadoop is really for batch oriented
>> > > > > jobs only (or those types of jobs where you don't need a quick
>> > > > > response time).  It's definitely not for normal queries (unless you
>> > > > > have unusual requirements).
>> > > > >
>> > > > > -Yonik
>> > > > > http://www.lucidimagination.com
>> > > > >
>> > > > >
>> --------------------------------------------------------------------
>> > > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> > > > >
>> > > > >
>> > > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download 
Now! http://messenger.yahoo.com/download.php

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: best practice: 1.4 billions documents

Reply via email to