it's up to your machines. in our application, we indexs about 30,000,000(30M)docs/shard, and the response time is about 150ms. our machine has about 48GB memory and about 25GB is allocated to solr and other is used for disk cache in Linux. if calculated by our application, indexing 1.25T docs will use 40+ machines.
On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < peter.mil...@objectconsulting.com.au> wrote: > Hi, > > I have a little bit of an unusual set of requirements, and I am looking > for advice. I have researched the archives, and seen some relevant posts, > but they are fairly old and not specifically a match, so I thought I would > give this a try. > > We will eventually have about 50TB raw, non-searchable data and 25TB of > search attributes to handle in Lucene, across about 1.25 trillion > documents. The app is write once, read many. There are many document types > involved that have to be able to be searched separately or together, with > some common attributes, but also unique ones per type. I plan on using a > JCP implementation that uses Lucene under the covers. The data itself is > not searchable, only the attributes. I plan to hook the JCP repo > (ModeShape) up to the OpenStack Object Storage on commodity hardware > eventually with 5 machines, each with 24 x 2TB drives. This should allow > for redundancy (3 copies), although I would suppose we would add bigger > drives as we go on. > > Since there is such a lot of data to index (not outrageous amounts for > these days, but a bit chunky), I was sort of assuming that the Lucene > indexes would go on the object storage solution too, to handle availability > and other infrastructure issues. Most of the searches would be > date-constrained, so I thought that the indexes could be sharded by date. > > There would be a local disk index being built near real time on the JCP > hardware that could be regularly merged in with the main indexes on the > object storage, I suppose. > > Does that make sense, and would it work? Sorry, but this is just > theoretical at the moment and I'm not experienced in Lucene, as you can no > doubt tell. > > I came across a piece that was talking about Hardoop and distributed Solr, > http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, and I'm now > wondering if that would be a superior approach? Or any other suggestions? > > Many Thanks, > The Captn >