It also depends on your queries. For example if you only query data for 1 month intervals, and you partition by date, you can calculate in which shard your data can be found, and query just that shard.
If you can find a partition key that is always present in the query, you can create a gazillion of small shards, but redirect query just to specific shard and keep search latency low. On Wed, Feb 8, 2012 at 09:39, Li Li <fancye...@gmail.com> wrote: > it's up to your machines. in our application, we indexs about > 30,000,000(30M)docs/shard, and the response time is about 150ms. our > machine has about 48GB memory and about 25GB is allocated to solr and other > is used for disk cache in Linux. > if calculated by our application, indexing 1.25T docs will use 40+ machines. > > On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < > peter.mil...@objectconsulting.com.au> wrote: > >> Hi, >> >> I have a little bit of an unusual set of requirements, and I am looking >> for advice. I have researched the archives, and seen some relevant posts, >> but they are fairly old and not specifically a match, so I thought I would >> give this a try. >> >> We will eventually have about 50TB raw, non-searchable data and 25TB of >> search attributes to handle in Lucene, across about 1.25 trillion >> documents. The app is write once, read many. There are many document types >> involved that have to be able to be searched separately or together, with >> some common attributes, but also unique ones per type. I plan on using a >> JCP implementation that uses Lucene under the covers. The data itself is >> not searchable, only the attributes. I plan to hook the JCP repo >> (ModeShape) up to the OpenStack Object Storage on commodity hardware >> eventually with 5 machines, each with 24 x 2TB drives. This should allow >> for redundancy (3 copies), although I would suppose we would add bigger >> drives as we go on. >> >> Since there is such a lot of data to index (not outrageous amounts for >> these days, but a bit chunky), I was sort of assuming that the Lucene >> indexes would go on the object storage solution too, to handle availability >> and other infrastructure issues. Most of the searches would be >> date-constrained, so I thought that the indexes could be sharded by date. >> >> There would be a local disk index being built near real time on the JCP >> hardware that could be regularly merged in with the main indexes on the >> object storage, I suppose. >> >> Does that make sense, and would it work? Sorry, but this is just >> theoretical at the moment and I'm not experienced in Lucene, as you can no >> doubt tell. >> >> I came across a piece that was talking about Hardoop and distributed Solr, >> http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, and I'm now >> wondering if that would be a superior approach? Or any other suggestions? >> >> Many Thanks, >> The Captn >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org