Whoops! Very poor basic maths, I should have written it down. I was thinking 13 shards. But yes, 13,000 is a bit different. Now I'm in even more need of help.
How is "easy" - 15 million audit records a month, coming from several active systems, and a requirement to keep and search across seven years of data. <Goes off to do more googling> Thanks a lot, The Captn -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, 8 February 2012 12:39 AM To: java-user@lucene.apache.org Subject: Re: How best to handle a reasonable amount to data (25TB+) I'm curious what the nature of your data is such that you have 1.25 trillion documents. Even at 100M/shard, you're still talking 12,500 shards. The "laggard" problem will rear it's ugly head, not to mention the administration of that many machines will be, shall we say, non-trivial... Best Erick On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller <peter.mil...@objectconsulting.com.au> wrote: > Thanks for the response. Actually, I am more concerned with trying to use an > Object Store for the indexes. The next concern is the use of a local index > versus the sharded ones, but I'm more relaxed about that now after thinking > about it. I see that index shards could be up to 100 million documents, so > that makes the 1.25 trillion number look reasonable. > > Any other thoughts? > > Thanks, > The Captn. > > -----Original Message----- > From: ppp c [mailto:peter.c.e...@gmail.com] > Sent: Monday, 6 February 2012 5:29 PM > To: java-user@lucene.apache.org > Subject: Re: How best to handle a reasonable amount to data (25TB+) > > it sounds not an issue of lucene but the logic of your app. > if you're afraid too many docs in one index you can make multiple indexes. > And then search across them, then merge, then over. > > On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < > peter.mil...@objectconsulting.com.au> wrote: > >> Hi, >> >> I have a little bit of an unusual set of requirements, and I am >> looking for advice. I have researched the archives, and seen some >> relevant posts, but they are fairly old and not specifically a match, >> so I thought I would give this a try. >> >> We will eventually have about 50TB raw, non-searchable data and 25TB >> of search attributes to handle in Lucene, across about 1.25 trillion >> documents. The app is write once, read many. There are many document >> types involved that have to be able to be searched separately or >> together, with some common attributes, but also unique ones per type. >> I plan on using a JCP implementation that uses Lucene under the >> covers. The data itself is not searchable, only the attributes. I >> plan to hook the JCP repo >> (ModeShape) up to the OpenStack Object Storage on commodity hardware >> eventually with 5 machines, each with 24 x 2TB drives. This should >> allow for redundancy (3 copies), although I would suppose we would >> add bigger drives as we go on. >> >> Since there is such a lot of data to index (not outrageous amounts >> for these days, but a bit chunky), I was sort of assuming that the >> Lucene indexes would go on the object storage solution too, to handle >> availability and other infrastructure issues. Most of the searches >> would be date-constrained, so I thought that the indexes could be sharded by >> date. >> >> There would be a local disk index being built near real time on the >> JCP hardware that could be regularly merged in with the main indexes >> on the object storage, I suppose. >> >> Does that make sense, and would it work? Sorry, but this is just >> theoretical at the moment and I'm not experienced in Lucene, as you >> can no doubt tell. >> >> I came across a piece that was talking about Hardoop and distributed >> Solr, http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, >> and I'm now wondering if that would be a superior approach? Or any other >> suggestions? >> >> Many Thanks, >> The Captn >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org