RE: How best to handle a reasonable amount to data (25TB+)

Peter Miller Tue, 07 Feb 2012 17:20:38 -0800

Whoops! Very poor basic maths, I should have written it down. I was thinking 13 
shards. But yes, 13,000 is a bit different. Now I'm in even more need of help.


How is "easy" - 15 million audit records a month, coming from several active 
systems, and a requirement to keep and search across seven years of data.

<Goes off to do more googling>

Thanks a lot,
The Captn

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, 8 February 2012 12:39 AM
To: java-user@lucene.apache.org
Subject: Re: How best to handle a reasonable amount to data (25TB+)

I'm curious what the nature of your data is such that you have 1.25 trillion 
documents. Even at 100M/shard, you're still talking  12,500 shards. The 
"laggard"
problem will rear it's ugly
head, not to mention the administration of that many machines will be, shall we 
say, non-trivial...

Best
Erick

On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller 
<peter.mil...@objectconsulting.com.au> wrote:
> Thanks for the response. Actually, I am more concerned with trying to use an 
> Object Store for the indexes. The next concern is the use of a local index 
> versus the sharded ones, but I'm more relaxed about that now after thinking 
> about it. I see that index shards could be up to 100 million documents, so 
> that makes the 1.25 trillion number look reasonable.
>
> Any other thoughts?
>
> Thanks,
> The Captn.
>
> -----Original Message-----
> From: ppp c [mailto:peter.c.e...@gmail.com]
> Sent: Monday, 6 February 2012 5:29 PM
> To: java-user@lucene.apache.org
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>
> it sounds not an issue of lucene but the logic of your app.
> if you're afraid too many docs in one index you can make multiple indexes.
> And then search across them, then merge, then over.
>
> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < 
> peter.mil...@objectconsulting.com.au> wrote:
>
>> Hi,
>>
>> I have a little bit of an unusual set of requirements, and I am 
>> looking for advice. I have researched the archives, and seen some 
>> relevant posts, but they are fairly old and not specifically a match, 
>> so I thought I would give this a try.
>>
>> We will eventually have about 50TB raw, non-searchable data and 25TB 
>> of search attributes to handle in Lucene, across about 1.25 trillion 
>> documents. The app is write once, read many. There are many document 
>> types involved that have to be able to be searched separately or 
>> together, with some common attributes, but also unique ones per type.
>> I plan on using a JCP implementation that uses Lucene under the 
>> covers. The data itself is not searchable, only the attributes. I 
>> plan to hook the JCP repo
>> (ModeShape) up to the OpenStack Object Storage on commodity hardware 
>> eventually with 5 machines, each with 24 x 2TB drives. This should 
>> allow for redundancy (3 copies), although I would suppose we would 
>> add bigger drives as we go on.
>>
>> Since there is such a lot of data to index (not outrageous amounts 
>> for these days, but a bit chunky), I was sort of assuming that the 
>> Lucene indexes would go on the object storage solution too, to handle 
>> availability and other infrastructure issues. Most of the searches 
>> would be date-constrained, so I thought that the indexes could be sharded by 
>> date.
>>
>> There would be a local disk index being built near real time on the 
>> JCP hardware that could be regularly merged in with the main indexes 
>> on the object storage, I suppose.
>>
>> Does that make sense, and would it work? Sorry, but this is just 
>> theoretical at the moment and I'm not experienced in Lucene, as you 
>> can no doubt tell.
>>
>> I came across a piece that was talking about Hardoop and distributed 
>> Solr, http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, 
>> and I'm now wondering if that would be a superior approach? Or any other 
>> suggestions?
>>
>> Many Thanks,
>> The Captn
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: How best to handle a reasonable amount to data (25TB+)

Reply via email to