Re: How best to handle a reasonable amount to data (25TB+)

Danil ŢORIN Wed, 08 Feb 2012 01:15:45 -0800

It also depends on your queries.

For example if you only query data for 1 month intervals, and you
partition by date, you can calculate in which shard your data can be
found, and query just that shard.


If you can find a partition key that is always present in the query,
you can create a gazillion of small shards, but redirect query just to
specific shard and keep search latency low.


On Wed, Feb 8, 2012 at 09:39, Li Li <[email protected]> wrote:
> it's up to your machines. in our application, we indexs about
> 30,000,000(30M)docs/shard, and the response time is about 150ms. our
> machine has about 48GB memory and about 25GB is allocated to solr and other
> is used for disk cache in Linux.
> if calculated by our application, indexing 1.25T docs will use 40+ machines.
>
> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller <
> [email protected]> wrote:
>
>> Hi,
>>
>> I have a little bit of an unusual set of requirements, and I am looking
>> for advice. I have researched the archives, and seen some relevant posts,
>> but they are fairly old and not specifically a match, so I thought I would
>> give this a try.
>>
>> We will eventually have about 50TB raw, non-searchable data and 25TB of
>> search attributes to handle in Lucene, across about 1.25 trillion
>> documents. The app is write once, read many. There are many document types
>> involved that have to be able to be searched separately or together, with
>> some common attributes, but also unique ones per type. I plan on using a
>> JCP implementation that uses Lucene under the covers. The data itself is
>> not searchable, only the attributes. I plan to hook the JCP repo
>> (ModeShape) up to the OpenStack Object Storage on commodity hardware
>> eventually with 5 machines, each with 24 x 2TB drives. This should allow
>> for redundancy (3 copies), although I would suppose we would add bigger
>> drives as we go on.
>>
>> Since there is such a lot of data to index (not outrageous amounts for
>> these days, but a bit chunky), I was sort of assuming that the Lucene
>> indexes would go on the object storage solution too, to handle availability
>> and other infrastructure issues. Most of the searches would be
>> date-constrained, so I thought that the indexes could be sharded by date.
>>
>> There would be a local disk index being built near real time on the JCP
>> hardware that could be regularly merged in with the main indexes on the
>> object storage, I suppose.
>>
>> Does that make sense, and would it work? Sorry, but this is just
>> theoretical at the moment and I'm not experienced in Lucene, as you can no
>> doubt tell.
>>
>> I came across a piece that was talking about Hardoop and distributed Solr,
>> http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, and I'm now
>> wondering if that would be a superior approach? Or any other suggestions?
>>
>> Many Thanks,
>> The Captn
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: How best to handle a reasonable amount to data (25TB+)

Reply via email to