Any idea how many documents your 5TB data contains?  Certain features such as 
faceting depends more on # of total documents than on actual size of data.

I have tested approx. 1 TB (100 million documents) running on a single machine 
(40 cores, 128 GB RAM), using distributed search across 10 shards (10 million 
docs each).  So running 10 SOLR processes.  Search performance is good (under 1 
second avg. including faceting).

So based on that for 5TB (assuming 500 millon docs) you could probably shard 
across a few such machines and get decent performance with distributed search.

The indexes were sharded by time.  New documents go into a single index (the 
"current" index), and once that index reaches 10 million docs, a new index is 
created to become the "current" index.  Then the oldest index is dropped from 
search (so total remains 10 shards).  It is news data, so older data is less 
important.



On Jan 13, 2012, at 10:00 AM, <dar...@ontrenet.com> <dar...@ontrenet.com> wrote:

> 
> Maybe also have a look at these links.
> 
> http://www.hathitrust.org/blogs/large-scale-search/performance-5-million-volumes
> http://www.hathitrust.org/blogs/large-scale-search
> 
> On Fri, 13 Jan 2012 15:49:06 +0100, Daniel Brügge <dan...@bruegge.eu>
> wrote:
>> Hi,
>> 
>> it's definitely a problem to store 5TB in Solr without using sharding. I
>> try to split data over solr instances,
>> so that the index will fit in my memory on the server.
>> 
>> I ran into trouble with a Solr using 50G index. 
>> 
>> Daniel
>> 
>> On Jan 13, 2012, at 1:08 PM, mustafozbek wrote:
>> 
>>> I am an apache solr user about a year. I used solr for simple search
>>> tools
>>> but now I want to use solr with 5TB of data. I assume that 5TB data
> will
>>> be
>>> 7TB when solr index it according to filter that I use. And then I will
>>> add
>>> nearly 50MB of data per hour to the same index.
>>> 1-  Are there any problem using single solr server with 5TB data.
> (without
>>> shards)
>>>  a- Can solr server answers the queries in an acceptable time
>>>  b- what is the expected time for commiting of 50MB data on 7TB index.
>>>  c- Is there an upper limit for index size.
>>> 2-  what are the suggestions that you offer
>>>  a- How many shards should I use
>>>  b- Should I use solr cores
>>>  c- What is the committing frequency you offered. (is 1 hour OK)
>>> 3-  are there any test results for this kind of large data
>>> 
>>> There is no available 5TB data, I just want to estimate what will be
> the
>>> result.
>>> Note: You can assume that hardware resourses are not a problem.
>>> 
>>> 
>>> --
>>> View this message in context:
>>> 
> http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p3656484.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to