Re: Solr cloud performance degradation with billions of documents

2014-08-17 Thread Erick Erickson
bq: I am interested in knowing, when you have multiple
collections like this case (60), and you just query one collection,

Yes.. and no. You're correct about OS memory swapping in
and out, but the JVM memory is a different matter. There'll
be some low-level caches filled up. Each collection may have
filterCache entries. Or sort entries. Or... There are lots of
Java memory-resident structures that are _not_ swapped
out. Furthermore, each collection may have 1-n warming
queries fired when it's loaded. And the top-level caches
configured in solrconfig.xml may have autowarm counts. And

The long and short of it is that each collection will consume
memory when it's loaded in varying amounts. Memory that
MUST live in the JVM. Other memory will be paged in and
out, see Uwe's excellent blog here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
So as you add more and more collections, you'll hit this
kind of problem.

Now, there is the LotsOfCores code, see:
http://wiki.apache.org/solr/LotsOfCores
WARNING: This is NOT supported (yet) for SolrCloud.
That code does load/unload cores as necessary based
on the configuration parameters that determine
1 whether a core can be unloaded/loaded
2 how many transient cores can be in memory at
 once

Best,
Erick


On Sat, Aug 16, 2014 at 6:43 PM, shushuai zhu ss...@yahoo.com.invalid wrote:
 Erik,

 ---
 I fear the problem will be this: you won't even be able to do basic searches 
 as the number of shards on a particular machine increase. To test, fire off a 
 simple search for each of your 60 days. I expect it'll blow you out of the 
 water. This assumes that all your shards are hosted in the same JVM on each 
 of your 32 machines. But that's totally a guess.
 ---

 In this case, assuming there are 60 collections, and only one collection is 
 queried each time, should the memory requirements be those for that 
 collection only? My understanding is, when a new collection is queried, the 
 indexes (cores) of the old collection in OS cache are to be swapped out and 
 the indexes of new collection are brought in, but the memory requirements 
 should be roughly the same as long as two collections have similar sizes.

 I am interested in knowing, when you have multiple collections like this case 
 (60), and you just query one collection, should other collections matter from 
 performance perspective? Since different collections contain different cores, 
 if querying one collection involves cores in other collections, is it a bug?

 Thanks.

 Shushuai


  From: Erick Erickson erickerick...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Friday, August 15, 2014 7:30 PM
 Subject: Re: Solr cloud performance degradation with billions of documents


 Toke:

 bq: I would have agreed with you fully an hour ago.

 Well, I now disagree with myself too :) I don't mind
 talking to myself. I don't even mind arguing with myself. I
 really _do_ mind losing the arguments I have with
 myself though.

 Scott:

 OK, that has a much better chance of working, I obviously
 misunderstood. So you'll have 60 different collections and each
 collection will have one shard on each machine.

 When the time comes to roll some of the collections off the
 end due to age, collection aliasing may be helpful. I still think
 you're significantly undersized, but you know your problem
 space better than I do.

 I fear the problem will be this: you won't even be able to do
 basic searches as the number of shards on a particular
 machine increase. To test, fire off a simple search for each of
 your 60 days. I expect it'll blow you out of the water. This
 assumes that all your shards are hosted in the same JVM
 on each of your 32 machines. But that's totally a guess.

 Keep us posted!


 On Fri, Aug 15, 2014 at 2:40 PM, Toke Eskildsen t...@statsbiblioteket.dk 
 wrote:
 Erick Erickson [erickerick...@gmail.com] wrote:
 I guess that my main issue is that from everything I've seen so far,
 this project is doomed. You simply cannot put 7B documents in a single
 shard, period. Lucene has a 2B hard limit.

 I would have agreed with you fully an hour ago and actually planned to ask 
 Wilbur to check if he had corrupted his indexes. However, his latest post 
 suggests that the scenario is more about having a larger amount of more 
 resonably sized shards in play than building gigantic shards.

 For instance, Wilburn is talking about only using 6G of memory. Even
 at 2B docs/shard, I'd be surprised to see it function at all. Don't
 try sorting on a timestamp for instance.

 I haven't understood Wilburns setup completely, as it seems to me that he 
 will quickly run out of memory for starting new shards. But if we are 
 looking at shards of 30GB and 160M documents, 6GB sounds a lot better.

 Regards,
 Toke Eskildsen


Re: Solr cloud performance degradation with billions of documents

2014-08-16 Thread shushuai zhu
Erik,

---
I fear the problem will be this: you won't even be able to do basic searches as 
the number of shards on a particular machine increase. To test, fire off a 
simple search for each of your 60 days. I expect it'll blow you out of the 
water. This assumes that all your shards are hosted in the same JVM on each of 
your 32 machines. But that's totally a guess.
---

In this case, assuming there are 60 collections, and only one collection is 
queried each time, should the memory requirements be those for that collection 
only? My understanding is, when a new collection is queried, the indexes 
(cores) of the old collection in OS cache are to be swapped out and the indexes 
of new collection are brought in, but the memory requirements should be roughly 
the same as long as two collections have similar sizes. 

I am interested in knowing, when you have multiple collections like this case 
(60), and you just query one collection, should other collections matter from 
performance perspective? Since different collections contain different cores, 
if querying one collection involves cores in other collections, is it a bug?

Thanks.

Shushuai

 
 From: Erick Erickson erickerick...@gmail.com
To: solr-user@lucene.apache.org 
Sent: Friday, August 15, 2014 7:30 PM
Subject: Re: Solr cloud performance degradation with billions of documents
  

Toke:

bq: I would have agreed with you fully an hour ago.

Well, I now disagree with myself too :) I don't mind
talking to myself. I don't even mind arguing with myself. I
really _do_ mind losing the arguments I have with
myself though.

Scott:

OK, that has a much better chance of working, I obviously
misunderstood. So you'll have 60 different collections and each
collection will have one shard on each machine.

When the time comes to roll some of the collections off the
end due to age, collection aliasing may be helpful. I still think
you're significantly undersized, but you know your problem
space better than I do.

I fear the problem will be this: you won't even be able to do
basic searches as the number of shards on a particular
machine increase. To test, fire off a simple search for each of
your 60 days. I expect it'll blow you out of the water. This
assumes that all your shards are hosted in the same JVM
on each of your 32 machines. But that's totally a guess.

Keep us posted!


On Fri, Aug 15, 2014 at 2:40 PM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 Erick Erickson [erickerick...@gmail.com] wrote:
 I guess that my main issue is that from everything I've seen so far,
 this project is doomed. You simply cannot put 7B documents in a single
 shard, period. Lucene has a 2B hard limit.

 I would have agreed with you fully an hour ago and actually planned to ask 
 Wilbur to check if he had corrupted his indexes. However, his latest post 
 suggests that the scenario is more about having a larger amount of more 
 resonably sized shards in play than building gigantic shards.

 For instance, Wilburn is talking about only using 6G of memory. Even
 at 2B docs/shard, I'd be surprised to see it function at all. Don't
 try sorting on a timestamp for instance.

 I haven't understood Wilburns setup completely, as it seems to me that he 
 will quickly run out of memory for starting new shards. But if we are looking 
 at shards of 30GB and 160M documents, 6GB sounds a lot better.

 Regards,
 Toke Eskildsen

Re: Solr cloud performance degradation with billions of documents

2014-08-15 Thread Erick Erickson
Toke:

You make valid points. You're completely right that my reflexes are
for sub-second responses so I tend to think of lots and lots of memory
being a requirement. I agree that depending on the problem space the
percentage of the index that has to be in memory varies widely, I've
seen a large variance in projects. And I know you've done some _very_
interesting things tuning-wise!

I guess that my main issue is that from everything I've seen so far,
this project is doomed. You simply cannot put 7B documents in a single
shard, period. Lucene has a 2B hard limit. Wilburn is making
assumptions here that are simply wrong. Or my math is off, that's been
known to happen too.

For instance, Wilburn is talking about only using 6G of memory. Even
at 2B docs/shard, I'd be surprised to see it function at all. Don't
try sorting on a timestamp for instance. I've never seen 2B docs fit
on a shard and be OK performance-wise. Or, for that matter, perform at
all. If there are situations like that I'd _love_ to know the
details...

For any chance of success, Wilburn has to go back and do some
reassessment IMO. There's no magic knob to turn to overcome the
fundamental limitations that are going to creeping out of the
woodwork. Indexing throughput is the least of his problems. I further
suspect (but don't know for sure) that the first time realistic
queries start hitting the system it'll OOM.

All that said, I don't know for sure of course.

On Thu, Aug 14, 2014 at 11:57 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 Erick Erickson [erickerick...@gmail.com] wrote:
 Solr requires holding large parts of the index in memory.
 For the entire corpus. At once.

 That requirement is under the assumption that one must have the lowest 
 possible latency at each individual box. You might as well argue for the 
 fastest possible memory or the fastest possible CPU being a requirement. The 
 advice is good in some contexts and a waste of money in other.

 I not-so-humbly point to 
 http://sbdevel.wordpress.com/2014/08/13/whale-hunting-with-solr/ where we 
 (for simple searches) handily achieve our goal of sub-second response times 
 for a 10TB index with just 1.4% of the index cached in RAM. Had our goal been 
 sub-50ms, it would be another matter, but it is not. Just as Wilburn's 
 problem is not to minimize latency for each individual box, but to achieve a 
 certain throughput for indexing, while performing searches.

 Wilburn's hardware is currently able to keep up, although barely, with 300B 
 documents. He needs to handle 900B. Tripling (or quadrupling) the amount of 
 machines should do the trick. Increasing the amount of RAM on each current 
 machine might also work (qua the well known effect of RAM with Lucene/Solr). 
 Using local SSDs, if he is not doing so already, might also work (qua the 
 article above).

 - Toke Eskildsen


RE: Solr cloud performance degradation with billions of documents

2014-08-15 Thread Wilburn, Scott
Erick,
You make some very good valid points. Let me clear a few things up, though. We 
are not trying to put 7B docs into one single shard, because we are using 
collections, created daily, which spread the index across the 32 shards that 
make up the cloud/collection. Last I counted, we are putting about 160M docs 
per collection per shard. 
You are very correct about the memory issues. In fact, we cannot do any 
complicated searches or faceting without Solr returning memory errors. Basic 
searching still works fine, fortunately. This limit on search is acceptable in 
our case, though not ideal, to ensure the project succeeds and comes in under 
budget. 

Thanks,
Scott 


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, August 15, 2014 7:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Toke:

You make valid points. You're completely right that my reflexes are for 
sub-second responses so I tend to think of lots and lots of memory being a 
requirement. I agree that depending on the problem space the percentage of the 
index that has to be in memory varies widely, I've seen a large variance in 
projects. And I know you've done some _very_ interesting things tuning-wise!

I guess that my main issue is that from everything I've seen so far, this 
project is doomed. You simply cannot put 7B documents in a single shard, 
period. Lucene has a 2B hard limit. Wilburn is making assumptions here that are 
simply wrong. Or my math is off, that's been known to happen too.

For instance, Wilburn is talking about only using 6G of memory. Even at 2B 
docs/shard, I'd be surprised to see it function at all. Don't try sorting on a 
timestamp for instance. I've never seen 2B docs fit on a shard and be OK 
performance-wise. Or, for that matter, perform at all. If there are situations 
like that I'd _love_ to know the details...

For any chance of success, Wilburn has to go back and do some reassessment IMO. 
There's no magic knob to turn to overcome the fundamental limitations that are 
going to creeping out of the woodwork. Indexing throughput is the least of his 
problems. I further suspect (but don't know for sure) that the first time 
realistic queries start hitting the system it'll OOM.

All that said, I don't know for sure of course.

On Thu, Aug 14, 2014 at 11:57 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 Erick Erickson [erickerick...@gmail.com] wrote:
 Solr requires holding large parts of the index in memory.
 For the entire corpus. At once.

 That requirement is under the assumption that one must have the lowest 
 possible latency at each individual box. You might as well argue for the 
 fastest possible memory or the fastest possible CPU being a requirement. The 
 advice is good in some contexts and a waste of money in other.

 I not-so-humbly point to 
 http://sbdevel.wordpress.com/2014/08/13/whale-hunting-with-solr/ where we 
 (for simple searches) handily achieve our goal of sub-second response times 
 for a 10TB index with just 1.4% of the index cached in RAM. Had our goal been 
 sub-50ms, it would be another matter, but it is not. Just as Wilburn's 
 problem is not to minimize latency for each individual box, but to achieve a 
 certain throughput for indexing, while performing searches.

 Wilburn's hardware is currently able to keep up, although barely, with 300B 
 documents. He needs to handle 900B. Tripling (or quadrupling) the amount of 
 machines should do the trick. Increasing the amount of RAM on each current 
 machine might also work (qua the well known effect of RAM with Lucene/Solr). 
 Using local SSDs, if he is not doing so already, might also work (qua the 
 article above).

 - Toke Eskildsen


RE: Solr cloud performance degradation with billions of documents

2014-08-15 Thread Toke Eskildsen
Wilburn, Scott [scott.wilb...@verizonwireless.com.INVALID] wrote:
 You make some very good valid points. Let me clear a few things up, though.
 We are not trying to put 7B docs into one single shard, because we are using
 collections, created daily, which spread the index across the 32 shards that
 make up the cloud/collection.

Just to be sure I understand: You make a new collection, consisting of 32 
shards, each day? And when you do, the old collection is not updated anymore?

As your primary problem is indexing speed degradation, dividing your machines 
into a dedicated search pool and a dedicated index (plus search in the 
collection being build) pool might work. This would require you to move 
finished collections from the indexers to the searchers, but it would make it 
possible for you to have quite fine-grained control over how much power should 
be given to each of the two jobs, by adjusting the pool sizes. Furthermore 
having shards that are no longer updated allows for optimization down to a 
single segment, which might also help with performance.

 You are very correct about the memory issues. In fact, we cannot do any
 complicated searches or faceting without Solr returning memory errors.

Could you describe the field(s) you would like to facet on? Number/string? 
Single-/multi-value? Have you tried with DocValues? Under the right 
circumstances, faceting can be done surprisingly cheap.

- Toke Eskildsen


RE: Solr cloud performance degradation with billions of documents

2014-08-15 Thread Toke Eskildsen
Erick Erickson [erickerick...@gmail.com] wrote:
 I guess that my main issue is that from everything I've seen so far,
 this project is doomed. You simply cannot put 7B documents in a single
 shard, period. Lucene has a 2B hard limit.

I would have agreed with you fully an hour ago and actually planned to ask 
Wilbur to check if he had corrupted his indexes. However, his latest post 
suggests that the scenario is more about having a larger amount of more 
resonably sized shards in play than building gigantic shards.

 For instance, Wilburn is talking about only using 6G of memory. Even
 at 2B docs/shard, I'd be surprised to see it function at all. Don't
 try sorting on a timestamp for instance.

I haven't understood Wilburns setup completely, as it seems to me that he will 
quickly run out of memory for starting new shards. But if we are looking at 
shards of 30GB and 160M documents, 6GB sounds a lot better. 

Regards,
Toke Eskildsen


Re: Solr cloud performance degradation with billions of documents

2014-08-15 Thread Erick Erickson
Toke:

bq: I would have agreed with you fully an hour ago.

Well, I now disagree with myself too :) I don't mind
talking to myself. I don't even mind arguing with myself. I
really _do_ mind losing the arguments I have with
myself though.

Scott:

OK, that has a much better chance of working, I obviously
misunderstood. So you'll have 60 different collections and each
collection will have one shard on each machine.

When the time comes to roll some of the collections off the
end due to age, collection aliasing may be helpful. I still think
you're significantly undersized, but you know your problem
space better than I do.

I fear the problem will be this: you won't even be able to do
basic searches as the number of shards on a particular
machine increase. To test, fire off a simple search for each of
your 60 days. I expect it'll blow you out of the water. This
assumes that all your shards are hosted in the same JVM
on each of your 32 machines. But that's totally a guess.

Keep us posted!

On Fri, Aug 15, 2014 at 2:40 PM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 Erick Erickson [erickerick...@gmail.com] wrote:
 I guess that my main issue is that from everything I've seen so far,
 this project is doomed. You simply cannot put 7B documents in a single
 shard, period. Lucene has a 2B hard limit.

 I would have agreed with you fully an hour ago and actually planned to ask 
 Wilbur to check if he had corrupted his indexes. However, his latest post 
 suggests that the scenario is more about having a larger amount of more 
 resonably sized shards in play than building gigantic shards.

 For instance, Wilburn is talking about only using 6G of memory. Even
 at 2B docs/shard, I'd be surprised to see it function at all. Don't
 try sorting on a timestamp for instance.

 I haven't understood Wilburns setup completely, as it seems to me that he 
 will quickly run out of memory for starting new shards. But if we are looking 
 at shards of 30GB and 160M documents, 6GB sounds a lot better.

 Regards,
 Toke Eskildsen


RE: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Wilburn, Scott
Erick,
Thanks for your suggestion to look into MapReduceIndexerTool, I'm looking into 
that now. I agree what I am trying to do is a tall order, and the more I hear 
from all of your comments, the more I am convinced that lack of memory is my 
biggest problem. I'm going to work on increasing the memory now, but was 
wondering if there are any configuration or other techniques that could also 
increase ingest performance? Does anyone know if a cloud of this size( hundreds 
of billions ) with an ingest rate of 5 billion new each day, has ever been 
attempted before? 

Thanks,
Scott 


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, August 13, 2014 4:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Several points:

1 Have you considered using the MapReduceIndexerTool for your ingestion?
Assuming you don't have duplicate IDs, i.e. each doc is new, you can spread 
your indexing across as many nodes as you have in your cluster. That said, it's 
not entirely clear that you'll gain throughput since you have as many nodes as 
you do.

2 Um, fitting this many documents into 6G of memory is ambitious. 
2 Very
ambitious. Actually it's impossible. By my calculations:
bq: 4 separate and individual clouds of 32 shards each so 128 shards in 
aggregate

bq:  inserting into these clouds per day is 5 Billion each in two clouds, 3 
Billion into the third, and 2 Billion into the fourth so we're talking 15B 
docs/day

bq: the plan is to keep up to 60 days...
So were talking 900B documents.

It just won't work. 900B/128 docs/shard is over 7B documents/shard on average. 
Your two larger collections will have more than that, the two smaller ones 
less. But it doesn't matter because:
1: Lucene has a limit of 2B docs per core(shard), positive signed int.
2: It ain't gonna fit in 6G of memory even without this limit I'm pretty sure.
3: I've rarely heard of a single shard coping with over 300M docs without 
performance issues. I usually start getting nervous around 100M and insist on 
stress testing. Of course it depends lots on your query profile.

So you're going to need a LOT more shards. You might be able to squeeze some 
more from your hardware by hosting multiple shards on for each collection on 
each machine, but I'm pretty sure your present setup is inadequate for your 
projected load.

Of course I may be misinterpreting what you're saying hugely, but from what I 
understand this system just won't work.

Best,
Erick




On Wed, Aug 13, 2014 at 2:39 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - You are running mapred jobs on the same nodes as Solr runs right? 
 The first thing i would think of is that your OS file buffer cache is abused.
 The mappers read all data, presumably residing on the same node. The 
 mapper output and shuffling part would take place on the same node, 
 only the reducer output is sent to your nodes, which i assume are on 
 the same machines. Those same machines have a large Lucene index. All 
 this data, written to and read from the same disk, competes for a nice 
 spot in the OS buffer cache.

 Forget it if i misread anything, but when you're using serious figures 
 of size, then do not abuse your caches. Have a separate mapred and 
 Solr cluster, because they both eat cache space. I assume you can see 
 serious IO WAIT times.

 Split the stuff and maybe even use smaller hardware, but more.

 M

 -Original message-
  From:Wilburn, Scott scott.wilb...@verizonwireless.com.INVALID
  Sent: Wednesday 13th August 2014 23:09
  To: solr-user@lucene.apache.org
  Subject: Solr cloud performance degradation with billions of 
  documents
 
  Hello everyone,
  I am trying to use SolrCloud to index a very large number of simple
 documents and have run into some performance and scalability 
 limitations and was wondering what can be done about it.
 
  Hardware wise, I have a 32-node Hadoop cluster that I use to run all 
  of
 the Solr shards and each node has 128GB of memory. The current 
 SolrCloud setup is split into 4 separate and individual clouds of 32 
 shards each thereby giving four running shards per cloud or one cloud per 
 eight nodes.
 Each shard is currently assigned a 6GB heap size. I’d prefer to avoid 
 increasing heap memory for Solr shards to have enough to run other 
 MapReduce jobs on the cluster.
 
  The rate of documents that I am currently inserting into these 
  clouds
 per day is 5 Billion each in two clouds, 3 Billion into the third, and 
 2 Billion into the fourth ; however to account for capacity, the aim 
 is to scale the solution to support double that amount of documents. 
 To index these documents, there are MapReduce jobs that run that 
 generate the Solr XML documents and will then submit these documents 
 via SolrJ's CloudSolrServer interface. In testing, I have found that 
 limiting the number of active parallel inserts to 80 per cloud gave

Re: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Jack Krupansky
You're using the term cloud again. Maybe that's the cause of your 
misunderstanding - SolrCloud probably should have been named SolrCluster 
since that's what it really is, a cluster rather than a cloud. The term 
cloud conjures up images of vast, unlimited numbers of nodes, thousands, 
tens of thousands of machines, but SolrCloud is much more modest than that.


Again, start with a model of 100 million documents on a fairly commodity box 
(say, 32GB as opposed to expensive 16-core 256GB machines). So, 1 billion 
docs means 10 servers, times replication - I assume you want to serve a 
healthy query load. So, 5 billion docs needs 50 servers, times replication. 
100 billion docs would require 1,000 servers. 500 billion documents would 
require 5,000 servers, times replication. Not quite Google class, but not a 
typical SolrCloud cluster either. You will have to test for yourself 
whether that 100 million number is achievable for your particular hardware 
and data. Maybe you can double it... or maybe only half of that.


And, once again, make sure your index for each node fits in the OS system 
memory available for file caching.


I haven't heard of any specific experiences of SolrCloud beyond dozens of 
nodes, but 64 nodes is probably a reasonable expectation for a SolrCloud 
cluster. How much bigger than that a SolrCloud cluster could grow is 
unknown. Whatever the actual practical limit, based on your own hardware, 
I/O, and network, and your own data schema and data patterns, which you will 
have to test for yourself, you will probably need to use an application 
layer to shard your 100s of billions to specific SolrCloud clusters.


-- Jack Krupansky

-Original Message- 
From: Wilburn, Scott

Sent: Thursday, August 14, 2014 11:05 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr cloud performance degradation with billions of documents

Erick,
Thanks for your suggestion to look into MapReduceIndexerTool, I'm looking 
into that now. I agree what I am trying to do is a tall order, and the more 
I hear from all of your comments, the more I am convinced that lack of 
memory is my biggest problem. I'm going to work on increasing the memory 
now, but was wondering if there are any configuration or other techniques 
that could also increase ingest performance? Does anyone know if a cloud of 
this size( hundreds of billions ) with an ingest rate of 5 billion new each 
day, has ever been attempted before?


Thanks,
Scott


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, August 13, 2014 4:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Several points:

1 Have you considered using the MapReduceIndexerTool for your ingestion?
Assuming you don't have duplicate IDs, i.e. each doc is new, you can spread 
your indexing across as many nodes as you have in your cluster. That said, 
it's not entirely clear that you'll gain throughput since you have as many 
nodes as you do.


2 Um, fitting this many documents into 6G of memory is ambitious.
2 Very
ambitious. Actually it's impossible. By my calculations:
bq: 4 separate and individual clouds of 32 shards each so 128 shards in 
aggregate


bq:  inserting into these clouds per day is 5 Billion each in two clouds, 3 
Billion into the third, and 2 Billion into the fourth so we're talking 15B 
docs/day


bq: the plan is to keep up to 60 days...
So were talking 900B documents.

It just won't work. 900B/128 docs/shard is over 7B documents/shard on 
average. Your two larger collections will have more than that, the two 
smaller ones less. But it doesn't matter because:

1: Lucene has a limit of 2B docs per core(shard), positive signed int.
2: It ain't gonna fit in 6G of memory even without this limit I'm pretty 
sure.
3: I've rarely heard of a single shard coping with over 300M docs without 
performance issues. I usually start getting nervous around 100M and insist 
on stress testing. Of course it depends lots on your query profile.


So you're going to need a LOT more shards. You might be able to squeeze some 
more from your hardware by hosting multiple shards on for each collection on 
each machine, but I'm pretty sure your present setup is inadequate for your 
projected load.


Of course I may be misinterpreting what you're saying hugely, but from what 
I understand this system just won't work.


Best,
Erick




On Wed, Aug 13, 2014 at 2:39 PM, Markus Jelsma markus.jel...@openindex.io
wrote:


Hi - You are running mapred jobs on the same nodes as Solr runs right?
The first thing i would think of is that your OS file buffer cache is 
abused.

The mappers read all data, presumably residing on the same node. The
mapper output and shuffling part would take place on the same node,
only the reducer output is sent to your nodes, which i assume are on
the same machines. Those same machines have a large Lucene index. All
this data, written to and read from

RE: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Toke Eskildsen
Wilburn, Scott [scott.wilb...@verizonwireless.com.INVALID] wrote:
 Thanks for your suggestion to look into MapReduceIndexerTool, I'm looking 
 into that now.
 I agree what I am trying to do is a tall order, and the more I hear from all 
 of your
 comments, the more I am convinced that lack of memory is my biggest problem.
 I'm going to work on increasing the memory now, but was wondering if there are
 any configuration or other techniques that could also increase ingest 
 performance?

More RAM basically compensates for slow storage, so the obvious trick is to 
increase your I/O performance. If your index is placed on network storage, then 
put it on local storage. If you are using spinning drives, then change to SSDs. 
If you are using SSDs then RAID them. Way cheaper than trying to match your RAM 
with your projected index size.

 Does anyone know if a cloud of this size( hundreds of billions ) with an 
 ingest rate of 5 billion new each day, has ever been attempted before?

Sorry, my experience is primarily with maximizing search performance.

- Toke Eskildsen


Re: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Erick Erickson
You are absolutely on the bleeding edge.

I know of a couple of projects that are at that scale, but

1 they aren't being done on just a few nodes. As Jack
says, this scale for SolrCloud is not common and there
are no OOB templates to follow.

2 AFAIK, the projects I'm talking about aren't in production
yet. And they're significant RD  efforts on the parts of the
companies involved.

3 You are _not_ going to do this on a shoestring
budget. Nor is it going to be something you have
up and running in 3 months. And you're talking
a lot of machines here. Jack and I are both
coming up with thousands of Solr servers, _that's_
the scale we're talking here! You're not going to
get around this by just adding more memory either.

Much as I love Solr, I have to ask whether it's the right
tool for your situation. Unlike some other technologies,
Solr requires holding large parts of the index in memory.
For the entire corpus. At once. At the scale you're talking,
you need compelling reasons to invest in all that. So I'd
carefully look at what your problem is and whether Solr/search
is the right tool for the job or not.

On Thu, Aug 14, 2014 at 9:51 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 Wilburn, Scott [scott.wilb...@verizonwireless.com.INVALID] wrote:
 Thanks for your suggestion to look into MapReduceIndexerTool, I'm looking 
 into that now.
 I agree what I am trying to do is a tall order, and the more I hear from all 
 of your
 comments, the more I am convinced that lack of memory is my biggest problem.
 I'm going to work on increasing the memory now, but was wondering if there 
 are
 any configuration or other techniques that could also increase ingest 
 performance?

 More RAM basically compensates for slow storage, so the obvious trick is to 
 increase your I/O performance. If your index is placed on network storage, 
 then put it on local storage. If you are using spinning drives, then change 
 to SSDs. If you are using SSDs then RAID them. Way cheaper than trying to 
 match your RAM with your projected index size.

 Does anyone know if a cloud of this size( hundreds of billions ) with an 
 ingest rate of 5 billion new each day, has ever been attempted before?

 Sorry, my experience is primarily with maximizing search performance.

 - Toke Eskildsen


RE: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Wilburn, Scott
Thanks, Jack. I'd like to stay away from a terminology debate, since it is 
clear you know what I am talking about. But just to give my opinion, I prefer 
the term 'cloud' because it differentiates it from the term 'cluster', which 
refers to the Hadoop environment which I am running it on. I would also refrain 
from using the term 'node' when talking about Solr for the same reason. 
I already have this setup and running as described in my original email, with 
over 300 billion total records so far and counting, so changing my hardware 
configuration is really not an option( except adding more memory ). My issue is 
specific to keeping up with the volume of new documents, as my ingest rate is 
barely able to keep up, and I fear will eventually be perpetually latent as the 
amount of documents in the cloud/cluster continues to grow. 

I now have a few things to try, thanks to all of your comments. I am very 
appreciative. 

Thanks,
Scott 


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Thursday, August 14, 2014 8:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

You're using the term cloud again. Maybe that's the cause of your 
misunderstanding - SolrCloud probably should have been named SolrCluster since 
that's what it really is, a cluster rather than a cloud. The term cloud 
conjures up images of vast, unlimited numbers of nodes, thousands, tens of 
thousands of machines, but SolrCloud is much more modest than that.

Again, start with a model of 100 million documents on a fairly commodity box 
(say, 32GB as opposed to expensive 16-core 256GB machines). So, 1 billion docs 
means 10 servers, times replication - I assume you want to serve a healthy 
query load. So, 5 billion docs needs 50 servers, times replication. 
100 billion docs would require 1,000 servers. 500 billion documents would 
require 5,000 servers, times replication. Not quite Google class, but not a 
typical SolrCloud cluster either. You will have to test for yourself whether 
that 100 million number is achievable for your particular hardware and data. 
Maybe you can double it... or maybe only half of that.

And, once again, make sure your index for each node fits in the OS system 
memory available for file caching.

I haven't heard of any specific experiences of SolrCloud beyond dozens of 
nodes, but 64 nodes is probably a reasonable expectation for a SolrCloud 
cluster. How much bigger than that a SolrCloud cluster could grow is unknown. 
Whatever the actual practical limit, based on your own hardware, I/O, and 
network, and your own data schema and data patterns, which you will have to 
test for yourself, you will probably need to use an application layer to 
shard your 100s of billions to specific SolrCloud clusters.

-- Jack Krupansky

-Original Message-
From: Wilburn, Scott
Sent: Thursday, August 14, 2014 11:05 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr cloud performance degradation with billions of documents

Erick,
Thanks for your suggestion to look into MapReduceIndexerTool, I'm looking into 
that now. I agree what I am trying to do is a tall order, and the more I hear 
from all of your comments, the more I am convinced that lack of memory is my 
biggest problem. I'm going to work on increasing the memory now, but was 
wondering if there are any configuration or other techniques that could also 
increase ingest performance? Does anyone know if a cloud of this size( hundreds 
of billions ) with an ingest rate of 5 billion new each day, has ever been 
attempted before?

Thanks,
Scott


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, August 13, 2014 4:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Several points:

1 Have you considered using the MapReduceIndexerTool for your ingestion?
Assuming you don't have duplicate IDs, i.e. each doc is new, you can spread 
your indexing across as many nodes as you have in your cluster. That said, it's 
not entirely clear that you'll gain throughput since you have as many nodes as 
you do.

2 Um, fitting this many documents into 6G of memory is ambitious.
2 Very
ambitious. Actually it's impossible. By my calculations:
bq: 4 separate and individual clouds of 32 shards each so 128 shards in 
aggregate

bq:  inserting into these clouds per day is 5 Billion each in two clouds, 3 
Billion into the third, and 2 Billion into the fourth so we're talking 15B 
docs/day

bq: the plan is to keep up to 60 days...
So were talking 900B documents.

It just won't work. 900B/128 docs/shard is over 7B documents/shard on average. 
Your two larger collections will have more than that, the two smaller ones 
less. But it doesn't matter because:
1: Lucene has a limit of 2B docs per core(shard), positive signed int.
2: It ain't gonna fit in 6G of memory even without this limit I'm

RE: Solr cloud performance degradation with billions of documents

2014-08-14 Thread Toke Eskildsen
Erick Erickson [erickerick...@gmail.com] wrote:
 Solr requires holding large parts of the index in memory.
 For the entire corpus. At once.

That requirement is under the assumption that one must have the lowest possible 
latency at each individual box. You might as well argue for the fastest 
possible memory or the fastest possible CPU being a requirement. The advice is 
good in some contexts and a waste of money in other.

I not-so-humbly point to 
http://sbdevel.wordpress.com/2014/08/13/whale-hunting-with-solr/ where we (for 
simple searches) handily achieve our goal of sub-second response times for a 
10TB index with just 1.4% of the index cached in RAM. Had our goal been 
sub-50ms, it would be another matter, but it is not. Just as Wilburn's problem 
is not to minimize latency for each individual box, but to achieve a certain 
throughput for indexing, while performing searches.

Wilburn's hardware is currently able to keep up, although barely, with 300B 
documents. He needs to handle 900B. Tripling (or quadrupling) the amount of 
machines should do the trick. Increasing the amount of RAM on each current 
machine might also work (qua the well known effect of RAM with Lucene/Solr). 
Using local SSDs, if he is not doing so already, might also work (qua the 
article above).

- Toke Eskildsen


Re: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Jack Krupansky
Could you clarify what you mean with the term cloud, as in per cloud and 
individual clouds? That's not a proper Solr or SolrCloud concept per se. 
SolrCloud works with a single cluster of nodes. And there is no 
interaction between separate SolrCloud clusters.


-- Jack Krupansky

-Original Message- 
From: Wilburn, Scott

Sent: Wednesday, August 13, 2014 5:08 PM
To: solr-user@lucene.apache.org
Subject: Solr cloud performance degradation with billions of documents

Hello everyone,
I am trying to use SolrCloud to index a very large number of simple 
documents and have run into some performance and scalability limitations and 
was wondering what can be done about it.


Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the 
Solr shards and each node has 128GB of memory. The current SolrCloud setup 
is split into 4 separate and individual clouds of 32 shards each thereby 
giving four running shards per cloud or one cloud per eight nodes. Each 
shard is currently assigned a 6GB heap size. I’d prefer to avoid increasing 
heap memory for Solr shards to have enough to run other MapReduce jobs on 
the cluster.


The rate of documents that I am currently inserting into these clouds per 
day is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion 
into the fourth ; however to account for capacity, the aim is to scale the 
solution to support double that amount of documents. To index these 
documents, there are MapReduce jobs that run that generate the Solr XML 
documents and will then submit these documents via SolrJ's CloudSolrServer 
interface. In testing, I have found that limiting the number of active 
parallel inserts to 80 per cloud gave the best performance as anything 
higher gave diminishing returns, most likely due to the constant shuffling 
of documents internally to SolrCloud. From an index perspective, dated 
collections are being created to hold an entire day's of documents and 
generally the inserting happens primarily on the current day (the previous 
days are only to allow for searching) and the plan is to keep up to 60 days 
(or collections) in each cloud. A single shard index in one collection in 
the busiest cloud currently takes up 30G disk space or 960G for the entire 
collection. The documents are being auto committed with a hard commit time 
of 4 minutes (opensearcher = false) and soft commit time of 8 minutes.


From a search perspective, the use case is fairly generic and simple 
searches of the type :, so there is no need to tune the system to use any of 
the more advanced querying features. Therefore, the most important thing for 
me is to have the indexing performance be able to keep up with the rate of 
input.


In the initial load testing, I was able to achieve a projected indexing rate 
of 10 Billion documents per cloud per day for a grand total of 40 Billion 
per day. However, the initial load testing was done on fairly empty clouds 
with just a few small collections. Now that there have been several days of 
documents being indexed, I am starting to see a fairly steep drop-off in 
indexing performance once the clouds reached about 15 full collections (or 
about 80-100 Billion documents per cloud) in the two biggest clouds. Based 
on current application logging I’m seeing a 40% drop off in indexing 
performance. Because of this, I have concerns on how performance will hold 
as more collections are added.


My question to the community is if anyone else has had any experience in 
using Solr at this scale (hundreds of Billions) and if anyone has observed 
such a decline in indexing performance as the number of collections 
increases. My understanding is that each collection is a separate index and 
therefore the inserting rate should remain constant. Aside from that, what 
other tweaks or changes can be done in the SolrCloud configuration to 
increase the rate of indexing performance? Am I hitting a hard limitation of 
what Solr can handle?


Thanks,
Scott



RE: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Wilburn, Scott
Thanks for replying Jack. I have 4 SolrCloud instances( or clusters ), each 
consisting of 32 shards. The clusters do not have any interaction with each 
other.  

Thanks,
Scott 


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Wednesday, August 13, 2014 2:17 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Could you clarify what you mean with the term cloud, as in per cloud and 
individual clouds? That's not a proper Solr or SolrCloud concept per se. 
SolrCloud works with a single cluster of nodes. And there is no interaction 
between separate SolrCloud clusters.

-- Jack Krupansky

-Original Message-
From: Wilburn, Scott
Sent: Wednesday, August 13, 2014 5:08 PM
To: solr-user@lucene.apache.org
Subject: Solr cloud performance degradation with billions of documents

Hello everyone,
I am trying to use SolrCloud to index a very large number of simple documents 
and have run into some performance and scalability limitations and was 
wondering what can be done about it.

Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the 
Solr shards and each node has 128GB of memory. The current SolrCloud setup is 
split into 4 separate and individual clouds of 32 shards each thereby giving 
four running shards per cloud or one cloud per eight nodes. Each shard is 
currently assigned a 6GB heap size. I’d prefer to avoid increasing heap memory 
for Solr shards to have enough to run other MapReduce jobs on the cluster.

The rate of documents that I am currently inserting into these clouds per day 
is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion into 
the fourth ; however to account for capacity, the aim is to scale the solution 
to support double that amount of documents. To index these documents, there are 
MapReduce jobs that run that generate the Solr XML documents and will then 
submit these documents via SolrJ's CloudSolrServer interface. In testing, I 
have found that limiting the number of active parallel inserts to 80 per cloud 
gave the best performance as anything higher gave diminishing returns, most 
likely due to the constant shuffling of documents internally to SolrCloud. From 
an index perspective, dated collections are being created to hold an entire 
day's of documents and generally the inserting happens primarily on the current 
day (the previous days are only to allow for searching) and the plan is to keep 
up to 60 days (or collections) in each cloud. A single shard index in one 
collection in the busiest cloud currently takes up 30G disk space or 960G for 
the entire collection. The documents are being auto committed with a hard 
commit time of 4 minutes (opensearcher = false) and soft commit time of 8 
minutes.

From a search perspective, the use case is fairly generic and simple searches 
of the type :, so there is no need to tune the system to use any of the more 
advanced querying features. Therefore, the most important thing for me is to 
have the indexing performance be able to keep up with the rate of input.

In the initial load testing, I was able to achieve a projected indexing rate of 
10 Billion documents per cloud per day for a grand total of 40 Billion per day. 
However, the initial load testing was done on fairly empty clouds with just a 
few small collections. Now that there have been several days of documents being 
indexed, I am starting to see a fairly steep drop-off in indexing performance 
once the clouds reached about 15 full collections (or about 80-100 Billion 
documents per cloud) in the two biggest clouds. Based on current application 
logging I’m seeing a 40% drop off in indexing performance. Because of this, I 
have concerns on how performance will hold as more collections are added.

My question to the community is if anyone else has had any experience in using 
Solr at this scale (hundreds of Billions) and if anyone has observed such a 
decline in indexing performance as the number of collections increases. My 
understanding is that each collection is a separate index and therefore the 
inserting rate should remain constant. Aside from that, what other tweaks or 
changes can be done in the SolrCloud configuration to increase the rate of 
indexing performance? Am I hitting a hard limitation of what Solr can handle?

Thanks,
Scott



RE: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Toke Eskildsen
Wilburn, Scott [scott.wilb...@verizonwireless.com.INVALID] wrote:
 Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the 
 Solr shards and
 each node has 128GB of memory. The current SolrCloud setup is split into 4  
 separate and
 individual clouds of 32 shards each thereby giving four running shards per 
 cloud or one
 cloud per eight nodes.

You mean 4 running shards per node, right? With 6GB/shard that leaves about 
100GB RAM for everything else on each node.

[Snip: 10 billion insertions/day]

That is nearly 4000 insertions/second per node. Quite a lot.

 A single shard index in one collection in the busiest cloud currently takes 
 up 30G disk space
 or 960G for the entire collection. The documents are being auto committed 
 with a hard
 commit time of 4 minutes (opensearcher = false) and soft commit time of 8 
 minutes.

And you have 4 of these collections, so each node holds about 120GB of index 
with heavy updating?

 In the initial load testing, I was able to achieve a projected indexing rate 
 of 10 Billion
 documents per cloud per day for a grand total of 40 Billion per day. However, 
 the initial load
 testing was done on fairly empty clouds with just a few small collections. 
 Now that there have
 been several days of documents being indexed, I am starting to see a fairly 
 steep drop-off in
 indexing performance once the clouds reached about 15 full collections [...]

If a single collection is 30GB and you have 15 now. That means your indexes 
takes up about 450GB on each node, which has less than 100GB free memory. 
Everything is not disk cached and since you are doing searches while you index, 
your indexer must compete for the disk cache. It seems natural that this would 
slow down indexing, with the slow down getting progressively worse as you fill 
the storage with active indexes.

If you could isolate the old collections from the ones being updated, you 
could avoid this cache competition. Or you could of course throw more hardware 
at the problem: Are you stuck on spinning drives or are you using SSDs?

- Toke Eskildsen


RE: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Markus Jelsma
Hi - You are running mapred jobs on the same nodes as Solr runs right? The 
first thing i would think of is that your OS file buffer cache is abused. The 
mappers read all data, presumably residing on the same node. The mapper output 
and shuffling part would take place on the same node, only the reducer output 
is sent to your nodes, which i assume are on the same machines. Those same 
machines have a large Lucene index. All this data, written to and read from the 
same disk, competes for a nice spot in the OS buffer cache.

Forget it if i misread anything, but when you're using serious figures of size, 
then do not abuse your caches. Have a separate mapred and Solr cluster, because 
they both eat cache space. I assume you can see serious IO WAIT times. 

Split the stuff and maybe even use smaller hardware, but more.

M 
 
-Original message-
 From:Wilburn, Scott scott.wilb...@verizonwireless.com.INVALID
 Sent: Wednesday 13th August 2014 23:09
 To: solr-user@lucene.apache.org
 Subject: Solr cloud performance degradation with billions of documents
 
 Hello everyone,
 I am trying to use SolrCloud to index a very large number of simple documents 
 and have run into some performance and scalability limitations and was 
 wondering what can be done about it.
 
 Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the 
 Solr shards and each node has 128GB of memory. The current SolrCloud setup is 
 split into 4 separate and individual clouds of 32 shards each thereby giving 
 four running shards per cloud or one cloud per eight nodes. Each shard is 
 currently assigned a 6GB heap size. I’d prefer to avoid increasing heap 
 memory for Solr shards to have enough to run other MapReduce jobs on the 
 cluster.
 
 The rate of documents that I am currently inserting into these clouds per day 
 is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion into 
 the fourth ; however to account for capacity, the aim is to scale the 
 solution to support double that amount of documents. To index these 
 documents, there are MapReduce jobs that run that generate the Solr XML 
 documents and will then submit these documents via SolrJ's CloudSolrServer 
 interface. In testing, I have found that limiting the number of active 
 parallel inserts to 80 per cloud gave the best performance as anything higher 
 gave diminishing returns, most likely due to the constant shuffling of 
 documents internally to SolrCloud. From an index perspective, dated 
 collections are being created to hold an entire day's of documents and 
 generally the inserting happens primarily on the current day (the previous 
 days are only to allow for searching) and the plan is to keep up to 60 days 
 (or collections) in each cloud. A single shar
 d index in one collection in the busiest cloud currently takes up 30G disk 
space or 960G for the entire collection. The documents are being auto committed 
with a hard commit time of 4 minutes (opensearcher = false) and soft commit 
time of 8 minutes.
 
 From a search perspective, the use case is fairly generic and simple searches 
 of the type :, so there is no need to tune the system to use any of the more 
 advanced querying features. Therefore, the most important thing for me is to 
 have the indexing performance be able to keep up with the rate of input.
 
 In the initial load testing, I was able to achieve a projected indexing rate 
 of 10 Billion documents per cloud per day for a grand total of 40 Billion per 
 day. However, the initial load testing was done on fairly empty clouds with 
 just a few small collections. Now that there have been several days of 
 documents being indexed, I am starting to see a fairly steep drop-off in 
 indexing performance once the clouds reached about 15 full collections (or 
 about 80-100 Billion documents per cloud) in the two biggest clouds. Based on 
 current application logging I’m seeing a 40% drop off in indexing 
 performance. Because of this, I have concerns on how performance will hold as 
 more collections are added.
 
 My question to the community is if anyone else has had any experience in 
 using Solr at this scale (hundreds of Billions) and if anyone has observed 
 such a decline in indexing performance as the number of collections 
 increases. My understanding is that each collection is a separate index and 
 therefore the inserting rate should remain constant. Aside from that, what 
 other tweaks or changes can be done in the SolrCloud configuration to 
 increase the rate of indexing performance? Am I hitting a hard limitation of 
 what Solr can handle?
 
 Thanks,
 Scott 
 
 


Re: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Jack Krupansky
Be careful when you say instance - that usually refers to a single Solr 
node. Anyway...


32 shards - with a replication factor of 1?

So, given your worst case here, 5 billion documents in a 32-node cluster, 
that's 156 million documents per node. What is the index size on a typical 
node? And how much system memory is available for caching of file reads? 
Generally, you want to have enough system memory to cache the full index. Or 
do you have SSD?


But please clarify what you mean by about 80-100 Billion documents per 
cloud. Is it really 5 billion total, refreshed every day, or 5 billion 
added per day and lots of days stored?


If you start seeing indexing rate drop off, that could be caused by not 
having enough RAM system memory to cache the full index. In particular, 
Lucene will occasionally be performing index merges, which would otherwise 
be I/O-intensive.


I would start with a rule of thumb of 100 million documents per node (and 
that is million, not billion.) That could be a lot higher - or a lot lower - 
based on your actual schema and data value distribution.


-- Jack Krupansky

-Original Message- 
From: Wilburn, Scott

Sent: Wednesday, August 13, 2014 5:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr cloud performance degradation with billions of documents

Thanks for replying Jack. I have 4 SolrCloud instances( or clusters ), each 
consisting of 32 shards. The clusters do not have any interaction with each 
other.


Thanks,
Scott


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Wednesday, August 13, 2014 2:17 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud performance degradation with billions of documents

Could you clarify what you mean with the term cloud, as in per cloud and 
individual clouds? That's not a proper Solr or SolrCloud concept per se.
SolrCloud works with a single cluster of nodes. And there is no 
interaction between separate SolrCloud clusters.


-- Jack Krupansky

-Original Message-
From: Wilburn, Scott
Sent: Wednesday, August 13, 2014 5:08 PM
To: solr-user@lucene.apache.org
Subject: Solr cloud performance degradation with billions of documents

Hello everyone,
I am trying to use SolrCloud to index a very large number of simple 
documents and have run into some performance and scalability limitations and 
was wondering what can be done about it.


Hardware wise, I have a 32-node Hadoop cluster that I use to run all of the 
Solr shards and each node has 128GB of memory. The current SolrCloud setup 
is split into 4 separate and individual clouds of 32 shards each thereby 
giving four running shards per cloud or one cloud per eight nodes. Each 
shard is currently assigned a 6GB heap size. I’d prefer to avoid increasing 
heap memory for Solr shards to have enough to run other MapReduce jobs on 
the cluster.


The rate of documents that I am currently inserting into these clouds per 
day is 5 Billion each in two clouds, 3 Billion into the third, and 2 Billion 
into the fourth ; however to account for capacity, the aim is to scale the 
solution to support double that amount of documents. To index these 
documents, there are MapReduce jobs that run that generate the Solr XML 
documents and will then submit these documents via SolrJ's CloudSolrServer 
interface. In testing, I have found that limiting the number of active 
parallel inserts to 80 per cloud gave the best performance as anything 
higher gave diminishing returns, most likely due to the constant shuffling 
of documents internally to SolrCloud. From an index perspective, dated 
collections are being created to hold an entire day's of documents and 
generally the inserting happens primarily on the current day (the previous 
days are only to allow for searching) and the plan is to keep up to 60 days 
(or collections) in each cloud. A single shard index in one collection in 
the busiest cloud currently takes up 30G disk space or 960G for the entire 
collection. The documents are being auto committed with a hard commit time 
of 4 minutes (opensearcher = false) and soft commit time of 8 minutes.


From a search perspective, the use case is fairly generic and simple 
searches of the type :, so there is no need to tune the system to use any of 
the more advanced querying features. Therefore, the most important thing for 
me is to have the indexing performance be able to keep up with the rate of 
input.


In the initial load testing, I was able to achieve a projected indexing rate 
of 10 Billion documents per cloud per day for a grand total of 40 Billion 
per day. However, the initial load testing was done on fairly empty clouds 
with just a few small collections. Now that there have been several days of 
documents being indexed, I am starting to see a fairly steep drop-off in 
indexing performance once the clouds reached about 15 full collections (or 
about 80-100 Billion documents per cloud) in the two biggest clouds. Based 
on current

Re: Solr cloud performance degradation with billions of documents

2014-08-13 Thread Erick Erickson
Several points:

1 Have you considered using the MapReduceIndexerTool for your ingestion?
Assuming you don't have duplicate IDs, i.e. each doc is new, you can spread
your indexing across as many nodes as you have in your cluster. That said,
it's not entirely clear that you'll gain throughput since you have as many
nodes as you do.

2 Um, fitting this many documents into 6G of memory is ambitious. Very
ambitious. Actually it's impossible. By my calculations:
bq: 4 separate and individual clouds of 32 shards each
so 128 shards in aggregate

bq:  inserting into these clouds per day is 5 Billion each in two clouds, 3
Billion into the third, and 2 Billion into the fourth
so we're talking 15B docs/day

bq: the plan is to keep up to 60 days...
So were talking 900B documents.

It just won't work. 900B/128 docs/shard is over 7B documents/shard on
average. Your two larger collections will have more than that, the two
smaller ones less. But it doesn't matter because:
1: Lucene has a limit of 2B docs per core(shard), positive signed int.
2: It ain't gonna fit in 6G of memory even without this limit I'm pretty
sure.
3: I've rarely heard of a single shard coping with over 300M docs without
performance issues. I usually start getting nervous around 100M and insist
on stress testing. Of course it depends lots on your query profile.

So you're going to need a LOT more shards. You might be able to squeeze
some more from your hardware by hosting multiple shards on for each
collection on each machine, but I'm pretty sure your present setup is
inadequate for your projected load.

Of course I may be misinterpreting what you're saying hugely, but from what
I understand this system just won't work.

Best,
Erick




On Wed, Aug 13, 2014 at 2:39 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - You are running mapred jobs on the same nodes as Solr runs right? The
 first thing i would think of is that your OS file buffer cache is abused.
 The mappers read all data, presumably residing on the same node. The mapper
 output and shuffling part would take place on the same node, only the
 reducer output is sent to your nodes, which i assume are on the same
 machines. Those same machines have a large Lucene index. All this data,
 written to and read from the same disk, competes for a nice spot in the OS
 buffer cache.

 Forget it if i misread anything, but when you're using serious figures of
 size, then do not abuse your caches. Have a separate mapred and Solr
 cluster, because they both eat cache space. I assume you can see serious IO
 WAIT times.

 Split the stuff and maybe even use smaller hardware, but more.

 M

 -Original message-
  From:Wilburn, Scott scott.wilb...@verizonwireless.com.INVALID
  Sent: Wednesday 13th August 2014 23:09
  To: solr-user@lucene.apache.org
  Subject: Solr cloud performance degradation with billions of documents
 
  Hello everyone,
  I am trying to use SolrCloud to index a very large number of simple
 documents and have run into some performance and scalability limitations
 and was wondering what can be done about it.
 
  Hardware wise, I have a 32-node Hadoop cluster that I use to run all of
 the Solr shards and each node has 128GB of memory. The current SolrCloud
 setup is split into 4 separate and individual clouds of 32 shards each
 thereby giving four running shards per cloud or one cloud per eight nodes.
 Each shard is currently assigned a 6GB heap size. I’d prefer to avoid
 increasing heap memory for Solr shards to have enough to run other
 MapReduce jobs on the cluster.
 
  The rate of documents that I am currently inserting into these clouds
 per day is 5 Billion each in two clouds, 3 Billion into the third, and 2
 Billion into the fourth ; however to account for capacity, the aim is to
 scale the solution to support double that amount of documents. To index
 these documents, there are MapReduce jobs that run that generate the Solr
 XML documents and will then submit these documents via SolrJ's
 CloudSolrServer interface. In testing, I have found that limiting the
 number of active parallel inserts to 80 per cloud gave the best performance
 as anything higher gave diminishing returns, most likely due to the
 constant shuffling of documents internally to SolrCloud. From an index
 perspective, dated collections are being created to hold an entire day's of
 documents and generally the inserting happens primarily on the current day
 (the previous days are only to allow for searching) and the plan is to keep
 up to 60 days (or collections) in each cloud. A single shar
  d index in one collection in the busiest cloud currently takes up 30G
 disk space or 960G for the entire collection. The documents are being auto
 committed with a hard commit time of 4 minutes (opensearcher = false) and
 soft commit time of 8 minutes.
 
  From a search perspective, the use case is fairly generic and simple
 searches of the type :, so there is no need to tune the system to use any
 of the more