RE: Solr maximum Optimal Index Size per Shard

2014-06-18 Thread Toke Eskildsen
Toke Eskildsen [t...@statsbiblioteket.dk] wrote:

[Toke: SSDs with 2.7TB of index on a 256GB machine]

 tl;dr: for small result sets ( 1M hits) on unwarmed searches with
 simple queries, response time is below 100ms. If we enable faceting with
 plain Solr, this jumps to about 1 second.

 I did a top on the machine and it says that 50GB is currently used for
 caching, so an 80GB (and probably less) machine would work fine for our
 2.7TB index.

So we actually tried this: 3.6TB (4 shards) and 80GB of RAM, leaving a little 
less than 40GB for caching: 40GB / 3,600GB ~= 1% of the index size.

This performed quite well, with faceting times still around 1 second and 
non-faceted search a lot lower. There's a writeup at
http://sbdevel.wordpress.com/2014/06/17/terabyte-index-search-and-faceting-with-solr/

- Toke Eskildsen, State and University Library, Denmark




Re: Solr maximum Optimal Index Size per Shard

2014-06-06 Thread Vineet Mishra
Hi Shawn,

Thanks for your response, wanted to clarify a few things.

*Does that mean for querying smoothly we need to have memory atleast equal
or greater to the size of index? As in my case the index size will be very
heavy(~2TB) and practically speaking that amount of memory is not possible.
Even If it goes to multiple shards, say around 10 Shards then also 200GB of
RAM will not be an feasible option.

*With CloudSolrServer can we specify which Shard the particular index
should go and reside, which I can do with EmbeddedSolrServer by indexing in
different directories and moving them to appropriate shard directories.

Thanks!



On Wed, Jun 4, 2014 at 12:43 PM, Shawn Heisey s...@elyograg.org wrote:

 On 6/4/2014 12:45 AM, Vineet Mishra wrote:
  Thanks all for your response.
  I presume this conversation concludes that indexing around 1Billion
  documents per shard won't be a problem, as I have 10 Billion docs to
 index,
  so approx 10 shards with 1 Billion each should be fine with it and how
  about Memory, what size of RAM should be fine for this amount of data?

 Figure out the heap requirements of the operating system and every
 program on the machine (Solr especially).  Then you would add that
 number to the total size of the index data on the machine.  That is the
 ideal minimum RAM.

 http://wiki.apache.org/solr/SolrPerformanceProblems

 Unfortunately, if you are dealing with a huge index with billions of
 documents, it is likely to be prohibitively expensive to buy that much
 RAM.  If you are running Solr on Amazon's cloud, the cost for that much
 RAM would be astronomical.

 Exactly how much RAM would actually be required is very difficult to
 predict.  If you had only 25% of the ideal, your index might have
 perfectly acceptable performance, or it might not.  It might do fine
 under a light query load, but if you increase to 50 queries per second,
 performance may drop significantly ... or it might be good.  It's
 generally not possible to know how your hardware will perform until you
 actually build and use your index.


 http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

 A general rule of thumb for RAM that I have found to be useful is that
 if you've got less than half of the ideal memory size, you might have
 performance problems.

  Moreover what should be the indexing technique for this huge data set, as
  currently I am indexing with EmbeddedSolrServer but its going
 pathetically
  slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due
  to network delays and response but after this long running the indexing
  with EmbeddedSolrServer I am getting a different notion.
  Any good indexing technique for this huge dataset would be highly
  appreciated.

 EmbeddedSolrServer is not recommended.  Run Solr in the traditional way
 with HTTP connectivity.  HTTP overhead on a LAN is usually quite small.
  Solr is fully thread-safe, so you can have several indexing threads all
 going at the same time.

 Indexes at this scale should normally be built with SolrCloud, with
 enough servers so that each machine is only handling one shard replica.
  The ideal indexing program would be written in Java, using
 CloudSolrServer.

 Thanks,
 Shawn




Re: Solr maximum Optimal Index Size per Shard

2014-06-06 Thread Vineet Mishra
Hey Jack,

Well I have indexed around some 10 Million documents consuming 20 GB index
size.
Each Document is consisting of nearly 100 String Fields with data upto 10
characters per field.
For my case each document containing number of fields can expand much
widely (from current 100 to 500 or ever more).

As for the typical exceptional case I was more interested for a way to
evenly maintain the right ratio of index vs shard.

Thanks!


On Wed, Jun 4, 2014 at 7:47 PM, Jack Krupansky j...@basetechnology.com
wrote:

 How many documents was in that 20GB index?

 I'm skeptical that a 1 billion document shard won't be a problem. I mean
 technically it is possible, but as you are already experiencing, it may
 take a long time and a very powerful machine to do so. 100 million (or 250
 million max) would be a more realistic goal. Even then, it depends on your
 doc size and machine size.

 The main point from the previous discussion is that although the technical
 hard limit for a Solr shard is 2G docs, from a practical perspective it is
 very difficult to get to that limit, not that indexing 1 billion docs on a
 single shard is just fine!

 As a general rule, if you want fast queries for high volume, strive to
 assure that your per-shard index fits entirely into the system memory
 available for OS caching of file system pages.

 In any case, a proof of concept implementation will tell you everything
 you need to know.


 -- Jack Krupansky

 -Original Message- From: Vineet Mishra
 Sent: Wednesday, June 4, 2014 2:45 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr maximum Optimal Index Size per Shard


 Thanks all for your response.
 I presume this conversation concludes that indexing around 1Billion
 documents per shard won't be a problem, as I have 10 Billion docs to index,
 so approx 10 shards with 1 Billion each should be fine with it and how
 about Memory, what size of RAM should be fine for this amount of data?
 Moreover what should be the indexing technique for this huge data set, as
 currently I am indexing with EmbeddedSolrServer but its going pathetically
 slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due
 to network delays and response but after this long running the indexing
 with EmbeddedSolrServer I am getting a different notion.
 Any good indexing technique for this huge dataset would be highly
 appreciated.

 Thanks again!


 On Wed, Jun 4, 2014 at 6:40 AM, rulinma ruli...@gmail.com wrote:

  mark.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-maximum-
 Optimal-Index-Size-per-Shard-tp4139565p4139698.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: Solr maximum Optimal Index Size per Shard

2014-06-06 Thread Toke Eskildsen
On Fri, 2014-06-06 at 12:32 +0200, Vineet Mishra wrote:
 *Does that mean for querying smoothly we need to have memory atleast equal
 or greater to the size of index?

If you absolutely, positively have to reduce latency as much as
possible, then yes. With an estimated index size of 2TB, I would guess
that 10-20 machines with powerful CPUs (1 per shard per expected
concurrent request) would also be advisable. While you're at it, do make
sure that you're using high-speed memory.

That was not a serious suggestion, should you be in doubt. Very few
people need the best latency possible. Most just need the individual
searches to be fast enough and want to scale throughput instead.

 As in my case the index size will be very heavy(~2TB) and practically
 speaking that amount of memory is not possible. Even If it goes to
 multiple shards, say around 10 Shards then also 200GB of RAM will not
 be an feasible option.

We're building a projected 24TB index collection and are currently at
2.7TB+, growing with about 1TB/10 days. Our current plan is to use a
single machine with 256GB of RAM, but we will of course adjust along the
way if it proves to be too small.

Requirements differ with the corpus and the needs, but for us, SSDs as
storage seems to provide quite enough of a punch. I did a little testing
yesterday: https://plus.google.com/u/0/+TokeEskildsen/posts/4yPvzrQo8A7

tl;dr: for small result sets ( 1M hits) on unwarmed searches with
simple queries, response time is below 100ms. If we enable faceting with
plain Solr, this jumps to about 1 second.

I did a top on the machine and it says that 50GB is currently used for
caching, so an 80GB (and probably less) machine would work fine for our
2.7TB index.


- Toke Eskildsen, State and University Library, Denmark




Re: Solr maximum Optimal Index Size per Shard

2014-06-06 Thread Vineet Mishra
Hi Toke,

That was Spectacular, really great to hear that you have already indexed
2.7TB+ data to your server and still the query response time is under ms or
a few seconds for such a huge dataset.
Could you state what indexing mechanism are you using, as I started with
EmbeddedSolrServer but it was pretty slow after a few GB(~30+) of indexing.
I started indexing 1 week back and still its 37GB, although I assume
HttpPost mechanism will perform lethargic slow due to network latency and
for the response await. Furthermore I started with CloudSolrServer but
facing some weird exception saying ClassCastException Cannot cast to
Exception while adding the SolrInputDocument to the Server.

CloudSolrServer server1 = new
CloudSolrServer(zkHost:port1,zkHost:port2,zkHost:port3,false);
server1.setDefaultCollection(mycollection);
SolrInputDocument doc = new SolrInputDocument();
doc.addField( ID, 123);
doc.addField( A0_s, 282628854);

server1.add(doc); //Error at this line
server1.commit();

Thanks again Toke for sharing that Stats.


On Fri, Jun 6, 2014 at 5:04 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 On Fri, 2014-06-06 at 12:32 +0200, Vineet Mishra wrote:
  *Does that mean for querying smoothly we need to have memory atleast
 equal
  or greater to the size of index?

 If you absolutely, positively have to reduce latency as much as
 possible, then yes. With an estimated index size of 2TB, I would guess
 that 10-20 machines with powerful CPUs (1 per shard per expected
 concurrent request) would also be advisable. While you're at it, do make
 sure that you're using high-speed memory.

 That was not a serious suggestion, should you be in doubt. Very few
 people need the best latency possible. Most just need the individual
 searches to be fast enough and want to scale throughput instead.

  As in my case the index size will be very heavy(~2TB) and practically
  speaking that amount of memory is not possible. Even If it goes to
  multiple shards, say around 10 Shards then also 200GB of RAM will not
  be an feasible option.

 We're building a projected 24TB index collection and are currently at
 2.7TB+, growing with about 1TB/10 days. Our current plan is to use a
 single machine with 256GB of RAM, but we will of course adjust along the
 way if it proves to be too small.

 Requirements differ with the corpus and the needs, but for us, SSDs as
 storage seems to provide quite enough of a punch. I did a little testing
 yesterday: https://plus.google.com/u/0/+TokeEskildsen/posts/4yPvzrQo8A7

 tl;dr: for small result sets ( 1M hits) on unwarmed searches with
 simple queries, response time is below 100ms. If we enable faceting with
 plain Solr, this jumps to about 1 second.

 I did a top on the machine and it says that 50GB is currently used for
 caching, so an 80GB (and probably less) machine would work fine for our
 2.7TB index.


 - Toke Eskildsen, State and University Library, Denmark





Re: Solr maximum Optimal Index Size per Shard

2014-06-06 Thread Toke Eskildsen
On Fri, 2014-06-06 at 14:05 +0200, Vineet Mishra wrote:

 Could you state what indexing mechanism are you using, as I started
 with EmbeddedSolrServer but it was pretty slow after a few GB(~30+) of
 indexing.

I suspect that is due to too-frequent commits, too small heap or
something third, unrelated to EmbeddedSolrServer itself. Underneath the
surface it is just the same as a standalone Solr.

We're building our ~1TB indexes individually, using standalone workers
for the heavy part of the analysis (Tika). The delivery from the workers
to the Solr server is over the network, using the Solr binary protocol.
My colleague Thomas Egense just created a small write-up at
https://github.com/netarchivesuite/netsearch

  I started indexing 1 week back and still its 37GB, although I assume
 HttpPost mechanism will perform lethargic slow due to network latency
 and for the response await.

Maybe if you send the documents one at a time, but if you bundle them in
larger updates, the post-method should be fine.

- Toke Eskildsen, State and University Library, Denmark




Re: Solr maximum Optimal Index Size per Shard

2014-06-06 Thread Vineet Mishra
Earlier I used to index with HtttpPost Mechanism only, making each post
size specific to 2Mb to 20Mb that was going fine, but we had a suspect that
instead of indexing through network call(which ofcourse results in latency
due to network delays and http protocol) if we can index Offline by just
writing the index and dumping it to Shards it would be much better.

Although I am doing commit with a batch of 25K docs which I will try to
replace with CommitWithin(seems it works faster) or probably have a look at
this Binary Prot.

Thanks!




On Fri, Jun 6, 2014 at 5:55 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 On Fri, 2014-06-06 at 14:05 +0200, Vineet Mishra wrote:

  Could you state what indexing mechanism are you using, as I started
  with EmbeddedSolrServer but it was pretty slow after a few GB(~30+) of
  indexing.

 I suspect that is due to too-frequent commits, too small heap or
 something third, unrelated to EmbeddedSolrServer itself. Underneath the
 surface it is just the same as a standalone Solr.

 We're building our ~1TB indexes individually, using standalone workers
 for the heavy part of the analysis (Tika). The delivery from the workers
 to the Solr server is over the network, using the Solr binary protocol.
 My colleague Thomas Egense just created a small write-up at
 https://github.com/netarchivesuite/netsearch

   I started indexing 1 week back and still its 37GB, although I assume
  HttpPost mechanism will perform lethargic slow due to network latency
  and for the response await.

 Maybe if you send the documents one at a time, but if you bundle them in
 larger updates, the post-method should be fine.

 - Toke Eskildsen, State and University Library, Denmark





Re: Solr maximum Optimal Index Size per Shard

2014-06-04 Thread Vineet Mishra
Thanks all for your response.
I presume this conversation concludes that indexing around 1Billion
documents per shard won't be a problem, as I have 10 Billion docs to index,
so approx 10 shards with 1 Billion each should be fine with it and how
about Memory, what size of RAM should be fine for this amount of data?
Moreover what should be the indexing technique for this huge data set, as
currently I am indexing with EmbeddedSolrServer but its going pathetically
slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due
to network delays and response but after this long running the indexing
with EmbeddedSolrServer I am getting a different notion.
Any good indexing technique for this huge dataset would be highly
appreciated.

Thanks again!


On Wed, Jun 4, 2014 at 6:40 AM, rulinma ruli...@gmail.com wrote:

 mark.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-maximum-Optimal-Index-Size-per-Shard-tp4139565p4139698.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr maximum Optimal Index Size per Shard

2014-06-04 Thread Shawn Heisey
On 6/4/2014 12:45 AM, Vineet Mishra wrote:
 Thanks all for your response.
 I presume this conversation concludes that indexing around 1Billion
 documents per shard won't be a problem, as I have 10 Billion docs to index,
 so approx 10 shards with 1 Billion each should be fine with it and how
 about Memory, what size of RAM should be fine for this amount of data?

Figure out the heap requirements of the operating system and every
program on the machine (Solr especially).  Then you would add that
number to the total size of the index data on the machine.  That is the
ideal minimum RAM.

http://wiki.apache.org/solr/SolrPerformanceProblems

Unfortunately, if you are dealing with a huge index with billions of
documents, it is likely to be prohibitively expensive to buy that much
RAM.  If you are running Solr on Amazon's cloud, the cost for that much
RAM would be astronomical.

Exactly how much RAM would actually be required is very difficult to
predict.  If you had only 25% of the ideal, your index might have
perfectly acceptable performance, or it might not.  It might do fine
under a light query load, but if you increase to 50 queries per second,
performance may drop significantly ... or it might be good.  It's
generally not possible to know how your hardware will perform until you
actually build and use your index.

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

A general rule of thumb for RAM that I have found to be useful is that
if you've got less than half of the ideal memory size, you might have
performance problems.

 Moreover what should be the indexing technique for this huge data set, as
 currently I am indexing with EmbeddedSolrServer but its going pathetically
 slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due
 to network delays and response but after this long running the indexing
 with EmbeddedSolrServer I am getting a different notion.
 Any good indexing technique for this huge dataset would be highly
 appreciated.

EmbeddedSolrServer is not recommended.  Run Solr in the traditional way
with HTTP connectivity.  HTTP overhead on a LAN is usually quite small.
 Solr is fully thread-safe, so you can have several indexing threads all
going at the same time.

Indexes at this scale should normally be built with SolrCloud, with
enough servers so that each machine is only handling one shard replica.
 The ideal indexing program would be written in Java, using CloudSolrServer.

Thanks,
Shawn



Re: Solr maximum Optimal Index Size per Shard

2014-06-04 Thread Jack Krupansky

How many documents was in that 20GB index?

I'm skeptical that a 1 billion document shard won't be a problem. I mean 
technically it is possible, but as you are already experiencing, it may take 
a long time and a very powerful machine to do so. 100 million (or 250 
million max) would be a more realistic goal. Even then, it depends on your 
doc size and machine size.


The main point from the previous discussion is that although the technical 
hard limit for a Solr shard is 2G docs, from a practical perspective it is 
very difficult to get to that limit, not that indexing 1 billion docs on a 
single shard is just fine!


As a general rule, if you want fast queries for high volume, strive to 
assure that your per-shard index fits entirely into the system memory 
available for OS caching of file system pages.


In any case, a proof of concept implementation will tell you everything you 
need to know.


-- Jack Krupansky

-Original Message- 
From: Vineet Mishra

Sent: Wednesday, June 4, 2014 2:45 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr maximum Optimal Index Size per Shard

Thanks all for your response.
I presume this conversation concludes that indexing around 1Billion
documents per shard won't be a problem, as I have 10 Billion docs to index,
so approx 10 shards with 1 Billion each should be fine with it and how
about Memory, what size of RAM should be fine for this amount of data?
Moreover what should be the indexing technique for this huge data set, as
currently I am indexing with EmbeddedSolrServer but its going pathetically
slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due
to network delays and response but after this long running the indexing
with EmbeddedSolrServer I am getting a different notion.
Any good indexing technique for this huge dataset would be highly
appreciated.

Thanks again!


On Wed, Jun 4, 2014 at 6:40 AM, rulinma ruli...@gmail.com wrote:


mark.



--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-maximum-Optimal-Index-Size-per-Shard-tp4139565p4139698.html
Sent from the Solr - User mailing list archive at Nabble.com.





Solr maximum Optimal Index Size per Shard

2014-06-03 Thread Vineet Mishra
Hi All,

Has anyone came across the maximum threshold document or size wise for each
core of solr to hold.
As I have indexed some 10 Million Documents of 18Gb and when I index
another 5 (9Gb)Million Documents on top of these indexes it responds little
slow with Stats query.

Considering I have around 2Tb of data to index what should be an
appropriate balanced proportionate of Data vs # of Shards.

Its more of a indexing Big data for NRT.
Looking forward for your response.
Urgent!

Thanks!


Re: Solr maximum Optimal Index Size per Shard

2014-06-03 Thread Jack Krupansky
How much free system memory do you have for the OS to cache file system 
data? If your entire index fits in system memory operations will be fast, 
but as your index grows beyond the space the OS can use to cache the data, 
performance will decline.


But there's no hard limit in Solr per se.

-- Jack Krupansky

-Original Message- 
From: Vineet Mishra

Sent: Tuesday, June 3, 2014 8:43 AM
To: solr-user@lucene.apache.org
Subject: Solr maximum Optimal Index Size per Shard

Hi All,

Has anyone came across the maximum threshold document or size wise for each
core of solr to hold.
As I have indexed some 10 Million Documents of 18Gb and when I index
another 5 (9Gb)Million Documents on top of these indexes it responds little
slow with Stats query.

Considering I have around 2Tb of data to index what should be an
appropriate balanced proportionate of Data vs # of Shards.

Its more of a indexing Big data for NRT.
Looking forward for your response.
Urgent!

Thanks! 



Re: Solr maximum Optimal Index Size per Shard

2014-06-03 Thread Shawn Heisey
On 6/3/2014 12:54 PM, Jack Krupansky wrote:
 How much free system memory do you have for the OS to cache file
 system data? If your entire index fits in system memory operations
 will be fast, but as your index grows beyond the space the OS can use
 to cache the data, performance will decline.

 But there's no hard limit in Solr per se.

Vineet,

There is only one hard limit in Solr: You can't put more than about 2
billion documents in one shard.  The exact number is 2147483647 -- the
largest number that you can store in a Java integer.  Because this
number also includes deleted documents, to be absolutely sure that
nothing will have a problem, it would be advisable to stay below 1
billion documents per shard.

Because of Solr's reliance on RAM for the OS disk cache, which Jack has
already mentioned, chances are very good that your shards will have
performance problems long before you reach a billion documents.

Thanks,
Shawn



Re: Solr maximum Optimal Index Size per Shard

2014-06-03 Thread Jack Krupansky
Anybody care to forecast when hardware will catch up with Solr and we can 
routinely look forward to newbies complaining that they indexed some data 
and after only 10 minutes they hit this weird 2G document count limit?


-- Jack Krupansky

-Original Message- 
From: Shawn Heisey

Sent: Tuesday, June 3, 2014 3:34 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr maximum Optimal Index Size per Shard

On 6/3/2014 12:54 PM, Jack Krupansky wrote:

How much free system memory do you have for the OS to cache file
system data? If your entire index fits in system memory operations
will be fast, but as your index grows beyond the space the OS can use
to cache the data, performance will decline.

But there's no hard limit in Solr per se.


Vineet,

There is only one hard limit in Solr: You can't put more than about 2
billion documents in one shard.  The exact number is 2147483647 -- the
largest number that you can store in a Java integer.  Because this
number also includes deleted documents, to be absolutely sure that
nothing will have a problem, it would be advisable to stay below 1
billion documents per shard.

Because of Solr's reliance on RAM for the OS disk cache, which Jack has
already mentioned, chances are very good that your shards will have
performance problems long before you reach a billion documents.

Thanks,
Shawn 



Re: Solr maximum Optimal Index Size per Shard

2014-06-03 Thread Shawn Heisey
On 6/3/2014 1:47 PM, Jack Krupansky wrote:
 Anybody care to forecast when hardware will catch up with Solr and we
 can routinely look forward to newbies complaining that they indexed
 some data and after only 10 minutes they hit this weird 2G document
 count limit?

I would speculate that Lucene will update its index format to use 64-bit
(or maybe VInt) document identifiers before a typical (or test) data
source would present a problem like this.  I also seriously doubt that
it would be a complete newbie, it is more likely that it would be an
experienced admin or developer who would instantly know why it happened
as soon as they saw how many documents were successfully indexed.  They
might need someone to point out SolrCloud shards to them as the solution.

I've already heard of a production SolrCloud install with five billion
documents, so even now it's not a theoretical problem.

Thanks,
Shawn



Re: Solr maximum Optimal Index Size per Shard

2014-06-03 Thread rulinma
mark.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-maximum-Optimal-Index-Size-per-Shard-tp4139565p4139698.html
Sent from the Solr - User mailing list archive at Nabble.com.