Re: Limit the documents for each shard in solr cloud

2015-05-08 Thread Jilani Shaik
Hi,

Actually we are facing lot of issues with Solr shards in our environment.
Our environment is fully loaded with around 150 million documents where
each document will have around 50+ stored fields which has multiple values.
And also we have lot of custom components in this environment which are
using FieldCache and various other Solr features.

The main issue we are facing is shards going down frequently in Solr cloud.

As you mentioned in this reply and I also I have observed various other
reply on memory issues. I will try to debug further and keep posted here if
any issues I found in that process.

Thanks,
Jilani

On Thu, May 7, 2015 at 10:17 PM, Daniel Collins danwcoll...@gmail.com
wrote:

 Jilani, you did say My team needs that option if at all possible, my
 first response would be why?.   Why do they want to limit the number of
 documents per shard, what's the rationale/use case behind that
 requirement?  Once we understand that, we can explain why its a bad idea.
 :)

 I suspect I'm re-iterating Jack's comments, but why are you sharding in the
 first place? 8 shards split across 4 machines, so 2 shards per machine.
 But you have 2 replicas of each shard, so you have 16 Solr core, and hence
 4 Solr cores per machine?  Since you need an instance of all 8 shards to be
 up in order to service requests, you can get away with everything on 2
 machines, but you still have 8 Solr cores to manage in order to have a
 fully functioning system.  What's the benefit of sharding in this
 scenario?  Sharding adds complexity, so you normally only add sharding if
 your search times are too slow without it.

 You need to work out how much disk space the whole 20m docs is going to
 take (maybe index 1m or 5m docs and extrapolate if they are all equivalent
 in size), then split it across 4 machines.  But as Erick points out you
 need to allow for merges to occur, so whatever the space of the static
 data set, you need to allow for double that from time to time if background
 merges are happening.


 On 7 May 2015 at 16:05, Jack Krupansky jack.krupan...@gmail.com wrote:

  A leader is also a replica - SolrCloud is not a master/slave
 architecture.
  Any replica can be elected to be the leader, but that is only temporary
 and
  can change over time.
 
  You can place multiple shards on a single node, but was that really your
  intention?
 
  Generally, number of nodes equals number of shards times the replication
  factor. But then divided by shards per node if you do place more than one
  shard per node.
 
  -- Jack Krupansky
 
  On Thu, May 7, 2015 at 1:29 AM, Jilani Shaik jilani24...@gmail.com
  wrote:
 
   Hi,
  
   Is it possible to restrict number of documents per shard in Solr cloud?
  
   Lets say we have Solr cloud with 4 nodes, and on each node we have one
   leader and one replica. Like wise total we have 8 shards that includes
   replicas. Now I need to index my documents in such a way that each
 shard
   will have only 5 million documents. Total documents in Solr cloud
 should
  be
   20 million documents.
  
  
   Thanks,
   Jilani
  
 



Re: Limit the documents for each shard in solr cloud

2015-05-07 Thread Jilani Shaik
Hi Daniel,

Thanks for the detailed explanation.

My understanding is also similar to you that we should not provide limit
over the shard for number of documents that it can index. Usually it will
depend on shard routing provided by Solr and I am not expecting any change
to document routing process.

My team needs that option if at all possible, Before saying not possible
at Solr end to limit the documents per shard, I just want to get
confirmation or some details of this. So I dropped a question here to get
answers.

You mentioned that as long as it has sufficient space to do index
 - How will Solr knows or estimate that whether Solr has sufficient
space to index or not on particular shard or on entire cloud?

Conclusion of my understand:
We will not be able to limit the documents per shard in Solr Cloud. As Solr
will accept all the documents as long as space is there for it to index.

Please suggest.

Thanks,
Jilani

On Thu, May 7, 2015 at 12:41 PM, Daniel Collins danwcoll...@gmail.com
wrote:

 Not sure I understand your problem.  If you have 20m documents, and 8
 shards, then each shard is (broadly speaking) only going to have 2.5m docs
 each, so I don't follow the 5m limit? That is with the default
 routing/hashing, obviously you can write your own hash algorithm or you can
 shard at your application level.

 In terms of limiting documents in a shard, I'm not sure what purpose that
 would serve.  If for arguments sake you only had 2 shards, and a limit of
 5m doccs per shard, what happens when you hit that limit?  If you have
 indexed 10m docs, and now you try to index one more, what would you expect
 to happen, would the system just reject any documents, should it try to
 shard to shard 1 but see that is full, and then fail-over to shard2 instead
 (that's not going to work as sharding needs to be reproducible and the
 document was intended for shard 1)?

 Solr's basic premise would be to index what you gave it, as long as it has
 sufficient space to do that.  If you want to limit your index to 20m docs,
 that is probably better done at the application layer (but I still don't
 really see why you would want to do that).

 On 7 May 2015 at 06:29, Jilani Shaik jilani24...@gmail.com wrote:

  Hi,
 
  Is it possible to restrict number of documents per shard in Solr cloud?
 
  Lets say we have Solr cloud with 4 nodes, and on each node we have one
  leader and one replica. Like wise total we have 8 shards that includes
  replicas. Now I need to index my documents in such a way that each shard
  will have only 5 million documents. Total documents in Solr cloud should
 be
  20 million documents.
 
 
  Thanks,
  Jilani
 



Re: Limit the documents for each shard in solr cloud

2015-05-07 Thread Daniel Collins
Not sure I understand your problem.  If you have 20m documents, and 8
shards, then each shard is (broadly speaking) only going to have 2.5m docs
each, so I don't follow the 5m limit? That is with the default
routing/hashing, obviously you can write your own hash algorithm or you can
shard at your application level.

In terms of limiting documents in a shard, I'm not sure what purpose that
would serve.  If for arguments sake you only had 2 shards, and a limit of
5m doccs per shard, what happens when you hit that limit?  If you have
indexed 10m docs, and now you try to index one more, what would you expect
to happen, would the system just reject any documents, should it try to
shard to shard 1 but see that is full, and then fail-over to shard2 instead
(that's not going to work as sharding needs to be reproducible and the
document was intended for shard 1)?

Solr's basic premise would be to index what you gave it, as long as it has
sufficient space to do that.  If you want to limit your index to 20m docs,
that is probably better done at the application layer (but I still don't
really see why you would want to do that).

On 7 May 2015 at 06:29, Jilani Shaik jilani24...@gmail.com wrote:

 Hi,

 Is it possible to restrict number of documents per shard in Solr cloud?

 Lets say we have Solr cloud with 4 nodes, and on each node we have one
 leader and one replica. Like wise total we have 8 shards that includes
 replicas. Now I need to index my documents in such a way that each shard
 will have only 5 million documents. Total documents in Solr cloud should be
 20 million documents.


 Thanks,
 Jilani



Re: Limit the documents for each shard in solr cloud

2015-05-07 Thread Erick Erickson
bq: We will not be able to limit the documents per shard in Solr
Cloud. As Solr will accept all the documents as long as space is there
for it to index.

True, end of story ;).

How does Solr know it will run out of space? It hits an exception,
there's really no this doesn't look like it will fit so let's not
index it. But that's not really a problem because you need at least
as much free space on your disk as the index size to handle merges so
you'll run into many, many, many other problems before you fill up
your disk.

The hashing function that's used to distribute the files across the
shards has not had any reports of significant uneven distribution that
I know of. So simply dividing the number of docs by number of shards
and assuming that number (+/- a very small number,  1%) of docs will
get on each shard is usually good enough. If you see something
different it would be good to know

Best,
Erick

On Thu, May 7, 2015 at 12:45 AM, Jilani Shaik jilani24...@gmail.com wrote:
 Hi Daniel,

 Thanks for the detailed explanation.

 My understanding is also similar to you that we should not provide limit
 over the shard for number of documents that it can index. Usually it will
 depend on shard routing provided by Solr and I am not expecting any change
 to document routing process.

 My team needs that option if at all possible, Before saying not possible
 at Solr end to limit the documents per shard, I just want to get
 confirmation or some details of this. So I dropped a question here to get
 answers.

 You mentioned that as long as it has sufficient space to do index
  - How will Solr knows or estimate that whether Solr has sufficient
 space to index or not on particular shard or on entire cloud?

 Conclusion of my understand:
 We will not be able to limit the documents per shard in Solr Cloud. As Solr
 will accept all the documents as long as space is there for it to index.

 Please suggest.

 Thanks,
 Jilani

 On Thu, May 7, 2015 at 12:41 PM, Daniel Collins danwcoll...@gmail.com
 wrote:

 Not sure I understand your problem.  If you have 20m documents, and 8
 shards, then each shard is (broadly speaking) only going to have 2.5m docs
 each, so I don't follow the 5m limit? That is with the default
 routing/hashing, obviously you can write your own hash algorithm or you can
 shard at your application level.

 In terms of limiting documents in a shard, I'm not sure what purpose that
 would serve.  If for arguments sake you only had 2 shards, and a limit of
 5m doccs per shard, what happens when you hit that limit?  If you have
 indexed 10m docs, and now you try to index one more, what would you expect
 to happen, would the system just reject any documents, should it try to
 shard to shard 1 but see that is full, and then fail-over to shard2 instead
 (that's not going to work as sharding needs to be reproducible and the
 document was intended for shard 1)?

 Solr's basic premise would be to index what you gave it, as long as it has
 sufficient space to do that.  If you want to limit your index to 20m docs,
 that is probably better done at the application layer (but I still don't
 really see why you would want to do that).

 On 7 May 2015 at 06:29, Jilani Shaik jilani24...@gmail.com wrote:

  Hi,
 
  Is it possible to restrict number of documents per shard in Solr cloud?
 
  Lets say we have Solr cloud with 4 nodes, and on each node we have one
  leader and one replica. Like wise total we have 8 shards that includes
  replicas. Now I need to index my documents in such a way that each shard
  will have only 5 million documents. Total documents in Solr cloud should
 be
  20 million documents.
 
 
  Thanks,
  Jilani
 



Re: Limit the documents for each shard in solr cloud

2015-05-07 Thread Jack Krupansky
Wait a minute, guys... aren't we in the 21st century, where disk is
ultra-cheap and ultra plentiful? So... what's the REAL problem here?
Seriously, when multi-terabyte drives are so common on servers and Solr
really doesn't work well with more than 100 to 250 million docs per server
anyway, which is way under needing terabytes, what could possibly be the
problem??!!

Or... is this really an SSD rather than a spinning disk issue? Possibly a
virtualization issue, where a single physical machine with only modest
physical SSD is virtualized into multiple virtual machines, but then each
virtual machine gets only a fairly tiny amount of SSD disk storage space?
Just guessing here

A little clarification is in order.

In any case, if you really only have such a limited amount of storage per
node, that probably simply means that you need more nodes.


-- Jack Krupansky

On Thu, May 7, 2015 at 9:51 AM, Erick Erickson erickerick...@gmail.com
wrote:

 bq: We will not be able to limit the documents per shard in Solr
 Cloud. As Solr will accept all the documents as long as space is there
 for it to index.

 True, end of story ;).

 How does Solr know it will run out of space? It hits an exception,
 there's really no this doesn't look like it will fit so let's not
 index it. But that's not really a problem because you need at least
 as much free space on your disk as the index size to handle merges so
 you'll run into many, many, many other problems before you fill up
 your disk.

 The hashing function that's used to distribute the files across the
 shards has not had any reports of significant uneven distribution that
 I know of. So simply dividing the number of docs by number of shards
 and assuming that number (+/- a very small number,  1%) of docs will
 get on each shard is usually good enough. If you see something
 different it would be good to know

 Best,
 Erick

 On Thu, May 7, 2015 at 12:45 AM, Jilani Shaik jilani24...@gmail.com
 wrote:
  Hi Daniel,
 
  Thanks for the detailed explanation.
 
  My understanding is also similar to you that we should not provide limit
  over the shard for number of documents that it can index. Usually it will
  depend on shard routing provided by Solr and I am not expecting any
 change
  to document routing process.
 
  My team needs that option if at all possible, Before saying not possible
  at Solr end to limit the documents per shard, I just want to get
  confirmation or some details of this. So I dropped a question here to get
  answers.
 
  You mentioned that as long as it has sufficient space to do index
   - How will Solr knows or estimate that whether Solr has sufficient
  space to index or not on particular shard or on entire cloud?
 
  Conclusion of my understand:
  We will not be able to limit the documents per shard in Solr Cloud. As
 Solr
  will accept all the documents as long as space is there for it to index.
 
  Please suggest.
 
  Thanks,
  Jilani
 
  On Thu, May 7, 2015 at 12:41 PM, Daniel Collins danwcoll...@gmail.com
  wrote:
 
  Not sure I understand your problem.  If you have 20m documents, and 8
  shards, then each shard is (broadly speaking) only going to have 2.5m
 docs
  each, so I don't follow the 5m limit? That is with the default
  routing/hashing, obviously you can write your own hash algorithm or you
 can
  shard at your application level.
 
  In terms of limiting documents in a shard, I'm not sure what purpose
 that
  would serve.  If for arguments sake you only had 2 shards, and a limit
 of
  5m doccs per shard, what happens when you hit that limit?  If you have
  indexed 10m docs, and now you try to index one more, what would you
 expect
  to happen, would the system just reject any documents, should it try to
  shard to shard 1 but see that is full, and then fail-over to shard2
 instead
  (that's not going to work as sharding needs to be reproducible and the
  document was intended for shard 1)?
 
  Solr's basic premise would be to index what you gave it, as long as it
 has
  sufficient space to do that.  If you want to limit your index to 20m
 docs,
  that is probably better done at the application layer (but I still don't
  really see why you would want to do that).
 
  On 7 May 2015 at 06:29, Jilani Shaik jilani24...@gmail.com wrote:
 
   Hi,
  
   Is it possible to restrict number of documents per shard in Solr
 cloud?
  
   Lets say we have Solr cloud with 4 nodes, and on each node we have one
   leader and one replica. Like wise total we have 8 shards that includes
   replicas. Now I need to index my documents in such a way that each
 shard
   will have only 5 million documents. Total documents in Solr cloud
 should
  be
   20 million documents.
  
  
   Thanks,
   Jilani
  
 



Re: Limit the documents for each shard in solr cloud

2015-05-07 Thread Jack Krupansky
A leader is also a replica - SolrCloud is not a master/slave architecture.
Any replica can be elected to be the leader, but that is only temporary and
can change over time.

You can place multiple shards on a single node, but was that really your
intention?

Generally, number of nodes equals number of shards times the replication
factor. But then divided by shards per node if you do place more than one
shard per node.

-- Jack Krupansky

On Thu, May 7, 2015 at 1:29 AM, Jilani Shaik jilani24...@gmail.com wrote:

 Hi,

 Is it possible to restrict number of documents per shard in Solr cloud?

 Lets say we have Solr cloud with 4 nodes, and on each node we have one
 leader and one replica. Like wise total we have 8 shards that includes
 replicas. Now I need to index my documents in such a way that each shard
 will have only 5 million documents. Total documents in Solr cloud should be
 20 million documents.


 Thanks,
 Jilani



Re: Limit the documents for each shard in solr cloud

2015-05-07 Thread Daniel Collins
Jilani, you did say My team needs that option if at all possible, my
first response would be why?.   Why do they want to limit the number of
documents per shard, what's the rationale/use case behind that
requirement?  Once we understand that, we can explain why its a bad idea. :)

I suspect I'm re-iterating Jack's comments, but why are you sharding in the
first place? 8 shards split across 4 machines, so 2 shards per machine.
But you have 2 replicas of each shard, so you have 16 Solr core, and hence
4 Solr cores per machine?  Since you need an instance of all 8 shards to be
up in order to service requests, you can get away with everything on 2
machines, but you still have 8 Solr cores to manage in order to have a
fully functioning system.  What's the benefit of sharding in this
scenario?  Sharding adds complexity, so you normally only add sharding if
your search times are too slow without it.

You need to work out how much disk space the whole 20m docs is going to
take (maybe index 1m or 5m docs and extrapolate if they are all equivalent
in size), then split it across 4 machines.  But as Erick points out you
need to allow for merges to occur, so whatever the space of the static
data set, you need to allow for double that from time to time if background
merges are happening.


On 7 May 2015 at 16:05, Jack Krupansky jack.krupan...@gmail.com wrote:

 A leader is also a replica - SolrCloud is not a master/slave architecture.
 Any replica can be elected to be the leader, but that is only temporary and
 can change over time.

 You can place multiple shards on a single node, but was that really your
 intention?

 Generally, number of nodes equals number of shards times the replication
 factor. But then divided by shards per node if you do place more than one
 shard per node.

 -- Jack Krupansky

 On Thu, May 7, 2015 at 1:29 AM, Jilani Shaik jilani24...@gmail.com
 wrote:

  Hi,
 
  Is it possible to restrict number of documents per shard in Solr cloud?
 
  Lets say we have Solr cloud with 4 nodes, and on each node we have one
  leader and one replica. Like wise total we have 8 shards that includes
  replicas. Now I need to index my documents in such a way that each shard
  will have only 5 million documents. Total documents in Solr cloud should
 be
  20 million documents.
 
 
  Thanks,
  Jilani