[jira] [Created] (SOLR-12588) Solr Autoscaling History doesn't log node added events

2018-07-24 Thread Jerry Bao (JIRA)
Jerry Bao created SOLR-12588:


 Summary: Solr Autoscaling History doesn't log node added events
 Key: SOLR-12588
 URL: https://issues.apache.org/jira/browse/SOLR-12588
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: AutoScaling
Affects Versions: 7.3.1
Reporter: Jerry Bao


Autoscaling node added triggers don't log node added events to the history in 
.system collection.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-12563) Unable to delete failed/completed async request statuses

2018-07-18 Thread Jerry Bao (JIRA)
Jerry Bao created SOLR-12563:


 Summary: Unable to delete failed/completed async request statuses 
 Key: SOLR-12563
 URL: https://issues.apache.org/jira/browse/SOLR-12563
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Affects Versions: 7.3.1
Reporter: Jerry Bao


/admin/collections?action=DELETESTATUS=true

{code}
{
"responseHeader": {
"status": 500,
"QTime": 5
},
"error": {
"msg": "KeeperErrorCode = Directory not empty for 
/overseer/collection-map-completed/mn-node_lost_trigger",
"trace": "org.apache.zookeeper.KeeperException$NotEmptyException: 
KeeperErrorCode = Directory not empty for 
/overseer/collection-map-completed/mn-node_lost_trigger\n\tat 
org.apache.zookeeper.KeeperException.create(KeeperException.java:128)\n\tat 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)\n\tat 
org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:876)\n\tat 
org.apache.solr.common.cloud.SolrZkClient.lambda$delete$1(SolrZkClient.java:244)\n\tat
 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)\n\tat
 org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:243)\n\tat 
org.apache.solr.cloud.DistributedMap.remove(DistributedMap.java:98)\n\tat 
org.apache.solr.handler.admin.CollectionsHandler$CollectionOperation$1.execute(CollectionsHandler.java:753)\n\tat
 
org.apache.solr.handler.admin.CollectionsHandler$CollectionOperation.execute(CollectionsHandler.java:1114)\n\tat
 
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:242)\n\tat
 
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:230)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:195)\n\tat
 org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:736)\n\tat 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:717)\n\tat
 org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:498)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:530)\n\tat 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347)\n\tat 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256)\n\tat
 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)\n\tat
 org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat 
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)\n\tat 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247)\n\tat
 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)\n\tat
 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)\n\tat
 

[jira] [Commented] (SOLR-12495) Enhance the Autoscaling policy syntax to evenly distribute replicas

2018-06-23 Thread Jerry Bao (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521323#comment-16521323
 ] 

Jerry Bao commented on SOLR-12495:
--

{quote}
Is there anything that's not already addressed by that? I understand that it 
won't show any violations if you are already in an imbalanced state.
{quote}

Thats the main issue: no violations if you're already in an imbalanced state. 
If the autoscaling suggestions also suggested to move replicas to a more 
balanced state (based on the preferences) without any violations, then that 
would solve this issue. 

We have machines that have 0 load on them because the collections are 
distributed amongst the machines but all of the replicas aren't distributed. We 
also see machines with too much load because they have one of every 
collection's replica on it.

{quote}
This can always lead to violations which are impossible to satisfy
{quote}

I think this can lead to violations that are impossible to satisfy because 
often to fix the violation, it takes multiple steps. Something like, doing a 
3-way triangle movement. I understand that the more movement possible, the 
exponential increase in combinations you have to check, but I think we can be 
smarter here about deciding which machines are definitely possible to move to 
and which doesn't make sense to move to.

I would say that if we could incorporate the preferences into suggestions (so 
that the trigger can move things to be more balanced based on our preferences), 
that should help us a lot here.

> Enhance the Autoscaling policy syntax to evenly distribute replicas
> ---
>
> Key: SOLR-12495
> URL: https://issues.apache.org/jira/browse/SOLR-12495
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Noble Paul
>Priority: Major
>
> Support a new function value for {{replica= "#MINIMUM"}}
> {{#MINIMUM}} means the minimum computed value for the given configuration
> the value of replica will be calculated as  {{<= 
> Math.ceil(number_of_replicas/number_of_valid_nodes) }}
> *example 1:*
> {code:java}
> {"replica" : "#MINIMUM" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *case 1* : nodes=3, replicationFactor=4
>  the value of replica will be calculated as {{Math.ceil(4/3) = 2}}
> current state : nodes=3, replicationFactor=2
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *case 2* : 
> current state : nodes=3, replicationFactor=2
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *example:2*
> {code}
> {"replica" : "#MINIMUM"  , "node" : "#ANY"}{code}
> case 1: numShards = 2, replicationFactor=3, nodes = 5
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "node" : "#ANY"}
> {code}
> *example:3*
> {code}
> {"replica" : "<2"  , "shard" : "#EACH" , "port" : "8983"}{code}
> case 1: {{replicationFactor=3, nodes with port 8983 = 2}}
> this is equivalent to the hard coded rule
> {code}
> {"replica" : "<3"  , "shard" : "#EACH" , "port" : "8983"}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11985) Allow percentage in replica attribute in policy

2018-06-23 Thread Jerry Bao (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521322#comment-16521322
 ] 

Jerry Bao commented on SOLR-11985:
--

SOLR-12511 should definitely solve the issue I was speaking of :).

{quote}
But the problem is that once you are already in a badly distributed cluster, it 
won't show any violations.
{quote}

Yep thats the problem I was hoping we can avoid. Balancing needs an in-between 
(such as either 2-3 replicas each machine) to be distributed, not a 
maximum/minimum.

> Allow percentage in replica attribute in policy
> ---
>
> Key: SOLR-11985
> URL: https://issues.apache.org/jira/browse/SOLR-11985
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling, SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Noble Paul
>Priority: Major
> Fix For: master (8.0), 7.5
>
> Attachments: SOLR-11985.patch, SOLR-11985.patch
>
>
> Today we can only specify an absolute number in the 'replica' attribute in 
> the policy rules. It'd be useful to write a percentage value to make certain 
> use-cases easier. For example:
> {code:java}
> // Keep a third of the the replicas of each shard in east region
> {"replica" : "<34%", "shard" : "#EACH", "sysprop:region": "east"}
> // Keep two thirds of the the replicas of each shard in west region
> {"replica" : "<67%", "shard" : "#EACH", "sysprop:region": "west"}
> {code}
> Today the above must be represented by different rules for each collection if 
> they have different replication factors. Also if the replication factor 
> changes later, the absolute value has to be changed in tandem. So expressing 
> a percentage removes both of these restrictions.
> This feature means that the value of the attribute {{"replica"}} is only 
> available just in time. We call such values {{"computed values"}} . The 
> computed value for this attribute depends on other attributes as well. 
>  Take the following 2 rules
> {code:java}
> //example 1
> {"replica" : "<34%", "shard" : "#EACH", "sysprop:region": "east"}
> //example 2
> {"replica" : "<34%",  "sysprop:region": "east"}
> {code}
> assume we have collection {{"A"}} with 2 shards and {{replicationFactor=3}}
> *example 1* would mean that the value of replica is computed as
> {{3 * 34 / 100 = 1.02}}
> Which means *_for each shard_* keep less than 1.02 replica in east 
> availability zone
>  
> *example 2* would mean that the value of replica is computed as 
> {{3 * 2 * 34 / 100 = 2.04}}
>  
> which means _*for each collection*_ keep less than 2.04 replicas on east 
> availability zone



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11985) Allow percentage in replica attribute in policy

2018-06-22 Thread Jerry Bao (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520669#comment-16520669
 ] 

Jerry Bao commented on SOLR-11985:
--

Given the way it was written, the concern I had was the following:

One collection has shards with 3 replicas and another collection has shards 
with 4 replicas. If I had the following set of rules...
{code}
{"replica" : "<33%", "shard" : "#EACH", "sysprop:region": "us-east-1a"}
{"replica" : "<33%", "shard" : "#EACH", "sysprop:region": "us-east-1b"}
{"replica" : "<33%", "shard" : "#EACH", "sysprop:region": "us-east-1c"}
{code}

My concern was it would turn into
{code}
{"replica" : "<2", "shard" : "#EACH", "sysprop:region": "us-east-1a"}
{"replica" : "<2", "shard" : "#EACH", "sysprop:region": "us-east-1b"}
{"replica" : "<2", "shard" : "#EACH", "sysprop:region": "us-east-1c"}
{code}
for the collection with 3 replicas, and
{code}
{"replica" : "<3", "shard" : "#EACH", "sysprop:region": "us-east-1a"}
{"replica" : "<3", "shard" : "#EACH", "sysprop:region": "us-east-1b"}
{"replica" : "<3", "shard" : "#EACH", "sysprop:region": "us-east-1c"}
{code}
for the collection with 4 replicas. In the collection with 4 replicas, you 
could have 2 replicas on us-east-1a and 2 replicas on us-east-1b. What we 
really want is 1 on each before having the 4th replica on another zone. Due to 
the way the rules are set up, it treats them individually when they should be 
treated together; evenly balancing the replicas based on the number of zones 
available.

We could make it work by making different zone rules per collection, but that 
shouldn't be necessary. Rack awareness (which is what we're trying to achieve 
here), should be collection agnostic and apply against each collection. 
https://issues.apache.org/jira/browse/SOLR-12511 would help here.

> Allow percentage in replica attribute in policy
> ---
>
> Key: SOLR-11985
> URL: https://issues.apache.org/jira/browse/SOLR-11985
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling, SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Noble Paul
>Priority: Major
> Fix For: master (8.0), 7.5
>
> Attachments: SOLR-11985.patch, SOLR-11985.patch
>
>
> Today we can only specify an absolute number in the 'replica' attribute in 
> the policy rules. It'd be useful to write a percentage value to make certain 
> use-cases easier. For example:
> {code:java}
> // Keep a third of the the replicas of each shard in east region
> {"replica" : "<34%", "shard" : "#EACH", "sysprop:region": "east"}
> // Keep two thirds of the the replicas of each shard in west region
> {"replica" : "<67%", "shard" : "#EACH", "sysprop:region": "west"}
> {code}
> Today the above must be represented by different rules for each collection if 
> they have different replication factors. Also if the replication factor 
> changes later, the absolute value has to be changed in tandem. So expressing 
> a percentage removes both of these restrictions.
> This feature means that the value of the attribute {{"replica"}} is only 
> available just in time. We call such values {{"computed values"}} . The 
> computed value for this attribute depends on other attributes as well. 
>  Take the following 2 rules
> {code:java}
> //example 1
> {"replica" : "<34%", "shard" : "#EACH", "sysprop:region": "east"}
> //example 2
> {"replica" : "<34%",  "sysprop:region": "east"}
> {code}
> assume we have collection {{"A"}} with 2 shards and {{replicationFactor=3}}
> *example 1* would mean that the value of replica is computed as
> {{3 * 34 / 100 = 1.02}}
> Which means *_for each shard_* keep less than 1.02 replica in east 
> availability zone
>  
> *example 2* would mean that the value of replica is computed as 
> {{3 * 2 * 34 / 100 = 2.04}}
>  
> which means _*for each collection*_ keep less than 2.04 replicas on east 
> availability zone



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12495) Enhance the Autoscaling policy syntax to evenly distribute replicas

2018-06-22 Thread Jerry Bao (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520654#comment-16520654
 ] 

Jerry Bao commented on SOLR-12495:
--

{quote}
Actually, the terms replica , shard are always associated with a collection. If 
the attribute shard is present , the replica counts are computed on a per-shard 
basis , if it is absent, it is computed on a per-collection basis

The equivalent term for a replica globally is a core which is not associated 
with a collection or shard
{quote}
I see; could {"core": "#MINIMUM", "node": "#ANY"} be included with this issue? 
Along with per-collection balancing, we'll also need cluster-wide balancing.

{quote}
That means The no:of of replicas will have to be between 1 and 2 (inclusive) . 
Which means , both 1 and 2 are valid but 0 , 3 or >3 are invalid and , the list 
of violations will show that
{quote}
Awesome! No qualms here then :)

Thanks for all your help on this issue! Cluster balancing is a critical issue 
for us @ Reddit.

> Enhance the Autoscaling policy syntax to evenly distribute replicas
> ---
>
> Key: SOLR-12495
> URL: https://issues.apache.org/jira/browse/SOLR-12495
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Noble Paul
>Priority: Major
>
> Support a new function value for {{replica= "#MINIMUM"}}
> {{#MINIMUM}} means the minimum computed value for the given configuration
> the value of replica will be calculated as  {{<= 
> Math.ceil(number_of_replicas/number_of_valid_nodes) }}
> *example 1:*
> {code:java}
> {"replica" : "#MINIMUM" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *case 1* : nodes=3, replicationFactor=4
>  the value of replica will be calculated as {{Math.ceil(4/3) = 2}}
> current state : nodes=3, replicationFactor=2
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *case 2* : 
> current state : nodes=3, replicationFactor=2
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *example:2*
> {code}
> {"replica" : "#MINIMUM"  , "node" : "#ANY"}{code}
> case 1: numShards = 2, replicationFactor=3, nodes = 5
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "node" : "#ANY"}
> {code}
> *example:3*
> {code}
> {"replica" : "<2"  , "shard" : "#EACH" , "port" : "8983"}{code}
> case 1: {{replicationFactor=3, nodes with port 8983 = 2}}
> this is equivalent to the hard coded rule
> {code}
> {"replica" : "<3"  , "shard" : "#EACH" , "port" : "8983"}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12495) Make it possible to evenly distribute replicas

2018-06-21 Thread Jerry Bao (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519668#comment-16519668
 ] 

Jerry Bao edited comment on SOLR-12495 at 6/21/18 6:40 PM:
---

{quote}well
{code:java}
{"replica": "#MINIMUM", "node": "#ANY"}
{code}
means it is applied on a per collection basis
{quote}
That seems confusing to me; the way I read it is: keep a minimum number of 
replicas on every node. Just to clarify, when you say per-collection basis, 
you're meaning each collection is balanced? If that is so will there be a way 
to keep the entire cluster balanced irrespective of collection? Is that covered 
by the core preference? My concern here is that without a way to keep the 
entire cluster balanced irrespective of collection, you'll end up with nodes 
with one replica of every collection and other nodes with 0 replicas. For 
example, if you had three collections with 30 replicas each, and 45 nodes, you 
could end up with 30 nodes, each with one of each collections replica, and 15 
nodes with 0 replicas, which is unbalanced.
{quote}In reality, it works slightly different. The value "<3" is not a 
constant . it keeps varying when every replica is created. for instance , when 
replica # 40 is being created , the value is (40/40 = 1) that is like saying 
{{replica:"<2"}} . whereas , when replica #41 is created, it suddenly becomes 
{{"replica" : "<3"}}. So actually allocations happen evenly
{quote}
I understand that it's not constant, but what I'm saying is the rule itself can 
not be violated but the cluster not balanced. If I have 42 replicas and 40 
nodes, I would want 1 replica on every node before getting 2 on other nodes. 
ceil(42/40) -> <3 rule, which has the potential of having 2 replicas on 21 
nodes, which satisfies the rule but is not balanced.


was (Author: jerry.bao):
{quote}well
{code:java}
{"replica": "#MINIMUM", "node": "#ANY"}
{code}
means it is applied on a per collection basis
{quote}
That seems confusing to me; the way I read it is: keep a minimum number of 
replicas on every node. Just to clarify, when you say per-collection basis, 
you're meaning each collection is balanced? If that is so will there be a way 
to keep the entire cluster balanced irrespective of collection? Is that covered 
by the core preference? My concern here is that without a way to keep the 
entire cluster balanced irrespective of collection, you'll end up with nodes 
with one replica of every collection and other nodes with 0 replicas. For 
example, if you had three collections with 30 replicas each, and 45 nodes, you 
could end up with 30 nodes, each with one collections replica, and 15 nodes 
with 0 replicas, which is unbalanced.
{quote}In reality, it works slightly different. The value "<3" is not a 
constant . it keeps varying when every replica is created. for instance , when 
replica # 40 is being created , the value is (40/40 = 1) that is like saying 
{{replica:"<2"}} . whereas , when replica #41 is created, it suddenly becomes 
{{"replica" : "<3"}}. So actually allocations happen evenly
{quote}
I understand that it's not constant, but what I'm saying is the rule itself can 
not be violated but the cluster not balanced. If I have 42 replicas and 40 
nodes, I would want 1 replica on every node before getting 2 on other nodes. 
ceil(42/40) -> <3 rule, which has the potential of having 2 replicas on 21 
nodes, which satisfies the rule but is not balanced.

> Make it possible to evenly distribute replicas
> --
>
> Key: SOLR-12495
> URL: https://issues.apache.org/jira/browse/SOLR-12495
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Noble Paul
>Priority: Major
>
> Support a new function value for {{replica= "#MINIMUM"}}
> {{#MINIMUM}} means the minimum computed value for the given configuration
> the value of replica will be calculated as  {{<= 
> Math.ceil(number_of_replicas/number_of_valid_nodes) }}
> *example 1:*
> {code:java}
> {"replica" : "#MINIMUM" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *case 1* : nodes=3, replicationFactor=4
>  the value of replica will be calculated as {{Math.ceil(4/3) = 2}}
> current state : nodes=3, replicationFactor=2
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *case 2* : 
> current state : nodes=3, replicationFactor=2
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *example:2*
> {code}
> {"replica" : "#MINIMUM"  , "node" : "#ANY"}{code}
> case 1: numShards = 2, replicationFactor=3, nodes = 5
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "node" : 

[jira] [Commented] (SOLR-12495) Make it possible to evenly distribute replicas

2018-06-21 Thread Jerry Bao (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519668#comment-16519668
 ] 

Jerry Bao commented on SOLR-12495:
--

{quote}well
{code:java}
{"replica": "#MINIMUM", "node": "#ANY"}
{code}
means it is applied on a per collection basis
{quote}
That seems confusing to me; the way I read it is: keep a minimum number of 
replicas on every node. Just to clarify, when you say per-collection basis, 
you're meaning each collection is balanced? If that is so will there be a way 
to keep the entire cluster balanced irrespective of collection? Is that covered 
by the core preference? My concern here is that without a way to keep the 
entire cluster balanced irrespective of collection, you'll end up with nodes 
with one replica of every collection and other nodes with 0 replicas. For 
example, if you had three collections with 30 replicas each, and 45 nodes, you 
could end up with 30 nodes, each with one collections replica, and 15 nodes 
with 0 replicas, which is unbalanced.
{quote}In reality, it works slightly different. The value "<3" is not a 
constant . it keeps varying when every replica is created. for instance , when 
replica # 40 is being created , the value is (40/40 = 1) that is like saying 
{{replica:"<2"}} . whereas , when replica #41 is created, it suddenly becomes 
{{"replica" : "<3"}}. So actually allocations happen evenly
{quote}
I understand that it's not constant, but what I'm saying is the rule itself can 
not be violated but the cluster not balanced. If I have 42 replicas and 40 
nodes, I would want 1 replica on every node before getting 2 on other nodes. 
ceil(42/40) -> <3 rule, which has the potential of having 2 replicas on 21 
nodes, which satisfies the rule but is not balanced.

> Make it possible to evenly distribute replicas
> --
>
> Key: SOLR-12495
> URL: https://issues.apache.org/jira/browse/SOLR-12495
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Noble Paul
>Priority: Major
>
> Support a new function value for {{replica= "#MINIMUM"}}
> {{#MINIMUM}} means the minimum computed value for the given configuration
> the value of replica will be calculated as  {{<= 
> Math.ceil(number_of_replicas/number_of_valid_nodes) }}
> *example 1:*
> {code:java}
> {"replica" : "#MINIMUM" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *case 1* : nodes=3, replicationFactor=4
>  the value of replica will be calculated as {{Math.ceil(4/3) = 2}}
> current state : nodes=3, replicationFactor=2
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *case 2* : 
> current state : nodes=3, replicationFactor=2
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *example:2*
> {code}
> {"replica" : "#MINIMUM"  , "node" : "#ANY"}{code}
> case 1: numShards = 2, replicationFactor=3, nodes = 5
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "node" : "#ANY"}
> {code}
> *example:3*
> {code}
> {"replica" : "<2"  , "shard" : "#EACH" , "port" : "8983"}{code}
> case 1: {{replicationFactor=3, nodes with port 8983 = 2}}
> this is equivalent to the hard coded rule
> {code}
> {"replica" : "<3"  , "shard" : "#EACH" , "port" : "8983"}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11985) Allow percentage in replica attribute in policy

2018-06-21 Thread Jerry Bao (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518984#comment-16518984
 ] 

Jerry Bao commented on SOLR-11985:
--

[~noble.paul] that would mean each shard would have to have the same amount of 
replicas, which might not be the case within a collection or among all 
collections; it would be nice if there were a set of policies that would evenly 
distribute a shard's replicas amongst a property without having to specify 
different rules per collection based on how many replicas each shard has.

I agree that if all of the shards had the same number of replicas we could 
change the numbers, but that isn't always the case.

Does that make sense?

> Allow percentage in replica attribute in policy
> ---
>
> Key: SOLR-11985
> URL: https://issues.apache.org/jira/browse/SOLR-11985
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling, SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Noble Paul
>Priority: Major
> Fix For: master (8.0), 7.5
>
> Attachments: SOLR-11985.patch, SOLR-11985.patch
>
>
> Today we can only specify an absolute number in the 'replica' attribute in 
> the policy rules. It'd be useful to write a percentage value to make certain 
> use-cases easier. For example:
> {code:java}
> // Keep a third of the the replicas of each shard in east region
> {"replica" : "<34%", "shard" : "#EACH", "sysprop:region": "east"}
> // Keep two thirds of the the replicas of each shard in west region
> {"replica" : "<67%", "shard" : "#EACH", "sysprop:region": "west"}
> {code}
> Today the above must be represented by different rules for each collection if 
> they have different replication factors. Also if the replication factor 
> changes later, the absolute value has to be changed in tandem. So expressing 
> a percentage removes both of these restrictions.
> This feature means that the value of the attribute {{"replica"}} is only 
> available just in time. We call such values {{"computed values"}} . The 
> computed value for this attribute depends on other attributes as well. 
>  Take the following 2 rules
> {code:java}
> //example 1
> {"replica" : "<34%", "shard" : "#EACH", "sysprop:region": "east"}
> //example 2
> {"replica" : "<34%",  "sysprop:region": "east"}
> {code}
> assume we have collection {{"A"}} with 2 shards and {{replicationFactor=3}}
> *example 1* would mean that the value of replica is computed as
> {{3 * 34 / 100 = 1.02}}
> Which means *_for each shard_* keep less than 1.02 replica in east 
> availability zone
>  
> *example 2* would mean that the value of replica is computed as 
> {{3 * 2 * 34 / 100 = 2.04}}
>  
> which means _*for each collection*_ keep less than 2.04 replicas on east 
> availability zone



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11985) Allow percentage in replica attribute in policy

2018-06-20 Thread Jerry Bao (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518480#comment-16518480
 ] 

Jerry Bao commented on SOLR-11985:
--

[~noble.paul] What would happen if I had 5 replicas and 3 zones for a shard? Is 
it possible to make a rule that balances the replicas on a shard as 2 on 
us-east-1a, 2 on us-east-1b, and 1 on us-east-1c?

> Allow percentage in replica attribute in policy
> ---
>
> Key: SOLR-11985
> URL: https://issues.apache.org/jira/browse/SOLR-11985
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling, SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Noble Paul
>Priority: Major
> Fix For: master (8.0), 7.5
>
> Attachments: SOLR-11985.patch, SOLR-11985.patch
>
>
> Today we can only specify an absolute number in the 'replica' attribute in 
> the policy rules. It'd be useful to write a percentage value to make certain 
> use-cases easier. For example:
> {code:java}
> // Keep a third of the the replicas of each shard in east region
> {"replica" : "<34%", "shard" : "#EACH", "sysprop:region": "east"}
> // Keep two thirds of the the replicas of each shard in west region
> {"replica" : "<67%", "shard" : "#EACH", "sysprop:region": "west"}
> {code}
> Today the above must be represented by different rules for each collection if 
> they have different replication factors. Also if the replication factor 
> changes later, the absolute value has to be changed in tandem. So expressing 
> a percentage removes both of these restrictions.
> This feature means that the value of the attribute {{"replica"}} is only 
> available just in time. We call such values {{"computed values"}} . The 
> computed value for this attribute depends on other attributes as well. 
>  Take the following 2 rules
> {code:java}
> //example 1
> {"replica" : "<34%", "shard" : "#EACH", "sysprop:region": "east"}
> //example 2
> {"replica" : "<34%",  "sysprop:region": "east"}
> {code}
> assume we have collection {{"A"}} with 2 shards and {{replicationFactor=3}}
> *example 1* would mean that the value of replica is computed as
> {{3 * 34 / 100 = 1.02}}
> Which means *_for each shard_* keep less than 1.02 replica in east 
> availability zone
>  
> *example 2* would mean that the value of replica is computed as 
> {{3 * 2 * 34 / 100 = 2.04}}
>  
> which means _*for each collection*_ keep less than 2.04 replicas on east 
> availability zone



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12495) Make it possible to evenly distribute replicas

2018-06-20 Thread Jerry Bao (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518476#comment-16518476
 ] 

Jerry Bao commented on SOLR-12495:
--

Wanted to add a couple of comments:

Would be great if this occurs per-collection. For example, a collection with 42 
replicas and 40 nodes should expect to have one replica from that collection on 
each node, with 2 nodes having 2 replicas. \{"replica": "#MINIMUM", 
"collection": "#EACH", "node": "#ANY"}

Cluster-wide would also go along with this, making sure each node has a similar 
amount of replicas. \{"replica": "#MINIMUM", "node": "#ANY"}

A warning that "<3" which is ceil(42/40) = 2 works, but only after each node 
has one replica. This rule also allows for 2 replicas on 21 nodes, which is not 
as good as 1 replica on all nodes, and 2 replicas on 1 node. I think this 
should be fixed by the ordering of the nodes by preference, but only if the 
list is updated after each movement.

[~noble.paul] FYI

> Make it possible to evenly distribute replicas
> --
>
> Key: SOLR-12495
> URL: https://issues.apache.org/jira/browse/SOLR-12495
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Noble Paul
>Priority: Major
>
> Support a new function value for {{replica= "#MINIMUM"}}
> {{#MINIMUM}} means the minimum computed value for the given configuration
> the value of replica will be calculated as  {{<= 
> Math.ceil(number_of_replicas/number_of_valid_nodes) }}
> *example 1:*
> {code:java}
> {"replica" : "#MINIMUM" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *case 1* : nodes=3, replicationFactor=4
>  the value of replica will be calculated as {{Math.ceil(4/3) = 2}}
> current state : nodes=3, replicationFactor=2
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *case 2* : 
> current state : nodes=3, replicationFactor=2
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "shard" : "#EACH" , "node" : "#ANY"}
> {code}
> *example:2*
> {code}
> {"replica" : "#MINIMUM"  , "node" : "#ANY"}{code}
> case 1: numShards = 2, replicationFactor=3, nodes = 5
> this is equivalent to the hard coded rule
> {code:java}
> {"replica" : "<3" , "node" : "#ANY"}
> {code}
> *example:3*
> {code}
> {"replica" : "<2"  , "shard" : "#EACH" , "port" : "8983"}{code}
> case 1: {{replicationFactor=3, nodes with port 8983 = 2}}
> this is equivalent to the hard coded rule
> {code}
> {"replica" : "<3"  , "shard" : "#EACH" , "port" : "8983"}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12088) Shards with dead replicas cause increased write latency

2018-05-30 Thread Jerry Bao (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495708#comment-16495708
 ] 

Jerry Bao commented on SOLR-12088:
--

[~caomanhdat] I can't confirm or deny whether or not this has been fixed, but 
I'm happy with closing this out and reopening if we see it again.

> Shards with dead replicas cause increased write latency
> ---
>
> Key: SOLR-12088
> URL: https://issues.apache.org/jira/browse/SOLR-12088
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
>
> If a collection's shard contains dead replicas, write latency to the 
> collection is increased. For example, if a collection has 10 shards with a 
> replication factor of 3, and one of those shards contains 3 replicas and 3 
> downed replicas, write latency is increased in comparison to a shard that 
> contains only 3 replicas.
> My feeling here is that downed replicas should be completely ignored and not 
> cause issues to other alive replicas in terms of write latency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12358) Autoscaling suggestions fail randomly and for certain policies

2018-05-18 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12358:
-
Attachment: diagnostics
nodes

> Autoscaling suggestions fail randomly and for certain policies
> --
>
> Key: SOLR-12358
> URL: https://issues.apache.org/jira/browse/SOLR-12358
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Affects Versions: 7.3.1
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: diagnostics, nodes
>
>
> For the following policy
> {code:java}
> {"cores": "<4","node": "#ANY"}{code}
> the suggestions endpoint fails
> {code:java}
> "error": {"msg": "Comparison method violates its general contract!","trace": 
> "java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!\n\tat java.util.TimSort.mergeHi(TimSort.java:899)\n\tat 
> java.util.TimSort.mergeAt(TimSort.java:516)\n\tat 
> java.util.TimSort.mergeCollapse(TimSort.java:441)\n\tat 
> java.util.TimSort.sort(TimSort.java:245)\n\tat 
> java.util.Arrays.sort(Arrays.java:1512)\n\tat 
> java.util.ArrayList.sort(ArrayList.java:1462)\n\tat 
> java.util.Collections.sort(Collections.java:175)\n\tat 
> org.apache.solr.client.solrj.cloud.autoscaling.Policy.setApproxValuesAndSortNodes(Policy.java:363)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.applyRules(Policy.java:310)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.(Policy.java:272)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:376)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getSuggestions(PolicyHelper.java:214)\n\tat
>  
> org.apache.solr.cloud.autoscaling.AutoScalingHandler.handleSuggestions(AutoScalingHandler.java:158)\n\tat
>  
> org.apache.solr.cloud.autoscaling.AutoScalingHandler.handleRequestBody(AutoScalingHandler.java:133)\n\tat
>  
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:195)\n\tat
>  org.apache.solr.api.ApiBag$ReqHandlerToApi.call(ApiBag.java:242)\n\tat 
> org.apache.solr.api.V2HttpCall.handleAdmin(V2HttpCall.java:311)\n\tat 
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:717)\n\tat
>  org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:498)\n\tat 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
>  
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  org.eclipse.jetty.server.Server.handle(Server.java:530)\n\tat 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347)\n\tat 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256)\n\tat
>  
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)\n\tat
>  

[jira] [Commented] (SOLR-12358) Autoscaling suggestions fail randomly and for certain policies

2018-05-18 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481000#comment-16481000
 ] 

Jerry Bao commented on SOLR-12358:
--

[~noble.paul] Updated!

> Autoscaling suggestions fail randomly and for certain policies
> --
>
> Key: SOLR-12358
> URL: https://issues.apache.org/jira/browse/SOLR-12358
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Affects Versions: 7.3.1
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: diagnostics, nodes
>
>
> For the following policy
> {code:java}
> {"cores": "<4","node": "#ANY"}{code}
> the suggestions endpoint fails
> {code:java}
> "error": {"msg": "Comparison method violates its general contract!","trace": 
> "java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!\n\tat java.util.TimSort.mergeHi(TimSort.java:899)\n\tat 
> java.util.TimSort.mergeAt(TimSort.java:516)\n\tat 
> java.util.TimSort.mergeCollapse(TimSort.java:441)\n\tat 
> java.util.TimSort.sort(TimSort.java:245)\n\tat 
> java.util.Arrays.sort(Arrays.java:1512)\n\tat 
> java.util.ArrayList.sort(ArrayList.java:1462)\n\tat 
> java.util.Collections.sort(Collections.java:175)\n\tat 
> org.apache.solr.client.solrj.cloud.autoscaling.Policy.setApproxValuesAndSortNodes(Policy.java:363)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.applyRules(Policy.java:310)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.(Policy.java:272)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:376)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getSuggestions(PolicyHelper.java:214)\n\tat
>  
> org.apache.solr.cloud.autoscaling.AutoScalingHandler.handleSuggestions(AutoScalingHandler.java:158)\n\tat
>  
> org.apache.solr.cloud.autoscaling.AutoScalingHandler.handleRequestBody(AutoScalingHandler.java:133)\n\tat
>  
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:195)\n\tat
>  org.apache.solr.api.ApiBag$ReqHandlerToApi.call(ApiBag.java:242)\n\tat 
> org.apache.solr.api.V2HttpCall.handleAdmin(V2HttpCall.java:311)\n\tat 
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:717)\n\tat
>  org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:498)\n\tat 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
>  
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  org.eclipse.jetty.server.Server.handle(Server.java:530)\n\tat 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347)\n\tat 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256)\n\tat
>  
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)\n\tat
> 

[jira] [Updated] (SOLR-12358) Autoscaling suggestions fail randomly and for certain policies

2018-05-18 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12358:
-
Description: 
For the following policy
{code:java}
{"cores": "<4","node": "#ANY"}{code}
the suggestions endpoint fails
{code:java}
"error": {"msg": "Comparison method violates its general contract!","trace": 
"java.lang.IllegalArgumentException: Comparison method violates its general 
contract!\n\tat java.util.TimSort.mergeHi(TimSort.java:899)\n\tat 
java.util.TimSort.mergeAt(TimSort.java:516)\n\tat 
java.util.TimSort.mergeCollapse(TimSort.java:441)\n\tat 
java.util.TimSort.sort(TimSort.java:245)\n\tat 
java.util.Arrays.sort(Arrays.java:1512)\n\tat 
java.util.ArrayList.sort(ArrayList.java:1462)\n\tat 
java.util.Collections.sort(Collections.java:175)\n\tat 
org.apache.solr.client.solrj.cloud.autoscaling.Policy.setApproxValuesAndSortNodes(Policy.java:363)\n\tat
 
org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.applyRules(Policy.java:310)\n\tat
 
org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.(Policy.java:272)\n\tat
 
org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:376)\n\tat
 
org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getSuggestions(PolicyHelper.java:214)\n\tat
 
org.apache.solr.cloud.autoscaling.AutoScalingHandler.handleSuggestions(AutoScalingHandler.java:158)\n\tat
 
org.apache.solr.cloud.autoscaling.AutoScalingHandler.handleRequestBody(AutoScalingHandler.java:133)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:195)\n\tat
 org.apache.solr.api.ApiBag$ReqHandlerToApi.call(ApiBag.java:242)\n\tat 
org.apache.solr.api.V2HttpCall.handleAdmin(V2HttpCall.java:311)\n\tat 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:717)\n\tat
 org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:498)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:530)\n\tat 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347)\n\tat 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256)\n\tat
 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)\n\tat
 org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat 
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)\n\tat 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247)\n\tat
 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)\n\tat
 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)\n\tat
 
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)\n\tat
 

[jira] [Updated] (SOLR-12358) Autoscaling suggestions fail randomly and for certain policies

2018-05-15 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12358:
-
Priority: Critical  (was: Major)

> Autoscaling suggestions fail randomly and for certain policies
> --
>
> Key: SOLR-12358
> URL: https://issues.apache.org/jira/browse/SOLR-12358
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.3.1
>Reporter: Jerry Bao
>Priority: Critical
>
> For the following policy
> {code:java}
> {"replica": "<3", "node": "#ANY", "collection": "collection"}{code}
> the suggestions endpoint fails
> {code:java}
> "error": {"msg": "Comparison method violates its general contract!","trace": 
> "java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!\n\tat java.util.TimSort.mergeHi(TimSort.java:899)\n\tat 
> java.util.TimSort.mergeAt(TimSort.java:516)\n\tat 
> java.util.TimSort.mergeCollapse(TimSort.java:441)\n\tat 
> java.util.TimSort.sort(TimSort.java:245)\n\tat 
> java.util.Arrays.sort(Arrays.java:1512)\n\tat 
> java.util.ArrayList.sort(ArrayList.java:1462)\n\tat 
> java.util.Collections.sort(Collections.java:175)\n\tat 
> org.apache.solr.client.solrj.cloud.autoscaling.Policy.setApproxValuesAndSortNodes(Policy.java:363)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.applyRules(Policy.java:310)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.(Policy.java:272)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:376)\n\tat
>  
> org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getSuggestions(PolicyHelper.java:214)\n\tat
>  
> org.apache.solr.cloud.autoscaling.AutoScalingHandler.handleSuggestions(AutoScalingHandler.java:158)\n\tat
>  
> org.apache.solr.cloud.autoscaling.AutoScalingHandler.handleRequestBody(AutoScalingHandler.java:133)\n\tat
>  
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:195)\n\tat
>  org.apache.solr.api.ApiBag$ReqHandlerToApi.call(ApiBag.java:242)\n\tat 
> org.apache.solr.api.V2HttpCall.handleAdmin(V2HttpCall.java:311)\n\tat 
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:717)\n\tat
>  org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:498)\n\tat 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
>  
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  org.eclipse.jetty.server.Server.handle(Server.java:530)\n\tat 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347)\n\tat 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256)\n\tat
>  
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)\n\tat
>  org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat 

[jira] [Created] (SOLR-12358) Autoscaling suggestions fail randomly and for certain policies

2018-05-15 Thread Jerry Bao (JIRA)
Jerry Bao created SOLR-12358:


 Summary: Autoscaling suggestions fail randomly and for certain 
policies
 Key: SOLR-12358
 URL: https://issues.apache.org/jira/browse/SOLR-12358
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: 7.3.1
Reporter: Jerry Bao


For the following policy
{code:java}
{"replica": "<3", "node": "#ANY", "collection": "collection"}{code}
the suggestions endpoint fails
{code:java}
"error": {"msg": "Comparison method violates its general contract!","trace": 
"java.lang.IllegalArgumentException: Comparison method violates its general 
contract!\n\tat java.util.TimSort.mergeHi(TimSort.java:899)\n\tat 
java.util.TimSort.mergeAt(TimSort.java:516)\n\tat 
java.util.TimSort.mergeCollapse(TimSort.java:441)\n\tat 
java.util.TimSort.sort(TimSort.java:245)\n\tat 
java.util.Arrays.sort(Arrays.java:1512)\n\tat 
java.util.ArrayList.sort(ArrayList.java:1462)\n\tat 
java.util.Collections.sort(Collections.java:175)\n\tat 
org.apache.solr.client.solrj.cloud.autoscaling.Policy.setApproxValuesAndSortNodes(Policy.java:363)\n\tat
 
org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.applyRules(Policy.java:310)\n\tat
 
org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.(Policy.java:272)\n\tat
 
org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:376)\n\tat
 
org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getSuggestions(PolicyHelper.java:214)\n\tat
 
org.apache.solr.cloud.autoscaling.AutoScalingHandler.handleSuggestions(AutoScalingHandler.java:158)\n\tat
 
org.apache.solr.cloud.autoscaling.AutoScalingHandler.handleRequestBody(AutoScalingHandler.java:133)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:195)\n\tat
 org.apache.solr.api.ApiBag$ReqHandlerToApi.call(ApiBag.java:242)\n\tat 
org.apache.solr.api.V2HttpCall.handleAdmin(V2HttpCall.java:311)\n\tat 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:717)\n\tat
 org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:498)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:530)\n\tat 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347)\n\tat 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256)\n\tat
 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)\n\tat
 org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat 
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)\n\tat 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247)\n\tat
 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)\n\tat
 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)\n\tat
 

[jira] [Commented] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-29 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419464#comment-16419464
 ] 

Jerry Bao commented on SOLR-12087:
--

Can we get this fix backported to 7.3 and have a 7.3.1?

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Assignee: Cao Manh Dat
>Priority: Critical
> Fix For: 7.4
>
> Attachments: SOLR-12087.patch, SOLR-12087.patch, SOLR-12087.patch, 
> SOLR-12087.test.patch, Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually (read every second) 
> attempts to initiate recovery on the replica and fails to do so because the 
> core does not exist. As a result it continually publishes a down state for 
> the replica to zookeeper.
>  # The deleted replica node spams that it cannot locate the core because it's 
> been deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12088) Shards with dead replicas cause increased write latency

2018-03-23 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412212#comment-16412212
 ] 

Jerry Bao commented on SOLR-12088:
--

[~caomanhdat] It seems to last forever though I cannot confirm 100%. Definitely 
lasts past an hour. Why is the amount of LIR threads started decreasing as time 
goes on?

> Shards with dead replicas cause increased write latency
> ---
>
> Key: SOLR-12088
> URL: https://issues.apache.org/jira/browse/SOLR-12088
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
>
> If a collection's shard contains dead replicas, write latency to the 
> collection is increased. For example, if a collection has 10 shards with a 
> replication factor of 3, and one of those shards contains 3 replicas and 3 
> downed replicas, write latency is increased in comparison to a shard that 
> contains only 3 replicas.
> My feeling here is that downed replicas should be completely ignored and not 
> cause issues to other alive replicas in terms of write latency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-21 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408177#comment-16408177
 ] 

Jerry Bao commented on SOLR-12087:
--

[~caomanhdat] That sounds exactly like the case I'm running into. I can't 
verify that the logs you say I should see I saw but I did definitely see the 
leader logs you were mentioning.
{quote}You wrote that

Attempting to delete the downed replicas causes failures because the core does 
not exist anymore.
{quote}
Sorry I should have been more clear here: It causes failures but not failures 
that block the deletion of the replica; the replica does eventually get 
deleted. 
{quote}Make sure that on the 2nd call of DeleteReplica (for removing zombie 
replica), parameters are correct because the name of the replica may get 
changed, ie: from core_node3 to core_node4.
{quote}
I wrote a small script to find all downed replicas and issue a delete command 
against them, which does take into account the name change.

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: SOLR-12087.test.patch, Screen Shot 2018-03-16 at 
> 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually (read every second) 
> attempts to initiate recovery on the replica and fails to do so because the 
> core does not exist. As a result it continually publishes a down state for 
> the replica to zookeeper.
>  # The deleted replica node spams that it cannot locate the core because it's 
> been deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402548#comment-16402548
 ] 

Jerry Bao edited comment on SOLR-12087 at 3/16/18 10:28 PM:


Adding some more potentially relevant information:

We're constantly updating Solr collections via live streaming updates. I 
noticed that moving shards that don't have live indexing is much easier than 
those that do. Also heavy indexing seems to be a factor in whether or not 
zombie shards exist.

EDIT: It seems that collections with indexing consistently have zombie shards 
vs those that dont.


was (Author: jerry.bao):
Adding some more potentially relevant information:

We're constantly updating Solr collections via live streaming updates. I 
noticed that moving shards that don't have live indexing is much easier than 
those that do. Also heavy indexing seems to be a factor in whether or not 
zombie shards exist.

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually (read every second) 
> attempts to initiate recovery on the replica and fails to do so because the 
> core does not exist. As a result it continually publishes a down state for 
> the replica to zookeeper.
>  # The deleted replica node spams that it cannot locate the core because it's 
> been deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402548#comment-16402548
 ] 

Jerry Bao edited comment on SOLR-12087 at 3/16/18 10:28 PM:


Adding some more potentially relevant information:

We're constantly updating Solr collections via live streaming updates. I 
noticed that moving shards that don't have live indexing is much easier than 
those that do. Also heavy indexing seems to be a factor in whether or not 
zombie shards exist.

EDIT: It seems that collections with indexing/querying consistently have zombie 
shards vs those that dont.


was (Author: jerry.bao):
Adding some more potentially relevant information:

We're constantly updating Solr collections via live streaming updates. I 
noticed that moving shards that don't have live indexing is much easier than 
those that do. Also heavy indexing seems to be a factor in whether or not 
zombie shards exist.

EDIT: It seems that collections with indexing consistently have zombie shards 
vs those that dont.

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually (read every second) 
> attempts to initiate recovery on the replica and fails to do so because the 
> core does not exist. As a result it continually publishes a down state for 
> the replica to zookeeper.
>  # The deleted replica node spams that it cannot locate the core because it's 
> been deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402548#comment-16402548
 ] 

Jerry Bao edited comment on SOLR-12087 at 3/16/18 10:14 PM:


Adding some more potentially relevant information:

We're constantly updating Solr collections via live streaming updates. I 
noticed that moving shards that don't have live indexing is much easier than 
those that do. Also heavy indexing seems to be a factor in whether or not 
zombie shards exist.


was (Author: jerry.bao):
I've updated the description with more information.

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually (read every second) 
> attempts to initiate recovery on the replica and fails to do so because the 
> core does not exist. As a result it continually publishes a down state for 
> the replica to zookeeper.
>  # The deleted replica node spams that it cannot locate the core because it's 
> been deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-12117) Autoscaling suggestions are too few or non existent for clear violations

2018-03-16 Thread Jerry Bao (JIRA)
Jerry Bao created SOLR-12117:


 Summary: Autoscaling suggestions are too few or non existent for 
clear violations
 Key: SOLR-12117
 URL: https://issues.apache.org/jira/browse/SOLR-12117
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: AutoScaling
Reporter: Jerry Bao
 Attachments: autoscaling.json, diagnostics.json, solr_instances, 
suggestions.json

Attaching suggestions, diagnostics, autoscaling settings, and the 
solr_instances AZ's. One of the operations suggested is impossible:
{code:java}
{"type": "violation","violation": {"node": 
"solr-0a7207d791bd08d4e:8983_solr","tagKey": "null","violation": {"node": 
"4","delta": 1},"clause": {"cores": "<4","node": "#ANY"}},"operation": 
{"method": "POST","path": "/c/r_posts","command": {"move-replica": 
{"targetNode": "solr-0f0e86f34298f7e79:8983_solr","inPlaceMove": 
"true","replica": "2151000"}}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12117) Autoscaling suggestions are too few or non existent for clear violations

2018-03-16 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12117:
-
Description: Attaching suggestions, diagnostics, autoscaling settings, and 
the solr_instances AZ's. Some of the suggestions are one too many for one 
violation, and other suggestions do not appear even though there are clear 
violations in the policy and easily fixable.  (was: Attaching suggestions, 
diagnostics, autoscaling settings, and the solr_instances AZ's. One of the 
operations suggested is impossible:
{code:java}
{"type": "violation","violation": {"node": 
"solr-0a7207d791bd08d4e:8983_solr","tagKey": "null","violation": {"node": 
"4","delta": 1},"clause": {"cores": "<4","node": "#ANY"}},"operation": 
{"method": "POST","path": "/c/r_posts","command": {"move-replica": 
{"targetNode": "solr-0f0e86f34298f7e79:8983_solr","inPlaceMove": 
"true","replica": "2151000"}}}{code})

> Autoscaling suggestions are too few or non existent for clear violations
> 
>
> Key: SOLR-12117
> URL: https://issues.apache.org/jira/browse/SOLR-12117
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: autoscaling.json, diagnostics.json, solr_instances, 
> suggestions.json
>
>
> Attaching suggestions, diagnostics, autoscaling settings, and the 
> solr_instances AZ's. Some of the suggestions are one too many for one 
> violation, and other suggestions do not appear even though there are clear 
> violations in the policy and easily fixable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12116) Autoscaling suggests to move a replica that does not exist (all numbers)

2018-03-16 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12116:
-
Attachment: solr_instances
autoscaling.json
diagnostics.json
suggestions.json

> Autoscaling suggests to move a replica that does not exist (all numbers)
> 
>
> Key: SOLR-12116
> URL: https://issues.apache.org/jira/browse/SOLR-12116
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: autoscaling.json, diagnostics.json, solr_instances, 
> suggestions.json
>
>
> Attaching suggestions, diagnostics, autoscaling settings, and the 
> solr_instances AZ's. One of the operations suggested is impossible:
> {code:java}
> {"type": "violation","violation": {"node": 
> "solr-0a7207d791bd08d4e:8983_solr","tagKey": "null","violation": {"node": 
> "4","delta": 1},"clause": {"cores": "<4","node": "#ANY"}},"operation": 
> {"method": "POST","path": "/c/r_posts","command": {"move-replica": 
> {"targetNode": "solr-0f0e86f34298f7e79:8983_solr","inPlaceMove": 
> "true","replica": "2151000"}}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12116) Autoscaling suggests to move a replica that does not exist (all numbers)

2018-03-16 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12116:
-
Priority: Critical  (was: Major)

> Autoscaling suggests to move a replica that does not exist (all numbers)
> 
>
> Key: SOLR-12116
> URL: https://issues.apache.org/jira/browse/SOLR-12116
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: Jerry Bao
>Priority: Critical
>
> Attaching suggestions, diagnostics, autoscaling settings, and the 
> solr_instances AZ's. One of the operations suggested is impossible:
> {code:java}
> {"type": "violation","violation": {"node": 
> "solr-0a7207d791bd08d4e:8983_solr","tagKey": "null","violation": {"node": 
> "4","delta": 1},"clause": {"cores": "<4","node": "#ANY"}},"operation": 
> {"method": "POST","path": "/c/r_posts","command": {"move-replica": 
> {"targetNode": "solr-0f0e86f34298f7e79:8983_solr","inPlaceMove": 
> "true","replica": "2151000"}}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-12116) Autoscaling suggests to move a replica that does not exist (all numbers)

2018-03-16 Thread Jerry Bao (JIRA)
Jerry Bao created SOLR-12116:


 Summary: Autoscaling suggests to move a replica that does not 
exist (all numbers)
 Key: SOLR-12116
 URL: https://issues.apache.org/jira/browse/SOLR-12116
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: AutoScaling
Reporter: Jerry Bao


Attaching suggestions, diagnostics, autoscaling settings, and the 
solr_instances AZ's. One of the operations suggested is impossible:
{code:java}
{"type": "violation","violation": {"node": 
"solr-0a7207d791bd08d4e:8983_solr","tagKey": "null","violation": {"node": 
"4","delta": 1},"clause": {"cores": "<4","node": "#ANY"}},"operation": 
{"method": "POST","path": "/c/r_posts","command": {"move-replica": 
{"targetNode": "solr-0f0e86f34298f7e79:8983_solr","inPlaceMove": 
"true","replica": "2151000"}}}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12087:
-
Description: 
Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

This also occurs when trying to move replicas, since that move is an add and 
delete.

Some more information regarding this issue; when the MOVEREPLICA command is 
issued, the new replica is created successfully but the replica to be deleted 
fails to be removed from state.json (the core is deleted though) and we see two 
logs spammed.
 # The node containing the leader replica continually (read every second) 
attempts to initiate recovery on the replica and fails to do so because the 
core does not exist. As a result it continually publishes a down state for the 
replica to zookeeper.
 # The deleted replica node spams that it cannot locate the core because it's 
been deleted.

During this period of time, we see an increase in ZK network connectivity 
overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
shard until its removed from the state)

My guess is there's two issues at hand here:
 # The leader continually attempts to recover a downed replica that is 
unrecoverable because the core does not exist.
 # The replica to be deleted is having trouble being deleted from state.json in 
ZK.

This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.

  was:
Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

This also occurs when trying to move replicas, since that move is an add and 
delete.

Some more information regarding this issue; when the MOVEREPLICA command is 
issued, the new replica is created successfully but the replica to be deleted 
fails to be removed from state.json (the core is deleted though) and we see two 
logs spammed.
 # The node containing the leader replica continually (read every second) 
attempts to initiate recovery on the replica and fails to do so because the 
core does not exist. As a result it continually publishes a down state for the 
replica to zookeeper.
 # The replica node spams that it cannot locate the core because it's been 
deleted.

During this period of time, we see an increase in ZK network connectivity 
overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
shard until its removed from the state)

My guess is there's two issues at hand here:
 # The leader continually attempts to recover a downed replica that is 
unrecoverable because the core does not exist.
 # The replica to be deleted is having trouble being deleted from state.json in 
ZK.

This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.


> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually (read every second) 
> attempts to initiate recovery on the replica and fails to do so because the 
> core does not exist. As a result it continually publishes a down state for 
> the replica to zookeeper.
>  # The deleted replica node spams that it cannot locate the core because it's 
> been 

[jira] [Updated] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12087:
-
Description: 
Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

This also occurs when trying to move replicas, since that move is an add and 
delete.

Some more information regarding this issue; when the MOVEREPLICA command is 
issued, the new replica is created successfully but the replica to be deleted 
fails to be removed from state.json (the core is deleted though) and we see two 
logs spammed.
 # The node containing the leader replica continually (read every second) 
attempts to initiate recovery on the replica and fails to do so because the 
core does not exist. As a result it continually publishes a down state for the 
replica to zookeeper.
 # The replica node spams that it cannot locate the core because it's been 
deleted.

During this period of time, we see an increase in ZK network connectivity 
overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
shard until its removed from the state)

My guess is there's two issues at hand here:
 # The leader continually attempts to recover a downed replica that is 
unrecoverable because the core does not exist.
 # The replica to be deleted is having trouble being deleted from state.json in 
ZK.

This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.

  was:
Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

This also occurs when trying to move replicas, since that move is an add and 
delete.

Some more information regarding this issue; when the MOVEREPLICA command is 
issued, the new replica is created successfully but the replica to be deleted 
fails to be removed from state.json (the core is deleted though) and we see two 
logs spammed.
 # The node containing the leader replica continually attempts to initiate 
recovery on the replica and fails to do so because the core does not exist. As 
a result it continually publishes a down state for the replica to zookeeper.
 # The replica node spams that it cannot locate the core because it's been 
deleted.

During this period of time, we see an increase in ZK network connectivity 
overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
shard until its removed from the state)

My guess is there's two issues at hand here:
 # The leader continually attempts to recover a downed replica that is 
unrecoverable because the core does not exist.
 # The replica to be deleted is having trouble being deleted from state.json in 
ZK.

This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.


> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually (read every second) 
> attempts to initiate recovery on the replica and fails to do so because the 
> core does not exist. As a result it continually publishes a down state for 
> the replica to zookeeper.
>  # The replica node spams that it cannot locate the core because it's been 
> deleted.
> During this period of time, we 

[jira] [Commented] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402548#comment-16402548
 ] 

Jerry Bao commented on SOLR-12087:
--

I've updated the description with more information.

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually attempts to initiate 
> recovery on the replica and fails to do so because the core does not exist. 
> As a result it continually publishes a down state for the replica to 
> zookeeper.
>  # The replica node spams that it cannot locate the core because it's been 
> deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12087:
-
Priority: Critical  (was: Major)

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Critical
> Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually attempts to initiate 
> recovery on the replica and fails to do so because the core does not exist. 
> As a result it continually publishes a down state for the replica to 
> zookeeper.
>  # The replica node spams that it cannot locate the core because it's been 
> deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12087:
-
Attachment: Screen Shot 2018-03-16 at 11.50.32 AM.png

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
> Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually attempts to initiate 
> recovery on the replica and fails to do so because the core does not exist. 
> As a result it continually publishes a down state for the replica to 
> zookeeper.
>  # The replica node spams that it cannot locate the core because it's been 
> deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12087:
-
Description: 
Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

This also occurs when trying to move replicas, since that move is an add and 
delete.

Some more information regarding this issue; when the MOVEREPLICA command is 
issued, the new replica is created successfully but the replica to be deleted 
fails to be removed from state.json (the core is deleted though) and we see two 
logs spammed.
 # The node containing the leader replica continually attempts to initiate 
recovery on the replica and fails to do so because the core does not exist. As 
a result it continually publishes a down state for the replica to zookeeper.
 # The replica node spams that it cannot locate the core because it's been 
deleted.

During this period of time, we see an increase in ZK network connectivity 
overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
shard until its removed from the state)

My guess is there's two issues at hand here:
 # The leader continually attempts to recover a downed replica that is 
unrecoverable because the core does not exist.
 # The replica to be deleted is having trouble being deleted from state.json in 
ZK.

This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.

  was:
Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

This also occurs when trying to move replicas, since that move is an add and 
delete.


> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
> Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually attempts to initiate 
> recovery on the replica and fails to do so because the core does not exist. 
> As a result it continually publishes a down state for the replica to 
> zookeeper.
>  # The replica node spams that it cannot locate the core because it's been 
> deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12087) Deleting replicas sometimes fails and causes the replicas to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12087:
-
Summary: Deleting replicas sometimes fails and causes the replicas to exist 
in the down state  (was: Deleting shards sometimes fails and causes the shard 
to exist in the down state)

> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> 
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12087) Deleting shards sometimes fails and causes the shard to exist in the down state

2018-03-16 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12087:
-
Description: 
Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

This also occurs when trying to move replicas, since that move is an add and 
delete.

  was:
Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

It seems like when deleting replicas, ZK writes are timing out, preventing the 
cluster state from being properly updated.


> Deleting shards sometimes fails and causes the shard to exist in the down 
> state
> ---
>
> Key: SOLR-12087
> URL: https://issues.apache.org/jira/browse/SOLR-12087
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12088) Shards with dead replicas cause increased write latency

2018-03-14 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399608#comment-16399608
 ] 

Jerry Bao edited comment on SOLR-12088 at 3/14/18 11:35 PM:


We've been running on Solr 7.2.1, so its all been state.json and not 
clusterstate.json.

In regards to re-issuing the DELETEREPLICA command, sometimes that fails and I 
filed a Jira for that here: SOLR-12087. That was what was causing this second 
issue here.

For example purposes, our indexing latency went from 2s to 1.7s after 
successfully deleting the dead replicas. One thing I did notice is that the 
dead replicas spam the logs with "unable to unload non-existent core" on the 
machine that hosts the dead replicas. Could be a side affect?


was (Author: jerry.bao):
We've been running on Solr 7.2.1, so its all been state.json and not 
clusterstate.json.

In regards to re-issuing the DELETEREPLICA command, sometimes that fails and I 
filed a Jira for that here: SOLR-12087. That was what was causing this second 
issue here.

For example purposes, our indexing latency went from 2s to 1.7s after deleting 
the dead replicas. One thing I did notice is that the dead replicas spam the 
logs with "unable to unload non-existent core" on the machine that hosts the 
dead replicas. Could be a side affect?

> Shards with dead replicas cause increased write latency
> ---
>
> Key: SOLR-12088
> URL: https://issues.apache.org/jira/browse/SOLR-12088
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
>
> If a collection's shard contains dead replicas, write latency to the 
> collection is increased. For example, if a collection has 10 shards with a 
> replication factor of 3, and one of those shards contains 3 replicas and 3 
> downed replicas, write latency is increased in comparison to a shard that 
> contains only 3 replicas.
> My feeling here is that downed replicas should be completely ignored and not 
> cause issues to other alive replicas in terms of write latency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12088) Shards with dead replicas cause increased write latency

2018-03-14 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399608#comment-16399608
 ] 

Jerry Bao commented on SOLR-12088:
--

We've been running on Solr 7.2.1, so its all been state.json and not 
clusterstate.json.

In regards to re-issuing the DELETEREPLICA command, sometimes that fails and I 
filed a Jira for that here: SOLR-12087. That was what was causing this second 
issue here.

For example purposes, our indexing latency went from 2s to 1.7s after deleting 
the dead replicas. One thing I did notice is that the dead replicas spam the 
logs with "unable to unload non-existent core" on the machine that hosts the 
dead replicas. Could be a side affect?

> Shards with dead replicas cause increased write latency
> ---
>
> Key: SOLR-12088
> URL: https://issues.apache.org/jira/browse/SOLR-12088
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
>
> If a collection's shard contains dead replicas, write latency to the 
> collection is increased. For example, if a collection has 10 shards with a 
> replication factor of 3, and one of those shards contains 3 replicas and 3 
> downed replicas, write latency is increased in comparison to a shard that 
> contains only 3 replicas.
> My feeling here is that downed replicas should be completely ignored and not 
> cause issues to other alive replicas in terms of write latency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12088) Shards with dead replicas cause increased write latency

2018-03-14 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399184#comment-16399184
 ] 

Jerry Bao commented on SOLR-12088:
--

[~erickerickson] I don't have an answer to your question; this issue occurred 
from movement of replicas where the movement did not completely clean up the 
state of the replicas, causing it to be a zombie replicas (data gone but state 
still exists after movement).

Your thinking definitely could explain why theres a higher latency of indexing 
times. That makes the most sense to me. How long is this timeout?

> Shards with dead replicas cause increased write latency
> ---
>
> Key: SOLR-12088
> URL: https://issues.apache.org/jira/browse/SOLR-12088
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
>
> If a collection's shard contains dead replicas, write latency to the 
> collection is increased. For example, if a collection has 10 shards with a 
> replication factor of 3, and one of those shards contains 3 replicas and 3 
> downed replicas, write latency is increased in comparison to a shard that 
> contains only 3 replicas.
> My feeling here is that downed replicas should be completely ignored and not 
> cause issues to other alive replicas in terms of write latency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12088) Shards with dead replicas cause increased write latency

2018-03-14 Thread Jerry Bao (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399167#comment-16399167
 ] 

Jerry Bao commented on SOLR-12088:
--

Your scenario is what I experienced, so yes :)

1. 30 nodes in the cluster

2. There are no nodes part of the cluster that aren't hosting any replicas.

3. Indexing via Lucidwork's Fusion (which I assume is using a SolrJ based 
client)

4. Latency is measured through our own service's instrumentation of roundtrip 
time to index.

> Shards with dead replicas cause increased write latency
> ---
>
> Key: SOLR-12088
> URL: https://issues.apache.org/jira/browse/SOLR-12088
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
>
> If a collection's shard contains dead replicas, write latency to the 
> collection is increased. For example, if a collection has 10 shards with a 
> replication factor of 3, and one of those shards contains 3 replicas and 3 
> downed replicas, write latency is increased in comparison to a shard that 
> contains only 3 replicas.
> My feeling here is that downed replicas should be completely ignored and not 
> cause issues to other alive replicas in terms of write latency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-12088) Shards with dead replicas cause increased write latency

2018-03-13 Thread Jerry Bao (JIRA)
Jerry Bao created SOLR-12088:


 Summary: Shards with dead replicas cause increased write latency
 Key: SOLR-12088
 URL: https://issues.apache.org/jira/browse/SOLR-12088
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Affects Versions: 7.2
Reporter: Jerry Bao


If a collection's shard contains dead replicas, write latency to the collection 
is increased. For example, if a collection has 10 shards with a replication 
factor of 3, and one of those shards contains 3 replicas and 3 downed replicas, 
write latency is increased in comparison to a shard that contains only 3 
replicas.

My feeling here is that downed replicas should be completely ignored and not 
cause issues to other alive replicas in terms of write latency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-12087) Deleting shards sometimes fails and causes the shard to exist in the down state

2018-03-13 Thread Jerry Bao (JIRA)
Jerry Bao created SOLR-12087:


 Summary: Deleting shards sometimes fails and causes the shard to 
exist in the down state
 Key: SOLR-12087
 URL: https://issues.apache.org/jira/browse/SOLR-12087
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Affects Versions: 7.2
Reporter: Jerry Bao


Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

It seems like when deleting replicas, ZK writes are timing out, preventing the 
cluster state from being properly updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12014) Cryptic error message when creating a collection with sharding that violates autoscaling policies

2018-02-21 Thread Jerry Bao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12014:
-
Description: 
When creating a collection with sharding a replication factors that are 
impossible because it will violate autoscaling policies, Solr raises a cryptic 
exception that is unrelated to the issue. 


{code:java}
{
"responseHeader":{
"status":500,
"QTime":629},
"Operation create caused 
exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
 Error closing CloudSolrClient",
"exception":{
"msg":"Error closing CloudSolrClient",
"rspCode":500},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","org.apache.solr.common.SolrException"],
"msg":"Error closing CloudSolrClient",
"trace":"org.apache.solr.common.SolrException: Error closing 
CloudSolrClient\n\tat 
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:309)\n\tat
 
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:246)\n\tat
 
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:224)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)\n\tat
 org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:735)\n\tat 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:716)\n\tat
 org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:497)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\n\tat 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat
 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)\n\tat
 org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\n\tat 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat
 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\tat
 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\tat
 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tat
 java.lang.Thread.run(Thread.java:748)\n",
"code":500}}{code}

  was:When creating a collection with sharding a replication factors that are 
impossible because it will violate autoscaling policies, Solr raises a cryptic 
exception that is unrelated to the issue. 


> Cryptic error message when creating a collection with sharding that violates 
> autoscaling policies
> -
>
> Key: SOLR-12014
> URL: https://issues.apache.org/jira/browse/SOLR-12014
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Affects Versions: 7.2
>Reporter: Jerry Bao
>Priority: Major
>
> When creating a collection with sharding a replication factors that 

[jira] [Created] (SOLR-12014) Cryptic error message when creating a collection with sharding that violates autoscaling policies

2018-02-21 Thread Jerry Bao (JIRA)
Jerry Bao created SOLR-12014:


 Summary: Cryptic error message when creating a collection with 
sharding that violates autoscaling policies
 Key: SOLR-12014
 URL: https://issues.apache.org/jira/browse/SOLR-12014
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: AutoScaling
Affects Versions: 7.2
Reporter: Jerry Bao


When creating a collection with sharding a replication factors that are 
impossible because it will violate autoscaling policies, Solr raises a cryptic 
exception that is unrelated to the issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org