[jira] [Resolved] (SOLR-14942) Reduce leader election time on node shutdown

2020-12-01 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-14942.
--
Resolution: Fixed

Thanks, David for the feedback and to you and Dat for the review.

> Reduce leader election time on node shutdown
> 
>
> Key: SOLR-14942
> URL: https://issues.apache.org/jira/browse/SOLR-14942
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.7.3, 8.6.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: 8.8, master (9.0)
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> The credit for this issue and investigation belongs to [~caomanhdat]. I am 
> merely reporting the issue and creating PRs based on his work.
> The shutdown process waits for all replicas/cores to be closed before 
> removing the election node of the leader. This can take some time due to 
> index flush or merge activities on the leader cores and delays new leaders 
> from being elected.
> This process happens at CoreContainer.shutdown():
> # zkController.preClose(): remove current node from live_node and change 
> states of all cores in this node to DOWN state. Assuming that the current 
> node hosting a leader of a shard, the shard becomes leaderless after calling 
> this method, since the state of the leader is DOWN now. The leader election 
> process is not triggered for the shard since the election node is still 
> on-hold by the current node.
> # Waiting for all cores to be loaded (if there are any).
> # SolrCores.close(): close all cores.
> # zkController.close(): this is where all ephemeral nodes are removed from ZK 
> which include election nodes created by this node. Therefore other replicas 
> in the shard can take part in the leader election from now.
> Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive 
> SIGTERM signal. 
> On receiving SIGTERM, Jetty will also stop accepting new connections and new 
> requests. This is a very important factor, since even if the leader replica 
> is ACTIVE and its node in live_nodes, the shard will be considered as 
> leaderless if no-one can index to that shard. Therefore shards become 
> leaderless as soon as the node (which contains shard’s leader) receives 
> SIGTERM.
> Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards 
> remain leaderless. The time needed for step 3 scales with the number of cores 
> so the more cores a node has, the worse. This time is spent in 
> IndexWriter.close() where the system will 
> # Flush all pending updates to disk
> # Waiting for all merge finish (this most likely is the meaty part)
> The shutdown process is proposed to changed to:
> # Wait for all in-flight indexing requests and replication requests to 
> complete
> # Remove election nodes
> # Close all replicas/cores
> This ensures that index flush or merges do not block new leader elections 
> anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Reopened] (SOLR-14942) Reduce leader election time on node shutdown

2020-12-01 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reopened SOLR-14942:
--

Reopening to address review feedback from David.

Thanks, David for the feedback. I think this will work. Please review the PR at 
https://github.com/apache/lucene-solr/pull/2112

> Reduce leader election time on node shutdown
> 
>
> Key: SOLR-14942
> URL: https://issues.apache.org/jira/browse/SOLR-14942
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.7.3, 8.6.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: 8.8, master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The credit for this issue and investigation belongs to [~caomanhdat]. I am 
> merely reporting the issue and creating PRs based on his work.
> The shutdown process waits for all replicas/cores to be closed before 
> removing the election node of the leader. This can take some time due to 
> index flush or merge activities on the leader cores and delays new leaders 
> from being elected.
> This process happens at CoreContainer.shutdown():
> # zkController.preClose(): remove current node from live_node and change 
> states of all cores in this node to DOWN state. Assuming that the current 
> node hosting a leader of a shard, the shard becomes leaderless after calling 
> this method, since the state of the leader is DOWN now. The leader election 
> process is not triggered for the shard since the election node is still 
> on-hold by the current node.
> # Waiting for all cores to be loaded (if there are any).
> # SolrCores.close(): close all cores.
> # zkController.close(): this is where all ephemeral nodes are removed from ZK 
> which include election nodes created by this node. Therefore other replicas 
> in the shard can take part in the leader election from now.
> Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive 
> SIGTERM signal. 
> On receiving SIGTERM, Jetty will also stop accepting new connections and new 
> requests. This is a very important factor, since even if the leader replica 
> is ACTIVE and its node in live_nodes, the shard will be considered as 
> leaderless if no-one can index to that shard. Therefore shards become 
> leaderless as soon as the node (which contains shard’s leader) receives 
> SIGTERM.
> Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards 
> remain leaderless. The time needed for step 3 scales with the number of cores 
> so the more cores a node has, the worse. This time is spent in 
> IndexWriter.close() where the system will 
> # Flush all pending updates to disk
> # Waiting for all merge finish (this most likely is the meaty part)
> The shutdown process is proposed to changed to:
> # Wait for all in-flight indexing requests and replication requests to 
> complete
> # Remove election nodes
> # Close all replicas/cores
> This ensures that index flush or merges do not block new leader elections 
> anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-6399) Implement unloadCollection in the Collections API

2020-11-06 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reassigned SOLR-6399:
---

Assignee: (was: Shalin Shekhar Mangar)

> Implement unloadCollection in the Collections API
> -
>
> Key: SOLR-6399
> URL: https://issues.apache.org/jira/browse/SOLR-6399
> Project: Solr
>  Issue Type: New Feature
>Reporter: dfdeshom
>Priority: Major
> Fix For: 6.0
>
>
> There is currently no way to unload a collection without deleting its 
> contents. There should be a way in the collections API to unload a collection 
> and reload it later, as needed.
> A use case for this is the following: you store logs by day, with each day 
> having its own collection. You are required to store up to 2 years of data, 
> which adds up to 730 collections.  Most of the time, you'll want to have 3 
> days of data loaded for search. Having just 3 collections loaded into memory, 
> instead of 730 will make managing Solr easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14985) Slow indexing and search performance when using HttpClusterStateProvider

2020-11-05 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14985:
-
Description: 
HttpClusterStateProvider fetches and caches Aliases and Live Nodes for 5 
seconds.

The BaseSolrCloudClient caches DocCollection for 60 seconds but only if the 
DocCollection is not lazy and all collections returned by 
HttpClusterStateProvider are not lazy which means they are never cached.

The BaseSolrCloudClient has a method for resolving aliases which fetches 
DocCollection for each input collection. This is an HTTP call with no caching 
when using HttpClusterStateProvider. This resolveAliases method is called twice 
for each update.

So overall, at least 3 HTTP calls are made to fetch cluster state for each 
update request when using HttpClusterStateProvider. There may be more if 
aliases are involved or if more than one collection is specified in the 
request. Similar problems exist on the query path as well.

Due to these reasons, using HttpClusterStateProvider causes horrible latencies 
and throughput for update and search requests.

  was:
HttpClusterStateProvider fetches and caches Aliases and Live Nodes for 5 
seconds.

The BaseSolrCloudClient caches DocCollection for 60 seconds but only if the 
DocCollection is not lazy and all collections returned by 
HttpClusterStateProvider are not lazy which means they are never cached.

The BaseSolrCloudClient has a method for resolving aliases which fetches 
DocCollection for each input collection. This is an HTTP call with no caching 
when using HttpClusterStateProvider. This resolveAliases method is called twice 
for each update.

So overall, at least 3 HTTP calls are made to fetch cluster state for each 
update request when using HttpClusterStateProvider. There may be more if 
aliases are involved or if more than one collection is specified in the 
request. Similar problems exist on the query path as well.

Due to these reasons, using HttpClusterStateProvider causes horrible latencies 
and throughput for update requests.


> Slow indexing and search performance when using HttpClusterStateProvider
> 
>
> Key: SOLR-14985
> URL: https://issues.apache.org/jira/browse/SOLR-14985
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrJ
>Reporter: Shalin Shekhar Mangar
>Priority: Major
>
> HttpClusterStateProvider fetches and caches Aliases and Live Nodes for 5 
> seconds.
> The BaseSolrCloudClient caches DocCollection for 60 seconds but only if the 
> DocCollection is not lazy and all collections returned by 
> HttpClusterStateProvider are not lazy which means they are never cached.
> The BaseSolrCloudClient has a method for resolving aliases which fetches 
> DocCollection for each input collection. This is an HTTP call with no caching 
> when using HttpClusterStateProvider. This resolveAliases method is called 
> twice for each update.
> So overall, at least 3 HTTP calls are made to fetch cluster state for each 
> update request when using HttpClusterStateProvider. There may be more if 
> aliases are involved or if more than one collection is specified in the 
> request. Similar problems exist on the query path as well.
> Due to these reasons, using HttpClusterStateProvider causes horrible 
> latencies and throughput for update and search requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14985) Slow indexing and search performance when using HttpClusterStateProvider

2020-11-05 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226782#comment-17226782
 ] 

Shalin Shekhar Mangar commented on SOLR-14985:
--

Linking to SOLR-14966 and SOLR-14967

> Slow indexing and search performance when using HttpClusterStateProvider
> 
>
> Key: SOLR-14985
> URL: https://issues.apache.org/jira/browse/SOLR-14985
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrJ
>Reporter: Shalin Shekhar Mangar
>Priority: Major
>
> HttpClusterStateProvider fetches and caches Aliases and Live Nodes for 5 
> seconds.
> The BaseSolrCloudClient caches DocCollection for 60 seconds but only if the 
> DocCollection is not lazy and all collections returned by 
> HttpClusterStateProvider are not lazy which means they are never cached.
> The BaseSolrCloudClient has a method for resolving aliases which fetches 
> DocCollection for each input collection. This is an HTTP call with no caching 
> when using HttpClusterStateProvider. This resolveAliases method is called 
> twice for each update.
> So overall, at least 3 HTTP calls are made to fetch cluster state for each 
> update request when using HttpClusterStateProvider. There may be more if 
> aliases are involved or if more than one collection is specified in the 
> request. Similar problems exist on the query path as well.
> Due to these reasons, using HttpClusterStateProvider causes horrible 
> latencies and throughput for update requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14985) Slow indexing and search performance when using HttpClusterStateProvider

2020-11-05 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-14985:


 Summary: Slow indexing and search performance when using 
HttpClusterStateProvider
 Key: SOLR-14985
 URL: https://issues.apache.org/jira/browse/SOLR-14985
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrJ
Reporter: Shalin Shekhar Mangar


HttpClusterStateProvider fetches and caches Aliases and Live Nodes for 5 
seconds.

The BaseSolrCloudClient caches DocCollection for 60 seconds but only if the 
DocCollection is not lazy and all collections returned by 
HttpClusterStateProvider are not lazy which means they are never cached.

The BaseSolrCloudClient has a method for resolving aliases which fetches 
DocCollection for each input collection. This is an HTTP call with no caching 
when using HttpClusterStateProvider. This resolveAliases method is called twice 
for each update.

So overall, at least 3 HTTP calls are made to fetch cluster state for each 
update request when using HttpClusterStateProvider. There may be more if 
aliases are involved or if more than one collection is specified in the 
request. Similar problems exist on the query path as well.

Due to these reasons, using HttpClusterStateProvider causes horrible latencies 
and throughput for update requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14942) Reduce leader election time on node shutdown

2020-10-27 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14942:
-
Fix Version/s: (was: 8.7)
   8.8

> Reduce leader election time on node shutdown
> 
>
> Key: SOLR-14942
> URL: https://issues.apache.org/jira/browse/SOLR-14942
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.7.3, 8.6.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.8
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The credit for this issue and investigation belongs to [~caomanhdat]. I am 
> merely reporting the issue and creating PRs based on his work.
> The shutdown process waits for all replicas/cores to be closed before 
> removing the election node of the leader. This can take some time due to 
> index flush or merge activities on the leader cores and delays new leaders 
> from being elected.
> This process happens at CoreContainer.shutdown():
> # zkController.preClose(): remove current node from live_node and change 
> states of all cores in this node to DOWN state. Assuming that the current 
> node hosting a leader of a shard, the shard becomes leaderless after calling 
> this method, since the state of the leader is DOWN now. The leader election 
> process is not triggered for the shard since the election node is still 
> on-hold by the current node.
> # Waiting for all cores to be loaded (if there are any).
> # SolrCores.close(): close all cores.
> # zkController.close(): this is where all ephemeral nodes are removed from ZK 
> which include election nodes created by this node. Therefore other replicas 
> in the shard can take part in the leader election from now.
> Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive 
> SIGTERM signal. 
> On receiving SIGTERM, Jetty will also stop accepting new connections and new 
> requests. This is a very important factor, since even if the leader replica 
> is ACTIVE and its node in live_nodes, the shard will be considered as 
> leaderless if no-one can index to that shard. Therefore shards become 
> leaderless as soon as the node (which contains shard’s leader) receives 
> SIGTERM.
> Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards 
> remain leaderless. The time needed for step 3 scales with the number of cores 
> so the more cores a node has, the worse. This time is spent in 
> IndexWriter.close() where the system will 
> # Flush all pending updates to disk
> # Waiting for all merge finish (this most likely is the meaty part)
> The shutdown process is proposed to changed to:
> # Wait for all in-flight indexing requests and replication requests to 
> complete
> # Remove election nodes
> # Close all replicas/cores
> This ensures that index flush or merges do not block new leader elections 
> anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14942) Reduce leader election time on node shutdown

2020-10-27 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-14942.
--
Fix Version/s: 8.7
   master (9.0)
   Resolution: Fixed

Thanks Dat, Hoss and Mike!

> Reduce leader election time on node shutdown
> 
>
> Key: SOLR-14942
> URL: https://issues.apache.org/jira/browse/SOLR-14942
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.7.3, 8.6.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.7
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The credit for this issue and investigation belongs to [~caomanhdat]. I am 
> merely reporting the issue and creating PRs based on his work.
> The shutdown process waits for all replicas/cores to be closed before 
> removing the election node of the leader. This can take some time due to 
> index flush or merge activities on the leader cores and delays new leaders 
> from being elected.
> This process happens at CoreContainer.shutdown():
> # zkController.preClose(): remove current node from live_node and change 
> states of all cores in this node to DOWN state. Assuming that the current 
> node hosting a leader of a shard, the shard becomes leaderless after calling 
> this method, since the state of the leader is DOWN now. The leader election 
> process is not triggered for the shard since the election node is still 
> on-hold by the current node.
> # Waiting for all cores to be loaded (if there are any).
> # SolrCores.close(): close all cores.
> # zkController.close(): this is where all ephemeral nodes are removed from ZK 
> which include election nodes created by this node. Therefore other replicas 
> in the shard can take part in the leader election from now.
> Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive 
> SIGTERM signal. 
> On receiving SIGTERM, Jetty will also stop accepting new connections and new 
> requests. This is a very important factor, since even if the leader replica 
> is ACTIVE and its node in live_nodes, the shard will be considered as 
> leaderless if no-one can index to that shard. Therefore shards become 
> leaderless as soon as the node (which contains shard’s leader) receives 
> SIGTERM.
> Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards 
> remain leaderless. The time needed for step 3 scales with the number of cores 
> so the more cores a node has, the worse. This time is spent in 
> IndexWriter.close() where the system will 
> # Flush all pending updates to disk
> # Waiting for all merge finish (this most likely is the meaty part)
> The shutdown process is proposed to changed to:
> # Wait for all in-flight indexing requests and replication requests to 
> complete
> # Remove election nodes
> # Close all replicas/cores
> This ensures that index flush or merges do not block new leader elections 
> anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown

2020-10-23 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219733#comment-17219733
 ] 

Shalin Shekhar Mangar commented on SOLR-14942:
--

Thanks Hoss. I have updated the PR with code comments. Mike Drob also gave some 
feedback on the PR which has been incorporated as well. I intend to merge to 
master over the weekend.

> Reduce leader election time on node shutdown
> 
>
> Key: SOLR-14942
> URL: https://issues.apache.org/jira/browse/SOLR-14942
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.7.3, 8.6.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The credit for this issue and investigation belongs to [~caomanhdat]. I am 
> merely reporting the issue and creating PRs based on his work.
> The shutdown process waits for all replicas/cores to be closed before 
> removing the election node of the leader. This can take some time due to 
> index flush or merge activities on the leader cores and delays new leaders 
> from being elected.
> This process happens at CoreContainer.shutdown():
> # zkController.preClose(): remove current node from live_node and change 
> states of all cores in this node to DOWN state. Assuming that the current 
> node hosting a leader of a shard, the shard becomes leaderless after calling 
> this method, since the state of the leader is DOWN now. The leader election 
> process is not triggered for the shard since the election node is still 
> on-hold by the current node.
> # Waiting for all cores to be loaded (if there are any).
> # SolrCores.close(): close all cores.
> # zkController.close(): this is where all ephemeral nodes are removed from ZK 
> which include election nodes created by this node. Therefore other replicas 
> in the shard can take part in the leader election from now.
> Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive 
> SIGTERM signal. 
> On receiving SIGTERM, Jetty will also stop accepting new connections and new 
> requests. This is a very important factor, since even if the leader replica 
> is ACTIVE and its node in live_nodes, the shard will be considered as 
> leaderless if no-one can index to that shard. Therefore shards become 
> leaderless as soon as the node (which contains shard’s leader) receives 
> SIGTERM.
> Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards 
> remain leaderless. The time needed for step 3 scales with the number of cores 
> so the more cores a node has, the worse. This time is spent in 
> IndexWriter.close() where the system will 
> # Flush all pending updates to disk
> # Waiting for all merge finish (this most likely is the meaty part)
> The shutdown process is proposed to changed to:
> # Wait for all in-flight indexing requests and replication requests to 
> complete
> # Remove election nodes
> # Close all replicas/cores
> This ensures that index flush or merges do not block new leader elections 
> anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown

2020-10-22 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218973#comment-17218973
 ] 

Shalin Shekhar Mangar commented on SOLR-14942:
--

{code}
final boolean requestIsImportant = 
handler.isRequestImportantEnoughThatItShouldDelayShutdown(solrReq);
if (requestIsImportant && !core.getSolrCoreState().registerInFlightUpdate()) {
{code}

Firstly, the goal is not to delay shutdown but to (slightly) delay leader 
election so that in-flight update requests succeed and we can preserve 
consistency. Jetty already allows a grace period for in-flight requests to 
complete and our solr cores, searchers etc are reference counted to allow for 
graceful shutdown.

Secondly, if a request handler chooses to say that its requests are important 
enough to delay leader election, how do we decide the right timeouts in the 
pauseUpdatesAndAwaitInflightRequests() method? For update requests, we can make 
some reasonable assumptions but it is hard to do that in general.

That's why I don't think it makes sense to generalize this part even though I 
agree that the instanceof check is hackish. So unless you or someone else feels 
very strongly about this, I'd like to keep this check as-is.

> Reduce leader election time on node shutdown
> 
>
> Key: SOLR-14942
> URL: https://issues.apache.org/jira/browse/SOLR-14942
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.7.3, 8.6.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The credit for this issue and investigation belongs to [~caomanhdat]. I am 
> merely reporting the issue and creating PRs based on his work.
> The shutdown process waits for all replicas/cores to be closed before 
> removing the election node of the leader. This can take some time due to 
> index flush or merge activities on the leader cores and delays new leaders 
> from being elected.
> This process happens at CoreContainer.shutdown():
> # zkController.preClose(): remove current node from live_node and change 
> states of all cores in this node to DOWN state. Assuming that the current 
> node hosting a leader of a shard, the shard becomes leaderless after calling 
> this method, since the state of the leader is DOWN now. The leader election 
> process is not triggered for the shard since the election node is still 
> on-hold by the current node.
> # Waiting for all cores to be loaded (if there are any).
> # SolrCores.close(): close all cores.
> # zkController.close(): this is where all ephemeral nodes are removed from ZK 
> which include election nodes created by this node. Therefore other replicas 
> in the shard can take part in the leader election from now.
> Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive 
> SIGTERM signal. 
> On receiving SIGTERM, Jetty will also stop accepting new connections and new 
> requests. This is a very important factor, since even if the leader replica 
> is ACTIVE and its node in live_nodes, the shard will be considered as 
> leaderless if no-one can index to that shard. Therefore shards become 
> leaderless as soon as the node (which contains shard’s leader) receives 
> SIGTERM.
> Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards 
> remain leaderless. The time needed for step 3 scales with the number of cores 
> so the more cores a node has, the worse. This time is spent in 
> IndexWriter.close() where the system will 
> # Flush all pending updates to disk
> # Waiting for all merge finish (this most likely is the meaty part)
> The shutdown process is proposed to changed to:
> # Wait for all in-flight indexing requests and replication requests to 
> complete
> # Remove election nodes
> # Close all replicas/cores
> This ensures that index flush or merges do not block new leader elections 
> anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown

2020-10-20 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217619#comment-17217619
 ] 

Shalin Shekhar Mangar commented on SOLR-14942:
--

bq. every method should have javadocs, especially public ones

I added javadocs for ZkController.tryCancellAllElections

bq. ZkController.exlectionContexts is a syncrhonized map – but the new code 
that streams over it's `values()` doesn't synchronize on it which smells like a 
bug?

Yes, thank you! ZkController.close was doing the same thing so I have fixed 
that as well.

bq. if canceling & closing elections is a "slow" enough operation that it makes 
sense to parallelize them, then does it make sense to also check 
zkClient.isClosed() inside the loop, in case the client gets closed out from 
under us? (it's a cheap call, so i don't see any advantage to only checking 
once)

It is not a slow operation. ZkController.close was also using a parallelStream 
on electionContexts so this was basically copied code. But it doesn't make 
sense. As I noted on a comment in the PR, it deletes a znode and sets a 
volatile member so I have replaced parallelStream to a serial forEach.

bq. are there other "inflight" cases where it's risky to shutdown in the middle 
of? replication? peersync? core admin?

When the SolrCoreState.pauseUpdatesAndAwaitInflightRequests() method is 
executed, jetty has already received a SIGTERM so it will not allow any new 
connections/request. Let's talk about ongoing requests:
# All ongoing recoveries/replication (for cores on current node) have stopped 
(ZkController.preClose is called before the 
pauseUpdatesAndAwaitInflightRequests method)
# The election node has not been removed so peersyncs for leader election 
haven't started. (tryCancellAllElections happens after 
pauseUpdatesAndAwaitInflightRequests method)
# If another replica is recovering from this leader and a peersync is 
in-flight, even if we let it complete, subsequent replication requests will fail
# As for core admin requests:
## create, unload and reload are not useful (node is shutting down)
## split shard will eventually fail because it is a multi-step process
## requestrecovery and requestsync are not useful either. After node comes back 
online, all cores will recover again.
## backups and restore operations -- I don't think these should block a 
shutdown operation

bq. instead of hardcoding this instanceof check in HttpSolrCall would it make 
more sense to add a new 'default' method to SolrRequestHandler that 
UpdateRequestHandler (and potentially other handlers) could override to let 
them inspect the request and return true/false if it's "important" enough that 
it must be allowed to block shutdown until complete?

bq. this would also make it easier to bring back the "only block updates if i'm 
the leader" type logic  (living inside thee UpdateRequestHandler impl of the 
new method) at a later date if viable – w/o more API changes

I thought about that optimization (only block updates if i'm the leader) but it 
leads to too many race conditions when leadership is gained and lost. The 
problem is that we must ensure that all registered parties to the phaser 
eventually arrive and if we lose track then it can lead to 
IllegalStateExceptions from the phaser down the line (and even that is best 
effort). That is why I think it is safer to do this inside HttpSolrCall instead 
of giving this choice to plugin writers.

> Reduce leader election time on node shutdown
> 
>
> Key: SOLR-14942
> URL: https://issues.apache.org/jira/browse/SOLR-14942
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.7.3, 8.6.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The credit for this issue and investigation belongs to [~caomanhdat]. I am 
> merely reporting the issue and creating PRs based on his work.
> The shutdown process waits for all replicas/cores to be closed before 
> removing the election node of the leader. This can take some time due to 
> index flush or merge activities on the leader cores and delays new leaders 
> from being elected.
> This process happens at CoreContainer.shutdown():
> # zkController.preClose(): remove current node from live_node and change 
> states of all cores in this node to DOWN state. Assuming that the current 
> node hosting a leader of a shard, the shard becomes leaderless after calling 
> this method, since the state of the leader is DOWN now. The leader election 
> process is not triggered for the shard since the election node is still 
> on-hold by the current node.
> # Waiting for all cores to be loaded 

[jira] [Created] (SOLR-14942) Reduce leader election time on node shutdown

2020-10-16 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-14942:


 Summary: Reduce leader election time on node shutdown
 Key: SOLR-14942
 URL: https://issues.apache.org/jira/browse/SOLR-14942
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Affects Versions: 8.6.3, 7.7.3
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar


The credit for this issue and investigation belongs to [~caomanhdat]. I am 
merely reporting the issue and creating PRs based on his work.

The shutdown process waits for all replicas/cores to be closed before removing 
the election node of the leader. This can take some time due to index flush or 
merge activities on the leader cores and delays new leaders from being elected.

This process happens at CoreContainer.shutdown():
# zkController.preClose(): remove current node from live_node and change states 
of all cores in this node to DOWN state. Assuming that the current node hosting 
a leader of a shard, the shard becomes leaderless after calling this method, 
since the state of the leader is DOWN now. The leader election process is not 
triggered for the shard since the election node is still on-hold by the current 
node.
# Waiting for all cores to be loaded (if there are any).
# SolrCores.close(): close all cores.
# zkController.close(): this is where all ephemeral nodes are removed from ZK 
which include election nodes created by this node. Therefore other replicas in 
the shard can take part in the leader election from now.

Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive 
SIGTERM signal. 

On receiving SIGTERM, Jetty will also stop accepting new connections and new 
requests. This is a very important factor, since even if the leader replica is 
ACTIVE and its node in live_nodes, the shard will be considered as leaderless 
if no-one can index to that shard. Therefore shards become leaderless as soon 
as the node (which contains shard’s leader) receives SIGTERM.

Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards 
remain leaderless. The time needed for step 3 scales with the number of cores 
so the more cores a node has, the worse. This time is spent in 
IndexWriter.close() where the system will 
# Flush all pending updates to disk
# Waiting for all merge finish (this most likely is the meaty part)

The shutdown process is proposed to changed to:
# Wait for all in-flight indexing requests and replication requests to complete
# Remove election nodes
# Close all replicas/cores

This ensures that index flush or merges do not block new leader elections 
anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14776) Precompute the fingerprint during PeerSync

2020-10-13 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-14776.
--
Resolution: Fixed

Thanks Dat for the fix and Mike for the reviews!

> Precompute the fingerprint during PeerSync
> --
>
> Key: SOLR-14776
> URL: https://issues.apache.org/jira/browse/SOLR-14776
> Project: Solr
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.7
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Computing fingerprint can very costly and take time. But the current 
> implementation will send requests for getting fingerprint for multiple 
> replicas, then on the first response it will then compute its own fingerprint 
> for comparison. A very simple but effective improvement here is compute its 
> own fingerprint right after send requests to other replicas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14776) Precompute the fingerprint during PeerSync

2020-10-13 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14776:
-
Fix Version/s: 8.7
   master (9.0)

> Precompute the fingerprint during PeerSync
> --
>
> Key: SOLR-14776
> URL: https://issues.apache.org/jira/browse/SOLR-14776
> Project: Solr
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.7
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Computing fingerprint can very costly and take time. But the current 
> implementation will send requests for getting fingerprint for multiple 
> replicas, then on the first response it will then compute its own fingerprint 
> for comparison. A very simple but effective improvement here is compute its 
> own fingerprint right after send requests to other replicas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-14776) Precompute the fingerprint during PeerSync

2020-10-13 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reassigned SOLR-14776:


Assignee: Shalin Shekhar Mangar  (was: Cao Manh Dat)

> Precompute the fingerprint during PeerSync
> --
>
> Key: SOLR-14776
> URL: https://issues.apache.org/jira/browse/SOLR-14776
> Project: Solr
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Shalin Shekhar Mangar
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Computing fingerprint can very costly and take time. But the current 
> implementation will send requests for getting fingerprint for multiple 
> replicas, then on the first response it will then compute its own fingerprint 
> for comparison. A very simple but effective improvement here is compute its 
> own fingerprint right after send requests to other replicas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14776) Precompute the fingerprint during PeerSync

2020-10-13 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213015#comment-17213015
 ] 

Shalin Shekhar Mangar commented on SOLR-14776:
--

I have added a comment as Mike suggested. I'll commit this once tests pass 
locally.

> Precompute the fingerprint during PeerSync
> --
>
> Key: SOLR-14776
> URL: https://issues.apache.org/jira/browse/SOLR-14776
> Project: Solr
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Shalin Shekhar Mangar
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Computing fingerprint can very costly and take time. But the current 
> implementation will send requests for getting fingerprint for multiple 
> replicas, then on the first response it will then compute its own fingerprint 
> for comparison. A very simple but effective improvement here is compute its 
> own fingerprint right after send requests to other replicas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14640) Improve concurrency of SlowCompositeReaderWrapper.getSortedDocValues

2020-07-09 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-14640:


 Summary: Improve concurrency of 
SlowCompositeReaderWrapper.getSortedDocValues
 Key: SOLR-14640
 URL: https://issues.apache.org/jira/browse/SOLR-14640
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: search
Affects Versions: 8.4.1
Reporter: Shalin Shekhar Mangar
 Attachments: Screen Shot 2020-07-09 at 4.46.46 PM.png

Under heavy query load, the synchronized HashMap {{cachedOrdMaps}} inside 
SlowCompositeReaderWrapper.getSortedDocValues blocks search threads.

See attached screenshot of a java flight recording from an affected node. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14639) Improve concurrency of SlowCompositeReaderWrapper.terms

2020-07-09 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154440#comment-17154440
 ] 

Shalin Shekhar Mangar commented on SOLR-14639:
--

The problem is that ConcurrentHashMap.computeIfAbsent can be costly under 
contention. In JDK8, computeIfAbsent locks the node in which the key should be 
present regardless of whether the key exists or not [1]. This means that 
computeIfAbsent is always blocking as compared to get() which is a non-blocking 
operation. In JDK9, this was slightly ameliorated by adding a fast-return in 
case the key was found in the first node without entering a synchronization 
block. But if there is a hash collision and the key is not in the first node, 
then computeIfAbsent enters into a synchronization block on the node to find 
the key. For a cache, we can expect that the key will exist in most of the 
lookups so it makes sense to avoid the cost of entering a synchronized block 
for retrieval.

Doug Lea wrote on the concurrency mailing list [2]:
{code}
With the current implementation,
if you are implementing a cache, it may be better to code cache.get
to itself do a pre-screen, as in:
   V v = map.get(key);
   return (v != null) ? v : map.computeIfAbsent(key, function);

However, the exact benefit depends on access patterns.
For example, I reran your benchmark cases (urls below) on a
32way x86, and got throughputs (ops/sec) that are dramatically
better with pre-screen for the case of a single key,
but worse with your Zipf-distributed keys.
{code}

I would like to implement this method or switch to caffeine which has a 
non-blocking return in case the keys already exist [3].

[1] - 
https://concurrency-interest.altair.cs.oswego.narkive.com/0Jfe1waD/computeifabsent-optimized-for-missing-entries
[2] - 
http://cs.oswego.edu/pipermail/concurrency-interest/2014-December/013360.html
[3] - https://github.com/ben-manes/caffeine/wiki/Benchmarks

> Improve concurrency of SlowCompositeReaderWrapper.terms
> ---
>
> Key: SOLR-14639
> URL: https://issues.apache.org/jira/browse/SOLR-14639
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: 8.4.1
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Attachments: Screen Shot 2020-07-09 at 4.38.03 PM.png
>
>
> Under heavy query load, the ConcurrentHashMap.computeIfAbsent method inside 
> the SlowCompositeReaderWrapper.terms(String) method blocks searcher threads 
> (see attached screenshot of a java flight recording).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14639) Improve concurrency of SlowCompositeReaderWrapper.terms

2020-07-09 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-14639:


 Summary: Improve concurrency of SlowCompositeReaderWrapper.terms
 Key: SOLR-14639
 URL: https://issues.apache.org/jira/browse/SOLR-14639
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: search
Affects Versions: 8.4.1
Reporter: Shalin Shekhar Mangar
 Attachments: Screen Shot 2020-07-09 at 4.38.03 PM.png

Under heavy query load, the ConcurrentHashMap.computeIfAbsent method inside the 
SlowCompositeReaderWrapper.terms(String) method blocks searcher threads (see 
attached screenshot of a java flight recording).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13325) Add a collection selector to ComputePlanAction

2020-05-22 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-13325.
--
  Assignee: Shalin Shekhar Mangar
Resolution: Fixed

Thanks [~ab] for the review!

> Add a collection selector to ComputePlanAction
> --
>
> Key: SOLR-13325
> URL: https://issues.apache.org/jira/browse/SOLR-13325
> Project: Solr
>  Issue Type: Improvement
>  Components: AutoScaling
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.6
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Similar to SOLR-13273, it'd be nice to have a collection selector that 
> applies to compute plan action. An example use-case would be to selectively 
> add replicas on new nodes for certain collections only.
> Here is a selector that returns collections that match the given collection 
> property/value pair:
> {code}
> "collection": {"property_name": "property_value"}
> {code}
> Here's another selector that returns collections that have the given policy 
> applied
> {code}
> "collection": {"#policy": "policy_name"}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13325) Add a collection selector to ComputePlanAction

2020-05-11 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-13325:
-
Fix Version/s: (was: 8.2)
   8.6

> Add a collection selector to ComputePlanAction
> --
>
> Key: SOLR-13325
> URL: https://issues.apache.org/jira/browse/SOLR-13325
> Project: Solr
>  Issue Type: Improvement
>  Components: AutoScaling
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.6
>
>
> Similar to SOLR-13273, it'd be nice to have a collection selector that 
> applies to compute plan action. An example use-case would be to selectively 
> add replicas on new nodes for certain collections only.
> Here is a selector that returns collections that match the given collection 
> property/value pair:
> {code}
> "collection": {"property_name": "property_value"}
> {code}
> Here's another selector that returns collections that have the given policy 
> applied
> {code}
> "collection": {"#policy": "policy_name"}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13325) Add a collection selector to ComputePlanAction

2020-05-11 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-13325:
-
Summary: Add a collection selector to ComputePlanAction  (was: Add a 
collection selector to triggers)

> Add a collection selector to ComputePlanAction
> --
>
> Key: SOLR-13325
> URL: https://issues.apache.org/jira/browse/SOLR-13325
> Project: Solr
>  Issue Type: Improvement
>  Components: AutoScaling
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.2
>
>
> Similar to SOLR-13273, it'd be nice to have a collection selector that 
> applies to triggers. An example use-case would be to selectively add replicas 
> on new nodes for certain collections only.
> Here is a selector that returns collections that match the given collection 
> property/value pair:
> {code}
> "collection": {"property_name": "property_value"}
> {code}
> Here's another selector that returns collections that have the given policy 
> applied
> {code}
> "collection": {"#policy": "policy_name"}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13325) Add a collection selector to ComputePlanAction

2020-05-11 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-13325:
-
Description: 
Similar to SOLR-13273, it'd be nice to have a collection selector that applies 
to compute plan action. An example use-case would be to selectively add 
replicas on new nodes for certain collections only.

Here is a selector that returns collections that match the given collection 
property/value pair:
{code}
"collection": {"property_name": "property_value"}
{code}
Here's another selector that returns collections that have the given policy 
applied
{code}
"collection": {"#policy": "policy_name"}
{code}

  was:
Similar to SOLR-13273, it'd be nice to have a collection selector that applies 
to triggers. An example use-case would be to selectively add replicas on new 
nodes for certain collections only.

Here is a selector that returns collections that match the given collection 
property/value pair:
{code}
"collection": {"property_name": "property_value"}
{code}
Here's another selector that returns collections that have the given policy 
applied
{code}
"collection": {"#policy": "policy_name"}
{code}


> Add a collection selector to ComputePlanAction
> --
>
> Key: SOLR-13325
> URL: https://issues.apache.org/jira/browse/SOLR-13325
> Project: Solr
>  Issue Type: Improvement
>  Components: AutoScaling
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.2
>
>
> Similar to SOLR-13273, it'd be nice to have a collection selector that 
> applies to compute plan action. An example use-case would be to selectively 
> add replicas on new nodes for certain collections only.
> Here is a selector that returns collections that match the given collection 
> property/value pair:
> {code}
> "collection": {"property_name": "property_value"}
> {code}
> Here's another selector that returns collections that have the given policy 
> applied
> {code}
> "collection": {"#policy": "policy_name"}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14472) Autoscaling "cores" preference should count all cores, not just loaded.

2020-05-11 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105054#comment-17105054
 ] 

Shalin Shekhar Mangar commented on SOLR-14472:
--

Transient cores are not supported in Solr cloud today and autoscaling works 
only in cloud mode. What am I missing here?

> Autoscaling "cores" preference should count all cores, not just loaded.
> ---
>
> Key: SOLR-14472
> URL: https://issues.apache.org/jira/browse/SOLR-14472
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
>
> The AutoScaling "cores" preference works by counting the core names that are 
> retrievable via the metrics API.  99% of the time, that's fine but it does 
> not count unloaded transient cores that are also tracked by the 
> CoreContainer, which I think should be counted as well.  Most users don't 
> have such cores so it won't affect them.
> Furthermore, instead of counting them by asking the metrics API to return 
> each loaded core name, it should use the {{CONTAINER.cores}} prefix set of 
> counters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13325) Add a collection selector to triggers

2020-04-23 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091090#comment-17091090
 ] 

Shalin Shekhar Mangar commented on SOLR-13325:
--

I'm looking at this again. I think we should change the syntax slightly and get 
rid of the {{#policy}} key name. Instead, this can operate on any collection 
property such as policy or configName or autoAddReplicas etc that are part of 
the collection state. What's slightly complicating is that there are additional 
collection properties (stored in collectionprops.json). I don't intend to 
support that at the moment. On a related note, collection props have write APIs 
but no read APIs which severely limit the usefulness of that feature? That's 
something we should fix separately.

Now once we have this working, it reduces the need for a separate 
AutoAddReplicasPlanAction because you can get the same behavior by setting the 
following in ComputePlanAction:
{code}
"collection": {"autoAddReplicas": "true"}
{code}
However, there is a difference between the current implementation of 
"collections" in ComputePlanAction and how AutoAddReplicasPlanAction works 
which is that the former filters out suggestions of non-matching collections 
but the latter pushes down the collection hint to the policy engine so that it 
doesn't even compute suggestions for non-matching collections in the first 
place. The latter is obviously more efficient.

The one thing we have to be careful about is that the list of matching 
collections should be evaluated lazily when the action is triggered instead of 
early in the init method so that it can *see* the changes in the cluster state.

> Add a collection selector to triggers
> -
>
> Key: SOLR-13325
> URL: https://issues.apache.org/jira/browse/SOLR-13325
> Project: Solr
>  Issue Type: Improvement
>  Components: AutoScaling
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.2
>
>
> Similar to SOLR-13273, it'd be nice to have a collection selector that 
> applies to triggers. An example use-case would be to selectively add replicas 
> on new nodes for certain collections only.
> Here is a selector that returns collections that match the given collection 
> property/value pair:
> {code}
> "collection": {"property_name": "property_value"}
> {code}
> Here's another selector that returns collections that have the given policy 
> applied
> {code}
> "collection": {"#policy": "policy_name"}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-21 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-14365.
--
Fix Version/s: 8.6
   master (9.0)
   Resolution: Fixed

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: SOLR-14365.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-21 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089177#comment-17089177
 ] 

Shalin Shekhar Mangar commented on SOLR-14365:
--

I think this is ready to be cherry picked to branch_8x. I'll do that today 
unless there are any objections.

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14391) Remove getDocSet's manual doc collection logic; remove ScoreFilter

2020-04-18 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086738#comment-17086738
 ] 

Shalin Shekhar Mangar commented on SOLR-14391:
--

bq. Given that this dates back to Lucene 3.2 or so, it was probably the most 
performant way, regardless if there was another way or not.

Is this still the most performant way? [~dsmiley] -- did you compare 
performance before removing the manual doc loop?

> Remove getDocSet's manual doc collection logic; remove ScoreFilter
> --
>
> Key: SOLR-14391
> URL: https://issues.apache.org/jira/browse/SOLR-14391
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Fix For: 8.6
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{SolrIndexSearcher.getDocSet(List)}} calls getProcessedFilter and 
> then basically loops over doc IDs, passing them through the filter, and 
> passes them to the Collector.  This logic is redundant with what Lucene 
> searcher.search(query,collector) will ultimately do in BulkScorer, and so I 
> propose we remove all that code and delegate to Lucene.
> Also, the top of this method looks to see if any query implements the 
> "ScoreFilter" marker interface (only implemented by CollapsingPostFilter) and 
> if so delegates to {{getDocSetScore}} method instead.  That method has an 
> implementation close to what I propose getDocSet be changed to; so it can be 
> removed along with this ScoreFilter interface 
> searcher.search(query,collector).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14406) Use exponential backoff in RecoveryStrategy.pingLeader

2020-04-13 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-14406:


 Summary: Use exponential backoff in RecoveryStrategy.pingLeader
 Key: SOLR-14406
 URL: https://issues.apache.org/jira/browse/SOLR-14406
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar


The RecoveryStrategy.pingLeader method tries to connect/ping to the known 
leader in a tight loop while waiting for only 500ms. This is wasteful when 
leader is down and also litters the logs with messages like the following 
repeated very frequently (especially when there are more than one replica on 
the node whose leader is down):
{code}
Failed to connect leader http://xyz/solr on recovery, try again
{code}

We should use an exponential back-off here between retries



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory

2020-04-13 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-9909:

Fix Version/s: (was: 6.7)
   (was: 7.0)
   8.6
   master (9.0)
 Assignee: Shalin Shekhar Mangar
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks Andras!

> Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
> 
>
> Key: SOLR-9909
> URL: https://issues.apache.org/jira/browse/SOLR-9909
> Project: Solr
>  Issue Type: Task
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Trivial
> Fix For: master (9.0), 8.6
>
> Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, 
> SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch, SOLR-9909.patch
>
>
> DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same 
> code. Let's remove one of them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (SOLR-11960) Add collection level properties

2020-04-13 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-11960:
-
Comment: was deleted

(was: Get instant response and good care of WiFi range extender for our highly 
talented experts. Available 24/7 365 days to provide you the best assistance 
and support for your wireless range extender. visit :  
[https://www.routerloginnet.tips/])

> Add collection level properties
> ---
>
> Key: SOLR-11960
> URL: https://issues.apache.org/jira/browse/SOLR-11960
> Project: Solr
>  Issue Type: New Feature
>Reporter: Peter Rusko
>Assignee: Tomas Eduardo Fernandez Lobbe
>Priority: Blocker
> Fix For: 7.3, 8.0
>
> Attachments: SOLR-11960.patch, SOLR-11960.patch, SOLR-11960.patch, 
> SOLR-11960.patch, SOLR-11960.patch, SOLR-11960_2.patch
>
>
> Solr has cluster properties, but no easy and extendable way of defining 
> properties that affect a single collection. Collection properties could be 
> stored in a single zookeeper node per collection, making it possible to 
> trigger zookeeper watchers for only those Solr nodes that have cores of that 
> collection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory

2020-04-12 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081999#comment-17081999
 ] 

Shalin Shekhar Mangar commented on SOLR-9909:
-

I accidentally used the wrong issue number in the commit so the asf git message 
went to another issue. Here's the commit:

{code}
Commit 13f19f65559290a860df84fa1b5ac2db903b27ec in lucene-solr's branch 
refs/heads/master from Shalin Shekhar Mangar
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=13f19f6 ]

SOLR-9906: SolrjNamedThreadFactory is deprecated in favor of 
SolrNamedThreadFactory. DefaultSolrThreadFactory is removed from solr-core in 
favor of SolrNamedThreadFactory in solrj package and all solr-core classes now 
use SolrNamedThreadFactory

{code}

> Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
> 
>
> Key: SOLR-9909
> URL: https://issues.apache.org/jira/browse/SOLR-9909
> Project: Solr
>  Issue Type: Task
>Reporter: Shalin Shekhar Mangar
>Priority: Trivial
> Fix For: 6.7, 7.0
>
> Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, 
> SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch, SOLR-9909.patch
>
>
> DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same 
> code. Let's remove one of them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-9906) Use better check to validate if node recovered via PeerSync or Replication

2020-04-12 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081998#comment-17081998
 ] 

Shalin Shekhar Mangar commented on SOLR-9906:
-

Please ignore the above comment. It was intended for SOLR-9909.

> Use better check to validate if node recovered via PeerSync or Replication
> --
>
> Key: SOLR-9906
> URL: https://issues.apache.org/jira/browse/SOLR-9906
> Project: Solr
>  Issue Type: Improvement
>Reporter: Pushkar Raste
>Assignee: Noble Paul
>Priority: Minor
> Fix For: 6.4
>
> Attachments: SOLR-9906.patch, SOLR-9906.patch, 
> SOLR-PeerSyncVsReplicationTest.diff
>
>
> Tests {{LeaderFailureAfterFreshStartTest}} and {{PeerSyncReplicationTest}} 
> currently rely on number of requests made to the leader's replication handler 
> to check if node recovered via PeerSync or replication. This check is not 
> very reliable and we have seen failures in the past. 
> While tinkering with different way to write a better test I found 
> [SOLR-9859|SOLR-9859]. Now that SOLR-9859 is fixed, here is idea for better 
> way to distinguish recovery via PeerSync vs Replication. 
> * For {{PeerSyncReplicationTest}}, if node successfully recovers via 
> PeerSync, then file {{replication.properties}} should not exist
> For {{LeaderFailureAfterFreshStartTest}}, if the freshly replicated node does 
> not go into replication recovery after the leader failure, contents 
> {{replication.properties}} should not change 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory

2020-04-12 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081693#comment-17081693
 ] 

Shalin Shekhar Mangar commented on SOLR-9909:
-

Updated patch that adds the ASL to the new class. This will make the RAT check 
pass.

> Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
> 
>
> Key: SOLR-9909
> URL: https://issues.apache.org/jira/browse/SOLR-9909
> Project: Solr
>  Issue Type: Task
>Reporter: Shalin Shekhar Mangar
>Priority: Trivial
> Fix For: 6.7, 7.0
>
> Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, 
> SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch, SOLR-9909.patch
>
>
> DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same 
> code. Let's remove one of them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory

2020-04-12 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-9909:

Attachment: SOLR-9909.patch

> Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
> 
>
> Key: SOLR-9909
> URL: https://issues.apache.org/jira/browse/SOLR-9909
> Project: Solr
>  Issue Type: Task
>Reporter: Shalin Shekhar Mangar
>Priority: Trivial
> Fix For: 6.7, 7.0
>
> Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, 
> SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch, SOLR-9909.patch
>
>
> DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same 
> code. Let's remove one of them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory

2020-04-11 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081256#comment-17081256
 ] 

Shalin Shekhar Mangar commented on SOLR-9909:
-

Updated patch which deprecates SolrjNamedThreadFactory and adds a 
SolrNamedThreadFactory. Once this patch is applied on master and branch_8x, I 
will follow up with a commit on master to delete the deprecated 
SolrjNamedThreadFactory.

> Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
> 
>
> Key: SOLR-9909
> URL: https://issues.apache.org/jira/browse/SOLR-9909
> Project: Solr
>  Issue Type: Task
>Reporter: Shalin Shekhar Mangar
>Priority: Trivial
> Fix For: 6.7, 7.0
>
> Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, 
> SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch
>
>
> DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same 
> code. Let's remove one of them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory

2020-04-11 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-9909:

Attachment: SOLR-9909.patch

> Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
> 
>
> Key: SOLR-9909
> URL: https://issues.apache.org/jira/browse/SOLR-9909
> Project: Solr
>  Issue Type: Task
>Reporter: Shalin Shekhar Mangar
>Priority: Trivial
> Fix For: 6.7, 7.0
>
> Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, 
> SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch
>
>
> DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same 
> code. Let's remove one of them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory

2020-04-11 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081253#comment-17081253
 ] 

Shalin Shekhar Mangar commented on SOLR-9909:
-

Well, we can deprecate SolrjNamedThreadFactory and add a SolrNamedThreadFactory 
in 8x. The former can be deleted on master.

> Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
> 
>
> Key: SOLR-9909
> URL: https://issues.apache.org/jira/browse/SOLR-9909
> Project: Solr
>  Issue Type: Task
>Reporter: Shalin Shekhar Mangar
>Priority: Trivial
> Fix For: 6.7, 7.0
>
> Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, 
> SOLR-9909-03.patch, SOLR-9909.patch
>
>
> DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same 
> code. Let's remove one of them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory

2020-04-10 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081131#comment-17081131
 ] 

Shalin Shekhar Mangar commented on SOLR-9909:
-

Patch updated to master.

> Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
> 
>
> Key: SOLR-9909
> URL: https://issues.apache.org/jira/browse/SOLR-9909
> Project: Solr
>  Issue Type: Task
>Reporter: Shalin Shekhar Mangar
>Priority: Trivial
> Fix For: 6.7, 7.0
>
> Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, 
> SOLR-9909-03.patch, SOLR-9909.patch
>
>
> DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same 
> code. Let's remove one of them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory

2020-04-10 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-9909:

Attachment: SOLR-9909.patch

> Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
> 
>
> Key: SOLR-9909
> URL: https://issues.apache.org/jira/browse/SOLR-9909
> Project: Solr
>  Issue Type: Task
>Reporter: Shalin Shekhar Mangar
>Priority: Trivial
> Fix For: 6.7, 7.0
>
> Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, 
> SOLR-9909-03.patch, SOLR-9909.patch
>
>
> DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same 
> code. Let's remove one of them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14402) Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search

2020-04-10 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14402:
-
Fix Version/s: 8.6
   master (9.0)
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Avoid creating new exceptions for every request made to 
> MDCAwareThreadPoolExecutor by distributed search
> 
>
> Key: SOLR-14402
> URL: https://issues.apache.org/jira/browse/SOLR-14402
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.4
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.6
>
> Attachments: SOLR-14402.patch
>
>
> SOLR-11880 tried to do the same and it succeeded for update shard handler but 
> the implementation was wrong for http shard handler because the executor 
> created during construction is overwritten in the init() method. The commit 
> for SOLR-11880 is at https://github.com/apache/lucene-solr/commit/5a47ed4/
> Thanks [~caomanhdat] for spotting this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14402) Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search

2020-04-10 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14402:
-
Status: Patch Available  (was: Open)

> Avoid creating new exceptions for every request made to 
> MDCAwareThreadPoolExecutor by distributed search
> 
>
> Key: SOLR-14402
> URL: https://issues.apache.org/jira/browse/SOLR-14402
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.4
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Attachments: SOLR-14402.patch
>
>
> SOLR-11880 tried to do the same and it succeeded for update shard handler but 
> the implementation was wrong for http shard handler because the executor 
> created during construction is overwritten in the init() method. The commit 
> for SOLR-11880 is at https://github.com/apache/lucene-solr/commit/5a47ed4/
> Thanks [~caomanhdat] for spotting this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14402) Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search

2020-04-10 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080406#comment-17080406
 ] 

Shalin Shekhar Mangar commented on SOLR-14402:
--

Here's a simple patch that sets enableSubmitterStackTrace to false while 
creating the executor inside HttpShardHandlerFactory's init method. It removes 
the executor that was initialized in the class attribute because it is 
overwritten in init anyway. The patch also fixes ZkControllerTest and 
OverseerTest which were using HttpShardHandlerFactory without calling init 
first.

> Avoid creating new exceptions for every request made to 
> MDCAwareThreadPoolExecutor by distributed search
> 
>
> Key: SOLR-14402
> URL: https://issues.apache.org/jira/browse/SOLR-14402
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.4
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Attachments: SOLR-14402.patch
>
>
> SOLR-11880 tried to do the same and it succeeded for update shard handler but 
> the implementation was wrong for http shard handler because the executor 
> created during construction is overwritten in the init() method. The commit 
> for SOLR-11880 is at https://github.com/apache/lucene-solr/commit/5a47ed4/
> Thanks [~caomanhdat] for spotting this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14402) Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search

2020-04-10 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14402:
-
Attachment: SOLR-14402.patch

> Avoid creating new exceptions for every request made to 
> MDCAwareThreadPoolExecutor by distributed search
> 
>
> Key: SOLR-14402
> URL: https://issues.apache.org/jira/browse/SOLR-14402
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.4
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Attachments: SOLR-14402.patch
>
>
> SOLR-11880 tried to do the same and it succeeded for update shard handler but 
> the implementation was wrong for http shard handler because the executor 
> created during construction is overwritten in the init() method. The commit 
> for SOLR-11880 is at https://github.com/apache/lucene-solr/commit/5a47ed4/
> Thanks [~caomanhdat] for spotting this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-10 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080401#comment-17080401
 ] 

Shalin Shekhar Mangar commented on SOLR-14365:
--

I just saw this test failure on master which seems related and is reproducible:

{code}
[junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestRandomCollapseQParserPlugin 
-Dtests.method=testRandomCollpaseWithSort -Dtests.seed=20C0F4D7CBA81876 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=lv 
-Dtests.timezone=America/St_Johns -Dtests.asserts=true 
-Dtests.file.encoding=ANSI_X3.4-1968
   [junit4] FAILURE 7.30s J4 | 
TestRandomCollapseQParserPlugin.testRandomCollpaseWithSort <<<
   [junit4]> Throwable #1: java.lang.AssertionError: collapseKey too big -- 
need to grow array?
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([20C0F4D7CBA81876:257D871EE0002B85]:0)
   [junit4]>at 
org.apache.solr.search.CollapsingQParserPlugin$SortFieldsCompare.setGroupValues(CollapsingQParserPlugin.java:2702)
   [junit4]>at 
org.apache.solr.search.CollapsingQParserPlugin$IntSortSpecStrategy.collapse(CollapsingQParserPlugin.java:2544)
   [junit4]>at 
org.apache.solr.search.CollapsingQParserPlugin$IntFieldValueCollector.collect(CollapsingQParserPlugin.java:1223)
   [junit4]>at 
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:254)
   [junit4]>at 
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:205)
   [junit4]>at 
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
   [junit4]>at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:739)
   [junit4]>at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:526)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:202)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1651)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1469)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
   [junit4]>at 
org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1487)
   [junit4]>at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:399)
   [junit4]>at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:328)
   [junit4]>at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:209)
   [junit4]>at 
org.apache.solr.core.SolrCore.execute(SolrCore.java:2565)
   [junit4]>at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:227)
   [junit4]>at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:207)
   [junit4]>at 
org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1003)
   [junit4]>at 
org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1018)
   [junit4]>at 
org.apache.solr.search.TestRandomCollapseQParserPlugin.testRandomCollpaseWithSort(TestRandomCollapseQParserPlugin.java:158)
   [junit4]>at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{code}

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is 

[jira] [Created] (SOLR-14402) Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search

2020-04-09 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-14402:


 Summary: Avoid creating new exceptions for every request made to 
MDCAwareThreadPoolExecutor by distributed search
 Key: SOLR-14402
 URL: https://issues.apache.org/jira/browse/SOLR-14402
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Affects Versions: 7.4
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar


SOLR-11880 tried to do the same and it succeeded for update shard handler but 
the implementation was wrong for http shard handler because the executor 
created during construction is overwritten in the init() method. The commit for 
SOLR-11880 is at https://github.com/apache/lucene-solr/commit/5a47ed4/

Thanks [~caomanhdat] for spotting this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-12720) Remove autoReplicaFailoverWaitAfterExpiration in Solr 8.0

2020-04-05 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-12720.
--
  Assignee: Shalin Shekhar Mangar
Resolution: Fixed

Thanks [~marcussorealheis]!

> Remove autoReplicaFailoverWaitAfterExpiration in Solr 8.0
> -
>
> Key: SOLR-12720
> URL: https://issues.apache.org/jira/browse/SOLR-12720
> Project: Solr
>  Issue Type: Task
>  Components: AutoScaling, SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Blocker
> Fix For: master (9.0), 8.1
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> SOLR-12719 deprecated the autoReplicaFailoverWaitAfterExpiration property in 
> solr.xml. We should remove it entirely in Solr 8.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12720) Remove autoReplicaFailoverWaitAfterExpiration in Solr 8.0

2020-04-05 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17075999#comment-17075999
 ] 

Shalin Shekhar Mangar commented on SOLR-12720:
--

The commit for this issue used the wrong Jira:

{quote}
Commit 9322a7b37555832e41a25bbc556a34299b90204e in lucene-solr's branch 
refs/heads/master from Shalin Shekhar Mangar
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9322a7b ]

SOLR-12067: Remove support for autoReplicaFailoverWaitAfterExpiration

This closes #1402.
{quote}

> Remove autoReplicaFailoverWaitAfterExpiration in Solr 8.0
> -
>
> Key: SOLR-12720
> URL: https://issues.apache.org/jira/browse/SOLR-12720
> Project: Solr
>  Issue Type: Task
>  Components: AutoScaling, SolrCloud
>Reporter: Shalin Shekhar Mangar
>Priority: Blocker
> Fix For: 8.1, master (9.0)
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> SOLR-12719 deprecated the autoReplicaFailoverWaitAfterExpiration property in 
> solr.xml. We should remove it entirely in Solr 8.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14347) Autoscaling placement wrong when concurrent replica placements are calculated

2020-04-05 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17075774#comment-17075774
 ] 

Shalin Shekhar Mangar commented on SOLR-14347:
--

The PR is still open, can you please close that too?

> Autoscaling placement wrong when concurrent replica placements are calculated
> -
>
> Key: SOLR-14347
> URL: https://issues.apache.org/jira/browse/SOLR-14347
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling
>Affects Versions: 8.5
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Major
> Fix For: 8.6
>
> Attachments: SOLR-14347.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Steps to reproduce:
>  * create a cluster of a few nodes (tested with 7 nodes)
>  * define per-collection policies that distribute replicas exclusively on 
> different nodes per policy
>  * concurrently create a few collections, each using a different policy
>  * resulting replica placement will be seriously wrong, causing many policy 
> violations
> Running the same scenario but instead creating collections sequentially 
> results in no violations.
> I suspect this is caused by incorrect locking level for all collection 
> operations (as defined in {{CollectionParams.CollectionAction}}) that create 
> new replica placements - i.e. CREATE, ADDREPLICA, MOVEREPLICA, DELETENODE, 
> REPLACENODE, SPLITSHARD, RESTORE, REINDEXCOLLECTION. All of these operations 
> use the policy engine to create new replica placements, and as a result they 
> change the cluster state. However, currently these operations are locked (in 
> {{OverseerCollectionMessageHandler.lockTask}} ) using 
> {{LockLevel.COLLECTION}}. In practice this means that the lock is held only 
> for the particular collection that is being modified.
> A straightforward fix for this issue is to change the locking level to 
> CLUSTER (and I confirm this fixes the scenario described above). However, 
> this effectively serializes all collection operations listed above, which 
> will result in general slow-down of all collection operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14374) Use coreLoadExecutor to load all cores; not just startup

2020-03-30 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071421#comment-17071421
 ] 

Shalin Shekhar Mangar commented on SOLR-14374:
--

[~dsmiley] - yes that makes sense, thank you!

> Use coreLoadExecutor to load all cores; not just startup
> 
>
> Key: SOLR-14374
> URL: https://issues.apache.org/jira/browse/SOLR-14374
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>
> CoreContainer.load() creates coreLoadExecutor (an Executor) to load 
> pre-existing cores concurrently -- defaulting to 8 at a time.  Then it's 
> never used again.  However, cores might be loaded in other circumstances: (a) 
> creating new cores, (b) "transient" cores, or (c) loadOnStartup=false cores, 
> (d) reload cores.  By using coreLoadExecutor for all cases, we'll then have 
> metrics for core loading that work globally and not just on startup since 
> coreLoadExecutor is instrumented already -- 
> {{CONTAINER.threadPool.coreLoadExecutor}} metrics path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14356) PeerSync with hanging nodes

2020-03-28 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069426#comment-17069426
 ] 

Shalin Shekhar Mangar commented on SOLR-14356:
--

Okay, yes let's add the connect timeout exception and discuss a better fix in 
SOLR-14368

> PeerSync with hanging nodes
> ---
>
> Key: SOLR-14356
> URL: https://issues.apache.org/jira/browse/SOLR-14356
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14356.patch
>
>
> Right now in {{PeerSync}} (during leader election), in case of exception on 
> requesting versions to a node, we will skip that node if exception is one the 
> following type
> * ConnectTimeoutException
> * NoHttpResponseException
> * SocketException
> Sometime the other node basically hang but still accept connection. In that 
> case SocketTimeoutException is thrown and we consider the {{PeerSync}} 
> process as failed and the whole shard just basically leaderless forever (as 
> long as the hang node still there).
> We can't just blindly adding {{SocketTimeoutException}} to above list, since 
> [~shalin] mentioned that sometimes timeout can happen because of genuine 
> reasons too e.g. temporary GC pause.
> I think the general idea here is we obey {{leaderVoteWait}} restriction and 
> retry doing sync with others in case of connection/timeout exception happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-03-28 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069285#comment-17069285
 ] 

Shalin Shekhar Mangar commented on SOLR-14365:
--

I think we should add another method and make it configurable.

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-10397) Port 'autoAddReplicas' feature to the autoscaling framework and make it work with non-shared filesystems

2020-03-17 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061338#comment-17061338
 ] 

Shalin Shekhar Mangar commented on SOLR-10397:
--

[~dsmiley] - I agree that both of those paths are bad. It could go to the core 
descriptor.

> Port 'autoAddReplicas' feature to the autoscaling framework and make it work 
> with non-shared filesystems
> 
>
> Key: SOLR-10397
> URL: https://issues.apache.org/jira/browse/SOLR-10397
> Project: Solr
>  Issue Type: Sub-task
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Cao Manh Dat
>Priority: Major
>  Labels: autoscaling
> Fix For: 7.1, 8.0
>
> Attachments: SOLR-10397.1.patch, SOLR-10397.2.patch, 
> SOLR-10397.2.patch, SOLR-10397.2.patch, SOLR-10397.patch, 
> SOLR-10397_remove_nocommit.patch
>
>
> Currently 'autoAddReplicas=true' can be specified in the Collection Create 
> API to automatically add replicas when a replica becomes unavailable. I 
> propose to move this feature to the autoscaling cluster policy rules design.
> This will include the following:
> * Trigger support for ‘nodeLost’ event type
> * Modification of existing implementation of ‘autoAddReplicas’ to 
> automatically create the appropriate ‘nodeLost’ trigger.
> * Any such auto-created trigger must be marked internally such that setting 
> ‘autoAddReplicas=false’ via the Modify Collection API should delete or 
> disable corresponding trigger.
> * Support for non-HDFS filesystems while retaining the optimization afforded 
> by HDFS i.e. the replaced replica can point to the existing data dir of the 
> old replica.
> * Deprecate/remove the feature of enabling/disabling ‘autoAddReplicas’ across 
> the entire cluster using cluster properties in favor of using the 
> suspend-trigger/resume-trigger APIs.
> This will retain backward compatibility for the most part and keep a common 
> use-case easy to enable as well as make it available to more people (i.e. 
> people who don't use HDFS).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13996) Refactor HttpShardHandler#prepDistributed() into smaller pieces

2020-03-08 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-13996.
--
Fix Version/s: 8.5
   master (9.0)
   Resolution: Fixed

I have a few more improvements planned but 8.5 has been cut so I will close 
this issue and open another.

> Refactor HttpShardHandler#prepDistributed() into smaller pieces
> ---
>
> Key: SOLR-13996
> URL: https://issues.apache.org/jira/browse/SOLR-13996
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-13996.patch, SOLR-13996.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently, it is very hard to understand all the various things being done in 
> HttpShardHandler. I'm starting with refactoring the prepDistributed() method 
> to make it easier to grasp. It has standalone and cloud code intertwined, and 
> wanted to cleanly separate them out. Later, we can even have two separate 
> method (for standalone and cloud, each).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050677#comment-17050677
 ] 

Shalin Shekhar Mangar commented on SOLR-13942:
--

As someone who runs a managed search service and has to troubleshoot Solr 
issues, I want to add my 2 cents.

There's plenty of information that is required for troubleshooting but is not 
available in clusterstatus or any other documented/public API. Sure there's the 
undocumented /admin/zookeeper which has a weird output format meant for I don't 
know who. But even that does not have a few things that I've found necessary to 
troubleshoot Solr.

Here's a non-exhaustive list of things you need to troubleshoot Solr:
# Length of overseer queues (available in overseerstatus API)
# Contents of overseer queue (mildly useful, available in /admin/zookeeper)
# Overseer election queue and current leader (former is available in 
/admin/zookeeper and latter in overseer status)
# Cluster state (cluster status API)
# Solr.xml (no API regardless of whether it is in ZK or filesystem)
# Leader election queue and current leader for each shard (available in 
/admin/zookeeper)
# Shard terms for each shard/replica (not available in any API)
# Metrics/stats (metrics API)
# Solr Logs (log API? unless it is rolled over)
# GC logs (no API)

The overseerstatus API cannot be hit if there is no overseer so there's that 
too.

We run ZK and Solr inside kubernetes and we do not expose zookeeper publicly. 
So, to use a tool like zkcli means we have to port forward directly to the zk 
node which needs explicit privileges. Ideally we want to hit everything over 
http and never allow port forward privileges to anyone.

So I see the following options:
# Add missing information that is inside ZK (shard terms) to /admin/zookeeper 
and continue to live with its horrible output
# Immediately change /admin/zookeeper to a better output format and change the 
UI to consume this new format
# Deprecate /admin/zookeeper, introduce a clean API, migrate UI to this new 
endpoint or a better alternative and remove /admin/zookeeper in 9.0
# Not do anything and force people to use zkcli and existing solr apis for 
troubleshooting as we've been doing till now

My vote is to go with #3 and we can debate what we want to call the API and 
whether it should a public, documented, supported API or an undocumented API 
like /admin/zookeeper. My preference is to keep this undocumented and 
unsupported just like /admin/zookeeper. The other question is how we can secure 
it -- is it enough to be the same as /admin/zookeeper from a security 
perspective?

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13996) Refactor HttpShardHandler#prepDistributed() into smaller pieces

2020-02-27 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046629#comment-17046629
 ] 

Shalin Shekhar Mangar commented on SOLR-13996:
--

Fair enough, I'll rename the class.

> Refactor HttpShardHandler#prepDistributed() into smaller pieces
> ---
>
> Key: SOLR-13996
> URL: https://issues.apache.org/jira/browse/SOLR-13996
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Attachments: SOLR-13996.patch, SOLR-13996.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, it is very hard to understand all the various things being done in 
> HttpShardHandler. I'm starting with refactoring the prepDistributed() method 
> to make it easier to grasp. It has standalone and cloud code intertwined, and 
> wanted to cleanly separate them out. Later, we can even have two separate 
> method (for standalone and cloud, each).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize

2020-02-20 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-12550:
-
Fix Version/s: 8.5
   master (9.0)
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks Marc and Bérénice!

> ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
> 
>
> Key: SOLR-12550
> URL: https://issues.apache.org/jira/browse/SOLR-12550
> Project: Solr
>  Issue Type: Bug
>  Components: SolrJ
>Reporter: Marc Morissette
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We're in a situation where we need to optimize some of our collections. These 
> optimizations are done with waitSearcher=true as a simple throttling 
> mechanism to prevent too many collections from being optimized at once.
> We're seeing these optimize commands return without error after 10 minutes 
> but well before the end of the operation. Our Solr logs show errors with 
> socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
> has no effect.
> See the links section for my patch.
> It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
> commands to a private HttpSolrClient but fails to pass along its builder's 
> timeouts to that client.
> A patch is attached in the links section.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize

2020-02-20 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-12550:
-
Component/s: SolrJ

> ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
> 
>
> Key: SOLR-12550
> URL: https://issues.apache.org/jira/browse/SOLR-12550
> Project: Solr
>  Issue Type: Bug
>  Components: SolrJ
>Reporter: Marc Morissette
>Assignee: Shalin Shekhar Mangar
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We're in a situation where we need to optimize some of our collections. These 
> optimizations are done with waitSearcher=true as a simple throttling 
> mechanism to prevent too many collections from being optimized at once.
> We're seeing these optimize commands return without error after 10 minutes 
> but well before the end of the operation. Our Solr logs show errors with 
> socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
> has no effect.
> See the links section for my patch.
> It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
> commands to a private HttpSolrClient but fails to pass along its builder's 
> timeouts to that client.
> A patch is attached in the links section.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize

2020-02-20 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reassigned SOLR-12550:


Assignee: Shalin Shekhar Mangar

> ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
> 
>
> Key: SOLR-12550
> URL: https://issues.apache.org/jira/browse/SOLR-12550
> Project: Solr
>  Issue Type: Bug
>Reporter: Marc Morissette
>Assignee: Shalin Shekhar Mangar
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We're in a situation where we need to optimize some of our collections. These 
> optimizations are done with waitSearcher=true as a simple throttling 
> mechanism to prevent too many collections from being optimized at once.
> We're seeing these optimize commands return without error after 10 minutes 
> but well before the end of the operation. Our Solr logs show errors with 
> socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
> has no effect.
> See the links section for my patch.
> It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
> commands to a private HttpSolrClient but fails to pass along its builder's 
> timeouts to that client.
> A patch is attached in the links section.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize

2020-02-18 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039017#comment-17039017
 ] 

Shalin Shekhar Mangar commented on SOLR-12550:
--

I added a review comment to #417. Other than that it looks good.

> ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
> 
>
> Key: SOLR-12550
> URL: https://issues.apache.org/jira/browse/SOLR-12550
> Project: Solr
>  Issue Type: Bug
>Reporter: Marc Morissette
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We're in a situation where we need to optimize some of our collections. These 
> optimizations are done with waitSearcher=true as a simple throttling 
> mechanism to prevent too many collections from being optimized at once.
> We're seeing these optimize commands return without error after 10 minutes 
> but well before the end of the operation. Our Solr logs show errors with 
> socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
> has no effect.
> See the links section for my patch.
> It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
> commands to a private HttpSolrClient but fails to pass along its builder's 
> timeouts to that client.
> A patch is attached in the links section.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-14248.
--
Resolution: Fixed

> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch, SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032824#comment-17032824
 ] 

Shalin Shekhar Mangar commented on SOLR-14248:
--

The latest patch adds support for replica types and resolves a conflict 
introduced by SOLR-14245. It also adds a test for this class. This is ready to 
go.

> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch, SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14248:
-
Attachment: SOLR-14248.patch

> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch, SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032431#comment-17032431
 ] 

Shalin Shekhar Mangar commented on SOLR-14248:
--

This patch fixes all the problems except for #5. The way it fixes #3 is a hack 
but that's the best I could do without creating a builder class for 
DocCollection. I've left a todo comment in there to describe the hack and 
eventual fix.

> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14248:
-
Attachment: SOLR-14248.patch

> Improve ClusterStateMockUtil and make its methods public
> 
>
> Key: SOLR-14248
> URL: https://issues.apache.org/jira/browse/SOLR-14248
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14248.patch
>
>
> While working on SOLR-13996, I had the need to mock the cluster state for 
> various configurations and I used ClusterStateMockUtil.
> However, I ran into a few issues that needed to be fixed:
> 1. The methods in this class are protected making it useful only within the 
> same package
> 2. A null router was set for DocCollection objects
> 3. The DocCollection object is created before the slices so the 
> DocCollection.getActiveSlices method returns empty list because the active 
> slices map is created inside the DocCollection constructor
> 4. It did not set core name for the replicas it created
> 5. It has no support for replica types so it only creates nrt replicas
> I will use this Jira to fix these problems and make the methods in that class 
> public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public

2020-02-07 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-14248:


 Summary: Improve ClusterStateMockUtil and make its methods public
 Key: SOLR-14248
 URL: https://issues.apache.org/jira/browse/SOLR-14248
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Tests
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
 Fix For: master (9.0), 8.5


While working on SOLR-13996, I had the need to mock the cluster state for 
various configurations and I used ClusterStateMockUtil.

However, I ran into a few issues that needed to be fixed:
1. The methods in this class are protected making it useful only within the 
same package
2. A null router was set for DocCollection objects
3. The DocCollection object is created before the slices so the 
DocCollection.getActiveSlices method returns empty list because the active 
slices map is created inside the DocCollection constructor
4. It did not set core name for the replicas it created
5. It has no support for replica types so it only creates nrt replicas

I will use this Jira to fix these problems and make the methods in that class 
public (but marked as experimental)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2020-01-28 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025563#comment-17025563
 ] 

Shalin Shekhar Mangar commented on SOLR-13897:
--

Thanks [~jpountz] for fixing. I forgot that javadoc changes can cause precommit 
to fail.

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch, 
> SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13996) Refactor HttpShardHandler#prepDistributed() into smaller pieces

2020-01-28 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025122#comment-17025122
 ] 

Shalin Shekhar Mangar commented on SOLR-13996:
--

I've been working on a refactoring of this method and it's my fault that I 
didn't see this issue and the PR earlier. However, my goals are a bit more 
ambitious. This first PR https://github.com/apache/lucene-solr/pull/1220 is 
just a re-organization of the code but I'll be expanding it further by adding 
tests for each individual case and then move on to improve performance. 
Currently this class is quite inefficient as it parses and re-parses and 
creates strings out of shard urls even for solr cloud cases. The goal is to 
eventually have a cloud focused class that is extremely efficient and avoids 
unnecessary copies of shards/replicas completely. This will require changes in 
other places as well e.g. the host checker can be made to operate in a 
streaming mode etc. I haven't quite decided on how the replica list transformer 
should be changed.

I hope you don't mind Ishan but I'll assign this issue and take this forward. 
Reviews welcome!

> Refactor HttpShardHandler#prepDistributed() into smaller pieces
> ---
>
> Key: SOLR-13996
> URL: https://issues.apache.org/jira/browse/SOLR-13996
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Attachments: SOLR-13996.patch, SOLR-13996.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, it is very hard to understand all the various things being done in 
> HttpShardHandler. I'm starting with refactoring the prepDistributed() method 
> to make it easier to grasp. It has standalone and cloud code intertwined, and 
> wanted to cleanly separate them out. Later, we can even have two separate 
> method (for standalone and cloud, each).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-13996) Refactor HttpShardHandler#prepDistributed() into smaller pieces

2020-01-28 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reassigned SOLR-13996:


Assignee: Shalin Shekhar Mangar

> Refactor HttpShardHandler#prepDistributed() into smaller pieces
> ---
>
> Key: SOLR-13996
> URL: https://issues.apache.org/jira/browse/SOLR-13996
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Attachments: SOLR-13996.patch, SOLR-13996.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, it is very hard to understand all the various things being done in 
> HttpShardHandler. I'm starting with refactoring the prepDistributed() method 
> to make it easier to grasp. It has standalone and cloud code intertwined, and 
> wanted to cleanly separate them out. Later, we can even have two separate 
> method (for standalone and cloud, each).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2020-01-26 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-13897.
--
Fix Version/s: 8.5
   Resolution: Fixed

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch, 
> SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2020-01-26 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-13897:
-
Status: Open  (was: Patch Available)

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch, 
> SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14210) Introduce Node-level status handler for replicas

2020-01-23 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022645#comment-17022645
 ] 

Shalin Shekhar Mangar commented on SOLR-14210:
--

Why not extend the same /admin/info/health that we have with another parameter?

> Introduce Node-level status handler for replicas
> 
>
> Key: SOLR-14210
> URL: https://issues.apache.org/jira/browse/SOLR-14210
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (9.0), 8.5
>Reporter: Houston Putman
>Priority: Major
>
> h2. Background
> As was brought up in SOLR-13055, in order to run Solr in a more cloud-native 
> way, we need some additional features around node-level healthchecks.
> {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe 
> explained in 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n]
>  determine if a node is live and ready to serve live traffic.
> {quote}
>  
> However there are issues around kubernetes managing it's own rolling 
> restarts. With the current healthcheck setup, it's easy to envision a 
> scenario in which Solr reports itself as "healthy" when all of its replicas 
> are actually recovering. Therefore kubernetes, seeing a healthy pod would 
> then go and restart the next Solr node. This can happen until all replicas 
> are "recovering" and none are healthy. (maybe the last one restarted will be 
> "down", but still there are no "active" replicas)
> h2. Proposal
> I propose we make an additional healthcheck handler that returns whether all 
> replicas hosted by that Solr node are healthy and "active". That way we will 
> be able to use the [default kubernetes rolling restart 
> logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies]
>  with Solr.
> To add on to [Jan's point 
> here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559],
>  this handler should be more friendly for other Content-Types and should use 
> bettter HTTP response statuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14208) Reproducible test failure on TestBulkSchemaConcurrent

2020-01-23 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-14208.
--
Resolution: Duplicate

Andrzej fixed this in SOLR-14211

> Reproducible test failure on TestBulkSchemaConcurrent
> -
>
> Key: SOLR-14208
> URL: https://issues.apache.org/jira/browse/SOLR-14208
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Shalin Shekhar Mangar
>Priority: Major
>
> I found the following test failure on master branch while running tests on 
> SOLR-14207. The test failure is reproducible without the SOLR-14207 patch.
> {code}
> ant test  -Dtestcase=TestBulkSchemaConcurrent -Dtests.method=test 
> -Dtests.seed=AE6DC9DB591DAB9E -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=hi-IN -Dtests.timezone=Atlantic/Madeira -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> {code}
> The logs are full of the following warning repeated over and over:
> {code}
> [junit4]   2> 32396 WARN  (qtp1791658098-110) [n:127.0.0.1:46453_rx_%2Fr 
> c:collection1 s:shard2 r:core_node8 x:collection1_shard2_replica_n5 ] 
> o.a.s.s.SchemaManager Unable to retrieve fresh managed schema managed-schema
>[junit4]   2>   => java.lang.IllegalArgumentException: Path must 
> start with / character
>[junit4]   2>  at 
> org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:51)
>[junit4]   2> java.lang.IllegalArgumentException: Path must start with / 
> character
>[junit4]   2>  at 
> org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:51) 
> ~[zookeeper-3.5.5.jar:3.5.5]
>[junit4]   2>  at 
> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2000) 
> ~[zookeeper-3.5.5.jar:3.5.5]
>[junit4]   2>  at 
> org.apache.solr.common.cloud.SolrZkClient.lambda$exists$3(SolrZkClient.java:314)
>  ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)
>  ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:314) 
> ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.schema.SchemaManager.getFreshManagedSchema(SchemaManager.java:427)
>  ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.schema.SchemaManager.doOperations(SchemaManager.java:107) 
> ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.schema.SchemaManager.performOperations(SchemaManager.java:92) 
> ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.handler.SchemaHandler.handleRequestBody(SchemaHandler.java:90)
>  ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:208)
>  ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2582) ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799) ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:578) ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:419)
>  ~[java/:?]
>[junit4]   2>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351)
>  ~[java/:?]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2020-01-23 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-13897:
-
Attachment: SOLR-13897.patch

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch, 
> SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2020-01-23 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021920#comment-17021920
 ] 

Shalin Shekhar Mangar commented on SOLR-13897:
--

Updated patch so that it applies on master.

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch, 
> SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14172) Collection metadata remains in zookeeper if too many shards requested

2020-01-23 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-14172.
--
Resolution: Fixed

Thanks Andras for the PR and Kevin for his review!

> Collection metadata remains in zookeeper if too many shards requested
> -
>
> Key: SOLR-14172
> URL: https://issues.apache.org/jira/browse/SOLR-14172
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 8.3.1
>Reporter: Andras Salamon
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14172.patch, SOLR-14172.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When I try to create a collection and request too many shards, collection 
> creation fails with the expected error message:
> {noformat}
> $ curl -i --retry 5 -s -L -k --negotiate -u : 
> 'http://asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:8983/solr/admin/collections?action=CREATE=TooManyShardstest1=4=zk_init_too=1'
> HTTP/1.1 400 Bad Request
> Content-Type: application/json;charset=utf-8
> Content-Length: 1562
> {
>   "responseHeader":{
> "status":400,
> "QTime":122},
>   "Operation create caused 
> exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>  Cannot create collection TooManyShardstest1. Value of maxShardsPerNode is 1, 
> and the number of nodes currently live or live and part of your createNodeSet 
> is 3. This allows a maximum of 3 to be created. Value of numShards is 4, 
> value of nrtReplicas is 1, value of tlogReplicas is 0 and value of 
> pullReplicas is 0. This requires 4 shards to be created (higher than the 
> allowed number)",
>   "exception":{
> "msg":"Cannot create collection TooManyShardstest1. Value of 
> maxShardsPerNode is 1, and the number of nodes currently live or live and 
> part of your createNodeSet is 3. This allows a maximum of 3 to be created. 
> Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is 
> 0 and value of pullReplicas is 0. This requires 4 shards to be created 
> (higher than the allowed number)",
> "rspCode":400},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","org.apache.solr.common.SolrException"],
> "msg":"Cannot create collection TooManyShardstest1. Value of 
> maxShardsPerNode is 1, and the number of nodes currently live or live and 
> part of your createNodeSet is 3. This allows a maximum of 3 to be created. 
> Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is 
> 0 and value of pullReplicas is 0. This requires 4 shards to be created 
> (higher than the allowed number)",
> "code":400}}
> {noformat}
> Although the collection creation was not successful if I list the collections 
> it shows the new collection:
> {noformat}
> $ solr  collection --list                                        
> TooManyShardstest1 (1) 
>  {noformat}
> Looks like metadata remains in Zookeeper:
> {noformat}
> $ zkcli.sh -zkhost asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:2181/solr 
> -cmd ls /collections
> INFO  - 2020-01-06 04:54:01.851; 
> org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect 
> to ZooKeeper
> INFO  - 2020-01-06 04:54:01.880; 
> org.apache.solr.common.cloud.ConnectionManager; zkClient has connected
> INFO  - 2020-01-06 04:54:01.881; 
> org.apache.solr.common.cloud.ConnectionManager; Client is connected to 
> ZooKeeper
> /collections (1)
>  /collections/TooManyShardstest1 (1)
>  DATA:
>  {"configName":"zk_init_too"}
>   /collections/TooManyShardstest1/state.json (0)
>   DATA:
>   {"TooManyShardstest1":{
>   "pullReplicas":"0",
>   "replicationFactor":"1",
>   "router":{"name":"compositeId"},
>   "maxShardsPerNode":"1",
>   "autoAddReplicas":"false",
>   "nrtReplicas":"1",
>   "tlogReplicas":"0",
>   "shards":{
> "shard1":{
>   "range":"8000-bfff",
>   "state":"active",
>   "replicas":{}},
> "shard2":{
>   "range":"c000-",
>   "state":"active",
>   "replicas":{}},
> "shard3":{
>   "range":"0-3fff",
>   "state":"active",
>   "replicas":{}},
> "shard4":{
>   "range":"4000-7fff",
>   "state":"active",
>   "replicas":{}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SOLR-14172) Collection metadata remains in zookeeper if too many shards requested

2020-01-23 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021890#comment-17021890
 ] 

Shalin Shekhar Mangar commented on SOLR-14172:
--

I attached a new patch which adds a failure message in case the collection 
creation request is successful.

> Collection metadata remains in zookeeper if too many shards requested
> -
>
> Key: SOLR-14172
> URL: https://issues.apache.org/jira/browse/SOLR-14172
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 8.3.1
>Reporter: Andras Salamon
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14172.patch, SOLR-14172.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When I try to create a collection and request too many shards, collection 
> creation fails with the expected error message:
> {noformat}
> $ curl -i --retry 5 -s -L -k --negotiate -u : 
> 'http://asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:8983/solr/admin/collections?action=CREATE=TooManyShardstest1=4=zk_init_too=1'
> HTTP/1.1 400 Bad Request
> Content-Type: application/json;charset=utf-8
> Content-Length: 1562
> {
>   "responseHeader":{
> "status":400,
> "QTime":122},
>   "Operation create caused 
> exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>  Cannot create collection TooManyShardstest1. Value of maxShardsPerNode is 1, 
> and the number of nodes currently live or live and part of your createNodeSet 
> is 3. This allows a maximum of 3 to be created. Value of numShards is 4, 
> value of nrtReplicas is 1, value of tlogReplicas is 0 and value of 
> pullReplicas is 0. This requires 4 shards to be created (higher than the 
> allowed number)",
>   "exception":{
> "msg":"Cannot create collection TooManyShardstest1. Value of 
> maxShardsPerNode is 1, and the number of nodes currently live or live and 
> part of your createNodeSet is 3. This allows a maximum of 3 to be created. 
> Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is 
> 0 and value of pullReplicas is 0. This requires 4 shards to be created 
> (higher than the allowed number)",
> "rspCode":400},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","org.apache.solr.common.SolrException"],
> "msg":"Cannot create collection TooManyShardstest1. Value of 
> maxShardsPerNode is 1, and the number of nodes currently live or live and 
> part of your createNodeSet is 3. This allows a maximum of 3 to be created. 
> Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is 
> 0 and value of pullReplicas is 0. This requires 4 shards to be created 
> (higher than the allowed number)",
> "code":400}}
> {noformat}
> Although the collection creation was not successful if I list the collections 
> it shows the new collection:
> {noformat}
> $ solr  collection --list                                        
> TooManyShardstest1 (1) 
>  {noformat}
> Looks like metadata remains in Zookeeper:
> {noformat}
> $ zkcli.sh -zkhost asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:2181/solr 
> -cmd ls /collections
> INFO  - 2020-01-06 04:54:01.851; 
> org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect 
> to ZooKeeper
> INFO  - 2020-01-06 04:54:01.880; 
> org.apache.solr.common.cloud.ConnectionManager; zkClient has connected
> INFO  - 2020-01-06 04:54:01.881; 
> org.apache.solr.common.cloud.ConnectionManager; Client is connected to 
> ZooKeeper
> /collections (1)
>  /collections/TooManyShardstest1 (1)
>  DATA:
>  {"configName":"zk_init_too"}
>   /collections/TooManyShardstest1/state.json (0)
>   DATA:
>   {"TooManyShardstest1":{
>   "pullReplicas":"0",
>   "replicationFactor":"1",
>   "router":{"name":"compositeId"},
>   "maxShardsPerNode":"1",
>   "autoAddReplicas":"false",
>   "nrtReplicas":"1",
>   "tlogReplicas":"0",
>   "shards":{
> "shard1":{
>   "range":"8000-bfff",
>   "state":"active",
>   "replicas":{}},
> "shard2":{
>   "range":"c000-",
>   "state":"active",
>   "replicas":{}},
> "shard3":{
>   "range":"0-3fff",
>   "state":"active",
>   "replicas":{}},
> "shard4":{
>   "range":"4000-7fff",
>   "state":"active",
>   "replicas":{}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SOLR-14172) Collection metadata remains in zookeeper if too many shards requested

2020-01-23 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14172:
-
Attachment: SOLR-14172.patch

> Collection metadata remains in zookeeper if too many shards requested
> -
>
> Key: SOLR-14172
> URL: https://issues.apache.org/jira/browse/SOLR-14172
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 8.3.1
>Reporter: Andras Salamon
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14172.patch, SOLR-14172.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When I try to create a collection and request too many shards, collection 
> creation fails with the expected error message:
> {noformat}
> $ curl -i --retry 5 -s -L -k --negotiate -u : 
> 'http://asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:8983/solr/admin/collections?action=CREATE=TooManyShardstest1=4=zk_init_too=1'
> HTTP/1.1 400 Bad Request
> Content-Type: application/json;charset=utf-8
> Content-Length: 1562
> {
>   "responseHeader":{
> "status":400,
> "QTime":122},
>   "Operation create caused 
> exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>  Cannot create collection TooManyShardstest1. Value of maxShardsPerNode is 1, 
> and the number of nodes currently live or live and part of your createNodeSet 
> is 3. This allows a maximum of 3 to be created. Value of numShards is 4, 
> value of nrtReplicas is 1, value of tlogReplicas is 0 and value of 
> pullReplicas is 0. This requires 4 shards to be created (higher than the 
> allowed number)",
>   "exception":{
> "msg":"Cannot create collection TooManyShardstest1. Value of 
> maxShardsPerNode is 1, and the number of nodes currently live or live and 
> part of your createNodeSet is 3. This allows a maximum of 3 to be created. 
> Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is 
> 0 and value of pullReplicas is 0. This requires 4 shards to be created 
> (higher than the allowed number)",
> "rspCode":400},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","org.apache.solr.common.SolrException"],
> "msg":"Cannot create collection TooManyShardstest1. Value of 
> maxShardsPerNode is 1, and the number of nodes currently live or live and 
> part of your createNodeSet is 3. This allows a maximum of 3 to be created. 
> Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is 
> 0 and value of pullReplicas is 0. This requires 4 shards to be created 
> (higher than the allowed number)",
> "code":400}}
> {noformat}
> Although the collection creation was not successful if I list the collections 
> it shows the new collection:
> {noformat}
> $ solr  collection --list                                        
> TooManyShardstest1 (1) 
>  {noformat}
> Looks like metadata remains in Zookeeper:
> {noformat}
> $ zkcli.sh -zkhost asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:2181/solr 
> -cmd ls /collections
> INFO  - 2020-01-06 04:54:01.851; 
> org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect 
> to ZooKeeper
> INFO  - 2020-01-06 04:54:01.880; 
> org.apache.solr.common.cloud.ConnectionManager; zkClient has connected
> INFO  - 2020-01-06 04:54:01.881; 
> org.apache.solr.common.cloud.ConnectionManager; Client is connected to 
> ZooKeeper
> /collections (1)
>  /collections/TooManyShardstest1 (1)
>  DATA:
>  {"configName":"zk_init_too"}
>   /collections/TooManyShardstest1/state.json (0)
>   DATA:
>   {"TooManyShardstest1":{
>   "pullReplicas":"0",
>   "replicationFactor":"1",
>   "router":{"name":"compositeId"},
>   "maxShardsPerNode":"1",
>   "autoAddReplicas":"false",
>   "nrtReplicas":"1",
>   "tlogReplicas":"0",
>   "shards":{
> "shard1":{
>   "range":"8000-bfff",
>   "state":"active",
>   "replicas":{}},
> "shard2":{
>   "range":"c000-",
>   "state":"active",
>   "replicas":{}},
> "shard3":{
>   "range":"0-3fff",
>   "state":"active",
>   "replicas":{}},
> "shard4":{
>   "range":"4000-7fff",
>   "state":"active",
>   "replicas":{}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14172) Collection metadata remains in zookeeper if too many shards requested

2020-01-23 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14172:
-
   Attachment: SOLR-14172.patch
Fix Version/s: 8.5
   master (9.0)
 Assignee: Shalin Shekhar Mangar
   Status: Open  (was: Open)

This patch incorporates the test added by Andras Salamon in PR #1152 but the 
actual fix is slightly different.

This patch changes the buildReplicaPositions method to throw an 
AssignmentException instead of SolrException in case the maxShardsPerNode is 
insufficient. It also changes the Create Collection API to return a BAD_REQUEST 
code instead of SERVER_ERROR in case of assignment exception. I'll note this 
behavior change in the upgrade notes.

> Collection metadata remains in zookeeper if too many shards requested
> -
>
> Key: SOLR-14172
> URL: https://issues.apache.org/jira/browse/SOLR-14172
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 8.3.1
>Reporter: Andras Salamon
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14172.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When I try to create a collection and request too many shards, collection 
> creation fails with the expected error message:
> {noformat}
> $ curl -i --retry 5 -s -L -k --negotiate -u : 
> 'http://asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:8983/solr/admin/collections?action=CREATE=TooManyShardstest1=4=zk_init_too=1'
> HTTP/1.1 400 Bad Request
> Content-Type: application/json;charset=utf-8
> Content-Length: 1562
> {
>   "responseHeader":{
> "status":400,
> "QTime":122},
>   "Operation create caused 
> exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>  Cannot create collection TooManyShardstest1. Value of maxShardsPerNode is 1, 
> and the number of nodes currently live or live and part of your createNodeSet 
> is 3. This allows a maximum of 3 to be created. Value of numShards is 4, 
> value of nrtReplicas is 1, value of tlogReplicas is 0 and value of 
> pullReplicas is 0. This requires 4 shards to be created (higher than the 
> allowed number)",
>   "exception":{
> "msg":"Cannot create collection TooManyShardstest1. Value of 
> maxShardsPerNode is 1, and the number of nodes currently live or live and 
> part of your createNodeSet is 3. This allows a maximum of 3 to be created. 
> Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is 
> 0 and value of pullReplicas is 0. This requires 4 shards to be created 
> (higher than the allowed number)",
> "rspCode":400},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","org.apache.solr.common.SolrException"],
> "msg":"Cannot create collection TooManyShardstest1. Value of 
> maxShardsPerNode is 1, and the number of nodes currently live or live and 
> part of your createNodeSet is 3. This allows a maximum of 3 to be created. 
> Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is 
> 0 and value of pullReplicas is 0. This requires 4 shards to be created 
> (higher than the allowed number)",
> "code":400}}
> {noformat}
> Although the collection creation was not successful if I list the collections 
> it shows the new collection:
> {noformat}
> $ solr  collection --list                                        
> TooManyShardstest1 (1) 
>  {noformat}
> Looks like metadata remains in Zookeeper:
> {noformat}
> $ zkcli.sh -zkhost asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:2181/solr 
> -cmd ls /collections
> INFO  - 2020-01-06 04:54:01.851; 
> org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect 
> to ZooKeeper
> INFO  - 2020-01-06 04:54:01.880; 
> org.apache.solr.common.cloud.ConnectionManager; zkClient has connected
> INFO  - 2020-01-06 04:54:01.881; 
> org.apache.solr.common.cloud.ConnectionManager; Client is connected to 
> ZooKeeper
> /collections (1)
>  /collections/TooManyShardstest1 (1)
>  DATA:
>  {"configName":"zk_init_too"}
>   /collections/TooManyShardstest1/state.json (0)
>   DATA:
>   {"TooManyShardstest1":{
>   "pullReplicas":"0",
>   "replicationFactor":"1",
>   "router":{"name":"compositeId"},
>   "maxShardsPerNode":"1",
>   "autoAddReplicas":"false",
>   "nrtReplicas":"1",
>   "tlogReplicas":"0",
>   "shards":{
> "shard1":{
>   "range":"8000-bfff",
>   "state":"active",
>   "replicas":{}},
> "shard2":{
>   "range":"c000-",
>   "state":"active",
>   "replicas":{}},
>

[jira] [Resolved] (SOLR-14207) Fix logging statements with less or more arguments than placeholders

2020-01-23 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-14207.
--
Resolution: Fixed

> Fix logging statements with less or more arguments than placeholders
> 
>
> Key: SOLR-14207
> URL: https://issues.apache.org/jira/browse/SOLR-14207
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: logging
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14207.patch
>
>
> I found bad logging statements in the solr-exporter which had different 
> number of arguments than placeholders. Ran an inspection check in Idea and 
> found many more places with similar problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14208) Reproducible test failure on TestBulkSchemaConcurrent

2020-01-23 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-14208:


 Summary: Reproducible test failure on TestBulkSchemaConcurrent
 Key: SOLR-14208
 URL: https://issues.apache.org/jira/browse/SOLR-14208
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Tests
Reporter: Shalin Shekhar Mangar


I found the following test failure on master branch while running tests on 
SOLR-14207. The test failure is reproducible without the SOLR-14207 patch.

{code}
ant test  -Dtestcase=TestBulkSchemaConcurrent -Dtests.method=test 
-Dtests.seed=AE6DC9DB591DAB9E -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=hi-IN -Dtests.timezone=Atlantic/Madeira -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
{code}

The logs are full of the following warning repeated over and over:
{code}
[junit4]   2> 32396 WARN  (qtp1791658098-110) [n:127.0.0.1:46453_rx_%2Fr 
c:collection1 s:shard2 r:core_node8 x:collection1_shard2_replica_n5 ] 
o.a.s.s.SchemaManager Unable to retrieve fresh managed schema managed-schema
   [junit4]   2>   => java.lang.IllegalArgumentException: Path must 
start with / character
   [junit4]   2>at 
org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:51)
   [junit4]   2> java.lang.IllegalArgumentException: Path must start with / 
character
   [junit4]   2>at 
org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:51) 
~[zookeeper-3.5.5.jar:3.5.5]
   [junit4]   2>at 
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2000) 
~[zookeeper-3.5.5.jar:3.5.5]
   [junit4]   2>at 
org.apache.solr.common.cloud.SolrZkClient.lambda$exists$3(SolrZkClient.java:314)
 ~[java/:?]
   [junit4]   2>at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)
 ~[java/:?]
   [junit4]   2>at 
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:314) 
~[java/:?]
   [junit4]   2>at 
org.apache.solr.schema.SchemaManager.getFreshManagedSchema(SchemaManager.java:427)
 ~[java/:?]
   [junit4]   2>at 
org.apache.solr.schema.SchemaManager.doOperations(SchemaManager.java:107) 
~[java/:?]
   [junit4]   2>at 
org.apache.solr.schema.SchemaManager.performOperations(SchemaManager.java:92) 
~[java/:?]
   [junit4]   2>at 
org.apache.solr.handler.SchemaHandler.handleRequestBody(SchemaHandler.java:90) 
~[java/:?]
   [junit4]   2>at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:208)
 ~[java/:?]
   [junit4]   2>at 
org.apache.solr.core.SolrCore.execute(SolrCore.java:2582) ~[java/:?]
   [junit4]   2>at 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799) ~[java/:?]
   [junit4]   2>at 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:578) ~[java/:?]
   [junit4]   2>at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:419)
 ~[java/:?]
   [junit4]   2>at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351)
 ~[java/:?]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14207) Fix logging statements with less or more arguments than placeholders

2020-01-22 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-14207:
-
Attachment: SOLR-14207.patch

> Fix logging statements with less or more arguments than placeholders
> 
>
> Key: SOLR-14207
> URL: https://issues.apache.org/jira/browse/SOLR-14207
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: logging
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14207.patch
>
>
> I found bad logging statements in the solr-exporter which had different 
> number of arguments than placeholders. Ran an inspection check in Idea and 
> found many more places with similar problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14207) Fix logging statements with less or more arguments than placeholders

2020-01-22 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-14207:


 Summary: Fix logging statements with less or more arguments than 
placeholders
 Key: SOLR-14207
 URL: https://issues.apache.org/jira/browse/SOLR-14207
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: logging
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
 Fix For: master (9.0), 8.5


I found bad logging statements in the solr-exporter which had different number 
of arguments than placeholders. Ran an inspection check in Idea and found many 
more places with similar problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14191) Restore fixes for HealthCheckHandlerTest.testHealthCheckHandler() made by SOLR-11456

2020-01-15 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-14191.
--
Resolution: Invalid

Never mind, false alarm. I did in fact incorporate fixes made by SOLR-11456 in 
SOLR-11126 when I committed the code. It's just that the last patch attached on 
SOLR-11126 did not have those fixes.

Sorry for the noise.

> Restore fixes for HealthCheckHandlerTest.testHealthCheckHandler() made by 
> SOLR-11456 
> -
>
> Key: SOLR-14191
> URL: https://issues.apache.org/jira/browse/SOLR-14191
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Affects Versions: 8.0
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: 8.5
>
>
> Chasing a test failure while backporting SOLR-11126 to branch 7x, I 
> discovered that all the test failures fixed by Hoss in SOLR-11456 were lost 
> when SOLR-11126 was committed even though Hoss had commented on the issue to 
> remind us about them.
> This issue will restore those lost fixes on the master and 8x branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14191) Restore fixes for HealthCheckHandlerTest.testHealthCheckHandler() made by SOLR-11456

2020-01-15 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-14191:


 Summary: Restore fixes for 
HealthCheckHandlerTest.testHealthCheckHandler() made by SOLR-11456 
 Key: SOLR-14191
 URL: https://issues.apache.org/jira/browse/SOLR-14191
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Tests
Affects Versions: 8.0
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
 Fix For: 8.5


Chasing a test failure while backporting SOLR-11126 to branch 7x, I discovered 
that all the test failures fixed by Hoss in SOLR-11456 were lost when 
SOLR-11126 was committed even though Hoss had commented on the issue to remind 
us about them.

This issue will restore those lost fixes on the master and 8x branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-13845) DELETEREPLICA API by "count" and "type"

2020-01-13 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reassigned SOLR-13845:


Assignee: Shalin Shekhar Mangar

> DELETEREPLICA API by "count" and "type"
> ---
>
> Key: SOLR-13845
> URL: https://issues.apache.org/jira/browse/SOLR-13845
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Amrit Sarkar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Attachments: SOLR-13845.patch
>
>
> SOLR-9319 added support for deleting replicas by count. It would be great to 
> have the feature with added functionality the type of replica we want to 
> delete like we add replicas by count and type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13979) Expose separate metrics for distributed and non-distributed requests

2019-12-03 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987539#comment-16987539
 ] 

Shalin Shekhar Mangar commented on SOLR-13979:
--

Yes, I have used that method in the past but this is a common use-case and it 
should not be necessary to resort to such clever solutions.

> Expose separate metrics for distributed and non-distributed requests
> 
>
> Key: SOLR-13979
> URL: https://issues.apache.org/jira/browse/SOLR-13979
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.4
>
>
> Currently we expose metrics such as count, rate and latency on a per handler 
> level however for search requests there is no distinction made for distrib vs 
> non-distrib requests. This means that there is no way to find the count, rate 
> or latency of only user-sent queries.
> I propose that we expose distrib vs non-distrib metrics separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2019-11-30 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reassigned SOLR-13897:


Assignee: Shalin Shekhar Mangar

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.4
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2019-11-30 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985479#comment-16985479
 ] 

Shalin Shekhar Mangar commented on SOLR-13897:
--

This patch adds registerTerm inside ZkCollectionTerms so that it is called 
after synchronizing on the same terms object as that used for the remove. I 
couldn't quite make out a condition where both could happen concurrently but it 
makes me sleep better knowing that they absolutely cannot.

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.4
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2019-11-30 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-13897:
-
Status: Patch Available  (was: Open)

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.4
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2019-11-30 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-13897:
-
Attachment: SOLR-13897.patch

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.4
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-13989) Move all hadoop related code to a contrib module

2019-11-30 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-13989:


 Summary: Move all hadoop related code to a contrib module
 Key: SOLR-13989
 URL: https://issues.apache.org/jira/browse/SOLR-13989
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Hadoop Integration
Reporter: Shalin Shekhar Mangar
 Fix For: master (9.0)


Spin off from SOLR-13986:

{quote}
It seems really important to move or remove this hadoop shit out of the solr 
core: It is really unreasonable that solr core depends on hadoop. that's gonna 
simply block any progress improving its security, because solr code will get 
dragged down by hadoop's code.
{quote}

We should move all hadoop related dependencies to a separate contrib module



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13986) remove "execute" permission from solr-tests.policy

2019-11-30 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985477#comment-16985477
 ] 

Shalin Shekhar Mangar commented on SOLR-13986:
--

bq. Unrelated to these specific problems, It seems really important to move or 
remove this hadoop shit out of the solr core: It is really unreasonable that 
solr core depends on hadoop. that's gonna simply block any progress improving 
its security, because solr code will get dragged down by hadoop's code.

I agree that hadoop specific code should live in a contrib. I'll open an issue 
to do that.

> remove "execute" permission from solr-tests.policy
> --
>
> Key: SOLR-13986
> URL: https://issues.apache.org/jira/browse/SOLR-13986
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Robert Muir
>Priority: Major
> Attachments: SOLR-13986-notyet.patch, SOLR-13986.patch, 
> SOLR-13986.patch
>
>
> If we don't really need to execute processes, we can take the permission 
> away. That way any attempt to execute something results in a 
> SecurityException rather than running a process.
> It is necessary to first fix the tests policy before thinking about 
> supporting securitymanager in solr. This way we can ensure functionality does 
> not break via our tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2019-11-30 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-13897:
-
Attachment: SOLR-13897.patch

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.4
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2019-11-30 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985459#comment-16985459
 ] 

Shalin Shekhar Mangar commented on SOLR-13897:
--

The onTermUpdates might receive updates out of order (i.e. monotonic term 
versions are not guaranteed inside onTermUpdates) but it is not a problem in 
the default RecoveringCoreTermWatcher implementation because it tracks the last 
term that triggered recovery and returns if it is greater (or equal) to the 
current term. This patch adds javadocs to the CoreTermWatcher interface and 
calls out the behavior of these invocations.

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.4
>
> Attachments: SOLR-13897.patch, SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2019-11-29 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985256#comment-16985256
 ] 

Shalin Shekhar Mangar commented on SOLR-13897:
--

Here's a patch that changes the Terms to an AtomicReference. However, I am not 
convinced that it is still correct. Seems there can be race conditions between 
registerTerm and removeTerm and also onTermUpdates might receive updates out of 
order (i.e. monotonic term versions are not guaranteed inside onTermUpdates)

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.4
>
> Attachments: SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms

2019-11-29 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-13897:
-
Attachment: SOLR-13897.patch

> Unsafe publication of Terms object in ZkShardTerms
> --
>
> Key: SOLR-13897
> URL: https://issues.apache.org/jira/browse/SOLR-13897
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 8.2, 8.3
>Reporter: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.4
>
> Attachments: SOLR-13897.patch
>
>
> The Terms object in ZkShardTerms is written using a write lock but reading is 
> allowed freely. This is not safe and can cause visibility issues and 
> associated race conditions under contention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13805) Solr - generates an NPE when calling /solr/admin/health on standalone solr

2019-11-29 Thread Shalin Shekhar Mangar (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-13805.
--
Fix Version/s: 8.4
   master (9.0)
   Resolution: Fixed

Thanks Nicholas!

> Solr - generates an NPE when calling /solr/admin/health on standalone solr
> --
>
> Key: SOLR-13805
> URL: https://issues.apache.org/jira/browse/SOLR-13805
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud, SolrJ
>Affects Versions: 8.1, 8.2
>Reporter: Nicholas DiPiazza
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.4
>
>
> steps to reproduce:
> unzip solr and run 
> {code}
> ./bin/solr start
> {code}
> Then nav to: 
> http://localhost:8983/solr/admin/health
> Result will be an NPE: 
> {code}
> {
>   "responseHeader":{
> "status":500,
> "QTime":20},
>   "error":{
> "trace":"java.lang.NullPointerException\n\tat 
> org.apache.solr.handler.admin.HealthCheckHandler.handleRequestBody(HealthCheckHandler.java:68)\n\tat
>  
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
>  
> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat 
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
>  org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
>  
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>  org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
>  
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat
>  org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat 
> org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)\n\tat
>  
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)\n\tat
>  java.lang.Thread.run(Thread.java:748)\n",
> "code":500}}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-13979) Expose separate metrics for distributed and non-distributed requests

2019-11-28 Thread Shalin Shekhar Mangar (Jira)
Shalin Shekhar Mangar created SOLR-13979:


 Summary: Expose separate metrics for distributed and 
non-distributed requests
 Key: SOLR-13979
 URL: https://issues.apache.org/jira/browse/SOLR-13979
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: metrics
Reporter: Shalin Shekhar Mangar
 Fix For: master (9.0), 8.4


Currently we expose metrics such as count, rate and latency on a per handler 
level however for search requests there is no distinction made for distrib vs 
non-distrib requests. This means that there is no way to find the count, rate 
or latency of only user-sent queries.

I propose that we expose distrib vs non-distrib metrics separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13945) SPLITSHARD data loss due to "rollback"

2019-11-24 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16981287#comment-16981287
 ] 

Shalin Shekhar Mangar edited comment on SOLR-13945 at 11/25/19 4:29 AM:


[~ichattopadhyaya] - the final commit was added in SOLR-4997 so that documents 
are visible when the sub-shard replicas come up. -It is not necessary if there 
is a single replica.- (note it is necessary to call this commit regardless of 
the replication factor)


was (Author: shalinmangar):
[~ichattopadhyaya] - the final commit was added in SOLR-4997 so that documents 
are visible when the sub-shard replicas come up. It is not necessary if there 
is a single replica.

> SPLITSHARD data loss due to "rollback"
> --
>
> Key: SOLR-13945
> URL: https://issues.apache.org/jira/browse/SOLR-13945
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Ishan Chattopadhyaya
>Priority: Major
> Attachments: SOLR-13945.patch, SOLR-13945.patch, SOLR-13945.patch
>
>
> # As per SOLR-7673, there is a commit on the parent shard *after state 
> changes* have happened, i.e. from active/construction/construction to 
> inactive/active/active. Please see 
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588
> # Due to SOLR-12509, there's now a cleanup/rollback method called 
> "cleanupAfterFailure" in the finally block that resets the state to 
> active/construction/construction. Please see: 
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657
> # When 2 is entered into due to a failure in 1, we have a situation where any 
> documents that went into the subshards (because they are already active by 
> now) are now lost after the parent becomes active.
> If my above understanding is correct, I am wondering:
> # Why is a commit to parent shard needed *after* the parent shard is 
> inactive, subshards are now active and the split operation has completed?
> # This rollback looks very suspicious. If state of subshards is already 
> active and parent is inactive, then what is the need for setting them back to 
> construction? Seems like a crucial check is missing there. Also, why do we 
> reset the subshard status back to construction instead of inactive? It is 
> extremely misleading (and, frankly, ridiculous) for any external clusterstate 
> monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to 
> CONSTRUCTION and then the subshard disappearing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   >