[jira] [Resolved] (SOLR-14942) Reduce leader election time on node shutdown
[ https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-14942. -- Resolution: Fixed Thanks, David for the feedback and to you and Dat for the review. > Reduce leader election time on node shutdown > > > Key: SOLR-14942 > URL: https://issues.apache.org/jira/browse/SOLR-14942 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.7.3, 8.6.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: 8.8, master (9.0) > > Time Spent: 2h 50m > Remaining Estimate: 0h > > The credit for this issue and investigation belongs to [~caomanhdat]. I am > merely reporting the issue and creating PRs based on his work. > The shutdown process waits for all replicas/cores to be closed before > removing the election node of the leader. This can take some time due to > index flush or merge activities on the leader cores and delays new leaders > from being elected. > This process happens at CoreContainer.shutdown(): > # zkController.preClose(): remove current node from live_node and change > states of all cores in this node to DOWN state. Assuming that the current > node hosting a leader of a shard, the shard becomes leaderless after calling > this method, since the state of the leader is DOWN now. The leader election > process is not triggered for the shard since the election node is still > on-hold by the current node. > # Waiting for all cores to be loaded (if there are any). > # SolrCores.close(): close all cores. > # zkController.close(): this is where all ephemeral nodes are removed from ZK > which include election nodes created by this node. Therefore other replicas > in the shard can take part in the leader election from now. > Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive > SIGTERM signal. > On receiving SIGTERM, Jetty will also stop accepting new connections and new > requests. This is a very important factor, since even if the leader replica > is ACTIVE and its node in live_nodes, the shard will be considered as > leaderless if no-one can index to that shard. Therefore shards become > leaderless as soon as the node (which contains shard’s leader) receives > SIGTERM. > Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards > remain leaderless. The time needed for step 3 scales with the number of cores > so the more cores a node has, the worse. This time is spent in > IndexWriter.close() where the system will > # Flush all pending updates to disk > # Waiting for all merge finish (this most likely is the meaty part) > The shutdown process is proposed to changed to: > # Wait for all in-flight indexing requests and replication requests to > complete > # Remove election nodes > # Close all replicas/cores > This ensures that index flush or merges do not block new leader elections > anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Reopened] (SOLR-14942) Reduce leader election time on node shutdown
[ https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar reopened SOLR-14942: -- Reopening to address review feedback from David. Thanks, David for the feedback. I think this will work. Please review the PR at https://github.com/apache/lucene-solr/pull/2112 > Reduce leader election time on node shutdown > > > Key: SOLR-14942 > URL: https://issues.apache.org/jira/browse/SOLR-14942 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.7.3, 8.6.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: 8.8, master (9.0) > > Time Spent: 2.5h > Remaining Estimate: 0h > > The credit for this issue and investigation belongs to [~caomanhdat]. I am > merely reporting the issue and creating PRs based on his work. > The shutdown process waits for all replicas/cores to be closed before > removing the election node of the leader. This can take some time due to > index flush or merge activities on the leader cores and delays new leaders > from being elected. > This process happens at CoreContainer.shutdown(): > # zkController.preClose(): remove current node from live_node and change > states of all cores in this node to DOWN state. Assuming that the current > node hosting a leader of a shard, the shard becomes leaderless after calling > this method, since the state of the leader is DOWN now. The leader election > process is not triggered for the shard since the election node is still > on-hold by the current node. > # Waiting for all cores to be loaded (if there are any). > # SolrCores.close(): close all cores. > # zkController.close(): this is where all ephemeral nodes are removed from ZK > which include election nodes created by this node. Therefore other replicas > in the shard can take part in the leader election from now. > Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive > SIGTERM signal. > On receiving SIGTERM, Jetty will also stop accepting new connections and new > requests. This is a very important factor, since even if the leader replica > is ACTIVE and its node in live_nodes, the shard will be considered as > leaderless if no-one can index to that shard. Therefore shards become > leaderless as soon as the node (which contains shard’s leader) receives > SIGTERM. > Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards > remain leaderless. The time needed for step 3 scales with the number of cores > so the more cores a node has, the worse. This time is spent in > IndexWriter.close() where the system will > # Flush all pending updates to disk > # Waiting for all merge finish (this most likely is the meaty part) > The shutdown process is proposed to changed to: > # Wait for all in-flight indexing requests and replication requests to > complete > # Remove election nodes > # Close all replicas/cores > This ensures that index flush or merges do not block new leader elections > anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-6399) Implement unloadCollection in the Collections API
[ https://issues.apache.org/jira/browse/SOLR-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar reassigned SOLR-6399: --- Assignee: (was: Shalin Shekhar Mangar) > Implement unloadCollection in the Collections API > - > > Key: SOLR-6399 > URL: https://issues.apache.org/jira/browse/SOLR-6399 > Project: Solr > Issue Type: New Feature >Reporter: dfdeshom >Priority: Major > Fix For: 6.0 > > > There is currently no way to unload a collection without deleting its > contents. There should be a way in the collections API to unload a collection > and reload it later, as needed. > A use case for this is the following: you store logs by day, with each day > having its own collection. You are required to store up to 2 years of data, > which adds up to 730 collections. Most of the time, you'll want to have 3 > days of data loaded for search. Having just 3 collections loaded into memory, > instead of 730 will make managing Solr easier. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14985) Slow indexing and search performance when using HttpClusterStateProvider
[ https://issues.apache.org/jira/browse/SOLR-14985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14985: - Description: HttpClusterStateProvider fetches and caches Aliases and Live Nodes for 5 seconds. The BaseSolrCloudClient caches DocCollection for 60 seconds but only if the DocCollection is not lazy and all collections returned by HttpClusterStateProvider are not lazy which means they are never cached. The BaseSolrCloudClient has a method for resolving aliases which fetches DocCollection for each input collection. This is an HTTP call with no caching when using HttpClusterStateProvider. This resolveAliases method is called twice for each update. So overall, at least 3 HTTP calls are made to fetch cluster state for each update request when using HttpClusterStateProvider. There may be more if aliases are involved or if more than one collection is specified in the request. Similar problems exist on the query path as well. Due to these reasons, using HttpClusterStateProvider causes horrible latencies and throughput for update and search requests. was: HttpClusterStateProvider fetches and caches Aliases and Live Nodes for 5 seconds. The BaseSolrCloudClient caches DocCollection for 60 seconds but only if the DocCollection is not lazy and all collections returned by HttpClusterStateProvider are not lazy which means they are never cached. The BaseSolrCloudClient has a method for resolving aliases which fetches DocCollection for each input collection. This is an HTTP call with no caching when using HttpClusterStateProvider. This resolveAliases method is called twice for each update. So overall, at least 3 HTTP calls are made to fetch cluster state for each update request when using HttpClusterStateProvider. There may be more if aliases are involved or if more than one collection is specified in the request. Similar problems exist on the query path as well. Due to these reasons, using HttpClusterStateProvider causes horrible latencies and throughput for update requests. > Slow indexing and search performance when using HttpClusterStateProvider > > > Key: SOLR-14985 > URL: https://issues.apache.org/jira/browse/SOLR-14985 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Reporter: Shalin Shekhar Mangar >Priority: Major > > HttpClusterStateProvider fetches and caches Aliases and Live Nodes for 5 > seconds. > The BaseSolrCloudClient caches DocCollection for 60 seconds but only if the > DocCollection is not lazy and all collections returned by > HttpClusterStateProvider are not lazy which means they are never cached. > The BaseSolrCloudClient has a method for resolving aliases which fetches > DocCollection for each input collection. This is an HTTP call with no caching > when using HttpClusterStateProvider. This resolveAliases method is called > twice for each update. > So overall, at least 3 HTTP calls are made to fetch cluster state for each > update request when using HttpClusterStateProvider. There may be more if > aliases are involved or if more than one collection is specified in the > request. Similar problems exist on the query path as well. > Due to these reasons, using HttpClusterStateProvider causes horrible > latencies and throughput for update and search requests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14985) Slow indexing and search performance when using HttpClusterStateProvider
[ https://issues.apache.org/jira/browse/SOLR-14985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226782#comment-17226782 ] Shalin Shekhar Mangar commented on SOLR-14985: -- Linking to SOLR-14966 and SOLR-14967 > Slow indexing and search performance when using HttpClusterStateProvider > > > Key: SOLR-14985 > URL: https://issues.apache.org/jira/browse/SOLR-14985 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ >Reporter: Shalin Shekhar Mangar >Priority: Major > > HttpClusterStateProvider fetches and caches Aliases and Live Nodes for 5 > seconds. > The BaseSolrCloudClient caches DocCollection for 60 seconds but only if the > DocCollection is not lazy and all collections returned by > HttpClusterStateProvider are not lazy which means they are never cached. > The BaseSolrCloudClient has a method for resolving aliases which fetches > DocCollection for each input collection. This is an HTTP call with no caching > when using HttpClusterStateProvider. This resolveAliases method is called > twice for each update. > So overall, at least 3 HTTP calls are made to fetch cluster state for each > update request when using HttpClusterStateProvider. There may be more if > aliases are involved or if more than one collection is specified in the > request. Similar problems exist on the query path as well. > Due to these reasons, using HttpClusterStateProvider causes horrible > latencies and throughput for update requests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14985) Slow indexing and search performance when using HttpClusterStateProvider
Shalin Shekhar Mangar created SOLR-14985: Summary: Slow indexing and search performance when using HttpClusterStateProvider Key: SOLR-14985 URL: https://issues.apache.org/jira/browse/SOLR-14985 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: SolrJ Reporter: Shalin Shekhar Mangar HttpClusterStateProvider fetches and caches Aliases and Live Nodes for 5 seconds. The BaseSolrCloudClient caches DocCollection for 60 seconds but only if the DocCollection is not lazy and all collections returned by HttpClusterStateProvider are not lazy which means they are never cached. The BaseSolrCloudClient has a method for resolving aliases which fetches DocCollection for each input collection. This is an HTTP call with no caching when using HttpClusterStateProvider. This resolveAliases method is called twice for each update. So overall, at least 3 HTTP calls are made to fetch cluster state for each update request when using HttpClusterStateProvider. There may be more if aliases are involved or if more than one collection is specified in the request. Similar problems exist on the query path as well. Due to these reasons, using HttpClusterStateProvider causes horrible latencies and throughput for update requests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14942) Reduce leader election time on node shutdown
[ https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14942: - Fix Version/s: (was: 8.7) 8.8 > Reduce leader election time on node shutdown > > > Key: SOLR-14942 > URL: https://issues.apache.org/jira/browse/SOLR-14942 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.7.3, 8.6.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.8 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > The credit for this issue and investigation belongs to [~caomanhdat]. I am > merely reporting the issue and creating PRs based on his work. > The shutdown process waits for all replicas/cores to be closed before > removing the election node of the leader. This can take some time due to > index flush or merge activities on the leader cores and delays new leaders > from being elected. > This process happens at CoreContainer.shutdown(): > # zkController.preClose(): remove current node from live_node and change > states of all cores in this node to DOWN state. Assuming that the current > node hosting a leader of a shard, the shard becomes leaderless after calling > this method, since the state of the leader is DOWN now. The leader election > process is not triggered for the shard since the election node is still > on-hold by the current node. > # Waiting for all cores to be loaded (if there are any). > # SolrCores.close(): close all cores. > # zkController.close(): this is where all ephemeral nodes are removed from ZK > which include election nodes created by this node. Therefore other replicas > in the shard can take part in the leader election from now. > Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive > SIGTERM signal. > On receiving SIGTERM, Jetty will also stop accepting new connections and new > requests. This is a very important factor, since even if the leader replica > is ACTIVE and its node in live_nodes, the shard will be considered as > leaderless if no-one can index to that shard. Therefore shards become > leaderless as soon as the node (which contains shard’s leader) receives > SIGTERM. > Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards > remain leaderless. The time needed for step 3 scales with the number of cores > so the more cores a node has, the worse. This time is spent in > IndexWriter.close() where the system will > # Flush all pending updates to disk > # Waiting for all merge finish (this most likely is the meaty part) > The shutdown process is proposed to changed to: > # Wait for all in-flight indexing requests and replication requests to > complete > # Remove election nodes > # Close all replicas/cores > This ensures that index flush or merges do not block new leader elections > anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14942) Reduce leader election time on node shutdown
[ https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-14942. -- Fix Version/s: 8.7 master (9.0) Resolution: Fixed Thanks Dat, Hoss and Mike! > Reduce leader election time on node shutdown > > > Key: SOLR-14942 > URL: https://issues.apache.org/jira/browse/SOLR-14942 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.7.3, 8.6.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.7 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > The credit for this issue and investigation belongs to [~caomanhdat]. I am > merely reporting the issue and creating PRs based on his work. > The shutdown process waits for all replicas/cores to be closed before > removing the election node of the leader. This can take some time due to > index flush or merge activities on the leader cores and delays new leaders > from being elected. > This process happens at CoreContainer.shutdown(): > # zkController.preClose(): remove current node from live_node and change > states of all cores in this node to DOWN state. Assuming that the current > node hosting a leader of a shard, the shard becomes leaderless after calling > this method, since the state of the leader is DOWN now. The leader election > process is not triggered for the shard since the election node is still > on-hold by the current node. > # Waiting for all cores to be loaded (if there are any). > # SolrCores.close(): close all cores. > # zkController.close(): this is where all ephemeral nodes are removed from ZK > which include election nodes created by this node. Therefore other replicas > in the shard can take part in the leader election from now. > Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive > SIGTERM signal. > On receiving SIGTERM, Jetty will also stop accepting new connections and new > requests. This is a very important factor, since even if the leader replica > is ACTIVE and its node in live_nodes, the shard will be considered as > leaderless if no-one can index to that shard. Therefore shards become > leaderless as soon as the node (which contains shard’s leader) receives > SIGTERM. > Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards > remain leaderless. The time needed for step 3 scales with the number of cores > so the more cores a node has, the worse. This time is spent in > IndexWriter.close() where the system will > # Flush all pending updates to disk > # Waiting for all merge finish (this most likely is the meaty part) > The shutdown process is proposed to changed to: > # Wait for all in-flight indexing requests and replication requests to > complete > # Remove election nodes > # Close all replicas/cores > This ensures that index flush or merges do not block new leader elections > anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown
[ https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219733#comment-17219733 ] Shalin Shekhar Mangar commented on SOLR-14942: -- Thanks Hoss. I have updated the PR with code comments. Mike Drob also gave some feedback on the PR which has been incorporated as well. I intend to merge to master over the weekend. > Reduce leader election time on node shutdown > > > Key: SOLR-14942 > URL: https://issues.apache.org/jira/browse/SOLR-14942 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.7.3, 8.6.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > The credit for this issue and investigation belongs to [~caomanhdat]. I am > merely reporting the issue and creating PRs based on his work. > The shutdown process waits for all replicas/cores to be closed before > removing the election node of the leader. This can take some time due to > index flush or merge activities on the leader cores and delays new leaders > from being elected. > This process happens at CoreContainer.shutdown(): > # zkController.preClose(): remove current node from live_node and change > states of all cores in this node to DOWN state. Assuming that the current > node hosting a leader of a shard, the shard becomes leaderless after calling > this method, since the state of the leader is DOWN now. The leader election > process is not triggered for the shard since the election node is still > on-hold by the current node. > # Waiting for all cores to be loaded (if there are any). > # SolrCores.close(): close all cores. > # zkController.close(): this is where all ephemeral nodes are removed from ZK > which include election nodes created by this node. Therefore other replicas > in the shard can take part in the leader election from now. > Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive > SIGTERM signal. > On receiving SIGTERM, Jetty will also stop accepting new connections and new > requests. This is a very important factor, since even if the leader replica > is ACTIVE and its node in live_nodes, the shard will be considered as > leaderless if no-one can index to that shard. Therefore shards become > leaderless as soon as the node (which contains shard’s leader) receives > SIGTERM. > Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards > remain leaderless. The time needed for step 3 scales with the number of cores > so the more cores a node has, the worse. This time is spent in > IndexWriter.close() where the system will > # Flush all pending updates to disk > # Waiting for all merge finish (this most likely is the meaty part) > The shutdown process is proposed to changed to: > # Wait for all in-flight indexing requests and replication requests to > complete > # Remove election nodes > # Close all replicas/cores > This ensures that index flush or merges do not block new leader elections > anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown
[ https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218973#comment-17218973 ] Shalin Shekhar Mangar commented on SOLR-14942: -- {code} final boolean requestIsImportant = handler.isRequestImportantEnoughThatItShouldDelayShutdown(solrReq); if (requestIsImportant && !core.getSolrCoreState().registerInFlightUpdate()) { {code} Firstly, the goal is not to delay shutdown but to (slightly) delay leader election so that in-flight update requests succeed and we can preserve consistency. Jetty already allows a grace period for in-flight requests to complete and our solr cores, searchers etc are reference counted to allow for graceful shutdown. Secondly, if a request handler chooses to say that its requests are important enough to delay leader election, how do we decide the right timeouts in the pauseUpdatesAndAwaitInflightRequests() method? For update requests, we can make some reasonable assumptions but it is hard to do that in general. That's why I don't think it makes sense to generalize this part even though I agree that the instanceof check is hackish. So unless you or someone else feels very strongly about this, I'd like to keep this check as-is. > Reduce leader election time on node shutdown > > > Key: SOLR-14942 > URL: https://issues.apache.org/jira/browse/SOLR-14942 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.7.3, 8.6.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > The credit for this issue and investigation belongs to [~caomanhdat]. I am > merely reporting the issue and creating PRs based on his work. > The shutdown process waits for all replicas/cores to be closed before > removing the election node of the leader. This can take some time due to > index flush or merge activities on the leader cores and delays new leaders > from being elected. > This process happens at CoreContainer.shutdown(): > # zkController.preClose(): remove current node from live_node and change > states of all cores in this node to DOWN state. Assuming that the current > node hosting a leader of a shard, the shard becomes leaderless after calling > this method, since the state of the leader is DOWN now. The leader election > process is not triggered for the shard since the election node is still > on-hold by the current node. > # Waiting for all cores to be loaded (if there are any). > # SolrCores.close(): close all cores. > # zkController.close(): this is where all ephemeral nodes are removed from ZK > which include election nodes created by this node. Therefore other replicas > in the shard can take part in the leader election from now. > Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive > SIGTERM signal. > On receiving SIGTERM, Jetty will also stop accepting new connections and new > requests. This is a very important factor, since even if the leader replica > is ACTIVE and its node in live_nodes, the shard will be considered as > leaderless if no-one can index to that shard. Therefore shards become > leaderless as soon as the node (which contains shard’s leader) receives > SIGTERM. > Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards > remain leaderless. The time needed for step 3 scales with the number of cores > so the more cores a node has, the worse. This time is spent in > IndexWriter.close() where the system will > # Flush all pending updates to disk > # Waiting for all merge finish (this most likely is the meaty part) > The shutdown process is proposed to changed to: > # Wait for all in-flight indexing requests and replication requests to > complete > # Remove election nodes > # Close all replicas/cores > This ensures that index flush or merges do not block new leader elections > anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown
[ https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217619#comment-17217619 ] Shalin Shekhar Mangar commented on SOLR-14942: -- bq. every method should have javadocs, especially public ones I added javadocs for ZkController.tryCancellAllElections bq. ZkController.exlectionContexts is a syncrhonized map – but the new code that streams over it's `values()` doesn't synchronize on it which smells like a bug? Yes, thank you! ZkController.close was doing the same thing so I have fixed that as well. bq. if canceling & closing elections is a "slow" enough operation that it makes sense to parallelize them, then does it make sense to also check zkClient.isClosed() inside the loop, in case the client gets closed out from under us? (it's a cheap call, so i don't see any advantage to only checking once) It is not a slow operation. ZkController.close was also using a parallelStream on electionContexts so this was basically copied code. But it doesn't make sense. As I noted on a comment in the PR, it deletes a znode and sets a volatile member so I have replaced parallelStream to a serial forEach. bq. are there other "inflight" cases where it's risky to shutdown in the middle of? replication? peersync? core admin? When the SolrCoreState.pauseUpdatesAndAwaitInflightRequests() method is executed, jetty has already received a SIGTERM so it will not allow any new connections/request. Let's talk about ongoing requests: # All ongoing recoveries/replication (for cores on current node) have stopped (ZkController.preClose is called before the pauseUpdatesAndAwaitInflightRequests method) # The election node has not been removed so peersyncs for leader election haven't started. (tryCancellAllElections happens after pauseUpdatesAndAwaitInflightRequests method) # If another replica is recovering from this leader and a peersync is in-flight, even if we let it complete, subsequent replication requests will fail # As for core admin requests: ## create, unload and reload are not useful (node is shutting down) ## split shard will eventually fail because it is a multi-step process ## requestrecovery and requestsync are not useful either. After node comes back online, all cores will recover again. ## backups and restore operations -- I don't think these should block a shutdown operation bq. instead of hardcoding this instanceof check in HttpSolrCall would it make more sense to add a new 'default' method to SolrRequestHandler that UpdateRequestHandler (and potentially other handlers) could override to let them inspect the request and return true/false if it's "important" enough that it must be allowed to block shutdown until complete? bq. this would also make it easier to bring back the "only block updates if i'm the leader" type logic (living inside thee UpdateRequestHandler impl of the new method) at a later date if viable – w/o more API changes I thought about that optimization (only block updates if i'm the leader) but it leads to too many race conditions when leadership is gained and lost. The problem is that we must ensure that all registered parties to the phaser eventually arrive and if we lose track then it can lead to IllegalStateExceptions from the phaser down the line (and even that is best effort). That is why I think it is safer to do this inside HttpSolrCall instead of giving this choice to plugin writers. > Reduce leader election time on node shutdown > > > Key: SOLR-14942 > URL: https://issues.apache.org/jira/browse/SOLR-14942 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.7.3, 8.6.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > The credit for this issue and investigation belongs to [~caomanhdat]. I am > merely reporting the issue and creating PRs based on his work. > The shutdown process waits for all replicas/cores to be closed before > removing the election node of the leader. This can take some time due to > index flush or merge activities on the leader cores and delays new leaders > from being elected. > This process happens at CoreContainer.shutdown(): > # zkController.preClose(): remove current node from live_node and change > states of all cores in this node to DOWN state. Assuming that the current > node hosting a leader of a shard, the shard becomes leaderless after calling > this method, since the state of the leader is DOWN now. The leader election > process is not triggered for the shard since the election node is still > on-hold by the current node. > # Waiting for all cores to be loaded
[jira] [Created] (SOLR-14942) Reduce leader election time on node shutdown
Shalin Shekhar Mangar created SOLR-14942: Summary: Reduce leader election time on node shutdown Key: SOLR-14942 URL: https://issues.apache.org/jira/browse/SOLR-14942 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: SolrCloud Affects Versions: 8.6.3, 7.7.3 Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar The credit for this issue and investigation belongs to [~caomanhdat]. I am merely reporting the issue and creating PRs based on his work. The shutdown process waits for all replicas/cores to be closed before removing the election node of the leader. This can take some time due to index flush or merge activities on the leader cores and delays new leaders from being elected. This process happens at CoreContainer.shutdown(): # zkController.preClose(): remove current node from live_node and change states of all cores in this node to DOWN state. Assuming that the current node hosting a leader of a shard, the shard becomes leaderless after calling this method, since the state of the leader is DOWN now. The leader election process is not triggered for the shard since the election node is still on-hold by the current node. # Waiting for all cores to be loaded (if there are any). # SolrCores.close(): close all cores. # zkController.close(): this is where all ephemeral nodes are removed from ZK which include election nodes created by this node. Therefore other replicas in the shard can take part in the leader election from now. Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive SIGTERM signal. On receiving SIGTERM, Jetty will also stop accepting new connections and new requests. This is a very important factor, since even if the leader replica is ACTIVE and its node in live_nodes, the shard will be considered as leaderless if no-one can index to that shard. Therefore shards become leaderless as soon as the node (which contains shard’s leader) receives SIGTERM. Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards remain leaderless. The time needed for step 3 scales with the number of cores so the more cores a node has, the worse. This time is spent in IndexWriter.close() where the system will # Flush all pending updates to disk # Waiting for all merge finish (this most likely is the meaty part) The shutdown process is proposed to changed to: # Wait for all in-flight indexing requests and replication requests to complete # Remove election nodes # Close all replicas/cores This ensures that index flush or merges do not block new leader elections anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14776) Precompute the fingerprint during PeerSync
[ https://issues.apache.org/jira/browse/SOLR-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-14776. -- Resolution: Fixed Thanks Dat for the fix and Mike for the reviews! > Precompute the fingerprint during PeerSync > -- > > Key: SOLR-14776 > URL: https://issues.apache.org/jira/browse/SOLR-14776 > Project: Solr > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.7 > > Time Spent: 20m > Remaining Estimate: 0h > > Computing fingerprint can very costly and take time. But the current > implementation will send requests for getting fingerprint for multiple > replicas, then on the first response it will then compute its own fingerprint > for comparison. A very simple but effective improvement here is compute its > own fingerprint right after send requests to other replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14776) Precompute the fingerprint during PeerSync
[ https://issues.apache.org/jira/browse/SOLR-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14776: - Fix Version/s: 8.7 master (9.0) > Precompute the fingerprint during PeerSync > -- > > Key: SOLR-14776 > URL: https://issues.apache.org/jira/browse/SOLR-14776 > Project: Solr > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.7 > > Time Spent: 10m > Remaining Estimate: 0h > > Computing fingerprint can very costly and take time. But the current > implementation will send requests for getting fingerprint for multiple > replicas, then on the first response it will then compute its own fingerprint > for comparison. A very simple but effective improvement here is compute its > own fingerprint right after send requests to other replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-14776) Precompute the fingerprint during PeerSync
[ https://issues.apache.org/jira/browse/SOLR-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar reassigned SOLR-14776: Assignee: Shalin Shekhar Mangar (was: Cao Manh Dat) > Precompute the fingerprint during PeerSync > -- > > Key: SOLR-14776 > URL: https://issues.apache.org/jira/browse/SOLR-14776 > Project: Solr > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Shalin Shekhar Mangar >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Computing fingerprint can very costly and take time. But the current > implementation will send requests for getting fingerprint for multiple > replicas, then on the first response it will then compute its own fingerprint > for comparison. A very simple but effective improvement here is compute its > own fingerprint right after send requests to other replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14776) Precompute the fingerprint during PeerSync
[ https://issues.apache.org/jira/browse/SOLR-14776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213015#comment-17213015 ] Shalin Shekhar Mangar commented on SOLR-14776: -- I have added a comment as Mike suggested. I'll commit this once tests pass locally. > Precompute the fingerprint during PeerSync > -- > > Key: SOLR-14776 > URL: https://issues.apache.org/jira/browse/SOLR-14776 > Project: Solr > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Shalin Shekhar Mangar >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Computing fingerprint can very costly and take time. But the current > implementation will send requests for getting fingerprint for multiple > replicas, then on the first response it will then compute its own fingerprint > for comparison. A very simple but effective improvement here is compute its > own fingerprint right after send requests to other replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14640) Improve concurrency of SlowCompositeReaderWrapper.getSortedDocValues
Shalin Shekhar Mangar created SOLR-14640: Summary: Improve concurrency of SlowCompositeReaderWrapper.getSortedDocValues Key: SOLR-14640 URL: https://issues.apache.org/jira/browse/SOLR-14640 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: search Affects Versions: 8.4.1 Reporter: Shalin Shekhar Mangar Attachments: Screen Shot 2020-07-09 at 4.46.46 PM.png Under heavy query load, the synchronized HashMap {{cachedOrdMaps}} inside SlowCompositeReaderWrapper.getSortedDocValues blocks search threads. See attached screenshot of a java flight recording from an affected node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14639) Improve concurrency of SlowCompositeReaderWrapper.terms
[ https://issues.apache.org/jira/browse/SOLR-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154440#comment-17154440 ] Shalin Shekhar Mangar commented on SOLR-14639: -- The problem is that ConcurrentHashMap.computeIfAbsent can be costly under contention. In JDK8, computeIfAbsent locks the node in which the key should be present regardless of whether the key exists or not [1]. This means that computeIfAbsent is always blocking as compared to get() which is a non-blocking operation. In JDK9, this was slightly ameliorated by adding a fast-return in case the key was found in the first node without entering a synchronization block. But if there is a hash collision and the key is not in the first node, then computeIfAbsent enters into a synchronization block on the node to find the key. For a cache, we can expect that the key will exist in most of the lookups so it makes sense to avoid the cost of entering a synchronized block for retrieval. Doug Lea wrote on the concurrency mailing list [2]: {code} With the current implementation, if you are implementing a cache, it may be better to code cache.get to itself do a pre-screen, as in: V v = map.get(key); return (v != null) ? v : map.computeIfAbsent(key, function); However, the exact benefit depends on access patterns. For example, I reran your benchmark cases (urls below) on a 32way x86, and got throughputs (ops/sec) that are dramatically better with pre-screen for the case of a single key, but worse with your Zipf-distributed keys. {code} I would like to implement this method or switch to caffeine which has a non-blocking return in case the keys already exist [3]. [1] - https://concurrency-interest.altair.cs.oswego.narkive.com/0Jfe1waD/computeifabsent-optimized-for-missing-entries [2] - http://cs.oswego.edu/pipermail/concurrency-interest/2014-December/013360.html [3] - https://github.com/ben-manes/caffeine/wiki/Benchmarks > Improve concurrency of SlowCompositeReaderWrapper.terms > --- > > Key: SOLR-14639 > URL: https://issues.apache.org/jira/browse/SOLR-14639 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: 8.4.1 >Reporter: Shalin Shekhar Mangar >Priority: Major > Attachments: Screen Shot 2020-07-09 at 4.38.03 PM.png > > > Under heavy query load, the ConcurrentHashMap.computeIfAbsent method inside > the SlowCompositeReaderWrapper.terms(String) method blocks searcher threads > (see attached screenshot of a java flight recording). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14639) Improve concurrency of SlowCompositeReaderWrapper.terms
Shalin Shekhar Mangar created SOLR-14639: Summary: Improve concurrency of SlowCompositeReaderWrapper.terms Key: SOLR-14639 URL: https://issues.apache.org/jira/browse/SOLR-14639 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: search Affects Versions: 8.4.1 Reporter: Shalin Shekhar Mangar Attachments: Screen Shot 2020-07-09 at 4.38.03 PM.png Under heavy query load, the ConcurrentHashMap.computeIfAbsent method inside the SlowCompositeReaderWrapper.terms(String) method blocks searcher threads (see attached screenshot of a java flight recording). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13325) Add a collection selector to ComputePlanAction
[ https://issues.apache.org/jira/browse/SOLR-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-13325. -- Assignee: Shalin Shekhar Mangar Resolution: Fixed Thanks [~ab] for the review! > Add a collection selector to ComputePlanAction > -- > > Key: SOLR-13325 > URL: https://issues.apache.org/jira/browse/SOLR-13325 > Project: Solr > Issue Type: Improvement > Components: AutoScaling >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.6 > > Time Spent: 1h > Remaining Estimate: 0h > > Similar to SOLR-13273, it'd be nice to have a collection selector that > applies to compute plan action. An example use-case would be to selectively > add replicas on new nodes for certain collections only. > Here is a selector that returns collections that match the given collection > property/value pair: > {code} > "collection": {"property_name": "property_value"} > {code} > Here's another selector that returns collections that have the given policy > applied > {code} > "collection": {"#policy": "policy_name"} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13325) Add a collection selector to ComputePlanAction
[ https://issues.apache.org/jira/browse/SOLR-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-13325: - Fix Version/s: (was: 8.2) 8.6 > Add a collection selector to ComputePlanAction > -- > > Key: SOLR-13325 > URL: https://issues.apache.org/jira/browse/SOLR-13325 > Project: Solr > Issue Type: Improvement > Components: AutoScaling >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.6 > > > Similar to SOLR-13273, it'd be nice to have a collection selector that > applies to compute plan action. An example use-case would be to selectively > add replicas on new nodes for certain collections only. > Here is a selector that returns collections that match the given collection > property/value pair: > {code} > "collection": {"property_name": "property_value"} > {code} > Here's another selector that returns collections that have the given policy > applied > {code} > "collection": {"#policy": "policy_name"} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13325) Add a collection selector to ComputePlanAction
[ https://issues.apache.org/jira/browse/SOLR-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-13325: - Summary: Add a collection selector to ComputePlanAction (was: Add a collection selector to triggers) > Add a collection selector to ComputePlanAction > -- > > Key: SOLR-13325 > URL: https://issues.apache.org/jira/browse/SOLR-13325 > Project: Solr > Issue Type: Improvement > Components: AutoScaling >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.2 > > > Similar to SOLR-13273, it'd be nice to have a collection selector that > applies to triggers. An example use-case would be to selectively add replicas > on new nodes for certain collections only. > Here is a selector that returns collections that match the given collection > property/value pair: > {code} > "collection": {"property_name": "property_value"} > {code} > Here's another selector that returns collections that have the given policy > applied > {code} > "collection": {"#policy": "policy_name"} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13325) Add a collection selector to ComputePlanAction
[ https://issues.apache.org/jira/browse/SOLR-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-13325: - Description: Similar to SOLR-13273, it'd be nice to have a collection selector that applies to compute plan action. An example use-case would be to selectively add replicas on new nodes for certain collections only. Here is a selector that returns collections that match the given collection property/value pair: {code} "collection": {"property_name": "property_value"} {code} Here's another selector that returns collections that have the given policy applied {code} "collection": {"#policy": "policy_name"} {code} was: Similar to SOLR-13273, it'd be nice to have a collection selector that applies to triggers. An example use-case would be to selectively add replicas on new nodes for certain collections only. Here is a selector that returns collections that match the given collection property/value pair: {code} "collection": {"property_name": "property_value"} {code} Here's another selector that returns collections that have the given policy applied {code} "collection": {"#policy": "policy_name"} {code} > Add a collection selector to ComputePlanAction > -- > > Key: SOLR-13325 > URL: https://issues.apache.org/jira/browse/SOLR-13325 > Project: Solr > Issue Type: Improvement > Components: AutoScaling >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.2 > > > Similar to SOLR-13273, it'd be nice to have a collection selector that > applies to compute plan action. An example use-case would be to selectively > add replicas on new nodes for certain collections only. > Here is a selector that returns collections that match the given collection > property/value pair: > {code} > "collection": {"property_name": "property_value"} > {code} > Here's another selector that returns collections that have the given policy > applied > {code} > "collection": {"#policy": "policy_name"} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14472) Autoscaling "cores" preference should count all cores, not just loaded.
[ https://issues.apache.org/jira/browse/SOLR-14472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105054#comment-17105054 ] Shalin Shekhar Mangar commented on SOLR-14472: -- Transient cores are not supported in Solr cloud today and autoscaling works only in cloud mode. What am I missing here? > Autoscaling "cores" preference should count all cores, not just loaded. > --- > > Key: SOLR-14472 > URL: https://issues.apache.org/jira/browse/SOLR-14472 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: AutoScaling >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > > The AutoScaling "cores" preference works by counting the core names that are > retrievable via the metrics API. 99% of the time, that's fine but it does > not count unloaded transient cores that are also tracked by the > CoreContainer, which I think should be counted as well. Most users don't > have such cores so it won't affect them. > Furthermore, instead of counting them by asking the metrics API to return > each loaded core name, it should use the {{CONTAINER.cores}} prefix set of > counters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13325) Add a collection selector to triggers
[ https://issues.apache.org/jira/browse/SOLR-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091090#comment-17091090 ] Shalin Shekhar Mangar commented on SOLR-13325: -- I'm looking at this again. I think we should change the syntax slightly and get rid of the {{#policy}} key name. Instead, this can operate on any collection property such as policy or configName or autoAddReplicas etc that are part of the collection state. What's slightly complicating is that there are additional collection properties (stored in collectionprops.json). I don't intend to support that at the moment. On a related note, collection props have write APIs but no read APIs which severely limit the usefulness of that feature? That's something we should fix separately. Now once we have this working, it reduces the need for a separate AutoAddReplicasPlanAction because you can get the same behavior by setting the following in ComputePlanAction: {code} "collection": {"autoAddReplicas": "true"} {code} However, there is a difference between the current implementation of "collections" in ComputePlanAction and how AutoAddReplicasPlanAction works which is that the former filters out suggestions of non-matching collections but the latter pushes down the collection hint to the policy engine so that it doesn't even compute suggestions for non-matching collections in the first place. The latter is obviously more efficient. The one thing we have to be careful about is that the list of matching collections should be evaluated lazily when the action is triggered instead of early in the init method so that it can *see* the changes in the cluster state. > Add a collection selector to triggers > - > > Key: SOLR-13325 > URL: https://issues.apache.org/jira/browse/SOLR-13325 > Project: Solr > Issue Type: Improvement > Components: AutoScaling >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.2 > > > Similar to SOLR-13273, it'd be nice to have a collection selector that > applies to triggers. An example use-case would be to selectively add replicas > on new nodes for certain collections only. > Here is a selector that returns collections that match the given collection > property/value pair: > {code} > "collection": {"property_name": "property_value"} > {code} > Here's another selector that returns collections that have the given policy > applied > {code} > "collection": {"#policy": "policy_name"} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values
[ https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-14365. -- Fix Version/s: 8.6 master (9.0) Resolution: Fixed > CollapsingQParser - Avoiding always allocate int[] and float[] with size > equals to number of unique values > -- > > Key: SOLR-14365 > URL: https://issues.apache.org/jira/browse/SOLR-14365 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 8.4.1 >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: SOLR-14365.patch > > Time Spent: 8h 10m > Remaining Estimate: 0h > > Since Collapsing is a PostFilter, documents reach Collapsing must match with > all filters and queries, so the number of documents Collapsing need to > collect/compute score is a small fraction of the total number documents in > the index. So why do we need to always consume the memory (for int[] and > float[] array) for all unique values of the collapsed field? If the number of > unique values of the collapsed field found in the documents that match > queries and filters is 300 then we only need int[] and float[] array with > size of 300 and not 1.2 million in size. However, we don't know which value > of the collapsed field will show up in the results so we cannot use a smaller > array. > The easy fix for this problem is using as much as we need by using IntIntMap > and IntFloatMap that hold primitives and are much more space efficient than > the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and > float[] if matched documents is large (almost all documents matched queries > and other filters). But our belief is that does not happen that frequently > (how frequently do we run collapsing on the entire index?). > For this issue I propose adding 2 methods for collapsing which is > * array : which is current implementation > * hash : which is new approach and will be default method > later we can add another method {{smart}} which is automatically pick method > based on comparision between {{number of docs matched queries and filters}} > and {{number of unique values of the field}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values
[ https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089177#comment-17089177 ] Shalin Shekhar Mangar commented on SOLR-14365: -- I think this is ready to be cherry picked to branch_8x. I'll do that today unless there are any objections. > CollapsingQParser - Avoiding always allocate int[] and float[] with size > equals to number of unique values > -- > > Key: SOLR-14365 > URL: https://issues.apache.org/jira/browse/SOLR-14365 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 8.4.1 >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-14365.patch > > Time Spent: 8h 10m > Remaining Estimate: 0h > > Since Collapsing is a PostFilter, documents reach Collapsing must match with > all filters and queries, so the number of documents Collapsing need to > collect/compute score is a small fraction of the total number documents in > the index. So why do we need to always consume the memory (for int[] and > float[] array) for all unique values of the collapsed field? If the number of > unique values of the collapsed field found in the documents that match > queries and filters is 300 then we only need int[] and float[] array with > size of 300 and not 1.2 million in size. However, we don't know which value > of the collapsed field will show up in the results so we cannot use a smaller > array. > The easy fix for this problem is using as much as we need by using IntIntMap > and IntFloatMap that hold primitives and are much more space efficient than > the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and > float[] if matched documents is large (almost all documents matched queries > and other filters). But our belief is that does not happen that frequently > (how frequently do we run collapsing on the entire index?). > For this issue I propose adding 2 methods for collapsing which is > * array : which is current implementation > * hash : which is new approach and will be default method > later we can add another method {{smart}} which is automatically pick method > based on comparision between {{number of docs matched queries and filters}} > and {{number of unique values of the field}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14391) Remove getDocSet's manual doc collection logic; remove ScoreFilter
[ https://issues.apache.org/jira/browse/SOLR-14391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086738#comment-17086738 ] Shalin Shekhar Mangar commented on SOLR-14391: -- bq. Given that this dates back to Lucene 3.2 or so, it was probably the most performant way, regardless if there was another way or not. Is this still the most performant way? [~dsmiley] -- did you compare performance before removing the manual doc loop? > Remove getDocSet's manual doc collection logic; remove ScoreFilter > -- > > Key: SOLR-14391 > URL: https://issues.apache.org/jira/browse/SOLR-14391 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > Fix For: 8.6 > > Time Spent: 0.5h > Remaining Estimate: 0h > > {{SolrIndexSearcher.getDocSet(List)}} calls getProcessedFilter and > then basically loops over doc IDs, passing them through the filter, and > passes them to the Collector. This logic is redundant with what Lucene > searcher.search(query,collector) will ultimately do in BulkScorer, and so I > propose we remove all that code and delegate to Lucene. > Also, the top of this method looks to see if any query implements the > "ScoreFilter" marker interface (only implemented by CollapsingPostFilter) and > if so delegates to {{getDocSetScore}} method instead. That method has an > implementation close to what I propose getDocSet be changed to; so it can be > removed along with this ScoreFilter interface > searcher.search(query,collector). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14406) Use exponential backoff in RecoveryStrategy.pingLeader
Shalin Shekhar Mangar created SOLR-14406: Summary: Use exponential backoff in RecoveryStrategy.pingLeader Key: SOLR-14406 URL: https://issues.apache.org/jira/browse/SOLR-14406 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: SolrCloud Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar The RecoveryStrategy.pingLeader method tries to connect/ping to the known leader in a tight loop while waiting for only 500ms. This is wasteful when leader is down and also litters the logs with messages like the following repeated very frequently (especially when there are more than one replica on the node whose leader is down): {code} Failed to connect leader http://xyz/solr on recovery, try again {code} We should use an exponential back-off here between retries -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
[ https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-9909: Fix Version/s: (was: 6.7) (was: 7.0) 8.6 master (9.0) Assignee: Shalin Shekhar Mangar Resolution: Fixed Status: Resolved (was: Patch Available) Thanks Andras! > Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory > > > Key: SOLR-9909 > URL: https://issues.apache.org/jira/browse/SOLR-9909 > Project: Solr > Issue Type: Task >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Trivial > Fix For: master (9.0), 8.6 > > Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, > SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch, SOLR-9909.patch > > > DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same > code. Let's remove one of them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Issue Comment Deleted] (SOLR-11960) Add collection level properties
[ https://issues.apache.org/jira/browse/SOLR-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-11960: - Comment: was deleted (was: Get instant response and good care of WiFi range extender for our highly talented experts. Available 24/7 365 days to provide you the best assistance and support for your wireless range extender. visit : [https://www.routerloginnet.tips/]) > Add collection level properties > --- > > Key: SOLR-11960 > URL: https://issues.apache.org/jira/browse/SOLR-11960 > Project: Solr > Issue Type: New Feature >Reporter: Peter Rusko >Assignee: Tomas Eduardo Fernandez Lobbe >Priority: Blocker > Fix For: 7.3, 8.0 > > Attachments: SOLR-11960.patch, SOLR-11960.patch, SOLR-11960.patch, > SOLR-11960.patch, SOLR-11960.patch, SOLR-11960_2.patch > > > Solr has cluster properties, but no easy and extendable way of defining > properties that affect a single collection. Collection properties could be > stored in a single zookeeper node per collection, making it possible to > trigger zookeeper watchers for only those Solr nodes that have cores of that > collection. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
[ https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081999#comment-17081999 ] Shalin Shekhar Mangar commented on SOLR-9909: - I accidentally used the wrong issue number in the commit so the asf git message went to another issue. Here's the commit: {code} Commit 13f19f65559290a860df84fa1b5ac2db903b27ec in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=13f19f6 ] SOLR-9906: SolrjNamedThreadFactory is deprecated in favor of SolrNamedThreadFactory. DefaultSolrThreadFactory is removed from solr-core in favor of SolrNamedThreadFactory in solrj package and all solr-core classes now use SolrNamedThreadFactory {code} > Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory > > > Key: SOLR-9909 > URL: https://issues.apache.org/jira/browse/SOLR-9909 > Project: Solr > Issue Type: Task >Reporter: Shalin Shekhar Mangar >Priority: Trivial > Fix For: 6.7, 7.0 > > Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, > SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch, SOLR-9909.patch > > > DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same > code. Let's remove one of them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-9906) Use better check to validate if node recovered via PeerSync or Replication
[ https://issues.apache.org/jira/browse/SOLR-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081998#comment-17081998 ] Shalin Shekhar Mangar commented on SOLR-9906: - Please ignore the above comment. It was intended for SOLR-9909. > Use better check to validate if node recovered via PeerSync or Replication > -- > > Key: SOLR-9906 > URL: https://issues.apache.org/jira/browse/SOLR-9906 > Project: Solr > Issue Type: Improvement >Reporter: Pushkar Raste >Assignee: Noble Paul >Priority: Minor > Fix For: 6.4 > > Attachments: SOLR-9906.patch, SOLR-9906.patch, > SOLR-PeerSyncVsReplicationTest.diff > > > Tests {{LeaderFailureAfterFreshStartTest}} and {{PeerSyncReplicationTest}} > currently rely on number of requests made to the leader's replication handler > to check if node recovered via PeerSync or replication. This check is not > very reliable and we have seen failures in the past. > While tinkering with different way to write a better test I found > [SOLR-9859|SOLR-9859]. Now that SOLR-9859 is fixed, here is idea for better > way to distinguish recovery via PeerSync vs Replication. > * For {{PeerSyncReplicationTest}}, if node successfully recovers via > PeerSync, then file {{replication.properties}} should not exist > For {{LeaderFailureAfterFreshStartTest}}, if the freshly replicated node does > not go into replication recovery after the leader failure, contents > {{replication.properties}} should not change -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
[ https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081693#comment-17081693 ] Shalin Shekhar Mangar commented on SOLR-9909: - Updated patch that adds the ASL to the new class. This will make the RAT check pass. > Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory > > > Key: SOLR-9909 > URL: https://issues.apache.org/jira/browse/SOLR-9909 > Project: Solr > Issue Type: Task >Reporter: Shalin Shekhar Mangar >Priority: Trivial > Fix For: 6.7, 7.0 > > Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, > SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch, SOLR-9909.patch > > > DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same > code. Let's remove one of them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
[ https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-9909: Attachment: SOLR-9909.patch > Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory > > > Key: SOLR-9909 > URL: https://issues.apache.org/jira/browse/SOLR-9909 > Project: Solr > Issue Type: Task >Reporter: Shalin Shekhar Mangar >Priority: Trivial > Fix For: 6.7, 7.0 > > Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, > SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch, SOLR-9909.patch > > > DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same > code. Let's remove one of them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
[ https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081256#comment-17081256 ] Shalin Shekhar Mangar commented on SOLR-9909: - Updated patch which deprecates SolrjNamedThreadFactory and adds a SolrNamedThreadFactory. Once this patch is applied on master and branch_8x, I will follow up with a commit on master to delete the deprecated SolrjNamedThreadFactory. > Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory > > > Key: SOLR-9909 > URL: https://issues.apache.org/jira/browse/SOLR-9909 > Project: Solr > Issue Type: Task >Reporter: Shalin Shekhar Mangar >Priority: Trivial > Fix For: 6.7, 7.0 > > Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, > SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch > > > DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same > code. Let's remove one of them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
[ https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-9909: Attachment: SOLR-9909.patch > Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory > > > Key: SOLR-9909 > URL: https://issues.apache.org/jira/browse/SOLR-9909 > Project: Solr > Issue Type: Task >Reporter: Shalin Shekhar Mangar >Priority: Trivial > Fix For: 6.7, 7.0 > > Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, > SOLR-9909-03.patch, SOLR-9909.patch, SOLR-9909.patch > > > DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same > code. Let's remove one of them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
[ https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081253#comment-17081253 ] Shalin Shekhar Mangar commented on SOLR-9909: - Well, we can deprecate SolrjNamedThreadFactory and add a SolrNamedThreadFactory in 8x. The former can be deleted on master. > Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory > > > Key: SOLR-9909 > URL: https://issues.apache.org/jira/browse/SOLR-9909 > Project: Solr > Issue Type: Task >Reporter: Shalin Shekhar Mangar >Priority: Trivial > Fix For: 6.7, 7.0 > > Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, > SOLR-9909-03.patch, SOLR-9909.patch > > > DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same > code. Let's remove one of them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
[ https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081131#comment-17081131 ] Shalin Shekhar Mangar commented on SOLR-9909: - Patch updated to master. > Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory > > > Key: SOLR-9909 > URL: https://issues.apache.org/jira/browse/SOLR-9909 > Project: Solr > Issue Type: Task >Reporter: Shalin Shekhar Mangar >Priority: Trivial > Fix For: 6.7, 7.0 > > Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, > SOLR-9909-03.patch, SOLR-9909.patch > > > DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same > code. Let's remove one of them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-9909) Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory
[ https://issues.apache.org/jira/browse/SOLR-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-9909: Attachment: SOLR-9909.patch > Nuke one of DefaultSolrThreadFactory and SolrjNamedThreadFactory > > > Key: SOLR-9909 > URL: https://issues.apache.org/jira/browse/SOLR-9909 > Project: Solr > Issue Type: Task >Reporter: Shalin Shekhar Mangar >Priority: Trivial > Fix For: 6.7, 7.0 > > Attachments: SOLR-9909-01.patch, SOLR-9909-02.patch, > SOLR-9909-03.patch, SOLR-9909.patch > > > DefaultSolrThreadFactory and SolrjNamedThreadFactory have exactly the same > code. Let's remove one of them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14402) Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search
[ https://issues.apache.org/jira/browse/SOLR-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14402: - Fix Version/s: 8.6 master (9.0) Resolution: Fixed Status: Resolved (was: Patch Available) > Avoid creating new exceptions for every request made to > MDCAwareThreadPoolExecutor by distributed search > > > Key: SOLR-14402 > URL: https://issues.apache.org/jira/browse/SOLR-14402 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.4 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.6 > > Attachments: SOLR-14402.patch > > > SOLR-11880 tried to do the same and it succeeded for update shard handler but > the implementation was wrong for http shard handler because the executor > created during construction is overwritten in the init() method. The commit > for SOLR-11880 is at https://github.com/apache/lucene-solr/commit/5a47ed4/ > Thanks [~caomanhdat] for spotting this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14402) Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search
[ https://issues.apache.org/jira/browse/SOLR-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14402: - Status: Patch Available (was: Open) > Avoid creating new exceptions for every request made to > MDCAwareThreadPoolExecutor by distributed search > > > Key: SOLR-14402 > URL: https://issues.apache.org/jira/browse/SOLR-14402 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.4 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Attachments: SOLR-14402.patch > > > SOLR-11880 tried to do the same and it succeeded for update shard handler but > the implementation was wrong for http shard handler because the executor > created during construction is overwritten in the init() method. The commit > for SOLR-11880 is at https://github.com/apache/lucene-solr/commit/5a47ed4/ > Thanks [~caomanhdat] for spotting this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14402) Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search
[ https://issues.apache.org/jira/browse/SOLR-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080406#comment-17080406 ] Shalin Shekhar Mangar commented on SOLR-14402: -- Here's a simple patch that sets enableSubmitterStackTrace to false while creating the executor inside HttpShardHandlerFactory's init method. It removes the executor that was initialized in the class attribute because it is overwritten in init anyway. The patch also fixes ZkControllerTest and OverseerTest which were using HttpShardHandlerFactory without calling init first. > Avoid creating new exceptions for every request made to > MDCAwareThreadPoolExecutor by distributed search > > > Key: SOLR-14402 > URL: https://issues.apache.org/jira/browse/SOLR-14402 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.4 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Attachments: SOLR-14402.patch > > > SOLR-11880 tried to do the same and it succeeded for update shard handler but > the implementation was wrong for http shard handler because the executor > created during construction is overwritten in the init() method. The commit > for SOLR-11880 is at https://github.com/apache/lucene-solr/commit/5a47ed4/ > Thanks [~caomanhdat] for spotting this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14402) Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search
[ https://issues.apache.org/jira/browse/SOLR-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14402: - Attachment: SOLR-14402.patch > Avoid creating new exceptions for every request made to > MDCAwareThreadPoolExecutor by distributed search > > > Key: SOLR-14402 > URL: https://issues.apache.org/jira/browse/SOLR-14402 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 7.4 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Attachments: SOLR-14402.patch > > > SOLR-11880 tried to do the same and it succeeded for update shard handler but > the implementation was wrong for http shard handler because the executor > created during construction is overwritten in the init() method. The commit > for SOLR-11880 is at https://github.com/apache/lucene-solr/commit/5a47ed4/ > Thanks [~caomanhdat] for spotting this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values
[ https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080401#comment-17080401 ] Shalin Shekhar Mangar commented on SOLR-14365: -- I just saw this test failure on master which seems related and is reproducible: {code} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomCollapseQParserPlugin -Dtests.method=testRandomCollpaseWithSort -Dtests.seed=20C0F4D7CBA81876 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=lv -Dtests.timezone=America/St_Johns -Dtests.asserts=true -Dtests.file.encoding=ANSI_X3.4-1968 [junit4] FAILURE 7.30s J4 | TestRandomCollapseQParserPlugin.testRandomCollpaseWithSort <<< [junit4]> Throwable #1: java.lang.AssertionError: collapseKey too big -- need to grow array? [junit4]>at __randomizedtesting.SeedInfo.seed([20C0F4D7CBA81876:257D871EE0002B85]:0) [junit4]>at org.apache.solr.search.CollapsingQParserPlugin$SortFieldsCompare.setGroupValues(CollapsingQParserPlugin.java:2702) [junit4]>at org.apache.solr.search.CollapsingQParserPlugin$IntSortSpecStrategy.collapse(CollapsingQParserPlugin.java:2544) [junit4]>at org.apache.solr.search.CollapsingQParserPlugin$IntFieldValueCollector.collect(CollapsingQParserPlugin.java:1223) [junit4]>at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:254) [junit4]>at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:205) [junit4]>at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39) [junit4]>at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:739) [junit4]>at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:526) [junit4]>at org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:202) [junit4]>at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1651) [junit4]>at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1469) [junit4]>at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584) [junit4]>at org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1487) [junit4]>at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:399) [junit4]>at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:328) [junit4]>at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:209) [junit4]>at org.apache.solr.core.SolrCore.execute(SolrCore.java:2565) [junit4]>at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:227) [junit4]>at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:207) [junit4]>at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1003) [junit4]>at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1018) [junit4]>at org.apache.solr.search.TestRandomCollapseQParserPlugin.testRandomCollpaseWithSort(TestRandomCollapseQParserPlugin.java:158) [junit4]>at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {code} > CollapsingQParser - Avoiding always allocate int[] and float[] with size > equals to number of unique values > -- > > Key: SOLR-14365 > URL: https://issues.apache.org/jira/browse/SOLR-14365 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 8.4.1 >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-14365.patch > > Time Spent: 8h 10m > Remaining Estimate: 0h > > Since Collapsing is a PostFilter, documents reach Collapsing must match with > all filters and queries, so the number of documents Collapsing need to > collect/compute score is a small fraction of the total number documents in > the index. So why do we need to always consume the memory (for int[] and > float[] array) for all unique values of the collapsed field? If the number of > unique values of the collapsed field found in the documents that match > queries and filters is 300 then we only need int[] and float[] array with > size of 300 and not 1.2 million in size. However, we don't know which value > of the collapsed field will show up in the results so we cannot use a smaller > array. > The easy fix for this problem is
[jira] [Created] (SOLR-14402) Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search
Shalin Shekhar Mangar created SOLR-14402: Summary: Avoid creating new exceptions for every request made to MDCAwareThreadPoolExecutor by distributed search Key: SOLR-14402 URL: https://issues.apache.org/jira/browse/SOLR-14402 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: SolrCloud Affects Versions: 7.4 Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar SOLR-11880 tried to do the same and it succeeded for update shard handler but the implementation was wrong for http shard handler because the executor created during construction is overwritten in the init() method. The commit for SOLR-11880 is at https://github.com/apache/lucene-solr/commit/5a47ed4/ Thanks [~caomanhdat] for spotting this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-12720) Remove autoReplicaFailoverWaitAfterExpiration in Solr 8.0
[ https://issues.apache.org/jira/browse/SOLR-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-12720. -- Assignee: Shalin Shekhar Mangar Resolution: Fixed Thanks [~marcussorealheis]! > Remove autoReplicaFailoverWaitAfterExpiration in Solr 8.0 > - > > Key: SOLR-12720 > URL: https://issues.apache.org/jira/browse/SOLR-12720 > Project: Solr > Issue Type: Task > Components: AutoScaling, SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Blocker > Fix For: master (9.0), 8.1 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > SOLR-12719 deprecated the autoReplicaFailoverWaitAfterExpiration property in > solr.xml. We should remove it entirely in Solr 8.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-12720) Remove autoReplicaFailoverWaitAfterExpiration in Solr 8.0
[ https://issues.apache.org/jira/browse/SOLR-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17075999#comment-17075999 ] Shalin Shekhar Mangar commented on SOLR-12720: -- The commit for this issue used the wrong Jira: {quote} Commit 9322a7b37555832e41a25bbc556a34299b90204e in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9322a7b ] SOLR-12067: Remove support for autoReplicaFailoverWaitAfterExpiration This closes #1402. {quote} > Remove autoReplicaFailoverWaitAfterExpiration in Solr 8.0 > - > > Key: SOLR-12720 > URL: https://issues.apache.org/jira/browse/SOLR-12720 > Project: Solr > Issue Type: Task > Components: AutoScaling, SolrCloud >Reporter: Shalin Shekhar Mangar >Priority: Blocker > Fix For: 8.1, master (9.0) > > Time Spent: 2h 20m > Remaining Estimate: 0h > > SOLR-12719 deprecated the autoReplicaFailoverWaitAfterExpiration property in > solr.xml. We should remove it entirely in Solr 8.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14347) Autoscaling placement wrong when concurrent replica placements are calculated
[ https://issues.apache.org/jira/browse/SOLR-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17075774#comment-17075774 ] Shalin Shekhar Mangar commented on SOLR-14347: -- The PR is still open, can you please close that too? > Autoscaling placement wrong when concurrent replica placements are calculated > - > > Key: SOLR-14347 > URL: https://issues.apache.org/jira/browse/SOLR-14347 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: AutoScaling >Affects Versions: 8.5 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Fix For: 8.6 > > Attachments: SOLR-14347.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Steps to reproduce: > * create a cluster of a few nodes (tested with 7 nodes) > * define per-collection policies that distribute replicas exclusively on > different nodes per policy > * concurrently create a few collections, each using a different policy > * resulting replica placement will be seriously wrong, causing many policy > violations > Running the same scenario but instead creating collections sequentially > results in no violations. > I suspect this is caused by incorrect locking level for all collection > operations (as defined in {{CollectionParams.CollectionAction}}) that create > new replica placements - i.e. CREATE, ADDREPLICA, MOVEREPLICA, DELETENODE, > REPLACENODE, SPLITSHARD, RESTORE, REINDEXCOLLECTION. All of these operations > use the policy engine to create new replica placements, and as a result they > change the cluster state. However, currently these operations are locked (in > {{OverseerCollectionMessageHandler.lockTask}} ) using > {{LockLevel.COLLECTION}}. In practice this means that the lock is held only > for the particular collection that is being modified. > A straightforward fix for this issue is to change the locking level to > CLUSTER (and I confirm this fixes the scenario described above). However, > this effectively serializes all collection operations listed above, which > will result in general slow-down of all collection operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14374) Use coreLoadExecutor to load all cores; not just startup
[ https://issues.apache.org/jira/browse/SOLR-14374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071421#comment-17071421 ] Shalin Shekhar Mangar commented on SOLR-14374: -- [~dsmiley] - yes that makes sense, thank you! > Use coreLoadExecutor to load all cores; not just startup > > > Key: SOLR-14374 > URL: https://issues.apache.org/jira/browse/SOLR-14374 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > > CoreContainer.load() creates coreLoadExecutor (an Executor) to load > pre-existing cores concurrently -- defaulting to 8 at a time. Then it's > never used again. However, cores might be loaded in other circumstances: (a) > creating new cores, (b) "transient" cores, or (c) loadOnStartup=false cores, > (d) reload cores. By using coreLoadExecutor for all cases, we'll then have > metrics for core loading that work globally and not just on startup since > coreLoadExecutor is instrumented already -- > {{CONTAINER.threadPool.coreLoadExecutor}} metrics path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14356) PeerSync with hanging nodes
[ https://issues.apache.org/jira/browse/SOLR-14356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069426#comment-17069426 ] Shalin Shekhar Mangar commented on SOLR-14356: -- Okay, yes let's add the connect timeout exception and discuss a better fix in SOLR-14368 > PeerSync with hanging nodes > --- > > Key: SOLR-14356 > URL: https://issues.apache.org/jira/browse/SOLR-14356 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Priority: Major > Attachments: SOLR-14356.patch > > > Right now in {{PeerSync}} (during leader election), in case of exception on > requesting versions to a node, we will skip that node if exception is one the > following type > * ConnectTimeoutException > * NoHttpResponseException > * SocketException > Sometime the other node basically hang but still accept connection. In that > case SocketTimeoutException is thrown and we consider the {{PeerSync}} > process as failed and the whole shard just basically leaderless forever (as > long as the hang node still there). > We can't just blindly adding {{SocketTimeoutException}} to above list, since > [~shalin] mentioned that sometimes timeout can happen because of genuine > reasons too e.g. temporary GC pause. > I think the general idea here is we obey {{leaderVoteWait}} restriction and > retry doing sync with others in case of connection/timeout exception happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values
[ https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069285#comment-17069285 ] Shalin Shekhar Mangar commented on SOLR-14365: -- I think we should add another method and make it configurable. > CollapsingQParser - Avoiding always allocate int[] and float[] with size > equals to number of unique values > -- > > Key: SOLR-14365 > URL: https://issues.apache.org/jira/browse/SOLR-14365 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 8.4.1 >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > > Since Collapsing is a PostFilter, documents reach Collapsing must match with > all filters and queries, so the number of documents Collapsing need to > collect/compute score is a small fraction of the total number documents in > the index. So why do we need to always consume the memory (for int[] and > float[] array) for all unique values of the collapsed field? If the number of > unique values of the collapsed field found in the documents that match > queries and filters is 300 then we only need int[] and float[] array with > size of 300 and not 1.2 million in size. However, we don't know which value > of the collapsed field will show up in the results so we cannot use a smaller > array. > The easy fix for this problem is using as much as we need by using IntIntMap > and IntFloatMap that hold primitives and are much more space efficient than > the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and > float[] if matched documents is large (almost all documents matched queries > and other filters). But our belief is that does not happen that frequently > (how frequently do we run collapsing on the entire index?). > For this issue I propose adding 2 methods for collapsing which is > * array : which is current implementation > * hash : which is new approach and will be default method > later we can add another method {{smart}} which is automatically pick method > based on comparision between {{number of docs matched queries and filters}} > and {{number of unique values of the field}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-10397) Port 'autoAddReplicas' feature to the autoscaling framework and make it work with non-shared filesystems
[ https://issues.apache.org/jira/browse/SOLR-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061338#comment-17061338 ] Shalin Shekhar Mangar commented on SOLR-10397: -- [~dsmiley] - I agree that both of those paths are bad. It could go to the core descriptor. > Port 'autoAddReplicas' feature to the autoscaling framework and make it work > with non-shared filesystems > > > Key: SOLR-10397 > URL: https://issues.apache.org/jira/browse/SOLR-10397 > Project: Solr > Issue Type: Sub-task > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Cao Manh Dat >Priority: Major > Labels: autoscaling > Fix For: 7.1, 8.0 > > Attachments: SOLR-10397.1.patch, SOLR-10397.2.patch, > SOLR-10397.2.patch, SOLR-10397.2.patch, SOLR-10397.patch, > SOLR-10397_remove_nocommit.patch > > > Currently 'autoAddReplicas=true' can be specified in the Collection Create > API to automatically add replicas when a replica becomes unavailable. I > propose to move this feature to the autoscaling cluster policy rules design. > This will include the following: > * Trigger support for ‘nodeLost’ event type > * Modification of existing implementation of ‘autoAddReplicas’ to > automatically create the appropriate ‘nodeLost’ trigger. > * Any such auto-created trigger must be marked internally such that setting > ‘autoAddReplicas=false’ via the Modify Collection API should delete or > disable corresponding trigger. > * Support for non-HDFS filesystems while retaining the optimization afforded > by HDFS i.e. the replaced replica can point to the existing data dir of the > old replica. > * Deprecate/remove the feature of enabling/disabling ‘autoAddReplicas’ across > the entire cluster using cluster properties in favor of using the > suspend-trigger/resume-trigger APIs. > This will retain backward compatibility for the most part and keep a common > use-case easy to enable as well as make it available to more people (i.e. > people who don't use HDFS). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13996) Refactor HttpShardHandler#prepDistributed() into smaller pieces
[ https://issues.apache.org/jira/browse/SOLR-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-13996. -- Fix Version/s: 8.5 master (9.0) Resolution: Fixed I have a few more improvements planned but 8.5 has been cut so I will close this issue and open another. > Refactor HttpShardHandler#prepDistributed() into smaller pieces > --- > > Key: SOLR-13996 > URL: https://issues.apache.org/jira/browse/SOLR-13996 > Project: Solr > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-13996.patch, SOLR-13996.patch > > Time Spent: 50m > Remaining Estimate: 0h > > Currently, it is very hard to understand all the various things being done in > HttpShardHandler. I'm starting with refactoring the prepDistributed() method > to make it easier to grasp. It has standalone and cloud code intertwined, and > wanted to cleanly separate them out. Later, we can even have two separate > method (for standalone and cloud, each). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data
[ https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050677#comment-17050677 ] Shalin Shekhar Mangar commented on SOLR-13942: -- As someone who runs a managed search service and has to troubleshoot Solr issues, I want to add my 2 cents. There's plenty of information that is required for troubleshooting but is not available in clusterstatus or any other documented/public API. Sure there's the undocumented /admin/zookeeper which has a weird output format meant for I don't know who. But even that does not have a few things that I've found necessary to troubleshoot Solr. Here's a non-exhaustive list of things you need to troubleshoot Solr: # Length of overseer queues (available in overseerstatus API) # Contents of overseer queue (mildly useful, available in /admin/zookeeper) # Overseer election queue and current leader (former is available in /admin/zookeeper and latter in overseer status) # Cluster state (cluster status API) # Solr.xml (no API regardless of whether it is in ZK or filesystem) # Leader election queue and current leader for each shard (available in /admin/zookeeper) # Shard terms for each shard/replica (not available in any API) # Metrics/stats (metrics API) # Solr Logs (log API? unless it is rolled over) # GC logs (no API) The overseerstatus API cannot be hit if there is no overseer so there's that too. We run ZK and Solr inside kubernetes and we do not expose zookeeper publicly. So, to use a tool like zkcli means we have to port forward directly to the zk node which needs explicit privileges. Ideally we want to hit everything over http and never allow port forward privileges to anyone. So I see the following options: # Add missing information that is inside ZK (shard terms) to /admin/zookeeper and continue to live with its horrible output # Immediately change /admin/zookeeper to a better output format and change the UI to consume this new format # Deprecate /admin/zookeeper, introduce a clean API, migrate UI to this new endpoint or a better alternative and remove /admin/zookeeper in 9.0 # Not do anything and force people to use zkcli and existing solr apis for troubleshooting as we've been doing till now My vote is to go with #3 and we can debate what we want to call the API and whether it should a public, documented, supported API or an undocumented API like /admin/zookeeper. My preference is to keep this undocumented and unsupported just like /admin/zookeeper. The other question is how we can secure it -- is it enough to be the same as /admin/zookeeper from a security perspective? > /api/cluster/zk/* to fetch raw ZK data > -- > > Key: SOLR-13942 > URL: https://issues.apache.org/jira/browse/SOLR-13942 > Project: Solr > Issue Type: Bug >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Fix For: 8.5 > > Time Spent: 10m > Remaining Estimate: 0h > > example > download the {{state.json}} of > {code} > GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json > {code} > get a list of all children under {{/live_nodes}} > {code} > GET http://localhost:8983/api/cluster/zk/live_nodes > {code} > If the requested path is a node with children show the list of child nodes > and their meta data -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13996) Refactor HttpShardHandler#prepDistributed() into smaller pieces
[ https://issues.apache.org/jira/browse/SOLR-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046629#comment-17046629 ] Shalin Shekhar Mangar commented on SOLR-13996: -- Fair enough, I'll rename the class. > Refactor HttpShardHandler#prepDistributed() into smaller pieces > --- > > Key: SOLR-13996 > URL: https://issues.apache.org/jira/browse/SOLR-13996 > Project: Solr > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Shalin Shekhar Mangar >Priority: Major > Attachments: SOLR-13996.patch, SOLR-13996.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently, it is very hard to understand all the various things being done in > HttpShardHandler. I'm starting with refactoring the prepDistributed() method > to make it easier to grasp. It has standalone and cloud code intertwined, and > wanted to cleanly separate them out. Later, we can even have two separate > method (for standalone and cloud, each). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
[ https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-12550: - Fix Version/s: 8.5 master (9.0) Resolution: Fixed Status: Resolved (was: Patch Available) Thanks Marc and Bérénice! > ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize > > > Key: SOLR-12550 > URL: https://issues.apache.org/jira/browse/SOLR-12550 > Project: Solr > Issue Type: Bug > Components: SolrJ >Reporter: Marc Morissette >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.5 > > Time Spent: 1.5h > Remaining Estimate: 0h > > We're in a situation where we need to optimize some of our collections. These > optimizations are done with waitSearcher=true as a simple throttling > mechanism to prevent too many collections from being optimized at once. > We're seeing these optimize commands return without error after 10 minutes > but well before the end of the operation. Our Solr logs show errors with > socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value > has no effect. > See the links section for my patch. > It turns out that ConcurrentUpdateSolrClient delegates commit and optimize > commands to a private HttpSolrClient but fails to pass along its builder's > timeouts to that client. > A patch is attached in the links section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
[ https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-12550: - Component/s: SolrJ > ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize > > > Key: SOLR-12550 > URL: https://issues.apache.org/jira/browse/SOLR-12550 > Project: Solr > Issue Type: Bug > Components: SolrJ >Reporter: Marc Morissette >Assignee: Shalin Shekhar Mangar >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > We're in a situation where we need to optimize some of our collections. These > optimizations are done with waitSearcher=true as a simple throttling > mechanism to prevent too many collections from being optimized at once. > We're seeing these optimize commands return without error after 10 minutes > but well before the end of the operation. Our Solr logs show errors with > socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value > has no effect. > See the links section for my patch. > It turns out that ConcurrentUpdateSolrClient delegates commit and optimize > commands to a private HttpSolrClient but fails to pass along its builder's > timeouts to that client. > A patch is attached in the links section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
[ https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar reassigned SOLR-12550: Assignee: Shalin Shekhar Mangar > ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize > > > Key: SOLR-12550 > URL: https://issues.apache.org/jira/browse/SOLR-12550 > Project: Solr > Issue Type: Bug >Reporter: Marc Morissette >Assignee: Shalin Shekhar Mangar >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > We're in a situation where we need to optimize some of our collections. These > optimizations are done with waitSearcher=true as a simple throttling > mechanism to prevent too many collections from being optimized at once. > We're seeing these optimize commands return without error after 10 minutes > but well before the end of the operation. Our Solr logs show errors with > socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value > has no effect. > See the links section for my patch. > It turns out that ConcurrentUpdateSolrClient delegates commit and optimize > commands to a private HttpSolrClient but fails to pass along its builder's > timeouts to that client. > A patch is attached in the links section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
[ https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039017#comment-17039017 ] Shalin Shekhar Mangar commented on SOLR-12550: -- I added a review comment to #417. Other than that it looks good. > ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize > > > Key: SOLR-12550 > URL: https://issues.apache.org/jira/browse/SOLR-12550 > Project: Solr > Issue Type: Bug >Reporter: Marc Morissette >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > We're in a situation where we need to optimize some of our collections. These > optimizations are done with waitSearcher=true as a simple throttling > mechanism to prevent too many collections from being optimized at once. > We're seeing these optimize commands return without error after 10 minutes > but well before the end of the operation. Our Solr logs show errors with > socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value > has no effect. > See the links section for my patch. > It turns out that ConcurrentUpdateSolrClient delegates commit and optimize > commands to a private HttpSolrClient but fails to pass along its builder's > timeouts to that client. > A patch is attached in the links section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-14248. -- Resolution: Fixed > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch, SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032824#comment-17032824 ] Shalin Shekhar Mangar commented on SOLR-14248: -- The latest patch adds support for replica types and resolves a conflict introduced by SOLR-14245. It also adds a test for this class. This is ready to go. > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch, SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14248: - Attachment: SOLR-14248.patch > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch, SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032431#comment-17032431 ] Shalin Shekhar Mangar commented on SOLR-14248: -- This patch fixes all the problems except for #5. The way it fixes #3 is a hack but that's the best I could do without creating a builder class for DocCollection. I've left a todo comment in there to describe the hack and eventual fix. > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
[ https://issues.apache.org/jira/browse/SOLR-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14248: - Attachment: SOLR-14248.patch > Improve ClusterStateMockUtil and make its methods public > > > Key: SOLR-14248 > URL: https://issues.apache.org/jira/browse/SOLR-14248 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14248.patch > > > While working on SOLR-13996, I had the need to mock the cluster state for > various configurations and I used ClusterStateMockUtil. > However, I ran into a few issues that needed to be fixed: > 1. The methods in this class are protected making it useful only within the > same package > 2. A null router was set for DocCollection objects > 3. The DocCollection object is created before the slices so the > DocCollection.getActiveSlices method returns empty list because the active > slices map is created inside the DocCollection constructor > 4. It did not set core name for the replicas it created > 5. It has no support for replica types so it only creates nrt replicas > I will use this Jira to fix these problems and make the methods in that class > public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14248) Improve ClusterStateMockUtil and make its methods public
Shalin Shekhar Mangar created SOLR-14248: Summary: Improve ClusterStateMockUtil and make its methods public Key: SOLR-14248 URL: https://issues.apache.org/jira/browse/SOLR-14248 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: Tests Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Fix For: master (9.0), 8.5 While working on SOLR-13996, I had the need to mock the cluster state for various configurations and I used ClusterStateMockUtil. However, I ran into a few issues that needed to be fixed: 1. The methods in this class are protected making it useful only within the same package 2. A null router was set for DocCollection objects 3. The DocCollection object is created before the slices so the DocCollection.getActiveSlices method returns empty list because the active slices map is created inside the DocCollection constructor 4. It did not set core name for the replicas it created 5. It has no support for replica types so it only creates nrt replicas I will use this Jira to fix these problems and make the methods in that class public (but marked as experimental) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025563#comment-17025563 ] Shalin Shekhar Mangar commented on SOLR-13897: -- Thanks [~jpountz] for fixing. I forgot that javadoc changes can cause precommit to fail. > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch, > SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13996) Refactor HttpShardHandler#prepDistributed() into smaller pieces
[ https://issues.apache.org/jira/browse/SOLR-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025122#comment-17025122 ] Shalin Shekhar Mangar commented on SOLR-13996: -- I've been working on a refactoring of this method and it's my fault that I didn't see this issue and the PR earlier. However, my goals are a bit more ambitious. This first PR https://github.com/apache/lucene-solr/pull/1220 is just a re-organization of the code but I'll be expanding it further by adding tests for each individual case and then move on to improve performance. Currently this class is quite inefficient as it parses and re-parses and creates strings out of shard urls even for solr cloud cases. The goal is to eventually have a cloud focused class that is extremely efficient and avoids unnecessary copies of shards/replicas completely. This will require changes in other places as well e.g. the host checker can be made to operate in a streaming mode etc. I haven't quite decided on how the replica list transformer should be changed. I hope you don't mind Ishan but I'll assign this issue and take this forward. Reviews welcome! > Refactor HttpShardHandler#prepDistributed() into smaller pieces > --- > > Key: SOLR-13996 > URL: https://issues.apache.org/jira/browse/SOLR-13996 > Project: Solr > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Shalin Shekhar Mangar >Priority: Major > Attachments: SOLR-13996.patch, SOLR-13996.patch > > Time Spent: 20m > Remaining Estimate: 0h > > Currently, it is very hard to understand all the various things being done in > HttpShardHandler. I'm starting with refactoring the prepDistributed() method > to make it easier to grasp. It has standalone and cloud code intertwined, and > wanted to cleanly separate them out. Later, we can even have two separate > method (for standalone and cloud, each). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-13996) Refactor HttpShardHandler#prepDistributed() into smaller pieces
[ https://issues.apache.org/jira/browse/SOLR-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar reassigned SOLR-13996: Assignee: Shalin Shekhar Mangar > Refactor HttpShardHandler#prepDistributed() into smaller pieces > --- > > Key: SOLR-13996 > URL: https://issues.apache.org/jira/browse/SOLR-13996 > Project: Solr > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Shalin Shekhar Mangar >Priority: Major > Attachments: SOLR-13996.patch, SOLR-13996.patch > > Time Spent: 20m > Remaining Estimate: 0h > > Currently, it is very hard to understand all the various things being done in > HttpShardHandler. I'm starting with refactoring the prepDistributed() method > to make it easier to grasp. It has standalone and cloud code intertwined, and > wanted to cleanly separate them out. Later, we can even have two separate > method (for standalone and cloud, each). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-13897. -- Fix Version/s: 8.5 Resolution: Fixed > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch, > SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-13897: - Status: Open (was: Patch Available) > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0) > > Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch, > SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14210) Introduce Node-level status handler for replicas
[ https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022645#comment-17022645 ] Shalin Shekhar Mangar commented on SOLR-14210: -- Why not extend the same /admin/info/health that we have with another parameter? > Introduce Node-level status handler for replicas > > > Key: SOLR-14210 > URL: https://issues.apache.org/jira/browse/SOLR-14210 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: master (9.0), 8.5 >Reporter: Houston Putman >Priority: Major > > h2. Background > As was brought up in SOLR-13055, in order to run Solr in a more cloud-native > way, we need some additional features around node-level healthchecks. > {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe > explained in > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n] > determine if a node is live and ready to serve live traffic. > {quote} > > However there are issues around kubernetes managing it's own rolling > restarts. With the current healthcheck setup, it's easy to envision a > scenario in which Solr reports itself as "healthy" when all of its replicas > are actually recovering. Therefore kubernetes, seeing a healthy pod would > then go and restart the next Solr node. This can happen until all replicas > are "recovering" and none are healthy. (maybe the last one restarted will be > "down", but still there are no "active" replicas) > h2. Proposal > I propose we make an additional healthcheck handler that returns whether all > replicas hosted by that Solr node are healthy and "active". That way we will > be able to use the [default kubernetes rolling restart > logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies] > with Solr. > To add on to [Jan's point > here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559], > this handler should be more friendly for other Content-Types and should use > bettter HTTP response statuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14208) Reproducible test failure on TestBulkSchemaConcurrent
[ https://issues.apache.org/jira/browse/SOLR-14208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-14208. -- Resolution: Duplicate Andrzej fixed this in SOLR-14211 > Reproducible test failure on TestBulkSchemaConcurrent > - > > Key: SOLR-14208 > URL: https://issues.apache.org/jira/browse/SOLR-14208 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Reporter: Shalin Shekhar Mangar >Priority: Major > > I found the following test failure on master branch while running tests on > SOLR-14207. The test failure is reproducible without the SOLR-14207 patch. > {code} > ant test -Dtestcase=TestBulkSchemaConcurrent -Dtests.method=test > -Dtests.seed=AE6DC9DB591DAB9E -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=hi-IN -Dtests.timezone=Atlantic/Madeira -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > {code} > The logs are full of the following warning repeated over and over: > {code} > [junit4] 2> 32396 WARN (qtp1791658098-110) [n:127.0.0.1:46453_rx_%2Fr > c:collection1 s:shard2 r:core_node8 x:collection1_shard2_replica_n5 ] > o.a.s.s.SchemaManager Unable to retrieve fresh managed schema managed-schema >[junit4] 2> => java.lang.IllegalArgumentException: Path must > start with / character >[junit4] 2> at > org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:51) >[junit4] 2> java.lang.IllegalArgumentException: Path must start with / > character >[junit4] 2> at > org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:51) > ~[zookeeper-3.5.5.jar:3.5.5] >[junit4] 2> at > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2000) > ~[zookeeper-3.5.5.jar:3.5.5] >[junit4] 2> at > org.apache.solr.common.cloud.SolrZkClient.lambda$exists$3(SolrZkClient.java:314) > ~[java/:?] >[junit4] 2> at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71) > ~[java/:?] >[junit4] 2> at > org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:314) > ~[java/:?] >[junit4] 2> at > org.apache.solr.schema.SchemaManager.getFreshManagedSchema(SchemaManager.java:427) > ~[java/:?] >[junit4] 2> at > org.apache.solr.schema.SchemaManager.doOperations(SchemaManager.java:107) > ~[java/:?] >[junit4] 2> at > org.apache.solr.schema.SchemaManager.performOperations(SchemaManager.java:92) > ~[java/:?] >[junit4] 2> at > org.apache.solr.handler.SchemaHandler.handleRequestBody(SchemaHandler.java:90) > ~[java/:?] >[junit4] 2> at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:208) > ~[java/:?] >[junit4] 2> at > org.apache.solr.core.SolrCore.execute(SolrCore.java:2582) ~[java/:?] >[junit4] 2> at > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799) ~[java/:?] >[junit4] 2> at > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:578) ~[java/:?] >[junit4] 2> at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:419) > ~[java/:?] >[junit4] 2> at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351) > ~[java/:?] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-13897: - Attachment: SOLR-13897.patch > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0) > > Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch, > SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021920#comment-17021920 ] Shalin Shekhar Mangar commented on SOLR-13897: -- Updated patch so that it applies on master. > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0) > > Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch, > SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14172) Collection metadata remains in zookeeper if too many shards requested
[ https://issues.apache.org/jira/browse/SOLR-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-14172. -- Resolution: Fixed Thanks Andras for the PR and Kevin for his review! > Collection metadata remains in zookeeper if too many shards requested > - > > Key: SOLR-14172 > URL: https://issues.apache.org/jira/browse/SOLR-14172 > Project: Solr > Issue Type: Bug >Affects Versions: 8.3.1 >Reporter: Andras Salamon >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14172.patch, SOLR-14172.patch > > Time Spent: 50m > Remaining Estimate: 0h > > When I try to create a collection and request too many shards, collection > creation fails with the expected error message: > {noformat} > $ curl -i --retry 5 -s -L -k --negotiate -u : > 'http://asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:8983/solr/admin/collections?action=CREATE=TooManyShardstest1=4=zk_init_too=1' > HTTP/1.1 400 Bad Request > Content-Type: application/json;charset=utf-8 > Content-Length: 1562 > { > "responseHeader":{ > "status":400, > "QTime":122}, > "Operation create caused > exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: > Cannot create collection TooManyShardstest1. Value of maxShardsPerNode is 1, > and the number of nodes currently live or live and part of your createNodeSet > is 3. This allows a maximum of 3 to be created. Value of numShards is 4, > value of nrtReplicas is 1, value of tlogReplicas is 0 and value of > pullReplicas is 0. This requires 4 shards to be created (higher than the > allowed number)", > "exception":{ > "msg":"Cannot create collection TooManyShardstest1. Value of > maxShardsPerNode is 1, and the number of nodes currently live or live and > part of your createNodeSet is 3. This allows a maximum of 3 to be created. > Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is > 0 and value of pullReplicas is 0. This requires 4 shards to be created > (higher than the allowed number)", > "rspCode":400}, > "error":{ > "metadata":[ > "error-class","org.apache.solr.common.SolrException", > "root-error-class","org.apache.solr.common.SolrException"], > "msg":"Cannot create collection TooManyShardstest1. Value of > maxShardsPerNode is 1, and the number of nodes currently live or live and > part of your createNodeSet is 3. This allows a maximum of 3 to be created. > Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is > 0 and value of pullReplicas is 0. This requires 4 shards to be created > (higher than the allowed number)", > "code":400}} > {noformat} > Although the collection creation was not successful if I list the collections > it shows the new collection: > {noformat} > $ solr collection --list > TooManyShardstest1 (1) > {noformat} > Looks like metadata remains in Zookeeper: > {noformat} > $ zkcli.sh -zkhost asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:2181/solr > -cmd ls /collections > INFO - 2020-01-06 04:54:01.851; > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect > to ZooKeeper > INFO - 2020-01-06 04:54:01.880; > org.apache.solr.common.cloud.ConnectionManager; zkClient has connected > INFO - 2020-01-06 04:54:01.881; > org.apache.solr.common.cloud.ConnectionManager; Client is connected to > ZooKeeper > /collections (1) > /collections/TooManyShardstest1 (1) > DATA: > {"configName":"zk_init_too"} > /collections/TooManyShardstest1/state.json (0) > DATA: > {"TooManyShardstest1":{ > "pullReplicas":"0", > "replicationFactor":"1", > "router":{"name":"compositeId"}, > "maxShardsPerNode":"1", > "autoAddReplicas":"false", > "nrtReplicas":"1", > "tlogReplicas":"0", > "shards":{ > "shard1":{ > "range":"8000-bfff", > "state":"active", > "replicas":{}}, > "shard2":{ > "range":"c000-", > "state":"active", > "replicas":{}}, > "shard3":{ > "range":"0-3fff", > "state":"active", > "replicas":{}}, > "shard4":{ > "range":"4000-7fff", > "state":"active", > "replicas":{} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] [Commented] (SOLR-14172) Collection metadata remains in zookeeper if too many shards requested
[ https://issues.apache.org/jira/browse/SOLR-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021890#comment-17021890 ] Shalin Shekhar Mangar commented on SOLR-14172: -- I attached a new patch which adds a failure message in case the collection creation request is successful. > Collection metadata remains in zookeeper if too many shards requested > - > > Key: SOLR-14172 > URL: https://issues.apache.org/jira/browse/SOLR-14172 > Project: Solr > Issue Type: Bug >Affects Versions: 8.3.1 >Reporter: Andras Salamon >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14172.patch, SOLR-14172.patch > > Time Spent: 40m > Remaining Estimate: 0h > > When I try to create a collection and request too many shards, collection > creation fails with the expected error message: > {noformat} > $ curl -i --retry 5 -s -L -k --negotiate -u : > 'http://asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:8983/solr/admin/collections?action=CREATE=TooManyShardstest1=4=zk_init_too=1' > HTTP/1.1 400 Bad Request > Content-Type: application/json;charset=utf-8 > Content-Length: 1562 > { > "responseHeader":{ > "status":400, > "QTime":122}, > "Operation create caused > exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: > Cannot create collection TooManyShardstest1. Value of maxShardsPerNode is 1, > and the number of nodes currently live or live and part of your createNodeSet > is 3. This allows a maximum of 3 to be created. Value of numShards is 4, > value of nrtReplicas is 1, value of tlogReplicas is 0 and value of > pullReplicas is 0. This requires 4 shards to be created (higher than the > allowed number)", > "exception":{ > "msg":"Cannot create collection TooManyShardstest1. Value of > maxShardsPerNode is 1, and the number of nodes currently live or live and > part of your createNodeSet is 3. This allows a maximum of 3 to be created. > Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is > 0 and value of pullReplicas is 0. This requires 4 shards to be created > (higher than the allowed number)", > "rspCode":400}, > "error":{ > "metadata":[ > "error-class","org.apache.solr.common.SolrException", > "root-error-class","org.apache.solr.common.SolrException"], > "msg":"Cannot create collection TooManyShardstest1. Value of > maxShardsPerNode is 1, and the number of nodes currently live or live and > part of your createNodeSet is 3. This allows a maximum of 3 to be created. > Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is > 0 and value of pullReplicas is 0. This requires 4 shards to be created > (higher than the allowed number)", > "code":400}} > {noformat} > Although the collection creation was not successful if I list the collections > it shows the new collection: > {noformat} > $ solr collection --list > TooManyShardstest1 (1) > {noformat} > Looks like metadata remains in Zookeeper: > {noformat} > $ zkcli.sh -zkhost asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:2181/solr > -cmd ls /collections > INFO - 2020-01-06 04:54:01.851; > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect > to ZooKeeper > INFO - 2020-01-06 04:54:01.880; > org.apache.solr.common.cloud.ConnectionManager; zkClient has connected > INFO - 2020-01-06 04:54:01.881; > org.apache.solr.common.cloud.ConnectionManager; Client is connected to > ZooKeeper > /collections (1) > /collections/TooManyShardstest1 (1) > DATA: > {"configName":"zk_init_too"} > /collections/TooManyShardstest1/state.json (0) > DATA: > {"TooManyShardstest1":{ > "pullReplicas":"0", > "replicationFactor":"1", > "router":{"name":"compositeId"}, > "maxShardsPerNode":"1", > "autoAddReplicas":"false", > "nrtReplicas":"1", > "tlogReplicas":"0", > "shards":{ > "shard1":{ > "range":"8000-bfff", > "state":"active", > "replicas":{}}, > "shard2":{ > "range":"c000-", > "state":"active", > "replicas":{}}, > "shard3":{ > "range":"0-3fff", > "state":"active", > "replicas":{}}, > "shard4":{ > "range":"4000-7fff", > "state":"active", > "replicas":{} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Updated] (SOLR-14172) Collection metadata remains in zookeeper if too many shards requested
[ https://issues.apache.org/jira/browse/SOLR-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14172: - Attachment: SOLR-14172.patch > Collection metadata remains in zookeeper if too many shards requested > - > > Key: SOLR-14172 > URL: https://issues.apache.org/jira/browse/SOLR-14172 > Project: Solr > Issue Type: Bug >Affects Versions: 8.3.1 >Reporter: Andras Salamon >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14172.patch, SOLR-14172.patch > > Time Spent: 40m > Remaining Estimate: 0h > > When I try to create a collection and request too many shards, collection > creation fails with the expected error message: > {noformat} > $ curl -i --retry 5 -s -L -k --negotiate -u : > 'http://asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:8983/solr/admin/collections?action=CREATE=TooManyShardstest1=4=zk_init_too=1' > HTTP/1.1 400 Bad Request > Content-Type: application/json;charset=utf-8 > Content-Length: 1562 > { > "responseHeader":{ > "status":400, > "QTime":122}, > "Operation create caused > exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: > Cannot create collection TooManyShardstest1. Value of maxShardsPerNode is 1, > and the number of nodes currently live or live and part of your createNodeSet > is 3. This allows a maximum of 3 to be created. Value of numShards is 4, > value of nrtReplicas is 1, value of tlogReplicas is 0 and value of > pullReplicas is 0. This requires 4 shards to be created (higher than the > allowed number)", > "exception":{ > "msg":"Cannot create collection TooManyShardstest1. Value of > maxShardsPerNode is 1, and the number of nodes currently live or live and > part of your createNodeSet is 3. This allows a maximum of 3 to be created. > Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is > 0 and value of pullReplicas is 0. This requires 4 shards to be created > (higher than the allowed number)", > "rspCode":400}, > "error":{ > "metadata":[ > "error-class","org.apache.solr.common.SolrException", > "root-error-class","org.apache.solr.common.SolrException"], > "msg":"Cannot create collection TooManyShardstest1. Value of > maxShardsPerNode is 1, and the number of nodes currently live or live and > part of your createNodeSet is 3. This allows a maximum of 3 to be created. > Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is > 0 and value of pullReplicas is 0. This requires 4 shards to be created > (higher than the allowed number)", > "code":400}} > {noformat} > Although the collection creation was not successful if I list the collections > it shows the new collection: > {noformat} > $ solr collection --list > TooManyShardstest1 (1) > {noformat} > Looks like metadata remains in Zookeeper: > {noformat} > $ zkcli.sh -zkhost asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:2181/solr > -cmd ls /collections > INFO - 2020-01-06 04:54:01.851; > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect > to ZooKeeper > INFO - 2020-01-06 04:54:01.880; > org.apache.solr.common.cloud.ConnectionManager; zkClient has connected > INFO - 2020-01-06 04:54:01.881; > org.apache.solr.common.cloud.ConnectionManager; Client is connected to > ZooKeeper > /collections (1) > /collections/TooManyShardstest1 (1) > DATA: > {"configName":"zk_init_too"} > /collections/TooManyShardstest1/state.json (0) > DATA: > {"TooManyShardstest1":{ > "pullReplicas":"0", > "replicationFactor":"1", > "router":{"name":"compositeId"}, > "maxShardsPerNode":"1", > "autoAddReplicas":"false", > "nrtReplicas":"1", > "tlogReplicas":"0", > "shards":{ > "shard1":{ > "range":"8000-bfff", > "state":"active", > "replicas":{}}, > "shard2":{ > "range":"c000-", > "state":"active", > "replicas":{}}, > "shard3":{ > "range":"0-3fff", > "state":"active", > "replicas":{}}, > "shard4":{ > "range":"4000-7fff", > "state":"active", > "replicas":{} > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14172) Collection metadata remains in zookeeper if too many shards requested
[ https://issues.apache.org/jira/browse/SOLR-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14172: - Attachment: SOLR-14172.patch Fix Version/s: 8.5 master (9.0) Assignee: Shalin Shekhar Mangar Status: Open (was: Open) This patch incorporates the test added by Andras Salamon in PR #1152 but the actual fix is slightly different. This patch changes the buildReplicaPositions method to throw an AssignmentException instead of SolrException in case the maxShardsPerNode is insufficient. It also changes the Create Collection API to return a BAD_REQUEST code instead of SERVER_ERROR in case of assignment exception. I'll note this behavior change in the upgrade notes. > Collection metadata remains in zookeeper if too many shards requested > - > > Key: SOLR-14172 > URL: https://issues.apache.org/jira/browse/SOLR-14172 > Project: Solr > Issue Type: Bug >Affects Versions: 8.3.1 >Reporter: Andras Salamon >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14172.patch > > Time Spent: 40m > Remaining Estimate: 0h > > When I try to create a collection and request too many shards, collection > creation fails with the expected error message: > {noformat} > $ curl -i --retry 5 -s -L -k --negotiate -u : > 'http://asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:8983/solr/admin/collections?action=CREATE=TooManyShardstest1=4=zk_init_too=1' > HTTP/1.1 400 Bad Request > Content-Type: application/json;charset=utf-8 > Content-Length: 1562 > { > "responseHeader":{ > "status":400, > "QTime":122}, > "Operation create caused > exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: > Cannot create collection TooManyShardstest1. Value of maxShardsPerNode is 1, > and the number of nodes currently live or live and part of your createNodeSet > is 3. This allows a maximum of 3 to be created. Value of numShards is 4, > value of nrtReplicas is 1, value of tlogReplicas is 0 and value of > pullReplicas is 0. This requires 4 shards to be created (higher than the > allowed number)", > "exception":{ > "msg":"Cannot create collection TooManyShardstest1. Value of > maxShardsPerNode is 1, and the number of nodes currently live or live and > part of your createNodeSet is 3. This allows a maximum of 3 to be created. > Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is > 0 and value of pullReplicas is 0. This requires 4 shards to be created > (higher than the allowed number)", > "rspCode":400}, > "error":{ > "metadata":[ > "error-class","org.apache.solr.common.SolrException", > "root-error-class","org.apache.solr.common.SolrException"], > "msg":"Cannot create collection TooManyShardstest1. Value of > maxShardsPerNode is 1, and the number of nodes currently live or live and > part of your createNodeSet is 3. This allows a maximum of 3 to be created. > Value of numShards is 4, value of nrtReplicas is 1, value of tlogReplicas is > 0 and value of pullReplicas is 0. This requires 4 shards to be created > (higher than the allowed number)", > "code":400}} > {noformat} > Although the collection creation was not successful if I list the collections > it shows the new collection: > {noformat} > $ solr collection --list > TooManyShardstest1 (1) > {noformat} > Looks like metadata remains in Zookeeper: > {noformat} > $ zkcli.sh -zkhost asalamon-cdpd-rebase831-a-1.vpc.cloudera.com:2181/solr > -cmd ls /collections > INFO - 2020-01-06 04:54:01.851; > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect > to ZooKeeper > INFO - 2020-01-06 04:54:01.880; > org.apache.solr.common.cloud.ConnectionManager; zkClient has connected > INFO - 2020-01-06 04:54:01.881; > org.apache.solr.common.cloud.ConnectionManager; Client is connected to > ZooKeeper > /collections (1) > /collections/TooManyShardstest1 (1) > DATA: > {"configName":"zk_init_too"} > /collections/TooManyShardstest1/state.json (0) > DATA: > {"TooManyShardstest1":{ > "pullReplicas":"0", > "replicationFactor":"1", > "router":{"name":"compositeId"}, > "maxShardsPerNode":"1", > "autoAddReplicas":"false", > "nrtReplicas":"1", > "tlogReplicas":"0", > "shards":{ > "shard1":{ > "range":"8000-bfff", > "state":"active", > "replicas":{}}, > "shard2":{ > "range":"c000-", > "state":"active", > "replicas":{}}, >
[jira] [Resolved] (SOLR-14207) Fix logging statements with less or more arguments than placeholders
[ https://issues.apache.org/jira/browse/SOLR-14207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-14207. -- Resolution: Fixed > Fix logging statements with less or more arguments than placeholders > > > Key: SOLR-14207 > URL: https://issues.apache.org/jira/browse/SOLR-14207 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: logging >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14207.patch > > > I found bad logging statements in the solr-exporter which had different > number of arguments than placeholders. Ran an inspection check in Idea and > found many more places with similar problems. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14208) Reproducible test failure on TestBulkSchemaConcurrent
Shalin Shekhar Mangar created SOLR-14208: Summary: Reproducible test failure on TestBulkSchemaConcurrent Key: SOLR-14208 URL: https://issues.apache.org/jira/browse/SOLR-14208 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: Tests Reporter: Shalin Shekhar Mangar I found the following test failure on master branch while running tests on SOLR-14207. The test failure is reproducible without the SOLR-14207 patch. {code} ant test -Dtestcase=TestBulkSchemaConcurrent -Dtests.method=test -Dtests.seed=AE6DC9DB591DAB9E -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=hi-IN -Dtests.timezone=Atlantic/Madeira -Dtests.asserts=true -Dtests.file.encoding=UTF-8 {code} The logs are full of the following warning repeated over and over: {code} [junit4] 2> 32396 WARN (qtp1791658098-110) [n:127.0.0.1:46453_rx_%2Fr c:collection1 s:shard2 r:core_node8 x:collection1_shard2_replica_n5 ] o.a.s.s.SchemaManager Unable to retrieve fresh managed schema managed-schema [junit4] 2> => java.lang.IllegalArgumentException: Path must start with / character [junit4] 2>at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:51) [junit4] 2> java.lang.IllegalArgumentException: Path must start with / character [junit4] 2>at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:51) ~[zookeeper-3.5.5.jar:3.5.5] [junit4] 2>at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2000) ~[zookeeper-3.5.5.jar:3.5.5] [junit4] 2>at org.apache.solr.common.cloud.SolrZkClient.lambda$exists$3(SolrZkClient.java:314) ~[java/:?] [junit4] 2>at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71) ~[java/:?] [junit4] 2>at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:314) ~[java/:?] [junit4] 2>at org.apache.solr.schema.SchemaManager.getFreshManagedSchema(SchemaManager.java:427) ~[java/:?] [junit4] 2>at org.apache.solr.schema.SchemaManager.doOperations(SchemaManager.java:107) ~[java/:?] [junit4] 2>at org.apache.solr.schema.SchemaManager.performOperations(SchemaManager.java:92) ~[java/:?] [junit4] 2>at org.apache.solr.handler.SchemaHandler.handleRequestBody(SchemaHandler.java:90) ~[java/:?] [junit4] 2>at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:208) ~[java/:?] [junit4] 2>at org.apache.solr.core.SolrCore.execute(SolrCore.java:2582) ~[java/:?] [junit4] 2>at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799) ~[java/:?] [junit4] 2>at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:578) ~[java/:?] [junit4] 2>at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:419) ~[java/:?] [junit4] 2>at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351) ~[java/:?] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14207) Fix logging statements with less or more arguments than placeholders
[ https://issues.apache.org/jira/browse/SOLR-14207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-14207: - Attachment: SOLR-14207.patch > Fix logging statements with less or more arguments than placeholders > > > Key: SOLR-14207 > URL: https://issues.apache.org/jira/browse/SOLR-14207 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: logging >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14207.patch > > > I found bad logging statements in the solr-exporter which had different > number of arguments than placeholders. Ran an inspection check in Idea and > found many more places with similar problems. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14207) Fix logging statements with less or more arguments than placeholders
Shalin Shekhar Mangar created SOLR-14207: Summary: Fix logging statements with less or more arguments than placeholders Key: SOLR-14207 URL: https://issues.apache.org/jira/browse/SOLR-14207 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: logging Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Fix For: master (9.0), 8.5 I found bad logging statements in the solr-exporter which had different number of arguments than placeholders. Ran an inspection check in Idea and found many more places with similar problems. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14191) Restore fixes for HealthCheckHandlerTest.testHealthCheckHandler() made by SOLR-11456
[ https://issues.apache.org/jira/browse/SOLR-14191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-14191. -- Resolution: Invalid Never mind, false alarm. I did in fact incorporate fixes made by SOLR-11456 in SOLR-11126 when I committed the code. It's just that the last patch attached on SOLR-11126 did not have those fixes. Sorry for the noise. > Restore fixes for HealthCheckHandlerTest.testHealthCheckHandler() made by > SOLR-11456 > - > > Key: SOLR-14191 > URL: https://issues.apache.org/jira/browse/SOLR-14191 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Affects Versions: 8.0 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: 8.5 > > > Chasing a test failure while backporting SOLR-11126 to branch 7x, I > discovered that all the test failures fixed by Hoss in SOLR-11456 were lost > when SOLR-11126 was committed even though Hoss had commented on the issue to > remind us about them. > This issue will restore those lost fixes on the master and 8x branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14191) Restore fixes for HealthCheckHandlerTest.testHealthCheckHandler() made by SOLR-11456
Shalin Shekhar Mangar created SOLR-14191: Summary: Restore fixes for HealthCheckHandlerTest.testHealthCheckHandler() made by SOLR-11456 Key: SOLR-14191 URL: https://issues.apache.org/jira/browse/SOLR-14191 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: Tests Affects Versions: 8.0 Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Fix For: 8.5 Chasing a test failure while backporting SOLR-11126 to branch 7x, I discovered that all the test failures fixed by Hoss in SOLR-11456 were lost when SOLR-11126 was committed even though Hoss had commented on the issue to remind us about them. This issue will restore those lost fixes on the master and 8x branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-13845) DELETEREPLICA API by "count" and "type"
[ https://issues.apache.org/jira/browse/SOLR-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar reassigned SOLR-13845: Assignee: Shalin Shekhar Mangar > DELETEREPLICA API by "count" and "type" > --- > > Key: SOLR-13845 > URL: https://issues.apache.org/jira/browse/SOLR-13845 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Amrit Sarkar >Assignee: Shalin Shekhar Mangar >Priority: Major > Attachments: SOLR-13845.patch > > > SOLR-9319 added support for deleting replicas by count. It would be great to > have the feature with added functionality the type of replica we want to > delete like we add replicas by count and type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13979) Expose separate metrics for distributed and non-distributed requests
[ https://issues.apache.org/jira/browse/SOLR-13979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987539#comment-16987539 ] Shalin Shekhar Mangar commented on SOLR-13979: -- Yes, I have used that method in the past but this is a common use-case and it should not be necessary to resort to such clever solutions. > Expose separate metrics for distributed and non-distributed requests > > > Key: SOLR-13979 > URL: https://issues.apache.org/jira/browse/SOLR-13979 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.4 > > > Currently we expose metrics such as count, rate and latency on a per handler > level however for search requests there is no distinction made for distrib vs > non-distrib requests. This means that there is no way to find the count, rate > or latency of only user-sent queries. > I propose that we expose distrib vs non-distrib metrics separately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar reassigned SOLR-13897: Assignee: Shalin Shekhar Mangar > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.4 > > Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985479#comment-16985479 ] Shalin Shekhar Mangar commented on SOLR-13897: -- This patch adds registerTerm inside ZkCollectionTerms so that it is called after synchronizing on the same terms object as that used for the remove. I couldn't quite make out a condition where both could happen concurrently but it makes me sleep better knowing that they absolutely cannot. > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.4 > > Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-13897: - Status: Patch Available (was: Open) > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.4 > > Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-13897: - Attachment: SOLR-13897.patch > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.4 > > Attachments: SOLR-13897.patch, SOLR-13897.patch, SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-13989) Move all hadoop related code to a contrib module
Shalin Shekhar Mangar created SOLR-13989: Summary: Move all hadoop related code to a contrib module Key: SOLR-13989 URL: https://issues.apache.org/jira/browse/SOLR-13989 Project: Solr Issue Type: Task Security Level: Public (Default Security Level. Issues are Public) Components: Hadoop Integration Reporter: Shalin Shekhar Mangar Fix For: master (9.0) Spin off from SOLR-13986: {quote} It seems really important to move or remove this hadoop shit out of the solr core: It is really unreasonable that solr core depends on hadoop. that's gonna simply block any progress improving its security, because solr code will get dragged down by hadoop's code. {quote} We should move all hadoop related dependencies to a separate contrib module -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13986) remove "execute" permission from solr-tests.policy
[ https://issues.apache.org/jira/browse/SOLR-13986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985477#comment-16985477 ] Shalin Shekhar Mangar commented on SOLR-13986: -- bq. Unrelated to these specific problems, It seems really important to move or remove this hadoop shit out of the solr core: It is really unreasonable that solr core depends on hadoop. that's gonna simply block any progress improving its security, because solr code will get dragged down by hadoop's code. I agree that hadoop specific code should live in a contrib. I'll open an issue to do that. > remove "execute" permission from solr-tests.policy > -- > > Key: SOLR-13986 > URL: https://issues.apache.org/jira/browse/SOLR-13986 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Robert Muir >Priority: Major > Attachments: SOLR-13986-notyet.patch, SOLR-13986.patch, > SOLR-13986.patch > > > If we don't really need to execute processes, we can take the permission > away. That way any attempt to execute something results in a > SecurityException rather than running a process. > It is necessary to first fix the tests policy before thinking about > supporting securitymanager in solr. This way we can ensure functionality does > not break via our tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-13897: - Attachment: SOLR-13897.patch > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.4 > > Attachments: SOLR-13897.patch, SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985459#comment-16985459 ] Shalin Shekhar Mangar commented on SOLR-13897: -- The onTermUpdates might receive updates out of order (i.e. monotonic term versions are not guaranteed inside onTermUpdates) but it is not a problem in the default RecoveringCoreTermWatcher implementation because it tracks the last term that triggered recovery and returns if it is greater (or equal) to the current term. This patch adds javadocs to the CoreTermWatcher interface and calls out the behavior of these invocations. > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.4 > > Attachments: SOLR-13897.patch, SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985256#comment-16985256 ] Shalin Shekhar Mangar commented on SOLR-13897: -- Here's a patch that changes the Terms to an AtomicReference. However, I am not convinced that it is still correct. Seems there can be race conditions between registerTerm and removeTerm and also onTermUpdates might receive updates out of order (i.e. monotonic term versions are not guaranteed inside onTermUpdates) > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.4 > > Attachments: SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-13897) Unsafe publication of Terms object in ZkShardTerms
[ https://issues.apache.org/jira/browse/SOLR-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-13897: - Attachment: SOLR-13897.patch > Unsafe publication of Terms object in ZkShardTerms > -- > > Key: SOLR-13897 > URL: https://issues.apache.org/jira/browse/SOLR-13897 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Affects Versions: 8.2, 8.3 >Reporter: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.4 > > Attachments: SOLR-13897.patch > > > The Terms object in ZkShardTerms is written using a write lock but reading is > allowed freely. This is not safe and can cause visibility issues and > associated race conditions under contention. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13805) Solr - generates an NPE when calling /solr/admin/health on standalone solr
[ https://issues.apache.org/jira/browse/SOLR-13805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-13805. -- Fix Version/s: 8.4 master (9.0) Resolution: Fixed Thanks Nicholas! > Solr - generates an NPE when calling /solr/admin/health on standalone solr > -- > > Key: SOLR-13805 > URL: https://issues.apache.org/jira/browse/SOLR-13805 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud, SolrJ >Affects Versions: 8.1, 8.2 >Reporter: Nicholas DiPiazza >Assignee: Shalin Shekhar Mangar >Priority: Major > Fix For: master (9.0), 8.4 > > > steps to reproduce: > unzip solr and run > {code} > ./bin/solr start > {code} > Then nav to: > http://localhost:8983/solr/admin/health > Result will be an NPE: > {code} > { > "responseHeader":{ > "status":500, > "QTime":20}, > "error":{ > "trace":"java.lang.NullPointerException\n\tat > org.apache.solr.handler.admin.HealthCheckHandler.handleRequestBody(HealthCheckHandler.java:68)\n\tat > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat > > org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat > org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat > org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat > > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat > org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)\n\tat > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)\n\tat > java.lang.Thread.run(Thread.java:748)\n", > "code":500}} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-13979) Expose separate metrics for distributed and non-distributed requests
Shalin Shekhar Mangar created SOLR-13979: Summary: Expose separate metrics for distributed and non-distributed requests Key: SOLR-13979 URL: https://issues.apache.org/jira/browse/SOLR-13979 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: metrics Reporter: Shalin Shekhar Mangar Fix For: master (9.0), 8.4 Currently we expose metrics such as count, rate and latency on a per handler level however for search requests there is no distinction made for distrib vs non-distrib requests. This means that there is no way to find the count, rate or latency of only user-sent queries. I propose that we expose distrib vs non-distrib metrics separately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-13945) SPLITSHARD data loss due to "rollback"
[ https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16981287#comment-16981287 ] Shalin Shekhar Mangar edited comment on SOLR-13945 at 11/25/19 4:29 AM: [~ichattopadhyaya] - the final commit was added in SOLR-4997 so that documents are visible when the sub-shard replicas come up. -It is not necessary if there is a single replica.- (note it is necessary to call this commit regardless of the replication factor) was (Author: shalinmangar): [~ichattopadhyaya] - the final commit was added in SOLR-4997 so that documents are visible when the sub-shard replicas come up. It is not necessary if there is a single replica. > SPLITSHARD data loss due to "rollback" > -- > > Key: SOLR-13945 > URL: https://issues.apache.org/jira/browse/SOLR-13945 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Ishan Chattopadhyaya >Priority: Major > Attachments: SOLR-13945.patch, SOLR-13945.patch, SOLR-13945.patch > > > # As per SOLR-7673, there is a commit on the parent shard *after state > changes* have happened, i.e. from active/construction/construction to > inactive/active/active. Please see > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588 > # Due to SOLR-12509, there's now a cleanup/rollback method called > "cleanupAfterFailure" in the finally block that resets the state to > active/construction/construction. Please see: > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657 > # When 2 is entered into due to a failure in 1, we have a situation where any > documents that went into the subshards (because they are already active by > now) are now lost after the parent becomes active. > If my above understanding is correct, I am wondering: > # Why is a commit to parent shard needed *after* the parent shard is > inactive, subshards are now active and the split operation has completed? > # This rollback looks very suspicious. If state of subshards is already > active and parent is inactive, then what is the need for setting them back to > construction? Seems like a crucial check is missing there. Also, why do we > reset the subshard status back to construction instead of inactive? It is > extremely misleading (and, frankly, ridiculous) for any external clusterstate > monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to > CONSTRUCTION and then the subshard disappearing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org