[ https://issues.apache.org/jira/browse/SOLR-12708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628739#comment-16628739 ]
Mano Kovacs commented on SOLR-12708: ------------------------------------ Hello [~varunthacker], thank you for the review! bq. I'm curious about the 10 minute latch countdown timeout. Shouldn't we wait forever here? I think if we would wait forever, any downstream command that stuck or never get result would keep this job hanging as well. I would worry about the robustness here. This part of the code creates a bunch of empty cores (one per shards) in parallel. Considering a larger, 200-300 shard cluster, this might take longer than 10 minutes if the overseer queue is already behind, so 10 minutes in fact might be problematic. However, if Overseer is getting behind much more than that, it would seriously hurt the stability of the cluster anyway. I increase this wait for an hour, if you agree, which would leave plenty of time for overseer to process the core creation on a relatively large collection, but still ensures that the job is getting cancelled if one task stucks. bq. So here we're doing something different wrt success and failure . If the add replica call has a failure we're adding it back to the main response but if it's a success then we will end up skipping it ( at this point results.get("success") will always be null ) . I have to be honest and admit that I copied the full block from {{CreateShardCmd.java}}. I think the code is doing the right thing there. In both branches of the {{if}} the code checks if the main {{results}} has success/failure node already, and creates if necessary. Then adds the corresponding {{addResult}} field into the main one. The only difference is that the failure recalled before the {{if}} block. bq. Can't we do this instead which will append the results directly to the main object? We do this for the remaining add replicas as the last step of the restore Then we may let the downstream call override certain other fields that might be populated. I think isolation makes it more error-prone. I think this was Dat's original intent as well in {{CreateShardCmd}}, but not sure. > Async collection actions should not hide failures > ------------------------------------------------- > > Key: SOLR-12708 > URL: https://issues.apache.org/jira/browse/SOLR-12708 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Admin UI, Backup/Restore > Affects Versions: 7.4 > Reporter: Mano Kovacs > Assignee: Varun Thacker > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Async collection API may hide failures compared to sync version. > [OverseerCollectionMessageHandler::processResponses|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java#L744] > structures errors differently in the response, that hides failures from most > evaluators. RestoreCmd did not receive, nor handle async addReplica issues. > Sample create collection sync and async result with invalid solrconfig.xml: > {noformat} > { > "responseHeader":{ > "status":0, > "QTime":32104}, > "failure":{ > "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error > from server at http://localhost:8983/solr: Error CREATEing SolrCore > 'name4_shard1_replica_n1': Unable to create core [name4_shard1_replica_n1] > Caused by: The content of elements must consist of well-formed character data > or markup.", > "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error > from server at http://localhost:8983/solr: Error CREATEing SolrCore > 'name4_shard2_replica_n2': Unable to create core [name4_shard2_replica_n2] > Caused by: The content of elements must consist of well-formed character data > or markup.", > "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error > from server at http://localhost:8983/solr: Error CREATEing SolrCore > 'name4_shard1_replica_n2': Unable to create core [name4_shard1_replica_n2] > Caused by: The content of elements must consist of well-formed character data > or markup.", > "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error > from server at http://localhost:8983/solr: Error CREATEing SolrCore > 'name4_shard2_replica_n1': Unable to create core [name4_shard2_replica_n1] > Caused by: The content of elements must consist of well-formed character data > or markup."} > } > {noformat} > vs async: > {noformat} > { > "responseHeader":{ > "status":0, > "QTime":3}, > "success":{ > "localhost:8983_solr":{ > "responseHeader":{ > "status":0, > "QTime":12}}, > "localhost:8983_solr":{ > "responseHeader":{ > "status":0, > "QTime":3}}, > "localhost:8983_solr":{ > "responseHeader":{ > "status":0, > "QTime":11}}, > "localhost:8983_solr":{ > "responseHeader":{ > "status":0, > "QTime":12}}}, > "myTaskId2709146382836":{ > "responseHeader":{ > "status":0, > "QTime":1}, > "STATUS":"failed", > "Response":"Error CREATEing SolrCore 'name_shard2_replica_n2': Unable to > create core [name_shard2_replica_n2] Caused by: The content of elements must > consist of well-formed character data or markup."}, > "status":{ > "state":"completed", > "msg":"found [myTaskId] in completed tasks"}} > {noformat} > Proposing adding failure node to the results, keeping backward compatible but > correct result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org