[ 
https://issues.apache.org/jira/browse/SOLR-12708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624052#comment-16624052
 ] 

Varun Thacker commented on SOLR-12708:
--------------------------------------

Hi Mano,

Thanks for the patch!

I'm curious about the 10 minute latch countdown timeout. Shouldn't we wait 
forever here? 

So here we're doing something different wrt success and failure . If the add 
replica call has a failure we're adding it back to the main response but if 
it's a success then we will end up skipping it ( at this point 
results.get("success") will always be null ) .  
{code:java}
ocmh.addReplica(clusterState, new ZkNodeProps(propMap), addResult, ()-> {
  countDownLatch.countDown();
  Object addResultFailure = addResult.get("failure");
  if (addResultFailure != null) {
    SimpleOrderedMap failure = (SimpleOrderedMap) results.get("failure");
    if (failure == null) {
      failure = new SimpleOrderedMap();
      results.add("failure", failure);
    }
    failure.addAll((NamedList) addResultFailure);
  } else {
    SimpleOrderedMap success = (SimpleOrderedMap) results.get("success");
    if (success == null) {
      success = new SimpleOrderedMap();
      results.add("success", success);
    }
    success.addAll((NamedList) addResult.get("success"));
  }
});{code}
Can't we do this instead which will append the results directly to the main 
object? We do this for the remaining add replicas as the last step of the 
restore
{code:java}
ocmh.addReplica(clusterState, new ZkNodeProps(propMap), results, ()-> {
  countDownLatch.countDown();
});{code}
 

> Async collection actions should not hide failures
> -------------------------------------------------
>
>                 Key: SOLR-12708
>                 URL: https://issues.apache.org/jira/browse/SOLR-12708
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Admin UI, Backup/Restore
>    Affects Versions: 7.4
>            Reporter: Mano Kovacs
>            Assignee: Varun Thacker
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Async collection API may hide failures compared to sync version. 
> [OverseerCollectionMessageHandler::processResponses|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java#L744]
>  structures errors differently in the response, that hides failures from most 
> evaluators. RestoreCmd did not receive, nor handle async addReplica issues.
> Sample create collection sync and async result with invalid solrconfig.xml:
> {noformat}
> {
> "responseHeader":{
> "status":0,
> "QTime":32104},
> "failure":{
> "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
>  from server at http://localhost:8983/solr: Error CREATEing SolrCore 
> 'name4_shard1_replica_n1': Unable to create core [name4_shard1_replica_n1] 
> Caused by: The content of elements must consist of well-formed character data 
> or markup.",
> "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
>  from server at http://localhost:8983/solr: Error CREATEing SolrCore 
> 'name4_shard2_replica_n2': Unable to create core [name4_shard2_replica_n2] 
> Caused by: The content of elements must consist of well-formed character data 
> or markup.",
> "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
>  from server at http://localhost:8983/solr: Error CREATEing SolrCore 
> 'name4_shard1_replica_n2': Unable to create core [name4_shard1_replica_n2] 
> Caused by: The content of elements must consist of well-formed character data 
> or markup.",
> "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
>  from server at http://localhost:8983/solr: Error CREATEing SolrCore 
> 'name4_shard2_replica_n1': Unable to create core [name4_shard2_replica_n1] 
> Caused by: The content of elements must consist of well-formed character data 
> or markup."}
> }
> {noformat}
> vs async:
> {noformat}
> {
> "responseHeader":{
> "status":0,
> "QTime":3},
> "success":{
> "localhost:8983_solr":{
> "responseHeader":{
> "status":0,
> "QTime":12}},
> "localhost:8983_solr":{
> "responseHeader":{
> "status":0,
> "QTime":3}},
> "localhost:8983_solr":{
> "responseHeader":{
> "status":0,
> "QTime":11}},
> "localhost:8983_solr":{
> "responseHeader":{
> "status":0,
> "QTime":12}}},
> "myTaskId2709146382836":{
> "responseHeader":{
> "status":0,
> "QTime":1},
> "STATUS":"failed",
> "Response":"Error CREATEing SolrCore 'name_shard2_replica_n2': Unable to 
> create core [name_shard2_replica_n2] Caused by: The content of elements must 
> consist of well-formed character data or markup."},
> "status":{
> "state":"completed",
> "msg":"found [myTaskId] in completed tasks"}}
> {noformat}
> Proposing adding failure node to the results, keeping backward compatible but 
> correct result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to