[ 
https://issues.apache.org/jira/browse/SOLR-12708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628739#comment-16628739
 ] 

Mano Kovacs commented on SOLR-12708:
------------------------------------

Hello [~varunthacker], thank you for the review!

bq. I'm curious about the 10 minute latch countdown timeout. Shouldn't we wait 
forever here? 
I think if we would wait forever, any downstream command that stuck or never 
get result would keep this job hanging as well. I would worry about the 
robustness here. This part of the code creates a bunch of empty cores (one per 
shards) in parallel. Considering a larger, 200-300 shard cluster, this might 
take longer than 10 minutes if the overseer queue is already behind, so 10 
minutes in fact might be problematic. However, if Overseer is getting behind 
much more than that, it would seriously hurt the stability of the cluster 
anyway. I increase this wait for an hour, if you agree, which would leave 
plenty of time for overseer to process the core creation on a relatively large 
collection, but still ensures that the job is getting cancelled if one task 
stucks.

bq. So here we're doing something different wrt success and failure . If the 
add replica call has a failure we're adding it back to the main response but if 
it's a success then we will end up skipping it ( at this point 
results.get("success") will always be null ) . 
I have to be honest and admit that I copied the full block from 
{{CreateShardCmd.java}}. I think the code is doing the right thing there. In 
both branches of the {{if}} the code checks if the main {{results}} has 
success/failure node already, and creates if necessary. Then adds the 
corresponding {{addResult}} field into the main one. The only difference is 
that the failure recalled before the {{if}} block.

bq. Can't we do this instead which will append the results directly to the main 
object? We do this for the remaining add replicas as the last step of the 
restore
Then we may let the downstream call override certain other fields that might be 
populated. I think isolation makes it more error-prone. I think this was Dat's 
original intent as well in {{CreateShardCmd}}, but not sure.


> Async collection actions should not hide failures
> -------------------------------------------------
>
>                 Key: SOLR-12708
>                 URL: https://issues.apache.org/jira/browse/SOLR-12708
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Admin UI, Backup/Restore
>    Affects Versions: 7.4
>            Reporter: Mano Kovacs
>            Assignee: Varun Thacker
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Async collection API may hide failures compared to sync version. 
> [OverseerCollectionMessageHandler::processResponses|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java#L744]
>  structures errors differently in the response, that hides failures from most 
> evaluators. RestoreCmd did not receive, nor handle async addReplica issues.
> Sample create collection sync and async result with invalid solrconfig.xml:
> {noformat}
> {
> "responseHeader":{
> "status":0,
> "QTime":32104},
> "failure":{
> "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
>  from server at http://localhost:8983/solr: Error CREATEing SolrCore 
> 'name4_shard1_replica_n1': Unable to create core [name4_shard1_replica_n1] 
> Caused by: The content of elements must consist of well-formed character data 
> or markup.",
> "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
>  from server at http://localhost:8983/solr: Error CREATEing SolrCore 
> 'name4_shard2_replica_n2': Unable to create core [name4_shard2_replica_n2] 
> Caused by: The content of elements must consist of well-formed character data 
> or markup.",
> "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
>  from server at http://localhost:8983/solr: Error CREATEing SolrCore 
> 'name4_shard1_replica_n2': Unable to create core [name4_shard1_replica_n2] 
> Caused by: The content of elements must consist of well-formed character data 
> or markup.",
> "localhost:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
>  from server at http://localhost:8983/solr: Error CREATEing SolrCore 
> 'name4_shard2_replica_n1': Unable to create core [name4_shard2_replica_n1] 
> Caused by: The content of elements must consist of well-formed character data 
> or markup."}
> }
> {noformat}
> vs async:
> {noformat}
> {
> "responseHeader":{
> "status":0,
> "QTime":3},
> "success":{
> "localhost:8983_solr":{
> "responseHeader":{
> "status":0,
> "QTime":12}},
> "localhost:8983_solr":{
> "responseHeader":{
> "status":0,
> "QTime":3}},
> "localhost:8983_solr":{
> "responseHeader":{
> "status":0,
> "QTime":11}},
> "localhost:8983_solr":{
> "responseHeader":{
> "status":0,
> "QTime":12}}},
> "myTaskId2709146382836":{
> "responseHeader":{
> "status":0,
> "QTime":1},
> "STATUS":"failed",
> "Response":"Error CREATEing SolrCore 'name_shard2_replica_n2': Unable to 
> create core [name_shard2_replica_n2] Caused by: The content of elements must 
> consist of well-formed character data or markup."},
> "status":{
> "state":"completed",
> "msg":"found [myTaskId] in completed tasks"}}
> {noformat}
> Proposing adding failure node to the results, keeping backward compatible but 
> correct result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to