Re: Errors when moving a shard

Robert Samuel Newson Thu, 27 Jul 2017 00:59:56 -0700

Not sure if you saw my write-up from the BigCouch era, still valid for CouchDB 
2.0;


https://stackoverflow.com/questions/6676972/moving-a-shard-from-one-bigcouch-server-to-another-for-balancing

Shard moving / database rebalancing is definitely a bit tricky and we could use 
better tools for it.


> On 26 Jul 2017, at 18:10, Carlos Alonso <[email protected]> wrote:
> 
> Hi!
> 
> I have had a few log errors when moving a shard under particular
> circumstances and I'd like to share it here and get your input on whether
> this should be reported or not.
> 
> So let me describe the steps I took:
> 
> 1. 3 nodes cluster (couch-0, couch-1 and couch-2), 1 database (my_db) with
> 48 shards and 1 replica
> 2. A 4th node (couch-3) is added to the cluster.
> 3. Change shards map so that the last one gets one of the shards from
> couch-0 (at this moment both couch-0 and couch-4 contain the shard)
> 4. Synchronisation happens and the new node gets its shard
> 5. Change shards map again so that couch-0 is not that shard owner anymore
> 6. I go into couch-0 node and manually delete the .couch file of the shard,
> to reclaim disk space
> 7. All fine here
> 
> 8. Now I want to put back the shard into the original node, where it was
> before
> 9. I put couch-0 into maintenance (I always do this before adding a shard
> to a node, to avoid it responding to reads before it is synced)
> 10. Modify the shards map adding the shard to the couch-0
> 11. All nodes logs get full of errors (details below)
> 12. I remove couch-0 maintenance mode and things seem to flow again
> 
> So this is the process, now please let me describe what I spotted on the
> logs:
> 
> Couch-0 seems to go through a few statuses:
> 
> 1. Tries to create the shard, somehow detect that it existed before
> (probably something I forgot to delete on step 6)
> 
> `mem3_shards tried to create shards/0fffffff-15555553/my-db.1500994155, got
> file_exists` (edited)
> 
> 2. gen_server crashes
> 
> `CRASH REPORT Process  (<0.27288.4>) with 0 neighbors exited with reason:
> no match of right hand value {error,enoent} at couch_file:sync/1(line:211)
> <= couch_db_updater:sync_header/2(line:987) <=
> couch_db_updater:update_docs_int/5(line:906) <=
> couch_db_updater:handle_info/2(line:289) <=
> gen_server:handle_msg/5(line:599) <= proc_lib:wake_up/3(line:247) at
> gen_server:terminate/6(line:737) <= proc_lib:wake_up/3(line:247);
> initial_call: {couch_db_updater,init,['Argument__1']}, ancestors:
> [<0.27267.4>], messages: [], links: [<0.210.0>], dictionary:
> [{io_priority,{db_update,<<"shards/0fffffff-15555553/my_db...">>}},...],
> trap_exit: false, status: running, heap_size: 6772, stack_size: 27,
> reductions: 300961927`
> 
> 3. Seems to somehow recover and try to open the file again
> 
> ```
> Could not open file ./data/shards/0fffffff-15555553/my_db.1500994155.couch:
> no such file or directory
> 
> open_result error {not_found,no_db_file} for
> shards/0fffffff-15555553/my_db.1500994155
> ```
> 
> 4. Tries to create the file
> 
> `creating missing database: shards/0fffffff-15555553/my_db.1500994155`
> 
> 5. Continuously fails because it cannot load validation funcs, possibly
> because of the maintenance mode?
> 
> ```
> Error in process <0.2126.141> on node 'couchdb@couch-0' with exit value:
> {{badmatch,{error,{maintenance_mode,nil,'couchdb@couch-0
> '}}},[{ddoc_cache_opener,recover_validation_funs,1,[{file,"src/ddoc_cache_opener.erl"},{line,127}]},{ddoc_cache_opener,fetch_doc_data,1,[...
> 
> Error in process <0.1970.141> on node 'couchdb@couch-0' with exit value:
> {{case_clause,{error,{{badmatch,{error,{maintenance_mode,nil,'couchdb@couch-0
> '}}},[{ddoc_cache_opener,recover_validation_funs,1,[{file,"src/ddoc_cache_opener.erl"},{line,127}]},{ddoc_cache_opener...
> 
> could not load validation funs
> {{case_clause,{error,{{badmatch,{error,{maintenance_mode,nil,'couchdb@couch-0
> '}}},[{ddoc_cache_opener,recover_validation_funs,1,[{file,"src/ddoc_cache_opener.erl"},{line,127}]},{ddoc_cache_opener,fetch_doc_data,1,[{file,"src/ddoc_cache_opener.erl"},{line,240}]}]}}},[{ddoc_cache_opener,handle_open_response,1,[{file,"src/ddoc_cache_opener.erl"},{line,282}]},{couch_db,'-load_validation_funs/1-fun-0-',1,[{file,"src/couch_db.erl"},{line,659}]}]}
> ```
> 
> Couch-3 shows a warning and an error
> 
> ```
> [warning] ... -------- mem3_sync shards/0fffffff-15555553/my_db.1500994155
> couchdb@couch-0
> {internal_server_error,[{mem3_rpc,rexi_call,2,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,267}]},{mem3_rep,save_on_target,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,286}]},{mem3_rep,replicate_batch,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,256}]},{mem3_rep,repl,2,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,178}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,81}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,208}]}]}
> ```
> ```
> [error] ... -------- Error in process <0.13658.13> on node 'couchdb@couch-3'
> with exit value:
> {internal_server_error,[{mem3_rpc,rexi_call,2,[{file,"src/mem3_rpc.erl"},{line,267}]},{mem3_rep,save_on_target,3,[{file,"src/mem3_rep.erl"},{line,286}]},{mem3_rep,replicate_batch,1,[{file,"src/mem3_rep.erl"},{line,256}]},{mem3_rep...
> ```
> 
> Which, to me, means that couch-0 is responding internal server errors to
> his requests
> 
> Couch-1, which is, by the way, the owner of the task for replicating my_db
> from a remote server seems to go through two statuses:
> 
> First seems as being unable to continue with the replication process
> because receives a 500 error (maybe from couch-0?)
> ```
> [error] ... req_err(4096501418) unknown_error : badarg
>    [<<"dict:fetch/2 L130">>,<<"couch_util:-reorder_results/2-lc$^1/1-1-/2
> L424">>,<<"couch_util:-reorder_results/2-lc$^1/1-1-/2
> L424">>,<<"fabric_doc_update:go/3 L41">>,<<"fabric:update_docs/3
> L259">>,<<"chttpd_db:db_req/2 L445">>,<<"chttpd:process_request/1
> L293">>,<<"chttpd:handle_request_int/1 L229">>]
> 
> 
> [notice] ... 127.0.0.1:5984 127.0.0.1 undefined POST /my_db/_bulk_docs 500
> ok 114
> 
> [notice] ... Retrying POST request to http://127.0.0.1:5984/my_db/_bulk_docs
> in 0.25 seconds due to error {code,500}
> ```
> 
> After disabling maintenance on couch-0 the replication process seem to work
> again, but a few seconds later lots of new errors appear again:
> 
> ```
> [error]  -------- rexi_server exit:timeout
> [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,286}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,632}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> ```
> 
> And a few seconds later (I have not been able to correlate it to anything
> so far) they stop.
> 
> Finally, couch-2 just show one error, the same as the last from couch-1
> 
> ```
> [error]  -------- rexi_server exit:timeout
> [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,286}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,632}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> ```
> 
> *In conclusion:*
> 
> To me it looks like two things are involved here:
> 
> 1. The fact that I deleted the file from disk and something else still know
> that it should be there
> 2. The fact that the node is under maintenance and it seems that prevents
> from new shards to be created
> 
> Sorry for such a wall of text. I hope it is detailed enough to get
> someone's input on this that can help me confirm or refuse my theories and
> decide whether it makes sense or not to open a GH issue to make the process
> more robust at this stage.
> 
> Regards
> 
> 
> -- 
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
> 
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
> 
> [email protected]
> 
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
> 
> -- 
> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
> destinatario, pudiendo contener información confidencial sometida a secreto 
> profesional. No está permitida su reproducción o distribución sin la 
> autorización expresa de Cabify. Si usted no es el destinatario final por 
> favor elimínelo e infórmenos por esta vía. 
> 
> This message and any attached file are intended exclusively for the 
> addressee, and it may be confidential. You are not allowed to copy or 
> disclose it without Cabify's prior written authorization. If you are not 
> the intended recipient please delete it from your system and notify us by 
> e-mail.

Re: Errors when moving a shard

Reply via email to