Not sure if you saw my write-up from the BigCouch era, still valid for CouchDB 2.0;
https://stackoverflow.com/questions/6676972/moving-a-shard-from-one-bigcouch-server-to-another-for-balancing Shard moving / database rebalancing is definitely a bit tricky and we could use better tools for it. > On 26 Jul 2017, at 18:10, Carlos Alonso <[email protected]> wrote: > > Hi! > > I have had a few log errors when moving a shard under particular > circumstances and I'd like to share it here and get your input on whether > this should be reported or not. > > So let me describe the steps I took: > > 1. 3 nodes cluster (couch-0, couch-1 and couch-2), 1 database (my_db) with > 48 shards and 1 replica > 2. A 4th node (couch-3) is added to the cluster. > 3. Change shards map so that the last one gets one of the shards from > couch-0 (at this moment both couch-0 and couch-4 contain the shard) > 4. Synchronisation happens and the new node gets its shard > 5. Change shards map again so that couch-0 is not that shard owner anymore > 6. I go into couch-0 node and manually delete the .couch file of the shard, > to reclaim disk space > 7. All fine here > > 8. Now I want to put back the shard into the original node, where it was > before > 9. I put couch-0 into maintenance (I always do this before adding a shard > to a node, to avoid it responding to reads before it is synced) > 10. Modify the shards map adding the shard to the couch-0 > 11. All nodes logs get full of errors (details below) > 12. I remove couch-0 maintenance mode and things seem to flow again > > So this is the process, now please let me describe what I spotted on the > logs: > > Couch-0 seems to go through a few statuses: > > 1. Tries to create the shard, somehow detect that it existed before > (probably something I forgot to delete on step 6) > > `mem3_shards tried to create shards/0fffffff-15555553/my-db.1500994155, got > file_exists` (edited) > > 2. gen_server crashes > > `CRASH REPORT Process (<0.27288.4>) with 0 neighbors exited with reason: > no match of right hand value {error,enoent} at couch_file:sync/1(line:211) > <= couch_db_updater:sync_header/2(line:987) <= > couch_db_updater:update_docs_int/5(line:906) <= > couch_db_updater:handle_info/2(line:289) <= > gen_server:handle_msg/5(line:599) <= proc_lib:wake_up/3(line:247) at > gen_server:terminate/6(line:737) <= proc_lib:wake_up/3(line:247); > initial_call: {couch_db_updater,init,['Argument__1']}, ancestors: > [<0.27267.4>], messages: [], links: [<0.210.0>], dictionary: > [{io_priority,{db_update,<<"shards/0fffffff-15555553/my_db...">>}},...], > trap_exit: false, status: running, heap_size: 6772, stack_size: 27, > reductions: 300961927` > > 3. Seems to somehow recover and try to open the file again > > ``` > Could not open file ./data/shards/0fffffff-15555553/my_db.1500994155.couch: > no such file or directory > > open_result error {not_found,no_db_file} for > shards/0fffffff-15555553/my_db.1500994155 > ``` > > 4. Tries to create the file > > `creating missing database: shards/0fffffff-15555553/my_db.1500994155` > > 5. Continuously fails because it cannot load validation funcs, possibly > because of the maintenance mode? > > ``` > Error in process <0.2126.141> on node 'couchdb@couch-0' with exit value: > {{badmatch,{error,{maintenance_mode,nil,'couchdb@couch-0 > '}}},[{ddoc_cache_opener,recover_validation_funs,1,[{file,"src/ddoc_cache_opener.erl"},{line,127}]},{ddoc_cache_opener,fetch_doc_data,1,[... > > Error in process <0.1970.141> on node 'couchdb@couch-0' with exit value: > {{case_clause,{error,{{badmatch,{error,{maintenance_mode,nil,'couchdb@couch-0 > '}}},[{ddoc_cache_opener,recover_validation_funs,1,[{file,"src/ddoc_cache_opener.erl"},{line,127}]},{ddoc_cache_opener... > > could not load validation funs > {{case_clause,{error,{{badmatch,{error,{maintenance_mode,nil,'couchdb@couch-0 > '}}},[{ddoc_cache_opener,recover_validation_funs,1,[{file,"src/ddoc_cache_opener.erl"},{line,127}]},{ddoc_cache_opener,fetch_doc_data,1,[{file,"src/ddoc_cache_opener.erl"},{line,240}]}]}}},[{ddoc_cache_opener,handle_open_response,1,[{file,"src/ddoc_cache_opener.erl"},{line,282}]},{couch_db,'-load_validation_funs/1-fun-0-',1,[{file,"src/couch_db.erl"},{line,659}]}]} > ``` > > Couch-3 shows a warning and an error > > ``` > [warning] ... -------- mem3_sync shards/0fffffff-15555553/my_db.1500994155 > couchdb@couch-0 > {internal_server_error,[{mem3_rpc,rexi_call,2,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,267}]},{mem3_rep,save_on_target,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,286}]},{mem3_rep,replicate_batch,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,256}]},{mem3_rep,repl,2,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,178}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,81}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,208}]}]} > ``` > ``` > [error] ... -------- Error in process <0.13658.13> on node 'couchdb@couch-3' > with exit value: > {internal_server_error,[{mem3_rpc,rexi_call,2,[{file,"src/mem3_rpc.erl"},{line,267}]},{mem3_rep,save_on_target,3,[{file,"src/mem3_rep.erl"},{line,286}]},{mem3_rep,replicate_batch,1,[{file,"src/mem3_rep.erl"},{line,256}]},{mem3_rep... > ``` > > Which, to me, means that couch-0 is responding internal server errors to > his requests > > Couch-1, which is, by the way, the owner of the task for replicating my_db > from a remote server seems to go through two statuses: > > First seems as being unable to continue with the replication process > because receives a 500 error (maybe from couch-0?) > ``` > [error] ... req_err(4096501418) unknown_error : badarg > [<<"dict:fetch/2 L130">>,<<"couch_util:-reorder_results/2-lc$^1/1-1-/2 > L424">>,<<"couch_util:-reorder_results/2-lc$^1/1-1-/2 > L424">>,<<"fabric_doc_update:go/3 L41">>,<<"fabric:update_docs/3 > L259">>,<<"chttpd_db:db_req/2 L445">>,<<"chttpd:process_request/1 > L293">>,<<"chttpd:handle_request_int/1 L229">>] > > > [notice] ... 127.0.0.1:5984 127.0.0.1 undefined POST /my_db/_bulk_docs 500 > ok 114 > > [notice] ... Retrying POST request to http://127.0.0.1:5984/my_db/_bulk_docs > in 0.25 seconds due to error {code,500} > ``` > > After disabling maintenance on couch-0 the replication process seem to work > again, but a few seconds later lots of new errors appear again: > > ``` > [error] -------- rexi_server exit:timeout > [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,286}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,632}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}] > ``` > > And a few seconds later (I have not been able to correlate it to anything > so far) they stop. > > Finally, couch-2 just show one error, the same as the last from couch-1 > > ``` > [error] -------- rexi_server exit:timeout > [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,286}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,632}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}] > ``` > > *In conclusion:* > > To me it looks like two things are involved here: > > 1. The fact that I deleted the file from disk and something else still know > that it should be there > 2. The fact that the node is under maintenance and it seems that prevents > from new shards to be created > > Sorry for such a wall of text. I hope it is detailed enough to get > someone's input on this that can help me confirm or refuse my theories and > decide whether it makes sense or not to open a GH issue to make the process > more robust at this stage. > > Regards > > > -- > [image: Cabify - Your private Driver] <http://www.cabify.com/> > > *Carlos Alonso* > Data Engineer > Madrid, Spain > > [email protected] > > Prueba gratis con este código > #CARLOSA6319 <https://cabify.com/i/carlosa6319> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter] > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image: > Linkedin] <https://www.linkedin.com/in/mrcalonso> > > -- > Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su > destinatario, pudiendo contener información confidencial sometida a secreto > profesional. No está permitida su reproducción o distribución sin la > autorización expresa de Cabify. Si usted no es el destinatario final por > favor elimínelo e infórmenos por esta vía. > > This message and any attached file are intended exclusively for the > addressee, and it may be confidential. You are not allowed to copy or > disclose it without Cabify's prior written authorization. If you are not > the intended recipient please delete it from your system and notify us by > e-mail.
