Hello, we are developing a new application based on a couchdb database, and
we had some strange problems.
We are using couchdb 2.3.0 in docker installation based on the official
images.
During the development cycle everything worked fine with simulated
workloads comparable to the final usage.
When we installed the application on two early production machines, with a
workload much lighter than the development one and more powerful dedicated
servers, during the night while being unused by any user both of the
couchdb started behaving erratically, one of them freezing completely: in
the logs I can't see anything logged since the moment it froze, the process
seems to be running, but any call made by curl to couch throw out after a
lenghty timeout an error like the following:

*{"error":"case_clause","reason":"{timeout,{[{{shard,<<\"shards/00000000-1fffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[0,536870911],\n
#Ref<0.0.2097153.224211>,[]},\n            nil},\n
{{shard,<<\"shards/20000000-3fffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[536870912,1073741823],\n
#Ref<0.0.2097153.224212>,[]},\n            27988},\n
{{shard,<<\"shards/40000000-5fffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[1073741824,1610612735],\n
#Ref<0.0.2097153.224213>,[]},\n            nil},\n
{{shard,<<\"shards/60000000-7fffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[1610612736,2147483647],\n
#Ref<0.0.2097153.224214>,[]},\n            28553},\n
{{shard,<<\"shards/80000000-9fffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[2147483648,2684354559],\n
#Ref<0.0.2097153.224215>,[]},\n            27969},\n
{{shard,<<\"shards/a0000000-bfffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[2684354560,3221225471],\n
#Ref<0.0.2097153.224216>,[]},\n            28502},\n
{{shard,<<\"shards/c0000000-dfffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[3221225472,3758096383],\n*
*#Ref<0.0.2097153.224217>,[]},\n            28633},\n
{{shard,<<\"shards/e0000000-ffffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[3758096384,4294967295],\n
#Ref<0.0.2097153.224218>,[]},\n            28010}],\n
[{{shard,<<\"shards/e0000000-ffffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[3758096384,4294967295],\n
#Ref<0.0.2097153.224218>,[]},\n            0},\n
{{shard,<<\"shards/c0000000-dfffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[3221225472,3758096383],\n
#Ref<0.0.2097153.224217>,[]},\n            0},\n
{{shard,<<\"shards/a0000000-bfffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[2684354560,3221225471],\n
#Ref<0.0.2097153.224216>,[]},\n            0},\n
{{shard,<<\"shards/80000000-9fffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[2147483648,2684354559],\n                  *
*#Ref<0.0.2097153.224215>,[]},\n            0},\n
{{shard,<<\"shards/60000000-7fffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[1610612736,2147483647],\n
#Ref<0.0.2097153.224214>,[]},\n            0},\n
{{shard,<<\"shards/20000000-3fffffff/mastercom.1555403400\">>,\n
nonode@nohost,<<\"mastercom\">>,\n
[536870912,1073741823],\n
#Ref<0.0.2097153.224212>,[]},\n            0}],\n
[[{db_name,<<\"shards/e0000000-ffffffff/mastercom.1555403400\">>},\n
{engine,couch_bt_engine},\n            {doc_count,371},\n
{doc_del_count,0},\n            {update_seq,28010},\n
{purge_seq,0},\n            {compact_running,false},\n
{sizes,{[{active,1282636},{external,109712},{file,1298647}]}},\n
{disk_size,1298647},\n            {data_size,1282636},\n
{other,{[{data_size,109712}]}},\n
{instance_start_time,<<\"1555521000894744\">>},\n
{disk_format_version,7},\n
{committed_update_seq,28010},\n
{compacted_seq,28010},\n
{uuid,<<\"c3f60d2791bf5c3de01063af95f255b1\">>}],\n  *
*[{db_name,<<\"shards/c0000000-dfffffff/mastercom.1555403400\">>},\n
{engine,couch_bt_engine},\n            {doc_count,379},\n
{doc_del_count,1},\n            {update_seq,28633},\n
{purge_seq,0},\n            {compact_running,false},\n
{sizes,{[{active,1314436},{external,117751},{file,1331415}]}},\n
{disk_size,1331415},\n            {data_size,1314436},\n
{other,{[{data_size,117751}]}},\n
{instance_start_time,<<\"1555521000894769\">>},\n
{disk_format_version,7},\n
{committed_update_seq,28633},\n
{compacted_seq,28633},\n
{uuid,<<\"740de6d1e5535ba64e43bc3ead349e84\">>}],\n
[{db_name,<<\"shards/a0000000-bfffffff/mastercom.1555403400\">>},\n
{engine,couch_bt_engine},\n            {doc_count,378},\n
{doc_del_count,0},\n            {update_seq,28502},\n
{purge_seq,0},\n            {compact_running,false},\n
{sizes,{[{active,1308584},{external,116630},{file,1413335}]}},\n
{disk_size,1413335},\n            {data_size,1308584},\n
{other,{[{data_size,116630}]}},\n
{instance_start_time,<<\"1555521000894704\">>},\n
{disk_format_version,7},\n
{committed_update_seq,28502},\n
{compacted_seq,28502},\n
{uuid,<<\"d6acfc8cdf0a172df3d8db22df1a4b68\">>}],\n*
*[{db_name,<<\"shards/80000000-9fffffff/mastercom.1555403400\">>},\n
{engine,couch_bt_engine},\n            {doc_count,373},\n
{doc_del_count,0},\n            {update_seq,27969},\n
{purge_seq,0},\n            {compact_running,false},\n
{sizes,{[{active,1290174},{external,125829},{file,1335511}]}},\n
{disk_size,1335511},\n            {data_size,1290174},\n
{other,{[{data_size,125829}]}},\n
{instance_start_time,<<\"1555521000894811\">>},\n
{disk_format_version,7},\n
{committed_update_seq,27969},\n
{compacted_seq,27969},\n
{uuid,<<\"6735a0b95e5ebdc3c5dfe08b4c81a108\">>}],\n
[{db_name,<<\"shards/60000000-7fffffff/mastercom.1555403400\">>},\n
{engine,couch_bt_engine},\n            {doc_count,379},\n
{doc_del_count,0},\n            {update_seq,28553},\n
{purge_seq,0},\n            {compact_running,false},\n
{sizes,{[{active,1312728},{external,123640},{file,1327319}]}},\n
{disk_size,1327319},\n            {data_size,1312728},\n
{other,{[{data_size,123640}]}},\n
{instance_start_time,<<\"1555521000894765\">>},\n
{disk_format_version,7},\n
{committed_update_seq,28553},\n
{compacted_seq,28553},\n
{uuid,<<\"4086cc00f08dea80fd962808cf9e37bf\">>}],\n*
*[{db_name,<<\"shards/20000000-3fffffff/mastercom.1555403400\">>},\n
{engine,couch_bt_engine},\n            {doc_count,372},\n
{doc_del_count,0},\n            {update_seq,27988},\n
{purge_seq,0},\n            {compact_running,false},\n
{sizes,{[{active,1287520},{external,121646},{file,1302743}]}},\n
{disk_size,1302743},\n            {data_size,1287520},\n
{other,{[{data_size,121646}]}},\n
{instance_start_time,<<\"1555521000894678\">>},\n
{disk_format_version,7},\n
{committed_update_seq,27988},\n
{compacted_seq,27988},\n
{uuid,<<\"c1327baaacb6ac82b71d2c540648e82d\">>}],\n
{cluster,[{q,8},{n,1},{w,1},{r,1}]}]}}","ref":3845389673}*

On the other server, one of the databases went on answering regularly to
requests, while the other hanged up with messages like the one above. Logs
were regular for the answer database, not-existant for the hanged up one.
While trying to access some general-level stats, the still working one
froze too.
Only an hard reboot of the docker managed to have them working again. No
data was apparently corrupted or problematic.
The day after, it happened again, although this time they were both
completely frozen when we found them.
In the entire db, there were only about a 2k quite small records, no views
and only a couple of indexes for the mango queries.
The db that the first time hanged individually on the second server was
completely empty.
The main difference between the development environment and the crashing
one is that in this one there is also an unfiltered bidirectional
replication setup, if this can be related to the problem, but before the
crash, the replication works perfectly without errors or any strange
message.


I realize that these may be insufficient information here to understand the
cause, but the problem is that it happens unreliably, it never happened in
months in the development environment, the parameters that we checked are
all the same, and I've no idea were to watch for some signs of what may be
the trouble.

What can I check? Which parameters may be the ones responsible for such a
problem?

Thanks in advance to everybody...

Reply via email to