Re: problems with replication

Sinan Gabel Tue, 06 Oct 2015 05:38:03 -0700

Hi!

I would try a couple things:


(1) Increase connection_timeout to e.g. 180000 (3 minutes)

(2) Do not use filtered replication on the server sending the data, just
make a full replication, and if needed filter the data on the receiving
side with a script.

Br,
Sinan

On 6 October 2015 at 13:18, Mike <[email protected]> wrote:

> What sort of connection is there between the two couchdbs ?
>
> I was having some problems with an adsl connection and replication when
> the line was saturated - i had a play with the replicator settings - below
> is what sorted my issues:
>
> [replicator]
> db = _replicator
> ; Maximum replicaton retry count can be a non-negative integer or
> "infinity".
> max_replication_retry_count = 10
> ; More worker processes can give higher network throughput but can also
> ; imply more disk and network IO.
> ;worker_processes = 4
> worker_processes = 1
> ; With lower batch sizes checkpoints are done more frequently. Lower batch
> sizes
> ; also reduce the total amount of used RAM memory.
> ;worker_batch_size = 500
> worker_batch_size = 50
> ; Maximum number of HTTP connections per replication.
> ;http_connections = 20
> http_connections = 2
> ; HTTP connection timeout per replication.
> ; Even for very fast/reliable networks it might need to be increased if a
> remote
> ; database is too busy.
> connection_timeout = 30000
> ; If a request fails, the replicator will retry it up to N times.
> retries_per_request = 10
>
>
>
>
> On 06/10/2015 12:08, Francesco Zamboni wrote:
>
>> Ok, right now I'm getting more and more persuaded (not yet 100% sure, but
>> at least 80% right now) that all my couchdb problems come to couchdb being
>> unable to process records too big or too complex, essentially starting to
>> cause a cascade of timeouts while all the db hangs indefinitely.
>>
>> Given that, and considering that it would be really really inconvenient
>> having to somehow trim those records, how can I manage this?
>> I've already tried playing with parameters like
>> os_process_timeout and os_process_limit
>> without noticing any change... There's some other parameters I could try,
>> or some common pitfall I'm not considering?
>>
>> Thanks to everybody
>>
>>
>> 2015-10-05 12:32 GMT+02:00 Francesco Zamboni <[email protected]>:
>>
>> Just to add some more informations, this is the crash report when I try to
>>> start the replication:
>>>
>>>
>>> [info] [<0.8712.8>] X.X.X.X - - POST /_replicate 500
>>>
>>>> [error] [<0.8712.8>] httpd 500 error response:
>>>>   {"error":"timeout"}
>>>>
>>>> [error] [<0.21596.15>] ** Generic server <0.21596.15> terminating
>>>> ** Last message in was {'EXIT',<0.21595.15>,killed}
>>>> ** When Server state == {state,"http://www.xxx.xxx:4984/bozze/",20,[],
>>>>                                 [<0.21597.15>],
>>>>                                 {[],[]}}
>>>> ** Reason for termination ==
>>>> ** killed
>>>>
>>>>
>>>> =ERROR REPORT==== 5-Oct-2015::10:27:02 ===
>>>> ** Generic server <0.21596.15> terminating
>>>> ** Last message in was {'EXIT',<0.21595.15>,killed}
>>>> ** When Server state == {state,"http://www.xxx.xxx:4984/bozze/",20,[],
>>>>                                 [<0.21597.15>],
>>>>                                 {[],[]}}
>>>> ** Reason for termination ==
>>>> ** killed
>>>> [error] [<0.21596.15>] {error_report,<0.31.0>,
>>>>                          {<0.21596.15>,crash_report,
>>>>                           [[{initial_call,
>>>>                              {couch_replicator_httpc_pool,init,
>>>>                               ['Argument__1']}},
>>>>                             {pid,<0.21596.15>},
>>>>                             {registered_name,[]},
>>>>                             {error_info,
>>>>                              {exit,killed,
>>>>                               [{gen_server,terminate,7,
>>>>                                 [{file,"gen_server.erl"},{line,804}]},
>>>>                                {proc_lib,init_p_do_apply,3,
>>>>                                 [{file,"proc_lib.erl"},{line,237}]}]}},
>>>>                             {ancestors,
>>>>                              [<0.21595.15>,couch_replicator_job_sup,
>>>>                               couch_primary_services,couch_server_sup,
>>>>                               <0.32.0>]},
>>>>                             {messages,[]},
>>>>                             {links,[<0.21597.15>]},
>>>>                             {dictionary,[]},
>>>>                             {trap_exit,true},
>>>>                             {status,running},
>>>>                             {heap_size,376},
>>>>                             {stack_size,27},
>>>>                             {reductions,178}],
>>>>                            []]}}
>>>>
>>>> =CRASH REPORT==== 5-Oct-2015::10:27:02 ===
>>>>    crasher:
>>>>      initial call: couch_replicator_httpc_pool:init/1
>>>>      pid: <0.21596.15>
>>>>      registered_name: []
>>>>      exception exit: killed
>>>>        in function  gen_server:terminate/7 (gen_server.erl, line 804)
>>>>      ancestors: [<0.21595.15>,couch_replicator_job_sup,
>>>>                    couch_primary_services,couch_server_sup,<0.32.0>]
>>>>      messages: []
>>>>      links: [<0.21597.15>]
>>>>      dictionary: []
>>>>      trap_exit: true
>>>>      status: running
>>>>      heap_size: 376
>>>>      stack_size: 27
>>>>      reductions: 178
>>>>    neighbours:
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>> 2015-10-02 19:16 GMT+02:00 Francesco Zamboni <[email protected]>:
>>>
>>> I did some tests... with "test" recordset I reproduced consistently the
>>>> behaviour, creating huge single objects and then trying to index even a
>>>> single view.
>>>>
>>>> The result is usually a long list of
>>>> [error] [<0.25586.2>] OS Process Error <0.28141.2> :: {os_process_error,
>>>>                                                         "OS process
>>>> timed
>>>> out."}
>>>> Followed by a
>>>> [info] [<0.17278.2>] 127.0.0.1 - - GET /test/_design/docs/_view/list_1
>>>> 500
>>>> [error] [emulator] Error in process <0.25586.2> with exit value:
>>>> {{nocatch,{os_process_error,"OS process timed
>>>>
>>>> out."}},[{couch_os_process,prompt,2,[{file,"couch_os_process.erl"},{line,57}]},{couch_query_servers,map_doc_raw,2,[{file,"couch_query_servers.erl"},{line,88}]},{couch_mrview_updater...
>>>>
>>>>
>>>> With the "real" db it does remains not deterministic: sometimes it does
>>>> runs easily and quickly, sometimes it does lock completely, doing
>>>> exactly
>>>> the same operations over exactly the same data.
>>>>
>>>> Some of our records are in fact not small, but if attachments do not
>>>> count, they're also not really so big, having to ... the biggest are
>>>> around
>>>> 100-200k.
>>>>
>>>> We had also some hang-ups while creating new filters, but those too
>>>> seems
>>>> not deterministic: you run it, couchdb freezes and never recovers, then
>>>> you
>>>> drop everything and re-create everything exactly the same and it does
>>>> run
>>>> smoothly...
>>>> I'm trying to obtain some more informations from a "real db" crash, but
>>>> the fact that it does happens so randomly and with an application that,
>>>> being actively used, need to be restored ASAP, is frustrating my
>>>> attempts,
>>>>
>>>> One thing we've excluded is the machine/installation: we've moved the
>>>> database over different machines, with different network configurations,
>>>> and the behaviour do re-appear.
>>>>
>>>> We're using couchdb 1.6.1 as a klaemo docker image over ubuntu 14.04
>>>> VMs,
>>>> but we tried even a physical machine with a packaged installation from
>>>> scratch.
>>>>
>>>> I'll write again if (hopefully when!) I'll find more... in the meantime
>>>> thanks to everybody!
>>>>
>>>> 2015-09-29 22:36 GMT+02:00 Sebastian Rothbucher <
>>>> [email protected]>:
>>>>
>>>> Hi Francesco,
>>>>>
>>>>> maybe these two things will help you:
>>>>> 1.) as Harald Pointed out: filtered replication could be a problem. An
>>>>> initial thought: make sure only one runs at a time. Surely not the
>>>>> solution
>>>>> in the long run, but could help figuring out where the problem is
>>>>> 2.) Try intercepting the couchjs process to find out more. Maybe it's
>>>>> always the same (typically huge) document where it hangs (see e.g.
>>>>> here:
>>>>> https://gist.github.com/sebastianrothbucher/01afe929095a55ab233e).
>>>>> Generally, looking for huge documents (huge content, attachments don't
>>>>> count here) might be worthwhile. When you exclude / delete these
>>>>> temporarily, it might be another lead. Again: not the final soltuion,
>>>>> but
>>>>> helps pointing it down
>>>>>
>>>>> Good luck, pls. share what you found - and also let us all know when we
>>>>> might be able to help
>>>>>
>>>>> Best
>>>>>     Sebastian
>>>>>
>>>>> On Tue, Sep 29, 2015 at 10:59 AM, Francesco Zamboni <
>>>>> [email protected]>
>>>>> wrote:
>>>>>
>>>>> Hello,
>>>>>> we're having some problems with replication and couchdb, but as we're
>>>>>>
>>>>> still
>>>>>
>>>>>> quite green with couchdb I need to ask to people with more experience
>>>>>>
>>>>> even
>>>>>
>>>>>> what I can check, as the problem seems to be quite random and we've
>>>>>>
>>>>> been
>>>>>
>>>>>> not capable to even pinpoint a way to consistently reproduce the
>>>>>>
>>>>> problem.
>>>>>
>>>>>> Essentially, using couchdb 1.6.1, we've uploaded some thousands of
>>>>>> documents occupying about 10 megabytes of space, more or less, so
>>>>>>
>>>>> nothing
>>>>>
>>>>>> especially big...
>>>>>> Over these documents we've created a structure of views, lists, shows
>>>>>>
>>>>> and
>>>>>
>>>>>> other functions.
>>>>>> The problems seems to start when we try to launch a series of one-shot
>>>>>> filtered replication of these data over several sub-databases.
>>>>>> After creating a variable number of replication documents, the system
>>>>>>
>>>>> seems
>>>>>
>>>>>> to completely hang.
>>>>>> When the system is hanged, any attempt to access a view cause a crash.
>>>>>> The only messages are of the "OS process timed out" kind, but we've
>>>>>>
>>>>> tried
>>>>>
>>>>>> to increase the os_process_timeout and os_process_limit parameters
>>>>>>
>>>>> without
>>>>>
>>>>>> any appreciable change.
>>>>>>
>>>>>> Obviously this is not enough information to ask where the problem is,
>>>>>>
>>>>> but
>>>>>
>>>>>> as we're new to couchdb, I'd like to ask for some pointers in what to
>>>>>> check, some common pitfalls that could lead to this kind of problems
>>>>>>
>>>>> and so
>>>>>
>>>>>> on... we're having serious troubles understanding what happened when
>>>>>> something go wrong...
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Francesco Zamboni
>>>>>>
>>>>>> tel: +39 0522 1590100
>>>>>> fax: +39 0522 331673
>>>>>> mob: +39 335 7548422
>>>>>> e-mail: [email protected] <[email protected]>
>>>>>> web: www.mastertraining.it
>>>>>>
>>>>>>
>>>>>>   Sede Legale: via Timolini, 18 - Correggio (RE) - Italy
>>>>>> Sede Operativa: via Sani, 15 - Reggio Emilia - Italy
>>>>>> Sede Commerciale: via Sani, 9 - Reggio Emilia - Italy
>>>>>> Le informazioni contenute in questa e-mail sono da considerarsi
>>>>>> confidenziali e esclusivamente per uso personale dei destinatari sopra
>>>>>> indicati. Questo messaggio può includere dati personali o sensibili.
>>>>>> Qualora questo messaggio fosse da Voi ricevuto per errore vogliate
>>>>>> cortesemente darcene notizia a mezzo e-mail e distruggere il messaggio
>>>>>> ricevuto erroneamente. Quanto precede ai fini del rispetto del Decreto
>>>>>> Legislativo 196/2003 sulla tutela dei dati personali e sensibili.
>>>>>> This e-mail and any file transmitted with it is intended only for the
>>>>>> person or entity to which is addressed and may contain information
>>>>>> that is privileged, confidential or otherwise protected from
>>>>>> disclosure.Copying, dissemination or use of this e-mail or the
>>>>>> information herein by anyone other than the intended recipient is
>>>>>> prohibited. If you have received this e-mail by mistake, please notify
>>>>>> us immediately by telephone or fax.
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Francesco Zamboni
>>>>
>>>> tel: +39 0522 1590100
>>>> fax: +39 0522 331673
>>>> mob: +39 335 7548422
>>>> e-mail: [email protected] <[email protected]>
>>>>
>>>> web: www.mastertraining.it
>>>>
>>>>
>>>>   Sede Legale: via Timolini, 18 - Correggio (RE) - Italy
>>>> Sede Operativa: via Sani, 15 - Reggio Emilia - Italy
>>>> Sede Commerciale: via Sani, 9 - Reggio Emilia - Italy
>>>> Le informazioni contenute in questa e-mail sono da considerarsi
>>>> confidenziali e esclusivamente per uso personale dei destinatari sopra
>>>> indicati. Questo messaggio può includere dati personali o sensibili.
>>>> Qualora questo messaggio fosse da Voi ricevuto per errore vogliate
>>>> cortesemente darcene notizia a mezzo e-mail e distruggere il messaggio
>>>> ricevuto erroneamente. Quanto precede ai fini del rispetto del Decreto
>>>> Legislativo 196/2003 sulla tutela dei dati personali e sensibili.
>>>> This e-mail and any file transmitted with it is intended only for the
>>>> person or entity to which is addressed and may contain information that is
>>>> privileged, confidential or otherwise protected from disclosure.Copying,
>>>> dissemination or use of this e-mail or the information herein by anyone
>>>> other than the intended recipient is prohibited. If you have received this
>>>> e-mail by mistake, please notify us immediately by telephone or fax.
>>>>
>>>>
>>>>
>>> --
>>> Francesco Zamboni
>>>
>>> tel: +39 0522 1590100
>>> fax: +39 0522 331673
>>> mob: +39 335 7548422
>>> e-mail: [email protected] <[email protected]>
>>> web: www.mastertraining.it
>>>
>>>
>>>   Sede Legale: via Timolini, 18 - Correggio (RE) - Italy
>>> Sede Operativa: via Sani, 15 - Reggio Emilia - Italy
>>> Sede Commerciale: via Sani, 9 - Reggio Emilia - Italy
>>> Le informazioni contenute in questa e-mail sono da considerarsi
>>> confidenziali e esclusivamente per uso personale dei destinatari sopra
>>> indicati. Questo messaggio può includere dati personali o sensibili.
>>> Qualora questo messaggio fosse da Voi ricevuto per errore vogliate
>>> cortesemente darcene notizia a mezzo e-mail e distruggere il messaggio
>>> ricevuto erroneamente. Quanto precede ai fini del rispetto del Decreto
>>> Legislativo 196/2003 sulla tutela dei dati personali e sensibili.
>>> This e-mail and any file transmitted with it is intended only for the
>>> person or entity to which is addressed and may contain information that is
>>> privileged, confidential or otherwise protected from disclosure.Copying,
>>> dissemination or use of this e-mail or the information herein by anyone
>>> other than the intended recipient is prohibited. If you have received this
>>> e-mail by mistake, please notify us immediately by telephone or fax.
>>>
>>>
>>>
>>
>

Re: problems with replication

Reply via email to