Re: problems with replication

Sinan Gabel Tue, 06 Oct 2015 05:45:08 -0700

PS
(3) If documents are small I would also set worker_batch_size = 5000 or
10000 either in configuration or better just with the individual
replication call


On 6 October 2015 at 14:37, Sinan Gabel <[email protected]> wrote:

> Hi!
>
> I would try a couple things:
>
> (1) Increase connection_timeout to e.g. 180000 (3 minutes)
>
> (2) Do not use filtered replication on the server sending the data, just
> make a full replication, and if needed filter the data on the receiving
> side with a script.
>
> Br,
> Sinan
>
> On 6 October 2015 at 13:18, Mike <[email protected]> wrote:
>
>> What sort of connection is there between the two couchdbs ?
>>
>> I was having some problems with an adsl connection and replication when
>> the line was saturated - i had a play with the replicator settings - below
>> is what sorted my issues:
>>
>> [replicator]
>> db = _replicator
>> ; Maximum replicaton retry count can be a non-negative integer or
>> "infinity".
>> max_replication_retry_count = 10
>> ; More worker processes can give higher network throughput but can also
>> ; imply more disk and network IO.
>> ;worker_processes = 4
>> worker_processes = 1
>> ; With lower batch sizes checkpoints are done more frequently. Lower
>> batch sizes
>> ; also reduce the total amount of used RAM memory.
>> ;worker_batch_size = 500
>> worker_batch_size = 50
>> ; Maximum number of HTTP connections per replication.
>> ;http_connections = 20
>> http_connections = 2
>> ; HTTP connection timeout per replication.
>> ; Even for very fast/reliable networks it might need to be increased if a
>> remote
>> ; database is too busy.
>> connection_timeout = 30000
>> ; If a request fails, the replicator will retry it up to N times.
>> retries_per_request = 10
>>
>>
>>
>>
>> On 06/10/2015 12:08, Francesco Zamboni wrote:
>>
>>> Ok, right now I'm getting more and more persuaded (not yet 100% sure, but
>>> at least 80% right now) that all my couchdb problems come to couchdb
>>> being
>>> unable to process records too big or too complex, essentially starting to
>>> cause a cascade of timeouts while all the db hangs indefinitely.
>>>
>>> Given that, and considering that it would be really really inconvenient
>>> having to somehow trim those records, how can I manage this?
>>> I've already tried playing with parameters like
>>> os_process_timeout and os_process_limit
>>> without noticing any change... There's some other parameters I could try,
>>> or some common pitfall I'm not considering?
>>>
>>> Thanks to everybody
>>>
>>>
>>> 2015-10-05 12:32 GMT+02:00 Francesco Zamboni <[email protected]>:
>>>
>>> Just to add some more informations, this is the crash report when I try
>>>> to
>>>> start the replication:
>>>>
>>>>
>>>> [info] [<0.8712.8>] X.X.X.X - - POST /_replicate 500
>>>>
>>>>> [error] [<0.8712.8>] httpd 500 error response:
>>>>>   {"error":"timeout"}
>>>>>
>>>>> [error] [<0.21596.15>] ** Generic server <0.21596.15> terminating
>>>>> ** Last message in was {'EXIT',<0.21595.15>,killed}
>>>>> ** When Server state == {state,"http://www.xxx.xxx:4984/bozze/",20,[],
>>>>>                                 [<0.21597.15>],
>>>>>                                 {[],[]}}
>>>>> ** Reason for termination ==
>>>>> ** killed
>>>>>
>>>>>
>>>>> =ERROR REPORT==== 5-Oct-2015::10:27:02 ===
>>>>> ** Generic server <0.21596.15> terminating
>>>>> ** Last message in was {'EXIT',<0.21595.15>,killed}
>>>>> ** When Server state == {state,"http://www.xxx.xxx:4984/bozze/",20,[],
>>>>>                                 [<0.21597.15>],
>>>>>                                 {[],[]}}
>>>>> ** Reason for termination ==
>>>>> ** killed
>>>>> [error] [<0.21596.15>] {error_report,<0.31.0>,
>>>>>                          {<0.21596.15>,crash_report,
>>>>>                           [[{initial_call,
>>>>>                              {couch_replicator_httpc_pool,init,
>>>>>                               ['Argument__1']}},
>>>>>                             {pid,<0.21596.15>},
>>>>>                             {registered_name,[]},
>>>>>                             {error_info,
>>>>>                              {exit,killed,
>>>>>                               [{gen_server,terminate,7,
>>>>>                                 [{file,"gen_server.erl"},{line,804}]},
>>>>>                                {proc_lib,init_p_do_apply,3,
>>>>>                                 [{file,"proc_lib.erl"},{line,237}]}]}},
>>>>>                             {ancestors,
>>>>>                              [<0.21595.15>,couch_replicator_job_sup,
>>>>>                               couch_primary_services,couch_server_sup,
>>>>>                               <0.32.0>]},
>>>>>                             {messages,[]},
>>>>>                             {links,[<0.21597.15>]},
>>>>>                             {dictionary,[]},
>>>>>                             {trap_exit,true},
>>>>>                             {status,running},
>>>>>                             {heap_size,376},
>>>>>                             {stack_size,27},
>>>>>                             {reductions,178}],
>>>>>                            []]}}
>>>>>
>>>>> =CRASH REPORT==== 5-Oct-2015::10:27:02 ===
>>>>>    crasher:
>>>>>      initial call: couch_replicator_httpc_pool:init/1
>>>>>      pid: <0.21596.15>
>>>>>      registered_name: []
>>>>>      exception exit: killed
>>>>>        in function  gen_server:terminate/7 (gen_server.erl, line 804)
>>>>>      ancestors: [<0.21595.15>,couch_replicator_job_sup,
>>>>>                    couch_primary_services,couch_server_sup,<0.32.0>]
>>>>>      messages: []
>>>>>      links: [<0.21597.15>]
>>>>>      dictionary: []
>>>>>      trap_exit: true
>>>>>      status: running
>>>>>      heap_size: 376
>>>>>      stack_size: 27
>>>>>      reductions: 178
>>>>>    neighbours:
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-10-02 19:16 GMT+02:00 Francesco Zamboni <[email protected]>:
>>>>
>>>> I did some tests... with "test" recordset I reproduced consistently the
>>>>> behaviour, creating huge single objects and then trying to index even a
>>>>> single view.
>>>>>
>>>>> The result is usually a long list of
>>>>> [error] [<0.25586.2>] OS Process Error <0.28141.2> ::
>>>>> {os_process_error,
>>>>>                                                         "OS process
>>>>> timed
>>>>> out."}
>>>>> Followed by a
>>>>> [info] [<0.17278.2>] 127.0.0.1 - - GET /test/_design/docs/_view/list_1
>>>>> 500
>>>>> [error] [emulator] Error in process <0.25586.2> with exit value:
>>>>> {{nocatch,{os_process_error,"OS process timed
>>>>>
>>>>> out."}},[{couch_os_process,prompt,2,[{file,"couch_os_process.erl"},{line,57}]},{couch_query_servers,map_doc_raw,2,[{file,"couch_query_servers.erl"},{line,88}]},{couch_mrview_updater...
>>>>>
>>>>>
>>>>> With the "real" db it does remains not deterministic: sometimes it does
>>>>> runs easily and quickly, sometimes it does lock completely, doing
>>>>> exactly
>>>>> the same operations over exactly the same data.
>>>>>
>>>>> Some of our records are in fact not small, but if attachments do not
>>>>> count, they're also not really so big, having to ... the biggest are
>>>>> around
>>>>> 100-200k.
>>>>>
>>>>> We had also some hang-ups while creating new filters, but those too
>>>>> seems
>>>>> not deterministic: you run it, couchdb freezes and never recovers,
>>>>> then you
>>>>> drop everything and re-create everything exactly the same and it does
>>>>> run
>>>>> smoothly...
>>>>> I'm trying to obtain some more informations from a "real db" crash, but
>>>>> the fact that it does happens so randomly and with an application that,
>>>>> being actively used, need to be restored ASAP, is frustrating my
>>>>> attempts,
>>>>>
>>>>> One thing we've excluded is the machine/installation: we've moved the
>>>>> database over different machines, with different network
>>>>> configurations,
>>>>> and the behaviour do re-appear.
>>>>>
>>>>> We're using couchdb 1.6.1 as a klaemo docker image over ubuntu 14.04
>>>>> VMs,
>>>>> but we tried even a physical machine with a packaged installation from
>>>>> scratch.
>>>>>
>>>>> I'll write again if (hopefully when!) I'll find more... in the meantime
>>>>> thanks to everybody!
>>>>>
>>>>> 2015-09-29 22:36 GMT+02:00 Sebastian Rothbucher <
>>>>> [email protected]>:
>>>>>
>>>>> Hi Francesco,
>>>>>>
>>>>>> maybe these two things will help you:
>>>>>> 1.) as Harald Pointed out: filtered replication could be a problem. An
>>>>>> initial thought: make sure only one runs at a time. Surely not the
>>>>>> solution
>>>>>> in the long run, but could help figuring out where the problem is
>>>>>> 2.) Try intercepting the couchjs process to find out more. Maybe it's
>>>>>> always the same (typically huge) document where it hangs (see e.g.
>>>>>> here:
>>>>>> https://gist.github.com/sebastianrothbucher/01afe929095a55ab233e).
>>>>>> Generally, looking for huge documents (huge content, attachments don't
>>>>>> count here) might be worthwhile. When you exclude / delete these
>>>>>> temporarily, it might be another lead. Again: not the final soltuion,
>>>>>> but
>>>>>> helps pointing it down
>>>>>>
>>>>>> Good luck, pls. share what you found - and also let us all know when
>>>>>> we
>>>>>> might be able to help
>>>>>>
>>>>>> Best
>>>>>>     Sebastian
>>>>>>
>>>>>> On Tue, Sep 29, 2015 at 10:59 AM, Francesco Zamboni <
>>>>>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>> we're having some problems with replication and couchdb, but as we're
>>>>>>>
>>>>>> still
>>>>>>
>>>>>>> quite green with couchdb I need to ask to people with more experience
>>>>>>>
>>>>>> even
>>>>>>
>>>>>>> what I can check, as the problem seems to be quite random and we've
>>>>>>>
>>>>>> been
>>>>>>
>>>>>>> not capable to even pinpoint a way to consistently reproduce the
>>>>>>>
>>>>>> problem.
>>>>>>
>>>>>>> Essentially, using couchdb 1.6.1, we've uploaded some thousands of
>>>>>>> documents occupying about 10 megabytes of space, more or less, so
>>>>>>>
>>>>>> nothing
>>>>>>
>>>>>>> especially big...
>>>>>>> Over these documents we've created a structure of views, lists, shows
>>>>>>>
>>>>>> and
>>>>>>
>>>>>>> other functions.
>>>>>>> The problems seems to start when we try to launch a series of
>>>>>>> one-shot
>>>>>>> filtered replication of these data over several sub-databases.
>>>>>>> After creating a variable number of replication documents, the system
>>>>>>>
>>>>>> seems
>>>>>>
>>>>>>> to completely hang.
>>>>>>> When the system is hanged, any attempt to access a view cause a
>>>>>>> crash.
>>>>>>> The only messages are of the "OS process timed out" kind, but we've
>>>>>>>
>>>>>> tried
>>>>>>
>>>>>>> to increase the os_process_timeout and os_process_limit parameters
>>>>>>>
>>>>>> without
>>>>>>
>>>>>>> any appreciable change.
>>>>>>>
>>>>>>> Obviously this is not enough information to ask where the problem is,
>>>>>>>
>>>>>> but
>>>>>>
>>>>>>> as we're new to couchdb, I'd like to ask for some pointers in what to
>>>>>>> check, some common pitfalls that could lead to this kind of problems
>>>>>>>
>>>>>> and so
>>>>>>
>>>>>>> on... we're having serious troubles understanding what happened when
>>>>>>> something go wrong...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Francesco Zamboni
>>>>>>>
>>>>>>> tel: +39 0522 1590100
>>>>>>> fax: +39 0522 331673
>>>>>>> mob: +39 335 7548422
>>>>>>> e-mail: [email protected] <[email protected]>
>>>>>>> web: www.mastertraining.it
>>>>>>>
>>>>>>>
>>>>>>>   Sede Legale: via Timolini, 18 - Correggio (RE) - Italy
>>>>>>> Sede Operativa: via Sani, 15 - Reggio Emilia - Italy
>>>>>>> Sede Commerciale: via Sani, 9 - Reggio Emilia - Italy
>>>>>>> Le informazioni contenute in questa e-mail sono da considerarsi
>>>>>>> confidenziali e esclusivamente per uso personale dei destinatari
>>>>>>> sopra
>>>>>>> indicati. Questo messaggio può includere dati personali o sensibili.
>>>>>>> Qualora questo messaggio fosse da Voi ricevuto per errore vogliate
>>>>>>> cortesemente darcene notizia a mezzo e-mail e distruggere il
>>>>>>> messaggio
>>>>>>> ricevuto erroneamente. Quanto precede ai fini del rispetto del
>>>>>>> Decreto
>>>>>>> Legislativo 196/2003 sulla tutela dei dati personali e sensibili.
>>>>>>> This e-mail and any file transmitted with it is intended only for the
>>>>>>> person or entity to which is addressed and may contain information
>>>>>>> that is privileged, confidential or otherwise protected from
>>>>>>> disclosure.Copying, dissemination or use of this e-mail or the
>>>>>>> information herein by anyone other than the intended recipient is
>>>>>>> prohibited. If you have received this e-mail by mistake, please
>>>>>>> notify
>>>>>>> us immediately by telephone or fax.
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Francesco Zamboni
>>>>>
>>>>> tel: +39 0522 1590100
>>>>> fax: +39 0522 331673
>>>>> mob: +39 335 7548422
>>>>> e-mail: [email protected] <[email protected]>
>>>>>
>>>>> web: www.mastertraining.it
>>>>>
>>>>>
>>>>>   Sede Legale: via Timolini, 18 - Correggio (RE) - Italy
>>>>> Sede Operativa: via Sani, 15 - Reggio Emilia - Italy
>>>>> Sede Commerciale: via Sani, 9 - Reggio Emilia - Italy
>>>>> Le informazioni contenute in questa e-mail sono da considerarsi
>>>>> confidenziali e esclusivamente per uso personale dei destinatari sopra
>>>>> indicati. Questo messaggio può includere dati personali o sensibili.
>>>>> Qualora questo messaggio fosse da Voi ricevuto per errore vogliate
>>>>> cortesemente darcene notizia a mezzo e-mail e distruggere il messaggio
>>>>> ricevuto erroneamente. Quanto precede ai fini del rispetto del Decreto
>>>>> Legislativo 196/2003 sulla tutela dei dati personali e sensibili.
>>>>> This e-mail and any file transmitted with it is intended only for the
>>>>> person or entity to which is addressed and may contain information that is
>>>>> privileged, confidential or otherwise protected from disclosure.Copying,
>>>>> dissemination or use of this e-mail or the information herein by anyone
>>>>> other than the intended recipient is prohibited. If you have received this
>>>>> e-mail by mistake, please notify us immediately by telephone or fax.
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Francesco Zamboni
>>>>
>>>> tel: +39 0522 1590100
>>>> fax: +39 0522 331673
>>>> mob: +39 335 7548422
>>>> e-mail: [email protected] <[email protected]>
>>>> web: www.mastertraining.it
>>>>
>>>>
>>>>   Sede Legale: via Timolini, 18 - Correggio (RE) - Italy
>>>> Sede Operativa: via Sani, 15 - Reggio Emilia - Italy
>>>> Sede Commerciale: via Sani, 9 - Reggio Emilia - Italy
>>>> Le informazioni contenute in questa e-mail sono da considerarsi
>>>> confidenziali e esclusivamente per uso personale dei destinatari sopra
>>>> indicati. Questo messaggio può includere dati personali o sensibili.
>>>> Qualora questo messaggio fosse da Voi ricevuto per errore vogliate
>>>> cortesemente darcene notizia a mezzo e-mail e distruggere il messaggio
>>>> ricevuto erroneamente. Quanto precede ai fini del rispetto del Decreto
>>>> Legislativo 196/2003 sulla tutela dei dati personali e sensibili.
>>>> This e-mail and any file transmitted with it is intended only for the
>>>> person or entity to which is addressed and may contain information that is
>>>> privileged, confidential or otherwise protected from disclosure.Copying,
>>>> dissemination or use of this e-mail or the information herein by anyone
>>>> other than the intended recipient is prohibited. If you have received this
>>>> e-mail by mistake, please notify us immediately by telephone or fax.
>>>>
>>>>
>>>>
>>>
>>
>

Re: problems with replication

Reply via email to