[jira] [Updated] (OAK-2683) the "hitting the observation queue limit" problem

Davide Giannella (JIRA) Wed, 01 Jul 2015 06:07:04 -0700

     [ 
https://issues.apache.org/jira/browse/OAK-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Davide Giannella updated OAK-2683:
----------------------------------
    Fix Version/s:     (was: 1.3.2)
                   1.3.3

Bulk move to 1.3.3.

> the "hitting the observation queue limit" problem
> -------------------------------------------------
>
>                 Key: OAK-2683
>                 URL: https://issues.apache.org/jira/browse/OAK-2683
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: core, mongomk, segmentmk
>            Reporter: Stefan Egli
>              Labels: observation, resilience
>             Fix For: 1.3.3
>
>
> There are several tickets in this area:
> * OAK-2587: threading with observation being too eagar causing observation 
> queue to grow
> * OAK-2669: avoiding diffing from mongo by using persistent cache instead.
> * OAK-2349: which might be a duplicate or at least similar to 2669..
> * OAK-2562: diffcache is inefficient
> Yet I think it makes sense to create this summarizing ticket, about 
> describing again what happens when the observation queue hits the limit - and 
> eventually about how this can be improved
> Consider the following scenario (also compare with OAK-2587 - but that one 
> focused more on eagerness of threading):
> * rate of incoming commits is large and starts to generate many changes into 
> the observation queues, hence those queue become somewhat filled/loaded
> * depending on the underlying nodestore used the calculation of diffs is more 
> or less expensive - but at least for mongomk it is important that the diff 
> can be served from the cache
> ** in case of mongomk it can happen that diffs are no longer found in the 
> cache and thus require a round-trip to mongo - which is magnitudes slower 
> than via cache of course. this would result in the queue to start increasing 
> even faster as dequeuing becomes slower now.
> ** not sure about tarmk - I believe it should always be fast there
> * so based on the above, there can be a situation where the queue grows and 
> hits the configured limit
> * if this limit is reached, the current mechanism is to collapse any 
> subsequent change into one-big-marked-as-external-event change, lets call 
> this a collapsed-change.
> * this collapsed-change now becomes part of the normal queue and eventually 
> would 'walk down the queue' and be processed normally - hence opening a high 
> chance that yet a new collapsed-change is created should the queue just hit 
> the limit again. and this game can now be played for a while, resulting in 
> the queue to contain many/mostly such collapse-changes.
> * there is now an additional assumption in that the diffing of such collapses 
> is more expensive than normal diffing - plus it is almost guaranteed that the 
> diff cannot for example be shared between observation listeners, since the 
> exact 'collapse borders' depends on timing of each of the listeners' queues - 
> ie the collapse diffs are unique thus not cachable..
> * so as a result: once you have those collapse-diffs you can almost not get 
> rid of them - they are heavy to process - hence dequeuing is very slow
> * at the same time, there is always likely some commits happening in a 
> typical system, eg with sling on top you have sling discovery which does 
> heartbeats every now and then. So there's always new commits that add to the 
> load.
> * this will hence create a situation where quite a small additional commit 
> rate can keep all the queues filled - due to the fact that the queue is full 
> with 'heavy collapse diffs' that have to be calculated for each and every 
> listener (of which you could have eg 150-200) individually.
> So again, possible solutions for this:
> * OAK-2669: tune diffing via persistent cache
> * OAK-2587: have more threads to remain longer 'in the cache zone'
> * tune your input speed explicitly to avoid filling the observation queues 
> (this would be specific to your use-case of course, but can be seen as 
> explicitly throttling on the input side)
> * increase the relevant caches to the max
> * but I think we will come up with yet a broader improvement of this 
> observation queue limit problem by either
> ** doing flow control - eg via the commit rate limiter (also see OAK-1659)
> ** moving out handling of observation changes to a messaging subsystem - be 
> it to handle local events only (since handling external events makes the 
> system problematic wrt scalability if not done right) - also see 
> [corresponding suggestion on dev 
> list|http://markmail.org/message/b5trr6csyn4zzuj7]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-2683) the "hitting the observation queue limit" problem

Reply via email to