[jira] [Resolved] (SLING-3432) pseudo network partition causes job deserialization issue in a cluster (when reading while job is being reassigned)

Stefan Egli (JIRA) Thu, 04 Feb 2016 01:24:54 -0800

     [ 
https://issues.apache.org/jira/browse/SLING-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stefan Egli resolved SLING-3432.
--------------------------------
       Resolution: Fixed
         Assignee: Stefan Egli
    Fix Version/s:     (was: Discovery Impl 1.2.8)
                   Discovery Impl 1.2.2
                   Discovery Oak 1.2.2

Marking as fixed. Resolution is as follows:
* discovery.oak addresses network partitioning by relying on a deterministic 
storage (the DocumentStore), a reliable lease mechanism and lease timeouts. If 
any instance doesn't update the lease in time this results in removing it from 
the cluster (for the others) and shutting down (for the local instance). Under 
no circumstance would discovery.oak together with the lease-check in oak allow 
a pseudo network partitioning situation. So for discovery.oak this is handled 
fine.
* discovery.impl: all issues but the mentioned SLING-4640 are adressed. As 
mentioned SLING-4640 will not be fixed for discovery.impl as it's not feasible. 
However, SLING-5195 and SLING-5280 add additional safety checks that try to 
help for large repository delays too. Still, if the repository delays are very 
asymmetric (ie reading is very slow for one instance vs writes are fast), then 
SLING-4640 can still happen. To address those issues, the recommendation is to 
switch to discovery.oak.

> pseudo network partition causes job deserialization issue in a cluster (when 
> reading while job is being reassigned)
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: SLING-3432
>                 URL: https://issues.apache.org/jira/browse/SLING-3432
>             Project: Sling
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: Discovery Impl 1.0.2
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>             Fix For: Discovery Oak 1.2.2, Discovery Impl 1.2.2
>
>
> There is a race condition between two instances in a cluster (eg oak or crx): 
> Instance 1 is writing a job with a binary property, instance 2 is reading the 
> job (likely triggered by discovery sending it a topologychangedevent). It 
> looks like instance 2 is reading the job just about while instance 1 is still 
> in the process or completely writing the job, or at least the binary. 
> Resulting in the following exception:
> 04.03.2014 06:55:39.667 *WARN* [Apache Sling Job Background Loader] 
> org.apache.sling.event.impl.jobs.JobManagerImpl Unable to read job from 
> /var/eventing/jobs/assigned/e4337f8f-47d2-41df-b3ab-0d40b1b2acd4/slingevent:eventadmin/2014/3/3/8/45/cq.wcm.msm.job.pageEvent_9718d7db-85b4-4930-a2ba-11a80d772970_172
> java.lang.Exception: Unable to deserialize property 'pageEvent'
>         at 
> org.apache.sling.event.impl.support.ResourceHelper.cloneValueMap(ResourceHelper.java:213)
>         at 
> org.apache.sling.event.impl.jobs.JobManagerImpl.readJob(JobManagerImpl.java:538)
>         at 
> org.apache.sling.event.impl.jobs.BackgroundLoader.loadJobInTheBackground(BackgroundLoader.java:318)
>         at 
> org.apache.sling.event.impl.jobs.BackgroundLoader.loadJobsInTheBackground(BackgroundLoader.java:294)
>         at 
> org.apache.sling.event.impl.jobs.BackgroundLoader.run(BackgroundLoader.java:203)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.EOFException: null
>         at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2280)
>         at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2749)
>         at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:779)
>         at java.io.ObjectInputStream.<init>(ObjectInputStream.java:279)
>         at 
> org.apache.sling.event.impl.support.ResourceHelper.cloneValueMap(ResourceHelper.java:208)
>         ... 5 common frames omitted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (SLING-3432) pseudo network partition causes job deserialization issue in a cluster (when reading while job is being reassigned)

Reply via email to