Hi Carlos,

I think I found a solution, can you please check my latest comment on
SLING-9118 [1] ?

Thanks!
Robert

[1]: 
https://issues.apache.org/jira/browse/SLING-9118?focusedCommentId=17073183&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17073183

On Thu, 2020-03-26 at 17:46 -0400, Carlos Munoz wrote:
> Hi Robert,
> 
> I've found that it's not as simple. There is still some factor of
> randomness attached to this issue. After doing the bisect more times,
> I've
> found that commit 0a13d3467aa78b46ec33ae5687418685f90a9e12 seems to
> work
> *most* of the time. There are still times where I get the error, but
> it is
> recoverable on the next run.
> 
> Carlos
> 
> On Thu, Mar 19, 2020 at 6:21 AM Robert Munteanu <[email protected]>
> wrote:
> 
> > That's good info, thank you! I've added some details to the Jira
> > issue.
> > I tried reverting the commits I suspect are at fault
> > 
> > - 
> > https://github.com/apache/sling-org-apache-sling-jcr-base/commit/6f5771a
> > - 
> > https://github.com/apache/sling-org-apache-sling-jcr-base/commit/3de2b9f
> > 
> > But that failed due to conflicts. I will try and manually remove
> > the
> > changes and see what that does.
> > Robert
> > 
> > On Wed, 2020-03-18 at 21:24 -0400, Carlos Munoz wrote:
> > > I went through the bisect process and I got the first bad commit:
> > > 
> > > commit bb1e10d97f3c163fb87917ea782afff674050891
> > > Author: Eric Norman <[email protected]>
> > > Date:   Sun Dec 16 12:33:08 2018 -0800
> > > 
> > >     switch to released JCR Base 3.0.6
> > > 
> > > (I tried it a couple of times just to be sure)
> > > 
> > > I tried running our app with the commit before that and I get it
> > > to
> > > run.
> > > (There are other unrelated problems).
> > > 
> > > 
> > > On Mon, Mar 16, 2020 at 6:12 PM Robert Munteanu <
> > > [email protected]>
> > > wrote:
> > > 
> > > > Hi Carlos,
> > > > 
> > > > Apologies for the delay ...
> > > > 
> > > > What I was thinking of doing myself, but did not have the time
> > > > is
> > > > the
> > > > following
> > > > 
> > > > 1. Find a version of Sling for which the scenario in SLING-9118
> > > > works.
> > > > Perhaps Sling Starter 11 is a good start.
> > > > 2. Run a `git bisect` check between sling starter 11 and the
> > > > current
> > > > master branch
> > > > 
> > > > Assuming my guess is correct, git would say
> > > > 
> > > > Bisecting: 36 revisions left to test after this (roughly 5
> > > > steps)
> > > > [c1aedf7b292f7835ceb4e2f56fedcb3294c60756] Update to Tika 1.21
> > > > 
> > > > So not that many steps to test.
> > > > 
> > > > If you would manage to isolate the change to the starter that
> > > > broke
> > > > this, it would make it much easier to understand where the
> > > > problem
> > > > is
> > > > coming from.
> > > > 
> > > > Thanks!
> > > > Robert
> > > > 
> > > > On Mon, 2020-03-16 at 16:27 -0400, Carlos Munoz wrote:
> > > > > Hi Robert,
> > > > > 
> > > > > Just a friendly ping about this issue :)
> > > > > 
> > > > > We could try to submit a fix with some potential guidance
> > > > > from
> > > > > you.
> > > > > For
> > > > > example, which of the many Sling bundles should we start
> > > > > looking
> > > > > at?
> > > > > 
> > > > > Regards,
> > > > > 
> > > > > Carlos
> > > > > 
> > > > > 
> > > > > On Wed, Feb 26, 2020 at 7:24 AM Carlos Munoz <
> > > > > [email protected]>
> > > > > wrote:
> > > > > 
> > > > > > Thanks Robert. As always your help is appreciated.
> > > > > > 
> > > > > > On Fri, Feb 21, 2020 at 6:28 PM Robert Munteanu <
> > > > > > [email protected]
> > > > > > wrote:
> > > > > > 
> > > > > > > Thanks, Ben,
> > > > > > > 
> > > > > > > I added a bit more detail, based on our mailing list
> > > > > > > conversations.
> > > > > > > I'll have limited access in the next two weeks, but if no
> > > > > > > one
> > > > > > > picks it
> > > > > > > up I'll look into it when I get back.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Robert
> > > > > > > 
> > > > > > > On Fri, 2020-02-21 at 11:01 -0500, Ben Radey wrote:
> > > > > > > > I went ahead and created
> > > > > > > > https://issues.apache.org/jira/browse/SLING-9118
> > > > > > > > for this. Although the ultimate goal here is
> > > > > > > > containerization,
> > > > > > > > I
> > > > > > > > neglected
> > > > > > > > to include any details to that effect in the ticket,
> > > > > > > > since
> > > > > > > > the
> > > > > > > > behavior is
> > > > > > > > reproducible without that being a complicating factor.
> > > > > > > > 
> > > > > > > > On Thu, Feb 20, 2020 at 7:25 AM Robert Munteanu <
> > > > > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > > On Mon, 2020-02-17 at 13:45 -0500, Ben Radey wrote:
> > > > > > > > > > I am following along conceptually - I want to make
> > > > > > > > > > sure
> > > > > > > > > > I
> > > > > > > > > > understand
> > > > > > > > > > what's
> > > > > > > > > > being described.
> > > > > > > > > > 
> > > > > > > > > > Let's say Sling Instance A starts successfully the
> > > > > > > > > > first
> > > > > > > > > > time. If
> > > > > > > > > > we
> > > > > > > > > > restart Sling Instance A, we expect subsequent
> > > > > > > > > > restarts
> > > > > > > > > > to
> > > > > > > > > > also
> > > > > > > > > > succeed,
> > > > > > > > > > without removing the sling directory.
> > > > > > > > > > Now let's say Sling Instance B does NOT start
> > > > > > > > > > successfully
> > > > > > > > > > the
> > > > > > > > > > first
> > > > > > > > > > time.
> > > > > > > > > > Despite that, we expect subsequent restarts to
> > > > > > > > > > succeed
> > > > > > > > > > without
> > > > > > > > > > removing the
> > > > > > > > > > sling directory.
> > > > > > > > > > 
> > > > > > > > > > Correct so far?
> > > > > > > > > 
> > > > > > > > > Yes, correct.
> > > > > > > > > 
> > > > > > > > > > Assuming yes... what if this is running in k8s, and
> > > > > > > > > > k8s
> > > > > > > > > > sees that
> > > > > > > > > > Sling
> > > > > > > > > > Instance B did not start successfully, and kills
> > > > > > > > > > the
> > > > > > > > > > pod
> > > > > > > > > > (removing
> > > > > > > > > > all pod
> > > > > > > > > > resources, including that pod's sling directory) in
> > > > > > > > > > response?
> > > > > > > > > > Presumably,
> > > > > > > > > > k8s would then start Sling Instance C, which is a
> > > > > > > > > > fresh
> > > > > > > > > > instance
> > > > > > > > > > with
> > > > > > > > > > no
> > > > > > > > > > sling directory. Are we saying we expect C to have
> > > > > > > > > > a
> > > > > > > > > > 50/50
> > > > > > > > > > chance
> > > > > > > > > > of
> > > > > > > > > > starting successfully? Or have we observed
> > > > > > > > > > different
> > > > > > > > > > behavior?
> > > > > > > > > 
> > > > > > > > > I think that only the first instance starts
> > > > > > > > > successfully.
> > > > > > > > > Additional
> > > > > > > > > instances will not start unless they have a Sling
> > > > > > > > > directory
> > > > > > > > > set up.
> > > > > > > > > 
> > > > > > > > > I've tested with a third instance, once two instances
> > > > > > > > > are
> > > > > > > > > up,
> > > > > > > > > and
> > > > > > > > > it
> > > > > > > > > has the exact same behaviour.
> > > > > > > > > 
> > > > > > > > > One workaround that I can suggest for a containerized
> > > > > > > > > environment
> > > > > > > > > is to
> > > > > > > > > use a supervisor script that detects the abnormal
> > > > > > > > > startup
> > > > > > > > > problem
> > > > > > > > > and
> > > > > > > > > restarts Sling, so that it starts up successfully.
> > > > > > > > > 
> > > > > > > > > Another would be to persist the 'sling' directory as
> > > > > > > > > a
> > > > > > > > > per-
> > > > > > > > > container
> > > > > > > > > volume. Not sure how easy that is with k8s, but maybe
> > > > > > > > > you
> > > > > > > > > can
> > > > > > > > > use a
> > > > > > > > > single ReadWriteMany volume at /sling, and each pod
> > > > > > > > > gets
> > > > > > > > > their own
> > > > > > > > > ${sling.home} at /sling/${containerId} ( assuming
> > > > > > > > > that is
> > > > > > > > > exposed
> > > > > > > > > through the downward API).
> > > > > > > > > 
> > > > > > > > > As these are workardounds, I would still very much
> > > > > > > > > like
> > > > > > > > > to
> > > > > > > > > see this
> > > > > > > > > fixed properly, so please file a bug to track this.
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Robert
> > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Ben
> > > > > > > > > > 
> > > > > > > > > > On Mon, Feb 17, 2020 at 11:33 AM Carlos Munoz <
> > > > > > > > > > [email protected]
> > > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > > Thanks for the information Robert.
> > > > > > > > > > > 
> > > > > > > > > > > To replicate the issue all I needed was a mongodb
> > > > > > > > > > > (I
> > > > > > > > > > > used
> > > > > > > > > > > a
> > > > > > > > > > > full
> > > > > > > > > > > replica
> > > > > > > > > > > set, see my instructions in a previous email
> > > > > > > > > > > about
> > > > > > > > > > > how to
> > > > > > > > > > > get
> > > > > > > > > > > one
> > > > > > > > > > > going
> > > > > > > > > > > using podman) and a single process running sling.
> > > > > > > > > > > 
> > > > > > > > > > > The problem does happen when I do the following:
> > > > > > > > > > > 
> > > > > > > > > > > 2. Start Sling instance A, wait for it to start
> > > > > > > > > > > 3. Stop Sling instance A, wait for it to stop
> > > > > > > > > > > 4. Start Sling instance B - Error
> > > > > > > > > > > 
> > > > > > > > > > > but let me add more
> > > > > > > > > > > 
> > > > > > > > > > > 5. Start Sling Instance A again - Success (note I
> > > > > > > > > > > didn't
> > > > > > > > > > > remove
> > > > > > > > > > > the
> > > > > > > > > > > sling
> > > > > > > > > > > dir)
> > > > > > > > > > > 6. Start Sling instance B again - Success (note I
> > > > > > > > > > > didn't
> > > > > > > > > > > remove
> > > > > > > > > > > the
> > > > > > > > > > > sling
> > > > > > > > > > > dir)
> > > > > > > > > > > 
> > > > > > > > > > > this means that even if Sling recreates the sling
> > > > > > > > > > > directory and
> > > > > > > > > > > fails the
> > > > > > > > > > > startup, next time it will succeed. Unfortunately
> > > > > > > > > > > we
> > > > > > > > > > > don't have
> > > > > > > > > > > that luxury
> > > > > > > > > > > in containers because the sling directory is not
> > > > > > > > > > > persisted.
> > > > > > > > > > > 
> > > > > > > > > > > I think this is a bug, but I'll keep playing with
> > > > > > > > > > > it
> > > > > > > > > > > a
> > > > > > > > > > > bit to
> > > > > > > > > > > see
> > > > > > > > > > > if I can
> > > > > > > > > > > find out more.
> > > > > > > > > > > 
> > > > > > > > > > > Carlos
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > On Mon, Feb 17, 2020 at 5:23 AM Robert Munteanu <
> > > > > > > > > > > [email protected]
> > > > > > > > > > > wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > On Fri, 2020-02-14 at 15:41 -0500, Carlos Munoz
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > Robert I managed to replicate the issue in a
> > > > > > > > > > > > > local,
> > > > > > > > > > > > > non-
> > > > > > > > > > > > > containerized
> > > > > > > > > > > > > environment (!!!).
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The problem seems to be when the database is
> > > > > > > > > > > > > kept
> > > > > > > > > > > > > but
> > > > > > > > > > > > > the
> > > > > > > > > > > > > 'sling'
> > > > > > > > > > > > > directory
> > > > > > > > > > > > > is cleared out across restarts (as it is for
> > > > > > > > > > > > > us
> > > > > > > > > > > > > when
> > > > > > > > > > > > > the
> > > > > > > > > > > > > container
> > > > > > > > > > > > > goes
> > > > > > > > > > > > > away). As I said before this doesn't seem to
> > > > > > > > > > > > > be a
> > > > > > > > > > > > > problem
> > > > > > > > > > > > > with
> > > > > > > > > > > > > the
> > > > > > > > > > > > > Sling 11
> > > > > > > > > > > > > bundles.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The first basic solution will be to persist
> > > > > > > > > > > > > the
> > > > > > > > > > > > > 'sling'
> > > > > > > > > > > > > directory
> > > > > > > > > > > > > across
> > > > > > > > > > > > > restarts, and I was wondering if this is a
> > > > > > > > > > > > > bug,
> > > > > > > > > > > > > or as
> > > > > > > > > > > > > designed.
> > > > > > > > > > > > 
> > > > > > > > > > > > I think this should work.
> > > > > > > > > > > > 
> > > > > > > > > > > > > I also wonder if once persisted, multiple
> > > > > > > > > > > > > containers
> > > > > > > > > > > > > could
> > > > > > > > > > > > > share this
> > > > > > > > > > > > > directory.
> > > > > > > > > > > > 
> > > > > > > > > > > > This directory can't be shared, as it holds
> > > > > > > > > > > > runtime
> > > > > > > > > > > > data
> > > > > > > > > > > > related
> > > > > > > > > > > > to
> > > > > > > > > > > > Sling. For instance, a bundle that is started
> > > > > > > > > > > > in
> > > > > > > > > > > > instance A
> > > > > > > > > > > > could
> > > > > > > > > > > > be
> > > > > > > > > > > > starting on instance B.
> > > > > > > > > > > > 
> > > > > > > > > > > > There is at least one file ( sling.id ) that
> > > > > > > > > > > > holds
> > > > > > > > > > > > data
> > > > > > > > > > > > that
> > > > > > > > > > > > must
> > > > > > > > > > > > not
> > > > > > > > > > > > be the same between instances.
> > > > > > > > > > > > 
> > > > > > > > > > > > So I would advise as marking the directory as
> > > > > > > > > > > > container-
> > > > > > > > > > > > private
> > > > > > > > > > > > as a
> > > > > > > > > > > > first step.
> > > > > > > > > > > > 
> > > > > > > > > > > > Robert
> > > > > > > > > > > > 
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Carlos
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Fri, Feb 14, 2020 at 3:17 PM Carlos Munoz
> > > > > > > > > > > > > <
> > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thanks Robert (and once again I can't
> > > > > > > > > > > > > > stress
> > > > > > > > > > > > > > enough
> > > > > > > > > > > > > > how
> > > > > > > > > > > > > > grateful I
> > > > > > > > > > > > > > am for
> > > > > > > > > > > > > > all your help).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Right now we deploy our container with the
> > > > > > > > > > > > > > expectation
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > mongo db
> > > > > > > > > > > > > > is the only necessary state we need to
> > > > > > > > > > > > > > keep;
> > > > > > > > > > > > > > everything
> > > > > > > > > > > > > > else
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > throwaway.
> > > > > > > > > > > > > > This means that a totally new container
> > > > > > > > > > > > > > connected
> > > > > > > > > > > > > > to the
> > > > > > > > > > > > > > mongodb
> > > > > > > > > > > > > > should
> > > > > > > > > > > > > > pick up the state and run the same as the
> > > > > > > > > > > > > > first
> > > > > > > > > > > > > > time it
> > > > > > > > > > > > > > was
> > > > > > > > > > > > > > fired
> > > > > > > > > > > > > > up. Do
> > > > > > > > > > > > > > you think this is an incorrect assumption?
> > > > > > > > > > > > > > If
> > > > > > > > > > > > > > so,
> > > > > > > > > > > > > > what
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > other
> > > > > > > > > > > > > > pieces of
> > > > > > > > > > > > > > state we should be keeping for subsequent
> > > > > > > > > > > > > > restarts?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > This assumption has worked well for us with
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > current
> > > > > > > > > > > > > > sling
> > > > > > > > > > > > > > 11
> > > > > > > > > > > > > > release,
> > > > > > > > > > > > > > but it seems to break with the more up-to-
> > > > > > > > > > > > > > date
> > > > > > > > > > > > > > bundles.
> > > > > > > > > > > > > > Perhaps
> > > > > > > > > > > > > > running
> > > > > > > > > > > > > > Sling in a container is just not meant to
> > > > > > > > > > > > > > be.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Carlos
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 2:21 PM Robert
> > > > > > > > > > > > > > Munteanu
> > > > > > > > > > > > > > <
> > > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Hi Carlos,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On Fri, 2020-02-14 at 11:50 -0500, Carlos
> > > > > > > > > > > > > > > Munoz
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > Thanks Bertrand. How can I run Sling
> > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > DEBUG-level
> > > > > > > > > > > > > > > > logs for
> > > > > > > > > > > > > > > > every
> > > > > > > > > > > > > > > > bundle? I tried passing a few
> > > > > > > > > > > > > > > > configuration
> > > > > > > > > > > > > > > > arguments
> > > > > > > > > > > > > > > > from the
> > > > > > > > > > > > > > > > command line
> > > > > > > > > > > > > > > > but nothing seemed to work.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Try configuring the LogManager to debug
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > https://github.com/apache/sling-org-apache-sling-starter/blob/8ba34e28fbea2feb4c61767dde510aa94d86fa0a/src/main/provisioning/sling.txt#L138
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Robert
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Carlos
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 4:32 AM
> > > > > > > > > > > > > > > > Bertrand
> > > > > > > > > > > > > > > > Delacretaz <
> > > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > On Thu, Feb 13, 2020 at 8:47 PM
> > > > > > > > > > > > > > > > > Carlos
> > > > > > > > > > > > > > > > > Munoz
> > > > > > > > > > > > > > > > > <
> > > > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > ...Is there a reason why the Jcr
> > > > > > > > > > > > > > > > > > repository
> > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > restarting?
> > > > > > > > > > > > > > > > > > And what
> > > > > > > > > > > > > > > > > > class could we start looking into
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > debug
> > > > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > > this is
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > case?...
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > It's not uncommon to see extra
> > > > > > > > > > > > > > > > > restarts
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > OSGi
> > > > > > > > > > > > > > > > > components at
> > > > > > > > > > > > > > > > > startup,
> > > > > > > > > > > > > > > > > for various reasons.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > The simplest way to detect and log
> > > > > > > > > > > > > > > > > multiple
> > > > > > > > > > > > > > > > > repository
> > > > > > > > > > > > > > > > > startups
> > > > > > > > > > > > > > > > > might
> > > > > > > > > > > > > > > > > be to implement a
> > > > > > > > > > > > > > > > > SlingRepositoryInitializer
> > > > > > > > > > > > > > > > > service
> > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > that's
> > > > > > > > > > > > > > > > > called
> > > > > > > > > > > > > > > > > at every startup, or use the logs of
> > > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > existing
> > > > > > > > > > > > > > > > > one
> > > > > > > > > > > > > > > > > like the
> > > > > > > > > > > > > > > > > JCR
> > > > > > > > > > > > > > > > > RepositoryInitializer [2] if that has
> > > > > > > > > > > > > > > > > anything to
> > > > > > > > > > > > > > > > > process in
> > > > > > > > > > > > > > > > > your
> > > > > > > > > > > > > > > > > system.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > -Bertrand
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > 
> > https://sling.apache.org/documentation/bundles/repository-initialization.html#slingrepositoryinitializer
> > > > > > > > > > > > > > > > > [2]
> > > > > > > > > > > > > > > > > 
> > https://github.com/apache/sling-org-apache-sling-jcr-repoinit/blob/41dfe606f99ca71baee8d9054d3ec6e9b896b12e/src/main/java/org/apache/sling/jcr/repoinit/impl/RepositoryInitializer.java#L98
> > 
> > 

Reply via email to