Hi Carlos,

Apologies for the delay ...

What I was thinking of doing myself, but did not have the time is the
following

1. Find a version of Sling for which the scenario in SLING-9118 works.
Perhaps Sling Starter 11 is a good start.
2. Run a `git bisect` check between sling starter 11 and the current
master branch

Assuming my guess is correct, git would say

Bisecting: 36 revisions left to test after this (roughly 5 steps)
[c1aedf7b292f7835ceb4e2f56fedcb3294c60756] Update to Tika 1.21

So not that many steps to test.

If you would manage to isolate the change to the starter that broke
this, it would make it much easier to understand where the problem is
coming from.

Thanks!
Robert

On Mon, 2020-03-16 at 16:27 -0400, Carlos Munoz wrote:
> Hi Robert,
> 
> Just a friendly ping about this issue :)
> 
> We could try to submit a fix with some potential guidance from you.
> For
> example, which of the many Sling bundles should we start looking at?
> 
> Regards,
> 
> Carlos
> 
> 
> On Wed, Feb 26, 2020 at 7:24 AM Carlos Munoz <camu...@redhat.com>
> wrote:
> 
> > Thanks Robert. As always your help is appreciated.
> > 
> > On Fri, Feb 21, 2020 at 6:28 PM Robert Munteanu <romb...@apache.org
> > >
> > wrote:
> > 
> > > Thanks, Ben,
> > > 
> > > I added a bit more detail, based on our mailing list
> > > conversations.
> > > I'll have limited access in the next two weeks, but if no one
> > > picks it
> > > up I'll look into it when I get back.
> > > 
> > > Thanks,
> > > Robert
> > > 
> > > On Fri, 2020-02-21 at 11:01 -0500, Ben Radey wrote:
> > > > I went ahead and created
> > > > https://issues.apache.org/jira/browse/SLING-9118
> > > > for this. Although the ultimate goal here is containerization,
> > > > I
> > > > neglected
> > > > to include any details to that effect in the ticket, since the
> > > > behavior is
> > > > reproducible without that being a complicating factor.
> > > > 
> > > > On Thu, Feb 20, 2020 at 7:25 AM Robert Munteanu <
> > > > romb...@apache.org>
> > > > wrote:
> > > > 
> > > > > On Mon, 2020-02-17 at 13:45 -0500, Ben Radey wrote:
> > > > > > I am following along conceptually - I want to make sure I
> > > > > > understand
> > > > > > what's
> > > > > > being described.
> > > > > > 
> > > > > > Let's say Sling Instance A starts successfully the first
> > > > > > time. If
> > > > > > we
> > > > > > restart Sling Instance A, we expect subsequent restarts to
> > > > > > also
> > > > > > succeed,
> > > > > > without removing the sling directory.
> > > > > > Now let's say Sling Instance B does NOT start successfully
> > > > > > the
> > > > > > first
> > > > > > time.
> > > > > > Despite that, we expect subsequent restarts to succeed
> > > > > > without
> > > > > > removing the
> > > > > > sling directory.
> > > > > > 
> > > > > > Correct so far?
> > > > > 
> > > > > Yes, correct.
> > > > > 
> > > > > > Assuming yes... what if this is running in k8s, and k8s
> > > > > > sees that
> > > > > > Sling
> > > > > > Instance B did not start successfully, and kills the pod
> > > > > > (removing
> > > > > > all pod
> > > > > > resources, including that pod's sling directory) in
> > > > > > response?
> > > > > > Presumably,
> > > > > > k8s would then start Sling Instance C, which is a fresh
> > > > > > instance
> > > > > > with
> > > > > > no
> > > > > > sling directory. Are we saying we expect C to have a 50/50
> > > > > > chance
> > > > > > of
> > > > > > starting successfully? Or have we observed different
> > > > > > behavior?
> > > > > 
> > > > > I think that only the first instance starts successfully.
> > > > > Additional
> > > > > instances will not start unless they have a Sling directory
> > > > > set up.
> > > > > 
> > > > > I've tested with a third instance, once two instances are up,
> > > > > and
> > > > > it
> > > > > has the exact same behaviour.
> > > > > 
> > > > > One workaround that I can suggest for a containerized
> > > > > environment
> > > > > is to
> > > > > use a supervisor script that detects the abnormal startup
> > > > > problem
> > > > > and
> > > > > restarts Sling, so that it starts up successfully.
> > > > > 
> > > > > Another would be to persist the 'sling' directory as a per-
> > > > > container
> > > > > volume. Not sure how easy that is with k8s, but maybe you can
> > > > > use a
> > > > > single ReadWriteMany volume at /sling, and each pod gets
> > > > > their own
> > > > > ${sling.home} at /sling/${containerId} ( assuming that is
> > > > > exposed
> > > > > through the downward API).
> > > > > 
> > > > > As these are workardounds, I would still very much like to
> > > > > see this
> > > > > fixed properly, so please file a bug to track this.
> > > > > 
> > > > > Thanks,
> > > > > Robert
> > > > > 
> > > > > > Thanks,
> > > > > > Ben
> > > > > > 
> > > > > > On Mon, Feb 17, 2020 at 11:33 AM Carlos Munoz <
> > > > > > camu...@redhat.com
> > > > > > wrote:
> > > > > > 
> > > > > > > Thanks for the information Robert.
> > > > > > > 
> > > > > > > To replicate the issue all I needed was a mongodb (I used
> > > > > > > a
> > > > > > > full
> > > > > > > replica
> > > > > > > set, see my instructions in a previous email about how to
> > > > > > > get
> > > > > > > one
> > > > > > > going
> > > > > > > using podman) and a single process running sling.
> > > > > > > 
> > > > > > > The problem does happen when I do the following:
> > > > > > > 
> > > > > > > 2. Start Sling instance A, wait for it to start
> > > > > > > 3. Stop Sling instance A, wait for it to stop
> > > > > > > 4. Start Sling instance B - Error
> > > > > > > 
> > > > > > > but let me add more
> > > > > > > 
> > > > > > > 5. Start Sling Instance A again - Success (note I didn't
> > > > > > > remove
> > > > > > > the
> > > > > > > sling
> > > > > > > dir)
> > > > > > > 6. Start Sling instance B again - Success (note I didn't
> > > > > > > remove
> > > > > > > the
> > > > > > > sling
> > > > > > > dir)
> > > > > > > 
> > > > > > > this means that even if Sling recreates the sling
> > > > > > > directory and
> > > > > > > fails the
> > > > > > > startup, next time it will succeed. Unfortunately we
> > > > > > > don't have
> > > > > > > that luxury
> > > > > > > in containers because the sling directory is not
> > > > > > > persisted.
> > > > > > > 
> > > > > > > I think this is a bug, but I'll keep playing with it a
> > > > > > > bit to
> > > > > > > see
> > > > > > > if I can
> > > > > > > find out more.
> > > > > > > 
> > > > > > > Carlos
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On Mon, Feb 17, 2020 at 5:23 AM Robert Munteanu <
> > > > > > > romb...@apache.org
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > On Fri, 2020-02-14 at 15:41 -0500, Carlos Munoz wrote:
> > > > > > > > > Robert I managed to replicate the issue in a local,
> > > > > > > > > non-
> > > > > > > > > containerized
> > > > > > > > > environment (!!!).
> > > > > > > > > 
> > > > > > > > > The problem seems to be when the database is kept but
> > > > > > > > > the
> > > > > > > > > 'sling'
> > > > > > > > > directory
> > > > > > > > > is cleared out across restarts (as it is for us when
> > > > > > > > > the
> > > > > > > > > container
> > > > > > > > > goes
> > > > > > > > > away). As I said before this doesn't seem to be a
> > > > > > > > > problem
> > > > > > > > > with
> > > > > > > > > the
> > > > > > > > > Sling 11
> > > > > > > > > bundles.
> > > > > > > > > 
> > > > > > > > > The first basic solution will be to persist the
> > > > > > > > > 'sling'
> > > > > > > > > directory
> > > > > > > > > across
> > > > > > > > > restarts, and I was wondering if this is a bug, or as
> > > > > > > > > designed.
> > > > > > > > 
> > > > > > > > I think this should work.
> > > > > > > > 
> > > > > > > > > I also wonder if once persisted, multiple containers
> > > > > > > > > could
> > > > > > > > > share this
> > > > > > > > > directory.
> > > > > > > > 
> > > > > > > > This directory can't be shared, as it holds runtime
> > > > > > > > data
> > > > > > > > related
> > > > > > > > to
> > > > > > > > Sling. For instance, a bundle that is started in
> > > > > > > > instance A
> > > > > > > > could
> > > > > > > > be
> > > > > > > > starting on instance B.
> > > > > > > > 
> > > > > > > > There is at least one file ( sling.id ) that holds data
> > > > > > > > that
> > > > > > > > must
> > > > > > > > not
> > > > > > > > be the same between instances.
> > > > > > > > 
> > > > > > > > So I would advise as marking the directory as
> > > > > > > > container-
> > > > > > > > private
> > > > > > > > as a
> > > > > > > > first step.
> > > > > > > > 
> > > > > > > > Robert
> > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > 
> > > > > > > > > Carlos
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On Fri, Feb 14, 2020 at 3:17 PM Carlos Munoz <
> > > > > > > > > camu...@redhat.com>
> > > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > > Thanks Robert (and once again I can't stress enough
> > > > > > > > > > how
> > > > > > > > > > grateful I
> > > > > > > > > > am for
> > > > > > > > > > all your help).
> > > > > > > > > > 
> > > > > > > > > > Right now we deploy our container with the
> > > > > > > > > > expectation
> > > > > > > > > > that
> > > > > > > > > > the
> > > > > > > > > > mongo db
> > > > > > > > > > is the only necessary state we need to keep;
> > > > > > > > > > everything
> > > > > > > > > > else
> > > > > > > > > > is
> > > > > > > > > > throwaway.
> > > > > > > > > > This means that a totally new container connected
> > > > > > > > > > to the
> > > > > > > > > > mongodb
> > > > > > > > > > should
> > > > > > > > > > pick up the state and run the same as the first
> > > > > > > > > > time it
> > > > > > > > > > was
> > > > > > > > > > fired
> > > > > > > > > > up. Do
> > > > > > > > > > you think this is an incorrect assumption? If so,
> > > > > > > > > > what
> > > > > > > > > > are
> > > > > > > > > > other
> > > > > > > > > > pieces of
> > > > > > > > > > state we should be keeping for subsequent restarts?
> > > > > > > > > > 
> > > > > > > > > > This assumption has worked well for us with the
> > > > > > > > > > current
> > > > > > > > > > sling
> > > > > > > > > > 11
> > > > > > > > > > release,
> > > > > > > > > > but it seems to break with the more up-to-date
> > > > > > > > > > bundles.
> > > > > > > > > > Perhaps
> > > > > > > > > > running
> > > > > > > > > > Sling in a container is just not meant to be.
> > > > > > > > > > 
> > > > > > > > > > Regards,
> > > > > > > > > > 
> > > > > > > > > > Carlos
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > On Fri, Feb 14, 2020 at 2:21 PM Robert Munteanu <
> > > > > > > > > > romb...@apache.org
> > > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > > Hi Carlos,
> > > > > > > > > > > 
> > > > > > > > > > > On Fri, 2020-02-14 at 11:50 -0500, Carlos Munoz
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Thanks Bertrand. How can I run Sling with
> > > > > > > > > > > > DEBUG-level
> > > > > > > > > > > > logs for
> > > > > > > > > > > > every
> > > > > > > > > > > > bundle? I tried passing a few configuration
> > > > > > > > > > > > arguments
> > > > > > > > > > > > from the
> > > > > > > > > > > > command line
> > > > > > > > > > > > but nothing seemed to work.
> > > > > > > > > > > 
> > > > > > > > > > > Try configuring the LogManager to debug at
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > https://github.com/apache/sling-org-apache-sling-starter/blob/8ba34e28fbea2feb4c61767dde510aa94d86fa0a/src/main/provisioning/sling.txt#L138
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Robert
> > > > > > > > > > > 
> > > > > > > > > > > > Carlos
> > > > > > > > > > > > 
> > > > > > > > > > > > On Fri, Feb 14, 2020 at 4:32 AM Bertrand
> > > > > > > > > > > > Delacretaz <
> > > > > > > > > > > > bdelacre...@apache.org>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Thu, Feb 13, 2020 at 8:47 PM Carlos Munoz
> > > > > > > > > > > > > <
> > > > > > > > > > > > > camu...@redhat.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > ...Is there a reason why the Jcr repository
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > restarting?
> > > > > > > > > > > > > > And what
> > > > > > > > > > > > > > class could we start looking into to debug
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > this is
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > case?...
> > > > > > > > > > > > > 
> > > > > > > > > > > > > It's not uncommon to see extra restarts of
> > > > > > > > > > > > > OSGi
> > > > > > > > > > > > > components at
> > > > > > > > > > > > > startup,
> > > > > > > > > > > > > for various reasons.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The simplest way to detect and log multiple
> > > > > > > > > > > > > repository
> > > > > > > > > > > > > startups
> > > > > > > > > > > > > might
> > > > > > > > > > > > > be to implement a SlingRepositoryInitializer
> > > > > > > > > > > > > service
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > > that's
> > > > > > > > > > > > > called
> > > > > > > > > > > > > at every startup, or use the logs of an
> > > > > > > > > > > > > existing
> > > > > > > > > > > > > one
> > > > > > > > > > > > > like the
> > > > > > > > > > > > > JCR
> > > > > > > > > > > > > RepositoryInitializer [2] if that has
> > > > > > > > > > > > > anything to
> > > > > > > > > > > > > process in
> > > > > > > > > > > > > your
> > > > > > > > > > > > > system.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > -Bertrand
> > > > > > > > > > > > > 
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > > 
> > > https://sling.apache.org/documentation/bundles/repository-initialization.html#slingrepositoryinitializer
> > > > > > > > > > > > > [2]
> > > > > > > > > > > > > 
> > > https://github.com/apache/sling-org-apache-sling-jcr-repoinit/blob/41dfe606f99ca71baee8d9054d3ec6e9b896b12e/src/main/java/org/apache/sling/jcr/repoinit/impl/RepositoryInitializer.java#L98
> > > > > 

Reply via email to