On Mon, 2020-02-17 at 13:45 -0500, Ben Radey wrote:
> I am following along conceptually - I want to make sure I understand
> what's
> being described.
> 
> Let's say Sling Instance A starts successfully the first time. If we
> restart Sling Instance A, we expect subsequent restarts to also
> succeed,
> without removing the sling directory.
> Now let's say Sling Instance B does NOT start successfully the first
> time.
> Despite that, we expect subsequent restarts to succeed without
> removing the
> sling directory.
> 
> Correct so far?

Yes, correct.

> 
> Assuming yes... what if this is running in k8s, and k8s sees that
> Sling
> Instance B did not start successfully, and kills the pod (removing
> all pod
> resources, including that pod's sling directory) in response?
> Presumably,
> k8s would then start Sling Instance C, which is a fresh instance with
> no
> sling directory. Are we saying we expect C to have a 50/50 chance of
> starting successfully? Or have we observed different behavior?

I think that only the first instance starts successfully. Additional
instances will not start unless they have a Sling directory set up.

I've tested with a third instance, once two instances are up, and it
has the exact same behaviour.

One workaround that I can suggest for a containerized environment is to
use a supervisor script that detects the abnormal startup problem and
restarts Sling, so that it starts up successfully.

Another would be to persist the 'sling' directory as a per-container
volume. Not sure how easy that is with k8s, but maybe you can use a
single ReadWriteMany volume at /sling, and each pod gets their own
${sling.home} at /sling/${containerId} ( assuming that is exposed
through the downward API).

As these are workardounds, I would still very much like to see this
fixed properly, so please file a bug to track this.

Thanks,
Robert

> 
> Thanks,
> Ben
> 
> On Mon, Feb 17, 2020 at 11:33 AM Carlos Munoz <camu...@redhat.com>
> wrote:
> 
> > Thanks for the information Robert.
> > 
> > To replicate the issue all I needed was a mongodb (I used a full
> > replica
> > set, see my instructions in a previous email about how to get one
> > going
> > using podman) and a single process running sling.
> > 
> > The problem does happen when I do the following:
> > 
> > 2. Start Sling instance A, wait for it to start
> > 3. Stop Sling instance A, wait for it to stop
> > 4. Start Sling instance B - Error
> > 
> > but let me add more
> > 
> > 5. Start Sling Instance A again - Success (note I didn't remove the
> > sling
> > dir)
> > 6. Start Sling instance B again - Success (note I didn't remove the
> > sling
> > dir)
> > 
> > this means that even if Sling recreates the sling directory and
> > fails the
> > startup, next time it will succeed. Unfortunately we don't have
> > that luxury
> > in containers because the sling directory is not persisted.
> > 
> > I think this is a bug, but I'll keep playing with it a bit to see
> > if I can
> > find out more.
> > 
> > Carlos
> > 
> > 
> > 
> > 
> > 
> > 
> > On Mon, Feb 17, 2020 at 5:23 AM Robert Munteanu <romb...@apache.org
> > >
> > wrote:
> > 
> > > On Fri, 2020-02-14 at 15:41 -0500, Carlos Munoz wrote:
> > > > Robert I managed to replicate the issue in a local, non-
> > > > containerized
> > > > environment (!!!).
> > > > 
> > > > The problem seems to be when the database is kept but the
> > > > 'sling'
> > > > directory
> > > > is cleared out across restarts (as it is for us when the
> > > > container
> > > > goes
> > > > away). As I said before this doesn't seem to be a problem with
> > > > the
> > > > Sling 11
> > > > bundles.
> > > > 
> > > > The first basic solution will be to persist the 'sling'
> > > > directory
> > > > across
> > > > restarts, and I was wondering if this is a bug, or as designed.
> > > 
> > > I think this should work.
> > > 
> > > > I also wonder if once persisted, multiple containers could
> > > > share this
> > > > directory.
> > > 
> > > This directory can't be shared, as it holds runtime data related
> > > to
> > > Sling. For instance, a bundle that is started in instance A could
> > > be
> > > starting on instance B.
> > > 
> > > There is at least one file ( sling.id ) that holds data that must
> > > not
> > > be the same between instances.
> > > 
> > > So I would advise as marking the directory as container-private
> > > as a
> > > first step.
> > > 
> > > Robert
> > > 
> > > > Regards,
> > > > 
> > > > Carlos
> > > > 
> > > > 
> > > > On Fri, Feb 14, 2020 at 3:17 PM Carlos Munoz <
> > > > camu...@redhat.com>
> > > > wrote:
> > > > 
> > > > > Thanks Robert (and once again I can't stress enough how
> > > > > grateful I
> > > > > am for
> > > > > all your help).
> > > > > 
> > > > > Right now we deploy our container with the expectation that
> > > > > the
> > > > > mongo db
> > > > > is the only necessary state we need to keep; everything else
> > > > > is
> > > > > throwaway.
> > > > > This means that a totally new container connected to the
> > > > > mongodb
> > > > > should
> > > > > pick up the state and run the same as the first time it was
> > > > > fired
> > > > > up. Do
> > > > > you think this is an incorrect assumption? If so, what are
> > > > > other
> > > > > pieces of
> > > > > state we should be keeping for subsequent restarts?
> > > > > 
> > > > > This assumption has worked well for us with the current sling
> > > > > 11
> > > > > release,
> > > > > but it seems to break with the more up-to-date bundles.
> > > > > Perhaps
> > > > > running
> > > > > Sling in a container is just not meant to be.
> > > > > 
> > > > > Regards,
> > > > > 
> > > > > Carlos
> > > > > 
> > > > > 
> > > > > On Fri, Feb 14, 2020 at 2:21 PM Robert Munteanu <
> > > > > romb...@apache.org
> > > > > wrote:
> > > > > 
> > > > > > Hi Carlos,
> > > > > > 
> > > > > > On Fri, 2020-02-14 at 11:50 -0500, Carlos Munoz wrote:
> > > > > > > Thanks Bertrand. How can I run Sling with DEBUG-level
> > > > > > > logs for
> > > > > > > every
> > > > > > > bundle? I tried passing a few configuration arguments
> > > > > > > from the
> > > > > > > command line
> > > > > > > but nothing seemed to work.
> > > > > > 
> > > > > > Try configuring the LogManager to debug at
> > > > > > 
> > > > > > 
> > > > > > 
> > https://github.com/apache/sling-org-apache-sling-starter/blob/8ba34e28fbea2feb4c61767dde510aa94d86fa0a/src/main/provisioning/sling.txt#L138
> > > > > > Thanks,
> > > > > > Robert
> > > > > > 
> > > > > > > Carlos
> > > > > > > 
> > > > > > > On Fri, Feb 14, 2020 at 4:32 AM Bertrand Delacretaz <
> > > > > > > bdelacre...@apache.org>
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > On Thu, Feb 13, 2020 at 8:47 PM Carlos Munoz <
> > > > > > > > camu...@redhat.com>
> > > > > > > > wrote:
> > > > > > > > > ...Is there a reason why the Jcr repository could be
> > > > > > > > > restarting?
> > > > > > > > > And what
> > > > > > > > > class could we start looking into to debug if this is
> > > > > > > > > the
> > > > > > > > > case?...
> > > > > > > > 
> > > > > > > > It's not uncommon to see extra restarts of OSGi
> > > > > > > > components at
> > > > > > > > startup,
> > > > > > > > for various reasons.
> > > > > > > > 
> > > > > > > > The simplest way to detect and log multiple repository
> > > > > > > > startups
> > > > > > > > might
> > > > > > > > be to implement a SlingRepositoryInitializer service
> > > > > > > > [1]
> > > > > > > > that's
> > > > > > > > called
> > > > > > > > at every startup, or use the logs of an existing one
> > > > > > > > like the
> > > > > > > > JCR
> > > > > > > > RepositoryInitializer [2] if that has anything to
> > > > > > > > process in
> > > > > > > > your
> > > > > > > > system.
> > > > > > > > 
> > > > > > > > -Bertrand
> > > > > > > > 
> > > > > > > > [1]
> > > > > > > > 
> > https://sling.apache.org/documentation/bundles/repository-initialization.html#slingrepositoryinitializer
> > > > > > > > [2]
> > > > > > > > 
> > https://github.com/apache/sling-org-apache-sling-jcr-repoinit/blob/41dfe606f99ca71baee8d9054d3ec6e9b896b12e/src/main/java/org/apache/sling/jcr/repoinit/impl/RepositoryInitializer.java#L98

Reply via email to