On Mon, 2020-02-17 at 13:45 -0500, Ben Radey wrote: > I am following along conceptually - I want to make sure I understand > what's > being described. > > Let's say Sling Instance A starts successfully the first time. If we > restart Sling Instance A, we expect subsequent restarts to also > succeed, > without removing the sling directory. > Now let's say Sling Instance B does NOT start successfully the first > time. > Despite that, we expect subsequent restarts to succeed without > removing the > sling directory. > > Correct so far?
Yes, correct. > > Assuming yes... what if this is running in k8s, and k8s sees that > Sling > Instance B did not start successfully, and kills the pod (removing > all pod > resources, including that pod's sling directory) in response? > Presumably, > k8s would then start Sling Instance C, which is a fresh instance with > no > sling directory. Are we saying we expect C to have a 50/50 chance of > starting successfully? Or have we observed different behavior? I think that only the first instance starts successfully. Additional instances will not start unless they have a Sling directory set up. I've tested with a third instance, once two instances are up, and it has the exact same behaviour. One workaround that I can suggest for a containerized environment is to use a supervisor script that detects the abnormal startup problem and restarts Sling, so that it starts up successfully. Another would be to persist the 'sling' directory as a per-container volume. Not sure how easy that is with k8s, but maybe you can use a single ReadWriteMany volume at /sling, and each pod gets their own ${sling.home} at /sling/${containerId} ( assuming that is exposed through the downward API). As these are workardounds, I would still very much like to see this fixed properly, so please file a bug to track this. Thanks, Robert > > Thanks, > Ben > > On Mon, Feb 17, 2020 at 11:33 AM Carlos Munoz <camu...@redhat.com> > wrote: > > > Thanks for the information Robert. > > > > To replicate the issue all I needed was a mongodb (I used a full > > replica > > set, see my instructions in a previous email about how to get one > > going > > using podman) and a single process running sling. > > > > The problem does happen when I do the following: > > > > 2. Start Sling instance A, wait for it to start > > 3. Stop Sling instance A, wait for it to stop > > 4. Start Sling instance B - Error > > > > but let me add more > > > > 5. Start Sling Instance A again - Success (note I didn't remove the > > sling > > dir) > > 6. Start Sling instance B again - Success (note I didn't remove the > > sling > > dir) > > > > this means that even if Sling recreates the sling directory and > > fails the > > startup, next time it will succeed. Unfortunately we don't have > > that luxury > > in containers because the sling directory is not persisted. > > > > I think this is a bug, but I'll keep playing with it a bit to see > > if I can > > find out more. > > > > Carlos > > > > > > > > > > > > > > On Mon, Feb 17, 2020 at 5:23 AM Robert Munteanu <romb...@apache.org > > > > > wrote: > > > > > On Fri, 2020-02-14 at 15:41 -0500, Carlos Munoz wrote: > > > > Robert I managed to replicate the issue in a local, non- > > > > containerized > > > > environment (!!!). > > > > > > > > The problem seems to be when the database is kept but the > > > > 'sling' > > > > directory > > > > is cleared out across restarts (as it is for us when the > > > > container > > > > goes > > > > away). As I said before this doesn't seem to be a problem with > > > > the > > > > Sling 11 > > > > bundles. > > > > > > > > The first basic solution will be to persist the 'sling' > > > > directory > > > > across > > > > restarts, and I was wondering if this is a bug, or as designed. > > > > > > I think this should work. > > > > > > > I also wonder if once persisted, multiple containers could > > > > share this > > > > directory. > > > > > > This directory can't be shared, as it holds runtime data related > > > to > > > Sling. For instance, a bundle that is started in instance A could > > > be > > > starting on instance B. > > > > > > There is at least one file ( sling.id ) that holds data that must > > > not > > > be the same between instances. > > > > > > So I would advise as marking the directory as container-private > > > as a > > > first step. > > > > > > Robert > > > > > > > Regards, > > > > > > > > Carlos > > > > > > > > > > > > On Fri, Feb 14, 2020 at 3:17 PM Carlos Munoz < > > > > camu...@redhat.com> > > > > wrote: > > > > > > > > > Thanks Robert (and once again I can't stress enough how > > > > > grateful I > > > > > am for > > > > > all your help). > > > > > > > > > > Right now we deploy our container with the expectation that > > > > > the > > > > > mongo db > > > > > is the only necessary state we need to keep; everything else > > > > > is > > > > > throwaway. > > > > > This means that a totally new container connected to the > > > > > mongodb > > > > > should > > > > > pick up the state and run the same as the first time it was > > > > > fired > > > > > up. Do > > > > > you think this is an incorrect assumption? If so, what are > > > > > other > > > > > pieces of > > > > > state we should be keeping for subsequent restarts? > > > > > > > > > > This assumption has worked well for us with the current sling > > > > > 11 > > > > > release, > > > > > but it seems to break with the more up-to-date bundles. > > > > > Perhaps > > > > > running > > > > > Sling in a container is just not meant to be. > > > > > > > > > > Regards, > > > > > > > > > > Carlos > > > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 2:21 PM Robert Munteanu < > > > > > romb...@apache.org > > > > > wrote: > > > > > > > > > > > Hi Carlos, > > > > > > > > > > > > On Fri, 2020-02-14 at 11:50 -0500, Carlos Munoz wrote: > > > > > > > Thanks Bertrand. How can I run Sling with DEBUG-level > > > > > > > logs for > > > > > > > every > > > > > > > bundle? I tried passing a few configuration arguments > > > > > > > from the > > > > > > > command line > > > > > > > but nothing seemed to work. > > > > > > > > > > > > Try configuring the LogManager to debug at > > > > > > > > > > > > > > > > > > > > https://github.com/apache/sling-org-apache-sling-starter/blob/8ba34e28fbea2feb4c61767dde510aa94d86fa0a/src/main/provisioning/sling.txt#L138 > > > > > > Thanks, > > > > > > Robert > > > > > > > > > > > > > Carlos > > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 4:32 AM Bertrand Delacretaz < > > > > > > > bdelacre...@apache.org> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > On Thu, Feb 13, 2020 at 8:47 PM Carlos Munoz < > > > > > > > > camu...@redhat.com> > > > > > > > > wrote: > > > > > > > > > ...Is there a reason why the Jcr repository could be > > > > > > > > > restarting? > > > > > > > > > And what > > > > > > > > > class could we start looking into to debug if this is > > > > > > > > > the > > > > > > > > > case?... > > > > > > > > > > > > > > > > It's not uncommon to see extra restarts of OSGi > > > > > > > > components at > > > > > > > > startup, > > > > > > > > for various reasons. > > > > > > > > > > > > > > > > The simplest way to detect and log multiple repository > > > > > > > > startups > > > > > > > > might > > > > > > > > be to implement a SlingRepositoryInitializer service > > > > > > > > [1] > > > > > > > > that's > > > > > > > > called > > > > > > > > at every startup, or use the logs of an existing one > > > > > > > > like the > > > > > > > > JCR > > > > > > > > RepositoryInitializer [2] if that has anything to > > > > > > > > process in > > > > > > > > your > > > > > > > > system. > > > > > > > > > > > > > > > > -Bertrand > > > > > > > > > > > > > > > > [1] > > > > > > > > > > https://sling.apache.org/documentation/bundles/repository-initialization.html#slingrepositoryinitializer > > > > > > > > [2] > > > > > > > > > > https://github.com/apache/sling-org-apache-sling-jcr-repoinit/blob/41dfe606f99ca71baee8d9054d3ec6e9b896b12e/src/main/java/org/apache/sling/jcr/repoinit/impl/RepositoryInitializer.java#L98