Thanks, Ben, I added a bit more detail, based on our mailing list conversations. I'll have limited access in the next two weeks, but if no one picks it up I'll look into it when I get back.
Thanks, Robert On Fri, 2020-02-21 at 11:01 -0500, Ben Radey wrote: > I went ahead and created > https://issues.apache.org/jira/browse/SLING-9118 > for this. Although the ultimate goal here is containerization, I > neglected > to include any details to that effect in the ticket, since the > behavior is > reproducible without that being a complicating factor. > > On Thu, Feb 20, 2020 at 7:25 AM Robert Munteanu <[email protected]> > wrote: > > > On Mon, 2020-02-17 at 13:45 -0500, Ben Radey wrote: > > > I am following along conceptually - I want to make sure I > > > understand > > > what's > > > being described. > > > > > > Let's say Sling Instance A starts successfully the first time. If > > > we > > > restart Sling Instance A, we expect subsequent restarts to also > > > succeed, > > > without removing the sling directory. > > > Now let's say Sling Instance B does NOT start successfully the > > > first > > > time. > > > Despite that, we expect subsequent restarts to succeed without > > > removing the > > > sling directory. > > > > > > Correct so far? > > > > Yes, correct. > > > > > Assuming yes... what if this is running in k8s, and k8s sees that > > > Sling > > > Instance B did not start successfully, and kills the pod > > > (removing > > > all pod > > > resources, including that pod's sling directory) in response? > > > Presumably, > > > k8s would then start Sling Instance C, which is a fresh instance > > > with > > > no > > > sling directory. Are we saying we expect C to have a 50/50 chance > > > of > > > starting successfully? Or have we observed different behavior? > > > > I think that only the first instance starts successfully. > > Additional > > instances will not start unless they have a Sling directory set up. > > > > I've tested with a third instance, once two instances are up, and > > it > > has the exact same behaviour. > > > > One workaround that I can suggest for a containerized environment > > is to > > use a supervisor script that detects the abnormal startup problem > > and > > restarts Sling, so that it starts up successfully. > > > > Another would be to persist the 'sling' directory as a per- > > container > > volume. Not sure how easy that is with k8s, but maybe you can use a > > single ReadWriteMany volume at /sling, and each pod gets their own > > ${sling.home} at /sling/${containerId} ( assuming that is exposed > > through the downward API). > > > > As these are workardounds, I would still very much like to see this > > fixed properly, so please file a bug to track this. > > > > Thanks, > > Robert > > > > > Thanks, > > > Ben > > > > > > On Mon, Feb 17, 2020 at 11:33 AM Carlos Munoz <[email protected] > > > > > > > wrote: > > > > > > > Thanks for the information Robert. > > > > > > > > To replicate the issue all I needed was a mongodb (I used a > > > > full > > > > replica > > > > set, see my instructions in a previous email about how to get > > > > one > > > > going > > > > using podman) and a single process running sling. > > > > > > > > The problem does happen when I do the following: > > > > > > > > 2. Start Sling instance A, wait for it to start > > > > 3. Stop Sling instance A, wait for it to stop > > > > 4. Start Sling instance B - Error > > > > > > > > but let me add more > > > > > > > > 5. Start Sling Instance A again - Success (note I didn't remove > > > > the > > > > sling > > > > dir) > > > > 6. Start Sling instance B again - Success (note I didn't remove > > > > the > > > > sling > > > > dir) > > > > > > > > this means that even if Sling recreates the sling directory and > > > > fails the > > > > startup, next time it will succeed. Unfortunately we don't have > > > > that luxury > > > > in containers because the sling directory is not persisted. > > > > > > > > I think this is a bug, but I'll keep playing with it a bit to > > > > see > > > > if I can > > > > find out more. > > > > > > > > Carlos > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Feb 17, 2020 at 5:23 AM Robert Munteanu < > > > > [email protected] > > > > wrote: > > > > > > > > > On Fri, 2020-02-14 at 15:41 -0500, Carlos Munoz wrote: > > > > > > Robert I managed to replicate the issue in a local, non- > > > > > > containerized > > > > > > environment (!!!). > > > > > > > > > > > > The problem seems to be when the database is kept but the > > > > > > 'sling' > > > > > > directory > > > > > > is cleared out across restarts (as it is for us when the > > > > > > container > > > > > > goes > > > > > > away). As I said before this doesn't seem to be a problem > > > > > > with > > > > > > the > > > > > > Sling 11 > > > > > > bundles. > > > > > > > > > > > > The first basic solution will be to persist the 'sling' > > > > > > directory > > > > > > across > > > > > > restarts, and I was wondering if this is a bug, or as > > > > > > designed. > > > > > > > > > > I think this should work. > > > > > > > > > > > I also wonder if once persisted, multiple containers could > > > > > > share this > > > > > > directory. > > > > > > > > > > This directory can't be shared, as it holds runtime data > > > > > related > > > > > to > > > > > Sling. For instance, a bundle that is started in instance A > > > > > could > > > > > be > > > > > starting on instance B. > > > > > > > > > > There is at least one file ( sling.id ) that holds data that > > > > > must > > > > > not > > > > > be the same between instances. > > > > > > > > > > So I would advise as marking the directory as container- > > > > > private > > > > > as a > > > > > first step. > > > > > > > > > > Robert > > > > > > > > > > > Regards, > > > > > > > > > > > > Carlos > > > > > > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 3:17 PM Carlos Munoz < > > > > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Thanks Robert (and once again I can't stress enough how > > > > > > > grateful I > > > > > > > am for > > > > > > > all your help). > > > > > > > > > > > > > > Right now we deploy our container with the expectation > > > > > > > that > > > > > > > the > > > > > > > mongo db > > > > > > > is the only necessary state we need to keep; everything > > > > > > > else > > > > > > > is > > > > > > > throwaway. > > > > > > > This means that a totally new container connected to the > > > > > > > mongodb > > > > > > > should > > > > > > > pick up the state and run the same as the first time it > > > > > > > was > > > > > > > fired > > > > > > > up. Do > > > > > > > you think this is an incorrect assumption? If so, what > > > > > > > are > > > > > > > other > > > > > > > pieces of > > > > > > > state we should be keeping for subsequent restarts? > > > > > > > > > > > > > > This assumption has worked well for us with the current > > > > > > > sling > > > > > > > 11 > > > > > > > release, > > > > > > > but it seems to break with the more up-to-date bundles. > > > > > > > Perhaps > > > > > > > running > > > > > > > Sling in a container is just not meant to be. > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > Carlos > > > > > > > > > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 2:21 PM Robert Munteanu < > > > > > > > [email protected] > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Carlos, > > > > > > > > > > > > > > > > On Fri, 2020-02-14 at 11:50 -0500, Carlos Munoz wrote: > > > > > > > > > Thanks Bertrand. How can I run Sling with DEBUG-level > > > > > > > > > logs for > > > > > > > > > every > > > > > > > > > bundle? I tried passing a few configuration arguments > > > > > > > > > from the > > > > > > > > > command line > > > > > > > > > but nothing seemed to work. > > > > > > > > > > > > > > > > Try configuring the LogManager to debug at > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/sling-org-apache-sling-starter/blob/8ba34e28fbea2feb4c61767dde510aa94d86fa0a/src/main/provisioning/sling.txt#L138 > > > > > > > > Thanks, > > > > > > > > Robert > > > > > > > > > > > > > > > > > Carlos > > > > > > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 4:32 AM Bertrand Delacretaz < > > > > > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > On Thu, Feb 13, 2020 at 8:47 PM Carlos Munoz < > > > > > > > > > > [email protected]> > > > > > > > > > > wrote: > > > > > > > > > > > ...Is there a reason why the Jcr repository could > > > > > > > > > > > be > > > > > > > > > > > restarting? > > > > > > > > > > > And what > > > > > > > > > > > class could we start looking into to debug if > > > > > > > > > > > this is > > > > > > > > > > > the > > > > > > > > > > > case?... > > > > > > > > > > > > > > > > > > > > It's not uncommon to see extra restarts of OSGi > > > > > > > > > > components at > > > > > > > > > > startup, > > > > > > > > > > for various reasons. > > > > > > > > > > > > > > > > > > > > The simplest way to detect and log multiple > > > > > > > > > > repository > > > > > > > > > > startups > > > > > > > > > > might > > > > > > > > > > be to implement a SlingRepositoryInitializer > > > > > > > > > > service > > > > > > > > > > [1] > > > > > > > > > > that's > > > > > > > > > > called > > > > > > > > > > at every startup, or use the logs of an existing > > > > > > > > > > one > > > > > > > > > > like the > > > > > > > > > > JCR > > > > > > > > > > RepositoryInitializer [2] if that has anything to > > > > > > > > > > process in > > > > > > > > > > your > > > > > > > > > > system. > > > > > > > > > > > > > > > > > > > > -Bertrand > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > https://sling.apache.org/documentation/bundles/repository-initialization.html#slingrepositoryinitializer > > > > > > > > > > [2] > > > > > > > > > > > > https://github.com/apache/sling-org-apache-sling-jcr-repoinit/blob/41dfe606f99ca71baee8d9054d3ec6e9b896b12e/src/main/java/org/apache/sling/jcr/repoinit/impl/RepositoryInitializer.java#L98 > > > >
