I went through the bisect process and I got the first bad commit: commit bb1e10d97f3c163fb87917ea782afff674050891 Author: Eric Norman <enor...@apache.org> Date: Sun Dec 16 12:33:08 2018 -0800
switch to released JCR Base 3.0.6 (I tried it a couple of times just to be sure) I tried running our app with the commit before that and I get it to run. (There are other unrelated problems). On Mon, Mar 16, 2020 at 6:12 PM Robert Munteanu <romb...@apache.org> wrote: > Hi Carlos, > > Apologies for the delay ... > > What I was thinking of doing myself, but did not have the time is the > following > > 1. Find a version of Sling for which the scenario in SLING-9118 works. > Perhaps Sling Starter 11 is a good start. > 2. Run a `git bisect` check between sling starter 11 and the current > master branch > > Assuming my guess is correct, git would say > > Bisecting: 36 revisions left to test after this (roughly 5 steps) > [c1aedf7b292f7835ceb4e2f56fedcb3294c60756] Update to Tika 1.21 > > So not that many steps to test. > > If you would manage to isolate the change to the starter that broke > this, it would make it much easier to understand where the problem is > coming from. > > Thanks! > Robert > > On Mon, 2020-03-16 at 16:27 -0400, Carlos Munoz wrote: > > Hi Robert, > > > > Just a friendly ping about this issue :) > > > > We could try to submit a fix with some potential guidance from you. > > For > > example, which of the many Sling bundles should we start looking at? > > > > Regards, > > > > Carlos > > > > > > On Wed, Feb 26, 2020 at 7:24 AM Carlos Munoz <camu...@redhat.com> > > wrote: > > > > > Thanks Robert. As always your help is appreciated. > > > > > > On Fri, Feb 21, 2020 at 6:28 PM Robert Munteanu <romb...@apache.org > > > > > > > wrote: > > > > > > > Thanks, Ben, > > > > > > > > I added a bit more detail, based on our mailing list > > > > conversations. > > > > I'll have limited access in the next two weeks, but if no one > > > > picks it > > > > up I'll look into it when I get back. > > > > > > > > Thanks, > > > > Robert > > > > > > > > On Fri, 2020-02-21 at 11:01 -0500, Ben Radey wrote: > > > > > I went ahead and created > > > > > https://issues.apache.org/jira/browse/SLING-9118 > > > > > for this. Although the ultimate goal here is containerization, > > > > > I > > > > > neglected > > > > > to include any details to that effect in the ticket, since the > > > > > behavior is > > > > > reproducible without that being a complicating factor. > > > > > > > > > > On Thu, Feb 20, 2020 at 7:25 AM Robert Munteanu < > > > > > romb...@apache.org> > > > > > wrote: > > > > > > > > > > > On Mon, 2020-02-17 at 13:45 -0500, Ben Radey wrote: > > > > > > > I am following along conceptually - I want to make sure I > > > > > > > understand > > > > > > > what's > > > > > > > being described. > > > > > > > > > > > > > > Let's say Sling Instance A starts successfully the first > > > > > > > time. If > > > > > > > we > > > > > > > restart Sling Instance A, we expect subsequent restarts to > > > > > > > also > > > > > > > succeed, > > > > > > > without removing the sling directory. > > > > > > > Now let's say Sling Instance B does NOT start successfully > > > > > > > the > > > > > > > first > > > > > > > time. > > > > > > > Despite that, we expect subsequent restarts to succeed > > > > > > > without > > > > > > > removing the > > > > > > > sling directory. > > > > > > > > > > > > > > Correct so far? > > > > > > > > > > > > Yes, correct. > > > > > > > > > > > > > Assuming yes... what if this is running in k8s, and k8s > > > > > > > sees that > > > > > > > Sling > > > > > > > Instance B did not start successfully, and kills the pod > > > > > > > (removing > > > > > > > all pod > > > > > > > resources, including that pod's sling directory) in > > > > > > > response? > > > > > > > Presumably, > > > > > > > k8s would then start Sling Instance C, which is a fresh > > > > > > > instance > > > > > > > with > > > > > > > no > > > > > > > sling directory. Are we saying we expect C to have a 50/50 > > > > > > > chance > > > > > > > of > > > > > > > starting successfully? Or have we observed different > > > > > > > behavior? > > > > > > > > > > > > I think that only the first instance starts successfully. > > > > > > Additional > > > > > > instances will not start unless they have a Sling directory > > > > > > set up. > > > > > > > > > > > > I've tested with a third instance, once two instances are up, > > > > > > and > > > > > > it > > > > > > has the exact same behaviour. > > > > > > > > > > > > One workaround that I can suggest for a containerized > > > > > > environment > > > > > > is to > > > > > > use a supervisor script that detects the abnormal startup > > > > > > problem > > > > > > and > > > > > > restarts Sling, so that it starts up successfully. > > > > > > > > > > > > Another would be to persist the 'sling' directory as a per- > > > > > > container > > > > > > volume. Not sure how easy that is with k8s, but maybe you can > > > > > > use a > > > > > > single ReadWriteMany volume at /sling, and each pod gets > > > > > > their own > > > > > > ${sling.home} at /sling/${containerId} ( assuming that is > > > > > > exposed > > > > > > through the downward API). > > > > > > > > > > > > As these are workardounds, I would still very much like to > > > > > > see this > > > > > > fixed properly, so please file a bug to track this. > > > > > > > > > > > > Thanks, > > > > > > Robert > > > > > > > > > > > > > Thanks, > > > > > > > Ben > > > > > > > > > > > > > > On Mon, Feb 17, 2020 at 11:33 AM Carlos Munoz < > > > > > > > camu...@redhat.com > > > > > > > wrote: > > > > > > > > > > > > > > > Thanks for the information Robert. > > > > > > > > > > > > > > > > To replicate the issue all I needed was a mongodb (I used > > > > > > > > a > > > > > > > > full > > > > > > > > replica > > > > > > > > set, see my instructions in a previous email about how to > > > > > > > > get > > > > > > > > one > > > > > > > > going > > > > > > > > using podman) and a single process running sling. > > > > > > > > > > > > > > > > The problem does happen when I do the following: > > > > > > > > > > > > > > > > 2. Start Sling instance A, wait for it to start > > > > > > > > 3. Stop Sling instance A, wait for it to stop > > > > > > > > 4. Start Sling instance B - Error > > > > > > > > > > > > > > > > but let me add more > > > > > > > > > > > > > > > > 5. Start Sling Instance A again - Success (note I didn't > > > > > > > > remove > > > > > > > > the > > > > > > > > sling > > > > > > > > dir) > > > > > > > > 6. Start Sling instance B again - Success (note I didn't > > > > > > > > remove > > > > > > > > the > > > > > > > > sling > > > > > > > > dir) > > > > > > > > > > > > > > > > this means that even if Sling recreates the sling > > > > > > > > directory and > > > > > > > > fails the > > > > > > > > startup, next time it will succeed. Unfortunately we > > > > > > > > don't have > > > > > > > > that luxury > > > > > > > > in containers because the sling directory is not > > > > > > > > persisted. > > > > > > > > > > > > > > > > I think this is a bug, but I'll keep playing with it a > > > > > > > > bit to > > > > > > > > see > > > > > > > > if I can > > > > > > > > find out more. > > > > > > > > > > > > > > > > Carlos > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Feb 17, 2020 at 5:23 AM Robert Munteanu < > > > > > > > > romb...@apache.org > > > > > > > > wrote: > > > > > > > > > > > > > > > > > On Fri, 2020-02-14 at 15:41 -0500, Carlos Munoz wrote: > > > > > > > > > > Robert I managed to replicate the issue in a local, > > > > > > > > > > non- > > > > > > > > > > containerized > > > > > > > > > > environment (!!!). > > > > > > > > > > > > > > > > > > > > The problem seems to be when the database is kept but > > > > > > > > > > the > > > > > > > > > > 'sling' > > > > > > > > > > directory > > > > > > > > > > is cleared out across restarts (as it is for us when > > > > > > > > > > the > > > > > > > > > > container > > > > > > > > > > goes > > > > > > > > > > away). As I said before this doesn't seem to be a > > > > > > > > > > problem > > > > > > > > > > with > > > > > > > > > > the > > > > > > > > > > Sling 11 > > > > > > > > > > bundles. > > > > > > > > > > > > > > > > > > > > The first basic solution will be to persist the > > > > > > > > > > 'sling' > > > > > > > > > > directory > > > > > > > > > > across > > > > > > > > > > restarts, and I was wondering if this is a bug, or as > > > > > > > > > > designed. > > > > > > > > > > > > > > > > > > I think this should work. > > > > > > > > > > > > > > > > > > > I also wonder if once persisted, multiple containers > > > > > > > > > > could > > > > > > > > > > share this > > > > > > > > > > directory. > > > > > > > > > > > > > > > > > > This directory can't be shared, as it holds runtime > > > > > > > > > data > > > > > > > > > related > > > > > > > > > to > > > > > > > > > Sling. For instance, a bundle that is started in > > > > > > > > > instance A > > > > > > > > > could > > > > > > > > > be > > > > > > > > > starting on instance B. > > > > > > > > > > > > > > > > > > There is at least one file ( sling.id ) that holds data > > > > > > > > > that > > > > > > > > > must > > > > > > > > > not > > > > > > > > > be the same between instances. > > > > > > > > > > > > > > > > > > So I would advise as marking the directory as > > > > > > > > > container- > > > > > > > > > private > > > > > > > > > as a > > > > > > > > > first step. > > > > > > > > > > > > > > > > > > Robert > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > Carlos > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 3:17 PM Carlos Munoz < > > > > > > > > > > camu...@redhat.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Thanks Robert (and once again I can't stress enough > > > > > > > > > > > how > > > > > > > > > > > grateful I > > > > > > > > > > > am for > > > > > > > > > > > all your help). > > > > > > > > > > > > > > > > > > > > > > Right now we deploy our container with the > > > > > > > > > > > expectation > > > > > > > > > > > that > > > > > > > > > > > the > > > > > > > > > > > mongo db > > > > > > > > > > > is the only necessary state we need to keep; > > > > > > > > > > > everything > > > > > > > > > > > else > > > > > > > > > > > is > > > > > > > > > > > throwaway. > > > > > > > > > > > This means that a totally new container connected > > > > > > > > > > > to the > > > > > > > > > > > mongodb > > > > > > > > > > > should > > > > > > > > > > > pick up the state and run the same as the first > > > > > > > > > > > time it > > > > > > > > > > > was > > > > > > > > > > > fired > > > > > > > > > > > up. Do > > > > > > > > > > > you think this is an incorrect assumption? If so, > > > > > > > > > > > what > > > > > > > > > > > are > > > > > > > > > > > other > > > > > > > > > > > pieces of > > > > > > > > > > > state we should be keeping for subsequent restarts? > > > > > > > > > > > > > > > > > > > > > > This assumption has worked well for us with the > > > > > > > > > > > current > > > > > > > > > > > sling > > > > > > > > > > > 11 > > > > > > > > > > > release, > > > > > > > > > > > but it seems to break with the more up-to-date > > > > > > > > > > > bundles. > > > > > > > > > > > Perhaps > > > > > > > > > > > running > > > > > > > > > > > Sling in a container is just not meant to be. > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > Carlos > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 2:21 PM Robert Munteanu < > > > > > > > > > > > romb...@apache.org > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Carlos, > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 2020-02-14 at 11:50 -0500, Carlos Munoz > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Thanks Bertrand. How can I run Sling with > > > > > > > > > > > > > DEBUG-level > > > > > > > > > > > > > logs for > > > > > > > > > > > > > every > > > > > > > > > > > > > bundle? I tried passing a few configuration > > > > > > > > > > > > > arguments > > > > > > > > > > > > > from the > > > > > > > > > > > > > command line > > > > > > > > > > > > > but nothing seemed to work. > > > > > > > > > > > > > > > > > > > > > > > > Try configuring the LogManager to debug at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/sling-org-apache-sling-starter/blob/8ba34e28fbea2feb4c61767dde510aa94d86fa0a/src/main/provisioning/sling.txt#L138 > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Robert > > > > > > > > > > > > > > > > > > > > > > > > > Carlos > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Feb 14, 2020 at 4:32 AM Bertrand > > > > > > > > > > > > > Delacretaz < > > > > > > > > > > > > > bdelacre...@apache.org> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 13, 2020 at 8:47 PM Carlos Munoz > > > > > > > > > > > > > > < > > > > > > > > > > > > > > camu...@redhat.com> > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > ...Is there a reason why the Jcr repository > > > > > > > > > > > > > > > could > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > restarting? > > > > > > > > > > > > > > > And what > > > > > > > > > > > > > > > class could we start looking into to debug > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > this is > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > case?... > > > > > > > > > > > > > > > > > > > > > > > > > > > > It's not uncommon to see extra restarts of > > > > > > > > > > > > > > OSGi > > > > > > > > > > > > > > components at > > > > > > > > > > > > > > startup, > > > > > > > > > > > > > > for various reasons. > > > > > > > > > > > > > > > > > > > > > > > > > > > > The simplest way to detect and log multiple > > > > > > > > > > > > > > repository > > > > > > > > > > > > > > startups > > > > > > > > > > > > > > might > > > > > > > > > > > > > > be to implement a SlingRepositoryInitializer > > > > > > > > > > > > > > service > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > that's > > > > > > > > > > > > > > called > > > > > > > > > > > > > > at every startup, or use the logs of an > > > > > > > > > > > > > > existing > > > > > > > > > > > > > > one > > > > > > > > > > > > > > like the > > > > > > > > > > > > > > JCR > > > > > > > > > > > > > > RepositoryInitializer [2] if that has > > > > > > > > > > > > > > anything to > > > > > > > > > > > > > > process in > > > > > > > > > > > > > > your > > > > > > > > > > > > > > system. > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Bertrand > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > https://sling.apache.org/documentation/bundles/repository-initialization.html#slingrepositoryinitializer > > > > > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > https://github.com/apache/sling-org-apache-sling-jcr-repoinit/blob/41dfe606f99ca71baee8d9054d3ec6e9b896b12e/src/main/java/org/apache/sling/jcr/repoinit/impl/RepositoryInitializer.java#L98 > > > > > > > >