Hi Alexander,

Pls see below my answers.


   Best Regards,

   Adam.



Adam Hamšík
Co-founder & CEO
Mobile: +421-904-937-495
www.lablabs.io


On 23 Jan 2022, 09:09 +0100, Alexandru Pătrănescu <dreal...@gmail.com>, wrote:
>
> On Sat, Jan 22, 2022 at 10:00 PM Adam Hamsik <adam.ham...@lablabs.io> wrote:
> > Hello,
> >
> > We are using PHP for our application backends, this works very well as we 
> > have developed s imple way to clone them with minimal effort(they can be 
> > very similar). For our orchestration we are using Kubernetes (>= 1.21). Our 
> > application pod generally contains NGINX + php-fpm and fluentbit for log 
> > shipping. We generally want to have LivenessProbe(for an simple explanation 
> > this is a simple check which is run against our pod to verify if it's 
> > alive, if it fails particular container will be restarted).
> >
> > This works very we(we are also using swoole which is roughly 80-70% 
> > better)l, but in certain unstable situations when we see higher application 
> > latency (db problem or a bug in our application). We often experience 
> > problems, because pods are falsely marked as dead (failed liveness probe 
> > and restarted by kubelet). This happens all processes in our static pool 
> > are allocated to application requests. For our livenessProbe we tried to 
> > use both fpm.ping and fpm.status endpoints but both of them behave in a 
> > same way as they are managed with worker processes.
> >
> > I had a look at pgp-src repo if e.g. we can use signals to verify if 
> > application server is running as a way to go around our issue. When looking 
> > at this I saw fpm-systemd.c which is a SystemD specific check. This check 
> > reports fpm status every couple seconds(configurable to systemd). Would you 
> > be willing to integrate similar feature for kubernetes. This would be based 
> > on a pull model probably with and REST interface.
> >
> > My idea is following:
> >
> > 1) During startup if this is enabled php-fpm master will open a secondary 
> > port pm.health_port(9001) and listen for a pm.health_path(/healtz)[2].
> > 2) If we receive GET request fpm master process will respond with HTTP code 
> > 200 and string ok. If anything is wrong (we can later add some 
> > checks/metrics to make sure fpm is in a good state).  If we do not respond 
> > or fpm is not ok our LivenessProbe will fail. based on configuration this 
> > will trigger container restart.
> >
> > Would you be interested to integrate feature like this ? or is there any 
> > other way how we can achieve similar results ?
> >
> > [1] 
> > https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-liveness-probe[2]
> >  https://kubernetes.io/docs/reference/using-api/health-checks/
> >
> >    Best Regards,
> >
> >    Adam.
> >
> >
> >
> > Adam Hamšík
> > Co-founder & CEO
> > Mobile: +421-904-937-495
> > www.lablabs.io
>
> Hi Adam,
>
> While I believe that improvements for health checking and other metrics can 
> be added to the php-fpm to expose internal status and statistics,
> I want to say that I don't know too much about that and I want to first 
> discuss the problem that you mentioned and the approach.
>
> Based on my experience, it is best to have the health check always going 
> through the application.
> You mentioned "certain unstable situations when we see higher application 
> latency (db problem or a bug in our application)".
> Taking this two examples:
>
> - "db problems". I'm guessing you mean, higher latency from the database.
> In case of the health check, you should not connect to the database, of 
> course so the actual execution of the healthcheck should not be impacted.
> But probably you mean that more requests are piling up as php-fpm is not able 
> to handle them as fast as they are coming due to limited child processes.
> One solution here would be to configure a second listening pool for health 
> endpoint on php-fpm with 1 or 2 child processes and configure nginx to use it 
> for the specific path.
>
> - "a bug in our application".I'm guessing you mean a bug that causes high CPU 
> usage.
> If the issue is visible immediately once the pod starts, it's good to have 
> the health check failed so the deployment rollout fails and avoid bringing 
> bugs in production.
> If the issue is visible later, some time after the pod starts, I'm thinking 
> this could happen due to a memory leak. A pod restart due to a failed health 
> check would also make sure the production stays healthy.
Both of these problem are not usually by themselves big enough to cause an 
outage. They are just making application behave slightly worse, this however 
can sometimes lead to failed liveness probes -> pod restarts.
>
> Having the health check passing through the application makes sure it's 
> actually working.
Sure, Bu in our case we either go to fpm.ping or fpm.status as initializing 
whole symfony is quite expensive. I'm not sure if this counts as  going through 
application.
> Based on my experience, it's good to include in the health check all the 
> application bootstrapping that are local and avoid any I/O like database, 
> memcache and others.
> A missed new production configuration dependency that would make the 
> application not start up properly would not allow the deployment rollout and 
> keep high uptime.
> A health check that is not using the actual application would report it 
> healthy while it will not be able to handle requests.
>

I agree with this. We initially tried to do a lot in our healthchecks and 
gradually reducte their footprint/scope to just required minimum, because they 
were too fragile.
> If I understand things differently or there are other cases that you 
> encountered where you think a health check not going through the app is 
> helping please share so we can learn about it.

>
> Regards,
> Alex
>

Reply via email to