Nothing in the syslog or error log to suggest that MariaDB crashed or
was restarted in any way. Just what I saw in the graphs.
There were some memory pressure events in error log around that time
too, reporting a number of pages being released - which could be related.
LimitNOFILE was set to 200000 (well below the table open cache) and
mysql user ulimit for files was set to 999999. I've upped the first to
2097152 and the second to unlimited.
Will apply these to the servers in turn and see if this makes a difference.
Thanks again for all your help!
Derick
On 27/03/2025 21:43, Gordan Bobic via discuss wrote:
On Thu, 27 Mar 2025 at 20:14, Derick Turner <[email protected]> wrote:
We had another event today.
Everything went from fine with respect to cache hits (99.9% open table
cache) and INNODB buffer pool all good (22GB size) to 15% Open table
cache hit with 0 file opens and 3.11 misses and INNODB buffer pool size
of 475MB. The graphs on SSM were interesting (and where I got that
information)
Are you saying that your buffer pool dropped from 22GB to 475MB?
The only thing that can cause that is if mysqld/mariadbd crashed and
was restarted.
Do you have enough file handles? The defaults in the MariaDB systemd
service aren't particularly generous, it is possible your increase of
table_open_cache didn't actually fully take effect because you are
maxed out on file handles.
Do:
systemctl edit mariadb
and add:
[Service]
LimitNOFILE=1048576
then:
systemctl daemon-reload
systemctl restart mariadb
and see if that makes a difference.
Unfortunately it is rather difficult to guess what's going on based
purely on the data points you mentioned thus far.
Only unusual entry in the error log was:
2025-03-27 17:37:56 3194063 [Warning] InnoDB: A long wait (152 seconds)
was observed for dict_sys.latch
(17:35 was when SSM was showing everything nose-diving)
This wait time kept growing over the next few minutes till:
2025-03-27 17:41:17 3193777 [Warning] InnoDB: A long wait (354 seconds)
was observed for dict_sys.latch
I'd already switched our webservers off of the stricken DB server but
everything came unstuck after that last error log entry.
What would be causing the dict_sys.latch issue? What can be done to fix it?
There seem to be at least 13 still open bugs (plus probably some more
that have been merged for next release) that could be causing this:
https://jira.mariadb.org/browse/MDEV-34988?jql=status%20%3D%20Open%20AND%20text%20~%20%22dict_sys.latch%22
--
Derick Turner - He/Him
_______________________________________________
discuss mailing list -- [email protected]
To unsubscribe send an email to [email protected]