Hello people, we have been experiencing an issue with lvm2-thin on
_some_ of our production servers where out of nowhere
lvm2/device-mapper starts spamming error logs and I can't really seem
to trace down the root cause.
This is what the logs look like;
Oct 9 06:25:02 U5bW8JT7 lvm[8020]: device-mapper: waitevent ioctl on
LVM-CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool
failed: Inappropriate ioctl for device
Oct 9 06:25:02 U5bW8JT7 lvm[8020]: waitevent: dm_task_run failed:
Inappropriate ioctl for device
It writes this to rsyslog at speeds so fast not even tail can keep up
over ssh. I'm really lost as to the reason or how to trace this back
to a process. We use lvm-thin to host virtual machines via libvirt, so
several thin disks under a thin pool one for each virtual machine. The
lvm2 pool is created on top of a vg that is on top of a md-raid1
device.
/dev/md127:
Version : 1.2
Creation Time : Wed May 5 16:56:09 2021
Raid Level : raid1
Array Size : 1953382464 (1862.89 GiB 2000.26 GB)
Used Dev Size : 1953382464 (1862.89 GiB 2000.26 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Oct 9 07:31:44 2024
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : U5bW8JT7:1 (local to host U5bW8JT7)
UUID : 5072120c:b114a0aa:51438cf8:8ebe58b8
Events : 70525
Number Major Minor RaidDevice State
0 259 0 0 active sync /dev/nvme0n1
1 259 1 1 active sync /dev/nvme1n1
I do not know what this
`CP5Gw8QrWLqwhBcJL87R1mc9Q9KTBtQQmOowipTAFuM7hqzHz6pRVvUaNO9FGzeq-tpool`
disk is on device mapper, I'm unable to find it, however the tpool
suffix suggests it's the thin pool so here is what lvdisplay says
about it
--- Logical volume ---
LV Name lvol1
VG Name lightning-nvme
LV UUID mOowip-TAFu-M7hq-zHz6-pRVv-UaNO-9FGzeq
LV Write Access read/write
LV Creation host, time U5bW8JT7, 2021-05-05 16:57:13 +0000
LV Pool metadata lvol1_tmeta
LV Pool data lvol1_tdata
LV Status available
# open 72
LV Size <1.80 TiB
Allocated pool data 81.30%
Allocated metadata 49.28%
Current LE 471040
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:2
The start of the device name suggests it's part of the vg so here's
also what vgdisplay says:
--- Volume group ---
VG Name lightning-nvme
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 52521
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 72
Open LV 70
Max PV 0
Cur PV 1
Act PV 1
VG Size <1.82 TiB
PE Size 4.00 MiB
Total PE 476899
Alloc PE / Size 471098 / <1.80 TiB
Free PE / Size 5801 / 22.66 GiB
VG UUID CP5Gw8-QrWL-qwhB-cJL8-7R1m-c9Q9-KTBtQQ
Any tips on how to trace this back to a process or a cause or a bug or
anything are highly appreciated. I literally only have these two log
lines to go by, everything else is working as expected until the
server crashes from running out of disk space from these logs.
Restarting the node fixes the issue/log spam until it starts happening
again randomly after several days.
Runnig Ubuntu 18.04.5 LTS - lvm2 2.02.176-4.1ubuntu3.18.04.3
-
--Fabricio Winter