Re: [ceph-users] cephfs automatic data pool cleanup
On Thu, Dec 14, 2017 at 12:52 AM, Jens-U. Mozdzenwrote: > Hi Yan, > > Zitat von "Yan, Zheng" : >> >> [...] >> >> It's likely some clients had caps on unlinked inodes, which prevent >> MDS from purging objects. When a file gets deleted, mds notifies all >> clients, clients are supposed to drop corresponding caps if possible. >> You may hit a bug in this area, some clients failed to drop cap for >> unlinked inodes. >> [...] >> There is a reconnect stage during MDS recovers. To reduce reconnect >> message size, clients trim unused inodes from their cache >> aggressively. In your case, most unlinked inodes also got trimmed . >> So mds could purge corresponding objects after it recovered > > > thank you for that detailed explanation. While I've already included the > recent code fix for this issue on a test node, all other mount points > (including the NFS server machine) still run thenon-fixed kernel Ceph > client. So your description makes me believe we've hit exactly what you > describe. > > Seems we'll have to fix the clients :) > > Is there a command I can use to see what caps a client holds, to verify the > proposed patch actually works? > No easy way. 'ceph daemon mds.x session ls' can show how many caps each client holds. 'ceph daemon mds.x dump cache' dump whole mds cache. that information can be extracted from the cache dump. Regards Yan, Zheng > Regards, > Jens > > PS: Is there a command I can use to see what caps a client holds, to verify > the proposed patch actually works? > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs automatic data pool cleanup
Hi Yan, Zitat von "Yan, Zheng": [...] It's likely some clients had caps on unlinked inodes, which prevent MDS from purging objects. When a file gets deleted, mds notifies all clients, clients are supposed to drop corresponding caps if possible. You may hit a bug in this area, some clients failed to drop cap for unlinked inodes. [...] There is a reconnect stage during MDS recovers. To reduce reconnect message size, clients trim unused inodes from their cache aggressively. In your case, most unlinked inodes also got trimmed . So mds could purge corresponding objects after it recovered thank you for that detailed explanation. While I've already included the recent code fix for this issue on a test node, all other mount points (including the NFS server machine) still run thenon-fixed kernel Ceph client. So your description makes me believe we've hit exactly what you describe. Seems we'll have to fix the clients :) Is there a command I can use to see what caps a client holds, to verify the proposed patch actually works? Regards, Jens PS: Is there a command I can use to see what caps a client holds, to verify the proposed patch actually works? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs automatic data pool cleanup
On Wed, Dec 13, 2017 at 10:11 PM, Jens-U. Mozdzenwrote: > Hi *, > > during the last weeks, we noticed some strange behavior of our CephFS data > pool (not metadata). As things have worked out over time, I'm just asking > here so that I can better understand what to look out for in the future. > > This is on a three-node Ceph Luminous (12.2.1) cluster with one active MDS > and one standby MDS. We have a range of machines mounting that single CephFS > via kernel mounts, using different versions of Linux kernels (all at least > 4.4, with vendor backports). > > We observed an ever-increasing number of objects and space allocation on the > (HDD-based, replicated) CephFS data pool, although the actual file system > usage didn't grow over time and actually decreased significantly during that > time period. The pool allocation went above all warn and crit levels, > forcing us to add new OSDs (our first three Bluestore OSDs - all others are > file-based) to relief pressure, if only for some time. > > Part of the growth seems to be related to a large nightly compile job, that > was using CephFS via an NFS server (kernel-based) exposing the > kernel-mounted CephFS to many nodes: Once we stopped that job, pool > allocation growth significantly slowed (but didn't stop). > > Further diagnosis hinted that the data pool had many orphan objects, that is > objects for inodes we could not locate in the live CephFS. > It's likely some clients had caps on unlinked inodes, which prevent MDS from purging objects. When a file gets deleted, mds notifies all clients, clients are supposed to drop corresponding caps if possible. You may hit a bug in this area, some clients failed to drop cap for unlinked inodes. > All the time, we did not notice any significant growth of the metadata pool > (SSD-based) nor obvious errors in the Ceph logs (Ceph, MDS, OSDs). Except > for the fill levels, the cluster was healthy. Restarting MDSs did not help. > > Then we had one of the nodes crash for a lack of memory (MDS was > 12 GB, > plus the new Bluestore OSD and probably the 12.2.1 BlueStore memory leak). > > We brought the node back online and at first had MDS report an inconsistent > file system, though no other errors were reported. Once we restarted the > other MDS (by then active MDS on another node), that problem went away, too, > and we were back online. We did not restart clients, neither CephFS mounts > nor rbd clients. > > The following day we noticed an ongoing significant decrease in the number > of objects in the CephFS data pool. As we couldn't spot any actual problems > with the content of the CephFS (which was rather stable at the time), we sat > back and watched - after some hours, the pool stabilized in size and was at > a total size a bit closer to the actual CephFS content than before the mass > deletion (FS size around 630 GB per "df" output, current data pool size > about 1100 GB, peak size was around 1.3 TB before the mass deletion). > There is a reconnect stage during MDS recovers. To reduce reconnect message size, clients trim unused inodes from their cache aggressively. In your case, most unlinked inodes also got trimmed . So mds could purge corresponding objects after it recovered Regards Yan, Zheng > What may it have been that we were watching - some form of garbage > collection that was triggered by the node outage? Is this something we could > have triggered manually before, to avoid the free space problems we faced? > Or is this something unexpected, that should have happened auto-magically > and much more often, but that for some reason didn't occur in our > environment? > > Thank you for any ideas and/or pointers you may share. > > Regards, > J > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs automatic data pool cleanup
Hi Webert, Zitat von Webert de Souza Lima: I have experienced delayed free in used space before, in Jewel, but that just stopped happening with no intervention. thank you for letting me know. If none of the developers remembers fixing this issue, it might be a still pending problem. Back then, umounting all client's fs would make it free the space rapidly. In our case, while one of the Ceph cluster members rebooted, the CephFS clients remained active continuously. So my hope is high that a complete umount is no hard requirement ;) Regards, Jens ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs automatic data pool cleanup
Hi John, Zitat von John Spray: On Wed, Dec 13, 2017 at 2:11 PM, Jens-U. Mozdzen wrote: [...] Then we had one of the nodes crash for a lack of memory (MDS was > 12 GB, plus the new Bluestore OSD and probably the 12.2.1 BlueStore memory leak). We brought the node back online and at first had MDS report an inconsistent file system, though no other errors were reported. Once we restarted the other MDS (by then active MDS on another node), that problem went away, too, and we were back online. We did not restart clients, neither CephFS mounts nor rbd clients. I'm curious about the "MDS report an inconsistent file system" part -- what exactly was the error you were seeing? my apologies, being off-site I mixed up messages. It wasn't about inconsistencies, but FS_DEGRADED. When the failed node came back online (and Ceph then had recovered all objects problems after bringing the OSDs online), "ceph -s" reported "1 filesystem is degraded" and "ceph health detail" did also show just this error. At that time, both MDS were up and the MDS on the surviving node was the active MDS. Once I restarted the MDS on the surviving node, FS_DEGRADED was cleared: --- cut here --- 2017-12-07 19:05:33.113619 mon.node01 mon.0 192.168.160.15:6789/0 243 : cluster [WRN] overall HEALTH_WARN 1 filesystem is degraded; noout flag(s) set; 1 nearfull osd(s) 2017-12-07 19:06:33.113826 mon.node01 mon.0 192.168.160.15:6789/0 298 : cluster [INF] mon.1 192.168.160.16:6789/0 2017-12-07 19:06:33.113923 mon.node01 mon.0 192.168.160.15:6789/0 299 : cluster [INF] mon.2 192.168.160.17:6789/0 2017-12-07 19:11:16.997308 mon.node01 mon.0 192.168.160.15:6789/0 541 : cluster [INF] Standby daemon mds.node01 assigned to filesystem cephfs as rank 0 2017-12-07 19:11:16.997446 mon.node01 mon.0 192.168.160.15:6789/0 542 : cluster [WRN] Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY) 2017-12-07 19:11:20.968933 mon.node01 mon.0 192.168.160.15:6789/0 553 : cluster [INF] Health check cleared: MDS_INSUFFICIENT_STANDBY (was: insufficient standby MDS daemons available) 2017-12-07 19:11:33.113816 mon.node01 mon.0 192.168.160.15:6789/0 565 : cluster [INF] mon.1 192.168.160.16:6789/0 2017-12-07 19:11:33.114958 mon.node01 mon.0 192.168.160.15:6789/0 566 : cluster [INF] mon.2 192.168.160.17:6789/0 2017-12-07 19:12:09.889106 mon.node01 mon.0 192.168.160.15:6789/0 598 : cluster [INF] daemon mds.node01 is now active in filesystem cephfs as rank 0 2017-12-07 19:12:09.983442 mon.node01 mon.0 192.168.160.15:6789/0 599 : cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded) No other errors/warnings were obvious. The "insufficient standby" at 19:11:16.997308 is likely caused by the restart of the MDS at node2. Regards, Jens ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs automatic data pool cleanup
I have experienced delayed free in used space before, in Jewel, but that just stopped happening with no intervention. Back then, umounting all client's fs would make it free the space rapidly. I don't know if that's related. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs automatic data pool cleanup
On Wed, Dec 13, 2017 at 2:11 PM, Jens-U. Mozdzenwrote: > Hi *, > > during the last weeks, we noticed some strange behavior of our CephFS data > pool (not metadata). As things have worked out over time, I'm just asking > here so that I can better understand what to look out for in the future. > > This is on a three-node Ceph Luminous (12.2.1) cluster with one active MDS > and one standby MDS. We have a range of machines mounting that single CephFS > via kernel mounts, using different versions of Linux kernels (all at least > 4.4, with vendor backports). > > We observed an ever-increasing number of objects and space allocation on the > (HDD-based, replicated) CephFS data pool, although the actual file system > usage didn't grow over time and actually decreased significantly during that > time period. The pool allocation went above all warn and crit levels, > forcing us to add new OSDs (our first three Bluestore OSDs - all others are > file-based) to relief pressure, if only for some time. > > Part of the growth seems to be related to a large nightly compile job, that > was using CephFS via an NFS server (kernel-based) exposing the > kernel-mounted CephFS to many nodes: Once we stopped that job, pool > allocation growth significantly slowed (but didn't stop). > > Further diagnosis hinted that the data pool had many orphan objects, that is > objects for inodes we could not locate in the live CephFS. > > All the time, we did not notice any significant growth of the metadata pool > (SSD-based) nor obvious errors in the Ceph logs (Ceph, MDS, OSDs). Except > for the fill levels, the cluster was healthy. Restarting MDSs did not help. > > Then we had one of the nodes crash for a lack of memory (MDS was > 12 GB, > plus the new Bluestore OSD and probably the 12.2.1 BlueStore memory leak). > > We brought the node back online and at first had MDS report an inconsistent > file system, though no other errors were reported. Once we restarted the > other MDS (by then active MDS on another node), that problem went away, too, > and we were back online. We did not restart clients, neither CephFS mounts > nor rbd clients. I'm curious about the "MDS report an inconsistent file system" part -- what exactly was the error you were seeing? John > The following day we noticed an ongoing significant decrease in the number > of objects in the CephFS data pool. As we couldn't spot any actual problems > with the content of the CephFS (which was rather stable at the time), we sat > back and watched - after some hours, the pool stabilized in size and was at > a total size a bit closer to the actual CephFS content than before the mass > deletion (FS size around 630 GB per "df" output, current data pool size > about 1100 GB, peak size was around 1.3 TB before the mass deletion). > > What may it have been that we were watching - some form of garbage > collection that was triggered by the node outage? Is this something we could > have triggered manually before, to avoid the free space problems we faced? > Or is this something unexpected, that should have happened auto-magically > and much more often, but that for some reason didn't occur in our > environment? > > Thank you for any ideas and/or pointers you may share. > > Regards, > J > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs automatic data pool cleanup
Hi *, during the last weeks, we noticed some strange behavior of our CephFS data pool (not metadata). As things have worked out over time, I'm just asking here so that I can better understand what to look out for in the future. This is on a three-node Ceph Luminous (12.2.1) cluster with one active MDS and one standby MDS. We have a range of machines mounting that single CephFS via kernel mounts, using different versions of Linux kernels (all at least 4.4, with vendor backports). We observed an ever-increasing number of objects and space allocation on the (HDD-based, replicated) CephFS data pool, although the actual file system usage didn't grow over time and actually decreased significantly during that time period. The pool allocation went above all warn and crit levels, forcing us to add new OSDs (our first three Bluestore OSDs - all others are file-based) to relief pressure, if only for some time. Part of the growth seems to be related to a large nightly compile job, that was using CephFS via an NFS server (kernel-based) exposing the kernel-mounted CephFS to many nodes: Once we stopped that job, pool allocation growth significantly slowed (but didn't stop). Further diagnosis hinted that the data pool had many orphan objects, that is objects for inodes we could not locate in the live CephFS. All the time, we did not notice any significant growth of the metadata pool (SSD-based) nor obvious errors in the Ceph logs (Ceph, MDS, OSDs). Except for the fill levels, the cluster was healthy. Restarting MDSs did not help. Then we had one of the nodes crash for a lack of memory (MDS was > 12 GB, plus the new Bluestore OSD and probably the 12.2.1 BlueStore memory leak). We brought the node back online and at first had MDS report an inconsistent file system, though no other errors were reported. Once we restarted the other MDS (by then active MDS on another node), that problem went away, too, and we were back online. We did not restart clients, neither CephFS mounts nor rbd clients. The following day we noticed an ongoing significant decrease in the number of objects in the CephFS data pool. As we couldn't spot any actual problems with the content of the CephFS (which was rather stable at the time), we sat back and watched - after some hours, the pool stabilized in size and was at a total size a bit closer to the actual CephFS content than before the mass deletion (FS size around 630 GB per "df" output, current data pool size about 1100 GB, peak size was around 1.3 TB before the mass deletion). What may it have been that we were watching - some form of garbage collection that was triggered by the node outage? Is this something we could have triggered manually before, to avoid the free space problems we faced? Or is this something unexpected, that should have happened auto-magically and much more often, but that for some reason didn't occur in our environment? Thank you for any ideas and/or pointers you may share. Regards, J ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com