Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-14 Thread Yan, Zheng
On Thu, Dec 14, 2017 at 12:52 AM, Jens-U. Mozdzen  wrote:
> Hi Yan,
>
> Zitat von "Yan, Zheng" :
>>
>> [...]
>>
>> It's likely some clients had caps on unlinked inodes, which prevent
>> MDS from purging objects. When a file gets deleted, mds notifies all
>> clients, clients are supposed to drop corresponding caps if possible.
>> You may hit a bug in this area, some clients failed to drop cap for
>> unlinked inodes.
>> [...]
>> There is a reconnect stage during MDS recovers. To reduce reconnect
>> message size, clients trim unused inodes from their cache
>> aggressively. In your case,  most unlinked inodes also got trimmed .
>> So mds could purge corresponding objects after it recovered
>
>
> thank you for that detailed explanation. While I've already included the
> recent code fix for this issue on a test node, all other mount points
> (including the NFS server machine) still run thenon-fixed kernel Ceph
> client. So your description makes me believe we've hit exactly what you
> describe.
>
> Seems we'll have to fix the clients :)
>
> Is there a command I can use to see what caps a client holds, to verify the
> proposed patch actually works?
>

No easy way.

'ceph daemon mds.x session ls' can show how many caps each client
holds.  'ceph daemon mds.x dump cache' dump whole mds cache. that
information can be extracted from the cache dump.

Regards
Yan, Zheng
> Regards,
> Jens
>
> PS: Is there a command I can use to see what caps a client holds, to verify
> the proposed patch actually works?
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Jens-U. Mozdzen

Hi Yan,

Zitat von "Yan, Zheng" :

[...]

It's likely some clients had caps on unlinked inodes, which prevent
MDS from purging objects. When a file gets deleted, mds notifies all
clients, clients are supposed to drop corresponding caps if possible.
You may hit a bug in this area, some clients failed to drop cap for
unlinked inodes.
[...]
There is a reconnect stage during MDS recovers. To reduce reconnect
message size, clients trim unused inodes from their cache
aggressively. In your case,  most unlinked inodes also got trimmed .
So mds could purge corresponding objects after it recovered


thank you for that detailed explanation. While I've already included  
the recent code fix for this issue on a test node, all other mount  
points (including the NFS server machine) still run thenon-fixed  
kernel Ceph client. So your description makes me believe we've hit  
exactly what you describe.


Seems we'll have to fix the clients :)

Is there a command I can use to see what caps a client holds, to  
verify the proposed patch actually works?


Regards,
Jens

PS: Is there a command I can use to see what caps a client holds, to  
verify the proposed patch actually works?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Yan, Zheng
On Wed, Dec 13, 2017 at 10:11 PM, Jens-U. Mozdzen  wrote:
> Hi *,
>
> during the last weeks, we noticed some strange behavior of our CephFS data
> pool (not metadata). As things have worked out over time, I'm just asking
> here so that I can better understand what to look out for in the future.
>
> This is on a three-node Ceph Luminous (12.2.1) cluster with one active MDS
> and one standby MDS. We have a range of machines mounting that single CephFS
> via kernel mounts, using different versions of Linux kernels (all at least
> 4.4, with vendor backports).
>
> We observed an ever-increasing number of objects and space allocation on the
> (HDD-based, replicated) CephFS data pool, although the actual file system
> usage didn't grow over time and actually decreased significantly during that
> time period. The pool allocation went above all warn and crit levels,
> forcing us to add new OSDs (our first three Bluestore OSDs - all others are
> file-based) to relief pressure, if only for some time.
>
> Part of the growth seems to be related to a large nightly compile job, that
> was using CephFS via an NFS server (kernel-based) exposing the
> kernel-mounted CephFS to many nodes: Once we stopped that job, pool
> allocation growth significantly slowed (but didn't stop).
>
> Further diagnosis hinted that the data pool had many orphan objects, that is
> objects for inodes we could not locate in the live CephFS.
>

It's likely some clients had caps on unlinked inodes, which prevent
MDS from purging objects. When a file gets deleted, mds notifies all
clients, clients are supposed to drop corresponding caps if possible.
You may hit a bug in this area, some clients failed to drop cap for
unlinked inodes.

> All the time, we did not notice any significant growth of the metadata pool
> (SSD-based) nor obvious errors in the Ceph logs (Ceph, MDS, OSDs). Except
> for the fill levels, the cluster was healthy. Restarting MDSs did not help.
>
> Then we had one of the nodes crash for a lack of memory (MDS was > 12 GB,
> plus the new Bluestore OSD and probably the 12.2.1 BlueStore memory leak).
>
> We brought the node back online and at first had MDS report an inconsistent
> file system, though no other errors were reported. Once we restarted the
> other MDS (by then active MDS on another node), that problem went away, too,
> and we were back online. We did not restart clients, neither CephFS mounts
> nor rbd clients.
>
> The following day we noticed an ongoing significant decrease in the number
> of objects in the CephFS data pool. As we couldn't spot any actual problems
> with the content of the CephFS (which was rather stable at the time), we sat
> back and watched - after some hours, the pool stabilized in size and was at
> a total size a bit closer to the actual CephFS content than before the mass
> deletion (FS size around 630 GB per "df" output, current data pool size
> about 1100 GB, peak size was around 1.3 TB before the mass deletion).
>

There is a reconnect stage during MDS recovers. To reduce reconnect
message size, clients trim unused inodes from their cache
aggressively. In your case,  most unlinked inodes also got trimmed .
So mds could purge corresponding objects after it recovered

Regards
Yan, Zheng

> What may it have been that we were watching - some form of garbage
> collection that was triggered by the node outage? Is this something we could
> have triggered manually before, to avoid the free space problems we faced?
> Or is this something unexpected, that should have happened auto-magically
> and much more often, but that for some reason didn't occur in our
> environment?
>
> Thank you for any ideas and/or pointers you may share.
>
> Regards,
> J
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Jens-U. Mozdzen

Hi Webert,

Zitat von Webert de Souza Lima :

I have experienced delayed free in used space before, in Jewel, but that
just stopped happening with no intervention.


thank you for letting me know.

If none of the developers remembers fixing this issue, it might be a  
still pending problem.



Back then, umounting all client's fs would make it free the space rapidly.


In our case, while one of the Ceph cluster members rebooted, the  
CephFS clients remained active continuously. So my hope is high that a  
complete umount is no hard requirement ;)


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Jens-U. Mozdzen

Hi John,

Zitat von John Spray :

On Wed, Dec 13, 2017 at 2:11 PM, Jens-U. Mozdzen  wrote:
[...]

Then we had one of the nodes crash for a lack of memory (MDS was > 12 GB,
plus the new Bluestore OSD and probably the 12.2.1 BlueStore memory leak).

We brought the node back online and at first had MDS report an inconsistent
file system, though no other errors were reported. Once we restarted the
other MDS (by then active MDS on another node), that problem went away, too,
and we were back online. We did not restart clients, neither CephFS mounts
nor rbd clients.


I'm curious about the "MDS report an inconsistent file system" part --
what exactly was the error you were seeing?


my apologies, being off-site I mixed up messages. It wasn't about  
inconsistencies, but FS_DEGRADED.


When the failed node came back online (and Ceph then had recovered all  
objects problems after bringing the OSDs online), "ceph -s" reported  
"1 filesystem is degraded" and "ceph health detail" did also show just  
this error. At that time, both MDS were up and the MDS on the  
surviving node was the active MDS.


Once I restarted the MDS on the surviving node, FS_DEGRADED was cleared:

--- cut here ---
2017-12-07 19:05:33.113619 mon.node01 mon.0 192.168.160.15:6789/0 243  
: cluster [WRN] overall HEALTH_WARN 1 filesystem is degraded; noout  
flag(s) set; 1 nearfull osd(s)
2017-12-07 19:06:33.113826 mon.node01 mon.0 192.168.160.15:6789/0 298  
: cluster [INF] mon.1 192.168.160.16:6789/0
2017-12-07 19:06:33.113923 mon.node01 mon.0 192.168.160.15:6789/0 299  
: cluster [INF] mon.2 192.168.160.17:6789/0
2017-12-07 19:11:16.997308 mon.node01 mon.0 192.168.160.15:6789/0 541  
: cluster [INF] Standby daemon mds.node01 assigned to filesystem  
cephfs as rank 0
2017-12-07 19:11:16.997446 mon.node01 mon.0 192.168.160.15:6789/0 542  
: cluster [WRN] Health check failed: insufficient standby MDS daemons  
available (MDS_INSUFFICIENT_STANDBY)
2017-12-07 19:11:20.968933 mon.node01 mon.0 192.168.160.15:6789/0 553  
: cluster [INF] Health check cleared: MDS_INSUFFICIENT_STANDBY (was:  
insufficient standby MDS daemons available)
2017-12-07 19:11:33.113816 mon.node01 mon.0 192.168.160.15:6789/0 565  
: cluster [INF] mon.1 192.168.160.16:6789/0
2017-12-07 19:11:33.114958 mon.node01 mon.0 192.168.160.15:6789/0 566  
: cluster [INF] mon.2 192.168.160.17:6789/0
2017-12-07 19:12:09.889106 mon.node01 mon.0 192.168.160.15:6789/0 598  
: cluster [INF] daemon mds.node01 is now active in filesystem cephfs  
as rank 0
2017-12-07 19:12:09.983442 mon.node01 mon.0 192.168.160.15:6789/0 599  
: cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem  
is degraded)


No other errors/warnings were obvious. The "insufficient standby" at  
19:11:16.997308 is likely caused by the restart of the MDS at node2.


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Webert de Souza Lima
I have experienced delayed free in used space before, in Jewel, but that
just stopped happening with no intervention.
Back then, umounting all client's fs would make it free the space rapidly.
I don't know if that's related.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread John Spray
On Wed, Dec 13, 2017 at 2:11 PM, Jens-U. Mozdzen  wrote:
> Hi *,
>
> during the last weeks, we noticed some strange behavior of our CephFS data
> pool (not metadata). As things have worked out over time, I'm just asking
> here so that I can better understand what to look out for in the future.
>
> This is on a three-node Ceph Luminous (12.2.1) cluster with one active MDS
> and one standby MDS. We have a range of machines mounting that single CephFS
> via kernel mounts, using different versions of Linux kernels (all at least
> 4.4, with vendor backports).
>
> We observed an ever-increasing number of objects and space allocation on the
> (HDD-based, replicated) CephFS data pool, although the actual file system
> usage didn't grow over time and actually decreased significantly during that
> time period. The pool allocation went above all warn and crit levels,
> forcing us to add new OSDs (our first three Bluestore OSDs - all others are
> file-based) to relief pressure, if only for some time.
>
> Part of the growth seems to be related to a large nightly compile job, that
> was using CephFS via an NFS server (kernel-based) exposing the
> kernel-mounted CephFS to many nodes: Once we stopped that job, pool
> allocation growth significantly slowed (but didn't stop).
>
> Further diagnosis hinted that the data pool had many orphan objects, that is
> objects for inodes we could not locate in the live CephFS.
>
> All the time, we did not notice any significant growth of the metadata pool
> (SSD-based) nor obvious errors in the Ceph logs (Ceph, MDS, OSDs). Except
> for the fill levels, the cluster was healthy. Restarting MDSs did not help.
>
> Then we had one of the nodes crash for a lack of memory (MDS was > 12 GB,
> plus the new Bluestore OSD and probably the 12.2.1 BlueStore memory leak).
>
> We brought the node back online and at first had MDS report an inconsistent
> file system, though no other errors were reported. Once we restarted the
> other MDS (by then active MDS on another node), that problem went away, too,
> and we were back online. We did not restart clients, neither CephFS mounts
> nor rbd clients.

I'm curious about the "MDS report an inconsistent file system" part --
what exactly was the error you were seeing?

John

> The following day we noticed an ongoing significant decrease in the number
> of objects in the CephFS data pool. As we couldn't spot any actual problems
> with the content of the CephFS (which was rather stable at the time), we sat
> back and watched - after some hours, the pool stabilized in size and was at
> a total size a bit closer to the actual CephFS content than before the mass
> deletion (FS size around 630 GB per "df" output, current data pool size
> about 1100 GB, peak size was around 1.3 TB before the mass deletion).
>
> What may it have been that we were watching - some form of garbage
> collection that was triggered by the node outage? Is this something we could
> have triggered manually before, to avoid the free space problems we faced?
> Or is this something unexpected, that should have happened auto-magically
> and much more often, but that for some reason didn't occur in our
> environment?
>
> Thank you for any ideas and/or pointers you may share.
>
> Regards,
> J
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Jens-U. Mozdzen

Hi *,

during the last weeks, we noticed some strange behavior of our CephFS  
data pool (not metadata). As things have worked out over time, I'm  
just asking here so that I can better understand what to look out for  
in the future.


This is on a three-node Ceph Luminous (12.2.1) cluster with one active  
MDS and one standby MDS. We have a range of machines mounting that  
single CephFS via kernel mounts, using different versions of Linux  
kernels (all at least 4.4, with vendor backports).


We observed an ever-increasing number of objects and space allocation  
on the (HDD-based, replicated) CephFS data pool, although the actual  
file system usage didn't grow over time and actually decreased  
significantly during that time period. The pool allocation went above  
all warn and crit levels, forcing us to add new OSDs (our first three  
Bluestore OSDs - all others are file-based) to relief pressure, if  
only for some time.


Part of the growth seems to be related to a large nightly compile job,  
that was using CephFS via an NFS server (kernel-based) exposing the  
kernel-mounted CephFS to many nodes: Once we stopped that job, pool  
allocation growth significantly slowed (but didn't stop).


Further diagnosis hinted that the data pool had many orphan objects,  
that is objects for inodes we could not locate in the live CephFS.


All the time, we did not notice any significant growth of the metadata  
pool (SSD-based) nor obvious errors in the Ceph logs (Ceph, MDS,  
OSDs). Except for the fill levels, the cluster was healthy. Restarting  
MDSs did not help.


Then we had one of the nodes crash for a lack of memory (MDS was > 12  
GB, plus the new Bluestore OSD and probably the 12.2.1 BlueStore  
memory leak).


We brought the node back online and at first had MDS report an  
inconsistent file system, though no other errors were reported. Once  
we restarted the other MDS (by then active MDS on another node), that  
problem went away, too, and we were back online. We did not restart  
clients, neither CephFS mounts nor rbd clients.


The following day we noticed an ongoing significant decrease in the  
number of objects in the CephFS data pool. As we couldn't spot any  
actual problems with the content of the CephFS (which was rather  
stable at the time), we sat back and watched - after some hours, the  
pool stabilized in size and was at a total size a bit closer to the  
actual CephFS content than before the mass deletion (FS size around  
630 GB per "df" output, current data pool size about 1100 GB, peak  
size was around 1.3 TB before the mass deletion).


What may it have been that we were watching - some form of garbage  
collection that was triggered by the node outage? Is this something we  
could have triggered manually before, to avoid the free space problems  
we faced? Or is this something unexpected, that should have happened  
auto-magically and much more often, but that for some reason didn't  
occur in our environment?


Thank you for any ideas and/or pointers you may share.

Regards,
J

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com