We had some network problems (high packet drops) to some cephfs client nodes that run ceph-fuse (14.2.13) against a Nautilus cluster (on version 14.2.8).  As a result a couple of clients got evicted (as one would expect).  What was really odd is that the clients were trying to flush data they had in cache and kept getting rejected by OSD's for almost an hour, and then magically the data flush worked. When asked afterwards, the client reported that it was no longer backlisted.  How would that happen?  I certainly didn't run any commands to un-blacklist a client and the docs say that otherwise the client will stay blacklisted until the file system gets remounted.

Here is the status of the client when it was backlisted:
[root@worker2033 ceph]# ceph daemon /var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status
{
    "metadata": {
        "ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f",
        "ceph_version": "ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)",
        "entity_id": "cephfs2",
        "hostname": "worker2033",
        "mount_point": "/mnt/ceph",
        "pid": "7698",
        "root": "/"
    },
    "dentry_count": 252,
    "dentry_pinned_count": 9,
    "id": 111995680,
    "inst": {
        "name": {
            "type": "client",
            "num": 111995680
        },
        "addr": {
            "type": "v1",
            "addr": "10.254.65.33:0",
            "nonce": 410851087
        }
    },
    "addr": {
        "type": "v1",
        "addr": "10.254.65.33:0",
        "nonce": 410851087
    },
    "inst_str": "client.111995680 10.254.65.33:0/410851087",
    "addr_str": "10.254.65.33:0/410851087",
    "inode_count": 251,
    "mds_epoch": 3376260,
    "osd_epoch": 1717896,
    "osd_epoch_barrier": 1717893,
    "blacklisted": true
}

This corresponds to server side log messages:
2020-11-09 15:56:31.578 7fffe59a4700  1 mds.0.3376160 Evicting (and blacklisting) client session 111995680 (10.254.65.33:0/410851087) 2020-11-09 15:56:31.578 7fffe59a4700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 111995680 (10.254.65.33:0/410851087) 2020-11-09 15:56:31.706 7fffe59a4700  1 mds.0.3376160 Evicting (and blacklisting) client session 111995680 (10.254.65.33:0/410851087) 2020-11-09 15:56:31.706 7fffe59a4700  0 log_channel(cluster) log [INF] : Evicting (and blacklisting) client session 111995680 (10.254.65.33:0/410851087)

and them some time later (perhaps half an hour or so) I got this from the client:

[root@worker2033 ceph]# ceph daemon /var/run/ceph/ceph-client.cephfs2.7698.93825141588944.asok status
{
    "metadata": {
        "ceph_sha1": "1778d63e55dbff6cedb071ab7d367f8f52a8699f",
        "ceph_version": "ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable)",
        "entity_id": "cephfs2",
        "hostname": "worker2033",
        "mount_point": "/mnt/ceph",
        "pid": "7698",
        "root": "/"
    },
    "dentry_count": 252,
    "dentry_pinned_count": 9,
    "id": 111995680,
    "inst": {
        "name": {
            "type": "client",
            "num": 111995680
        },
        "addr": {
            "type": "v1",
            "addr": "10.254.65.33:0",
            "nonce": 410851087
        }
    },
    "addr": {
        "type": "v1",
        "addr": "10.254.65.33:0",
        "nonce": 410851087
    },
    "inst_str": "client.111995680 10.254.65.33:0/410851087",
    "addr_str": "10.254.65.33:0/410851087",
    "inode_count": 251,
    "mds_epoch": 3376260,
    "osd_epoch": 1717897,
    "osd_epoch_barrier": 1717893,
    "blacklisted": false
}

The cluster was otherwise healthy - nothing wrong with MDS's, or any placement groups, etc.  I also don't see any further log messages regarding eviction/backlisting in the MDS logs.  I didn't run any ceph commands that would change the state of the cluster - I was just looking around, increasing log levels.

Any ideas how could that have happened?

A separate problem (perhaps needs a ticket filed) that while the ceph-fuse client was in a blacklisted state, it was retrying in an infinite loop to flush data to the OSD's and got rejected every time.  I have some logs for the details of this too.

Andras
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to