We've run into a problem on our test cluster this afternoon which is running 
Nautilus (14.2.2).  It seems that any time PGs move on the cluster (from 
marking an OSD down, setting the primary-affinity to 0, or by using the 
balancer), a large number of the OSDs in the cluster peg the CPU cores they're 
running on for a while which causes slow requests.  From what I can tell it 
appears to be related to slow peering caused by osd_pg_create() taking a long 
time.

This was seen on quite a few OSDs while waiting for peering to complete:

# ceph daemon osd.3 ops
{
    "ops": [
        {
            "description": "osd_pg_create(e179061 287.7a:177739 287.9a:177739 
287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 
287.216:177739 287.306:177739 287.3e6:177739)",
            "initiated_at": "2019-08-27 14:34:46.556413",
            "age": 318.25234538000001,
            "duration": 318.25241895300002,
            "type_data": {
                "flag_point": "started",
                "events": [
                    {
                        "time": "2019-08-27 14:34:46.556413",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-08-27 14:34:46.556413",
                        "event": "header_read"
                    },
                    {
                        "time": "2019-08-27 14:34:46.556299",
                        "event": "throttled"
                    },
                    {
                        "time": "2019-08-27 14:34:46.556456",
                        "event": "all_read"
                    },
                    {
                        "time": "2019-08-27 14:35:12.456901",
                        "event": "dispatched"
                    },
                    {
                        "time": "2019-08-27 14:35:12.456903",
                        "event": "wait for new map"
                    },
                    {
                        "time": "2019-08-27 14:40:01.292346",
                        "event": "started"
                    }
                ]
            }
        },
...snip...
        {
            "description": "osd_pg_create(e179066 287.7a:177739 287.9a:177739 
287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 
287.216:177739 287.306:177739 287.3e6:177739)",
            "initiated_at": "2019-08-27 14:35:09.908567",
            "age": 294.900191001,
            "duration": 294.90068416899999,
            "type_data": {
                "flag_point": "delayed",
                "events": [
                    {
                        "time": "2019-08-27 14:35:09.908567",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-08-27 14:35:09.908567",
                        "event": "header_read"
                    },
                    {
                        "time": "2019-08-27 14:35:09.908520",
                        "event": "throttled"
                    },
                    {
                        "time": "2019-08-27 14:35:09.908617",
                        "event": "all_read"
                    },
                    {
                        "time": "2019-08-27 14:35:12.456921",
                        "event": "dispatched"
                    },
                    {
                        "time": "2019-08-27 14:35:12.456923",
                        "event": "wait for new map"
                    }
                ]
            }
        }
    ],
    "num_ops": 6
}


That "wait for new map" message made us think something was getting hung up on 
the monitors, so we restarted them all without any luck.

I'll keep investigating, but so far my google searches aren't pulling anything 
up so I wanted to see if anyone else is running into this?

Thanks,
Bryan
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to