On Fri, Apr 12, 2019 at 9:52 AM Magnus Grönlund <mag...@gronlund.se> wrote:
>
> Hi Jason,
>
> Tried to follow the instructions and setting the debug level to 15 worked OK, 
> but the daemon appeared to silently ignore the restart command (nothing 
> indicating a restart seen in the log).
> So I set the log level to 15 in the config file and restarted the rbd mirror 
> daemon. The output surprised me though, my previous perception of the issue 
> might be completely wrong...
> Lots of "image_replayer::BootstrapRequest:.... failed to create local image: 
> (2) No such file or directory" and ":ImageReplayer: ....  replay encountered 
> an error: (42) No message of desired type"

What is the result from "rbd mirror pool status --verbose nova"
against your DR cluster now? Are they in up+error now? The ENOENT
errors most likely related to a parent image that hasn't been
mirrored. The ENOMSG error seems to indicate that there might be some
corruption in a journal and it's missing expected records (like a
production client crashed), but it should be able to recover from
that.

> https://pastebin.com/1bTETNGs
>
> Best regards
> /Magnus
>
> Den tis 9 apr. 2019 kl 18:35 skrev Jason Dillaman <jdill...@redhat.com>:
>>
>> Can you pastebin the results from running the following on your backup
>> site rbd-mirror daemon node?
>>
>> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 15
>> ceph --admin-socket /path/to/asok rbd mirror restart nova
>> .... wait a minute to let some logs accumulate ...
>> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 0/5
>>
>> ... and collect the rbd-mirror log from /var/log/ceph/ (should have
>> lots of "rbd::mirror"-like log entries.
>>
>>
>> On Tue, Apr 9, 2019 at 12:23 PM Magnus Grönlund <mag...@gronlund.se> wrote:
>> >
>> >
>> >
>> > Den tis 9 apr. 2019 kl 17:48 skrev Jason Dillaman <jdill...@redhat.com>:
>> >>
>> >> Any chance your rbd-mirror daemon has the admin sockets available
>> >> (defaults to /var/run/ceph/cephdr-client.<id>.<pid>.<random>.asok)? If
>> >> so, you can run "ceph --admin-daemon /path/to/asok rbd mirror status".
>> >
>> >
>> > {
>> >     "pool_replayers": [
>> >         {
>> >             "pool": "glance",
>> >             "peer": "uuid: df30fb21-d1de-4c3a-9c00-10eaa4b30e00 cluster: 
>> > production client: client.productionbackup",
>> >             "instance_id": "869081",
>> >             "leader_instance_id": "869081",
>> >             "leader": true,
>> >             "instances": [],
>> >             "local_cluster_admin_socket": 
>> > "/var/run/ceph/client.backup.1936211.backup.94225674131712.asok",
>> >             "remote_cluster_admin_socket": 
>> > "/var/run/ceph/client.productionbackup.1936211.production.94225675210000.asok",
>> >             "sync_throttler": {
>> >                 "max_parallel_syncs": 5,
>> >                 "running_syncs": 0,
>> >                 "waiting_syncs": 0
>> >             },
>> >             "image_replayers": [
>> >                 {
>> >                     "name": "glance/ea5e4ad2-090a-4665-b142-5c7a095963e0",
>> >                     "state": "Replaying"
>> >                 },
>> >                 {
>> >                     "name": "glance/d7095183-45ef-40b5-80ef-f7c9d3bb1e62",
>> >                     "state": "Replaying"
>> >                 },
>> > -------------------cut----------
>> >                 {
>> >                     "name": 
>> > "cinder/volume-bcb41f46-3716-4ee2-aa19-6fbc241fbf05",
>> >                     "state": "Replaying"
>> >                 }
>> >             ]
>> >         },
>> >          {
>> >             "pool": "nova",
>> >             "peer": "uuid: 1fc7fefc-9bcb-4f36-a259-66c3d8086702 cluster: 
>> > production client: client.productionbackup",
>> >             "instance_id": "889074",
>> >             "leader_instance_id": "889074",
>> >             "leader": true,
>> >             "instances": [],
>> >             "local_cluster_admin_socket": 
>> > "/var/run/ceph/client.backup.1936211.backup.94225678548048.asok",
>> >             "remote_cluster_admin_socket": 
>> > "/var/run/ceph/client.productionbackup.1936211.production.94225679621728.asok",
>> >             "sync_throttler": {
>> >                 "max_parallel_syncs": 5,
>> >                 "running_syncs": 0,
>> >                 "waiting_syncs": 0
>> >             },
>> >             "image_replayers": []
>> >         }
>> >     ],
>> >     "image_deleter": {
>> >         "image_deleter_status": {
>> >             "delete_images_queue": [
>> >                 {
>> >                     "local_pool_id": 3,
>> >                     "global_image_id": 
>> > "ff531159-de6f-4324-a022-50c079dedd45"
>> >                 }
>> >             ],
>> >             "failed_deletes_queue": []
>> >         }
>> >>
>> >>
>> >> On Tue, Apr 9, 2019 at 11:26 AM Magnus Grönlund <mag...@gronlund.se> 
>> >> wrote:
>> >> >
>> >> >
>> >> >
>> >> > Den tis 9 apr. 2019 kl 17:14 skrev Jason Dillaman <jdill...@redhat.com>:
>> >> >>
>> >> >> On Tue, Apr 9, 2019 at 11:08 AM Magnus Grönlund <mag...@gronlund.se> 
>> >> >> wrote:
>> >> >> >
>> >> >> > >On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund 
>> >> >> > ><mag...@gronlund.se> wrote:
>> >> >> > >>
>> >> >> > >> Hi,
>> >> >> > >> We have configured one-way replication of pools between a 
>> >> >> > >> production cluster and a backup cluster. But unfortunately the 
>> >> >> > >> rbd-mirror or the backup cluster is unable to keep up with the 
>> >> >> > >> production cluster so the replication fails to reach replaying 
>> >> >> > >> state.
>> >> >> > >
>> >> >> > >Hmm, it's odd that they don't at least reach the replaying state. 
>> >> >> > >Are
>> >> >> > >they still performing the initial sync?
>> >> >> >
>> >> >> > There are three pools we try to mirror, (glance, cinder, and nova, 
>> >> >> > no points for guessing what the cluster is used for :) ),
>> >> >> > the glance and cinder pools are smaller and sees limited write 
>> >> >> > activity, and the mirroring works, the nova pool which is the 
>> >> >> > largest and has 90% of the write activity never leaves the "unknown" 
>> >> >> > state.
>> >> >> >
>> >> >> > # rbd mirror pool status cinder
>> >> >> > health: OK
>> >> >> > images: 892 total
>> >> >> >     890 replaying
>> >> >> >     2 stopped
>> >> >> > #
>> >> >> > # rbd mirror pool status nova
>> >> >> > health: WARNING
>> >> >> > images: 2479 total
>> >> >> >     2479 unknown
>> >> >> > #
>> >> >> > The production clsuter has 5k writes/s on average and the backup 
>> >> >> > cluster has 1-2k writes/s on average. The production cluster is 
>> >> >> > bigger and has better specs. I thought that the backup cluster would 
>> >> >> > be able to keep up but it looks like I was wrong.
>> >> >>
>> >> >> The fact that they are in the unknown state just means that the remote
>> >> >> "rbd-mirror" daemon hasn't started any journal replayers against the
>> >> >> images. If it couldn't keep up, it would still report a status of
>> >> >> "up+replaying". What Ceph release are you running on your backup
>> >> >> cluster?
>> >> >>
>> >> > The backup cluster is running Luminous 12.2.11 (the production cluster 
>> >> > 12.2.10)
>> >> >
>> >> >>
>> >> >> > >> And the journals on the rbd volumes keep growing...
>> >> >> > >>
>> >> >> > >> Is it enough to simply disable the mirroring of the pool  (rbd 
>> >> >> > >> mirror pool disable <pool>) and that will remove the lagging 
>> >> >> > >> reader from the journals and shrink them, or is there anything 
>> >> >> > >> else that has to be done?
>> >> >> > >
>> >> >> > >You can either disable the journaling feature on the image(s) since
>> >> >> > >there is no point to leave it on if you aren't using mirroring, or 
>> >> >> > >run
>> >> >> > >"rbd mirror pool disable <pool>" to purge the journals.
>> >> >> >
>> >> >> > Thanks for the confirmation.
>> >> >> > I will stop the mirror of the nova pool and try to figure out if 
>> >> >> > there is anything we can do to get the backup cluster to keep up.
>> >> >> >
>> >> >> > >> Best regards
>> >> >> > >> /Magnus
>> >> >> > >> _______________________________________________
>> >> >> > >> ceph-users mailing list
>> >> >> > >> ceph-users@lists.ceph.com
>> >> >> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> > >
>> >> >> > >--
>> >> >> > >Jason
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Jason
>> >>
>> >>
>> >>
>> >> --
>> >> Jason
>>
>>
>>
>> --
>> Jason



-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to