Re: [ceph-users] Ceph RBD Mirroring

Jason Dillaman Fri, 13 Sep 2019 07:30:41 -0700

On Fri, Sep 13, 2019 at 10:17 AM Jason Dillaman <jdill...@redhat.com> wrote:
>
> On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth
> <freyerm...@physik.uni-bonn.de> wrote:
> >
> > Dear Jason,
> >
> > thanks for the very detailed explanation! This was very instructive.
> > Sadly, the watchers look correct - see details inline.
> >
> > Am 13.09.19 um 15:02 schrieb Jason Dillaman:
> > > On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth
> > > <freyerm...@physik.uni-bonn.de> wrote:
> > >>
> > >> Dear Jason,
> > >>
> > >> thanks for taking care and developing a patch so quickly!
> > >>
> > >> I have another strange observation to share. In our test setup, only a 
> > >> single RBD mirroring daemon is running for 51 images.
> > >> It works fine with a constant stream of 1-2 MB/s, but at some point 
> > >> after roughly 20 hours, _all_ images go to this interesting state:
> > >> -----------------------------------------
> > >> # rbd mirror image status test-vm.XXXXX-disk2
> > >> test-vm.XXXXX-disk2:
> > >>    global_id:   XXXXXXXXXXXXXXX
> > >>    state:       down+replaying
> > >>    description: replaying, master_position=[object_number=14, tag_tid=6, 
> > >> entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, 
> > >> entry_tid=6338], entries_behind_master=0
> > >>    last_update: 2019-09-13 03:45:43
> > >> -----------------------------------------
> > >> Running this command several times, I see entry_tid increasing at both 
> > >> ends, so mirroring seems to be working just fine.
> > >>
> > >> However:
> > >> -----------------------------------------
> > >> # rbd mirror pool status
> > >> health: WARNING
> > >> images: 51 total
> > >>      51 unknown
> > >> -----------------------------------------
> > >> The health warning is not visible in the dashboard (also not in the 
> > >> mirroring menu), the daemon still seems to be running, dropped nothing 
> > >> in the logs,
> > >> and claims to be "ok" in the dashboard - it's only that all images show 
> > >> up in unknown state even though all seems to be working fine.
> > >>
> > >> Any idea on how to debug this?
> > >> When I restart the rbd-mirror service, all images come back as green. I 
> > >> already encountered this twice in 3 days.
> > >
> > > The dashboard relies on the rbd-mirror daemon to provide it errors and
> > > warnings. You can see the status reported by rbd-mirror by running
> > > "ceph service status":
> > >
> > > $ ceph service status
> > > {
> > >      "rbd-mirror": {
> > >          "4152": {
> > >              "status_stamp": "2019-09-13T08:58:41.937491-0400",
> > >              "last_beacon": "2019-09-13T08:58:41.937491-0400",
> > >              "status": {
> > >                  "json":
> > > "{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}"
> > >              }
> > >          }
> > >      }
> > > }
> > >
> > > In your case, most likely it seems like rbd-mirror thinks all is good
> > > with the world so it's not reporting any errors.
> >
> > This is indeed the case:
> >
> > # ceph service status
> > {
> >      "rbd-mirror": {
> >          "84243": {
> >              "status_stamp": "2019-09-13 15:40:01.149815",
> >              "last_beacon": "2019-09-13 15:40:26.151381",
> >              "status": {
> >                  "json": 
> > "{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}"
> >              }
> >          }
> >      },
> >      "rgw": {
> > ...
> >      }
> > }
> >
> > > The "down" state indicates that the rbd-mirror daemon isn't correctly
> > > watching the "rbd_mirroring" object in the pool. You can see who it
> > > watching that object by running the "rados" "listwatchers" command:
> > >
> > > $ rados -p <pool name> listwatchers rbd_mirroring
> > > watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992
> > > watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424
> > >
> > > In my case, the "4154" from "client.4154" is the unique global id for
> > > my connection to the cluster, which relates back to the "ceph service
> > > status" dump which also shows status by daemon using the unique global
> > > id.
> >
> > Sadly(?), this looks as expected:
> >
> > # rados -p rbd listwatchers rbd_mirroring
> > watcher=10.160.19.240:0/2922488671 client.84247 cookie=139770046978672
> > watcher=10.160.19.240:0/2922488671 client.84247 cookie=139771389162560
>
> Hmm, the unique id is different (84243 vs 84247). I wouldn't have
> expected the global id to have changed. Did you restart the Ceph
> cluster or MONs? Do you see any "peer assigned me a different
> global_id" errors in your rbd-mirror logs?
>
> I'll open a tracker ticket to fix the "ceph service status", though,
> since clearly your global id changed but it wasn't noticed by the
> service daemon status updater.


... also, can you please provide the output from the following via a
pastebin link?

# rados -p rbd listomapvals rbd_mirroring

> > However, the dashboard still shows those images in "unknown", and this also 
> > shows up via command line:
> >
> > # rbd mirror pool status
> > health: WARNING
> > images: 51 total
> >      51 unknown
> > # rbd mirror image status test-vm.physik.uni-bonn.de-disk1
> > test-vm.physik.uni-bonn.de-disk2:
> >    global_id:   1a53fafa-37ef-4edf-9633-c2ba3323ed93
> >    state:       down+replaying
> >    description: replaying, master_position=[object_number=18, tag_tid=6, 
> > entry_tid=25202], mirror_position=[object_number=18, tag_tid=6, 
> > entry_tid=25202], entries_behind_master=0
> >    last_update: 2019-09-13 15:55:15
> >
> > Any ideas on what else could cause this?
> >
> > Cheers and thanks,
> >         Oliver
> >
> > >
> > >> Any idea on this (or how I can extract more information)?
> > >> I fear keeping high-level debug logs active for ~24h is not feasible.
> > >>
> > >> Cheers,
> > >>          Oliver
> > >>
> > >>
> > >> On 2019-09-11 19:14, Jason Dillaman wrote:
> > >>> On Wed, Sep 11, 2019 at 12:57 PM Oliver Freyermuth
> > >>> <freyerm...@physik.uni-bonn.de> wrote:
> > >>>>
> > >>>> Dear Jason,
> > >>>>
> > >>>> I played a bit more with rbd mirroring and learned that deleting an 
> > >>>> image at the source (or disabling journaling on it) immediately moves 
> > >>>> the image to trash at the target -
> > >>>> but setting rbd_mirroring_delete_delay helps to have some more grace 
> > >>>> time to catch human mistakes.
> > >>>>
> > >>>> However, I have issues restoring such an image which has been moved to 
> > >>>> trash by the RBD-mirror daemon as user:
> > >>>> -----------------------------------
> > >>>> [root@mon001 ~]# rbd trash ls -la
> > >>>> ID           NAME                             SOURCE    DELETED_AT     
> > >>>>           STATUS                                   PARENT
> > >>>> d4fbe8f63905 test-vm-XXXXXXXXXXXXXXXXXX-disk2 MIRRORING Wed Sep 11 
> > >>>> 18:43:14 2019 protected until Thu Sep 12 18:43:14 2019
> > >>>> [root@mon001 ~]# rbd trash restore --image foo-image d4fbe8f63905
> > >>>> rbd: restore error: 2019-09-11 18:50:15.387 7f5fa9590b00 -1 
> > >>>> librbd::api::Trash: restore: Current trash source: mirroring does not 
> > >>>> match expected: user
> > >>>> (22) Invalid argument
> > >>>> -----------------------------------
> > >>>> This is issued on the mon, which has the client.admin key, so it 
> > >>>> should not be a permission issue.
> > >>>> It also fails when I try that in the Dashboard.
> > >>>>
> > >>>> Sadly, the error message is not clear enough for me to figure out what 
> > >>>> could be the problem - do you see what I did wrong?
> > >>>
> > >>> Good catch, it looks like we accidentally broke this in Nautilus when
> > >>> image live-migration support was added. I've opened a new tracker
> > >>> ticket to fix this [1].
> > >>>
> > >>>> Cheers and thanks again,
> > >>>>          Oliver
> > >>>>
> > >>>> On 2019-09-10 23:17, Oliver Freyermuth wrote:
> > >>>>> Dear Jason,
> > >>>>>
> > >>>>> On 2019-09-10 23:04, Jason Dillaman wrote:
> > >>>>>> On Tue, Sep 10, 2019 at 2:08 PM Oliver Freyermuth
> > >>>>>> <freyerm...@physik.uni-bonn.de> wrote:
> > >>>>>>>
> > >>>>>>> Dear Jason,
> > >>>>>>>
> > >>>>>>> On 2019-09-10 18:50, Jason Dillaman wrote:
> > >>>>>>>> On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth
> > >>>>>>>> <freyerm...@physik.uni-bonn.de> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Dear Cephalopodians,
> > >>>>>>>>>
> > >>>>>>>>> I have two questions about RBD mirroring.
> > >>>>>>>>>
> > >>>>>>>>> 1) I can not get it to work - my setup is:
> > >>>>>>>>>
> > >>>>>>>>>       - One cluster holding the live RBD volumes and snapshots, 
> > >>>>>>>>> in pool "rbd", cluster name "ceph",
> > >>>>>>>>>         running latest Mimic.
> > >>>>>>>>>         I ran "rbd mirror pool enable rbd pool" on that cluster 
> > >>>>>>>>> and created a cephx user "rbd_mirror" with (is there a better 
> > >>>>>>>>> way?):
> > >>>>>>>>>         ceph auth get-or-create client.rbd_mirror mon 'allow r' 
> > >>>>>>>>> osd 'allow class-read object_prefix rbd_children, allow pool rbd 
> > >>>>>>>>> r' -o ceph.client.rbd_mirror.keyring --cluster ceph
> > >>>>>>>>>         In that pool, two images have the journaling feature 
> > >>>>>>>>> activated, all others have it disabled still (so I would expect 
> > >>>>>>>>> these two to be mirrored).
> > >>>>>>>>
> > >>>>>>>> You can just use "mon 'profile rbd' osd 'profile rbd'" for the 
> > >>>>>>>> caps --
> > >>>>>>>> but you definitely need more than read-only permissions to the 
> > >>>>>>>> remote
> > >>>>>>>> cluster since it needs to be able to create snapshots of remote 
> > >>>>>>>> images
> > >>>>>>>> and update/trim the image journals.
> > >>>>>>>
> > >>>>>>> these profiles really make life a lot easier. I should have thought 
> > >>>>>>> of them rather than "guessing" a potentially good configuration...
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>>       - Another (empty) cluster running latest Nautilus, cluster 
> > >>>>>>>>> name "ceph", pool "rbd".
> > >>>>>>>>>         I've used the dashboard to activate mirroring for the RBD 
> > >>>>>>>>> pool, and then added a peer with cluster name "ceph-virt", 
> > >>>>>>>>> cephx-ID "rbd_mirror", filled in the mons and key created above.
> > >>>>>>>>>         I've then run:
> > >>>>>>>>>         ceph auth get-or-create client.rbd_mirror_backup mon 
> > >>>>>>>>> 'allow r' osd 'allow class-read object_prefix rbd_children, allow 
> > >>>>>>>>> pool rbd rwx' -o client.rbd_mirror_backup.keyring --cluster ceph
> > >>>>>>>>>         and deployed that key on the rbd-mirror machine, and 
> > >>>>>>>>> started the service with:
> > >>>>>>>>
> > >>>>>>>> Please use "mon 'profile rbd-mirror' osd 'profile rbd'" for your 
> > >>>>>>>> caps [1].
> > >>>>>>>
> > >>>>>>> That did the trick (in combination with the above)!
> > >>>>>>> Again a case of PEBKAC: I should have read the documentation until 
> > >>>>>>> the end, clearly my fault.
> > >>>>>>>
> > >>>>>>> It works well now, even though it seems to run a bit slow (~35 MB/s 
> > >>>>>>> for the initial sync when everything is 1 GBit/s),
> > >>>>>>> but that may also be caused by combination of some very limited 
> > >>>>>>> hardware on the receiving end (which will be scaled up in the 
> > >>>>>>> future).
> > >>>>>>> A single host with 6 disks, replica 3 and a RAID controller which 
> > >>>>>>> can only do RAID0 and not JBOD is certainly not ideal, so commit 
> > >>>>>>> latency may cause this slow bandwidth.
> > >>>>>>
> > >>>>>> You could try increasing "rbd_concurrent_management_ops" from the
> > >>>>>> default of 10 ops to something higher to attempt to account for the
> > >>>>>> latency. However, I wouldn't expect near-line speed w/ RBD mirroring.
> > >>>>>
> > >>>>> Thanks - I will play with this option once we have more storage 
> > >>>>> available in the target pool ;-).
> > >>>>>
> > >>>>>>
> > >>>>>>>>
> > >>>>>>>>>         systemctl start ceph-rbd-mirror@rbd_mirror_backup.service
> > >>>>>>>>>
> > >>>>>>>>>      After this, everything looks fine:
> > >>>>>>>>>       # rbd mirror pool info
> > >>>>>>>>>         Mode: pool
> > >>>>>>>>>         Peers:
> > >>>>>>>>>          UUID                                 NAME      CLIENT
> > >>>>>>>>>          XXXXXXXXXXX                          ceph-virt 
> > >>>>>>>>> client.rbd_mirror
> > >>>>>>>>>
> > >>>>>>>>>      The service also seems to start fine, but logs show (debug 
> > >>>>>>>>> rbd_mirror=20):
> > >>>>>>>>>
> > >>>>>>>>>      rbd::mirror::ClusterWatcher:0x5575e2a7d390 
> > >>>>>>>>> resolve_peer_config_keys: retrieving config-key: pool_id=2, 
> > >>>>>>>>> pool_name=rbd, peer_uuid=XXXXXXXXXXX
> > >>>>>>>>>      rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: 
> > >>>>>>>>> enter
> > >>>>>>>>>      rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: 
> > >>>>>>>>> restarting failed pool replayer for uuid: XXXXXXXXXXX cluster: 
> > >>>>>>>>> ceph-virt client: client.rbd_mirror
> > >>>>>>>>>      rbd::mirror::PoolReplayer: 0x5575e2a7da20 init: replaying 
> > >>>>>>>>> for uuid: XXXXXXXXXXX cluster: ceph-virt client: client.rbd_mirror
> > >>>>>>>>>      rbd::mirror::PoolReplayer: 0x5575e2a7da20 init_rados: error 
> > >>>>>>>>> connecting to remote peer uuid: XXXXXXXXXXX cluster: ceph-virt 
> > >>>>>>>>> client: client.rbd_mirror: (95) Operation not supported
> > >>>>>>>>>      rbd::mirror::ServiceDaemon: 0x5575e29c8d70 
> > >>>>>>>>> add_or_update_callout: pool_id=2, callout_id=2, 
> > >>>>>>>>> callout_level=error, text=unable to connect to remote cluster
> > >>>>>>>>
> > >>>>>>>> If it's still broken after fixing your caps above, perhaps increase
> > >>>>>>>> debugging for "rados", "monc", "auth", and "ms" to see if you can
> > >>>>>>>> determine the source of the op not supported error.
> > >>>>>>>>
> > >>>>>>>>> I already tried storing the ceph.client.rbd_mirror.keyring (i.e. 
> > >>>>>>>>> from the cluster with the live images) on the rbd-mirror machine 
> > >>>>>>>>> explicitly (i.e. not only in mon config storage),
> > >>>>>>>>> and after doing that:
> > >>>>>>>>>     rbd -m mon_ip_of_ceph_virt_cluster --id=rbd_mirror ls
> > >>>>>>>>> works fine. So it's not a connectivity issue. Maybe a permission 
> > >>>>>>>>> issue? Or did I miss something?
> > >>>>>>>>>
> > >>>>>>>>> Any idea what "operation not supported" means?
> > >>>>>>>>> It's unclear to me whether things should work well using Mimic 
> > >>>>>>>>> with Nautilus, and enabling pool mirroring but only having 
> > >>>>>>>>> journaling on for two images is a supported case.
> > >>>>>>>>
> > >>>>>>>> Yes and yes.
> > >>>>>>>>
> > >>>>>>>>> 2) Since there is a performance drawback (about 2x) for 
> > >>>>>>>>> journaling, is it also possible to only mirror snapshots, and 
> > >>>>>>>>> leave the live volumes alone?
> > >>>>>>>>>       This would cover the common backup usecase before deferred 
> > >>>>>>>>> mirroring is implemented (or is it there already?).
> > >>>>>>>>
> > >>>>>>>> This is in-development right now and will hopefully land for the
> > >>>>>>>> Octopus release.
> > >>>>>>>
> > >>>>>>> That would be very cool. Just to clarify: You mean the "real" 
> > >>>>>>> deferred mirroring, not a "snapshot only" mirroring?
> > >>>>>>> Is it already clear if this will require Octopous (or a later 
> > >>>>>>> release) on both ends, or only on the receiving side?
> > >>>>>>
> > >>>>>> I might not be sure what you mean by deferred mirroring. You can 
> > >>>>>> delay
> > >>>>>> the replay of the journal via the "rbd_mirroring_replay_delay"
> > >>>>>> configuration option so that your DR site can be X seconds behind the
> > >>>>>> primary at a minimum.
> > >>>>>
> > >>>>> This is indeed what I was thinking of...
> > >>>>>
> > >>>>>> For Octopus we are working on on-demand and
> > >>>>>> scheduled snapshot mirroring between sites -- no journal is involved.
> > >>>>>
> > >>>>> ... and this is what I was dreaming of. We keep snapshots of VMs to 
> > >>>>> be able to roll them back.
> > >>>>> We'd like to also keep those snapshots in a separate Ceph instance as 
> > >>>>> an additional safety-net (in addition to an offline backup of those 
> > >>>>> snapshots with Benji backup).
> > >>>>> It is not (yet) clear to me whether we can pay the "2 x" price for 
> > >>>>> journaling in the long run, so this would be the way to go in case we 
> > >>>>> can't.
> > >>>>>
> > >>>>>>
> > >>>>>>> Since I got you personally, I have two bonus questions.
> > >>>>>>>
> > >>>>>>> 1) Your talk:
> > >>>>>>>      
> > >>>>>>> https://events.static.linuxfound.org/sites/events/files/slides/Disaster%20Recovery%20and%20Ceph%20Block%20Storage-%20Introducing%20Multi-Site%20Mirroring.pdf
> > >>>>>>>      mentions "rbd journal object flush age", which I'd translate 
> > >>>>>>> with something like the "commit" mount option on a classical file 
> > >>>>>>> system - correct?
> > >>>>>>>      I don't find this switch documented anywhere, though - is 
> > >>>>>>> there experience with it / what's the default?
> > >>>>>>
> > >>>>>> It's a low-level knob that by default causes the journal to flush its
> > >>>>>> pending IO events before it allows the corresponding IO to be issued
> > >>>>>> against the backing image. Setting it to a value greater that zero
> > >>>>>> will allow that many seconds of IO events to be batched together in a
> > >>>>>> journal append operation and its helpful for high-throughout, small 
> > >>>>>> IO
> > >>>>>> operations. Of course it turned out that a bug had broken that option
> > >>>>>> a while where events would never batch, so a fix is currently
> > >>>>>> scheduled for backport of all active releases [1] w/ the goal that no
> > >>>>>> one should need to tweak it.
> > >>>>>
> > >>>>> That's even better - since our setup is growing and we will keep 
> > >>>>> upgrading, I'll then just keep things as they are now (no manual 
> > >>>>> tweaking)
> > >>>>> and tag along the development. Thanks!
> > >>>>>
> > >>>>>>
> > >>>>>>> 2) I read I can run more than one rbd-mirror with Mimic/Nautilus. 
> > >>>>>>> Do they load-balance the images, or "only" failover in case one of 
> > >>>>>>> them dies?
> > >>>>>>
> > >>>>>> Starting with Nautilus, the default configuration for rbd-mirror is 
> > >>>>>> to
> > >>>>>> evenly divide the number of mirrored images between all running
> > >>>>>> daemons. This does not split the total load since some images might 
> > >>>>>> be
> > >>>>>> hotter than others, but it at least spreads the load.
> > >>>>>
> > >>>>> That's fine enough for our use case. Spreading by "hotness" is a task 
> > >>>>> without a clear answer
> > >>>>> and "temperature" may change quickly, so that's all I hoped for.
> > >>>>>
> > >>>>> Many thanks again for the very helpful explanations!
> > >>>>>
> > >>>>> Cheers,
> > >>>>>        Oliver
> > >>>>>
> > >>>>>>
> > >>>>>>>
> > >>>>>>> Cheers and many thanks for the quick and perfect help!
> > >>>>>>>           Oliver
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>> Cheers and thanks in advance,
> > >>>>>>>>>           Oliver
> > >>>>>>>>>
> > >>>>>>>>> _______________________________________________
> > >>>>>>>>> ceph-users mailing list
> > >>>>>>>>> ceph-users@lists.ceph.com
> > >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>>>>>
> > >>>>>>>> [1] 
> > >>>>>>>> https://docs.ceph.com/docs/master/rbd/rbd-mirroring/#rbd-mirror-daemon
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Jason
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>> [1] https://github.com/ceph/ceph/pull/28539
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>> [1] https://tracker.ceph.com/issues/41780
> > >>>
> > >>
> > >>
> > >
> > >
> >
> >
>
>
> --
> Jason



-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph RBD Mirroring

Reply via email to