On Wed, 23 Oct 2019 at 3:12 am, <ceph-users-requ...@ceph.io> wrote:

> Send ceph-users mailing list submissions to
>         ceph-users@ceph.io
>
> To subscribe or unsubscribe via email, send a message with subject or
> body 'help' to
>         ceph-users-requ...@ceph.io
>
> You can reach the person managing the list at
>         ceph-users-ow...@ceph.io
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
> Today's Topics:
>
>    1. Re: Replace ceph osd in a container (Sasha Litvak)
>    2. Re: Fwd: large concurrent rbd operations block for over 15 mins!
>       (Mark Nelson)
>    3. Re: rgw multisite failover (Ed Fisher)
>
>
> ----------------------------------------------------------------------
>
> Date: Tue, 22 Oct 2019 08:52:54 -0500
> From: Sasha Litvak <alexander.v.lit...@gmail.com>
> Subject: [ceph-users] Re: Replace ceph osd in a container
> To: Frank Schilder <fr...@dtu.dk>
> Cc: ceph-users <ceph-users@ceph.io>
> Message-ID:
>         <
> cali_l49rxwcbx_zivrhhwgyg8ea_urh-0ykgmy4+b20khxu...@mail.gmail.com>
> Content-Type: multipart/alternative;
>         boundary="000000000000a0731a0595801ea4"
>
> --000000000000a0731a0595801ea4
> Content-Type: text/plain; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> Frank,
>
> Thank you for your suggestion.  It sounds very promising.  I will
> definitely try it.
>
> Best,
>
> On Tue, Oct 22, 2019, 2:44 AM Frank Schilder <fr...@dtu.dk> wrote:
>
> > > I am suspecting that mon or mgr have no access to /dev or /var/lib
> whil=
> e
> > osd containers do.
> > > Cluster configured originally by ceph-ansible (nautilus 14.2.2)
> >
> > They don't, because they don't need to.
> >
> > > The question is if I want to replace all disks on a single node, and I
> > have 6 nodes with pools
> > > replication 3, is it safe to restart mgr mounting /dev and
> /var/lib/cep=
> h
> > volumes (not configured right now).
> >
> > Restarting mons is safe in the sense that data will not get lost.
> However=
> ,
> > access might get lost temporarily.
> >
> > The question is, how many mons do you have? If you have only 1 or 2, it
> > will mean downtime. If you can bear the downtime, it doesn't matter. If
> y=
> ou
> > have at least 3, you can restart one after the other.
> >
> > However, I would not do that. Having to restart a mon container every
> tim=
> e
> > some minor container config changes for reasons that have nothing to do
> > with a mon sounds like calling for trouble.
> >
> > I also use containers and would recommend a different approach. I created
> > an additional type of container (ceph-adm) that I use for all admin
> tasks=
> .
> > Its the same image and the entry point simply executes a sleep infinity.
> =
> In
> > this container I make all relevant hardware visible. You might also want
> =
> to
> > expose /var/run/ceph to be able to use admin sockets without hassle. This
> > way, I separated admin operations from actual storage daemons and can
> > modify and restart the admin container as I like.
> >
> > Best regards,
> >
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Frank Schilder
> > AIT Ris=C3=B8 Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Alex
> > Litvak <alexander.v.lit...@gmail.com>
> > Sent: 22 October 2019 08:04
> > To: ceph-us...@lists.ceph.com
> > Subject: [ceph-users] Replace ceph osd in a container
> >
> > Hello cephers,
> >
> > So I am having trouble with a new hardware systems with strange OSD
> > behavior and I want to replace a disk with a brand new one to test the
> > theory.
> >
> > I run all daemons in containers and on one of the nodes I have mon, mgr,
> > and 6 osds.  So following
> >
> https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replac=
> ing-an-osd
> <https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replac=ing-an-osd>
> >
> > I stopped container with osd.23, waited until it is down and out, ran
> > safe-to-destroy loop and then destroyed the osd all using the monitor
> fro=
> m
> > the container on this node.  All good.
> >
> > Then I swapped the SSDs and started running additional steps (from step
> 3=
> )
> > using the same mon container.  I have no ceph packages installed on the
> > bare metal box. It looks like mon container doesn't
> > see the disk.
> >
> >      podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap /dev/sdh
> >   stderr: lsblk: /dev/sdh: not a block device
> >   stderr: error: /dev/sdh: No such file or directory
> >   stderr: Unknown device, --name=3D, --path=3D, or absolute path in
> /dev/=
>  or
> > /sys expected.
> > usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
> >                             [--osd-fsid OSD_FSID]
> >                             [DEVICES [DEVICES ...]]
> > ceph-volume lvm zap: error: Unable to proceed with non-existing device:
> > /dev/sdh
> > Error: exit status 2
> > root@storage2n2-la:~# ls -l /dev/sd
> > sda   sdc   sdd   sde   sdf   sdg   sdg1  sdg2  sdg5  sdh
> > root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume
> > lvm zap sdh
> >   stderr: lsblk: sdh: not a block device
> >   stderr: error: sdh: No such file or directory
> >   stderr: Unknown device, --name=3D, --path=3D, or absolute path in
> /dev/=
>  or
> > /sys expected.
> > usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
> >                             [--osd-fsid OSD_FSID]
> >                             [DEVICES [DEVICES ...]]
> > ceph-volume lvm zap: error: Unable to proceed with non-existing device:
> s=
> dh
> > Error: exit status 2
> >
> > I execute lsblk and it sees device sdh
> > root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk
> > lsblk: dm-1: failed to get device path
> > lsblk: dm-2: failed to get device path
> > lsblk: dm-4: failed to get device path
> > lsblk: dm-6: failed to get device path
> > lsblk: dm-4: failed to get device path
> > lsblk: dm-2: failed to get device path
> > lsblk: dm-1: failed to get device path
> > lsblk: dm-0: failed to get device path
> > lsblk: dm-0: failed to get device path
> > lsblk: dm-7: failed to get device path
> > lsblk: dm-5: failed to get device path
> > lsblk: dm-7: failed to get device path
> > lsblk: dm-6: failed to get device path
> > lsblk: dm-5: failed to get device path
> > lsblk: dm-3: failed to get device path
> > NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> > sdf      8:80   0   1.8T  0 disk
> > sdd      8:48   0   1.8T  0 disk
> > sdg      8:96   0 223.5G  0 disk
> > |-sdg5   8:101  0   223G  0 part
> > |-sdg1   8:97       487M  0 part
> > `-sdg2   8:98         1K  0 part
> > sde      8:64   0   1.8T  0 disk
> > sdc      8:32   0   3.5T  0 disk
> > sda      8:0    0   3.5T  0 disk
> > sdh      8:112  0   3.5T  0 disk
> >
> > So I use a fellow osd container (osd.5) on the same node and run all of
> > the operations (zap and prepare) successfully.
> >
> > I am suspecting that mon or mgr have no access to /dev or /var/lib while
> > osd containers do.  Cluster configured originally by ceph-ansible
> (nautil=
> us
> > 14.2.2)
> >
> > The question is if I want to replace all disks on a single node, and I
> > have 6 nodes with pools replication 3, is it safe to restart mgr mounting
> > /dev and /var/lib/ceph volumes (not configured right now).
> >
> > I cannot use other osd containers on the same box because my controller
> > reverts from raid to non-raid mode with all disks lost and not just a
> > single one.  So I need to replace all 6 osds to run back
> > in containers and the only things will remain operational on node are mon
> > and mgr containers.
> >
> > I prefer not to install a full cluster or client on the bare metal node
> i=
> f
> > possible.
> >
> > Thank you for your help,
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-us...@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --000000000000a0731a0595801ea4
> Content-Type: text/html; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> <div dir=3D"auto">Frank,<div dir=3D"auto"><br></div><div
> dir=3D"auto">Thank=
>  you=C2=A0for your suggestion.=C2=A0 It sounds very promising.=C2=A0 I
> will=
>  definitely try it.</div><div dir=3D"auto"><br></div><div
> dir=3D"auto">Best=
> ,</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr"
> class=3D"gmail=
> _attr">On Tue, Oct 22, 2019, 2:44 AM Frank Schilder &lt;<a href=3D"mailto:
> f=
> r...@dtu.dk">fr...@dtu.dk</a>&gt; wrote:<br></div><blockquote
> class=3D"gmai=
> l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc
> solid;padding-left=
> :1ex">&gt; I am suspecting that mon or mgr have no access to /dev or
> /var/l=
> ib while osd containers do. <br>
> &gt; Cluster configured originally by ceph-ansible (nautilus 14.2.2)<br>
> <br>
> They don&#39;t, because they don&#39;t need to.<br>
> <br>
> &gt; The question is if I want to replace all disks on a single node, and
> I=
>  have 6 nodes with pools<br>
> &gt; replication 3, is it safe to restart mgr mounting /dev and
> /var/lib/ce=
> ph volumes (not configured right now).<br>
> <br>
> Restarting mons is safe in the sense that data will not get lost. However,
> =
> access might get lost temporarily.<br>
> <br>
> The question is, how many mons do you have? If you have only 1 or 2, it
> wil=
> l mean downtime. If you can bear the downtime, it doesn&#39;t matter. If
> yo=
> u have at least 3, you can restart one after the other.<br>
> <br>
> However, I would not do that. Having to restart a mon container every time
> =
> some minor container config changes for reasons that have nothing to do
> wit=
> h a mon sounds like calling for trouble.<br>
> <br>
> I also use containers and would recommend a different approach. I created
> a=
> n additional type of container (ceph-adm) that I use for all admin tasks.
> I=
> ts the same image and the entry point simply executes a sleep infinity. In
> =
> this container I make all relevant hardware visible. You might also want
> to=
>  expose /var/run/ceph to be able to use admin sockets without hassle. This
> =
> way, I separated admin operations from actual storage daemons and can
> modif=
> y and restart the admin container as I like.<br>
> <br>
> Best regards,<br>
> <br>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
> Frank Schilder<br>
> AIT Ris=C3=B8 Campus<br>
> Bygning 109, rum S14<br>
> <br>
> ________________________________________<br>
> From: ceph-users &lt;<a href=3D"mailto:ceph-users-boun...@lists.ceph.com";
> t=
> arget=3D"_blank" rel=3D"noreferrer">ceph-users-boun...@lists.ceph.com
> </a>&g=
> t; on behalf of Alex Litvak &lt;<a href=3D"mailto:
> alexander.v.litvak@gmail.=
> com" target=3D"_blank" rel=3D"noreferrer">alexander.v.lit...@gmail.com
> </a>&=
> gt;<br>
> Sent: 22 October 2019 08:04<br>
> To: <a href=3D"mailto:ceph-us...@lists.ceph.com"; target=3D"_blank"
> rel=3D"n=
> oreferrer">ceph-us...@lists.ceph.com</a><br>
> Subject: [ceph-users] Replace ceph osd in a container<br>
> <br>
> Hello cephers,<br>
> <br>
> So I am having trouble with a new hardware systems with strange OSD
> behavio=
> r and I want to replace a disk with a brand new one to test the theory.<br>
> <br>
> I run all daemons in containers and on one of the nodes I have mon, mgr,
> an=
> d 6 osds.=C2=A0 So following <a href=3D"
> https://docs.ceph.com/docs/master/r=
> ados/operations/add-or-rm-osds/#replacing-an-osd
> <https://docs.ceph.com/docs/master/r=ados/operations/add-or-rm-osds/#replacing-an-osd>"
> rel=3D"noreferrer norefer=
> rer" target=3D"_blank">
> https://docs.ceph.com/docs/master/rados/operations/a=
> dd-or-rm-osds/#replacing-an-osd
> <https://docs.ceph.com/docs/master/rados/operations/a=dd-or-rm-osds/#replacing-an-osd>
> </a><br>
> <br>
> I stopped container with osd.23, waited until it is down and out, ran
> safe-=
> to-destroy loop and then destroyed the osd all using the monitor from the
> c=
> ontainer on this node.=C2=A0 All good.<br>
> <br>
> Then I swapped the SSDs and started running additional steps (from step 3)
> =
> using the same mon container.=C2=A0 I have no ceph packages installed on
> th=
> e bare metal box. It looks like mon container doesn&#39;t<br>
> see the disk.<br>
> <br>
> =C2=A0 =C2=A0 =C2=A0podman exec -it ceph-mon-storage2n2-la ceph-volume lvm
> =
> zap /dev/sdh<br>
> =C2=A0 stderr: lsblk: /dev/sdh: not a block device<br>
> =C2=A0 stderr: error: /dev/sdh: No such file or directory<br>
> =C2=A0 stderr: Unknown device, --name=3D, --path=3D, or absolute path in
> /d=
> ev/ or /sys expected.<br>
> usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]<br>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
> =A0 =C2=A0 =C2=A0 =C2=A0 [--osd-fsid OSD_FSID]<br>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
> =A0 =C2=A0 =C2=A0 =C2=A0 [DEVICES [DEVICES ...]]<br>
> ceph-volume lvm zap: error: Unable to proceed with non-existing device:
> /de=
> v/sdh<br>
> Error: exit status 2<br>
> root@storage2n2-la:~# ls -l /dev/sd<br>
> sda=C2=A0 =C2=A0sdc=C2=A0 =C2=A0sdd=C2=A0 =C2=A0sde=C2=A0 =C2=A0sdf=C2=A0 =
> =C2=A0sdg=C2=A0 =C2=A0sdg1=C2=A0 sdg2=C2=A0 sdg5=C2=A0 sdh<br>
> root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume
> lv=
> m zap sdh<br>
> =C2=A0 stderr: lsblk: sdh: not a block device<br>
> =C2=A0 stderr: error: sdh: No such file or directory<br>
> =C2=A0 stderr: Unknown device, --name=3D, --path=3D, or absolute path in
> /d=
> ev/ or /sys expected.<br>
> usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]<br>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
> =A0 =C2=A0 =C2=A0 =C2=A0 [--osd-fsid OSD_FSID]<br>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
> =A0 =C2=A0 =C2=A0 =C2=A0 [DEVICES [DEVICES ...]]<br>
> ceph-volume lvm zap: error: Unable to proceed with non-existing device:
> sdh=
> <br>
> Error: exit status 2<br>
> <br>
> I execute lsblk and it sees device sdh<br>
> root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk<br>
> lsblk: dm-1: failed to get device path<br>
> lsblk: dm-2: failed to get device path<br>
> lsblk: dm-4: failed to get device path<br>
> lsblk: dm-6: failed to get device path<br>
> lsblk: dm-4: failed to get device path<br>
> lsblk: dm-2: failed to get device path<br>
> lsblk: dm-1: failed to get device path<br>
> lsblk: dm-0: failed to get device path<br>
> lsblk: dm-0: failed to get device path<br>
> lsblk: dm-7: failed to get device path<br>
> lsblk: dm-5: failed to get device path<br>
> lsblk: dm-7: failed to get device path<br>
> lsblk: dm-6: failed to get device path<br>
> lsblk: dm-5: failed to get device path<br>
> lsblk: dm-3: failed to get device path<br>
> NAME=C2=A0 =C2=A0MAJ:MIN RM=C2=A0 =C2=A0SIZE RO TYPE MOUNTPOINT<br>
> sdf=C2=A0 =C2=A0 =C2=A0 8:80=C2=A0 =C2=A00=C2=A0 =C2=A01.8T=C2=A0 0
> disk<br=
> >
> sdd=C2=A0 =C2=A0 =C2=A0 8:48=C2=A0 =C2=A00=C2=A0 =C2=A01.8T=C2=A0 0
> disk<br=
> >
> sdg=C2=A0 =C2=A0 =C2=A0 8:96=C2=A0 =C2=A00 223.5G=C2=A0 0 disk<br>
> |-sdg5=C2=A0 =C2=A08:101=C2=A0 0=C2=A0 =C2=A0223G=C2=A0 0 part<br>
> |-sdg1=C2=A0 =C2=A08:97=C2=A0 =C2=A0 =C2=A0 =C2=A0487M=C2=A0 0 part<br>
> `-sdg2=C2=A0 =C2=A08:98=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01K=C2=A0 0
> part<br=
> >
> sde=C2=A0 =C2=A0 =C2=A0 8:64=C2=A0 =C2=A00=C2=A0 =C2=A01.8T=C2=A0 0
> disk<br=
> >
> sdc=C2=A0 =C2=A0 =C2=A0 8:32=C2=A0 =C2=A00=C2=A0 =C2=A03.5T=C2=A0 0
> disk<br=
> >
> sda=C2=A0 =C2=A0 =C2=A0 8:0=C2=A0 =C2=A0 0=C2=A0 =C2=A03.5T=C2=A0 0
> disk<br=
> >
> sdh=C2=A0 =C2=A0 =C2=A0 8:112=C2=A0 0=C2=A0 =C2=A03.5T=C2=A0 0 disk<br>
> <br>
> So I use a fellow osd container (osd.5) on the same node and run all of
> the=
>  operations (zap and prepare) successfully.<br>
> <br>
> I am suspecting that mon or mgr have no access to /dev or /var/lib while
> os=
> d containers do.=C2=A0 Cluster configured originally by ceph-ansible
> (nauti=
> lus 14.2.2)<br>
> <br>
> The question is if I want to replace all disks on a single node, and I
> have=
>  6 nodes with pools replication 3, is it safe to restart mgr mounting /dev
> =
> and /var/lib/ceph volumes (not configured right now).<br>
> <br>
> I cannot use other osd containers on the same box because my controller
> rev=
> erts from raid to non-raid mode with all disks lost and not just a single
> o=
> ne.=C2=A0 So I need to replace all 6 osds to run back<br>
> in containers and the only things will remain operational on node are mon
> a=
> nd mgr containers.<br>
> <br>
> I prefer not to install a full cluster or client on the bare metal node if
> =
> possible.<br>
> <br>
> Thank you for your help,<br>
> <br>
> _______________________________________________<br>
> ceph-users mailing list<br>
> <a href=3D"mailto:ceph-us...@lists.ceph.com"; target=3D"_blank"
> rel=3D"noref=
> errer">ceph-us...@lists.ceph.com</a><br>
> <a href=3D"http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com";
> rel=3D"n=
> oreferrer noreferrer" target=3D"_blank">
> http://lists.ceph.com/listinfo.cgi/=
> ceph-users-ceph.com
> <http://lists.ceph.com/listinfo.cgi/=ceph-users-ceph.com></a><br>
> </blockquote></div>
>
> --000000000000a0731a0595801ea4--
>
> ------------------------------
>
> Date: Tue, 22 Oct 2019 08:59:21 -0500
> From: Mark Nelson <mnel...@redhat.com>
> Subject: [ceph-users] Re: Fwd: large concurrent rbd operations block
>         for over 15 mins!
> To: ceph-users@ceph.io
> Message-ID: <362e3930-8c30-d3e0-d0b0-30187c855...@redhat.com>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> Out of curiosity, when you chose EC over replication how did you weigh
> IOPS vs space amplification in your decision making process?  I'm
> wondering if we should prioritize EC latency vs other tasks in future
> tuning efforts (it's always a tradeoff deciding what to focus on).
>
>
> Thanks,
>
> Mark
>
> On 10/22/19 2:35 AM, Frank Schilder wrote:
> > Getting decent RBD performance is not a trivial exercise. While at a
> first glance 61 SSDs for 245 clients sounds more or less OK, it does come
> down to a bit more than that.
> >
> > The first thing is, how to get SSD performance out of SSDs with ceph.
> This post will provide very good clues and might already point out the
> bottleneck: https://yourcmc.ru/wiki/index.php?title=Ceph_performance . Do
> you have good enterprise SSDs?
> >
> > Next thing to look at, what kind of data pool, replicated or erasure
> coded? If erasure coded, has the profile been benchmarked? There are very
> poor choices. Good ones are 4+m, 8+m. 4+m better IOps, 8+m better
> throughput. m>=2.
> >
> > More complications: do you need to deploy more than one OSD per SSD to
> boost performance? This is indicated by the iodepth required in an fio
> benchmark to get full IOPs. Good SSDs deliver already spec performance with
> 1 OSD. More common ones require 2-4 OSDs per disk. Are you using
> ceph-volume already, its default is 2 OSDs per SSD (batch mode).
> >
> > To give a base line, after extensive testing and working through all the
> required tuning steps, I could run about 250 VMs on a 6+2 EC data pool on
> 33 enterprise SAS SSDs with 1 OSD per disk, each VM getting 50IOPs write
> performance. This is probably what you would like to see as well.
> >
> > If you use replicated data pool, this should be relatively easy. With EC
> data pool, this is a bit of a battle.
> >
> > Good luck,
> >
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Void
> Star Nill <void.star.n...@gmail.com>
> > Sent: 22 October 2019 03:00
> > To: ceph-users
> > Subject: [ceph-users] Fwd: large concurrent rbd operations block for
> over 15 mins!
> >
> > Apparently the graph is too big, so my last post is stuck. Resending
> without the graph.
> >
> > Thanks
> >
> >
> > ---------- Forwarded message ---------
> > From: Void Star Nill <void.star.n...@gmail.com<mailto:
> void.star.n...@gmail.com>>
> > Date: Mon, Oct 21, 2019 at 4:41 PM
> > Subject: large concurrent rbd operations block for over 15 mins!
> > To: ceph-users <ceph-us...@lists.ceph.com<mailto:
> ceph-us...@lists.ceph.com>>
> >
> >
> > Hello,
> >
> > I have been running some benchmark tests with a mid-size cluster and I
> am seeing some issues. Wanted to know if this is a bug or something that
> can be tuned. Appreciate any help on this.
> >
> > - I have a 15 node Ceph cluster, with 3 monitors and 12 data nodes with
> total 61 OSDs on SSDs running 14.2.4 nautilus (stable) version. Each node
> has 100G link.
> > - I have 245 client machines from which I am triggering rbd operations.
> Each client has 25G link
> > - rbd operations include, creating an RBD image of 50G size and layering
> feature, mapping the image to the client machine, formatting the device in
> ext4 format, mounting it, running dd to write to the full disk and cleaning
> up (unmount, unmap and remove).
> >
> > If I run these RBD operations concurrently on a small number of machines
> (say 16-20), they run very well and I see good throughput. All image
> operations (except for dd) take less than 2 seconds.
> >
> > However, when I scale it up to 245 clients, each running these
> operations concurrently, I see lot of operations getting hung for a long
> time and the overall throughput reduces drastically.
> >
> > For example, some of the format operations take over 10-15 mins!!!
> >
> > Note that, all operations do complete - so its most likely not a
> deadlock kind of situation.
> >
> > I dont see any errors in ceph.log on the monitor nodes. However, the
> clients do report "hung_task_timeout" in dmesg logs.
> >
> > As you can see in the below image, half the format operations are
> completing in less than a second time, while the other half is over 10mins
> (y axis is in seconds)
> >
> >
> >
> > [11117.113618] INFO: task umount:9902 blocked for more than 120 seconds.
> > [11117.113677]       Tainted: G           OE    4.15.0-51-generic
> #55~16.04.1-Ubuntu
> > [11117.113731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> > [11117.113787] umount          D    0  9902   9901 0x00000000
> > [11117.113793] Call Trace:
> > [11117.113804]  __schedule+0x3d6/0x8b0
> > [11117.113810]  ? _raw_spin_unlock_bh+0x1e/0x20
> > [11117.113814]  schedule+0x36/0x80
> > [11117.113821]  wb_wait_for_completion+0x64/0x90
> > [11117.113828]  ? wait_woken+0x80/0x80
> > [11117.113831]  __writeback_inodes_sb_nr+0x8e/0xb0
> > [11117.113835]  writeback_inodes_sb+0x27/0x30
> > [11117.113840]  __sync_filesystem+0x51/0x60
> > [11117.113844]  sync_filesystem+0x26/0x40
> > [11117.113850]  generic_shutdown_super+0x27/0x120
> > [11117.113854]  kill_block_super+0x2c/0x80
> > [11117.113858]  deactivate_locked_super+0x48/0x80
> > [11117.113862]  deactivate_super+0x5a/0x60
> > [11117.113866]  cleanup_mnt+0x3f/0x80
> > [11117.113868]  __cleanup_mnt+0x12/0x20
> > [11117.113874]  task_work_run+0x8a/0xb0
> > [11117.113881]  exit_to_usermode_loop+0xc4/0xd0
> > [11117.113885]  do_syscall_64+0x100/0x130
> > [11117.113887]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > [11117.113891] RIP: 0033:0x7f0094384487
> > [11117.113893] RSP: 002b:00007fff4199efc8 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000a6
> > [11117.113897] RAX: 0000000000000000 RBX: 0000000000944030 RCX:
> 00007f0094384487
> > [11117.113899] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
> 0000000000944210
> > [11117.113900] RBP: 0000000000944210 R08: 0000000000000000 R09:
> 0000000000000014
> > [11117.113902] R10: 00000000000006b2 R11: 0000000000000246 R12:
> 00007f009488d83c
> > [11117.113903] R13: 0000000000000000 R14: 0000000000000000 R15:
> 00007fff4199f250
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ------------------------------
>
> Date: Tue, 22 Oct 2019 11:10:38 -0500
> From: Ed Fisher <e...@debacle.org>
> Subject: [ceph-users] Re: rgw multisite failover
> To: Frank R <frankaritc...@gmail.com>
> Cc: ceph-users <ceph-us...@ceph.com>
> Message-ID: <084f4293-88cc-456a-b8a4-2e36aca24...@debacle.org>
> Content-Type: multipart/alternative;
>         boundary="Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A"
>
>
> --Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A
> Content-Transfer-Encoding: quoted-printable
> Content-Type: text/plain;
>         charset=us-ascii
>
>
>
> > On Oct 18, 2019, at 10:40 PM, Frank R <frankaritc...@gmail.com> wrote:
> >=20
> > I am looking to change an RGW multisite deployment so that the =
> secondary will become master. This is meant to be a permanent change.
> >=20
> > Per:
> > https://docs.ceph.com/docs/mimic/radosgw/multisite/ =
> <https://docs.ceph.com/docs/mimic/radosgw/multisite/>
> >=20
> > I need to:
> >=20
> > 1. Stop RGW daemons on the current master end.
> >=20
> > On a secondary RGW node:
> > 2. radosgw-admin zone modify --rgw-zone=3D{zone-name} --master =
> --default
> > 3. radosgw-admin period update --commit
> > 4. systemctl restart ceph-radosgw@rgw.`hostname -s`
> >=20
> > Since I want the former master to be secondary permanently do I need =
> to do anything after restarting the RGW daemons on the old master end?
>
>
> Before you restart the RGW daemons on the old master you want to make =
> sure you pull the current realm from the new master. Beyond that there =
> should be no changes needed.=20=
>
> --Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A
> Content-Transfer-Encoding: quoted-printable
> Content-Type: text/html;
>         charset=us-ascii
>
> <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
> charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
> -webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""><br =
> class=3D""><div><br class=3D""><blockquote type=3D"cite" class=3D""><div =
> class=3D"">On Oct 18, 2019, at 10:40 PM, Frank R &lt;<a =
> href=3D"mailto:frankaritc...@gmail.com"; =
> class=3D"">frankaritc...@gmail.com</a>&gt; wrote:</div><br =
> class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
> class=3D"">I am looking to change an RGW multisite deployment so that =
> the secondary will become master. This is meant to be a permanent =
> change.<div class=3D""><br class=3D""></div><div class=3D"">Per:</div><div=
>  class=3D""><a =
> href=3D"https://docs.ceph.com/docs/mimic/radosgw/multisite/"; =
> class=3D"">https://docs.ceph.com/docs/mimic/radosgw/multisite/</a><br =
> class=3D""></div><div class=3D""><br class=3D""></div><div class=3D"">I =
> need to:</div><div class=3D""><br class=3D""></div><div class=3D"">1. =
> Stop RGW daemons on the current master end.</div><div class=3D""><br =
> class=3D""></div><div class=3D"">On a secondary RGW node:</div><div =
> class=3D"">2.&nbsp;radosgw-admin zone modify --rgw-zone=3D{zone-name} =
> --master --default</div><div class=3D"">3.&nbsp;radosgw-admin period =
> update --commit</div><div class=3D"">4.&nbsp;systemctl restart =
> ceph-radosgw@rgw.`hostname -s`</div><div class=3D""><br =
> class=3D""></div><div class=3D"">Since I want the former master to be =
> secondary permanently do I need to do anything after restarting the RGW =
> daemons on the old master end?</div></div></div></blockquote><div><br =
> class=3D""></div><div><br class=3D""></div>Before you restart the RGW =
> daemons on the old master you want to make sure you pull the current =
> realm from the new master. Beyond that there should be no changes =
> needed.&nbsp;</div></body></html>=
>
> --Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A--
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>
>
> ------------------------------
>
> End of ceph-users Digest, Vol 81, Issue 56
> ******************************************
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to