Re: [ceph-users] ceph-users Digest, Vol 60, Issue 26
Hi Orlando and Haodong, Is there any response of this thread? I'm interested with this too. Best regards, Date: Fri, 26 Jan 2018 21:53:59 + > From: "Moreno, Orlando" > To: "ceph-users@lists.ceph.com" , Ceph > Development > Cc: "Tang, Haodong" > Subject: [ceph-users] Ceph OSDs fail to start with RDMA > Message-ID: > < > 034aad465c6cbe4f96d9fb98573a79a63719e...@fmsmsx108.amr.corp.intel.com> > > Content-Type: text/plain; charset="us-ascii" > > Hi all, > > I am trying to bring up a Ceph cluster where the private network is > communicating via RoCEv2. The storage nodes have 2 dual-port 25Gb Mellanox > ConnectX-4 NICs, with each NIC's ports bonded (2x25Gb mode 4). I have set > memory limits to unlimited, can rping to each node, and > ms_async_rdma_device_name set to the ibdev (mlx5_bond_1). Everything goes > smoothly until I start bringing up OSDs. Nothing appears in stderr, but > upon further inspection of the OSD log, I see the following error: > > RDMAConnectedSocketImpl activate failed to transition to RTR state: (19) > No such device > /build/ceph-12.2.2/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: In > function 'void RDMAConnectedSocketImpl::handle_connection()' thread > 7f908633c700 time 2018-01-26 10:47:51.607573 > /build/ceph-12.2.2/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: 221: > FAILED assert(!r) > > ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x564a2ccf7892] > 2: (RDMAConnectedSocketImpl::handle_connection()+0xb4a) [0x564a2d007fba] > 3: (EventCenter::process_events(int, std::chrono::duration std::ratio<1l, 10l> >*)+0xa08) [0x564a2cd9a418] > 4: (()+0xb4f3a8) [0x564a2cd9e3a8] > 5: (()+0xb8c80) [0x7f9088c04c80] > 6: (()+0x76ba) [0x7f90892f36ba] > 7: (clone()+0x6d) [0x7f908836a41d] > NOTE: a copy of the executable, or `objdump -rdS ` is needed > to interpret this. > > Anyone see this before or have any suggestions? > > Thanks, > Orlando > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
On Sat, May 25, 2019 at 7:45 PM Paul Emmerich wrote: > > > On Fri, May 24, 2019 at 5:22 PM Kevin Flöh wrote: > >> ok this just gives me: >> >> error getting xattr ec31/10004dfce92./parent: (2) No such file or >> directory >> > Try to run it on the replicated main data pool which contains an empty > object for each file, not sure where the xattr is stored in a multi-pool > setup. > Also, you probably didn't lose all the chunks of the erasure coded data. Check the list_missing output to see which chunks are still there and where they are. You can export the chunks that you still have using ceph-objectstore-tool. The first 3 chunks will be the data of the object, you might be able to tell if that file is import for you. Paul > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > >> Does this mean that the lost object isn't even a file that appears in the >> ceph directory. Maybe a leftover of a file that has not been deleted >> properly? It wouldn't be an issue to mark the object as lost in that case. >> On 24.05.19 5:08 nachm., Robert LeBlanc wrote: >> >> You need to use the first stripe of the object as that is the only one >> with the metadata. >> >> Try "rados -p ec31 getxattr 10004dfce92. parent" instead. >> >> Robert LeBlanc >> >> Sent from a mobile device, please excuse any typos. >> >> On Fri, May 24, 2019, 4:42 AM Kevin Flöh wrote: >> >>> Hi, >>> >>> we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" >>> but this is just hanging forever if we are looking for unfound objects. It >>> works fine for all other objects. >>> >>> We also tried scanning the ceph directory with find -inum 1099593404050 >>> (decimal of 10004dfce92) and found nothing. This is also working for non >>> unfound objects. >>> >>> Is there another way to find the corresponding file? >>> On 24.05.19 11:12 vorm., Burkhard Linke wrote: >>> >>> Hi, >>> On 5/24/19 9:48 AM, Kevin Flöh wrote: >>> >>> We got the object ids of the missing objects with ceph pg 1.24c >>> list_missing: >>> >>> { >>> "offset": { >>> "oid": "", >>> "key": "", >>> "snapid": 0, >>> "hash": 0, >>> "max": 0, >>> "pool": -9223372036854775808, >>> "namespace": "" >>> }, >>> "num_missing": 1, >>> "num_unfound": 1, >>> "objects": [ >>> { >>> "oid": { >>> "oid": "10004dfce92.003d", >>> "key": "", >>> "snapid": -2, >>> "hash": 90219084, >>> "max": 0, >>> "pool": 1, >>> "namespace": "" >>> }, >>> "need": "46950'195355", >>> "have": "0'0", >>> "flags": "none", >>> "locations": [ >>> "36(3)", >>> "61(2)" >>> ] >>> } >>> ], >>> "more": false >>> } >>> >>> we want to give up those objects with: >>> >>> ceph pg 1.24c mark_unfound_lost revert >>> >>> But first we would like to know which file(s) is affected. Is there a way >>> to map the object id to the corresponding file? >>> >>> >>> The object name is composed of the file inode id and the chunk within >>> the file. The first chunk has some metadata you can use to retrieve the >>> filename. See the 'CephFS object mapping' thread on the mailing list for >>> more information. >>> >>> >>> Regards, >>> >>> Burkhard >>> >>> >>> ___ >>> ceph-users mailing >>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
On Fri, May 24, 2019 at 5:22 PM Kevin Flöh wrote: > ok this just gives me: > > error getting xattr ec31/10004dfce92./parent: (2) No such file or > directory > Try to run it on the replicated main data pool which contains an empty object for each file, not sure where the xattr is stored in a multi-pool setup. -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 > Does this mean that the lost object isn't even a file that appears in the > ceph directory. Maybe a leftover of a file that has not been deleted > properly? It wouldn't be an issue to mark the object as lost in that case. > On 24.05.19 5:08 nachm., Robert LeBlanc wrote: > > You need to use the first stripe of the object as that is the only one > with the metadata. > > Try "rados -p ec31 getxattr 10004dfce92. parent" instead. > > Robert LeBlanc > > Sent from a mobile device, please excuse any typos. > > On Fri, May 24, 2019, 4:42 AM Kevin Flöh wrote: > >> Hi, >> >> we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but >> this is just hanging forever if we are looking for unfound objects. It >> works fine for all other objects. >> >> We also tried scanning the ceph directory with find -inum 1099593404050 >> (decimal of 10004dfce92) and found nothing. This is also working for non >> unfound objects. >> >> Is there another way to find the corresponding file? >> On 24.05.19 11:12 vorm., Burkhard Linke wrote: >> >> Hi, >> On 5/24/19 9:48 AM, Kevin Flöh wrote: >> >> We got the object ids of the missing objects with ceph pg 1.24c >> list_missing: >> >> { >> "offset": { >> "oid": "", >> "key": "", >> "snapid": 0, >> "hash": 0, >> "max": 0, >> "pool": -9223372036854775808, >> "namespace": "" >> }, >> "num_missing": 1, >> "num_unfound": 1, >> "objects": [ >> { >> "oid": { >> "oid": "10004dfce92.003d", >> "key": "", >> "snapid": -2, >> "hash": 90219084, >> "max": 0, >> "pool": 1, >> "namespace": "" >> }, >> "need": "46950'195355", >> "have": "0'0", >> "flags": "none", >> "locations": [ >> "36(3)", >> "61(2)" >> ] >> } >> ], >> "more": false >> } >> >> we want to give up those objects with: >> >> ceph pg 1.24c mark_unfound_lost revert >> >> But first we would like to know which file(s) is affected. Is there a way to >> map the object id to the corresponding file? >> >> >> The object name is composed of the file inode id and the chunk within the >> file. The first chunk has some metadata you can use to retrieve the >> filename. See the 'CephFS object mapping' thread on the mailing list for >> more information. >> >> >> Regards, >> >> Burkhard >> >> >> ___ >> ceph-users mailing >> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] performance in a small cluster
Maybe my data can be useful to compare with? I have the samsung sm863. This[0] is what I get from fio directly on the ssd, and from an rbd ssd pool with 3x replication[1]. I also have included a comparisson with cephfs[3], would be nice if there would be some sort of manual page describing general to be expected ceph overhead. [0] direct randwrite-4k-seq: (groupid=1, jobs=1): err= 0: pid=522903: Thu Sep 6 21:04:12 2018 write: IOPS=17.9k, BW=69.8MiB/s (73.2MB/s)(12.3GiB/180001msec) slat (usec): min=4, max=333, avg= 9.94, stdev= 5.00 clat (nsec): min=1141, max=1131.2k, avg=42560.69, stdev=9074.14 lat (usec): min=35, max=1137, avg=52.80, stdev= 9.42 clat percentiles (usec): | 1.00th=[ 33], 5.00th=[ 35], 10.00th=[ 35], 20.00th=[ 35], | 30.00th=[ 36], 40.00th=[ 36], 50.00th=[ 41], 60.00th=[ 43], | 70.00th=[ 49], 80.00th=[ 54], 90.00th=[ 57], 95.00th=[ 58], | 99.00th=[ 60], 99.50th=[ 62], 99.90th=[ 67], 99.95th=[ 70], | 99.99th=[ 174] bw ( KiB/s): min=34338, max=92268, per=84.26%, avg=60268.13, stdev=12283.36, samples=359 iops: min= 8584, max=23067, avg=15066.67, stdev=3070.87, samples=359 lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=71.73%, 100=28.24% lat (usec) : 250=0.01%, 500=0.01%, 750=0.01% lat (msec) : 2=0.01% cpu : usr=12.96%, sys=26.87%, ctx=3218988, majf=0, minf=10962 IO depths: 1=116.8%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=0,3218724,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 randread-4k-seq: (groupid=3, jobs=1): err= 0: pid=523297: Thu Sep 6 21:04:12 2018 read: IOPS=10.2k, BW=39.7MiB/s (41.6MB/s)(7146MiB/180001msec) slat (usec): min=4, max=328, avg=15.39, stdev= 8.62 clat (nsec): min=1600, max=948792, avg=78946.53, stdev=36246.91 lat (usec): min=39, max=969, avg=94.75, stdev=37.43 clat percentiles (usec): | 1.00th=[ 38], 5.00th=[ 40], 10.00th=[ 40], 20.00th=[ 41], | 30.00th=[ 41], 40.00th=[ 52], 50.00th=[ 70], 60.00th=[ 110], | 70.00th=[ 112], 80.00th=[ 115], 90.00th=[ 125], 95.00th=[ 127], | 99.00th=[ 133], 99.50th=[ 135], 99.90th=[ 141], 99.95th=[ 147], | 99.99th=[ 243] bw ( KiB/s): min=19918, max=49336, per=84.40%, avg=34308.52, stdev=6891.67, samples=359 iops: min= 4979, max=12334, avg=8576.75, stdev=1722.92, samples=359 lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=38.06%, 100=19.88% lat (usec) : 250=42.04%, 500=0.01%, 750=0.01%, 1000=0.01% cpu : usr=8.07%, sys=21.59%, ctx=1829588, majf=0, minf=10954 IO depths: 1=116.7%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=1829296,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 [1] rbd ssd 3x randwrite-4k-seq: (groupid=1, jobs=1): err= 0: pid=1448032: Fri May 24 19:41:48 2019 write: IOPS=655, BW=2620KiB/s (2683kB/s)(461MiB/180001msec) slat (usec): min=7, max=120, avg=10.79, stdev= 6.22 clat (usec): min=897, max=77251, avg=1512.76, stdev=368.36 lat (usec): min=906, max=77262, avg=1523.77, stdev=368.54 clat percentiles (usec): | 1.00th=[ 1106], 5.00th=[ 1205], 10.00th=[ 1254], 20.00th=[ 1319], | 30.00th=[ 1369], 40.00th=[ 1418], 50.00th=[ 1483], 60.00th=[ 1532], | 70.00th=[ 1598], 80.00th=[ 1663], 90.00th=[ 1778], 95.00th=[ 1893], | 99.00th=[ 2540], 99.50th=[ 2933], 99.90th=[ 3392], 99.95th=[ 4080], | 99.99th=[ 6194] bw ( KiB/s): min= 1543, max= 2830, per=79.66%, avg=2087.02, stdev=396.14, samples=359 iops: min= 385, max= 707, avg=521.39, stdev=99.06, samples=359 lat (usec) : 1000=0.06% lat (msec) : 2=97.19%, 4=2.70%, 10=0.04%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01% cpu : usr=0.39%, sys=1.13%, ctx=118477, majf=0, minf=50 IO depths: 1=116.6%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=0,117905,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 randread-4k-seq: (groupid=3, jobs=1): err= 0: pid=1450173: Fri May 24 19:41:48 2019 read: IOPS=1812, BW=7251KiB/s (7425kB/s)(1275MiB/180001msec) slat (usec): min=6, max=161, avg=10.25, stdev= 6.37 clat (usec): min=182, max=23748, avg=538.35, stdev=136.71 lat (usec): min=189, max=23758, avg=548.86, stdev=137.19 clat percen
Re: [ceph-users] performance in a small cluster
Hello Robert, probably the following tool provides deeper insights whats happening on your osds: https://github.com/scoopex/ceph/blob/master/src/tools/histogram_dump.py https://github.com/ceph/ceph/pull/28244 https://user-images.githubusercontent.com/288876/58368661-410afa00-7ef0-11e9-9aca-b09d974024a7.png Monitoring virtual machine/client behavior in a comparable way would also be a good thing. @All: Do you know suitable tools? * kernel rbd * rbd-nbd * linux native (i.e. if your want to analyze from inside a kvm or xen vm) (the output of "iostat -N -d -x -t -m 10" seems not to be enough for detailed analytics) Regards Marc Am 24.05.19 um 13:22 schrieb Robert Sander: > Hi, > > we have a small cluster at a customer's site with three nodes and 4 SSD-OSDs > each. > Connected with 10G the system is supposed to perform well. > > rados bench shows ~450MB/s write and ~950MB/s read speeds with 4MB objects > but only 20MB/s write and 95MB/s read with 4KB objects. > > This is a little bit disappointing as the 4K performance is also seen in KVM > VMs using RBD. > > Is there anything we can do to improve performance with small objects / block > sizes? > > Jumbo frames have already been enabled. > > 4MB objects write: > > Total time run: 30.218930 > Total writes made: 3391 > Write size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 448.858 > Stddev Bandwidth: 63.5044 > Max bandwidth (MB/sec): 552 > Min bandwidth (MB/sec): 320 > Average IOPS: 112 > Stddev IOPS: 15 > Max IOPS: 138 > Min IOPS: 80 > Average Latency(s): 0.142475 > Stddev Latency(s): 0.0990132 > Max latency(s): 0.814715 > Min latency(s): 0.0308732 > > 4MB objects rand read: > > Total time run: 30.169312 > Total reads made: 7223 > Read size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 957.662 > Average IOPS: 239 > Stddev IOPS: 23 > Max IOPS: 272 > Min IOPS: 175 > Average Latency(s): 0.0653696 > Max latency(s): 0.517275 > Min latency(s): 0.00201978 > > 4K objects write: > > Total time run: 30.002628 > Total writes made: 165404 > Write size: 4096 > Object size: 4096 > Bandwidth (MB/sec): 21.5351 > Stddev Bandwidth: 2.0575 > Max bandwidth (MB/sec): 22.4727 > Min bandwidth (MB/sec): 11.0508 > Average IOPS: 5512 > Stddev IOPS: 526 > Max IOPS: 5753 > Min IOPS: 2829 > Average Latency(s): 0.00290095 > Stddev Latency(s): 0.0015036 > Max latency(s): 0.0778454 > Min latency(s): 0.00174262 > > 4K objects read: > > Total time run: 30.000538 > Total reads made: 1064610 > Read size: 4096 > Object size: 4096 > Bandwidth (MB/sec): 138.619 > Average IOPS: 35486 > Stddev IOPS: 3776 > Max IOPS: 42208 > Min IOPS: 26264 > Average Latency(s): 0.000443905 > Max latency(s): 0.0123462 > Min latency(s): 0.000123081 > > > Regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com