Re: [ceph-users] ceph-users Digest, Vol 60, Issue 26

2019-05-25 Thread Lazuardi Nasution
Hi Orlando and Haodong,

Is there any response of this thread? I'm interested with this too.

Best regards,

Date: Fri, 26 Jan 2018 21:53:59 +
> From: "Moreno, Orlando" 
> To: "ceph-users@lists.ceph.com" , Ceph
> Development 
> Cc: "Tang, Haodong" 
> Subject: [ceph-users] Ceph OSDs fail to start with RDMA
> Message-ID:
> <
> 034aad465c6cbe4f96d9fb98573a79a63719e...@fmsmsx108.amr.corp.intel.com>
>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi all,
>
> I am trying to bring up a Ceph cluster where the private network is
> communicating via RoCEv2. The storage nodes have 2 dual-port 25Gb Mellanox
> ConnectX-4 NICs, with each NIC's ports bonded (2x25Gb mode 4). I have set
> memory limits to unlimited, can rping to each node, and
> ms_async_rdma_device_name set to the ibdev (mlx5_bond_1). Everything goes
> smoothly until I start bringing up OSDs. Nothing appears in stderr, but
> upon further inspection of the OSD log, I see the following error:
>
> RDMAConnectedSocketImpl activate failed to transition to RTR state: (19)
> No such device
> /build/ceph-12.2.2/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: In
> function 'void RDMAConnectedSocketImpl::handle_connection()' thread
> 7f908633c700 time 2018-01-26 10:47:51.607573
> /build/ceph-12.2.2/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: 221:
> FAILED assert(!r)
>
> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x564a2ccf7892]
> 2: (RDMAConnectedSocketImpl::handle_connection()+0xb4a) [0x564a2d007fba]
> 3: (EventCenter::process_events(int, std::chrono::duration std::ratio<1l, 10l> >*)+0xa08) [0x564a2cd9a418]
> 4: (()+0xb4f3a8) [0x564a2cd9e3a8]
> 5: (()+0xb8c80) [0x7f9088c04c80]
> 6: (()+0x76ba) [0x7f90892f36ba]
> 7: (clone()+0x6d) [0x7f908836a41d]
> NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
> Anyone see this before or have any suggestions?
>
> Thanks,
> Orlando
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-25 Thread Paul Emmerich
On Sat, May 25, 2019 at 7:45 PM Paul Emmerich 
wrote:

>
>
> On Fri, May 24, 2019 at 5:22 PM Kevin Flöh  wrote:
>
>> ok this just gives me:
>>
>> error getting xattr ec31/10004dfce92./parent: (2) No such file or
>> directory
>>
> Try to run it on the replicated main data pool which contains an empty
> object for each file, not sure where the xattr is stored in a multi-pool
> setup.
>

Also, you probably didn't lose all the chunks of the erasure coded data.
Check the list_missing output to see which chunks are still there and where
they are.
You can export the chunks that you still have using ceph-objectstore-tool.
The first 3 chunks will be the data of the object, you might be able to
tell if that file is import for you.


Paul


>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
>> Does this mean that the lost object isn't even a file that appears in the
>> ceph directory. Maybe a leftover of a file that has not been deleted
>> properly? It wouldn't be an issue to mark the object as lost in that case.
>> On 24.05.19 5:08 nachm., Robert LeBlanc wrote:
>>
>> You need to use the first stripe of the object as that is the only one
>> with the metadata.
>>
>> Try "rados -p ec31 getxattr 10004dfce92. parent" instead.
>>
>> Robert LeBlanc
>>
>> Sent from a mobile device, please excuse any typos.
>>
>> On Fri, May 24, 2019, 4:42 AM Kevin Flöh  wrote:
>>
>>> Hi,
>>>
>>> we already tried "rados -p ec31 getxattr 10004dfce92.003d parent"
>>> but this is just hanging forever if we are looking for unfound objects. It
>>> works fine for all other objects.
>>>
>>> We also tried scanning the ceph directory with find -inum 1099593404050
>>> (decimal of 10004dfce92) and found nothing. This is also working for non
>>> unfound objects.
>>>
>>> Is there another way to find the corresponding file?
>>> On 24.05.19 11:12 vorm., Burkhard Linke wrote:
>>>
>>> Hi,
>>> On 5/24/19 9:48 AM, Kevin Flöh wrote:
>>>
>>> We got the object ids of the missing objects with ceph pg 1.24c
>>> list_missing:
>>>
>>> {
>>> "offset": {
>>> "oid": "",
>>> "key": "",
>>> "snapid": 0,
>>> "hash": 0,
>>> "max": 0,
>>> "pool": -9223372036854775808,
>>> "namespace": ""
>>> },
>>> "num_missing": 1,
>>> "num_unfound": 1,
>>> "objects": [
>>> {
>>> "oid": {
>>> "oid": "10004dfce92.003d",
>>> "key": "",
>>> "snapid": -2,
>>> "hash": 90219084,
>>> "max": 0,
>>> "pool": 1,
>>> "namespace": ""
>>> },
>>> "need": "46950'195355",
>>> "have": "0'0",
>>> "flags": "none",
>>> "locations": [
>>> "36(3)",
>>> "61(2)"
>>> ]
>>> }
>>> ],
>>> "more": false
>>> }
>>>
>>> we want to give up those objects with:
>>>
>>> ceph pg 1.24c mark_unfound_lost revert
>>>
>>> But first we would like to know which file(s) is affected. Is there a way 
>>> to map the object id to the corresponding file?
>>>
>>>
>>> The object name is composed of the file inode id and the chunk within
>>> the file. The first chunk has some metadata you can use to retrieve the
>>> filename. See the 'CephFS object mapping' thread on the mailing list for
>>> more information.
>>>
>>>
>>> Regards,
>>>
>>> Burkhard
>>>
>>>
>>> ___
>>> ceph-users mailing 
>>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-25 Thread Paul Emmerich
On Fri, May 24, 2019 at 5:22 PM Kevin Flöh  wrote:

> ok this just gives me:
>
> error getting xattr ec31/10004dfce92./parent: (2) No such file or
> directory
>
Try to run it on the replicated main data pool which contains an empty
object for each file, not sure where the xattr is stored in a multi-pool
setup.



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


> Does this mean that the lost object isn't even a file that appears in the
> ceph directory. Maybe a leftover of a file that has not been deleted
> properly? It wouldn't be an issue to mark the object as lost in that case.
> On 24.05.19 5:08 nachm., Robert LeBlanc wrote:
>
> You need to use the first stripe of the object as that is the only one
> with the metadata.
>
> Try "rados -p ec31 getxattr 10004dfce92. parent" instead.
>
> Robert LeBlanc
>
> Sent from a mobile device, please excuse any typos.
>
> On Fri, May 24, 2019, 4:42 AM Kevin Flöh  wrote:
>
>> Hi,
>>
>> we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but
>> this is just hanging forever if we are looking for unfound objects. It
>> works fine for all other objects.
>>
>> We also tried scanning the ceph directory with find -inum 1099593404050
>> (decimal of 10004dfce92) and found nothing. This is also working for non
>> unfound objects.
>>
>> Is there another way to find the corresponding file?
>> On 24.05.19 11:12 vorm., Burkhard Linke wrote:
>>
>> Hi,
>> On 5/24/19 9:48 AM, Kevin Flöh wrote:
>>
>> We got the object ids of the missing objects with ceph pg 1.24c
>> list_missing:
>>
>> {
>> "offset": {
>> "oid": "",
>> "key": "",
>> "snapid": 0,
>> "hash": 0,
>> "max": 0,
>> "pool": -9223372036854775808,
>> "namespace": ""
>> },
>> "num_missing": 1,
>> "num_unfound": 1,
>> "objects": [
>> {
>> "oid": {
>> "oid": "10004dfce92.003d",
>> "key": "",
>> "snapid": -2,
>> "hash": 90219084,
>> "max": 0,
>> "pool": 1,
>> "namespace": ""
>> },
>> "need": "46950'195355",
>> "have": "0'0",
>> "flags": "none",
>> "locations": [
>> "36(3)",
>> "61(2)"
>> ]
>> }
>> ],
>> "more": false
>> }
>>
>> we want to give up those objects with:
>>
>> ceph pg 1.24c mark_unfound_lost revert
>>
>> But first we would like to know which file(s) is affected. Is there a way to 
>> map the object id to the corresponding file?
>>
>>
>> The object name is composed of the file inode id and the chunk within the
>> file. The first chunk has some metadata you can use to retrieve the
>> filename. See the 'CephFS object mapping' thread on the mailing list for
>> more information.
>>
>>
>> Regards,
>>
>> Burkhard
>>
>>
>> ___
>> ceph-users mailing 
>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-25 Thread Marc Roos
 
Maybe my data can be useful to compare with? I have the samsung sm863. 

This[0] is what I get from fio directly on the ssd, and from an rbd ssd 
pool with 3x replication[1]. 
I also have included a comparisson with cephfs[3], would be nice if 
there would be some sort of
 manual page describing general to be expected ceph overhead.


[0] direct
randwrite-4k-seq: (groupid=1, jobs=1): err= 0: pid=522903: Thu Sep  6 
21:04:12 2018
  write: IOPS=17.9k, BW=69.8MiB/s (73.2MB/s)(12.3GiB/180001msec)
slat (usec): min=4, max=333, avg= 9.94, stdev= 5.00
clat (nsec): min=1141, max=1131.2k, avg=42560.69, stdev=9074.14
 lat (usec): min=35, max=1137, avg=52.80, stdev= 9.42
clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   35], 10.00th=[   35], 20.00th=[   
35],
 | 30.00th=[   36], 40.00th=[   36], 50.00th=[   41], 60.00th=[   
43],
 | 70.00th=[   49], 80.00th=[   54], 90.00th=[   57], 95.00th=[   
58],
 | 99.00th=[   60], 99.50th=[   62], 99.90th=[   67], 99.95th=[   
70],
 | 99.99th=[  174]
   bw (  KiB/s): min=34338, max=92268, per=84.26%, avg=60268.13, 
stdev=12283.36, samples=359
   iops: min= 8584, max=23067, avg=15066.67, stdev=3070.87, 
samples=359
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=71.73%, 100=28.24%
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%
  lat (msec)   : 2=0.01%
  cpu  : usr=12.96%, sys=26.87%, ctx=3218988, majf=0, minf=10962
  IO depths: 1=116.8%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 issued rwt: total=0,3218724,0, short=0,0,0, dropped=0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1
randread-4k-seq: (groupid=3, jobs=1): err= 0: pid=523297: Thu Sep  6 
21:04:12 2018
   read: IOPS=10.2k, BW=39.7MiB/s (41.6MB/s)(7146MiB/180001msec)
slat (usec): min=4, max=328, avg=15.39, stdev= 8.62
clat (nsec): min=1600, max=948792, avg=78946.53, stdev=36246.91
 lat (usec): min=39, max=969, avg=94.75, stdev=37.43
clat percentiles (usec):
 |  1.00th=[   38],  5.00th=[   40], 10.00th=[   40], 20.00th=[   
41],
 | 30.00th=[   41], 40.00th=[   52], 50.00th=[   70], 60.00th=[  
110],
 | 70.00th=[  112], 80.00th=[  115], 90.00th=[  125], 95.00th=[  
127],
 | 99.00th=[  133], 99.50th=[  135], 99.90th=[  141], 99.95th=[  
147],
 | 99.99th=[  243]
   bw (  KiB/s): min=19918, max=49336, per=84.40%, avg=34308.52, 
stdev=6891.67, samples=359
   iops: min= 4979, max=12334, avg=8576.75, stdev=1722.92, 
samples=359
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=38.06%, 100=19.88%
  lat (usec)   : 250=42.04%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu  : usr=8.07%, sys=21.59%, ctx=1829588, majf=0, minf=10954
  IO depths: 1=116.7%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 issued rwt: total=1829296,0,0, short=0,0,0, dropped=0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1

[1] rbd ssd 3x
randwrite-4k-seq: (groupid=1, jobs=1): err= 0: pid=1448032: Fri May 24 
19:41:48 2019
  write: IOPS=655, BW=2620KiB/s (2683kB/s)(461MiB/180001msec)
slat (usec): min=7, max=120, avg=10.79, stdev= 6.22
clat (usec): min=897, max=77251, avg=1512.76, stdev=368.36
 lat (usec): min=906, max=77262, avg=1523.77, stdev=368.54
clat percentiles (usec):
 |  1.00th=[ 1106],  5.00th=[ 1205], 10.00th=[ 1254], 20.00th=[ 
1319],
 | 30.00th=[ 1369], 40.00th=[ 1418], 50.00th=[ 1483], 60.00th=[ 
1532],
 | 70.00th=[ 1598], 80.00th=[ 1663], 90.00th=[ 1778], 95.00th=[ 
1893],
 | 99.00th=[ 2540], 99.50th=[ 2933], 99.90th=[ 3392], 99.95th=[ 
4080],
 | 99.99th=[ 6194]
   bw (  KiB/s): min= 1543, max= 2830, per=79.66%, avg=2087.02, 
stdev=396.14, samples=359
   iops: min=  385, max=  707, avg=521.39, stdev=99.06, 
samples=359
  lat (usec)   : 1000=0.06%
  lat (msec)   : 2=97.19%, 4=2.70%, 10=0.04%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu  : usr=0.39%, sys=1.13%, ctx=118477, majf=0, minf=50
  IO depths: 1=116.6%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 issued rwt: total=0,117905,0, short=0,0,0, dropped=0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1
randread-4k-seq: (groupid=3, jobs=1): err= 0: pid=1450173: Fri May 24 
19:41:48 2019
   read: IOPS=1812, BW=7251KiB/s (7425kB/s)(1275MiB/180001msec)
slat (usec): min=6, max=161, avg=10.25, stdev= 6.37
clat (usec): min=182, max=23748, avg=538.35, stdev=136.71
 lat (usec): min=189, max=23758, avg=548.86, stdev=137.19
clat percen

Re: [ceph-users] performance in a small cluster

2019-05-25 Thread Marc Schöchlin
Hello Robert,

probably the following tool provides deeper insights whats happening on your 
osds:

https://github.com/scoopex/ceph/blob/master/src/tools/histogram_dump.py
https://github.com/ceph/ceph/pull/28244
https://user-images.githubusercontent.com/288876/58368661-410afa00-7ef0-11e9-9aca-b09d974024a7.png

Monitoring virtual machine/client behavior in a comparable way would also be a 
good thing.

@All: Do you know suitable tools?

  * kernel rbd
  * rbd-nbd
  * linux native (i.e. if your want to analyze from inside a kvm or xen vm)

(the output of "iostat -N -d -x -t -m 10" seems not to be enough for detailed 
analytics)

Regards
Marc

Am 24.05.19 um 13:22 schrieb Robert Sander:
> Hi,
>
> we have a small cluster at a customer's site with three nodes and 4 SSD-OSDs 
> each.
> Connected with 10G the system is supposed to perform well.
>
> rados bench shows ~450MB/s write and ~950MB/s read speeds with 4MB objects 
> but only 20MB/s write and 95MB/s read with 4KB objects.
>
> This is a little bit disappointing as the 4K performance is also seen in KVM 
> VMs using RBD.
>
> Is there anything we can do to improve performance with small objects / block 
> sizes?
>
> Jumbo frames have already been enabled.
>
> 4MB objects write:
>
> Total time run: 30.218930
> Total writes made:  3391
> Write size: 4194304
> Object size:    4194304
> Bandwidth (MB/sec): 448.858
> Stddev Bandwidth:   63.5044
> Max bandwidth (MB/sec): 552
> Min bandwidth (MB/sec): 320
> Average IOPS:   112
> Stddev IOPS:    15
> Max IOPS:   138
> Min IOPS:   80
> Average Latency(s): 0.142475
> Stddev Latency(s):  0.0990132
> Max latency(s): 0.814715
> Min latency(s): 0.0308732
>
> 4MB objects rand read:
>
> Total time run:   30.169312
> Total reads made: 7223
> Read size:    4194304
> Object size:  4194304
> Bandwidth (MB/sec):   957.662
> Average IOPS: 239
> Stddev IOPS:  23
> Max IOPS: 272
> Min IOPS: 175
> Average Latency(s):   0.0653696
> Max latency(s):   0.517275
> Min latency(s):   0.00201978
>
> 4K objects write:
>
> Total time run: 30.002628
> Total writes made:  165404
> Write size: 4096
> Object size:    4096
> Bandwidth (MB/sec): 21.5351
> Stddev Bandwidth:   2.0575
> Max bandwidth (MB/sec): 22.4727
> Min bandwidth (MB/sec): 11.0508
> Average IOPS:   5512
> Stddev IOPS:    526
> Max IOPS:   5753
> Min IOPS:   2829
> Average Latency(s): 0.00290095
> Stddev Latency(s):  0.0015036
> Max latency(s): 0.0778454
> Min latency(s): 0.00174262
>
> 4K objects read:
>
> Total time run:   30.000538
> Total reads made: 1064610
> Read size:    4096
> Object size:  4096
> Bandwidth (MB/sec):   138.619
> Average IOPS: 35486
> Stddev IOPS:  3776
> Max IOPS: 42208
> Min IOPS: 26264
> Average Latency(s):   0.000443905
> Max latency(s):   0.0123462
> Min latency(s):   0.000123081
>
>
> Regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com