date:20181114

Re: [ceph-users] Ceph mgr Prometheus plugin: error when osd is down

2018-11-14 Thread Gökhan Kocak

True, sorry and many thanks!

Gökhan

On 14.11.18 21:03, John Spray wrote:
> On Wed, Nov 14, 2018 at 3:32 PM Gökhan Kocak
>  wrote:
>> Hello everyone,
>>
>> we encountered an error with the Prometheus plugin for Ceph mgr:
>> One osd was down and (therefore) it had no class:
>> ```
>> sudo ceph osd tree
>> ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT PRI-AFF
>>  28   hdd   7.27539 osd.28 up  1.0 1.0
>>   6   0 osd.6down0 1.0
>>
>> ```
>>
>> When we tried to curl the metrics, there was an error because the osd
>> had no class (see below "KeyError: 'class' ").
> I suspect you're running an old release?  This bug
> (https://tracker.ceph.com/issues/23300) was fixed in 12.2.5.
>
> John
>
>> Anybody experience the same?
>>
>> Isn't this an error on the Prometheus plugin's behalf? When an osd is down, 
>> the plugin should not stop working imo.
>>
>> ```
>> ~> curl -v 127.0.0.1:9283/metrics
>> *   Trying 127.0.0.1...
>> * Connected to 127.0.0.1 (127.0.0.1) port 9283 (#0)
>>> GET /metrics HTTP/1.1
>>> Host: 127.0.0.1:9283
>>> User-Agent: curl/7.47.0
>>> Accept: */*
>>>
>> < HTTP/1.1 500 Internal Server Error
>> < Date: Wed, 14 Nov 2018 13:59:59 GMT
>> < Content-Length: 1663
>> < Content-Type: text/html;charset=utf-8
>> < Server: CherryPy/3.5.0
>> <
>> > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>> 
>> 
>> 
>> 500 Internal Server Error
>> 
>> #powered_by {
>> margin-top: 20px;
>> border-top: 2px solid black;
>> font-style: italic;
>> }
>>
>> #traceback {
>> color: red;
>> }
>> 
>> 
>> 
>> 500 Internal Server Error
>> The server encountered an unexpected condition which
>> prevented it from fulfilling the request.
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
>> 670, in respond
>> response.body = self.handler()
>>   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>> 217, in __call__
>> self.body = self.oldhandler(*args, **kwargs)
>>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line
>> 61, in __call__
>> return self.callable(*self.args, **self.kwargs)
>>   File "/usr/lib/x86_64-linux-gnu/ceph/mgr/prometheus/module.py", line
>> 414, in metrics
>> metrics = global_instance().collect()
>>   File "/usr/lib/x86_64-linux-gnu/ceph/mgr/prometheus/module.py", line
>> 351, in collect
>> self.get_metadata_and_osd_status()
>>   File "/usr/lib/x86_64-linux-gnu/ceph/mgr/prometheus/module.py", line
>> 310, in get_metadata_and_osd_status
>> dev_class['class'],
>> KeyError: 'class'
>> 
>> 
>>   
>> Powered by http://www.cherrypy.org";>CherryPy 3.5.0
>>   
>> 
>> 
>> 
>> * Connection #0 to host 127.0.0.1 left intact
>> ```
>>
>> Kind regards,
>>
>> Gökhan
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Librbd performance VS KRBD performance

2018-11-14 Thread 赵赵贺东

Thanks you for your suggestion.
It really give me a lot of inspirations.


I will test as your suggestion, and browse through src/common/config_opts.h to 
see if I can find some configs performance related.

But, our osd nodes hardware itself is very poor, that is the truth…we have to 
face it.
Two osds in an arm board, two gb memory and 2*10T hdd disk on board, so one osd 
has 1gb memory to support 10TB hdd disk, we must try to make cluster works 
better as we can.


Thanks.

> 在 2018年11月15日，下午2:08，Jason Dillaman  写道：
> 
> Attempting to send 256 concurrent 4MiB writes via librbd will pretty
> quickly hit the default "objecter_inflight_op_bytes = 100 MiB" limit,
> which will drastically slow (stall) librados. I would recommend
> re-testing librbd w/ a much higher throttle override.
> On Thu, Nov 15, 2018 at 11:34 AM 赵赵贺东  wrote:
>> 
>> Thank you for your attention.
>> 
>> Our test are in run in physical machine environments.
>> 
>> Fio for KRBD:
>> [seq-write]
>> description="seq-write"
>> direct=1
>> ioengine=libaio
>> filename=/dev/rbd0
>> numjobs=1
>> iodepth=256
>> group_reporting
>> rw=write
>> bs=4M
>> size=10T
>> runtime=180
>> 
>> */dev/rbd0 mapped by rbd_pool/image2, so KRBD & librbd fio test use the same 
>> image.
>> 
>> Fio for librbd:
>> [global]
>> direct=1
>> numjobs=1
>> ioengine=rbd
>> clientname=admin
>> pool=rbd_pool
>> rbdname=image2
>> invalidate=0# mandatory
>> rw=write
>> bs=4M
>> size=10T
>> runtime=180
>> 
>> [rbd_iodepth32]
>> iodepth=256
>> 
>> 
>> Image info:
>> rbd image 'image2':
>> size 50TiB in 13107200 objects
>> order 22 (4MiB objects)
>> data_pool: ec_rbd_pool
>> block_name_prefix: rbd_data.8.148bb6b8b4567
>> format: 2
>> features: layering, data-pool
>> flags:
>> create_timestamp: Wed Nov 14 09:21:18 2018
>> 
>> * data_pool is a EC pool
>> 
>> Pool info:
>> pool 8 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash 
>> rjenkins pg_num 256 pgp_num 256 last_change 82627 flags hashpspool 
>> stripe_width 0 application rbd
>> pool 9 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 4 object_hash 
>> rjenkins pg_num 256 pgp_num 256 last_change 82649 flags 
>> hashpspool,ec_overwrites stripe_width 16384 application rbd
>> 
>> 
>> Rbd cache: Off (Because I think in tcmu , rbd cache will mandatory off, and 
>> our cluster will export disk by iscsi in furture.)
>> 
>> 
>> Thanks!
>> 
>> 
>> 在 2018年11月15日，下午1:22，Gregory Farnum  写道：
>> 
>> You'll need to provide more data about how your test is configured and run 
>> for us to have a good idea. IIRC librbd is often faster than krbd because it 
>> can support newer features and things, but krbd may have less overhead and 
>> is not dependent on the VM's driver configuration in QEMU...
>> 
>> On Thu, Nov 15, 2018 at 8:22 AM 赵赵贺东  wrote:
>>> 
>>> Hi cephers,
>>> 
>>> 
>>> All our cluster osds are deployed in armhf.
>>> Could someone say something about what is the rational performance rates 
>>> for librbd VS KRBD ?
>>> Or rational performance loss range when we use librbd compare to KRBD.
>>> I googled a lot, but I could not find a solid criterion.
>>> In fact , it confused me for a long time.
>>> 
>>> About our tests:
>>> In a small cluster(12 osds), 4m seq write performance for Librbd VS KRBD is 
>>> about 0.89 : 1 (177MB/s : 198MB/s ).
>>> In a big cluster (72 osds), 4m seq write performance for Librbd VS KRBD is 
>>> about  0.38: 1 (420MB/s : 1080MB/s).
>>> 
>>> We expect even increase  osd numbers, Librbd performance can keep being 
>>> close to KRBD.
>>> 
>>> PS: Librbd performance are tested both in  fio rbd engine & iscsi 
>>> (tcmu+librbd).
>>> 
>>> Thanks.
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Librbd performance VS KRBD performance

2018-11-14 Thread Jason Dillaman

Attempting to send 256 concurrent 4MiB writes via librbd will pretty
quickly hit the default "objecter_inflight_op_bytes = 100 MiB" limit,
which will drastically slow (stall) librados. I would recommend
re-testing librbd w/ a much higher throttle override.
On Thu, Nov 15, 2018 at 11:34 AM 赵赵贺东  wrote:
>
> Thank you for your attention.
>
> Our test are in run in physical machine environments.
>
> Fio for KRBD:
> [seq-write]
> description="seq-write"
> direct=1
> ioengine=libaio
> filename=/dev/rbd0
> numjobs=1
> iodepth=256
> group_reporting
> rw=write
> bs=4M
> size=10T
> runtime=180
>
> */dev/rbd0 mapped by rbd_pool/image2, so KRBD & librbd fio test use the same 
> image.
>
> Fio for librbd:
> [global]
> direct=1
> numjobs=1
> ioengine=rbd
> clientname=admin
> pool=rbd_pool
> rbdname=image2
> invalidate=0# mandatory
> rw=write
> bs=4M
> size=10T
> runtime=180
>
> [rbd_iodepth32]
> iodepth=256
>
>
> Image info:
> rbd image 'image2':
> size 50TiB in 13107200 objects
> order 22 (4MiB objects)
> data_pool: ec_rbd_pool
> block_name_prefix: rbd_data.8.148bb6b8b4567
> format: 2
> features: layering, data-pool
> flags:
> create_timestamp: Wed Nov 14 09:21:18 2018
>
> * data_pool is a EC pool
>
> Pool info:
> pool 8 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash 
> rjenkins pg_num 256 pgp_num 256 last_change 82627 flags hashpspool 
> stripe_width 0 application rbd
> pool 9 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 4 object_hash 
> rjenkins pg_num 256 pgp_num 256 last_change 82649 flags 
> hashpspool,ec_overwrites stripe_width 16384 application rbd
>
>
> Rbd cache: Off (Because I think in tcmu , rbd cache will mandatory off, and 
> our cluster will export disk by iscsi in furture.)
>
>
> Thanks!
>
>
> 在 2018年11月15日，下午1:22，Gregory Farnum  写道：
>
> You'll need to provide more data about how your test is configured and run 
> for us to have a good idea. IIRC librbd is often faster than krbd because it 
> can support newer features and things, but krbd may have less overhead and is 
> not dependent on the VM's driver configuration in QEMU...
>
> On Thu, Nov 15, 2018 at 8:22 AM 赵赵贺东  wrote:
>>
>> Hi cephers,
>>
>>
>> All our cluster osds are deployed in armhf.
>> Could someone say something about what is the rational performance rates for 
>> librbd VS KRBD ?
>> Or rational performance loss range when we use librbd compare to KRBD.
>> I googled a lot, but I could not find a solid criterion.
>> In fact , it confused me for a long time.
>>
>> About our tests:
>> In a small cluster(12 osds), 4m seq write performance for Librbd VS KRBD is 
>> about 0.89 : 1 (177MB/s : 198MB/s ).
>> In a big cluster (72 osds), 4m seq write performance for Librbd VS KRBD is 
>> about  0.38: 1 (420MB/s : 1080MB/s).
>>
>> We expect even increase  osd numbers, Librbd performance can keep being 
>> close to KRBD.
>>
>> PS: Librbd performance are tested both in  fio rbd engine & iscsi 
>> (tcmu+librbd).
>>
>> Thanks.
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Librbd performance VS KRBD performance

2018-11-14 Thread 赵赵贺东

Thank you for your attention.

Our test are in run in physical machine environments.

Fio for KRBD:
[seq-write]
description="seq-write"
direct=1
ioengine=libaio
filename=/dev/rbd0
numjobs=1
iodepth=256
group_reporting
rw=write
bs=4M
size=10T
runtime=180

*/dev/rbd0 mapped by rbd_pool/image2, so KRBD & librbd fio test use the same 
image.

Fio for librbd:
[global]
direct=1
numjobs=1
ioengine=rbd
clientname=admin
pool=rbd_pool
rbdname=image2
invalidate=0# mandatory
rw=write
bs=4M
size=10T
runtime=180

[rbd_iodepth32]
iodepth=256


Image info:
rbd image 'image2':
size 50TiB in 13107200 objects
order 22 (4MiB objects)
data_pool: ec_rbd_pool
block_name_prefix: rbd_data.8.148bb6b8b4567
format: 2
features: layering, data-pool
flags: 
create_timestamp: Wed Nov 14 09:21:18 2018

* data_pool is a EC pool

Pool info:
pool 8 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 82627 flags hashpspool stripe_width 
0 application rbd
pool 9 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 4 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 82649 flags 
hashpspool,ec_overwrites stripe_width 16384 application rbd


Rbd cache: Off (Because I think in tcmu , rbd cache will mandatory off, and our 
cluster will export disk by iscsi in furture.) 


Thanks!


> 在 2018年11月15日，下午1:22，Gregory Farnum  写道：
> 
> You'll need to provide more data about how your test is configured and run 
> for us to have a good idea. IIRC librbd is often faster than krbd because it 
> can support newer features and things, but krbd may have less overhead and is 
> not dependent on the VM's driver configuration in QEMU...
> 
> On Thu, Nov 15, 2018 at 8:22 AM 赵赵贺东  > wrote:
> Hi cephers,
> 
> 
> All our cluster osds are deployed in armhf.
> Could someone say something about what is the rational performance rates for 
> librbd VS KRBD ?
> Or rational performance loss range when we use librbd compare to KRBD.
> I googled a lot, but I could not find a solid criterion.  
> In fact , it confused me for a long time.
> 
> About our tests:
> In a small cluster(12 osds), 4m seq write performance for Librbd VS KRBD is 
> about 0.89 : 1 (177MB/s : 198MB/s ). 
> In a big cluster (72 osds), 4m seq write performance for Librbd VS KRBD is 
> about  0.38: 1 (420MB/s : 1080MB/s).
> 
> We expect even increase  osd numbers, Librbd performance can keep being close 
> to KRBD.
> 
> PS: Librbd performance are tested both in  fio rbd engine & iscsi 
> (tcmu+librbd).
> 
> Thanks.
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Effects of restoring a cluster's mon from an older backup

2018-11-14 Thread Gregory Farnum

On Mon, Nov 12, 2018 at 2:46 PM Hector Martin  wrote:

> On 10/11/2018 06:35, Gregory Farnum wrote:
> > Yes, do that, don't try and back up your monitor. If you restore a
> > monitor from backup then the monitor — your authoritative data source —
> > will warp back in time on what the OSD peering intervals look like,
> > which snapshots have been deleted and created, etc. It would be a huge
> > disaster and probably every running daemon or client would have to pause
> > IO until the monitor generated enough map epochs to "catch up" — and
> > then the rest of the cluster would start applying those changes and
> > nothing would work right.
>
> Thanks, I suspected this might be the case. Is there any reasonable safe
> "backwards warp" time window (that would permit asynchronous replication
> of mon storage to be good enough for disaster recovery), e.g. on the
> order of seconds? I assume synchronous replication is fine (e.g. RAID or
> DRBD configured correctly) since that's largely equivalent to local
> storage. I'll probably go with something like that for mon durability.
>

Unfortunately there really isn't. Any situation in which a monitor goes
back in time opens up the possibility (even likelihood!) that updates which
directly impact data services can disappear and cause issues. Synchronous
replication is fine, although I'm not sure there's much advantage to doing
that over simply running another monitor in that disk location.
-Greg


>
> > Unlike the OSDMap, the MDSMap doesn't really keep track of any
> > persistent data so it's much safer to rebuild or reset from scratch.
> > -Greg
>
> Good to know. I'll see if I can do some DR tests when I set this up, to
> prove to myself that it all works out :-)
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://marcan.st/marcan.asc
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Librbd performance VS KRBD performance

2018-11-14 Thread Gregory Farnum

You'll need to provide more data about how your test is configured and run
for us to have a good idea. IIRC librbd is often faster than krbd because
it can support newer features and things, but krbd may have less overhead
and is not dependent on the VM's driver configuration in QEMU...

On Thu, Nov 15, 2018 at 8:22 AM 赵赵贺东  wrote:

> Hi cephers,
>
>
> All our cluster osds are deployed in armhf.
> Could someone say something about what is the rational performance rates
> for librbd VS KRBD ?
> Or rational performance loss range when we use librbd compare to KRBD.
> I googled a lot, but I could not find a solid criterion.
> In fact , it confused me for a long time.
>
> About our tests:
> In a small cluster(12 osds), 4m seq write performance for Librbd VS KRBD
> is about 0.89 : 1 (177MB/s : 198MB/s ).
> In a big cluster (72 osds), 4m seq write performance for Librbd VS KRBD is
> about  0.38: 1 (420MB/s : 1080MB/s).
>
> We expect even increase  osd numbers, Librbd performance can keep being
> close to KRBD.
>
> PS: Librbd performance are tested both in  fio rbd engine & iscsi
> (tcmu+librbd).
>
> Thanks.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Placement Groups undersized after adding OSDs

2018-11-14 Thread Gregory Farnum

This is weird. Can you capture the pg query for one of them and narrow down
in which epoch it “lost” the previous replica and see if there’s any
evidence of why?
On Wed, Nov 14, 2018 at 8:09 PM Wido den Hollander  wrote:

> Hi,
>
> I'm in the middle of expanding a Ceph cluster and while having 'ceph -s'
> open I suddenly saw a bunch of Placement Groups go undersized.
>
> My first hint was that one or more OSDs have failed, but none did.
>
> So I checked and I saw these Placement Groups undersized:
>
> 11.3b54 active+undersized+degraded+remapped+backfill_wait
> [1795,639,1422]   1795   [1795,639]   1795
> 11.362f active+undersized+degraded+remapped+backfill_wait
> [1431,1134,2217]   1431  [1134,1468]   1134
> 11.3e31 active+undersized+degraded+remapped+backfill_wait
> [1451,1391,1906]   1451  [1906,2053]   1906
> 11.50c  active+undersized+degraded+remapped+backfill_wait
> [1867,1455,1348]   1867  [1867,2036]   1867
> 11.421e   active+undersized+degraded+remapped+backfilling
> [280,117,1421]280[280,117]280
> 11.700  active+undersized+degraded+remapped+backfill_wait
> [2212,1422,2087]   2212  [2055,2087]   2055
> 11.735active+undersized+degraded+remapped+backfilling
> [772,1832,1433]772   [772,1832]772
> 11.d5a  active+undersized+degraded+remapped+backfill_wait
> [423,1709,1441]423   [423,1709]423
> 11.a95  active+undersized+degraded+remapped+backfill_wait
> [1433,1180,978]   1433   [978,1180]978
> 11.a67  active+undersized+degraded+remapped+backfill_wait
> [1154,1463,2151]   1154  [1154,2151]   1154
> 11.10ca active+undersized+degraded+remapped+backfill_wait
> [2012,486,1457]   2012   [2012,486]   2012
> 11.2439 active+undersized+degraded+remapped+backfill_wait
> [910,1457,1193]910   [910,1193]910
> 11.2f7e active+undersized+degraded+remapped+backfill_wait
> [1423,1356,2098]   1423  [1356,2098]   1356
>
> After searching I found that OSDs
> 1422,1431,1451,1455,1421,1422,1433,1441,1433,1463,1457,1457 and 1423 are
> all running on the same (newly) added host.
>
> I checked:
> - The host did not reboot
> - The OSDs did not restart
>
> The OSDs are up_thru since map 646724 which is from 11:05 this morning
> (4,5 hours ago), which is about the same time when these were added.
>
> So these PGs are currently running on *2* replicas while they should be
> running on *3*.
>
> We just added 8 nodes with 24 disks each to the cluster, but none of the
> existing OSDs were touched.
>
> When looking at PG 11.3b54 I see that 1422 is a backfill target:
>
> $ ceph pg 11.3b54 query|jq '.recovery_state'
>
> The 'enter time' for this is about 30 minutes ago and that's about the
> same time this has happened.
>
> 'might_have_unfound' tells me OSD 1982 which is in the same rack as 1422
> (CRUSH replicates over racks), but that OSD is also online.
>
> It's up_thru = 647122 and that's from about 30 minutes ago. That
> ceph-osd process is however running since September and seems to be
> functioning fine.
>
> This confuses me as during such an expansion I know that normally a PG
> would map to size+1 until the backfill finishes.
>
> The cluster is running Luminous 12.2.8 on CentOS 7.5.
>
> Any ideas on what this could be?
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-14 Thread Konstantin Shalygin


On 11/15/18 9:31 AM, Vlad Kopylov wrote:
Thanks Konstantin, I already tried accessing it in different ways and 
best I got is bulk renamed files and other non presentable data.


Maybe to solve this I can create overlapping osd pools?
Like one pool includes all 3 osd for replication, and 3 more include 
one osd at each site with same blocks?




As far as I understand, you need something like this:


vm1 io -> building1 osds only

vm2 io -> building2 osds only

vm3 io -> buildgin3 osds only


Right?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Librbd performance VS KRBD performance

2018-11-14 Thread 赵赵贺东

Hi cephers,


All our cluster osds are deployed in armhf.
Could someone say something about what is the rational performance rates for 
librbd VS KRBD ?
Or rational performance loss range when we use librbd compare to KRBD.
I googled a lot, but I could not find a solid criterion.  
In fact , it confused me for a long time.

About our tests:
In a small cluster(12 osds), 4m seq write performance for Librbd VS KRBD is 
about 0.89 : 1 (177MB/s : 198MB/s ). 
In a big cluster (72 osds), 4m seq write performance for Librbd VS KRBD is 
about  0.38: 1 (420MB/s : 1080MB/s).

We expect even increase  osd numbers, Librbd performance can keep being close 
to KRBD.

PS: Librbd performance are tested both in  fio rbd engine & iscsi 
(tcmu+librbd).

Thanks.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-14 Thread Vlad Kopylov

Thanks Konstantin, I already tried accessing it in different ways and best
I got is bulk renamed files and other non presentable data.

Maybe to solve this I can create overlapping osd pools?
Like one pool includes all 3 osd for replication, and 3 more include one
osd at each site with same blocks?

v

On Wed, Nov 14, 2018 at 12:11 AM Konstantin Shalygin  wrote:

> Or is it possible to mount one OSD directly for read file access?
>
> In Ceph is impossible to io directly to OSD, only to PG.
>
>
>
> k
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How many PGs per OSD is too many?

2018-11-14 Thread Mark Nelson



On 11/14/18 1:45 PM, Vladimir Brik wrote:

Hello

I have a ceph 13.2.2 cluster comprised of 5 hosts, each with 16 HDDs 
and 4 SSDs. HDD OSDs have about 50 PGs each, while SSD OSDs have about 
400 PGs each (a lot more pools use SSDs than HDDs). Servers are fairly 
powerful: 48 HT cores, 192GB of RAM, and 2x25Gbps Ethernet.


The impression I got from the docs is that having more than 200 PGs 
per OSD is not a good thing, but justifications were vague (no 
concrete numbers), like increased peering time, increased resource 
consumption, and possibly decreased recovery performance. None of 
these appeared to be a significant problem in my testing, but the 
tests were very basic and done on a pretty empty cluster under minimal 
load, so I worry I'll run into trouble down the road.


Here are the questions I have:
- In practice, is it a big deal that some OSDs have ~400 PGs?
- In what situations would our cluster most likely fare significantly 
better if I went through the trouble of re-creating pools so that no 
OSD would have more than, say, ~100 PGs?
- What performance metrics could I monitor to detect possible issues 
due to having too many PGs?



It's a fuzzy sort of thing.  During normal operation: With more PGs 
you'll store more pglog info in memory, so you'll have a more bloated 
OSD process.  If you use the new bluestore option for setting an osd 
memory target, that will mean less memory for caches.  It will also 
likely mean that there's a greater chance that pglog entries won't be 
invalidated before memtable flushes in rocksdb, so you might end up with 
higher write amp and slower DB performance as those entries get 
compacted into L0+.  That could matter with RGW or if you are doing lots 
of small 4k writes with RBD.


I'd see what Neha/Josh think about the impact on recovery, though I 
suppose one upside is that more PGs means you get a longer log based 
recovery window.  You could accomplish the same effect by increasing the 
number of pglog entries per pg (or keep the same overall number of 
entries by having more PGs and lower the number of entries per PG).  And 
upside to having more PGs is better data distribution quality, though we 
can now get much better distributions with the new balancer code, even 
with fewer PGs. One bad thing about having too few PGs is that you can 
have increased lock contention.  The balancer can make the data 
distribution better but you still can't shrink the number of PGs per 
pool too low.


The gist of it is that if you decide to look into this yourself you are 
probably going to find some contradictory evidence and trade-offs.  
There are pitfalls if you go too high and pitfalls if you go too low.  
I'm not sure we can easily define the exact PG counts/OSD where they 
happen since it's sort of dependent on how much memory you have, how 
fast your hardware is, whether you are using the balancer, and what your 
expectations are.


How's that for a non-answer? ;)

Mark




Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to repair active+clean+inconsistent?

2018-11-14 Thread Brad Hubbard

You could try a 'rados get' and then a 'rados put' on the object to start with.
On Thu, Nov 15, 2018 at 4:07 AM K.C. Wong  wrote:
>
> So, I’ve issued the deep-scrub command (and the repair command)
> and nothing seems to happen.
> Unrelated to this issue, I have to take down some OSD to prepare
> a host for RMA. One of them happens to be in the replication
> group for this PG. So, a scrub happened indirectly. I now have
> this from “ceph -s”:
>
> cluster 374aed9e-5fc1-47e1-8d29-4416f7425e76
>  health HEALTH_ERR
> 1 pgs inconsistent
> 18446 scrub errors
>  monmap e1: 3 mons at 
> {mgmt01=10.0.1.1:6789/0,mgmt02=10.1.1.1:6789/0,mgmt03=10.2.1.1:6789/0}
> election epoch 252, quorum 0,1,2 mgmt01,mgmt02,mgmt03
>   fsmap e346: 1/1/1 up {0=mgmt01=up:active}, 2 up:standby
>  osdmap e40248: 120 osds: 119 up, 119 in
> flags sortbitwise,require_jewel_osds
>   pgmap v22025963: 3136 pgs, 18 pools, 18975 GB data, 214 Mobjects
> 59473 GB used, 287 TB / 345 TB avail
> 3120 active+clean
>   15 active+clean+scrubbing+deep
>1 active+clean+inconsistent
>
> That’s a lot of scrub errors:
>
> HEALTH_ERR 1 pgs inconsistent; 18446 scrub errors
> pg 1.65 is active+clean+inconsistent, acting [62,67,33]
> 18446 scrub errors
>
> Now, “rados list-inconsistent-obj 1.65” returns a *very* long JSON
> output. Here’s a very small snippet, the errors look the same across:
>
> {
>   “object”:{
> "name":”10ea8bb.0045”,
> "nspace":”",
> "locator":”",
> "snap":"head”,
> "version”:59538
>   },
>   "errors":["attr_name_mismatch”],
>   "union_shard_errors":["oi_attr_missing”],
>   "selected_object_info":"1:a70dc1cc:::10ea8bb.0045:head(2897'59538 
> client.4895965.0:462007 dirty|data_digest|omap_digest s 4194304 uv 59538 dd 
> f437a612 od  alloc_hint [0 0])”,
>   "shards”:[
> {
>   "osd":33,
>   "errors":[],
>   "size":4194304,
>   "omap_digest”:"0x”,
>   "data_digest”:"0xf437a612”,
>   "attrs":[
> {"name":"_”,
>  "value":”EAgNAQAABAM1AA...“,
>  "Base64":true},
> {"name":"snapset”,
>  "value":”AgIZAQ...“,
>  "Base64":true}
>   ]
> },
> {
>   "osd":62,
>   "errors":[],
>   "size":4194304,
>   "omap_digest":"0x”,
>   "data_digest":"0xf437a612”,
>   "attrs”:[
> {"name":"_”,
>  "value":”EAgNAQAABAM1AA...",
>  "Base64":true},
> {"name":"snapset”,
>  "value":”AgIZAQ…",
>  "Base64":true}
>   ]
> },
> {
>   "osd":67,
>   "errors":["oi_attr_missing”],
>   "size":4194304,
>   "omap_digest":"0x”,
>   "data_digest":"0xf437a612”,
>   "attrs":[]
> }
>   ]
> }
>
> Clearly, on osd.67, the “attrs” array is empty. The question is,
> how do I fix this?
>
> Many thanks in advance,
>
> -kc
>
> K.C. Wong
> kcw...@verseon.com
> M: +1 (408) 769-8235
>
> -
> Confidentiality Notice:
> This message contains confidential information. If you are not the
> intended recipient and received this message in error, any use or
> distribution is strictly prohibited. Please also notify us
> immediately by return e-mail, and delete this message from your
> computer system. Thank you.
> -
>
> 4096R/B8995EDE  E527 CBE8 023E 79EA 8BBB  5C77 23A6 92E9 B899 5EDE
>
> hkps://hkps.pool.sks-keyservers.net
>
> On Nov 11, 2018, at 10:58 PM, Brad Hubbard  wrote:
>
> On Mon, Nov 12, 2018 at 4:21 PM Ashley Merrick  
> wrote:
>
>
> Your need to run "ceph pg deep-scrub 1.65" first
>
>
> Right, thanks Ashley. That's what the "Note that you may have to do a
> deep scrub to populate the output." part of my answer meant but
> perhaps I needed to go further?
>
> The system has a record of a scrub error on a previous scan but
> subsequent activity in the cluster has invalidated the specifics. You
> need to run another scrub to get the specific information for this pg
> at this point in time (the information does not remain valid
> indefinitely and therefore may need to be renewed depending on
> circumstances).
>
>
> On Mon, Nov 12, 2018 at 2:20 PM K.C. Wong  wrote:
>
>
> Hi Brad,
>
> I got the following:
>
> [root@mgmt01 ~]# ceph health detail
> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> pg 1.65 is active+clean+inconsistent, acting [62,67,47]
> 1 scrub errors
> [root@mgmt01 ~]# rados list-inconsistent-obj 1.65
> No scrub information available for pg 1.65
> error 2: (2) No such file or directory
> [root@mgmt01 ~]# rados list-inconsistent-snapset 1.65
> No scrub information available for pg 1.65
> error 2: (2) No such file or directory
>
> Rather odd output, I’d say; not that I understand what
> that means. I also tried ceph list-inconsistent-pg:
>
> [root@mgmt01 ~]# rados lspools
> rbd
> cephf

Re: [ceph-users] How many PGs per OSD is too many?

2018-11-14 Thread Kjetil Joergensen

This may be less of an issue now - the most traumatic experience for us,
back around hammer, memory usage under recovery+load ended up with OOM kill
of osds, needing more recovery, a pretty vicious cycle.

-KJ

On Wed, Nov 14, 2018 at 11:45 AM Vladimir Brik <
vladimir.b...@icecube.wisc.edu> wrote:

> Hello
>
> I have a ceph 13.2.2 cluster comprised of 5 hosts, each with 16 HDDs and
> 4 SSDs. HDD OSDs have about 50 PGs each, while SSD OSDs have about 400
> PGs each (a lot more pools use SSDs than HDDs). Servers are fairly
> powerful: 48 HT cores, 192GB of RAM, and 2x25Gbps Ethernet.
>
> The impression I got from the docs is that having more than 200 PGs per
> OSD is not a good thing, but justifications were vague (no concrete
> numbers), like increased peering time, increased resource consumption,
> and possibly decreased recovery performance. None of these appeared to
> be a significant problem in my testing, but the tests were very basic
> and done on a pretty empty cluster under minimal load, so I worry I'll
> run into trouble down the road.
>
> Here are the questions I have:
> - In practice, is it a big deal that some OSDs have ~400 PGs?
> - In what situations would our cluster most likely fare significantly
> better if I went through the trouble of re-creating pools so that no OSD
> would have more than, say, ~100 PGs?
> - What performance metrics could I monitor to detect possible issues due
> to having too many PGs?
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Kjetil Joergensen 
SRE, Medallia Inc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Benchmark performance when using SSD as the journal

2018-11-14 Thread Joe Comeau

Hi Dave
 
Have you looked at the Intel P4600 vsd the P4500
 
The P4600 has better random writes and a better drive writes per day I
believe
 
Thanks Joe

>>>  11/13/2018 8:45 PM >>>

Thanks Merrick!
 
I checked with Intel spec [1], the performance Intel said is, 
 
·  Sequential Read (up to) 500 MB/s 
·  Sequential Write (up to) 330 MB/s 
·  Random Read (100% Span) 72000 IOPS 
·  Random Write (100% Span) 2 IOPS
 
I think these indicator should be must better than general HDD, and I
have run read/write commands with “rados bench” respectively,   there
should be some difference.
 
And is there any kinds of configuration that could give us any
performance gain with this SSD (Intel S4500)?
 
[1]
https://ark.intel.com/products/120521/Intel-SSD-DC-S4500-Series-480GB-2-5in-SATA-6Gb-s-3D1-TLC-
 
Best Regards,
Dave Chen
 
From: Ashley Merrick  
Sent: Wednesday, November 14, 2018 12:30 PM
To: Chen2, Dave
Cc: ceph-users
Subject: Re: [ceph-users] Benchmark performance when using SSD as the
journal
 

[EXTERNAL EMAIL] 
Please report any suspicious attachments, links, or requests for
sensitive information.

Only certain SSD's are good for CEPH Journals as can be seen @
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

 

The SSD your using isn't listed but doing a quick search online it
appears to be a SSD designed for read workloads as a "upgrade" from a HD
so probably is not designed for the high write requirements a journal
demands. 

Therefore when it's been hit by 3 OSD's of workloads your not going to
get much more performance out of it than you would just using the disk
as your seeing.

 

On Wed, Nov 14, 2018 at 12:21 PM  wrote:



Hi all,
 
We want to compare the performance between HDD partition as the journal
(inline from OSD disk) and SSD partition as the journal, here is what we
have done, we have 3 nodes used as Ceph OSD,  each has 3 OSD on it.
Firstly, we created the OSD with journal from OSD partition, and run
“rados bench” utility to test the performance, and then migrate the
journal from HDD to SSD (Intel S4500) and run “rados bench” again, the
expected result is SSD partition should be much better than HDD, but the
result shows us there is nearly no change,
 
The configuration of Ceph is as below,
pool size: 3
osd size: 3*3
pg (pgp) num: 300
osd nodes are separated across three different nodes
rbd image size: 10G (10240M)
 
The utility I used is,
rados bench -p rbd $duration write
rados bench -p rbd $duration seq
rados bench -p rbd $duration rand
 
Is there anything wrong from what I did?  Could anyone give me some
suggestion?
 
 
Best Regards,
Dave Chen
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mgr Prometheus plugin: error when osd is down

2018-11-14 Thread John Spray

On Wed, Nov 14, 2018 at 3:32 PM Gökhan Kocak
 wrote:
>
> Hello everyone,
>
> we encountered an error with the Prometheus plugin for Ceph mgr:
> One osd was down and (therefore) it had no class:
> ```
> sudo ceph osd tree
> ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT PRI-AFF
>  28   hdd   7.27539 osd.28 up  1.0 1.0
>   6   0 osd.6down0 1.0
>
> ```
>
> When we tried to curl the metrics, there was an error because the osd
> had no class (see below "KeyError: 'class' ").

I suspect you're running an old release?  This bug
(https://tracker.ceph.com/issues/23300) was fixed in 12.2.5.

John

> Anybody experience the same?
>
> Isn't this an error on the Prometheus plugin's behalf? When an osd is down, 
> the plugin should not stop working imo.
>
> ```
> ~> curl -v 127.0.0.1:9283/metrics
> *   Trying 127.0.0.1...
> * Connected to 127.0.0.1 (127.0.0.1) port 9283 (#0)
> > GET /metrics HTTP/1.1
> > Host: 127.0.0.1:9283
> > User-Agent: curl/7.47.0
> > Accept: */*
> >
> < HTTP/1.1 500 Internal Server Error
> < Date: Wed, 14 Nov 2018 13:59:59 GMT
> < Content-Length: 1663
> < Content-Type: text/html;charset=utf-8
> < Server: CherryPy/3.5.0
> <
>  "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> 
> 
> 
> 500 Internal Server Error
> 
> #powered_by {
> margin-top: 20px;
> border-top: 2px solid black;
> font-style: italic;
> }
>
> #traceback {
> color: red;
> }
> 
> 
> 
> 500 Internal Server Error
> The server encountered an unexpected condition which
> prevented it from fulfilling the request.
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
> 670, in respond
> response.body = self.handler()
>   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> 217, in __call__
> self.body = self.oldhandler(*args, **kwargs)
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line
> 61, in __call__
> return self.callable(*self.args, **self.kwargs)
>   File "/usr/lib/x86_64-linux-gnu/ceph/mgr/prometheus/module.py", line
> 414, in metrics
> metrics = global_instance().collect()
>   File "/usr/lib/x86_64-linux-gnu/ceph/mgr/prometheus/module.py", line
> 351, in collect
> self.get_metadata_and_osd_status()
>   File "/usr/lib/x86_64-linux-gnu/ceph/mgr/prometheus/module.py", line
> 310, in get_metadata_and_osd_status
> dev_class['class'],
> KeyError: 'class'
> 
> 
>   
> Powered by http://www.cherrypy.org";>CherryPy 3.5.0
>   
> 
> 
> 
> * Connection #0 to host 127.0.0.1 left intact
> ```
>
> Kind regards,
>
> Gökhan
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How many PGs per OSD is too many?

2018-11-14 Thread Vladimir Brik


Hello

I have a ceph 13.2.2 cluster comprised of 5 hosts, each with 16 HDDs and 
4 SSDs. HDD OSDs have about 50 PGs each, while SSD OSDs have about 400 
PGs each (a lot more pools use SSDs than HDDs). Servers are fairly 
powerful: 48 HT cores, 192GB of RAM, and 2x25Gbps Ethernet.


The impression I got from the docs is that having more than 200 PGs per 
OSD is not a good thing, but justifications were vague (no concrete 
numbers), like increased peering time, increased resource consumption, 
and possibly decreased recovery performance. None of these appeared to 
be a significant problem in my testing, but the tests were very basic 
and done on a pretty empty cluster under minimal load, so I worry I'll 
run into trouble down the road.


Here are the questions I have:
- In practice, is it a big deal that some OSDs have ~400 PGs?
- In what situations would our cluster most likely fare significantly 
better if I went through the trouble of re-creating pools so that no OSD 
would have more than, say, ~100 PGs?
- What performance metrics could I monitor to detect possible issues due 
to having too many PGs?


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to repair active+clean+inconsistent?

2018-11-14 Thread K.C. Wong

So, I’ve issued the deep-scrub command (and the repair command)
and nothing seems to happen.
Unrelated to this issue, I have to take down some OSD to prepare
a host for RMA. One of them happens to be in the replication
group for this PG. So, a scrub happened indirectly. I now have
this from “ceph -s”:

cluster 374aed9e-5fc1-47e1-8d29-4416f7425e76
 health HEALTH_ERR
1 pgs inconsistent
18446 scrub errors
 monmap e1: 3 mons at 
{mgmt01=10.0.1.1:6789/0,mgmt02=10.1.1.1:6789/0,mgmt03=10.2.1.1:6789/0}
election epoch 252, quorum 0,1,2 mgmt01,mgmt02,mgmt03
  fsmap e346: 1/1/1 up {0=mgmt01=up:active}, 2 up:standby
 osdmap e40248: 120 osds: 119 up, 119 in
flags sortbitwise,require_jewel_osds
  pgmap v22025963: 3136 pgs, 18 pools, 18975 GB data, 214 Mobjects
59473 GB used, 287 TB / 345 TB avail
3120 active+clean
  15 active+clean+scrubbing+deep
   1 active+clean+inconsistent

That’s a lot of scrub errors:

HEALTH_ERR 1 pgs inconsistent; 18446 scrub errors
pg 1.65 is active+clean+inconsistent, acting [62,67,33]
18446 scrub errors

Now, “rados list-inconsistent-obj 1.65” returns a *very* long JSON
output. Here’s a very small snippet, the errors look the same across:

{
  “object”:{
"name":”10ea8bb.0045”,
"nspace":”",
"locator":”",
"snap":"head”,
"version”:59538
  },
  "errors":["attr_name_mismatch”],
  "union_shard_errors":["oi_attr_missing”],
  "selected_object_info":"1:a70dc1cc:::10ea8bb.0045:head(2897'59538 
client.4895965.0:462007 dirty|data_digest|omap_digest s 4194304 uv 59538 dd 
f437a612 od  alloc_hint [0 0])”,
  "shards”:[
{
  "osd":33,
  "errors":[],
  "size":4194304,
  "omap_digest”:"0x”,
  "data_digest”:"0xf437a612”,
  "attrs":[
{"name":"_”,
 "value":”EAgNAQAABAM1AA...“,
 "Base64":true},
{"name":"snapset”,
 "value":”AgIZAQ...“,
 "Base64":true}
  ]
},
{
  "osd":62,
  "errors":[],
  "size":4194304,
  "omap_digest":"0x”,
  "data_digest":"0xf437a612”,
  "attrs”:[
{"name":"_”,
 "value":”EAgNAQAABAM1AA...",
 "Base64":true},
{"name":"snapset”,
 "value":”AgIZAQ…",
 "Base64":true}
  ]
},
{
  "osd":67,
  "errors":["oi_attr_missing”],
  "size":4194304,
  "omap_digest":"0x”,
  "data_digest":"0xf437a612”,
  "attrs":[]
}
  ]
}

Clearly, on osd.67, the “attrs” array is empty. The question is,
how do I fix this?

Many thanks in advance,

-kc

K.C. Wong
kcw...@verseon.com 
M: +1 (408) 769-8235

-
Confidentiality Notice:
This message contains confidential information. If you are not the
intended recipient and received this message in error, any use or
distribution is strictly prohibited. Please also notify us
immediately by return e-mail, and delete this message from your
computer system. Thank you.
-
4096R/B8995EDE 
  E527 
CBE8 023E 79EA 8BBB  5C77 23A6 92E9 B899 5EDE
hkps://hkps.pool.sks-keyservers.net

> On Nov 11, 2018, at 10:58 PM, Brad Hubbard  wrote:
> 
> On Mon, Nov 12, 2018 at 4:21 PM Ashley Merrick  > wrote:
>> 
>> Your need to run "ceph pg deep-scrub 1.65" first
> 
> Right, thanks Ashley. That's what the "Note that you may have to do a
> deep scrub to populate the output." part of my answer meant but
> perhaps I needed to go further?
> 
> The system has a record of a scrub error on a previous scan but
> subsequent activity in the cluster has invalidated the specifics. You
> need to run another scrub to get the specific information for this pg
> at this point in time (the information does not remain valid
> indefinitely and therefore may need to be renewed depending on
> circumstances).
> 
>> 
>> On Mon, Nov 12, 2018 at 2:20 PM K.C. Wong  wrote:
>>> 
>>> Hi Brad,
>>> 
>>> I got the following:
>>> 
>>> [root@mgmt01 ~]# ceph health detail
>>> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>>> pg 1.65 is active+clean+inconsistent, acting [62,67,47]
>>> 1 scrub errors
>>> [root@mgmt01 ~]# rados list-inconsistent-obj 1.65
>>> No scrub information available for pg 1.65
>>> error 2: (2) No such file or directory
>>> [root@mgmt01 ~]# rados list-inconsistent-snapset 1.65
>>> No scrub information available for pg 1.65
>>> error 2: (2) No such file or directory
>>> 
>>> Rather odd output, I’d say; not that I understand what
>>> that means. I also tried ceph list-inconsistent-pg:
>>> 
>>> [root@mgmt01 ~]# rados lspools
>>> rbd
>>> cephfs_data
>>> cephfs_metadata
>>> .rgw.root
>>> default.rgw.control
>>> default.rgw.data.root
>>> default.rgw.gc
>>> default.rgw.log
>>> ctrl-p
>>> prod
>>> c

Re: [ceph-users] New open-source foundation

2018-11-14 Thread Mike Perez

Hi Eric,

Please take a look at the new Foundation site's FAQ for answers to
these questions:

https://ceph.com/foundation/
On Tue, Nov 13, 2018 at 11:51 AM Smith, Eric  wrote:
>
> https://techcrunch.com/2018/11/12/the-ceph-storage-project-gets-a-dedicated-open-source-foundation/
>
>
>
> What does this mean for:
>
> Governance
> Development
> Community
>
>
>
> Forgive me if I’ve missed the discussion previously on this list.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph mgr Prometheus plugin: error when osd is down

2018-11-14 Thread Gökhan Kocak

Hello everyone,

we encountered an error with the Prometheus plugin for Ceph mgr:
One osd was down and (therefore) it had no class:
```
sudo ceph osd tree
ID  CLASS WEIGHT    TYPE NAME  STATUS REWEIGHT PRI-AFF
 28   hdd   7.27539 osd.28 up  1.0 1.0
  6   0 osd.6    down    0 1.0

```

When we tried to curl the metrics, there was an error because the osd
had no class (see below "KeyError: 'class' ").

Anybody experience the same?

Isn't this an error on the Prometheus plugin's behalf? When an osd is down, the 
plugin should not stop working imo. 

```
~> curl -v 127.0.0.1:9283/metrics
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 9283 (#0)
> GET /metrics HTTP/1.1
> Host: 127.0.0.1:9283
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Date: Wed, 14 Nov 2018 13:59:59 GMT
< Content-Length: 1663
< Content-Type: text/html;charset=utf-8
< Server: CherryPy/3.5.0
<
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>


    
    500 Internal Server Error
    
    #powered_by {
    margin-top: 20px;
    border-top: 2px solid black;
    font-style: italic;
    }

    #traceback {
    color: red;
    }
    

    
    500 Internal Server Error
    The server encountered an unexpected condition which
prevented it from fulfilling the request.
    Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
670, in respond
    response.body = self.handler()
  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line
61, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/lib/x86_64-linux-gnu/ceph/mgr/prometheus/module.py", line
414, in metrics
    metrics = global_instance().collect()
  File "/usr/lib/x86_64-linux-gnu/ceph/mgr/prometheus/module.py", line
351, in collect
    self.get_metadata_and_osd_status()
  File "/usr/lib/x86_64-linux-gnu/ceph/mgr/prometheus/module.py", line
310, in get_metadata_and_osd_status
    dev_class['class'],
KeyError: 'class'

    
  
    Powered by http://www.cherrypy.org";>CherryPy 3.5.0
  
    
    

* Connection #0 to host 127.0.0.1 left intact
```

Kind regards,

Gökhan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unhelpful behaviour of ceph-volume lvm batch with >1 NVME card for block.db

2018-11-14 Thread Alfredo Deza

On Wed, Nov 14, 2018 at 9:10 AM Matthew Vernon  wrote:
>
> Hi,
>
> We currently deploy our filestore OSDs with ceph-disk (via
> ceph-ansible), and I was looking at using ceph-volume as we migrate to
> bluestore.
>
> Our servers have 60 OSDs and 2 NVME cards; each OSD is made up of a
> single hdd, and an NVME partition for journal.
>
> If, however, I do:
> ceph-volume lvm batch /dev/sda /dev/sdb [...] /dev/nvme0n1 /dev/nvme1n1
> then I get (inter alia):
>
> Solid State VG:
>   Targets:   block.db  Total size: 1.82 TB
>   Total LVs: 2 Size per LV: 931.51 GB
>
>   Devices:   /dev/nvme0n1, /dev/nvme1n1
>
> i.e. ceph-volume is going to make a single VG containing both NVME
> devices, and split that up into LVs to use for block.db
>
> It seems to me that this is straightforwardly the wrong answer - either
> NVME failing will now take out *every* OSD on the host, whereas the
> obvious alternative (one VG per NVME, divide those into LVs) would give
> you just as good performance, but you'd only lose 1/2 the OSDs if an
> NVME card failed.
>
> Am I missing something obvious here?

This is exactly the intended behavior. The `lvm batch` sub-command is
meant to simplify LV management, and by doing so, it has to adhere to
some constraints.

These constraints (making a single VG out of both NVMe devices) makes
is far easier+robust on the implementation, and allows us to
accommodate for a lot of different scenarios, but I do see how this
might be
unexpected.

>
> I appreciate I /could/ do it all myself, but even using ceph-ansible
> that's going to be very tiresome...
>

Right, so you are able to chop the devices up in any way you find more
acceptable (creating LVs and then passing them to `lvm create`)

There is a bit of wiggle room here though, you could deploy half of it
first which would force `lvm batch` to use just one NVMe:

ceph-volume lvm batch /dev/sda [...] /dev/nvme0n1

And then the rest of devices

ceph-volume lvm batch /dev/sdb [...] /dev/nvme1n1

> Regards,
>
> Matthew
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Placement Groups undersized after adding OSDs

2018-11-14 Thread Wido den Hollander

Hi,

I'm in the middle of expanding a Ceph cluster and while having 'ceph -s'
open I suddenly saw a bunch of Placement Groups go undersized.

My first hint was that one or more OSDs have failed, but none did.

So I checked and I saw these Placement Groups undersized:

11.3b54 active+undersized+degraded+remapped+backfill_wait
[1795,639,1422]   1795   [1795,639]   1795
11.362f active+undersized+degraded+remapped+backfill_wait
[1431,1134,2217]   1431  [1134,1468]   1134
11.3e31 active+undersized+degraded+remapped+backfill_wait
[1451,1391,1906]   1451  [1906,2053]   1906
11.50c  active+undersized+degraded+remapped+backfill_wait
[1867,1455,1348]   1867  [1867,2036]   1867
11.421e   active+undersized+degraded+remapped+backfilling
[280,117,1421]280[280,117]280
11.700  active+undersized+degraded+remapped+backfill_wait
[2212,1422,2087]   2212  [2055,2087]   2055
11.735active+undersized+degraded+remapped+backfilling
[772,1832,1433]772   [772,1832]772
11.d5a  active+undersized+degraded+remapped+backfill_wait
[423,1709,1441]423   [423,1709]423
11.a95  active+undersized+degraded+remapped+backfill_wait
[1433,1180,978]   1433   [978,1180]978
11.a67  active+undersized+degraded+remapped+backfill_wait
[1154,1463,2151]   1154  [1154,2151]   1154
11.10ca active+undersized+degraded+remapped+backfill_wait
[2012,486,1457]   2012   [2012,486]   2012
11.2439 active+undersized+degraded+remapped+backfill_wait
[910,1457,1193]910   [910,1193]910
11.2f7e active+undersized+degraded+remapped+backfill_wait
[1423,1356,2098]   1423  [1356,2098]   1356

After searching I found that OSDs
1422,1431,1451,1455,1421,1422,1433,1441,1433,1463,1457,1457 and 1423 are
all running on the same (newly) added host.

I checked:
- The host did not reboot
- The OSDs did not restart

The OSDs are up_thru since map 646724 which is from 11:05 this morning
(4,5 hours ago), which is about the same time when these were added.

So these PGs are currently running on *2* replicas while they should be
running on *3*.

We just added 8 nodes with 24 disks each to the cluster, but none of the
existing OSDs were touched.

When looking at PG 11.3b54 I see that 1422 is a backfill target:

$ ceph pg 11.3b54 query|jq '.recovery_state'

The 'enter time' for this is about 30 minutes ago and that's about the
same time this has happened.

'might_have_unfound' tells me OSD 1982 which is in the same rack as 1422
(CRUSH replicates over racks), but that OSD is also online.

It's up_thru = 647122 and that's from about 30 minutes ago. That
ceph-osd process is however running since September and seems to be
functioning fine.

This confuses me as during such an expansion I know that normally a PG
would map to size+1 until the backfill finishes.

The cluster is running Luminous 12.2.8 on CentOS 7.5.

Any ideas on what this could be?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Unhelpful behaviour of ceph-volume lvm batch with >1 NVME card for block.db

2018-11-14 Thread Matthew Vernon

Hi,

We currently deploy our filestore OSDs with ceph-disk (via
ceph-ansible), and I was looking at using ceph-volume as we migrate to
bluestore.

Our servers have 60 OSDs and 2 NVME cards; each OSD is made up of a
single hdd, and an NVME partition for journal.

If, however, I do:
ceph-volume lvm batch /dev/sda /dev/sdb [...] /dev/nvme0n1 /dev/nvme1n1
then I get (inter alia):

Solid State VG:
  Targets:   block.db  Total size: 1.82 TB
  Total LVs: 2 Size per LV: 931.51 GB

  Devices:   /dev/nvme0n1, /dev/nvme1n1

i.e. ceph-volume is going to make a single VG containing both NVME
devices, and split that up into LVs to use for block.db

It seems to me that this is straightforwardly the wrong answer - either
NVME failing will now take out *every* OSD on the host, whereas the
obvious alternative (one VG per NVME, divide those into LVs) would give
you just as good performance, but you'd only lose 1/2 the OSDs if an
NVME card failed.

Am I missing something obvious here?

I appreciate I /could/ do it all myself, but even using ceph-ansible
that's going to be very tiresome...

Regards,

Matthew


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph luminous custom plugin

2018-11-14 Thread Amit Ghadge

On Wed, Nov 14, 2018 at 5:11 PM Amit Ghadge  wrote:

> Hi,
> I copied my custom module in /usr/lib64/ceph/mgr and run "ceph mgr module
> enable  --force" to enable plugin. It's plug and print some
> message in plugin but it's not print any log in ceph-mgr log file.
>
>
> Thanks,
> Amit G
>

Yes, it's started working need to restart ceph-mgr service.

Thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph luminous custom plugin

2018-11-14 Thread Amit Ghadge

Hi,
I copied my custom module in /usr/lib64/ceph/mgr and run "ceph mgr module
enable  --force" to enable plugin. It's plug and print some
message in plugin but it's not print any log in ceph-mgr log file.


Thanks,
Amit G
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Benchmark performance when using SSD as the journal

2018-11-14 Thread vitalif


Hi Dave,

The main line in SSD specs you should look at is


Enhanced Power Loss Data Protection: Yes


This makes SSD cache nonvolatile and makes SSD ignore fsync()s so 
transactional performance becomes equal to non-transactional. So your 
SSDs should be OK for journal.


rados bench is a bad tool for testing because of 4M default block size 
and a very small number of objects created for testing. Better test it 
with fio -ioengine=rbd -bs=4k -rw=randwrite and -sync=1 -iodepth=1 for 
latency or -iodepth=128 for max random load.


Another recent thing that I've discovered was that turning off write 
cache for all drives (for i in /dev/sd*; do hdparm -W 0 $i; done) 
increased write iops by an order of magnitude.



Hi all,

We want to compare the performance between HDD partition as the
journal (inline from OSD disk) and SSD partition as the journal, here
is what we have done, we have 3 nodes used as Ceph OSD,  each has 3
OSD on it. Firstly, we created the OSD with journal from OSD
partition, and run "rados bench" utility to test the performance, and
then migrate the journal from HDD to SSD (Intel S4500) and run "rados
bench" again, the expected result is SSD partition should be much
better than HDD, but the result shows us there is nearly no change,

The configuration of Ceph is as below,

pool size: 3

osd size: 3*3

pg (pgp) num: 300

osd nodes are separated across three different nodes

rbd image size: 10G (10240M)

The utility I used is,

rados bench -p rbd $duration write

rados bench -p rbd $duration seq

rados bench -p rbd $duration rand

Is there anything wrong from what I did?  Could anyone give me some
suggestion?

Best Regards,

Dave Chen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Benchmark performance when using SSD as the journal

2018-11-14 Thread Dave.Chen

Hi Roos,

I will try with the configuration, thank you very much!

Best Regards,
Dave Chen

-Original Message-
From: Marc Roos  
Sent: Wednesday, November 14, 2018 4:37 PM
To: ceph-users; Chen2, Dave
Subject: RE: [ceph-users] Benchmark performance when using SSD as the journal


[EXTERNAL EMAIL] 
Please report any suspicious attachments, links, or requests for sensitive 
information.


 

Try comparing results from something like this test


[global]
ioengine=posixaio
invalidate=1
ramp_time=30
iodepth=1
runtime=180
time_based
direct=1
filename=/mnt/cephfs/ssd/fio-bench.img

[write-4k-seq]
stonewall
bs=4k
rw=write
#write_bw_log=sdx-4k-write-seq.results
#write_iops_log=sdx-4k-write-seq.results

[randwrite-4k-seq]
stonewall
bs=4k
rw=randwrite
#write_bw_log=sdx-4k-randwrite-seq.results
#write_iops_log=sdx-4k-randwrite-seq.results

[read-4k-seq]
stonewall
bs=4k
rw=read
#write_bw_log=sdx-4k-read-seq.results
#write_iops_log=sdx-4k-read-seq.results

[randread-4k-seq]
stonewall
bs=4k
rw=randread
#write_bw_log=sdx-4k-randread-seq.results
#write_iops_log=sdx-4k-randread-seq.results

[rw-4k-seq]
stonewall
bs=4k
rw=rw
#write_bw_log=sdx-4k-rw-seq.results
#write_iops_log=sdx-4k-rw-seq.results

[randrw-4k-seq]
stonewall
bs=4k
rw=randrw
#write_bw_log=sdx-4k-randrw-seq.results
#write_iops_log=sdx-4k-randrw-seq.results

[write-128k-seq]
stonewall
bs=128k
rw=write
#write_bw_log=sdx-128k-write-seq.results
#write_iops_log=sdx-128k-write-seq.results

[randwrite-128k-seq]
stonewall
bs=128k
rw=randwrite
#write_bw_log=sdx-128k-randwrite-seq.results
#write_iops_log=sdx-128k-randwrite-seq.results

[read-128k-seq]
stonewall
bs=128k
rw=read
#write_bw_log=sdx-128k-read-seq.results
#write_iops_log=sdx-128k-read-seq.results

[randread-128k-seq]
stonewall
bs=128k
rw=randread
#write_bw_log=sdx-128k-randread-seq.results
#write_iops_log=sdx-128k-randread-seq.results

[rw-128k-seq]
stonewall
bs=128k
rw=rw
#write_bw_log=sdx-128k-rw-seq.results
#write_iops_log=sdx-128k-rw-seq.results

[randrw-128k-seq]
stonewall
bs=128k
rw=randrw
#write_bw_log=sdx-128k-randrw-seq.results
#write_iops_log=sdx-128k-randrw-seq.results

[write-1024k-seq]
stonewall
bs=1024k
rw=write
#write_bw_log=sdx-1024k-write-seq.results
#write_iops_log=sdx-1024k-write-seq.results

[randwrite-1024k-seq]
stonewall
bs=1024k
rw=randwrite
#write_bw_log=sdx-1024k-randwrite-seq.results
#write_iops_log=sdx-1024k-randwrite-seq.results

[read-1024k-seq]
stonewall
bs=1024k
rw=read
#write_bw_log=sdx-1024k-read-seq.results
#write_iops_log=sdx-1024k-read-seq.results

[randread-1024k-seq]
stonewall
bs=1024k
rw=randread
#write_bw_log=sdx-1024k-randread-seq.results
#write_iops_log=sdx-1024k-randread-seq.results

[rw-1024k-seq]
stonewall
bs=1024k
rw=rw
#write_bw_log=sdx-1024k-rw-seq.results
#write_iops_log=sdx-1024k-rw-seq.results

[randrw-1024k-seq]
stonewall
bs=1024k
rw=randrw
#write_bw_log=sdx-1024k-randrw-seq.results
#write_iops_log=sdx-1024k-randrw-seq.results

[write-4096k-seq]
stonewall
bs=4096k
rw=write
#write_bw_log=sdx-4096k-write-seq.results
#write_iops_log=sdx-4096k-write-seq.results

[randwrite-4096k-seq]
stonewall
bs=4096k
rw=randwrite
#write_bw_log=sdx-4096k-randwrite-seq.results
#write_iops_log=sdx-4096k-randwrite-seq.results

[read-4096k-seq]
stonewall
bs=4096k
rw=read
#write_bw_log=sdx-4096k-read-seq.results
#write_iops_log=sdx-4096k-read-seq.results

[randread-4096k-seq]
stonewall
bs=4096k
rw=randread
#write_bw_log=sdx-4096k-randread-seq.results
#write_iops_log=sdx-4096k-randread-seq.results

[rw-4096k-seq]
stonewall
bs=4096k
rw=rw
#write_bw_log=sdx-4096k-rw-seq.results
#write_iops_log=sdx-4096k-rw-seq.results

[randrw-4096k-seq]
stonewall
bs=4096k
rw=randrw
#write_bw_log=sdx-4096k-randrw-seq.results
#write_iops_log=sdx-4096k-randrw-seq.results



-Original Message-
From: dave.c...@dell.com [mailto:dave.c...@dell.com] 
Sent: woensdag 14 november 2018 5:21
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Benchmark performance when using SSD as the 
journal

Hi all,

 

We want to compare the performance between HDD partition as the journal 
(inline from OSD disk) and SSD partition as the journal, here is what we 
have done, we have 3 nodes used as Ceph OSD,  each has 3 OSD on it. 
Firstly, we created the OSD with journal from OSD partition, and run 
“rados bench” utility to test the performance, and then migrate the 
journal from HDD to SSD (Intel S4500) and run “rados bench” again, the 
expected result is SSD partition should be much better than HDD, but the 
result shows us there is nearly no change,

 

The configuration of Ceph is as below,

pool size: 3

osd size: 3*3

pg (pgp) num: 300

osd nodes are separated across three different nodes

rbd image size: 10G (10240M)

 

The utility I used is,

rados bench -p rbd $duration write

rados bench -p rbd $duration seq

rados bench -p rbd $duration rand

 

Is there anything wrong from what I did?  Could anyone give me some 
suggestion?

 

 

Best Regards,

Dave Chen

 


___

Re: [ceph-users] Benchmark performance when using SSD as the journal

2018-11-14 Thread Dave.Chen

Thanks Mokhtar! This is what I am looking for, thanks for your explanation!

Best Regards,
Dave Chen

From: Maged Mokhtar 
Sent: Wednesday, November 14, 2018 3:36 PM
To: Chen2, Dave; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Benchmark performance when using SSD as the journal

[EXTERNAL EMAIL]
Please report any suspicious attachments, links, or requests for sensitive 
information.

Hi Dave,

The SSD journal will help boost iops  & latency which will be more apparent for 
small block sizes. The rados benchmark default block size is 4M, use the -b 
option to specify the size. Try at 4k, 32k, 64k ...
As a side note, this is a rados level test, the rbd image size is not relevant 
here.

Maged.
On 14/11/18 06:21, dave.c...@dell.com wrote:
Hi all,

We want to compare the performance between HDD partition as the journal (inline 
from OSD disk) and SSD partition as the journal, here is what we have done, we 
have 3 nodes used as Ceph OSD,  each has 3 OSD on it. Firstly, we created the 
OSD with journal from OSD partition, and run "rados bench" utility to test the 
performance, and then migrate the journal from HDD to SSD (Intel S4500) and run 
"rados bench" again, the expected result is SSD partition should be much better 
than HDD, but the result shows us there is nearly no change,

The configuration of Ceph is as below,
pool size: 3
osd size: 3*3
pg (pgp) num: 300
osd nodes are separated across three different nodes
rbd image size: 10G (10240M)

The utility I used is,
rados bench -p rbd $duration write
rados bench -p rbd $duration seq
rados bench -p rbd $duration rand

Is there anything wrong from what I did?  Could anyone give me some suggestion?

Best Regards,
Dave Chen

___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Benchmark performance when using SSD as the journal

2018-11-14 Thread Marc Roos

 

Try comparing results from something like this test


[global]
ioengine=posixaio
invalidate=1
ramp_time=30
iodepth=1
runtime=180
time_based
direct=1
filename=/mnt/cephfs/ssd/fio-bench.img

[write-4k-seq]
stonewall
bs=4k
rw=write
#write_bw_log=sdx-4k-write-seq.results
#write_iops_log=sdx-4k-write-seq.results

[randwrite-4k-seq]
stonewall
bs=4k
rw=randwrite
#write_bw_log=sdx-4k-randwrite-seq.results
#write_iops_log=sdx-4k-randwrite-seq.results

[read-4k-seq]
stonewall
bs=4k
rw=read
#write_bw_log=sdx-4k-read-seq.results
#write_iops_log=sdx-4k-read-seq.results

[randread-4k-seq]
stonewall
bs=4k
rw=randread
#write_bw_log=sdx-4k-randread-seq.results
#write_iops_log=sdx-4k-randread-seq.results

[rw-4k-seq]
stonewall
bs=4k
rw=rw
#write_bw_log=sdx-4k-rw-seq.results
#write_iops_log=sdx-4k-rw-seq.results

[randrw-4k-seq]
stonewall
bs=4k
rw=randrw
#write_bw_log=sdx-4k-randrw-seq.results
#write_iops_log=sdx-4k-randrw-seq.results

[write-128k-seq]
stonewall
bs=128k
rw=write
#write_bw_log=sdx-128k-write-seq.results
#write_iops_log=sdx-128k-write-seq.results

[randwrite-128k-seq]
stonewall
bs=128k
rw=randwrite
#write_bw_log=sdx-128k-randwrite-seq.results
#write_iops_log=sdx-128k-randwrite-seq.results

[read-128k-seq]
stonewall
bs=128k
rw=read
#write_bw_log=sdx-128k-read-seq.results
#write_iops_log=sdx-128k-read-seq.results

[randread-128k-seq]
stonewall
bs=128k
rw=randread
#write_bw_log=sdx-128k-randread-seq.results
#write_iops_log=sdx-128k-randread-seq.results

[rw-128k-seq]
stonewall
bs=128k
rw=rw
#write_bw_log=sdx-128k-rw-seq.results
#write_iops_log=sdx-128k-rw-seq.results

[randrw-128k-seq]
stonewall
bs=128k
rw=randrw
#write_bw_log=sdx-128k-randrw-seq.results
#write_iops_log=sdx-128k-randrw-seq.results

[write-1024k-seq]
stonewall
bs=1024k
rw=write
#write_bw_log=sdx-1024k-write-seq.results
#write_iops_log=sdx-1024k-write-seq.results

[randwrite-1024k-seq]
stonewall
bs=1024k
rw=randwrite
#write_bw_log=sdx-1024k-randwrite-seq.results
#write_iops_log=sdx-1024k-randwrite-seq.results

[read-1024k-seq]
stonewall
bs=1024k
rw=read
#write_bw_log=sdx-1024k-read-seq.results
#write_iops_log=sdx-1024k-read-seq.results

[randread-1024k-seq]
stonewall
bs=1024k
rw=randread
#write_bw_log=sdx-1024k-randread-seq.results
#write_iops_log=sdx-1024k-randread-seq.results

[rw-1024k-seq]
stonewall
bs=1024k
rw=rw
#write_bw_log=sdx-1024k-rw-seq.results
#write_iops_log=sdx-1024k-rw-seq.results

[randrw-1024k-seq]
stonewall
bs=1024k
rw=randrw
#write_bw_log=sdx-1024k-randrw-seq.results
#write_iops_log=sdx-1024k-randrw-seq.results

[write-4096k-seq]
stonewall
bs=4096k
rw=write
#write_bw_log=sdx-4096k-write-seq.results
#write_iops_log=sdx-4096k-write-seq.results

[randwrite-4096k-seq]
stonewall
bs=4096k
rw=randwrite
#write_bw_log=sdx-4096k-randwrite-seq.results
#write_iops_log=sdx-4096k-randwrite-seq.results

[read-4096k-seq]
stonewall
bs=4096k
rw=read
#write_bw_log=sdx-4096k-read-seq.results
#write_iops_log=sdx-4096k-read-seq.results

[randread-4096k-seq]
stonewall
bs=4096k
rw=randread
#write_bw_log=sdx-4096k-randread-seq.results
#write_iops_log=sdx-4096k-randread-seq.results

[rw-4096k-seq]
stonewall
bs=4096k
rw=rw
#write_bw_log=sdx-4096k-rw-seq.results
#write_iops_log=sdx-4096k-rw-seq.results

[randrw-4096k-seq]
stonewall
bs=4096k
rw=randrw
#write_bw_log=sdx-4096k-randrw-seq.results
#write_iops_log=sdx-4096k-randrw-seq.results



-Original Message-
From: dave.c...@dell.com [mailto:dave.c...@dell.com] 
Sent: woensdag 14 november 2018 5:21
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Benchmark performance when using SSD as the 
journal

Hi all,

 

We want to compare the performance between HDD partition as the journal 
(inline from OSD disk) and SSD partition as the journal, here is what we 
have done, we have 3 nodes used as Ceph OSD,  each has 3 OSD on it. 
Firstly, we created the OSD with journal from OSD partition, and run 
“rados bench” utility to test the performance, and then migrate the 
journal from HDD to SSD (Intel S4500) and run “rados bench” again, the 
expected result is SSD partition should be much better than HDD, but the 
result shows us there is nearly no change,

 

The configuration of Ceph is as below,

pool size: 3

osd size: 3*3

pg (pgp) num: 300

osd nodes are separated across three different nodes

rbd image size: 10G (10240M)

 

The utility I used is,

rados bench -p rbd $duration write

rados bench -p rbd $duration seq

rados bench -p rbd $duration rand

 

Is there anything wrong from what I did?  Could anyone give me some 
suggestion?

 

 

Best Regards,

Dave Chen

 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mgr Prometheus plugin: error when osd is down

Re: [ceph-users] Librbd performance VS KRBD performance

Re: [ceph-users] Librbd performance VS KRBD performance

Re: [ceph-users] Librbd performance VS KRBD performance

Re: [ceph-users] Effects of restoring a cluster's mon from an older backup

Re: [ceph-users] Librbd performance VS KRBD performance

Re: [ceph-users] Placement Groups undersized after adding OSDs

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

[ceph-users] Librbd performance VS KRBD performance

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

Re: [ceph-users] How many PGs per OSD is too many?

Re: [ceph-users] How to repair active+clean+inconsistent?

Re: [ceph-users] How many PGs per OSD is too many?

Re: [ceph-users] Benchmark performance when using SSD as the journal

Re: [ceph-users] Ceph mgr Prometheus plugin: error when osd is down

[ceph-users] How many PGs per OSD is too many?

Re: [ceph-users] How to repair active+clean+inconsistent?

Re: [ceph-users] New open-source foundation

[ceph-users] Ceph mgr Prometheus plugin: error when osd is down

Re: [ceph-users] Unhelpful behaviour of ceph-volume lvm batch with >1 NVME card for block.db

[ceph-users] Placement Groups undersized after adding OSDs

[ceph-users] Unhelpful behaviour of ceph-volume lvm batch with >1 NVME card for block.db

Re: [ceph-users] Ceph luminous custom plugin

[ceph-users] Ceph luminous custom plugin

Re: [ceph-users] Benchmark performance when using SSD as the journal

Re: [ceph-users] Benchmark performance when using SSD as the journal

Re: [ceph-users] Benchmark performance when using SSD as the journal

Re: [ceph-users] Benchmark performance when using SSD as the journal

28 matches

Site Navigation

Mail list logo

Footer information