[ceph-users] Slow Performance - Sequential IO

2020-01-13 Thread Anthony Brandelli (abrandel)
I have a newly setup test cluster that is giving some surprising numbers when 
running fio against an RBD. The end goal here is to see how viable a Ceph based 
iSCSI SAN of sorts is for VMware clusters, which require a bunch of random IO.

Hardware:
2x E5-2630L v2 (2.4GHz, 6 core)
256GB RAM
2x 10gbps bonded network, Intel X520
LSI 9271-8i, SSDs used for OSDs in JBOD mode
Mons: 2x 1.2TB 10K SAS in RAID1
OSDs: 12x Samsung MZ6ER800HAGL-3 800GB SAS SSDs, super cap/power loss 
protection

Cluster setup:
Three mon nodes, four OSD nodes
Two OSDs per SSD
Replica 3 pool
Ceph 14.2.5

Ceph status:
  cluster:
id: e3d93b4a-520c-4d82-a135-97d0bda3e69d
health: HEALTH_WARN
application not enabled on 1 pool(s)
  services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 6d)
mgr: mon2(active, since 6d), standbys: mon3, mon1
osd: 96 osds: 96 up (since 3d), 96 in (since 3d)
  data:
pools:   1 pools, 3072 pgs
objects: 857.00k objects, 1.8 TiB
usage:   432 GiB used, 34 TiB / 35 TiB avail
pgs: 3072 active+clean

Network between nodes tests at 9.88gbps. Direct testing of the SSDs using a 4K 
block in fio shows 127k seq read, 86k randm read, 107k seq write, 52k random 
write. No high CPU load/interface saturation is noted when running tests 
against the rbd.

When testing with a 4K block size against an RBD on a dedicated metal test host 
(same specs as other cluster nodes noted above) I get the following (command 
similar to fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=32 -rw= 
-pool=scbench -runtime=60 -rbdname=datatest):

10k sequential read iops
69k random read iops
13k sequential write iops
22k random write iops

I’m not clear why the random ops, especially read, would be so much quicker 
compared to the sequential ops.

Any points appreciated.

Thanks,
Anthony
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Acting sets sometimes may violate crush rule ?

2020-01-13 Thread Dan van der Ster
Hi,

One way this can happen is if you change the crush rule of a pool after the
balancer has been running awhile.
This is because the balancer upmaps are only validated when they are
initially created.

ceph osd dump | grep upmap

Does it explain your issue?

.. Dan


On Tue, 14 Jan 2020, 04:17 Yi-Cian Pu,  wrote:

> Hi all,
>
> We sometimes can observe that acting set seems to violate crush rule. For
> example, we had an environment before:
>
> [root@Ann-per-R7-3 /]# ceph -s
>   cluster:
> id: 248ce880-f57b-4a4c-a53a-3fc2b3eb142a
> health: HEALTH_WARN
> 34/8019 objects misplaced (0.424%)
>
>   services:
> mon: 3 daemons, quorum Ann-per-R7-3,Ann-per-R7-7,Ann-per-R7-1
> mgr: Ann-per-R7-3(active), standbys: Ann-per-R7-7, Ann-per-R7-1
> mds: cephfs-1/1/1 up  {0=qceph-mds-Ann-per-R7-1=up:active}, 2 up:standby
> osd: 7 osds: 7 up, 7 in; 1 remapped pgs
>
>   data:
> pools:   7 pools, 128 pgs
> objects: 2.67 k objects, 10 GiB
> usage:   107 GiB used, 3.1 TiB / 3.2 TiB avail
> pgs: 34/8019 objects misplaced (0.424%)
>  127 active+clean
>  1   active+clean+remapped
>
> [root@Ann-per-R7-3 /]# ceph pg ls remapped
> PG  OBJECTS DEGRADED MISPLACED UNFOUND BYTES LOG STATE 
> STATE_STAMPVERSION REPORTED UP  ACTINGSCRUB_STAMP 
>DEEP_SCRUB_STAMP
> 1.7  34034   0 134217728  42 active+clean+remapped 
> 2019-11-05 10:39:58.639533  144'42  229:407 [6,1]p6 [6,1,2]p6 2019-11-04 
> 10:36:19.519820 2019-11-04 10:36:19.519820
>
>
> [root@Ann-per-R7-3 /]# ceph osd tree
> ID CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF
> -2 0 root perf_osd
> -1   3.10864 root default
> -7   0.44409 host Ann-per-R7-1
>  5   hdd 0.44409 osd.5 up  1.0 1.0
> -3   1.33228 host Ann-per-R7-3
>  0   hdd 0.44409 osd.0 up  1.0 1.0
>  1   hdd 0.44409 osd.1 up  1.0 1.0
>  2   hdd 0.44409 osd.2 up  1.0 1.0
> -9   1.33228 host Ann-per-R7-7
>  6   hdd 0.44409 osd.6 up  1.0 1.0
>  7   hdd 0.44409 osd.7 up  1.0 1.0
>  8   hdd 0.44409 osd.8 up  1.0 1.0
>
>
> [root@Ann-per-R7-3 /]# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE VAR  PGS
>  5   hdd 0.44409  1.0 465 GiB  21 GiB 444 GiB 4.49 1.36 127
>  0   hdd 0.44409  1.0 465 GiB  15 GiB 450 GiB 3.16 0.96  44
>  1   hdd 0.44409  1.0 465 GiB  15 GiB 450 GiB 3.14 0.95  52
>  2   hdd 0.44409  1.0 465 GiB  14 GiB 451 GiB 2.98 0.91  33
>  6   hdd 0.44409  1.0 465 GiB  14 GiB 451 GiB 2.97 0.90  43
>  7   hdd 0.44409  1.0 465 GiB  15 GiB 450 GiB 3.19 0.97  41
>  8   hdd 0.44409  1.0 465 GiB  14 GiB 450 GiB 3.09 0.94  44
> TOTAL 3.2 TiB 107 GiB 3.1 TiB 3.29
> MIN/MAX VAR: 0.90/1.36  STDDEV: 0.49
>
>
> Based on our crush map, crush rule should select 1 OSD from each host.
> However, from above log, we can see that an acting set is [6,1,2] and osd.1
> and osd.2 are in the same host, which seems to violate crush rule. So, my
> question is how does this happen...? Any enlightenment is much appreciated.
>
> Best
> Cian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Acting sets sometimes may violate crush rule ?

2020-01-13 Thread Yi-Cian Pu
Hi all,

We sometimes can observe that acting set seems to violate crush rule. For
example, we had an environment before:

[root@Ann-per-R7-3 /]# ceph -s
  cluster:
id: 248ce880-f57b-4a4c-a53a-3fc2b3eb142a
health: HEALTH_WARN
34/8019 objects misplaced (0.424%)

  services:
mon: 3 daemons, quorum Ann-per-R7-3,Ann-per-R7-7,Ann-per-R7-1
mgr: Ann-per-R7-3(active), standbys: Ann-per-R7-7, Ann-per-R7-1
mds: cephfs-1/1/1 up  {0=qceph-mds-Ann-per-R7-1=up:active}, 2 up:standby
osd: 7 osds: 7 up, 7 in; 1 remapped pgs

  data:
pools:   7 pools, 128 pgs
objects: 2.67 k objects, 10 GiB
usage:   107 GiB used, 3.1 TiB / 3.2 TiB avail
pgs: 34/8019 objects misplaced (0.424%)
 127 active+clean
 1   active+clean+remapped

[root@Ann-per-R7-3 /]# ceph pg ls remapped
PG  OBJECTS DEGRADED MISPLACED UNFOUND BYTES LOG STATE
STATE_STAMPVERSION REPORTED UP  ACTING
SCRUB_STAMPDEEP_SCRUB_STAMP
1.7  34034   0 134217728  42
active+clean+remapped 2019-11-05 10:39:58.639533  144'42  229:407
[6,1]p6 [6,1,2]p6 2019-11-04 10:36:19.519820 2019-11-04
10:36:19.519820


[root@Ann-per-R7-3 /]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF
-2 0 root perf_osd
-1   3.10864 root default
-7   0.44409 host Ann-per-R7-1
 5   hdd 0.44409 osd.5 up  1.0 1.0
-3   1.33228 host Ann-per-R7-3
 0   hdd 0.44409 osd.0 up  1.0 1.0
 1   hdd 0.44409 osd.1 up  1.0 1.0
 2   hdd 0.44409 osd.2 up  1.0 1.0
-9   1.33228 host Ann-per-R7-7
 6   hdd 0.44409 osd.6 up  1.0 1.0
 7   hdd 0.44409 osd.7 up  1.0 1.0
 8   hdd 0.44409 osd.8 up  1.0 1.0


[root@Ann-per-R7-3 /]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE VAR  PGS
 5   hdd 0.44409  1.0 465 GiB  21 GiB 444 GiB 4.49 1.36 127
 0   hdd 0.44409  1.0 465 GiB  15 GiB 450 GiB 3.16 0.96  44
 1   hdd 0.44409  1.0 465 GiB  15 GiB 450 GiB 3.14 0.95  52
 2   hdd 0.44409  1.0 465 GiB  14 GiB 451 GiB 2.98 0.91  33
 6   hdd 0.44409  1.0 465 GiB  14 GiB 451 GiB 2.97 0.90  43
 7   hdd 0.44409  1.0 465 GiB  15 GiB 450 GiB 3.19 0.97  41
 8   hdd 0.44409  1.0 465 GiB  14 GiB 450 GiB 3.09 0.94  44
TOTAL 3.2 TiB 107 GiB 3.1 TiB 3.29
MIN/MAX VAR: 0.90/1.36  STDDEV: 0.49


Based on our crush map, crush rule should select 1 OSD from each host.
However, from above log, we can see that an acting set is [6,1,2] and osd.1
and osd.2 are in the same host, which seems to violate crush rule. So, my
question is how does this happen...? Any enlightenment is much appreciated.

Best
Cian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] units of metrics

2020-01-13 Thread Robert LeBlanc
The link that you referenced above is no longer available, do you have a
new link?. We upgraded from 12.2.8 to 12.2.12 and the MDS metrics all
changed, so I'm trying to may the old values to the new values. Might just
have to look in the code. :(

Thanks!

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 12, 2019 at 8:02 AM Paul Emmerich 
wrote:

> We use a custom script to collect these metrics in croit
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Thu, Sep 12, 2019 at 5:00 PM Stefan Kooman  wrote:
> >
> > Hi Paul,
> >
> > Quoting Paul Emmerich (paul.emmer...@croit.io):
> > >
> https://static.croit.io/ceph-training-examples/ceph-training-example-admin-socket.pdf
> >
> > Thanks for the link. So, what tool do you use to gather the metrics? We
> > are using telegraf module of the Ceph manager. However, this module only
> > provides "sum" and not "avgtime" so I can't do the calculations. The
> > influx and zabbix mgr modules also only provide "sum". The only metrics
> > module that *does* send "avgtime" is the prometheus module:
> >
> > ceph_mds_reply_latency_sum
> > ceph_mds_reply_latency_count
> >
> > All modules use "self.get_all_perf_counters()" though:
> >
> > ~/git/ceph/src/pybind/mgr/ > grep -Ri get_all_perf_counters *
> > dashboard/controllers/perf_counters.py:return
> mgr.get_all_perf_counters()
> > diskprediction_cloud/agent/metrics/ceph_mon_osd.py:perf_data =
> obj_api.module.get_all_perf_counters(services=('mon', 'osd'))
> > influx/module.py:for daemon, counters in
> six.iteritems(self.get_all_perf_counters()):
> > mgr_module.py:def get_all_perf_counters(self, prio_limit=PRIO_USEFUL,
> > prometheus/module.py:for daemon, counters in
> self.get_all_perf_counters().items():
> > restful/api/perf.py:counters =
> context.instance.get_all_perf_counters()
> > telegraf/module.py:for daemon, counters in
> six.iteritems(self.get_all_perf_counters())
> >
> > Besides the *ceph* telegraf module we also use the ceph plugin for
> > telegraf ... but that plugin does not (yet?) provide mds metrics though.
> > Ideally we would *only* use the ceph mgr telegraf module to collect *all
> > the things*.
> >
> > Not sure what's the difference in python code between the modules that
> could explain this.
> >
> > Gr. Stefan
> >
> > --
> > | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> > | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] January Ceph Science Group Virtual Meeting

2020-01-13 Thread Kevin Hrpcek
Hello,

We will be having a Ceph science/research/big cluster call on Wednesday January 
22nd. If anyone wants to discuss something specific they can add it to the pad 
linked below. If you have questions or comments you can contact me.

This is an informal open call of community members mostly from hpc/htc/research 
environments where we discuss whatever is on our minds regarding ceph. Updates, 
outages, features, maintenance, etc...there is no set presenter but I do 
attempt to keep the conversation lively.

https://pad.ceph.com/p/Ceph_Science_User_Group_20200122

Ceph calendar event details:

January 22, 2020
9am US Central
4pm Central Eurpean

We try to keep it to an hour or less.

Description:Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index
Meetings will be recorded and posted to the Ceph Youtube channel.
To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink
To join from a Red Hat Deskphone or Softphone, dial: 84336.
Connecting directly from a room system?
1.) Dial: 199.48.152.152 or 
bjn.vc
2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
1.) Dial one of the following numbers: 408-915-6466 (US)
See all numbers: 
https://www.redhat.com/en/conference-numbers
2.) Enter Meeting ID: 908675367
3.) Press #
Want to test your video connection? 
https://bluejeans.com/111


Kevin


--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-13 Thread vitalif

Hi,

we're playing around with ceph but are not quite happy with the IOs.
on average 5000 iops / write
on average 13000 iops / read

We're expecting more. :( any ideas or is that all we can expect?


With server SSD you can expect up to ~1 write / ~25000 read iops per 
a single client.


https://yourcmc.ru/wiki/Ceph_performance


money is NOT a problem for this test-bed, any ideas howto gain more
IOS is greatly appreciated.


Grab some server NVMes and best possible CPUs :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-13 Thread Stefan Priebe - Profihost AG
Hi Stefan,

Am 13.01.20 um 17:09 schrieb Stefan Bauer:
> Hi,
> 
> 
> we're playing around with ceph but are not quite happy with the IOs.
> 
> 
> 3 node ceph / proxmox cluster with each:
> 
> 
> LSI HBA 3008 controller
> 
> 4 x MZILT960HAHQ/007 Samsung SSD
> 
> Transport protocol:   SAS (SPL-3)
> 
> 40G fibre Intel 520 Network controller on Unifi Switch
> 
> Ping roundtrip to partner node is 0.040ms average.
> 
> 
> Transport protocol:   SAS (SPL-3)
> 
> 
> fio reports on a virtual machine with
> 
> 
> --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test
> --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw
> --rwmixread=75
> 
> 
> on average 5000 iops / write
> 
> on average 13000 iops / read
> 
> 
> 
> We're expecting more. :( any ideas or is that all we can expect?
> 
> 
> money is *not* a problem for this test-bed, any ideas howto gain more
> IOS is greatly appreciated.

this has something todo with the firmware and how the manufacturer
handles syncs / flushes.

Intel just ignores sync / flush commands for drives which have a
capacitor. Samsung does not.

The problem is that Ceph sends a lot of flush commands which slows down
drives without capacitor.

You can make linux to ignore those userspace requests with the following
command:
echo "temporary write through" >
/sys/block/sdX/device/scsi_disk/*/cache_type

Greets,
Stefan Priebe
Profihost AG


> Thank you.
> 
> 
> Stefan
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-13 Thread John Petrini
Do those SSD's have capacitors (aka power loss protection)? I took a
look at the spec sheet on samsung's site and I don't see it mentioned.
If that's the case it could certainly explain the performance you're
seeing. Not all enterprise SSD's have it and it's a must have for Ceph
since it syncs every write directly to disk.

You may also want to look for something with a higher DWPD so you can
get more life out of them.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-13 Thread Stefan Bauer
Hi,



we're playing around with ceph but are not quite happy with the IOs.



3 node ceph / proxmox cluster with each:



LSI HBA 3008 controller

4 x MZILT960HAHQ/007 Samsung SSD

Transport protocol:   SAS (SPL-3)

40G fibre Intel 520 Network controller on Unifi Switch

Ping roundtrip to partner node is 0.040ms average.



Transport protocol:   SAS (SPL-3)



fio reports on a virtual machine with



--randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test 
--filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75



on average 5000 iops / write

on average 13000 iops / read





We're expecting more. :( any ideas or is that all we can expect?



money is not a problem for this test-bed, any ideas howto gain more IOS is 
greatly appreciated.



Thank you.



Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com