nde32: 2 x Intel(R) Xeon(R) CPU E5410 @ 2.33GHz (each 4 cores)
ndesan03: 1 x Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz (4 cores)
ceph@ndesan01:~/ceph-deploy> ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 8.14980 root default
-2 1.19995 host ndesan01
0 0.23999 osd.0 up 1.00000 1.00000
1 0.23999 osd.1 up 1.00000 1.00000
2 0.23999 osd.2 up 1.00000 1.00000
13 0.23999 osd.13 up 1.00000 1.00000
19 0.23999 osd.19 up 1.00000 1.00000
-3 1.19995 host ndesan02
3 0.23999 osd.3 up 1.00000 1.00000
4 0.23999 osd.4 up 1.00000 1.00000
5 0.23999 osd.5 up 1.00000 1.00000
9 0.23999 osd.9 up 1.00000 1.00000
10 0.23999 osd.10 up 1.00000 1.00000
-4 1.19995 host nde32
6 0.23999 osd.6 up 1.00000 1.00000
7 0.23999 osd.7 up 1.00000 1.00000
8 0.23999 osd.8 up 1.00000 1.00000
12 0.23999 osd.12 up 1.00000 1.00000
11 0.23999 osd.11 up 1.00000 1.00000
-5 4.54994 host ndesan03
14 0.90999 osd.14 up 1.00000 1.00000
15 0.90999 osd.15 up 1.00000 1.00000
16 0.90999 osd.16 up 1.00000 1.00000
17 0.90999 osd.17 up 1.00000 1.00000
18 0.90999 osd.18 up 1.00000 1.00000
ceph@ndesan01:~/ceph-deploy> ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR
0 0.23999 1.00000 249G 210G 39851M 84.43 1.59
1 0.23999 1.00000 249G 206G 44424M 82.64 1.55
2 0.23999 1.00000 249G 200G 50084M 80.43 1.51
13 0.23999 1.00000 249G 200G 50520M 80.26 1.51
19 0.23999 1.00000 249G 187G 63837M 75.05 1.41
3 0.23999 1.00000 249G 205G 45945M 82.04 1.54
4 0.23999 1.00000 249G 205G 45841M 82.08 1.54
5 0.23999 1.00000 249G 206G 44026M 82.79 1.56
9 0.23999 1.00000 249G 198G 52348M 79.54 1.50
10 0.23999 1.00000 249G 189G 62027M 75.76 1.42
6 0.23999 1.00000 249G 191G 59968M 76.56 1.44
7 0.23999 1.00000 249G 197G 53410M 79.13 1.49
8 0.23999 1.00000 249G 207G 43856M 82.86 1.56
12 0.23999 1.00000 249G 199G 51695M 79.80 1.50
11 0.23999 1.00000 249G 216G 33766M 86.80 1.63
14 0.90999 1.00000 931G 285G 645G 30.70 0.58
15 0.90999 1.00000 931G 287G 643G 30.88 0.58
16 0.90999 1.00000 931G 297G 633G 32.01 0.60
17 0.90999 1.00000 931G 288G 643G 30.93 0.58
18 0.90999 1.00000 931G 284G 646G 30.60 0.58
TOTAL 8403G 4468G 3935G 53.17
MIN/MAX VAR: 0.58/1.63 STDDEV: 26.41
I know that all OSDs should have the same size, but that wasn't my
decision ;-) And it will be changed in the near future.
Some information on RAM:
---cut here---
ceph@ndesan01:~/ceph-deploy> free -g
total used free shared buffers cached
Mem: 62 61 0 0 3 50
-/+ buffers/cache: 8 53
Swap: 0 0 0
ndesan02:~ # free -g
total used free shared buffers cached
Mem: 62 61 0 3 2 50
-/+ buffers/cache: 9 53
Swap: 0 0 0
nde32:~ # free -g
total used free shared buffers cached
Mem: 39 38 0 0 0 29
-/+ buffers/cache: 8 30
Swap: 1 1 0
ndesan03:~ # free -g
total used free shared buffers cached
Mem: 62 62 0 0 0 48
-/+ buffers/cache: 13 49
Swap: 0 0 0
---cut here---
> Since when does Xen support Ceph?
I'm not sure since when exactly. We started testing Ceph as our
OpenStack storage backend last year in April or May, according to my
email history it's supported since libvirt 1.3.2 (we had a bug filed
for this issue).
ceph@ndesan01:~/ceph-deploy> cat ceph.conf
[global]
fsid = 655cb05a-435a-41ba-83d9-8549f7c36167
mon_initial_members = ndesan01, ndesan02, nde32
mon_host = 192.168.160.15,192.168.160.16,192.168.160.17
public_network = 192.168.160.0/24
osd_pool_default_size = 3
osd pool default min size = 2
osd_crush_chooseleaf_type = 1
mon_pg_warn_max_per_osd = 0
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
filestore max inline xattr size = 512
filestore max inline xattrs = 2
rbd_default_features = 15
mon data avail warn = 15
mon data avail crit = 5
osd_disk_thread_ioprio_class = "idle"
osd_disk_thread_ioprio_priority = 7
osd scrub begin hour = 0
osd scrub end hour = 7
osd_scrub_sleep = 0.1
osd_deep_scrub_interval = 2419200
[mon]
mon osd allow primary affinity = true
We changed the deep-scrub interval because we use a cronjob to execute
deep scrubs in the evening, to avoid performance impact during working
hours.
Please let me know if I missed anything. I really appreciate you
looking into this.
Regards,
Eugen
Zitat von Christian Balzer <ch...@gol.com>:
> Hello,
>
> On Wed, 01 Feb 2017 11:43:02 +0100 Eugen Block wrote:
>
>> Hi,
>>
>> I haven't tracked the slow requests yet, but I ran some performance
>> tests. Although I'm not an expert, I believe the results are quite
>> unsatisfying.
>>
> You've told us absolutely nothing about your cluster that would be
> relevant to determine if your results are actually to be expected or not.
>
> Anything, and I mean anything is relevant, from OS, kernel, Ceph versions
> to a detailed description of ALL HW in there, CPU, RAM, SSD, HDD, network,
> topology, etc.
>
>> I ran a couple of rados bench tests in a pool, with different
>> replication sizes (1 to 3). The tests were executed on one of the ceph
>> nodes, which also is a monitor.
>>
> Benchmarks, including rados bench run from a cluster node that also hosts
> OSDs will give you skewed results.
>
>> Results for size 1:
>>
>> Total time run: 20.428947
>> Total writes made: 639
>> Write size: 4194304
>> Bandwidth (MB/sec): 125.117
>>
> This and the following results would be on par with a smallish, 1Gb/s
> network HDD based cluster.
> But we don't know if that's what you're running or not...
>
>> Results for size 2:
>>
>> Total time run: 20.741990
>> Total writes made: 454
>> Write size: 4194304
>> Bandwidth (MB/sec): 87.552
>>
>> Still looks nice, at least we would be totally fine with this throughput.
>>
> As mentioned countless times here, throughput is most likely your least
> worry with Ceph, IOPS are more likely to be your bottleneck.
>
>> Results for size 3:
>>
>> Total time run: 22.934567
>> Total writes made: 252
>> Write size: 4194304
>> Bandwidth (MB/sec): 43.951
>>
>> I guess this is still ok for a replication size of 3. But the real
>> problem seems to be the client throughput.
>>
>> Running fio on a Xen-VM inside rbd:
>>
> Since when does Xen support Ceph?
> As in, how are you running a Xen VM that uses Ceph?
>
>> Run status group 0 (all jobs):
>> WRITE: io=1024.0MB, aggrb=21776KB/s, minb=21776KB/s,
>> maxb=21776KB/s, mint=48151msec, maxt=48151msec
>>
>> And this is one of the better results, we often have values of 12MB/s
>> or even lower, e.g. on a KVM-VM:
>>
>> Run status group 0 (all jobs):
>> WRITE: io=512000KB, aggrb=6425KB/s, minb=6425KB/s, maxb=6425KB/s,
>> mint=79683msec, maxt=79683msec
>>
>
> Fio results w/o the full command line parameters are basically worthless,
> I'm afraid.
>
>> This unacceptable!
>>
>> I use almost all defaults in our ceph.conf,
>
> What did you change that could be relevant then?
>
>> I've been reading
>> everything I could find on Google, but I don't understand why the
>> clients are so slow. Is it a valid assumption, that the slow requests
>> could be caused by our slow clients?
>>
>
> Slow requests are caused typically when your cluster is not capable of
> keeping up.
>
>> One of the things I tried was
>>
>> vm1:~ # echo 4096 > /sys/block/xvda/queue/read_ahead_kb
>>
> Only affects reads.
>
>> but this had no effect. I tried it with different numbers of PGs, but
>> that also didn't have any effect.
>>
> Again, w/o knowing your cluster and these numbers, that information is of
> little value.
>
>> What else can I do to increase the performance of my instances or to
>> analyse the bottleneck?
>>
> Understand your HW/cluster.
>
> Monitor the nodes with atop etc, use collectd and graphite, etc, to keep
> records of what happened when and to visualize things.
>
> Christian
>
>> Regards,
>> Eugen
>>
>>
>> Zitat von Brad Hubbard <bhubb...@redhat.com>:
>>
>> > Check the latency figures in a "perf dump". High numbers in a
>> > particular area may help you nail it.
>> >
>> > I suspect though, that it may come down to enabling debug logging and
>> > tracking a slow request through the logs.
>> >
>> > On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block <ebl...@nde.ag> wrote:
>> >> Hi,
>> >>
>> >>> Looking at the output of dump_historic_ops and dump_ops_in_flight
>> >>
>> >>
>> >> I waited for new slow request messages and dumped the
historic_ops into a
>> >> file. The reporting OSD shows lots of "waiting for rw locks"
>> messages and a
>> >> duration of more than 30 secs:
>> >>
>> >> "age": 366.044746,
>> >> "duration": 32.491506,
>> >> "type_data": [
>> >> "commit sent; apply or cleanup",
>> >> {
>> >> "client": "client.9664429",
>> >> "tid": 130439910
>> >> },
>> >> [
>> >> {
>> >> "time": "2017-01-12 10:38:15.227649",
>> >> "event": "initiated"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:15.232310",
>> >> "event": "reached_pg"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:15.232341",
>> >> "event": "waiting for rw locks"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:15.268819",
>> >> "event": "reached_pg"
>> >> },
>> >> [
>> >> .
>> >> .
>> >> .
>> >> ]
>> >> {
>> >> "time": "2017-01-12 10:38:45.515055",
>> >> "event": "waiting for rw locks"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:46.921095",
>> >> "event": "reached_pg"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:46.921157",
>> >> "event": "started"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:46.921342",
>> >> "event": "waiting for subops from 9,15"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:46.921724",
>> >> "event": "commit_queued_for_journal_write"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:46.922186",
>> >> "event": "write_thread_in_journal_buffer"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:46.931103",
>> >> "event": "sub_op_commit_rec"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:46.968730",
>> >> "event": "sub_op_commit_rec"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:47.717770",
>> >> "event": "journaled_completion_queued"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:47.718280",
>> >> "event": "op_commit"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:47.718359",
>> >> "event": "commit_sent"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:47.718890",
>> >> "event": "op_applied"
>> >> },
>> >> {
>> >> "time": "2017-01-12 10:38:47.719154",
>> >> "event": "done"
>> >> }
>> >>
>> >>
>> >> There were about 70 events "waiting for rw locks", I truncated
>> the output.
>> >> Based on the message "waiting for subops from 9,15" I also dumped the
>> >> historic_ops for these two OSDs.
>> >>
>> >> Duration on OSD.9
>> >>
>> >> "initiated_at": "2017-01-12 10:38:29.258221",
>> >> "age": 54.069919,
>> >> "duration": 20.831568,
>> >>
>> >> Duration on OSD.15
>> >>
>> >> "initiated_at": "2017-01-12 10:38:23.695098",
>> >> "age": 112.118210,
>> >> "duration": 26.452526,
>> >>
>> >> They also contain many "waiting for rw locks" messages, but not
>> as much as
>> >> the dump from the reporting OSD.
>> >> To me it seems that because two OSDs take a lot of time to
process their
>> >> requests (only slightly less than 30 secs), it sums up to more
>> than 30 secs
>> >> on the reporting (primary?) OSD. Is the reporting OSD always
the primary?
>> >>
>> >> How can I debug this further? I searched the web for "waiting
>> for rw locks",
>> >> I also found Wido's blog [1] about my exact problem, but I'm not
>> sure how to
>> >> continue. Our admin says our network should be fine, but what
can I do to
>> >> rule that out?
>> >>
>> >> I don't think I have provided information about our cluster yet:
>> >>
>> >> 4 nodes, 3 mons, 20 OSDs on
>> >> ceph version 0.94.7-84-g8e6f430
>> (8e6f430683e4d8293e31fd4eb6cb09be96960cfa)
>> >>
>> >> [1]
>> >>
>>
https://blog.widodh.nl/2016/01/slow-requests-with-ceph-waiting-for-rw-locks/
>> >>
>> >> Thanks!
>> >> Eugen
>> >>
>> >>
>> >> Zitat von Brad Hubbard <bhubb...@redhat.com>:
>> >>
>> >>
>> >>> On Thu, Jan 12, 2017 at 2:19 AM, Eugen Block <ebl...@nde.ag> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I simply grepped for "slow request" in ceph.log. What exactly
>> do you mean
>> >>>> by
>> >>>> "effective OSD"?
>> >>>>
>> >>>> If I have this log line:
>> >>>> 2017-01-11 [...] osd.16 [...] cluster [WRN] slow request 32.868141
>> >>>> seconds
>> >>>> old, received at 2017-01-11 [...]
ack+ondisk+write+known_if_redirected
>> >>>> e12440) currently waiting for subops from 0,12
>> >>>>
>> >>>> I assumed that osd.16 is the one causing problems.
>> >>>
>> >>>
>> >>> 0 and/or 12 in this case. 16 is just waiting. Looking at the
output of
>> >>> dump_historic_ops and dump_ops_in_flight on the implicated osds may
>> >>> help to narrow down what the issue is.
>> >>>
>> >>>> But now that you mention
>> >>>> the subops, I only noticed them yesterday, but didn't have
the time yet
>> >>>> to
>> >>>> investigate further. I'll have a look into the subops messages
>> and report
>> >>>> back.
>> >>>>
>> >>>> Thanks!
>> >>>> Eugen
>> >>>>
>> >>>> Zitat von Burkhard Linke
>> >>>> <burkhard.li...@computational.bio.uni-giessen.de>:
>> >>>>
>> >>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>>
>> >>>>> just for clarity:
>> >>>>>
>> >>>>>
>> >>>>> Did you parse the slow request messages and use the effective
>> OSD in the
>> >>>>> statistics? Some message may refer to other OSDs, e.g.
>> "waiting for sub
>> >>>>> op
>> >>>>> on OSD X,Y". The reporting OSD is not the root cause in
that case, but
>> >>>>> one
>> >>>>> of the mentioned OSDs (and I'm currently not aware of a method to
>> >>>>> determine
>> >>>>> which of the both OSD is the cause in case of 3 replicates.....).
>> >>>>>
>> >>>>>
>> >>>>> Regards,
>> >>>>>
>> >>>>> Burkhard
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> ceph-users mailing list
>> >>>>> ceph-users@lists.ceph.com
>> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Eugen Block voice : +49-40-559 51 75
>> >>>> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77
>> >>>> Postfach 61 03 15
>> >>>> D-22423 Hamburg e-mail : ebl...@nde.ag
>> >>>>
>> >>>> Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>> >>>> Sitz und Registergericht: Hamburg, HRB 90934
>> >>>> Vorstand: Jens-U. Mozdzen
>> >>>> USt-IdNr. DE 814 013 983
>> >>>>
>> >>>> _______________________________________________
>> >>>> ceph-users mailing list
>> >>>> ceph-users@lists.ceph.com
>> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Cheers,
>> >>> Brad
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Eugen Block voice : +49-40-559 51 75
>> >> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77
>> >> Postfach 61 03 15
>> >> D-22423 Hamburg e-mail : ebl...@nde.ag
>> >>
>> >> Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>> >> Sitz und Registergericht: Hamburg, HRB 90934
>> >> Vorstand: Jens-U. Mozdzen
>> >> USt-IdNr. DE 814 013 983
>> >>
>> >
>> >
>> >
>> > --
>> > Cheers,
>> > Brad
>>
>>
>>
>
>
> --
> Christian Balzer Network/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/