Re: [ceph-users] Commit and Apply latency on nautilus

2019-10-01 Thread Sasha Litvak
All,

Thank you for your suggestions.  During the last night test, I had at least
one drive on one node doing a power-on reset by the controller.   It caused
a couple of OSDs asserting / timing out on that node.  I am testing and
updating the usual suspects on this node and after that on a whole cluster,
i.e. kernel, controller firmware, SSD firmware. All of these have updates
available.  Dell mentioned a possible crush on bionic during high
throughput but none of it is clear and simple.   I would like to eliminate
firmware/drivers, especially if there is a bug causing a crash under the
load.  I will then proceed with Mokhtar's and Robert's suggestions.

If anyone has any more suggestions please share on this thread as it may
help someone else later on.

Best,

On Tue, Oct 1, 2019 at 2:56 PM Maged Mokhtar  wrote:

> Some suggestions:
>
> monitor raw resources such as cpu %util raw disk %util/busy, raw disk iops.
>
> instead of running a mix of workloads at this stage, narrow it down first,
> for example using rbd rand writes and 4k block sizes, then change 1 param
> at a time for example change the block size. See how your cluster performs
> and what resources loads you get step by step. Latency from 4M will not be
> the same as 4k.
>
> i would also run fio tests on the raw Nytro 1551 devices including sync
> writes.
>
> I would not recommend you increase readahead for random io.
>
> I do not recommend making RAID0
>
> /Maged
>
>
> On 01/10/2019 02:12, Sasha Litvak wrote:
>
> At this point, I ran out of ideas.  I changed nr_requests and readahead
> parameters to 128->1024 and 128->4096, tuned nodes to
> performance-throughput.  However, I still get high latency during benchmark
> testing.  I attempted to disable cache on ssd
>
> for i in {a..f}; do hdparm -W 0 -A 0 /dev/sd$i; done
>
> and I think it make things not better at all.  I have H740 and H730
> controllers with drives in HBA mode.
>
> Other them converting them one by one to RAID0 I am not sure what else I
> can try.
>
> Any suggestions?
>
>
> On Mon, Sep 30, 2019 at 2:45 PM Paul Emmerich 
> wrote:
>
>> BTW: commit and apply latency are the exact same thing since
>> BlueStore, so don't bother looking at both.
>>
>> In fact you should mostly be looking at the op_*_latency counters
>>
>>
>> Paul
>>
>> --
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>>
>> On Mon, Sep 30, 2019 at 8:46 PM Sasha Litvak
>>  wrote:
>> >
>> > In my case, I am using premade Prometheus sourced dashboards in grafana.
>> >
>> > For individual latency, the query looks like that
>> >
>> >  irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"$osd"}[1m]) / on
>> (ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])
>> > irate(ceph_osd_op_w_latency_sum{ceph_daemon=~"$osd"}[1m]) / on
>> (ceph_daemon) irate(ceph_osd_op_w_latency_count[1m])
>> >
>> > The other ones use
>> >
>> > ceph_osd_commit_latency_ms
>> > ceph_osd_apply_latency_ms
>> >
>> > and graph the distribution of it over time
>> >
>> > Also, average OSD op latency
>> >
>> > avg(rate(ceph_osd_op_r_latency_sum{cluster="$cluster"}[5m]) /
>> rate(ceph_osd_op_r_latency_count{cluster="$cluster"}[5m]) >= 0)
>> > avg(rate(ceph_osd_op_w_latency_sum{cluster="$cluster"}[5m]) /
>> rate(ceph_osd_op_w_latency_count{cluster="$cluster"}[5m]) >= 0)
>> >
>> > Average OSD apply + commit latency
>> > avg(ceph_osd_apply_latency_ms{cluster="$cluster"})
>> > avg(ceph_osd_commit_latency_ms{cluster="$cluster"})
>> >
>> >
>> > On Mon, Sep 30, 2019 at 11:13 AM Marc Roos 
>> wrote:
>> >>
>> >>
>> >> What parameters are you exactly using? I want to do a similar test on
>> >> luminous, before I upgrade to Nautilus. I have quite a lot (74+)
>> >>
>> >> type_instance=Osd.opBeforeDequeueOpLat
>> >> type_instance=Osd.opBeforeQueueOpLat
>> >> type_instance=Osd.opLatency
>> >> type_instance=Osd.opPrepareLatency
>> >> type_instance=Osd.opProcessLatency
>> >> type_instance=Osd.opRLatency
>> >> type_instance=Osd.opRPrepareLatency
>> >> type_instance=Osd.opRProcessLatency
>> >> type_instance=Osd.opRwLatency
>> >> type_instance=Osd.opRwPrepareLatency
>

Re: [ceph-users] Commit and Apply latency on nautilus

2019-10-01 Thread Maged Mokhtar

Some suggestions:

monitor raw resources such as cpu %util raw disk %util/busy, raw disk iops.

instead of running a mix of workloads at this stage, narrow it down 
first, for example using rbd rand writes and 4k block sizes, then change 
1 param at a time for example change the block size. See how your 
cluster performs and what resources loads you get step by step. Latency 
from 4M will not be the same as 4k.


i would also run fio tests on the raw Nytro 1551 devices including sync 
writes.


I would not recommend you increase readahead for random io.

I do not recommend making RAID0

/Maged


On 01/10/2019 02:12, Sasha Litvak wrote:
At this point, I ran out of ideas.  I changed nr_requests and 
readahead parameters to 128->1024 and 128->4096, tuned nodes to 
performance-throughput.  However, I still get high latency during 
benchmark testing.  I attempted to disable cache on ssd


for i in {a..f}; do hdparm -W 0 -A 0 /dev/sd$i; done

and I think it make things not better at all.  I have H740 and H730 
controllers with drives in HBA mode.


Other them converting them one by one to RAID0 I am not sure what else 
I can try.


Any suggestions?


On Mon, Sep 30, 2019 at 2:45 PM Paul Emmerich <mailto:paul.emmer...@croit.io>> wrote:


BTW: commit and apply latency are the exact same thing since
BlueStore, so don't bother looking at both.

In fact you should mostly be looking at the op_*_latency counters


Paul

-- 
Paul Emmerich


Looking for help with your Ceph cluster? Contact us at
https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io <http://www.croit.io>
Tel: +49 89 1896585 90

On Mon, Sep 30, 2019 at 8:46 PM Sasha Litvak
mailto:alexander.v.lit...@gmail.com>> wrote:
>
> In my case, I am using premade Prometheus sourced dashboards in
grafana.
>
> For individual latency, the query looks like that
>
> irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"$osd"}[1m]) / on
(ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])
> irate(ceph_osd_op_w_latency_sum{ceph_daemon=~"$osd"}[1m]) / on
(ceph_daemon) irate(ceph_osd_op_w_latency_count[1m])
>
> The other ones use
>
> ceph_osd_commit_latency_ms
> ceph_osd_apply_latency_ms
>
> and graph the distribution of it over time
>
> Also, average OSD op latency
>
> avg(rate(ceph_osd_op_r_latency_sum{cluster="$cluster"}[5m]) /
rate(ceph_osd_op_r_latency_count{cluster="$cluster"}[5m]) >= 0)
> avg(rate(ceph_osd_op_w_latency_sum{cluster="$cluster"}[5m]) /
rate(ceph_osd_op_w_latency_count{cluster="$cluster"}[5m]) >= 0)
>
> Average OSD apply + commit latency
> avg(ceph_osd_apply_latency_ms{cluster="$cluster"})
> avg(ceph_osd_commit_latency_ms{cluster="$cluster"})
>
>
> On Mon, Sep 30, 2019 at 11:13 AM Marc Roos
mailto:m.r...@f1-outsourcing.eu>> wrote:
>>
>>
>> What parameters are you exactly using? I want to do a similar
test on
>> luminous, before I upgrade to Nautilus. I have quite a lot (74+)
>>
>> type_instance=Osd.opBeforeDequeueOpLat
>> type_instance=Osd.opBeforeQueueOpLat
>> type_instance=Osd.opLatency
>> type_instance=Osd.opPrepareLatency
>> type_instance=Osd.opProcessLatency
>> type_instance=Osd.opRLatency
>> type_instance=Osd.opRPrepareLatency
>> type_instance=Osd.opRProcessLatency
>> type_instance=Osd.opRwLatency
>> type_instance=Osd.opRwPrepareLatency
>> type_instance=Osd.opRwProcessLatency
>> type_instance=Osd.opWLatency
>> type_instance=Osd.opWPrepareLatency
>> type_instance=Osd.opWProcessLatency
>> type_instance=Osd.subopLatency
>> type_instance=Osd.subopWLatency
>> ...
>> ...
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Alex Litvak [mailto:alexander.v.lit...@gmail.com
<mailto:alexander.v.lit...@gmail.com>]
>> Sent: zondag 29 september 2019 13:06
>> To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> Cc: ceph-de...@vger.kernel.org <mailto:ceph-de...@vger.kernel.org>
>> Subject: [ceph-users] Commit and Apply latency on nautilus
>>
>> Hello everyone,
>>
>> I am running a number of parallel benchmark tests against the
cluster
>> that should be ready to go to production.
>> I enabled prometheus to monitor various information and while
cluster
>> stays healthy through the tests with no errors or slow requests,

Re: [ceph-users] Commit and Apply latency on nautilus

2019-10-01 Thread Robert LeBlanc
On Tue, Oct 1, 2019 at 7:54 AM Robert LeBlanc  wrote:
>
> On Mon, Sep 30, 2019 at 5:12 PM Sasha Litvak
>  wrote:
> >
> > At this point, I ran out of ideas.  I changed nr_requests and readahead 
> > parameters to 128->1024 and 128->4096, tuned nodes to 
> > performance-throughput.  However, I still get high latency during benchmark 
> > testing.  I attempted to disable cache on ssd
> >
> > for i in {a..f}; do hdparm -W 0 -A 0 /dev/sd$i; done
> >
> > and I think it make things not better at all.  I have H740 and H730 
> > controllers with drives in HBA mode.
> >
> > Other them converting them one by one to RAID0 I am not sure what else I 
> > can try.
> >
> > Any suggestions?
>
> If you haven't already tried this, add this to your ceph.conf and
> restart your OSDs, this should help bring down the variance in latency
> (It will be the default in Octopus):
>
> osd op queue = wpq
> osd op queue cut off = high

I should clarify. This will reduce the variance in latency for client
OPs. If this counter is also including recovery/backfill/deep_scrub
OP-, then the latency can still be high as these settings make
recovery/backfill/deep_scrub less impactful to client I/O at the cost
of them possibly being delayed a bit.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Commit and Apply latency on nautilus

2019-10-01 Thread Robert LeBlanc
On Mon, Sep 30, 2019 at 5:12 PM Sasha Litvak
 wrote:
>
> At this point, I ran out of ideas.  I changed nr_requests and readahead 
> parameters to 128->1024 and 128->4096, tuned nodes to performance-throughput. 
>  However, I still get high latency during benchmark testing.  I attempted to 
> disable cache on ssd
>
> for i in {a..f}; do hdparm -W 0 -A 0 /dev/sd$i; done
>
> and I think it make things not better at all.  I have H740 and H730 
> controllers with drives in HBA mode.
>
> Other them converting them one by one to RAID0 I am not sure what else I can 
> try.
>
> Any suggestions?

If you haven't already tried this, add this to your ceph.conf and
restart your OSDs, this should help bring down the variance in latency
(It will be the default in Octopus):

osd op queue = wpq
osd op queue cut off = high


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Commit and Apply latency on nautilus

2019-09-30 Thread Sasha Litvak
At this point, I ran out of ideas.  I changed nr_requests and readahead
parameters to 128->1024 and 128->4096, tuned nodes to
performance-throughput.  However, I still get high latency during benchmark
testing.  I attempted to disable cache on ssd

for i in {a..f}; do hdparm -W 0 -A 0 /dev/sd$i; done

and I think it make things not better at all.  I have H740 and H730
controllers with drives in HBA mode.

Other them converting them one by one to RAID0 I am not sure what else I
can try.

Any suggestions?


On Mon, Sep 30, 2019 at 2:45 PM Paul Emmerich 
wrote:

> BTW: commit and apply latency are the exact same thing since
> BlueStore, so don't bother looking at both.
>
> In fact you should mostly be looking at the op_*_latency counters
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Mon, Sep 30, 2019 at 8:46 PM Sasha Litvak
>  wrote:
> >
> > In my case, I am using premade Prometheus sourced dashboards in grafana.
> >
> > For individual latency, the query looks like that
> >
> >  irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"$osd"}[1m]) / on
> (ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])
> > irate(ceph_osd_op_w_latency_sum{ceph_daemon=~"$osd"}[1m]) / on
> (ceph_daemon) irate(ceph_osd_op_w_latency_count[1m])
> >
> > The other ones use
> >
> > ceph_osd_commit_latency_ms
> > ceph_osd_apply_latency_ms
> >
> > and graph the distribution of it over time
> >
> > Also, average OSD op latency
> >
> > avg(rate(ceph_osd_op_r_latency_sum{cluster="$cluster"}[5m]) /
> rate(ceph_osd_op_r_latency_count{cluster="$cluster"}[5m]) >= 0)
> > avg(rate(ceph_osd_op_w_latency_sum{cluster="$cluster"}[5m]) /
> rate(ceph_osd_op_w_latency_count{cluster="$cluster"}[5m]) >= 0)
> >
> > Average OSD apply + commit latency
> > avg(ceph_osd_apply_latency_ms{cluster="$cluster"})
> > avg(ceph_osd_commit_latency_ms{cluster="$cluster"})
> >
> >
> > On Mon, Sep 30, 2019 at 11:13 AM Marc Roos 
> wrote:
> >>
> >>
> >> What parameters are you exactly using? I want to do a similar test on
> >> luminous, before I upgrade to Nautilus. I have quite a lot (74+)
> >>
> >> type_instance=Osd.opBeforeDequeueOpLat
> >> type_instance=Osd.opBeforeQueueOpLat
> >> type_instance=Osd.opLatency
> >> type_instance=Osd.opPrepareLatency
> >> type_instance=Osd.opProcessLatency
> >> type_instance=Osd.opRLatency
> >> type_instance=Osd.opRPrepareLatency
> >> type_instance=Osd.opRProcessLatency
> >> type_instance=Osd.opRwLatency
> >> type_instance=Osd.opRwPrepareLatency
> >> type_instance=Osd.opRwProcessLatency
> >> type_instance=Osd.opWLatency
> >> type_instance=Osd.opWPrepareLatency
> >> type_instance=Osd.opWProcessLatency
> >> type_instance=Osd.subopLatency
> >> type_instance=Osd.subopWLatency
> >> ...
> >> ...
> >>
> >>
> >>
> >>
> >>
> >> -Original Message-
> >> From: Alex Litvak [mailto:alexander.v.lit...@gmail.com]
> >> Sent: zondag 29 september 2019 13:06
> >> To: ceph-users@lists.ceph.com
> >> Cc: ceph-de...@vger.kernel.org
> >> Subject: [ceph-users] Commit and Apply latency on nautilus
> >>
> >> Hello everyone,
> >>
> >> I am running a number of parallel benchmark tests against the cluster
> >> that should be ready to go to production.
> >> I enabled prometheus to monitor various information and while cluster
> >> stays healthy through the tests with no errors or slow requests,
> >> I noticed an apply / commit latency jumping between 40 - 600 ms on
> >> multiple SSDs.  At the same time op_read and op_write are on average
> >> below 0.25 ms in the worth case scenario.
> >>
> >> I am running nautilus 14.2.2, all bluestore, no separate NVME devices
> >> for WAL/DB, 6 SSDs per node(Dell PowerEdge R440) with all drives Seagate
> >> Nytro 1551, osd spread across 6 nodes, running in
> >> containers.  Each node has plenty of RAM with utilization ~ 25 GB during
> >> the benchmark runs.
> >>
> >> Here are benchmarks being run from 6 client systems in parallel,
> >> repeating the test for each block size in <4k,16k,128k,4M>.
> >>
> >> On rbd mapped partition local to each client:
> >>

Re: [ceph-users] Commit and Apply latency on nautilus

2019-09-30 Thread Paul Emmerich
BTW: commit and apply latency are the exact same thing since
BlueStore, so don't bother looking at both.

In fact you should mostly be looking at the op_*_latency counters


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Sep 30, 2019 at 8:46 PM Sasha Litvak
 wrote:
>
> In my case, I am using premade Prometheus sourced dashboards in grafana.
>
> For individual latency, the query looks like that
>
>  irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"$osd"}[1m]) / on (ceph_daemon) 
> irate(ceph_osd_op_r_latency_count[1m])
> irate(ceph_osd_op_w_latency_sum{ceph_daemon=~"$osd"}[1m]) / on (ceph_daemon) 
> irate(ceph_osd_op_w_latency_count[1m])
>
> The other ones use
>
> ceph_osd_commit_latency_ms
> ceph_osd_apply_latency_ms
>
> and graph the distribution of it over time
>
> Also, average OSD op latency
>
> avg(rate(ceph_osd_op_r_latency_sum{cluster="$cluster"}[5m]) / 
> rate(ceph_osd_op_r_latency_count{cluster="$cluster"}[5m]) >= 0)
> avg(rate(ceph_osd_op_w_latency_sum{cluster="$cluster"}[5m]) / 
> rate(ceph_osd_op_w_latency_count{cluster="$cluster"}[5m]) >= 0)
>
> Average OSD apply + commit latency
> avg(ceph_osd_apply_latency_ms{cluster="$cluster"})
> avg(ceph_osd_commit_latency_ms{cluster="$cluster"})
>
>
> On Mon, Sep 30, 2019 at 11:13 AM Marc Roos  wrote:
>>
>>
>> What parameters are you exactly using? I want to do a similar test on
>> luminous, before I upgrade to Nautilus. I have quite a lot (74+)
>>
>> type_instance=Osd.opBeforeDequeueOpLat
>> type_instance=Osd.opBeforeQueueOpLat
>> type_instance=Osd.opLatency
>> type_instance=Osd.opPrepareLatency
>> type_instance=Osd.opProcessLatency
>> type_instance=Osd.opRLatency
>> type_instance=Osd.opRPrepareLatency
>> type_instance=Osd.opRProcessLatency
>> type_instance=Osd.opRwLatency
>> type_instance=Osd.opRwPrepareLatency
>> type_instance=Osd.opRwProcessLatency
>> type_instance=Osd.opWLatency
>> type_instance=Osd.opWPrepareLatency
>> type_instance=Osd.opWProcessLatency
>> type_instance=Osd.subopLatency
>> type_instance=Osd.subopWLatency
>> ...
>> ...
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Alex Litvak [mailto:alexander.v.lit...@gmail.com]
>> Sent: zondag 29 september 2019 13:06
>> To: ceph-users@lists.ceph.com
>> Cc: ceph-de...@vger.kernel.org
>> Subject: [ceph-users] Commit and Apply latency on nautilus
>>
>> Hello everyone,
>>
>> I am running a number of parallel benchmark tests against the cluster
>> that should be ready to go to production.
>> I enabled prometheus to monitor various information and while cluster
>> stays healthy through the tests with no errors or slow requests,
>> I noticed an apply / commit latency jumping between 40 - 600 ms on
>> multiple SSDs.  At the same time op_read and op_write are on average
>> below 0.25 ms in the worth case scenario.
>>
>> I am running nautilus 14.2.2, all bluestore, no separate NVME devices
>> for WAL/DB, 6 SSDs per node(Dell PowerEdge R440) with all drives Seagate
>> Nytro 1551, osd spread across 6 nodes, running in
>> containers.  Each node has plenty of RAM with utilization ~ 25 GB during
>> the benchmark runs.
>>
>> Here are benchmarks being run from 6 client systems in parallel,
>> repeating the test for each block size in <4k,16k,128k,4M>.
>>
>> On rbd mapped partition local to each client:
>>
>> fio --name=randrw --ioengine=libaio --iodepth=4 --rw=randrw
>> --bs=<4k,16k,128k,4M> --direct=1 --size=2G --numjobs=8 --runtime=300
>> --group_reporting --time_based --rwmixread=70
>>
>> On mounted cephfs volume with each client storing test file(s) in own
>> sub-directory:
>>
>> fio --name=randrw --ioengine=libaio --iodepth=4 --rw=randrw
>> --bs=<4k,16k,128k,4M> --direct=1 --size=2G --numjobs=8 --runtime=300
>> --group_reporting --time_based --rwmixread=70
>>
>> dbench -t 30 30
>>
>> Could you please let me know if huge jump in applied and committed
>> latency is justified in my case and whether I can do anything to improve
>> / fix it.  Below is some additional cluster info.
>>
>> Thank you,
>>
>> root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph osd df
>> ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETA AVAIL
>>   %USE VAR  PGS STATUS
>>   6   ssd 1.74609  1.0 1.7 TiB  9

Re: [ceph-users] Commit and Apply latency on nautilus

2019-09-30 Thread Sasha Litvak
In my case, I am using premade Prometheus sourced dashboards in grafana.

For individual latency, the query looks like that

 irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"$osd"}[1m]) / on
(ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])
irate(ceph_osd_op_w_latency_sum{ceph_daemon=~"$osd"}[1m]) / on
(ceph_daemon) irate(ceph_osd_op_w_latency_count[1m])

The other ones use

ceph_osd_commit_latency_ms
ceph_osd_apply_latency_ms

and graph the distribution of it over time

Also, average OSD op latency

avg(rate(ceph_osd_op_r_latency_sum{cluster="$cluster"}[5m]) /
rate(ceph_osd_op_r_latency_count{cluster="$cluster"}[5m]) >= 0)
avg(rate(ceph_osd_op_w_latency_sum{cluster="$cluster"}[5m]) /
rate(ceph_osd_op_w_latency_count{cluster="$cluster"}[5m]) >= 0)

Average OSD apply + commit latency
avg(ceph_osd_apply_latency_ms{cluster="$cluster"})
avg(ceph_osd_commit_latency_ms{cluster="$cluster"})


On Mon, Sep 30, 2019 at 11:13 AM Marc Roos  wrote:

>
> What parameters are you exactly using? I want to do a similar test on
> luminous, before I upgrade to Nautilus. I have quite a lot (74+)
>
> type_instance=Osd.opBeforeDequeueOpLat
> type_instance=Osd.opBeforeQueueOpLat
> type_instance=Osd.opLatency
> type_instance=Osd.opPrepareLatency
> type_instance=Osd.opProcessLatency
> type_instance=Osd.opRLatency
> type_instance=Osd.opRPrepareLatency
> type_instance=Osd.opRProcessLatency
> type_instance=Osd.opRwLatency
> type_instance=Osd.opRwPrepareLatency
> type_instance=Osd.opRwProcessLatency
> type_instance=Osd.opWLatency
> type_instance=Osd.opWPrepareLatency
> type_instance=Osd.opWProcessLatency
> type_instance=Osd.subopLatency
> type_instance=Osd.subopWLatency
> ...
> ...
>
>
>
>
>
> -Original Message-
> From: Alex Litvak [mailto:alexander.v.lit...@gmail.com]
> Sent: zondag 29 september 2019 13:06
> To: ceph-users@lists.ceph.com
> Cc: ceph-de...@vger.kernel.org
> Subject: [ceph-users] Commit and Apply latency on nautilus
>
> Hello everyone,
>
> I am running a number of parallel benchmark tests against the cluster
> that should be ready to go to production.
> I enabled prometheus to monitor various information and while cluster
> stays healthy through the tests with no errors or slow requests,
> I noticed an apply / commit latency jumping between 40 - 600 ms on
> multiple SSDs.  At the same time op_read and op_write are on average
> below 0.25 ms in the worth case scenario.
>
> I am running nautilus 14.2.2, all bluestore, no separate NVME devices
> for WAL/DB, 6 SSDs per node(Dell PowerEdge R440) with all drives Seagate
> Nytro 1551, osd spread across 6 nodes, running in
> containers.  Each node has plenty of RAM with utilization ~ 25 GB during
> the benchmark runs.
>
> Here are benchmarks being run from 6 client systems in parallel,
> repeating the test for each block size in <4k,16k,128k,4M>.
>
> On rbd mapped partition local to each client:
>
> fio --name=randrw --ioengine=libaio --iodepth=4 --rw=randrw
> --bs=<4k,16k,128k,4M> --direct=1 --size=2G --numjobs=8 --runtime=300
> --group_reporting --time_based --rwmixread=70
>
> On mounted cephfs volume with each client storing test file(s) in own
> sub-directory:
>
> fio --name=randrw --ioengine=libaio --iodepth=4 --rw=randrw
> --bs=<4k,16k,128k,4M> --direct=1 --size=2G --numjobs=8 --runtime=300
> --group_reporting --time_based --rwmixread=70
>
> dbench -t 30 30
>
> Could you please let me know if huge jump in applied and committed
> latency is justified in my case and whether I can do anything to improve
> / fix it.  Below is some additional cluster info.
>
> Thank you,
>
> root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETA AVAIL
>   %USE VAR  PGS STATUS
>   6   ssd 1.74609  1.0 1.7 TiB  93 GiB  92 GiB 240 MiB  784 MiB 1.7
> TiB 5.21 0.90  44 up
> 12   ssd 1.74609  1.0 1.7 TiB  98 GiB  97 GiB 118 MiB  906 MiB 1.7
> TiB 5.47 0.95  40 up
> 18   ssd 1.74609  1.0 1.7 TiB 102 GiB 101 GiB 123 MiB  901 MiB 1.6
> TiB 5.73 0.99  47 up
> 24   ssd 3.49219  1.0 3.5 TiB 222 GiB 221 GiB 134 MiB  890 MiB 3.3
> TiB 6.20 1.07  96 up
> 30   ssd 3.49219  1.0 3.5 TiB 213 GiB 212 GiB 151 MiB  873 MiB 3.3
> TiB 5.95 1.03  93 up
> 35   ssd 3.49219  1.0 3.5 TiB 203 GiB 202 GiB 301 MiB  723 MiB 3.3
> TiB 5.67 0.98 100 up
>   5   ssd 1.74609  1.0 1.7 TiB 103 GiB 102 GiB 123 MiB  901 MiB 1.6
> TiB 5.78 1.00  49 up
> 11   ssd 1.74609  1.0 1.7 TiB 109 GiB 108 GiB  63 MiB  961 MiB 1.6
> TiB 6.09 1.05  46 up
> 17   ssd 1.74609  1.0 1.7 TiB 104 GiB 103 GiB 205 Mi

Re: [ceph-users] Commit and Apply latency on nautilus

2019-09-30 Thread Marc Roos


What parameters are you exactly using? I want to do a similar test on 
luminous, before I upgrade to Nautilus. I have quite a lot (74+)

type_instance=Osd.opBeforeDequeueOpLat
type_instance=Osd.opBeforeQueueOpLat
type_instance=Osd.opLatency
type_instance=Osd.opPrepareLatency
type_instance=Osd.opProcessLatency
type_instance=Osd.opRLatency
type_instance=Osd.opRPrepareLatency
type_instance=Osd.opRProcessLatency
type_instance=Osd.opRwLatency
type_instance=Osd.opRwPrepareLatency
type_instance=Osd.opRwProcessLatency
type_instance=Osd.opWLatency
type_instance=Osd.opWPrepareLatency
type_instance=Osd.opWProcessLatency
type_instance=Osd.subopLatency
type_instance=Osd.subopWLatency
...
...





-Original Message-
From: Alex Litvak [mailto:alexander.v.lit...@gmail.com] 
Sent: zondag 29 september 2019 13:06
To: ceph-users@lists.ceph.com
Cc: ceph-de...@vger.kernel.org
Subject: [ceph-users] Commit and Apply latency on nautilus

Hello everyone,

I am running a number of parallel benchmark tests against the cluster 
that should be ready to go to production.
I enabled prometheus to monitor various information and while cluster 
stays healthy through the tests with no errors or slow requests,
I noticed an apply / commit latency jumping between 40 - 600 ms on 
multiple SSDs.  At the same time op_read and op_write are on average 
below 0.25 ms in the worth case scenario.

I am running nautilus 14.2.2, all bluestore, no separate NVME devices 
for WAL/DB, 6 SSDs per node(Dell PowerEdge R440) with all drives Seagate 
Nytro 1551, osd spread across 6 nodes, running in 
containers.  Each node has plenty of RAM with utilization ~ 25 GB during 
the benchmark runs.

Here are benchmarks being run from 6 client systems in parallel, 
repeating the test for each block size in <4k,16k,128k,4M>.

On rbd mapped partition local to each client:

fio --name=randrw --ioengine=libaio --iodepth=4 --rw=randrw 
--bs=<4k,16k,128k,4M> --direct=1 --size=2G --numjobs=8 --runtime=300 
--group_reporting --time_based --rwmixread=70

On mounted cephfs volume with each client storing test file(s) in own 
sub-directory:

fio --name=randrw --ioengine=libaio --iodepth=4 --rw=randrw 
--bs=<4k,16k,128k,4M> --direct=1 --size=2G --numjobs=8 --runtime=300 
--group_reporting --time_based --rwmixread=70

dbench -t 30 30

Could you please let me know if huge jump in applied and committed 
latency is justified in my case and whether I can do anything to improve 
/ fix it.  Below is some additional cluster info.

Thank you,

root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETA AVAIL 
  %USE VAR  PGS STATUS
  6   ssd 1.74609  1.0 1.7 TiB  93 GiB  92 GiB 240 MiB  784 MiB 1.7 
TiB 5.21 0.90  44 up
12   ssd 1.74609  1.0 1.7 TiB  98 GiB  97 GiB 118 MiB  906 MiB 1.7 
TiB 5.47 0.95  40 up
18   ssd 1.74609  1.0 1.7 TiB 102 GiB 101 GiB 123 MiB  901 MiB 1.6 
TiB 5.73 0.99  47 up
24   ssd 3.49219  1.0 3.5 TiB 222 GiB 221 GiB 134 MiB  890 MiB 3.3 
TiB 6.20 1.07  96 up
30   ssd 3.49219  1.0 3.5 TiB 213 GiB 212 GiB 151 MiB  873 MiB 3.3 
TiB 5.95 1.03  93 up
35   ssd 3.49219  1.0 3.5 TiB 203 GiB 202 GiB 301 MiB  723 MiB 3.3 
TiB 5.67 0.98 100 up
  5   ssd 1.74609  1.0 1.7 TiB 103 GiB 102 GiB 123 MiB  901 MiB 1.6 
TiB 5.78 1.00  49 up
11   ssd 1.74609  1.0 1.7 TiB 109 GiB 108 GiB  63 MiB  961 MiB 1.6 
TiB 6.09 1.05  46 up
17   ssd 1.74609  1.0 1.7 TiB 104 GiB 103 GiB 205 MiB  819 MiB 1.6 
TiB 5.81 1.01  50 up
23   ssd 3.49219  1.0 3.5 TiB 210 GiB 209 GiB 168 MiB  856 MiB 3.3 
TiB 5.86 1.01  86 up
29   ssd 3.49219  1.0 3.5 TiB 204 GiB 203 GiB 272 MiB  752 MiB 3.3 
TiB 5.69 0.98  92 up
34   ssd 3.49219  1.0 3.5 TiB 198 GiB 197 GiB 295 MiB  729 MiB 3.3 
TiB 5.54 0.96  85 up
  4   ssd 1.74609  1.0 1.7 TiB 119 GiB 118 GiB  16 KiB 1024 MiB 1.6 
TiB 6.67 1.15  50 up
10   ssd 1.74609  1.0 1.7 TiB  95 GiB  94 GiB 183 MiB  841 MiB 1.7 
TiB 5.31 0.92  46 up
16   ssd 1.74609  1.0 1.7 TiB 102 GiB 101 GiB 122 MiB  902 MiB 1.6 
TiB 5.72 0.99  50 up
22   ssd 3.49219  1.0 3.5 TiB 218 GiB 217 GiB 109 MiB  915 MiB 3.3 
TiB 6.11 1.06  91 up
28   ssd 3.49219  1.0 3.5 TiB 198 GiB 197 GiB 343 MiB  681 MiB 3.3 
TiB 5.54 0.96  95 up
33   ssd 3.49219  1.0 3.5 TiB 198 GiB 196 GiB 297 MiB 1019 MiB 3.3 
TiB 5.53 0.96  85 up
  1   ssd 1.74609  1.0 1.7 TiB 101 GiB 100 GiB 222 MiB  802 MiB 1.6 
TiB 5.63 0.97  49 up
  7   ssd 1.74609  1.0 1.7 TiB 102 GiB 101 GiB 153 MiB  871 MiB 1.6 
TiB 5.69 0.99  46 up
13   ssd 1.74609  1.0 1.7 TiB 106 GiB 105 GiB  67 MiB  957 MiB 1.6 
TiB 5.96 1.03  42 up
19   ssd 3.49219  1.0 3.5 TiB 206 GiB 205 GiB 179 MiB  845 MiB 3.3 
TiB 5.77 1.00  83 up
25   ssd 3.49219  1.0 3.5 TiB 195 GiB 194 GiB 352 MiB  672 MiB 3.3 
TiB 5.45 0.94  97 up
31   ssd 3.49219  1.0 3.5 TiB 201 GiB 200 GiB 305 MiB

[ceph-users] Commit and Apply latency on nautilus

2019-09-29 Thread Alex Litvak

Hello everyone,

I am running a number of parallel benchmark tests against the cluster that 
should be ready to go to production.
I enabled prometheus to monitor various information and while cluster stays 
healthy through the tests with no errors or slow requests,
I noticed an apply / commit latency jumping between 40 - 600 ms on multiple 
SSDs.  At the same time op_read and op_write are on average below 0.25 ms in 
the worth case scenario.

I am running nautilus 14.2.2, all bluestore, no separate NVME devices for WAL/DB, 6 SSDs per node(Dell PowerEdge R440) with all drives Seagate Nytro 1551, osd spread across 6 nodes, running in 
containers.  Each node has plenty of RAM with utilization ~ 25 GB during the benchmark runs.


Here are benchmarks being run from 6 client systems in parallel, repeating the test 
for each block size in <4k,16k,128k,4M>.

On rbd mapped partition local to each client:

fio --name=randrw --ioengine=libaio --iodepth=4 --rw=randrw 
--bs=<4k,16k,128k,4M> --direct=1 --size=2G --numjobs=8 --runtime=300 
--group_reporting --time_based --rwmixread=70

On mounted cephfs volume with each client storing test file(s) in own 
sub-directory:

fio --name=randrw --ioengine=libaio --iodepth=4 --rw=randrw 
--bs=<4k,16k,128k,4M> --direct=1 --size=2G --numjobs=8 --runtime=300 
--group_reporting --time_based --rwmixread=70

dbench -t 30 30

Could you please let me know if huge jump in applied and committed latency is 
justified in my case and whether I can do anything to improve / fix it.  Below 
is some additional cluster info.

Thank you,

root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETA AVAIL   %USE 
VAR  PGS STATUS
 6   ssd 1.74609  1.0 1.7 TiB  93 GiB  92 GiB 240 MiB  784 MiB 1.7 TiB 5.21 
0.90  44 up
12   ssd 1.74609  1.0 1.7 TiB  98 GiB  97 GiB 118 MiB  906 MiB 1.7 TiB 5.47 
0.95  40 up
18   ssd 1.74609  1.0 1.7 TiB 102 GiB 101 GiB 123 MiB  901 MiB 1.6 TiB 5.73 
0.99  47 up
24   ssd 3.49219  1.0 3.5 TiB 222 GiB 221 GiB 134 MiB  890 MiB 3.3 TiB 6.20 
1.07  96 up
30   ssd 3.49219  1.0 3.5 TiB 213 GiB 212 GiB 151 MiB  873 MiB 3.3 TiB 5.95 
1.03  93 up
35   ssd 3.49219  1.0 3.5 TiB 203 GiB 202 GiB 301 MiB  723 MiB 3.3 TiB 5.67 
0.98 100 up
 5   ssd 1.74609  1.0 1.7 TiB 103 GiB 102 GiB 123 MiB  901 MiB 1.6 TiB 5.78 
1.00  49 up
11   ssd 1.74609  1.0 1.7 TiB 109 GiB 108 GiB  63 MiB  961 MiB 1.6 TiB 6.09 
1.05  46 up
17   ssd 1.74609  1.0 1.7 TiB 104 GiB 103 GiB 205 MiB  819 MiB 1.6 TiB 5.81 
1.01  50 up
23   ssd 3.49219  1.0 3.5 TiB 210 GiB 209 GiB 168 MiB  856 MiB 3.3 TiB 5.86 
1.01  86 up
29   ssd 3.49219  1.0 3.5 TiB 204 GiB 203 GiB 272 MiB  752 MiB 3.3 TiB 5.69 
0.98  92 up
34   ssd 3.49219  1.0 3.5 TiB 198 GiB 197 GiB 295 MiB  729 MiB 3.3 TiB 5.54 
0.96  85 up
 4   ssd 1.74609  1.0 1.7 TiB 119 GiB 118 GiB  16 KiB 1024 MiB 1.6 TiB 6.67 
1.15  50 up
10   ssd 1.74609  1.0 1.7 TiB  95 GiB  94 GiB 183 MiB  841 MiB 1.7 TiB 5.31 
0.92  46 up
16   ssd 1.74609  1.0 1.7 TiB 102 GiB 101 GiB 122 MiB  902 MiB 1.6 TiB 5.72 
0.99  50 up
22   ssd 3.49219  1.0 3.5 TiB 218 GiB 217 GiB 109 MiB  915 MiB 3.3 TiB 6.11 
1.06  91 up
28   ssd 3.49219  1.0 3.5 TiB 198 GiB 197 GiB 343 MiB  681 MiB 3.3 TiB 5.54 
0.96  95 up
33   ssd 3.49219  1.0 3.5 TiB 198 GiB 196 GiB 297 MiB 1019 MiB 3.3 TiB 5.53 
0.96  85 up
 1   ssd 1.74609  1.0 1.7 TiB 101 GiB 100 GiB 222 MiB  802 MiB 1.6 TiB 5.63 
0.97  49 up
 7   ssd 1.74609  1.0 1.7 TiB 102 GiB 101 GiB 153 MiB  871 MiB 1.6 TiB 5.69 
0.99  46 up
13   ssd 1.74609  1.0 1.7 TiB 106 GiB 105 GiB  67 MiB  957 MiB 1.6 TiB 5.96 
1.03  42 up
19   ssd 3.49219  1.0 3.5 TiB 206 GiB 205 GiB 179 MiB  845 MiB 3.3 TiB 5.77 
1.00  83 up
25   ssd 3.49219  1.0 3.5 TiB 195 GiB 194 GiB 352 MiB  672 MiB 3.3 TiB 5.45 
0.94  97 up
31   ssd 3.49219  1.0 3.5 TiB 201 GiB 200 GiB 305 MiB  719 MiB 3.3 TiB 5.62 
0.97  90 up
 0   ssd 1.74609  1.0 1.7 TiB 110 GiB 109 GiB  29 MiB  995 MiB 1.6 TiB 6.14 
1.06  43 up
 3   ssd 1.74609  1.0 1.7 TiB 109 GiB 108 GiB  28 MiB  996 MiB 1.6 TiB 6.07 
1.05  41 up
 9   ssd 1.74609  1.0 1.7 TiB 103 GiB 102 GiB 149 MiB  875 MiB 1.6 TiB 5.76 
1.00  52 up
15   ssd 3.49219  1.0 3.5 TiB 209 GiB 208 GiB 253 MiB  771 MiB 3.3 TiB 5.83 
1.01  98 up
21   ssd 3.49219  1.0 3.5 TiB 199 GiB 198 GiB 302 MiB  722 MiB 3.3 TiB 5.56 
0.96  90 up
27   ssd 3.49219  1.0 3.5 TiB 208 GiB 207 GiB 226 MiB  798 MiB 3.3 TiB 5.81 
1.00  95 up
 2   ssd 1.74609  1.0 1.7 TiB  96 GiB  95 GiB 158 MiB  866 MiB 1.7 TiB 5.35 
0.93  45 up
 8   ssd 1.74609  1.0 1.7 TiB 106 GiB 105 GiB 132 MiB  892 MiB 1.6 TiB 5.91 
1.02  50 up
14   ssd 1.74609  1.0 1.7 TiB  96 GiB  95 GiB 180 MiB  844 MiB 1.7 TiB 5.35 
0.92  46 up
20   ssd 3.49219  1.0 3.5 TiB 221 GiB 220 GiB