Re: [ceph-users] CEPH pool statistics MAX AVAIL

2019-06-25 Thread Mohamad Gebai
MAX AVAIL is the amount of data you can still write to the cluster
before *anyone one of your OSDs* becomes near full. If MAX AVAIL is not
what you expect it to be, look at the data distribution using ceph osd
tree and make sure you have a uniform distribution.

Mohamad

On 6/25/19 11:46 AM, Davis Mendoza Paco wrote:
> Hi all,
> I have installed ceph luminous, with 43 OSD(3TB)
>
> Checking pool statistics
>
> ceph df detail
> GLOBAL:
>     SIZE       AVAIL       RAW USED     %RAW USED     OBJECTS
>     117TiB     69.3TiB      48.0TiB         40.91       4.20M
> POOLS:
>     NAME                    ID     QUOTA OBJECTS     QUOTA BYTES    
> USED        %USED     MAX AVAIL     OBJECTS     DIRTY       READ      
>  WRITE       RAW USED
>     images                  9      N/A               N/A            
>  144GiB      1.36       10.2TiB       22379      22.38k     70.0MiB  
>    354KiB       432GiB
>     vms                     10     N/A               N/A            
> 3.36TiB     24.69       10.2TiB      889606     889.61k     3.36GiB  
>   4.61GiB      10.1TiB
>     backups                 12     N/A               N/A            
> 1.00GiB         0       10.2TiB         261         261      103KiB  
>      525B      3.00GiB
>     volumes                 13     N/A               N/A            
> 12.5TiB     55.02       10.2TiB     3289892       3.29M      754MiB  
>    616MiB      37.6TiB
>     
> I can not understand what the column "MAX AVAIL" refers to, according
> to the column "% USED" only 55% of the pool "volumes" is used, that is
> 12.5TiB
>     NAME                    ID    USED        %USED     MAX AVAIL
>     volumes                 13    12.5TiB     55.02       10.2TiB
>
> -- 
> *Davis Mendoza P.*
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache limiting IOPS

2019-03-07 Thread Mohamad Gebai
Hi Florian,

On 3/7/19 10:27 AM, Florian Engelmann wrote:
>
> So the settings are recognized and used by qemu. But any value higher
> than the default (32MB) of the cache size leads to strange IOPS
> results. IOPS are very constant with 32MB ~20.000 - 23.000 but if we
> define a bigger cache size (we tested from 64MB up to 256MB) the IOPS
> get very unconstant (from 0 IOPS up to 23.000).
>
> Setting "rbd cache max dirty" to 0 changes the behaviour to
> write-through as far as I understood. I expected the latency to
> increase to at least 0.6 ms what was the case but I also expected the
> IOPS to increase to up to 60.000 which was not the case. IOPS was
> constant at ~ 14.000IOPS (4 jobs, QD=64).
>
>
>
> Am 3/7/19 um 11:41 AM schrieb Florian Engelmann:
>> Hi,
>>
>> we are running an Openstack environment with Ceph block storage.
>> There are six nodes in the current Ceph cluster (12.2.10) with NVMe
>> SSDs and a P4800X Optane for rocksdb and WAL.
>> The decision was made to use rbd writeback cache with KVM/QEMU. The
>> write latency is incredible good (~85 µs) and the read latency is
>> still good (~0.6ms). But we are limited to ~23.000 IOPS in a KVM
>> machine. So we did the same FIO benchmark after we disabled the rbd
>> cache and got 65.000 IOPS but of course the write latency (QD1) was
>> increased to ~ 0.6ms.

How does fio with rbd cache enabled compare to QEMU/KVM?

Can you try making sure the queue depth used with QEMU is similar to
what you're using with fio? You'll probably have to play with the
configs of the disks in libvirt, there are a couple of settings you can
try, I'm unfortunately not very familiar with them (num-queues,
iothreads [1], and maybe others). I think I'd start there.

Mohamad

[1] https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd space usage

2019-02-28 Thread Mohamad Gebai
On 2/27/19 4:57 PM, Marc Roos wrote:
> They are 'thin provisioned' meaning if you create a 10GB rbd, it does 
> not use 10GB at the start. (afaik)

You can use 'rbd -p rbd du' to see how much of these devices is
provisioned and see if it's coherent.

Mohamad

>
>
> -Original Message-
> From: solarflow99 [mailto:solarflo...@gmail.com] 
> Sent: 27 February 2019 22:55
> To: Ceph Users
> Subject: [ceph-users] rbd space usage
>
> using ceph df it looks as if RBD images can use the total free space 
> available of the pool it belongs to, 8.54% yet I know they are created 
> with a --size parameter and thats what determines the actual space.  I 
> can't understand the difference i'm seeing, only 5T is being used but 
> ceph df shows 51T:
>
>
> /dev/rbd0   8.0T  4.8T  3.3T  60% /mnt/nfsroot/rbd0
> /dev/rbd1   9.8T   34M  9.8T   1% /mnt/nfsroot/rbd1
>
>
>
> # ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 180T  130T   51157G 27.75
> POOLS:
> NAMEID USED   %USED MAX AVAIL 
> OBJECTS
> rbd 0  15745G  8.543G  
> 4043495
> cephfs_data 1   0 03G
> 0
> cephfs_metadata 21962 03G
>20
> spider_stage 9   1595M 03G47835
> spider   10   955G  0.523G 
> 42541237
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic Bluestore memory optimization

2019-02-25 Thread Mohamad Gebai
Hi Glen,

On 2/24/19 9:21 PM, Glen Baars wrote:
> I am tracking down a performance issue with some of our mimic 13.2.4 OSDs. It 
> feels like a lack of memory but I have no real proof of the issue. I have 
> used the memory profiling ( pprof tool ) and the OSD's are maintaining their 
> 4GB allocated limit.

What are the symptoms? Does performance drop at a certain point? Did it
drop compared to a previous configuration? You're saying that only
*some* OSDs have a performance issue?

> My questions are:
>
> 1.How do you know if the allocated memory is enough for the OSD? My 1TB disks 
> and 12TB disks take the same memory and I wonder if the OSDs should have 
> memory allocated based on the size of the disks?
> 2.In the past, SSD disks needs 3 times the memory and now they don't, why is 
> that? ( 1GB ram per HDD and 3GB ram per SSD both went to 4GB )

I think you're talking about the BlueStore caching settings for SSDs and
HDDs. You should take a look at the memory autotuning (notably
osd_memory_target):

http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#automatic-cache-sizing

> 3.I have read that the number of placement groups per OSD is a significant 
> factor in the memory usage. Generally I have ~200 placement groups per OSD, 
> this is at the higher end of the recommended values and I wonder if its 
> causing high memory usage?
>
> For reference the hosts are 1 x 6 core CPU, 72GB ram, 14 OSDs, 2 x 10Gbit. 
> LSI cachecade / writeback cache for the HDD and LSI JBOD for SSDs. 9 hosts in 
> this cluster.
>
> Kind regards,
> Glen Baars
> This e-mail is intended solely for the benefit of the addressee(s) and any 
> other named recipient. It is confidential and may contain legally privileged 
> or confidential information. If you are not the recipient, any use, 
> distribution, disclosure or copying of this e-mail is prohibited. The 
> confidentiality and legal privilege attached to this communication is not 
> waived or lost by reason of the mistaken transmission or delivery to you. If 
> you have received this e-mail in error, please notify us immediately.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware difference in the same Rack

2019-02-21 Thread Mohamad Gebai
On 2/21/19 1:22 PM, Fabio Abreu wrote:
> Hi Everybody,
>
> It's recommended join different hardwares in the same rack  ?
>
> For example I have a sata rack with Apollo 4200 storage and I will get
> another hardware type to expand this rack, Hp 380 Gen10.
>
> I was made a lot tests to  understand the performance and these new
> disks have 100% of utilization in my environment and the cluster
> recovery is worst than another hardware.
>
> Can Someone recommend a best practice or configuration in this
> scenario? I  make this issue because if these disks not performing as
> hope, I will configure another pools to my openstack and maybe that's
> not make sense to me because I will split nova process in the computes
> node if I have two pools .
>

It's usually better to have homogeneous hardware across your cluster.
Mixing hardware will cause your requests to will be subject to the
"weakest link in the chain". For instance, write requests latency will
be bound by the latency of your slowest device. In practice there might
be other issues as well that have been pointed out on this list before
(feel free to search).

Having separate pools on different kind of hardware sounds like a good
approach. Otherwise, depending on your workload, it might be worth
thinking about tweaking the primary affinity of OSDs so that your fast
OSD are more likely to be primaries (reads are served from the primary
OSD only). Depending on your new disks (throughput and size), maybe look
at tweaking the weights. But that's just the beginning of a real hassle
in terms of management.

Mohamad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore / OpenStack Rocky performance issues

2019-02-21 Thread Mohamad Gebai
I didn't mean that the fact they are consumer SSDs is the reason for
this performance impact. I was just pointing it out, unrelated to your
problem.

40% is a lot more than one would expect to see. How are you measuring
the performance? What is the workload and what numbers are you getting?
What numbers did you used to get used to get with Filestore?

One of the biggest differences is that Filestore can make use of the
page cache, whereas Bluestore manages its own cache. You can try
increasing the Bluestore cache and see if it helps. Depending on the
data set size and pattern, it might make a significant difference.

Mohamad

On 2/21/19 11:36 AM, Smith, Eric wrote:
>
> Yes stand-alone OSDs (WAL/DB/Data all on the same disk), this is the
> same as it was for Jewel / filestore. Even if they are consumer SSDs
> why would they be 40% faster with an older version of Ceph?
>
>  
>
> *From: *Mohamad Gebai 
> *Date: *Thursday, February 21, 2019 at 9:44 AM
> *To: *"Smith, Eric" , Sinan Polat
> , "ceph-users@lists.ceph.com" 
> *Subject: *Re: [ceph-users] BlueStore / OpenStack Rocky performance issues
>
>  
>
> What is your setup with Bluestore? Standalone OSDs? Or do they have
> their WAL/DB partitions on another device? How does it compare to your
> Filestore setup for the journal?
>
> On a separate note, these look like they're consumer SSDs, which makes
> them not a great fit for Ceph.
>
> Mohamad
>
> On 2/21/19 9:29 AM, Smith, Eric wrote:
>
> 40% slower performance compared to Ceph Jewel / OpenStack Mitaka
> backed by the same SSDs ☹ I have 30 OSDs on SSDs (Samsung 860 EVO
> 1TB each)
>
>  
>
> *From:* Sinan Polat  <mailto:si...@turka.nl>
> *Sent:* Thursday, February 21, 2019 8:43 AM
> *To:* ceph-users@lists.ceph.com
> <mailto:ceph-users@lists.ceph.com>; Smith, Eric
>  <mailto:eric.sm...@ccur.com>
> *Subject:* Re: [ceph-users] BlueStore / OpenStack Rocky
> performance issues
>
>  
>
> Hi Eric,
>
> 40% slower performance compared to ..? Could you please share the
> current performance. How many OSD nodes do you have?
>
> Regards,
> Sinan
>
> Op 21 februari 2019 om 14:19 schreef "Smith, Eric"
> mailto:eric.sm...@ccur.com>>:
>
> Hey folks – I recently deployed Luminous / BlueStore on SSDs
> to back an OpenStack cluster that supports our build /
> deployment infrastructure and I’m getting 40% slower build
> times. Any thoughts on what I may need to do with Ceph to
> speed things up? I have 30 SSDs backing an 11 compute node
> cluster.
>
>  
>
> Eric
>
>
>  
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore / OpenStack Rocky performance issues

2019-02-21 Thread Mohamad Gebai
What is your setup with Bluestore? Standalone OSDs? Or do they have
their WAL/DB partitions on another device? How does it compare to your
Filestore setup for the journal?

On a separate note, these look like they're consumer SSDs, which makes
them not a great fit for Ceph.

Mohamad


On 2/21/19 9:29 AM, Smith, Eric wrote:
>
> 40% slower performance compared to Ceph Jewel / OpenStack Mitaka
> backed by the same SSDs ☹ I have 30 OSDs on SSDs (Samsung 860 EVO 1TB
> each)
>
>  
>
> *From:* Sinan Polat 
> *Sent:* Thursday, February 21, 2019 8:43 AM
> *To:* ceph-users@lists.ceph.com; Smith, Eric 
> *Subject:* Re: [ceph-users] BlueStore / OpenStack Rocky performance issues
>
>  
>
> Hi Eric,
>
> 40% slower performance compared to ..? Could you please share the
> current performance. How many OSD nodes do you have?
>
> Regards,
> Sinan
>
> Op 21 februari 2019 om 14:19 schreef "Smith, Eric"
> mailto:eric.sm...@ccur.com>>:
>
> Hey folks – I recently deployed Luminous / BlueStore on SSDs to
> back an OpenStack cluster that supports our build / deployment
> infrastructure and I’m getting 40% slower build times. Any
> thoughts on what I may need to do with Ceph to speed things up? I
> have 30 SSDs backing an 11 compute node cluster.
>
>  
>
> Eric
>
>
>  
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance issue due to tuned

2019-01-24 Thread Mohamad Gebai
Hi all,

I want to share a performance issue I just encountered on a test cluster
of mine, specifically related to tuned. I started by setting the
"throughput-performance" tuned profile on my OSD nodes and ran some
benchmarks. I then applied that same profile to my client node, which
intuitively sounds like a reasonable thing to do (I do want to tweak my
client to maximize throughput if that's possible). Long story short, I
found out that one of the tweaks made by the "throughput-performance"
profile is to increase

kernel.sched_wakeup_granularity_ns = 1500

which reduces the maximum throughput I'm able to get from 1080 MB/s to
1060 MB/s (-2.8%). The default value for sched_wakeup_granularity_ns
depends on the distro, on my system the default is 7.5ms. More info
about the benchmark:

- The benchmark tool is 'rados bench'
- The cluster has about 10 nodes with older hardware
- The client node has only 4 CPUs, the OSD nodes have 16 CPUs and 5 OSDs
each
- The throughput difference is always reproducible
- This was a read workload so that there is less volatility in the results
- I had all the data in BlueStore's cache on the OSD nodes so that
accessing the HDDs wouldn't skew the results
- I was looking at the difference of throughput once the benchmark
reaches its permanent regime, during which the throughput is very stable
(not surprising for a sequential read workload served from memory)

I have a theory which explains the reason for this reduced throughput.
The sched_wakeup_granularity_ns setting sets the minimum time a process
runs on a CPU before it can get preempted, so it looks like there might
be too much of a delay for rados bench's threads to get scheduled on-cpu
(higher latency from the moment a thread is woken up and goes in the CPU
runqueue to the time it is scheduled in and starts running) which
effectively results in a lower throughput overall.

We can measure that latency using 'perf sched timehist':

   time    cpu  task name   wait time  sch
delay   run time
    [tid/pid]  (msec)
(msec) (msec)
--- --  --  - 
-  -
 3279952.180957 [0002]  msgr-worker-1[50098/50094]  0.154 
0.021  0.135

it is shown in the 5th column (sch delay). If we look at the average of
'sch delay' for a lower throughput run, we get:

$> perf sched timehist -i perf.data.slow | egrep 'msgr|rados' | awk '{
total += $5; count++ } END { print total/count }'
0.0243015

And for a higher throughput run:

$> perf sched timehist -i perf.data.fast | egrep 'msgr|rados' | awk '{
total += $5; count++ } END { print total/count }'
0.00401659

There is on average a 20ms longer delay for "wakeup-to-sched-in" with
the throughput-performance profile enabled on the client due to the
sched_wakeup_granularity_ns setting. The fact that there are few CPUs on
that node doesn't help. If I set the number of concurrent IOs to 1, I
get the same throughput for both values of sched_wakeup_granularity,
because there is (almost) always an available CPU, which means that
rados bench's threads don't have to wait as long to get scheduled in and
start consuming data.

On the other hand, increasing sched_wakeup_granularity_ns on the OSD
nodes doesn't reduce the throughput because there are more CPUs than
there are OSDs, and the wakeup-to-sched delay is "diluted" by the
latency of reading/writing/moving data around.

I'm curious to know if this theory makes sense, and if other people have
encountered similar situations (with tuned or otherwise).

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] monitor cephfs mount io's

2019-01-22 Thread Mohamad Gebai
Hi Marc,

My point was that there was no way to do that for a kernel mount except
from the client that consumes the mounted RBDs.

Mohamad

On 1/21/19 4:29 AM, Marc Roos wrote:
>
> Hi Mohamad, How do you do that client side, I am having currently two 
> kernel mounts? 
>
>
>
>
>
> -Original Message-
> From: Mohamad Gebai [mailto:mge...@suse.de] 
> Sent: 17 January 2019 15:57
> To: Marc Roos; ceph-users
> Subject: Re: [ceph-users] monitor cephfs mount io's
>
> You can do that either straight from your client, or by querying the 
> perf dump if you're using ceph-fuse.
>
> Mohamad
>
> On 1/17/19 6:19 AM, Marc Roos wrote:
>> How / where can I monitor the ios on cephfs mount / client?
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] monitor cephfs mount io's

2019-01-17 Thread Mohamad Gebai
You can do that either straight from your client, or by querying the
perf dump if you're using ceph-fuse.

Mohamad

On 1/17/19 6:19 AM, Marc Roos wrote:
>
> How / where can I monitor the ios on cephfs mount / client?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC pools grinding to a screeching halt on Luminous

2018-12-31 Thread Mohamad Gebai
On 12/31/18 4:51 AM, Marcus Murwall wrote:
> What you say does make sense though as I also get the feeling that the
> osds are just waiting for something. Something that never happens and
> the request finally timeout...

So the OSDs are just completely idle? If not, try using strace and/or
perf to get some insights into what they're doing.

Maybe someone with better knowledge of EC internals will suggest
something. In the mean time, you might want to look at the client side.
Could the client be somehow saturated or blocked on something? (If the
clients aren't blocked you can use 'perf' or Mark's profiler [1] to
profile them).

Try benchmarking with an iodepth of 1 and slowly increase it until you
run into the issue, all while monitoring your resources. You might find
something that causes the tipping point. Are you able to reproduce this
using fio? Maybe this is just a client issue..

Sorry for suggesting a bunch of things that are all over the place, I'm
just trying to understand the state of the cluster (and clients). Are
both the OSDs and the clients completely blocked and make no progress?

Let us know what you find.

Mohamad

[1] https://github.com/markhpc/gdbpmp/

>
> I will have one of our network guys to take a look and get a second
> pair of eyes on it as well, just to make sure I'm not missing anything.
>
> Thanks for your help so far Mohamad, I really appreciate it. If you
> have some more ideas/suggestions on where to look please let us know.
>
> I wish you all a happy new year.
>
> Regards
> Marcus
>
>> Mohamad Gebai <mailto:mge...@suse.de>
>> 28 December 2018 at 16:10
>> Hi Marcus,
>>
>> On 12/27/18 4:21 PM, Marcus Murwall wrote:
>>> Hey Mohamad
>>>
>>> I work with Florian on this issue.
>>> Just reinstalled the ceph cluster and triggered the error again.
>>> Looking at iostat -x 1 there is basically no activity at all against
>>> any of the osds.
>>> We get blocked ops all over the place but here are some output from
>>> one of the osds that had blocked requests:
>>> http://paste.openstack.org/show/738721/
>>
>> Looking at the historic_slow_ops, the step in the pipeline that takes
>> the most time is sub_op_applied -> commit_sent. I couldn't say
>> exactly what these steps are from a high level view, but looking at
>> the code, commit_sent indicates that a message has been sent to the
>> OSD's client over the network. Can you look for network congestion
>> (the fact that there's nothing happening on the disks points in that
>> direction too)? Something like iftop might help. Is there anything
>> suspicious in the logs?
>>
>> Also, do you get the same throughput when benchmarking the replicated
>> compared to the EC pool?
>>
>> Mohamad
>>
>>>
>>>
>>> Regards
>>> Marcus
>>>
>>>> Mohamad Gebai <mailto:mge...@suse.de>
>>>> 26 December 2018 at 18:27
>>>> What is happening on the individual nodes when you reach that point
>>>> (iostat -x 1 on the OSD nodes)? Also, what throughput do you get when
>>>> benchmarking the replicated pool?
>>>>
>>>> I guess one way to start would be by looking at ongoing operations at
>>>> the OSD level:
>>>>
>>>> ceph daemon osd.X dump_blocked_ops
>>>> ceph daemon osd.X dump_ops_in_flight
>>>> ceph daemon osd.X dump_historic_slow_ops
>>>>
>>>> (see ceph daemon osd.X help) for more commands.
>>>>
>>>> The first command will show currently blocked operations. The last
>>>> command shows recent slow operations. You can follow the flow of
>>>> individual operations, and you might find that the slow operations are
>>>> all associated with the same few PGs, or that they're spending too much
>>>> time waiting on something.
>>>>
>>>> Hope that helps.
>>>>
>>>> Mohamad
>>>>
>>>>
>>>> Florian Haas <mailto:flor...@citynetwork.eu>
>>>> 26 December 2018 at 11:20
>>>> Hi everyone,
>>>>
>>>> We have a Luminous cluster (12.2.10) on Ubuntu Xenial, though we have
>>>> also observed the same behavior on 12.2.7 on Bionic (download.ceph.com
>>>> doesn't build Luminous packages for Bionic, and 12.2.7 is the latest
>>>> distro build).
>>>>
>>>> The primary use case for this cluster is radosgw. 6 OSD nodes, 22 OSDs
>>>> per node, of which 20 are SAS spinners and 2 are NVMe devices. Cluster
>>>> has been deployed with ceph-ansible sta

Re: [ceph-users] EC pools grinding to a screeching halt on Luminous

2018-12-28 Thread Mohamad Gebai
Hi Marcus,

On 12/27/18 4:21 PM, Marcus Murwall wrote:
> Hey Mohamad
>
> I work with Florian on this issue.
> Just reinstalled the ceph cluster and triggered the error again.
> Looking at iostat -x 1 there is basically no activity at all against
> any of the osds.
> We get blocked ops all over the place but here are some output from
> one of the osds that had blocked requests:
> http://paste.openstack.org/show/738721/

Looking at the historic_slow_ops, the step in the pipeline that takes
the most time is sub_op_applied -> commit_sent. I couldn't say exactly
what these steps are from a high level view, but looking at the code,
commit_sent indicates that a message has been sent to the OSD's client
over the network. Can you look for network congestion (the fact that
there's nothing happening on the disks points in that direction too)?
Something like iftop might help. Is there anything suspicious in the logs?

Also, do you get the same throughput when benchmarking the replicated
compared to the EC pool?

Mohamad

>
>
> Regards
> Marcus
>
>> Mohamad Gebai <mailto:mge...@suse.de>
>> 26 December 2018 at 18:27
>> What is happening on the individual nodes when you reach that point
>> (iostat -x 1 on the OSD nodes)? Also, what throughput do you get when
>> benchmarking the replicated pool?
>>
>> I guess one way to start would be by looking at ongoing operations at
>> the OSD level:
>>
>> ceph daemon osd.X dump_blocked_ops
>> ceph daemon osd.X dump_ops_in_flight
>> ceph daemon osd.X dump_historic_slow_ops
>>
>> (see ceph daemon osd.X help) for more commands.
>>
>> The first command will show currently blocked operations. The last
>> command shows recent slow operations. You can follow the flow of
>> individual operations, and you might find that the slow operations are
>> all associated with the same few PGs, or that they're spending too much
>> time waiting on something.
>>
>> Hope that helps.
>>
>> Mohamad
>>
>>
>> Florian Haas <mailto:flor...@citynetwork.eu>
>> 26 December 2018 at 11:20
>> Hi everyone,
>>
>> We have a Luminous cluster (12.2.10) on Ubuntu Xenial, though we have
>> also observed the same behavior on 12.2.7 on Bionic (download.ceph.com
>> doesn't build Luminous packages for Bionic, and 12.2.7 is the latest
>> distro build).
>>
>> The primary use case for this cluster is radosgw. 6 OSD nodes, 22 OSDs
>> per node, of which 20 are SAS spinners and 2 are NVMe devices. Cluster
>> has been deployed with ceph-ansible stable-3.1, we're using
>> "objectstore: bluestore" and "osd_scenario: collocated".
>>
>> We're using a "class hdd" replicated CRUSH ruleset for all our pools,
>> except:
>>
>> - the bucket index pool, which uses a replicated "class nvme" rule, and
>> - the bucket data pool, which uses an EC (crush-device-class=hdd,
>> crush-failure-domain=host, k=3, m=2).
>>
>> We also have 3 pools that we have created in order to be able to do
>> benchmark runs while leaving the other pools untouched, so we have
>>
>> - bench-repl-hdd, replicated, size 3, using a CRUSH rule with "step take
>> default class hdd"
>> - bench-repl-nvme, replicated, size 3, using a CRUSH rule with "step
>> take default class nvme"
>> - bench-ec-hdd, EC, crush-device-class=hdd, crush-failure-domain=host,
>> k=3, m=2.
>>
>> Baseline benchmarks with "ceph tell osd.* bench" at the default block
>> size of 4M yield pretty exactly the throughput you'd expect from the
>> devices: approx. 185 MB/s from the SAS drives; the NVMe devices
>> currently pull only 650 MB/s on writes but that may well be due to
>> pending conditioning — this is new hardware.
>>
>> Now when we run "rados bench" against the replicated pools, we again get
>> exactly what we expect for a nominally performing but largely untuned
>> system.
>>
>> It's when we try running benchmarks against the EC pool that everything
>> appears to grind to a halt:
>>
>> http://paste.openstack.org/show/738187/
>>
>> After 19 seconds, that pool does not accept a single further object. We
>> simultaneously see slow request warnings creep up in the cluster, and
>> the only thing we can then do is kill the benchmark, and wait for the
>> slow requests to clear out.
>>
>> We've also seen the log messages discussed in
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028972.html,
>> and they seem to correlate with the slow requests popping up, but from
>> Greg's 

Re: [ceph-users] EC pools grinding to a screeching halt on Luminous

2018-12-26 Thread Mohamad Gebai
What is happening on the individual nodes when you reach that point
(iostat -x 1 on the OSD nodes)? Also, what throughput do you get when
benchmarking the replicated pool?

I guess one way to start would be by looking at ongoing operations at
the OSD level:

ceph daemon osd.X dump_blocked_ops
ceph daemon osd.X dump_ops_in_flight
ceph daemon osd.X dump_historic_slow_ops

(see ceph daemon osd.X help) for more commands.

The first command will show currently blocked operations. The last
command shows recent slow operations. You can follow the flow of
individual operations, and you might find that the slow operations are
all associated with the same few PGs, or that they're spending too much
time waiting on something.

Hope that helps.

Mohamad


On 12/26/18 5:20 AM, Florian Haas wrote:
> Hi everyone,
>
> We have a Luminous cluster (12.2.10) on Ubuntu Xenial, though we have
> also observed the same behavior on 12.2.7 on Bionic (download.ceph.com
> doesn't build Luminous packages for Bionic, and 12.2.7 is the latest
> distro build).
>
> The primary use case for this cluster is radosgw. 6 OSD nodes, 22 OSDs
> per node, of which 20 are SAS spinners and 2 are NVMe devices. Cluster
> has been deployed with ceph-ansible stable-3.1, we're using
> "objectstore: bluestore" and "osd_scenario: collocated".
>
> We're using a "class hdd" replicated CRUSH ruleset for all our pools,
> except:
>
> - the bucket index pool, which uses a replicated "class nvme" rule, and
> - the bucket data pool, which uses an EC (crush-device-class=hdd,
> crush-failure-domain=host, k=3, m=2).
>
> We also have 3 pools that we have created in order to be able to do
> benchmark runs while leaving the other pools untouched, so we have
>
> - bench-repl-hdd, replicated, size 3, using a CRUSH rule with "step take
> default class hdd"
> - bench-repl-nvme, replicated, size 3, using a CRUSH rule with "step
> take default class nvme"
> - bench-ec-hdd, EC, crush-device-class=hdd, crush-failure-domain=host,
> k=3, m=2.
>
> Baseline benchmarks with "ceph tell osd.* bench" at the default block
> size of 4M yield pretty exactly the throughput you'd expect from the
> devices: approx. 185 MB/s from the SAS drives; the NVMe devices
> currently pull only 650 MB/s on writes but that may well be due to
> pending conditioning — this is new hardware.
>
> Now when we run "rados bench" against the replicated pools, we again get
> exactly what we expect for a nominally performing but largely untuned
> system.
>
> It's when we try running benchmarks against the EC pool that everything
> appears to grind to a halt:
>
> http://paste.openstack.org/show/738187/
>
> After 19 seconds, that pool does not accept a single further object. We
> simultaneously see slow request warnings creep up in the cluster, and
> the only thing we can then do is kill the benchmark, and wait for the
> slow requests to clear out.
>
> We've also seen the log messages discussed in
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028972.html,
> and they seem to correlate with the slow requests popping up, but from
> Greg's reply in
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028974.html
> I'm assuming that that's benign and doesn't warrant further investigation.
>
> Here's a few things we've tried, to no avail:
>
> - Make sure we use the latest Luminous release (we started out on Bionic
> and 12.2.7, then reinstalled systems with Xenial so we could use 12.2.10).
> - Enable Bluestore buffered writes (bluestore_default_buffered_write =
> true); buffered reads are on by default.
> - Extend the BlueStore cache from 1G to 4G (bluestore_cache_size_hdd =
> 4294967296; each OSD box has 128G RAM so should not run into memory
> starvation issues with that).
>
> But those were basically "let's give this a shot and see if it makes a
> difference" attempts (it didn't).
>
> I'm basically looking for ideas where even to start looking. So if
> anyone can guide us into the right direction, that would be excellent.
> Thanks in advance for any help you can offer; it is much appreciated!
>
> Cheers,
> Florian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RDMA/RoCE enablement failed with (113) No route to host

2018-12-18 Thread Mohamad Gebai
Last I heard (read) was that the RDMA implementation is somewhat
experimental. Search for "troubleshooting ceph rdma performance" on this
mailing list for more info.

(Adding Roman in CC who has been working on this recently.)

Mohamad

On 12/18/18 11:42 AM, Michael Green wrote:
> I don't know. 
> Ceph documentation on Mimic doesn't appear to go into too much details
> on RDMA in general, but still it's mentioned in the Ceph docs here and
> there.  Some examples:
> Change log - http://docs.ceph.com/docs/master/releases/mimic/
> Async messenger options
> - http://docs.ceph.com/docs/master/rados/configuration/ms-ref/
>
> I want to believe that the official docs wouldn't mention something
> that's completely broken?
>
> There are multiple posts in this very mailing list from people trying
> to make it work. 
> *--
> Michael Green
> *Customer Support & Integration
> Tel. +1 (518) 9862385
> gr...@e8storage.com 
>
> E8 Storage has a new look, find out more
>  
>
>
>
>
>
>
>
>
>
>
>> On Dec 18, 2018, at 6:55 AM, Виталий Филиппов > > wrote:
>>
>> Is RDMA officially supported? I'm asking because I recently tried to
>> use DPDK and it seems it's broken... i.e the code is there, but does
>> not compile until I fix cmake scripts, and after fixing the build
>> OSDs just get segfaults and die after processing something like 40-50
>> incoming packets.
>>
>> Maybe RDMA is in the same state?
>>
>> 13 декабря 2018 г. 2:42:23 GMT+03:00, Michael Green
>> mailto:gr...@e8storage.com>> пишет:
>>
>> Sorry for bumping the thread. I refuse to believe there are no
>> people on this list who have successfully enabled and run RDMA
>> with Mimic. :)
>>
>> Mike
>>
>>> Hello collective wisdom,
>>>
>>> ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
>>> mimic (stable) here.
>>>
>>> I have a working cluster here consisting of 3 monitor hosts,  64
>>> OSD processes across 4 osd hosts, plus 2 MDSs, plus 2 MGRs. All
>>> of that is consumed by 10 client nodes.
>>>
>>> Every host in the cluster, including clients is 
>>> RHEL 7.5
>>> Mellanox OFED 4.4-2.0.7.0
>>> RoCE NICs are either MCX416A-CCAT or MCX414A-CCAT @ 50Gbit/sec
>>> The NICs are all mlx5_0 port 1
>>>
>>> ring and ib_send_bw work fine both ways on any two nodes in the
>>> cluster.
>>>
>>> Full configuration of the cluster is pasted below, but RDMA
>>> related parameters are configured as following:
>>>
>>>
>>> ms_public_type = async+rdma
>>> ms_cluster = async+rdma
>>> # Exclude clients for now 
>>> ms_type = async+posix
>>>
>>> ms_async_rdma_device_name = mlx5_0
>>> ms_async_rdma_polling_us = 0
>>> ms_async_rdma_port_num=1
>>>
>>> When I try to start MON, it immediately fails as below. Anybody
>>> has seen this or could give any pointers what to/where to look next?
>>>
>>>
>>> --ceph-mon.rio.log--begin--
>>> 2018-12-12 22:35:30.011 7f515dc39140  0 set uid:gid to 167:167
>>> (ceph:ceph)
>>> 2018-12-12 22:35:30.011 7f515dc39140  0 ceph version 13.2.2
>>> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable),
>>> process ceph-mon, pid 2129843
>>> 2018-12-12 22:35:30.011 7f515dc39140  0 pidfile_write: ignore
>>> empty --pid-file
>>> 2018-12-12 22:35:30.036 7f515dc39140  0 load: jerasure load: lrc
>>> load: isa
>>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>>> compression = kNoCompression
>>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>>> level_compaction_dynamic_level_bytes = true
>>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>>> write_buffer_size = 33554432
>>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>>> compression = kNoCompression
>>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>>> level_compaction_dynamic_level_bytes = true
>>> 2018-12-12 22:35:30.036 7f515dc39140  0  set rocksdb option
>>> write_buffer_size = 33554432
>>> 2018-12-12 22:35:30.147 7f51442ed700  2 Event(0x55d927e95700
>>> nevent=5000 time_id=1).set_owner idx=1 owner=139987012998912
>>> 2018-12-12 22:35:30.147 7f51442ed700 10 stack operator() starting
>>> 2018-12-12 22:35:30.147 7f5143aec700  2 Event(0x55d927e95200
>>> nevent=5000 time_id=1).set_owner idx=0 owner=139987004606208
>>> 2018-12-12 22:35:30.147 7f5144aee700  2 Event(0x55d927e95c00
>>> nevent=5000 time_id=1).set_owner idx=2 owner=139987021391616
>>> 2018-12-12 22:35:30.147 7f5143aec700 10 stack operator() starting
>>> 2018-12-12 22:35:30.147 7f5144aee700 10 stack operator() starting
>>> 2018-12-12 22:35:30.147 7f515dc39140  0 starting mon.rio rank 0
>>> at public addr 192.168.1.58:6789/0 at bind addr
>>> 192.168.1.58:6789/0 mon_data /var/lib/ceph/mon/ceph-rio fsid
>>> 

[ceph-users] How are you using tuned

2018-07-12 Thread Mohamad Gebai
Hi all,

I was wondering how people were using tuned with Ceph, if at all. I
think it makes sense to enable the throuhput-performance profile on OSD
nodes, and maybe the network-latency profiles on mon and mgr nodes. Is
anyone using a similar configuration, and do you have any thought on
this approach?

Thanks,
Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Mohamad Gebai

On 05/16/2018 07:18 AM, Uwe Sauter wrote:
> Hi Mohamad,
>
>>
>> I think this is what you're looking for:
>>
>> $> ceph daemon osd.X dump_historic_slow_ops
>>
>> which gives you recent slow operations, as opposed to
>>
>> $> ceph daemon osd.X dump_blocked_ops
>>
>> which returns current blocked operations. You can also add a filter to
>> those commands.
> Thanks for these commands. I'll have a look into those. If I understand these 
> correctly it means that I need to run these at each
> server for each OSD instead of at a central location, is that correct?
>

That's the case, as it uses the admin socket.

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Mohamad Gebai
Hi,

On 05/16/2018 04:16 AM, Uwe Sauter wrote:
> Hi folks,
>
> I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
> like to identify the OSD that is causing those events
> once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
> get this info in realtime).
>
> Collecting this information could help identify aging disks if you were able 
> to accumulate and analyze which OSD had blocking
> requests in the past and how often those events occur.
>
> My research so far let's me think that this information is only available as 
> long as the requests are actually blocked. Is this
> correct?

I think this is what you're looking for:

$> ceph daemon osd.X dump_historic_slow_ops

which gives you recent slow operations, as opposed to

$> ceph daemon osd.X dump_blocked_ops

which returns current blocked operations. You can also add a filter to
those commands.

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions regarding hardware design of an SSD only cluster

2018-04-23 Thread Mohamad Gebai


On 04/23/2018 09:24 PM, Christian Balzer wrote:
> 
>> If anyone has some ideas/thoughts/pointers, I would be glad to hear them.
>>
> RAM, you'll need a lot of it, even more with Bluestore given the current
> caching.
> I'd say 1GB per TB storage as usual and 1-2GB extra per OSD.

Does that still stand? I was under the impression that with Bluestore,
the required RAM is mostly a function of the Bluestore cache size rather
than raw storage size (we're currently in the process of confirming this).

Mohamad

>> Regards,
>>
>> Florian
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph luminous - troubleshooting performance issues overall DSK 100%, busy 1%

2018-04-10 Thread Mohamad Gebai
Just to be clear about the issue:

You have a 3 servers setup, performance is good. You add a server (with
1 OSD?) and performance goes down, is that right?

Can you give us more details? What's your complete setup? How many OSDs
per node, bluestore/filestore, WAL/DB setup, etc. You're talking about
sdb, sde, etc.. are those supposed to be OSD disks? What performance do
you see before adding the last server? And how does it compare to the
performance after? Are your OSD weights set correctly after the move
(and after data settles)?

Mohamad


On 04/05/2018 11:23 AM, Steven Vacaroaia wrote:
> Hi,
>
> I have a strange issue - OSDs from a specific server are introducing
> huge performance issue
>
> This is a brand new installation on 3 identical servers -
>  DELL R620 with PERC H710 , bluestore  DB and WAL on SSD, 10GB
> dedicated private/public networks 
>
>
> When I add the OSD I see gaps like below and huge latency  
>
> atop provides no  clear culprit EXCEPT very low network and specific
> disk utilization BUT 100% DSK for ceph-osd process  which stay like
> that ( 100%) for the duration of the test
> ( see below)
>
> Not sure why ceph-osd process  DSK stays at 100% while all the
> specific DSK ( for sdb, sde ..etc) are 1% busy ?
>
> Any help/ instructions for how to troubleshooting this will be
> appreciated 
>
> (apologies if the format is not being kept)
>
>
> CPU | sys       4%  | user      1%  |               | irq       1%  | 
>              | idle    794%  | wait      0%  |              |         
>      |  steal     0% |  guest     0% |  curf 2.20GHz |             
>  |  curscal   ?% |
> CPL | avg1    0.00  |               | avg5    0.00  | avg15   0.00  | 
>              |               |               | csw    547/s |         
>      |  intr   832/s |               |               |  numcpu     8
> |               |
> MEM | tot    62.9G  | free   61.4G  | cache 520.6M  | dirty   0.0M  |
> buff    7.5M  | slab   98.9M  | slrec  64.8M  | shmem   8.8M |  shrss 
>  0.0M |  shswp   0.0M |  vmbal   0.0M |               |  hptot   0.0M
> |  hpuse   0.0M |
> SWP | tot     6.0G  | free    6.0G  |               |               | 
>              |               |               |              |         
>      |               |               |  vmcom   1.5G |             
>  |  vmlim  37.4G |
> LVM |         dm-0  | busy      1%  |               | read     0/s  |
> write   54/s  |               | KiB/r      0  | KiB/w    455 |  MBr/s 
>   0.0 |               |  MBw/s   24.0 |  avq     3.69 |             
>  |  avio 0.14 ms |
> DSK |          sdb  | busy      1%  |               | read     0/s  |
> write  102/s  |               | KiB/r      0  | KiB/w    240 |  MBr/s 
>   0.0 |               |  MBw/s   24.0 |  avq     6.69 |             
>  |  avio 0.08 ms |
> DSK |          sda  | busy      0%  |               | read     0/s  |
> write   12/s  |               | KiB/r      0  | KiB/w      4 |  MBr/s 
>   0.0 |               |  MBw/s    0.1 |  avq     1.00 |             
>  |  avio 0.05 ms |
> DSK |          sde  | busy      0%  |               | read     0/s  |
> write    0/s  |               | KiB/r      0  | KiB/w      0 |  MBr/s 
>   0.0 |               |  MBw/s    0.0 |  avq     1.00 |             
>  |  avio 2.50 ms |
> NET | transport     | tcpi   718/s  | tcpo   972/s  | udpi     0/s  | 
>              | udpo     0/s  | tcpao    0/s  | tcppo    0/s |  tcprs 
>  21/s |  tcpie    0/s |  tcpor    0/s |               |  udpnp    0/s
> |  udpie    0/s |
> NET | network       | ipi    719/s  |               | ipo    399/s  |
> ipfrw    0/s  |               | deliv  719/s  |              |       
>        |               |               |  icmpi    0/s |             
>  |  icmpo    0/s |
> NET | eth5      1%  | pcki  2214/s  | pcko   939/s  |               |
> sp   10 Gbps  | si  154 Mbps  | so   52 Mbps  |              |  coll 
>    0/s |  mlti     0/s |  erri     0/s |  erro     0/s |  drpi     0/s
> |  drpo     0/s |
> NET | eth4      0%  | pcki   712/s  | pcko    54/s  |               |
> sp   10 Gbps  | si   50 Mbps  | so   90 Kbps  |              |  coll 
>    0/s |  mlti     0/s |  erri     0/s |  erro     0/s |  drpi     0/s
> |  drpo     0/s |
>
>     PID                                 TID                           
>    RDDSK                               WRDSK                         
>    WCANCL                               DSK                           
>   CMD       1/21
>    2067                                   -                           
>     0K/s                              0.0G/s                         
>      0K/s                              100%                           
>   ceph-osd
>
>
>
>   
>
> 2018-04-05 10:55:24.316549 min lat: 0.0203278 max lat: 10.7501 avg
> lat: 0.496822
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
>    40      16      1096      1080   107.988         0           -   
> 0.496822
>    41      16      1096      1080   

Re: [ceph-users] What do you use to benchmark your rgw?

2018-04-03 Thread Mohamad Gebai

On 03/28/2018 11:11 AM, Mark Nelson wrote:
> Personally I usually use a modified version of Mark Seger's getput
> tool here:
>
> https://github.com/markhpc/getput/tree/wip-fix-timing
>
> The difference between this version and upstream is primarily to make
> getput more accurate/useful when using something like CBT for
> orchestration instead of the included orchestration wrapper (gpsuite).
>
> CBT can use this version of getput and run relatively accurate
> mutli-client tests without requiring quite as much setup as cosbench. 
> Having said that, many folks have used cosbench effectively and I
> suspect that might be a good option for many people.  I'm not sure how
> much development is happening these days, I think the primary author
> may no longer be working on the project.
>

AFAIK the project is still alive. Adding Mark.

Mohamad


> Mark
>
> On 03/28/2018 09:21 AM, David Byte wrote:
>> I use cosbench (the last rc works well enough). I can get multiple
>> GB/s from my 6 node cluster with 2 RGWs.
>>
>> David Byte
>> Sr. Technical Strategist
>> IHV Alliances and Embedded
>> SUSE
>>
>> Sent from my iPhone. Typos are Apple's fault.
>>
>> On Mar 28, 2018, at 5:26 AM, Janne Johansson > > wrote:
>>
>>> s3cmd and cli version of cyberduck to test it end-to-end using
>>> parallelism if possible.
>>>
>>> Getting some 100MB/s at most, from 500km distance over https against
>>> 5*radosgw behind HAProxy.
>>>
>>>
>>> 2018-03-28 11:17 GMT+02:00 Matthew Vernon >> >:
>>>
>>>     Hi,
>>>
>>>     What are people here using to benchmark their S3 service (i.e.
>>>     the rgw)?
>>>     rados bench is great for some things, but doesn't tell me about
>>> what
>>>     performance I can get from my rgws.
>>>
>>>     It seems that there used to be rest-bench, but that isn't in Jewel
>>>     AFAICT; I had a bit of a look at cosbench but it looks fiddly to
>>>     set up
>>>     and a bit under-maintained (the most recent version doesn't work
>>>     out of
>>>     the box, and the PR to fix that has been languishing for a while).
>>>
>>>     This doesn't seem like an unusual thing to want to do, so I'd
>>> like to
>>>     know what other ceph folk are using (and, if you like, the
>>>     numbers you
>>>     get from the benchmarkers)...?
>>>
>>>     Thanks,
>>>
>>>     Matthew
>>>
>>>
>>>     --
>>>      The Wellcome Sanger Institute is operated by Genome Research
>>>      Limited, a charity registered in England with number 1021457 and a
>>>      company registered in England with number 2742969, whose
>>> registered
>>>      office is 215 Euston Road, London, NW1 2BE.
>>>     ___
>>>     ceph-users mailing list
>>>     ceph-users@lists.ceph.com 
>>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>     
>>>
>>>
>>>
>>>
>>> -- 
>>> May the most significant bit of your life be positive.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unstable clock

2017-10-17 Thread Mohamad Gebai

On 10/17/2017 09:57 AM, Sage Weil wrote:
> On Tue, 17 Oct 2017, Mohamad Gebai wrote:
>>
>> Thanks Sage. I assume that's the card you're referring to:
>> https://trello.com/c/SAtGPq0N/65-use-time-span-monotonic-for-durations
>>
>> I can take of that one if no one else has started working on it.
> That would be wonderful!  I'm pretty sure nobody else is looking at it so 
> you win today.  :)
>

Great :) Anything I should, like add my face to the card or something?

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unstable clock

2017-10-17 Thread Mohamad Gebai

On 10/17/2017 09:27 AM, Sage Weil wrote:
> On Tue, 17 Oct 2017, Mohamad Gebai wrote:
>
>> It would be good to know if there are any, and maybe prepare for them?
> Adam added a new set of clock primitives that include a monotonic clock 
> option that should be used in all cases where we're measuring the passage 
> of time instead of the wall clock time.  There is a longstanding trello 
> card to go through and change the latency calculations to use the 
> monotonic clock.  There are probably dozens of places where an ill-timed 
> clock jump is liable to trigger some random assert.  It's just a matter of 
> going through and auditing calls to the legacy ceph_clock_now() method.
>

Thanks Sage. I assume that's the card you're referring to:
https://trello.com/c/SAtGPq0N/65-use-time-span-monotonic-for-durations

I can take of that one if no one else has started working on it.

Mohamad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unstable clock

2017-10-17 Thread Mohamad Gebai
Hi,

I am looking at the following issue: http://tracker.ceph.com/issues/21375

In summary, during a 'rados bench', impossible latency values (e.g.
9.00648e+07) are suddenly reported. I looked briefly at the code, it
seems CLOCK_REALTIME is used, which means that wall clock changes would
affect this output. This is a VM cluster, so the hypothesis was that the
system's clock was falling behind for some reason, then getting
readjusted (that's the only way I could reproduce the issue), which I
think is quite possible in a virtual environment.

A concern was raised: are there more critical parts of Ceph where a
clock jumping around might interfere with the behavior of the cluster?
It would be good to know if there are any, and maybe prepare for them?

Mohamad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backup VM (Base image + snapshot)

2017-10-15 Thread Mohamad Gebai
Hi,

I'm not answering your questions, but I just want to point out that you
might be using the documentation for an older version of Ceph:

On 10/14/2017 12:25 PM, Oscar Segarra wrote:
>
> http://docs.ceph.com/docs/giant/rbd/rbd-snapshot/
>

If you're not using the 'giant' version of Ceph (which has reached EOL),
you should replace that with your current version.

Mohamad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore Cache Ratios

2017-10-11 Thread Mohamad Gebai
Hi Jorge,

On 10/10/2017 07:23 AM, Jorge Pinilla López wrote:
> Are .99 KV, .01 MetaData and .0 Data ratios right? they seem a little
> too disproporcionate.

Yes, this is correct.

> Also .99 KV and Cache of 3GB for SSD means that almost the 3GB would
> be used for KV but there is also another attributed called
> bluestore_cache_kv_max which is by fault 512MB, then what is the rest
> of the cache used for?, nothing? shouldnt it be more kv_max value or
> less KV ratio?

Anything over the *cache_kv_max value goes to the metadata cache. You
can look in your logs to see the final values of kv, metadata and data
cache ratios. To get data cache, you need to lower the ratios of
metadata and kv caches.

Mohamad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous BlueStore EC performance

2017-09-12 Thread Mohamad Gebai
Sorry for the delay. We used the default k=2 and m=1.

Mohamad


On 09/07/2017 06:22 PM, Christian Wuerdig wrote:
> What type of EC config (k+m) was used if I may ask?
>
> On Fri, Sep 8, 2017 at 1:34 AM, Mohamad Gebai <mge...@suse.de> wrote:
>> Hi,
>>
>> These numbers are probably not as detailed as you'd like, but it's
>> something. They show the overhead of reading and/or writing to EC pools as
>> compared to 3x replicated pools using 1, 2, 8 and 16 threads (single
>> client):
>>
>>  Rep   EC Diff  Slowdown
>>  IOPS  IOPS
>> Read
>> 123,32522,052 -5.46%1.06
>> 227,26127,147 -0.42%1.00
>> 827,15127,127 -0.09%1.00
>> 16   26,79326,728 -0.24%1.00
>> Write
>> 119,444 5,708-70.64%3.41
>> 223,902 5,395-77.43%4.43
>> 823,912 5,641-76.41%4.24
>> 16   24,587 5,643-77.05%4.36
>> RW
>> 120,37911,166-45.21%1.83
>> 234,246 9,525-72.19%3.60
>> 833,195 9,300-71.98%3.57
>> 16   31,641 9,762-69.15%3.24
>>
>> This is on an all-SSD cluster, with 3 OSD nodes and Bluestore. Ceph version
>> 12.1.0-671-g2c11b88d14 (2c11b88d14e64bf60c0556c6a4ec8c9eda36ff6a) luminous
>> (rc).
>>
>> Mohamad
>>
>>
>> On 09/06/2017 01:28 AM, Blair Bethwaite wrote:
>>
>> Hi all,
>>
>> (Sorry if this shows up twice - I got auto-unsubscribed and so first attempt
>> was blocked)
>>
>> I'm keen to read up on some performance comparisons for replication versus
>> EC on HDD+SSD based setups. So far the only recent thing I've found is
>> Sage's Vault17 slides [1], which have a single slide showing 3X / EC42 /
>> EC51 for Kraken. I guess there is probably some of this data to be found in
>> the performance meeting threads, but it's hard to know the currency of those
>> (typically master or wip branch tests) with respect to releases. Can anyone
>> point out any other references or highlight something that's coming?
>>
>> I'm sure there are piles of operators and architects out there at the moment
>> wondering how they could and should reconfigure their clusters once upgraded
>> to Luminous. A couple of things going around in my head at the moment:
>>
>> * We want to get to having the bulk of our online storage in CephFS on EC
>> pool/s...
>> *-- is overwrite performance on EC acceptable for near-line NAS use-cases?
>> *-- recovery implications (currently recovery on our Jewel RGW EC83 pool is
>> _way_ slower that 3X pools, what does this do to reliability? maybe split
>> capacity into multiple pools if it helps to contain failure?)
>>
>> [1]
>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in/37
>>
>> --
>> Cheers,
>> ~Blairo
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous BlueStore EC performance

2017-09-07 Thread Mohamad Gebai
Hi,

These numbers are probably not as detailed as you'd like, but it's
something. They show the overhead of reading and/or writing to EC pools
as compared to 3x replicated pools using 1, 2, 8 and 16 threads (single
client):

 Rep   EC Diff  Slowdown
 IOPS  IOPS   
Read   
123,32522,052 -5.46%1.06
227,26127,147 -0.42%1.00
827,15127,127 -0.09%1.00
16   26,79326,728 -0.24%1.00
Write   
119,444 5,708-70.64%3.41
223,902 5,395-77.43%4.43
823,912 5,641-76.41%4.24
16   24,587 5,643-77.05%4.36
RW   
120,37911,166-45.21%1.83
234,246 9,525-72.19%3.60
833,195 9,300-71.98%3.57
16   31,641 9,762-69.15%3.24

This is on an all-SSD cluster, with 3 OSD nodes and Bluestore. Ceph
version 12.1.0-671-g2c11b88d14
(2c11b88d14e64bf60c0556c6a4ec8c9eda36ff6a) luminous (rc).

Mohamad

On 09/06/2017 01:28 AM, Blair Bethwaite wrote:
> Hi all,
>
> (Sorry if this shows up twice - I got auto-unsubscribed and so first
> attempt was blocked)
>
> I'm keen to read up on some performance comparisons for replication
> versus EC on HDD+SSD based setups. So far the only recent thing I've
> found is Sage's Vault17 slides [1], which have a single slide showing
> 3X / EC42 / EC51 for Kraken. I guess there is probably some of this
> data to be found in the performance meeting threads, but it's hard to
> know the currency of those (typically master or wip branch tests) with
> respect to releases. Can anyone point out any other references or
> highlight something that's coming?
>
> I'm sure there are piles of operators and architects out there at the
> moment wondering how they could and should reconfigure their clusters
> once upgraded to Luminous. A couple of things going around in my head
> at the moment:
>
> * We want to get to having the bulk of our online storage in CephFS on
> EC pool/s...
> *-- is overwrite performance on EC acceptable for near-line NAS use-cases?
> *-- recovery implications (currently recovery on our Jewel RGW EC83
> pool is _way_ slower that 3X pools, what does this do to reliability?
> maybe split capacity into multiple pools if it helps to contain failure?)
>
> [1] 
> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in/37
>
> -- 
> Cheers,
> ~Blairo
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD journaling benchmarks

2017-07-10 Thread Mohamad Gebai

On 07/10/2017 01:51 PM, Jason Dillaman wrote:
> On Mon, Jul 10, 2017 at 1:39 PM, Maged Mokhtar  wrote:
>> These are significant differences, to the point where it may not make sense
>> to use rbd journaling / mirroring unless there is only 1 active client.
> I interpreted the results as the same RBD image was being concurrently
> used by two fio jobs -- which we strongly recommend against since it
> will result in the exclusive-lock ping-ponging back and forth between
> the two clients / jobs. Each fio RBD job should utilize its own
> backing image to avoid such a scenario.
>

That is correct. The single job runs are more representative of the
overhead of journaling only, and it is worth noting the (expected)
inefficiency of multiple clients for the same RBD image, as explained by
Jason.

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD journaling benchmarks

2017-07-10 Thread Mohamad Gebai
Resending as my first try seems to have disappeared.

Hi,

We ran some benchmarks to assess the overhead caused by enabling
client-side RBD journaling in Luminous. The tests consists of:
- Create an image with journaling enabled  (--image-feature journaling)
- Run randread, randwrite and randrw workloads sequentially from a
single client using fio
- Collect IOPS

More info:
- Feature exclusive-lock is enabled with journaling (required)
- Queue depth of 128 for fio
- With 1 and 2 threads


Cluster 1


- 5 OSD nodes
- 6 OSDs per node
- 3 monitors
- All SSD
- Bluestore + WAL
- 10GbE NIC
- Ceph version 12.0.3-1380-g6984d41b5d
(6984d41b5d142ce157216b6e757bcb547da2c7d2) luminous (dev)


Results:

DefaultJournalingJour width 32  
JobsIOPSIOPSSlowdownIOPSSlowdown
RW
1195219104   2.1x160671.2x
230575726   42.1x  48862.6x
Read
12277522946  0.9x236010.9x
2359551078  33.3x  44680.2x
Write
1185156054   3.0x 97651.9x
2295861188  24.9x  53455.4x

- "Default" is the baseline (with journaling disabled)
- "Journaling" is with journaling enabled
- "Jour width 32" is with a journal data width of 32 objects
(--journal-splay-width 32)
- The major slowdown for two jobs is due to locking
- With a journal width of 32, the 0.9x slowdown (which is actually a
speedup) is due to the read-only workload, which doesn't exercise the
journaling code.
- The randwrite workload exercises the journaling code the most, and is
expected to have the highest slowdown, which is 1.9x in this case.


Cluster 2


- 3 OSD nodes
- 10 OSDs per node
- 1 monitor
- All HDD
- Filestore
- 10GbE NIC
- Ceph version 12.1.0-289-g117b171715
(117b1717154e1236b2d37c405a86a9444cf7871d) luminous (dev)


Results:

DefaultJournaling Jour width 32  
Jobs  IOPSIOPS Slowdown  IOPS   Slowdown
RW  
11186936743.2x   4914  2.4x
213127 736   17.8x432 30.4x
Read  
114500   147001.0x  14703  1.0x
21667338934.3x307 54.3x
Write  
1 826719254.3x   2591  3.2x
2 828310128.2x417 19.9x

- The number of IOPS for the write workload is quite low, which is due
to HDDs and filestore

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com