Re: [ceph-users] slow perfomance: sanity check

2017-04-06 Thread Mark Nelson

On 04/06/2017 01:54 PM, Adam Carheden wrote:

60-80MBs/s for what sort of setup? Is that 1Gbe rather than 10Gbe?


60-80MB/s per disk, assuming fairly standard 7200RPM disks before any 
replication takes place and assuming journals are on SSDs with fast 
O_DSYNC write performance.  Any network limitations may decrease that 
further.  Basically the gist of it is that you take a fairly standard 
~140-150MB/s per disk, assume you get half that due to metadata writes, 
flushes, inode seeks, etc.




I consistently get 80-90Mb/s bandwidth as measured by `rados bench -p
rbd 10 write` run from a ceph node on a cluster with:
* 3 nodes
* 4 OSD/node, 600GB 15kRPM SAS disks
* 1G disk controller cache write cache shared by all disks in each node
* No SSDs
* 2x1Gbe lacp bond for redundancy, no jumbo frames
* 512 PGs for a cluster of 12 OSDs
* All disks in one pool of size=3, min_size=2

IOzone run on a VM using an rbd as it's HD confirms that setup maxes out
at around just under 100 MB/s for best-case scenarios, so I assumed the
1Gb network was the bottleneck.


The network is a good guess.  With 3 1GbE nodes and 3X replication you 
aren't going to do any better than ~110MB/s.  You are a little below 
that but it's in the right ballpark.




I'm in the process of planning a hardware purchase for a larger cluster:
more nodes, more drives, SSD journals and 10Gbe. I'm asuming I'll get
better performance.


You should, but it can be tricky to balance out everything.  Figure that 
80MB/s per disk (with 7200rpm disks and SSD journals) is the typical 
upper limit of what to expect with filestore on XFS, and any potential 
additional bottlenecks may bring that down.  Some folks have started 
playing with things like Intel's CAS software to potentially improve 
those numbers through SSD caching, but it's not a typical setup.




What's the upper bound on CEPH performance for large sequential writes
from a single-client with all the recommended bells and whistles (ssd
journal, 10Gbe)? I assume it depends on both the total number of OSDs
and possibly OSDs per node if one had enough to saturate the network,
correct?


Yep, and that's sort of tough to answer.  The fastest single client 
performance I've seen was a little over 4GB/s doing 4MB writes to an RBD 
volume on 16 NVMe OSDs using 40GbE (ie maxing it out on the client).  If 
I had enough switch ports to do bonded I could probably having gotten 
closer to 8GB/s since the cluster was capable of it.


Having said that, there's a *lot* of ways to hurt performance.  Red Hat 
has a ref architecture team that tests various hardware that might be 
able to give you a better idea of what works well these days.


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow perfomance: sanity check

2017-04-06 Thread Adam Carheden
60-80MBs/s for what sort of setup? Is that 1Gbe rather than 10Gbe?

I consistently get 80-90Mb/s bandwidth as measured by `rados bench -p
rbd 10 write` run from a ceph node on a cluster with:
* 3 nodes
* 4 OSD/node, 600GB 15kRPM SAS disks
* 1G disk controller cache write cache shared by all disks in each node
* No SSDs
* 2x1Gbe lacp bond for redundancy, no jumbo frames
* 512 PGs for a cluster of 12 OSDs
* All disks in one pool of size=3, min_size=2

IOzone run on a VM using an rbd as it's HD confirms that setup maxes out
at around just under 100 MB/s for best-case scenarios, so I assumed the
1Gb network was the bottleneck.

I'm in the process of planning a hardware purchase for a larger cluster:
more nodes, more drives, SSD journals and 10Gbe. I'm asuming I'll get
better performance.

What's the upper bound on CEPH performance for large sequential writes
from a single-client with all the recommended bells and whistles (ssd
journal, 10Gbe)? I assume it depends on both the total number of OSDs
and possibly OSDs per node if one had enough to saturate the network,
correct?


-- 
Adam Carheden

On 04/06/2017 12:29 PM, Mark Nelson wrote:
> With filestore on XFS using SSD journals that have good O_DSYNC write
> performance, we typically see between 60-80MB/s per disk before
> replication for large object writes.  This is assuming there are no
> other bottlenecks or things going on though (pg splitting, recovery,
> network issues, etc).  Probably the best case scenario would be large
> writes to an RBD volume with 4MB objects and enough PGs in the pool that
> splits never need to happen.
> 
> Having said that, on setups where some of the drives are slow, the
> network is misconfigured, there are too few PGs, there are too many
> drives on one controller, or other issues, 25-30MB/s per disk is
> certainly possible.
> 
> Mark
> 
> On 04/06/2017 10:05 AM, Stanislav Kopp wrote:
>> I've reduced OSDs to 12 and  moved journal to ssd drives and now have
>> "boost" with writes to ~33-35MB/s. Is it maximum without full ssd
>> pools?
>>
>> Best,
>> Stan
>>
>> 2017-04-06 9:34 GMT+02:00 Stanislav Kopp :
>>> Hello,
>>>
>>> I'm evaluate ceph cluster, to see  if you can use it for our
>>> virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu
>>> 16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning
>>> drive (XFS), MONITORs are installed on the same nodes, all nodes are
>>> connected via 10G switch.
>>>
>>> The problem is, on client I have only ~25-30 MB/s with seq. write. (dd
>>> with "oflag=direct"). Proxmox uses Firefly, which is old, I know.  But
>>> I have the same performance on my desktop running the same version as
>>> ceph nodes using rbd mount, iperf shows full speed (1GB or 10GB up to
>>> client).
>>> I know that this setup is not optimal and for production I will use
>>> separate MON nodes and ssd for OSDs, but was wondering is this
>>> performance still normal. This is my cluster status.
>>>
>>>  cluster 3ea55c7e-5829-46d0-b83a-92c6798bde55
>>>  health HEALTH_OK
>>>  monmap e5: 3 mons at
>>> {ceph01=10.1.8.31:6789/0,ceph02=10.1.8.32:6789/0,ceph03=10.1.8.33:6789/0}
>>>
>>> election epoch 60, quorum 0,1,2 ceph01,ceph02,ceph03
>>>  osdmap e570: 42 osds: 42 up, 42 in
>>> flags sortbitwise,require_jewel_osds
>>>   pgmap v14784: 1024 pgs, 1 pools, 23964 MB data, 6047 objects
>>> 74743 MB used, 305 TB / 305 TB avail
>>> 1024 active+clean
>>>
>>> btw, bench on nodes itself looks good as far I see.
>>>
>>> ceph01:~# rados bench -p rbd 10 write
>>> 
>>> Total time run: 10.159667
>>> Total writes made:  1018
>>> Write size: 4194304
>>> Object size:4194304
>>> Bandwidth (MB/sec): 400.801
>>> Stddev Bandwidth:   38.2018
>>> Max bandwidth (MB/sec): 472
>>> Min bandwidth (MB/sec): 344
>>> Average IOPS:   100
>>> Stddev IOPS:9
>>> Max IOPS:   118
>>> Min IOPS:   86
>>> Average Latency(s): 0.159395
>>> Stddev Latency(s):  0.110994
>>> Max latency(s): 1.1069
>>> Min latency(s): 0.0432668
>>>
>>>
>>> Thanks,
>>> Stan
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow perfomance: sanity check

2017-04-06 Thread Mark Nelson
With filestore on XFS using SSD journals that have good O_DSYNC write 
performance, we typically see between 60-80MB/s per disk before 
replication for large object writes.  This is assuming there are no 
other bottlenecks or things going on though (pg splitting, recovery, 
network issues, etc).  Probably the best case scenario would be large 
writes to an RBD volume with 4MB objects and enough PGs in the pool that 
splits never need to happen.


Having said that, on setups where some of the drives are slow, the 
network is misconfigured, there are too few PGs, there are too many 
drives on one controller, or other issues, 25-30MB/s per disk is 
certainly possible.


Mark

On 04/06/2017 10:05 AM, Stanislav Kopp wrote:

I've reduced OSDs to 12 and  moved journal to ssd drives and now have
"boost" with writes to ~33-35MB/s. Is it maximum without full ssd
pools?

Best,
Stan

2017-04-06 9:34 GMT+02:00 Stanislav Kopp :

Hello,

I'm evaluate ceph cluster, to see  if you can use it for our
virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu
16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning
drive (XFS), MONITORs are installed on the same nodes, all nodes are
connected via 10G switch.

The problem is, on client I have only ~25-30 MB/s with seq. write. (dd
with "oflag=direct"). Proxmox uses Firefly, which is old, I know.  But
I have the same performance on my desktop running the same version as
ceph nodes using rbd mount, iperf shows full speed (1GB or 10GB up to
client).
I know that this setup is not optimal and for production I will use
separate MON nodes and ssd for OSDs, but was wondering is this
performance still normal. This is my cluster status.

 cluster 3ea55c7e-5829-46d0-b83a-92c6798bde55
 health HEALTH_OK
 monmap e5: 3 mons at
{ceph01=10.1.8.31:6789/0,ceph02=10.1.8.32:6789/0,ceph03=10.1.8.33:6789/0}
election epoch 60, quorum 0,1,2 ceph01,ceph02,ceph03
 osdmap e570: 42 osds: 42 up, 42 in
flags sortbitwise,require_jewel_osds
  pgmap v14784: 1024 pgs, 1 pools, 23964 MB data, 6047 objects
74743 MB used, 305 TB / 305 TB avail
1024 active+clean

btw, bench on nodes itself looks good as far I see.

ceph01:~# rados bench -p rbd 10 write

Total time run: 10.159667
Total writes made:  1018
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 400.801
Stddev Bandwidth:   38.2018
Max bandwidth (MB/sec): 472
Min bandwidth (MB/sec): 344
Average IOPS:   100
Stddev IOPS:9
Max IOPS:   118
Min IOPS:   86
Average Latency(s): 0.159395
Stddev Latency(s):  0.110994
Max latency(s): 1.1069
Min latency(s): 0.0432668


Thanks,
Stan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow perfomance: sanity check

2017-04-06 Thread Pasha
Also make sure your PGs per pool and per entire Cluster are correct... 
you want 50-100 PGs per OSD total, otherwise performance can be 
impacted. Also if the cluster is new, it might take it a little while to 
rebalance and be available 100%, at that point speed can be affected too.


Those are a couple issues I had just recently, thought I'd share with 
you too.




On 2017-04-06 12:40 AM, Piotr Dałek wrote:

On 04/06/2017 09:34 AM, Stanislav Kopp wrote:

Hello,

I'm evaluate ceph cluster, to see  if you can use it for our
virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu
16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning
drive (XFS), MONITORs are installed on the same nodes, all nodes are
connected via 10G switch.

The problem is, on client I have only ~25-30 MB/s with seq. write. (dd
with "oflag=direct"). [..]


8TB size suggest these are some kind of "archive" drives (SMR drives). 
Is that correct? If so, you may want to use non-SMR drives, because 
Ceph is not optimized for those.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow perfomance: sanity check

2017-04-06 Thread Stanislav Kopp
I've reduced OSDs to 12 and  moved journal to ssd drives and now have
"boost" with writes to ~33-35MB/s. Is it maximum without full ssd
pools?

Best,
Stan

2017-04-06 9:34 GMT+02:00 Stanislav Kopp :
> Hello,
>
> I'm evaluate ceph cluster, to see  if you can use it for our
> virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu
> 16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning
> drive (XFS), MONITORs are installed on the same nodes, all nodes are
> connected via 10G switch.
>
> The problem is, on client I have only ~25-30 MB/s with seq. write. (dd
> with "oflag=direct"). Proxmox uses Firefly, which is old, I know.  But
> I have the same performance on my desktop running the same version as
> ceph nodes using rbd mount, iperf shows full speed (1GB or 10GB up to
> client).
> I know that this setup is not optimal and for production I will use
> separate MON nodes and ssd for OSDs, but was wondering is this
> performance still normal. This is my cluster status.
>
>  cluster 3ea55c7e-5829-46d0-b83a-92c6798bde55
>  health HEALTH_OK
>  monmap e5: 3 mons at
> {ceph01=10.1.8.31:6789/0,ceph02=10.1.8.32:6789/0,ceph03=10.1.8.33:6789/0}
> election epoch 60, quorum 0,1,2 ceph01,ceph02,ceph03
>  osdmap e570: 42 osds: 42 up, 42 in
> flags sortbitwise,require_jewel_osds
>   pgmap v14784: 1024 pgs, 1 pools, 23964 MB data, 6047 objects
> 74743 MB used, 305 TB / 305 TB avail
> 1024 active+clean
>
> btw, bench on nodes itself looks good as far I see.
>
> ceph01:~# rados bench -p rbd 10 write
> 
> Total time run: 10.159667
> Total writes made:  1018
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 400.801
> Stddev Bandwidth:   38.2018
> Max bandwidth (MB/sec): 472
> Min bandwidth (MB/sec): 344
> Average IOPS:   100
> Stddev IOPS:9
> Max IOPS:   118
> Min IOPS:   86
> Average Latency(s): 0.159395
> Stddev Latency(s):  0.110994
> Max latency(s): 1.1069
> Min latency(s): 0.0432668
>
>
> Thanks,
> Stan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow perfomance: sanity check

2017-04-06 Thread Piotr Dałek

On 04/06/2017 09:34 AM, Stanislav Kopp wrote:

Hello,

I'm evaluate ceph cluster, to see  if you can use it for our
virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu
16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning
drive (XFS), MONITORs are installed on the same nodes, all nodes are
connected via 10G switch.

The problem is, on client I have only ~25-30 MB/s with seq. write. (dd
with "oflag=direct"). [..]


8TB size suggest these are some kind of "archive" drives (SMR drives). Is 
that correct? If so, you may want to use non-SMR drives, because Ceph is not 
optimized for those.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] slow perfomance: sanity check

2017-04-06 Thread Stanislav Kopp
Hello,

I'm evaluate ceph cluster, to see  if you can use it for our
virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu
16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning
drive (XFS), MONITORs are installed on the same nodes, all nodes are
connected via 10G switch.

The problem is, on client I have only ~25-30 MB/s with seq. write. (dd
with "oflag=direct"). Proxmox uses Firefly, which is old, I know.  But
I have the same performance on my desktop running the same version as
ceph nodes using rbd mount, iperf shows full speed (1GB or 10GB up to
client).
I know that this setup is not optimal and for production I will use
separate MON nodes and ssd for OSDs, but was wondering is this
performance still normal. This is my cluster status.

 cluster 3ea55c7e-5829-46d0-b83a-92c6798bde55
 health HEALTH_OK
 monmap e5: 3 mons at
{ceph01=10.1.8.31:6789/0,ceph02=10.1.8.32:6789/0,ceph03=10.1.8.33:6789/0}
election epoch 60, quorum 0,1,2 ceph01,ceph02,ceph03
 osdmap e570: 42 osds: 42 up, 42 in
flags sortbitwise,require_jewel_osds
  pgmap v14784: 1024 pgs, 1 pools, 23964 MB data, 6047 objects
74743 MB used, 305 TB / 305 TB avail
1024 active+clean

btw, bench on nodes itself looks good as far I see.

ceph01:~# rados bench -p rbd 10 write

Total time run: 10.159667
Total writes made:  1018
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 400.801
Stddev Bandwidth:   38.2018
Max bandwidth (MB/sec): 472
Min bandwidth (MB/sec): 344
Average IOPS:   100
Stddev IOPS:9
Max IOPS:   118
Min IOPS:   86
Average Latency(s): 0.159395
Stddev Latency(s):  0.110994
Max latency(s): 1.1069
Min latency(s): 0.0432668


Thanks,
Stan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com