Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Виталий Филиппов
Is it a question to me or Victor? :-)

I did test my drives, intel nvmes are capable of something like 95100 single 
thread iops.

10 марта 2019 г. 1:31:15 GMT+03:00, Martin Verges  
пишет:
>Hello,
>
>did you test the performance of your individual drives?
>
>Here is a small snippet:
>-
>DRIVE=/dev/XXX
>smartctl --a $DRIVE
>for i in 1 2 4 8 16; do echo "Test $i"; fio --filename=$DRIVE
>--direct=1
>--sync=1 --rw=write --bs=4k --numjobs=$i --iodepth=1 --runtime=60
>--time_based --group_reporting --name=journal-test; done
>-
>
>Please share the results that we know what's possible with your
>hardware.
>
>--
>Martin Verges
>Managing director
>
>Mobile: +49 174 9335695
>E-Mail: martin.ver...@croit.io
>Chat: https://t.me/MartinVerges
>
>croit GmbH, Freseniusstr. 31h, 81247 Munich
>CEO: Martin Verges - VAT-ID: DE310638492
>Com. register: Amtsgericht Munich HRB 231263
>
>Web: https://croit.io
>YouTube: https://goo.gl/PGE1Bx
>
>Vitaliy Filippov  schrieb am Sa., 9. März 2019,
>21:09:
>
>> There are 2:
>>
>> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1
>-rw=randwrite
>> -pool=bench -rbdname=testimg
>>
>> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128
>-rw=randwrite
>> -pool=bench -rbdname=testimg
>>
>> The first measures your min possible latency - it does not scale with
>the
>> number of OSDs at all, but it's usually what real applications like
>> DBMSes
>> need.
>>
>> The second measures your max possible random write throughput which
>you
>> probably won't be able to utilize if you don't have enough VMs all
>> writing
>> in parallel.
>>
>> --
>> With best regards,
>>Vitaliy Filippov
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd pg-upmap-items not working

2019-03-09 Thread Kári Bertilsson
Thanks

I did apply https://github.com/ceph/ceph/pull/26179.

Running manual upmap commands work now. I did run "ceph balancer optimize
new"and It did add a few upmaps.

But now another issue. Distribution is far from perfect but the balancer
can't find further optimization.
Specifically OSD 23 is getting way more pg's than the other 3tb OSD's.

See https://pastebin.com/f5g5Deak

On Fri, Mar 1, 2019 at 10:25 AM  wrote:

> > Backports should be available in v12.2.11.
>
> s/v12.2.11/ v12.2.12/
>
> Sorry for the typo.
>
>
>
>
> 原始邮件
> *发件人:*谢型果10072465
> *收件人:*d...@vanderster.com ;
> *抄送人:*ceph-users@lists.ceph.com ;
> *日 期 :*2019年03月01日 17:09
> *主 题 :**Re: [ceph-users] ceph osd pg-upmap-items not working*
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> See  
> 
> 
> 
> https://github.com/ceph/ceph/pull/26179
>
> Backports should be available in v12.2.11.
>
> Or you can manually do it by simply adopting
> 
> 
> 
> 
> 
> https://github.com/ceph/ceph/pull/26127   if you are eager to get out of
> the trap right now.
>
> 
> 
> 
>
> 
> 
>
>
>
>
>
>
> *发件人:*DanvanderSter 
> *收件人:*Kári Bertilsson ;
> *抄送人:*ceph-users ;谢型果10072465;
> *日 期 :*2019年03月01日 14:48
> *主 题 :**Re: [ceph-users] ceph osd pg-upmap-items not working*
> It looks like that somewhat unusual crush rule is confusing the new
> upmap cleaning.
> (debug_mon 10 on the active mon should show those cleanups).
>
> I'm copying Xie Xingguo, and probably you should create a tracker for this.
>
> -- dan
>
>
>
>
> On Fri, Mar 1, 2019 at 3:12 AM Kári Bertilsson  > wrote:
> >
> > This is the pool
>
> > pool 41 'ec82_pool' erasure size 10 min_size 8 crush_rule 1 object_hash 
> > rjenkins pg_num 512 pgp_num 512 last_change 63794 lfor 21731/21731 flags 
> > hashpspool,ec_overwrites stripe_width 32768 application cephfs
> >removed_snaps [1~5]
> >
> > Here is the relevant crush rule:
>
> > rule ec_pool { id 1 type erasure min_size 3 max_size 10 step 
> > set_chooseleaf_tries 5 step set_choose_tries 100 step take default class 
> > hdd step choose indep 5 type host step choose indep 2 type osd step emit }
> >
>
> > Both OSD 23 and 123 are in the same host. So this change should be 
> > perfectly acceptable by the rule set.
>
> > Something must be blocking the change, but i can't find anything about it 
> > in any logs.
> >
> > - Kári
> >
> > On Thu, Feb 28, 2019 at 8:07 AM Dan van der Ster  > wrote:
> >>
> >> Hi,
> >>
> >> pg-upmap-items became more strict in v12.2.11 when validating upmaps.
> >> E.g., it now won't let you put two PGs in the same rack if the crush
> >> rule doesn't allow it.
> >>
>
> >> Where are OSDs 23 and 123 in your cluster? What is the relevant crush rule?
> >>
> >> -- dan
> >>
> >>
> >> On Wed, Feb 27, 2019 at 9:17 PM Kári Bertilsson  > wrote:
> >> >
> >> > Hello
> >> >
>
> >> > I am trying to diagnose why upmap stopped working where it was 
> >> > previously working fine.
> >> >
> >> > Trying to move pg 41.1 to 123 has no effect and seems to be ignored.
> >> >
> >> > # ceph osd pg-upmap-items 41.1 23 123
> >> > set 41.1 pg_upmap_items mapping to [23->123]
> >> >
>
> >> > No rebalacing happens and if i run it again it shows the same output 
> >> > every time.
> >> >
> >> > I have in config
> >> > debug mgr = 4/5
> >> > debug mon = 4/5
> >> >
> >> > Paste from mon & mgr logs. Also output from "ceph osd dump"
> >> > https://pastebin.com/9VrT4YcU
> >> >
> >> >
>
> >> > I have run "ceph osd set-require-min-compat-client luminous" long time 
> >> > ago. And all servers running ceph have been rebooted numerous times 
> >> > since then.
>
> >> > But somehow i am still seeing "min_compat_client jewel". I believe that 
> >> > upmap was previously working anyway with that "jewel" line present.
> >> >
>
> >> > I see no indication in any logs why the upmap commands are being ignored.
> >> >
> >> > Any suggestions on how to debug further or what could be the issue ?
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _

Re: [ceph-users] Large OMAP Objects in default.rgw.log pool

2019-03-09 Thread Pavan Rallabhandi
That can happen if you have lot of objects with swift object expiry (TTL) 
enabled. You can 'listomapkeys' on these log pool objects and check for the 
objects that have registered for TTL as omap entries. I know this is the case 
with at least Jewel version.

Thanks,
-Pavan.

On 3/7/19, 10:09 PM, "ceph-users on behalf of Brad Hubbard" 
 wrote:

On Fri, Mar 8, 2019 at 4:46 AM Samuel Taylor Liston  
wrote:
>
> Hello All,
> I have recently had 32 large map objects appear in my 
default.rgw.log pool.  Running luminous 12.2.8.
>
> Not sure what to think about these.I’ve done a lot of reading 
about how when these normally occur it is related to a bucket needing 
resharding, but it doesn’t look like my default.rgw.log pool  has anything in 
it, let alone buckets.  Here’s some info on the system:
>
> [root@elm-rgw01 ~]# ceph versions
> {
> "mon": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 5
> },
> "mgr": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 1
> },
> "osd": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 192
> },
> "mds": {},
> "rgw": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 1
> },
> "overall": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 199
> }
> }
> [root@elm-rgw01 ~]# ceph osd pool ls
> .rgw.root
> default.rgw.control
> default.rgw.meta
> default.rgw.log
> default.rgw.buckets.index
> default.rgw.buckets.non-ec
> default.rgw.buckets.data
> [root@elm-rgw01 ~]# ceph health detail
> HEALTH_WARN 32 large omap objects
> LARGE_OMAP_OBJECTS 32 large omap objects
> 32 large objects found in pool 'default.rgw.log'
> Search the cluster log for 'Large omap object found' for more 
details.—
>
> Looking closer at these object they are all of size 0.  Also that pool 
shows a capacity usage of 0:

The size here relates to data size. Object map (omap) data is metadata
so an object of size 0 can have considerable omap data associated with
it (the omap data is stored separately from the object in a key/value
database). The large omap warning in health detail output should tell
you " "Search the cluster log for 'Large omap object found' for more
details." If you do that you should get the names of the specific
objects involved. You can then use the rados commands listomapkeys and
listomapvals to see the specifics of the omap data. Someone more
familiar with rgw can then probably help you out on what purpose they
serve.

HTH.

>
> (just a sampling of the 236 objects at size 0)
>
> [root@elm-mon01 ceph]# for i in `rados ls -p default.rgw.log`; do echo 
${i}; rados stat -p default.rgw.log ${i};done
> obj_delete_at_hint.78
> default.rgw.log/obj_delete_at_hint.78 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.70
> default.rgw.log/obj_delete_at_hint.70 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.000104
> default.rgw.log/obj_delete_at_hint.000104 mtime 2019-03-07 
11:39:20.00, size 0
> obj_delete_at_hint.26
> default.rgw.log/obj_delete_at_hint.26 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.28
> default.rgw.log/obj_delete_at_hint.28 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.40
> default.rgw.log/obj_delete_at_hint.40 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.15
> default.rgw.log/obj_delete_at_hint.15 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.69
> default.rgw.log/obj_delete_at_hint.69 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.95
> default.rgw.log/obj_delete_at_hint.95 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.03
> default.rgw.log/obj_delete_at_hint.03 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.47
> default.rgw.log/obj_delete_at_hint.47 mtime 2019-03-07 
11:39:19.00, size 0
>
>
> [root@elm-mon01 ceph]# rados df
> POOL_NAME  USEDOBJECTS   CLONES COPIES 
MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPSRD  WR_OPSWR
> .rgw.root  1.09KiB 4  0 12
  0   00 14853 9.67MiB 0 0B
> default.rgw.buckets.data444TiB 166829939  0 1000979634
  0   00 362357590  859TiB 909188749 703TiB
  

Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Martin Verges
Hello,

did you test the performance of your individual drives?

Here is a small snippet:
-
DRIVE=/dev/XXX
smartctl --a $DRIVE
for i in 1 2 4 8 16; do echo "Test $i"; fio --filename=$DRIVE --direct=1
--sync=1 --rw=write --bs=4k --numjobs=$i --iodepth=1 --runtime=60
--time_based --group_reporting --name=journal-test; done
-

Please share the results that we know what's possible with your hardware.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

Vitaliy Filippov  schrieb am Sa., 9. März 2019, 21:09:

> There are 2:
>
> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite
> -pool=bench -rbdname=testimg
>
> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite
> -pool=bench -rbdname=testimg
>
> The first measures your min possible latency - it does not scale with the
> number of OSDs at all, but it's usually what real applications like
> DBMSes
> need.
>
> The second measures your max possible random write throughput which you
> probably won't be able to utilize if you don't have enough VMs all
> writing
> in parallel.
>
> --
> With best regards,
>Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH ISCSI Gateway

2019-03-09 Thread Mike Christie
On 03/07/2019 09:22 AM, Ashley Merrick wrote:
> Been reading into the gateway, and noticed it’s been mentioned a few
> times it can be installed on OSD servers.
> 
> I am guessing therefore there be no issues like is sometimes mentioned
> when using kRBD on a OSD node apart from the extra resources required
> from the hardware.
> 

That is correct. You might have a similar issue if you were to run the
iscsi gw/target, OSD and then also run the iscsi initiator that logs
into the iscsi gw/target all on the same node. I don't think any use
case like that has ever come up though.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] priorize degraged objects than misplaced

2019-03-09 Thread Fabio Abreu
HI Everybody,

I have a doubt about degraded objects in the Jewel 10.2.7 version, can I
priorize the degraded objects than misplaced?

I asking this because I try simulate a disaster recovery scenario.


Thanks and best regards,
Fabio Abreu Reis
http://fajlinux.com.br
*Tel : *+55 21 98244-0161
*Skype : *fabioabreureis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Vitaliy Filippov

There are 2:

fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite  
-pool=bench -rbdname=testimg


fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite  
-pool=bench -rbdname=testimg


The first measures your min possible latency - it does not scale with the  
number of OSDs at all, but it's usually what real applications like DBMSes  
need.


The second measures your max possible random write throughput which you  
probably won't be able to utilize if you don't have enough VMs all writing  
in parallel.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Victor Hooi
Hi,

I have retested with 4K blocks - results are below.

I am currently using 4 OSDs per Optane 900P drive. This was based on some
posts I found on Proxmox Forums, and what seems to be "tribal knowledge"
there.

I also saw this presentation
,
which mentions on page 14:

2-4 OSDs/NVMe SSD and 4-6 NVMe SSDs per node are sweet spots


Has anybody done much testing with pure Optane drives for Ceph? (Paper
above seems to use them mixed with traditional SSDs).

Would increasing the number of OSDs help in this scenario? I am happy to
try that - I assume I will need to blow away all the existing OSDs/Ceph
setup and start again, of course.

Here are the rados bench results with 4K - the write IOPS are still a tad
short of 15,000 - is that what I should be aiming for?

Write result:

# rados bench -p proxmox_vms 60 write -b 4K -t 16 --no-cleanup
Total time run: 60.001016
Total writes made:  726749
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 47.3136
Stddev Bandwidth:   2.16408
Max bandwidth (MB/sec): 48.7344
Min bandwidth (MB/sec): 38.5078
Average IOPS:   12112
Stddev IOPS:554
Max IOPS:   12476
Min IOPS:   9858
Average Latency(s): 0.00132019
Stddev Latency(s):  0.000670617
Max latency(s): 0.065541
Min latency(s): 0.000689406


Sequential read result:

# rados bench -p proxmox_vms  60 seq -t 16
Total time run:   17.098593
Total reads made: 726749
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   166.029
Average IOPS: 42503
Stddev IOPS:  218
Max IOPS: 42978
Min IOPS: 42192
Average Latency(s):   0.000369021
Max latency(s):   0.00543175
Min latency(s):   0.000170024


Random read result:

# rados bench -p proxmox_vms 60 rand -t 16
Total time run:   60.000282
Total reads made: 2708799
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   176.353
Average IOPS: 45146
Stddev IOPS:  310
Max IOPS: 45754
Min IOPS: 44506
Average Latency(s):   0.000347637
Max latency(s):   0.00457886
Min latency(s):   0.000138381


I am happy to try with fio -ioengine =rbd (the reason I use rados bench is
because that is what was used in the Proxmox Ceph benchmark paper
)
however, is there a common community-suggested starting command line that
makes it easy to compare results? (fio seems quite complex in terms of
options).

Thanks,
Victor

On Sun, Mar 10, 2019 at 6:15 AM Vitaliy Filippov  wrote:

> Welcome to our "slow ceph" party :)))
>
> However I have to note that:
>
> 1) 50 iops is for 4 KB blocks. You're testing it with 4 MB ones.
> That's kind of unfair comparison.
>
> 2) fio -ioengine=rbd is better than rados bench for testing.
>
> 3) You can't "compensate" for Ceph's overhead even by having infinitely
> fast disks.
>
> At its simplest, imagine that disk I/O takes X microseconds and Ceph's
> overhead is Y for a single operation.
>
> Suppose there is no parallelism. Then raw disk IOPS = 100/X and Ceph
> IOPS = 100/(X+Y). Y is currently quite long, something around 400-800
> microseconds or so. So the best IOPS number you can squeeze out of a
> single client thread (a DBMS, for example) is 100/400 = only ~2500
> iops.
>
> Parallel iops are of course better, but still you won't get anything
> close
> to 50 iops from a single OSD. The expected number is around 15000.
> Create multiple OSDs on a single NVMe and sacrifice your CPU usage if you
> want better results.
>
> --
> With best regards,
>Vitaliy Filippov
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Vitaliy Filippov

Welcome to our "slow ceph" party :)))

However I have to note that:

1) 50 iops is for 4 KB blocks. You're testing it with 4 MB ones.  
That's kind of unfair comparison.


2) fio -ioengine=rbd is better than rados bench for testing.

3) You can't "compensate" for Ceph's overhead even by having infinitely  
fast disks.


At its simplest, imagine that disk I/O takes X microseconds and Ceph's  
overhead is Y for a single operation.


Suppose there is no parallelism. Then raw disk IOPS = 100/X and Ceph  
IOPS = 100/(X+Y). Y is currently quite long, something around 400-800  
microseconds or so. So the best IOPS number you can squeeze out of a  
single client thread (a DBMS, for example) is 100/400 = only ~2500  
iops.


Parallel iops are of course better, but still you won't get anything close  
to 50 iops from a single OSD. The expected number is around 15000.  
Create multiple OSDs on a single NVMe and sacrifice your CPU usage if you  
want better results.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rocksdb ceph bluestore

2019-03-09 Thread Vasiliy Tolstov
Hi, i'm interesting in implementation how ceph store wal in raw block
device in rocksdb?
As i know rocksdb uses fs to put files, i find blobdb in rocksdb code in
utilities does ceph uses it?
Or how ceph uses rocksdb to put kv in raw block device?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OpenStack with Ceph RDMA

2019-03-09 Thread Lazuardi Nasution
Hi,

I'm looking for information about where is the RDMA messaging of Ceph
happen, on cluster network, public network or both (it seem both, CMIIW)?
I'm talking about configuration of ms_type, ms_cluster_type and
ms_public_type.

In case of OpenStack integration with RBD, which of above three is
possible? In this case, should I still separate cluster network and public
network?

Best regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Victor Hooi
Hi Ahsley,

Right - so the 50% bandwidth is OK, I guess, but it was more the drop in
IOPS that was concerning (hence the subject line about 200 IOPS) *sad face*.

That, and the Optane drives weren't exactly cheap, and I was hoping they
would compensate for the overhead of Ceph.

At random read, each Optane drive is capable of 55 IOPS (random read)
and 50 IOPS (random write). Yet we're seeing it drop to around 0.04% of
that in testing (200 IOPS). Is that sort of drop in IOPS normal for Ceph?

Each node can take up to 8 x 2.5" drives. If I loaded up say 4 cheap SSDs
in each (e.g. Intel S3700 SSD), instead of one Optane drive per node, would
that have better performance with 4 x 3 = 12 drives? (Would I still put 4
OSDs per physical drive)? Or some way to supplement the Optane's with SSDs?
(Although I would assume any SSD I get is going to be slower than an Optane
drive).

Or are there tweaks I can do to either configuration, or our layout that
could eke out more IOPS?

(This is going to be used for VM hosting, so IOPS is definitely a concern).

Thanks,
Victor

On Sat, Mar 9, 2019 at 9:27 PM Ashley Merrick 
wrote:

> What kind of results are you expecting?
>
> Looking at the specs they are "up to" 2000 Write, and 2500 Read, so your
> around 50-60% of the max up to speed, which I wouldn't say is to bad due to
> the fact CEPH / Bluestore has an overhead specially when using a single
> disk for DB & WAL & Content.
>
> Remember CEPH scales with the amount of physical disks you have, as you
> only have 3 disks every piece of I/O is hitting all 3 disks, if you had 6
> disks for example and still did replication of 3 then only 50% of I/O would
> be hitting each disks, therefore id expect to see performance jump.
>
> On Sat, Mar 9, 2019 at 5:08 PM Victor Hooi  wrote:
>
>> Hi,
>>
>> I'm setting up a 3-node Proxmox cluster with Ceph as the shared storage,
>> based around Intel Optane 900P drives (which are meant to be the bee's
>> knees), and I'm seeing pretty low IOPS/bandwidth.
>>
>>- 3 nodes, each running a Ceph monitor daemon, and OSDs.
>>- Node 1 has 48 GB of RAM and 10 cores (Intel 4114
>>
>> ),
>>and Node 2 and 3 have 32 GB of RAM and 4 cores (Intel E3-1230V6
>>
>> 
>>)
>>- Each node has a Intel Optane 900p (480GB) NVMe
>>
>> 
>>  dedicated
>>for Ceph.
>>- 4 OSDs per node (total of 12 OSDs)
>>- NICs are Intel X520-DA2
>>
>> ,
>>with 10GBASE-LR going to a Unifi US-XG-16
>>.
>>- First 10GB port is for Proxmox VM traffic, second 10GB port is for
>>Ceph traffic.
>>
>> I created a new Ceph pool specifically for benchmarking with 128 PGs.
>>
>> Write results:
>>
>> root@vwnode1:~# rados bench -p benchmarking 60 write -b 4M -t 16
>> --no-cleanup
>> 
>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>> lat(s)
>>60  16 12258 12242   816.055   788   0.0856726
>> 0.0783458
>> Total time run: 60.069008
>> Total writes made:  12258
>> Write size: 4194304
>> Object size:4194304
>> Bandwidth (MB/sec): 816.261
>> Stddev Bandwidth:   17.4584
>> Max bandwidth (MB/sec): 856
>> Min bandwidth (MB/sec): 780
>> Average IOPS:   204
>> Stddev IOPS:4
>> Max IOPS:   214
>> Min IOPS:   195
>> Average Latency(s): 0.0783801
>> Stddev Latency(s):  0.0468404
>> Max latency(s): 0.437235
>> Min latency(s): 0.0177178
>>
>>
>> Sequential read results - I don't know why this only ran for 32 seconds?
>>
>> root@vwnode1:~# rados bench -p benchmarking 60 seq -t 16
>> 
>> Total time run:   32.608549
>> Total reads made: 12258
>> Read size:4194304
>> Object size:  4194304
>> Bandwidth (MB/sec):   1503.65
>> Average IOPS: 375
>> Stddev IOPS:  22
>> Max IOPS: 410
>> Min IOPS: 326
>> Average Latency(s):   0.0412777
>> Max latency(s):   0.498116
>> Min latency(s):   0.00447062
>>
>>
>> Random read result:
>>
>> root@vwnode1:~# rados bench -p benchmarking 60 rand -t 16
>> 
>> Total time run:   60.066384
>> Total reads made: 22819
>> Read size:4194304
>> Object size:  4194304
>> Bandwidth (MB/sec):   1519.59
>> Average IOPS: 379
>> Stddev IOPS:  21
>> Max IOPS: 424
>> Min IOPS: 320
>> Average Latency(s):   0.0408697
>> Max latency(s)

Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Konstantin Shalygin

These results (800 MB/s writes, 1500 Mb/s reads, and 200 write IOPS, 400
read IOPS) seems incredibly low - particularly considering what the Optane
900p is meant to be capable of.

Is this in line with what you might expect on this hardware with Ceph
though?

Or is there some way to find out the source of bottleneck?


4Mbyte*200IOPS=800MB/s. What exactly bottleneck you meant?

Try to use 4K instead 4M for IOPS load.


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Ashley Merrick
What kind of results are you expecting?

Looking at the specs they are "up to" 2000 Write, and 2500 Read, so your
around 50-60% of the max up to speed, which I wouldn't say is to bad due to
the fact CEPH / Bluestore has an overhead specially when using a single
disk for DB & WAL & Content.

Remember CEPH scales with the amount of physical disks you have, as you
only have 3 disks every piece of I/O is hitting all 3 disks, if you had 6
disks for example and still did replication of 3 then only 50% of I/O would
be hitting each disks, therefore id expect to see performance jump.

On Sat, Mar 9, 2019 at 5:08 PM Victor Hooi  wrote:

> Hi,
>
> I'm setting up a 3-node Proxmox cluster with Ceph as the shared storage,
> based around Intel Optane 900P drives (which are meant to be the bee's
> knees), and I'm seeing pretty low IOPS/bandwidth.
>
>- 3 nodes, each running a Ceph monitor daemon, and OSDs.
>- Node 1 has 48 GB of RAM and 10 cores (Intel 4114
>
> ),
>and Node 2 and 3 have 32 GB of RAM and 4 cores (Intel E3-1230V6
>
> 
>)
>- Each node has a Intel Optane 900p (480GB) NVMe
>
> 
>  dedicated
>for Ceph.
>- 4 OSDs per node (total of 12 OSDs)
>- NICs are Intel X520-DA2
>
> ,
>with 10GBASE-LR going to a Unifi US-XG-16
>.
>- First 10GB port is for Proxmox VM traffic, second 10GB port is for
>Ceph traffic.
>
> I created a new Ceph pool specifically for benchmarking with 128 PGs.
>
> Write results:
>
> root@vwnode1:~# rados bench -p benchmarking 60 write -b 4M -t 16
> --no-cleanup
> 
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
>60  16 12258 12242   816.055   788   0.0856726
> 0.0783458
> Total time run: 60.069008
> Total writes made:  12258
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 816.261
> Stddev Bandwidth:   17.4584
> Max bandwidth (MB/sec): 856
> Min bandwidth (MB/sec): 780
> Average IOPS:   204
> Stddev IOPS:4
> Max IOPS:   214
> Min IOPS:   195
> Average Latency(s): 0.0783801
> Stddev Latency(s):  0.0468404
> Max latency(s): 0.437235
> Min latency(s): 0.0177178
>
>
> Sequential read results - I don't know why this only ran for 32 seconds?
>
> root@vwnode1:~# rados bench -p benchmarking 60 seq -t 16
> 
> Total time run:   32.608549
> Total reads made: 12258
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   1503.65
> Average IOPS: 375
> Stddev IOPS:  22
> Max IOPS: 410
> Min IOPS: 326
> Average Latency(s):   0.0412777
> Max latency(s):   0.498116
> Min latency(s):   0.00447062
>
>
> Random read result:
>
> root@vwnode1:~# rados bench -p benchmarking 60 rand -t 16
> 
> Total time run:   60.066384
> Total reads made: 22819
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   1519.59
> Average IOPS: 379
> Stddev IOPS:  21
> Max IOPS: 424
> Min IOPS: 320
> Average Latency(s):   0.0408697
> Max latency(s):   0.662955
> Min latency(s):   0.00172077
>
>
> I then cleaned-up with:
>
> root@vwnode1:~# rados -p benchmarking cleanup
> Removed 12258 objects
>
>
> I then tested with another Ceph pool, with 512 PGs (originally created for
> Proxmox VMs) - results seem quite similar:
>
> root@vwnode1:~# rados bench -p proxmox_vms 60 write -b 4M -t 16
> --no-cleanup
> 
> Total time run: 60.041712
> Total writes made:  12132
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 808.238
> Stddev Bandwidth:   20.7444
> Max bandwidth (MB/sec): 860
> Min bandwidth (MB/sec): 744
> Average IOPS:   202
> Stddev IOPS:5
> Max IOPS:   215
> Min IOPS:   186
> Average Latency(s): 0.0791746
> Stddev Latency(s):  0.0432707
> Max latency(s): 0.42535
> Min latency(s): 0.0200791
>
>
> Sequential read result - once again, only ran for 32 seconds:
>
> root@vwnode1:~# rados bench -p proxmox_vms 60 seq -t 16
> 
> Total time run:   31.249274
> Total reads made: 12132
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   1552.93
> Average IOPS: 388
> Stddev IOPS:  30
> Max

[ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Victor Hooi
Hi,

I'm setting up a 3-node Proxmox cluster with Ceph as the shared storage,
based around Intel Optane 900P drives (which are meant to be the bee's
knees), and I'm seeing pretty low IOPS/bandwidth.

   - 3 nodes, each running a Ceph monitor daemon, and OSDs.
   - Node 1 has 48 GB of RAM and 10 cores (Intel 4114
   
),
   and Node 2 and 3 have 32 GB of RAM and 4 cores (Intel E3-1230V6
   

   )
   - Each node has a Intel Optane 900p (480GB) NVMe
   

dedicated
   for Ceph.
   - 4 OSDs per node (total of 12 OSDs)
   - NICs are Intel X520-DA2
   
,
   with 10GBASE-LR going to a Unifi US-XG-16
   .
   - First 10GB port is for Proxmox VM traffic, second 10GB port is for
   Ceph traffic.

I created a new Ceph pool specifically for benchmarking with 128 PGs.

Write results:

root@vwnode1:~# rados bench -p benchmarking 60 write -b 4M -t 16
--no-cleanup

  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
   60  16 12258 12242   816.055   788   0.0856726
0.0783458
Total time run: 60.069008
Total writes made:  12258
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 816.261
Stddev Bandwidth:   17.4584
Max bandwidth (MB/sec): 856
Min bandwidth (MB/sec): 780
Average IOPS:   204
Stddev IOPS:4
Max IOPS:   214
Min IOPS:   195
Average Latency(s): 0.0783801
Stddev Latency(s):  0.0468404
Max latency(s): 0.437235
Min latency(s): 0.0177178


Sequential read results - I don't know why this only ran for 32 seconds?

root@vwnode1:~# rados bench -p benchmarking 60 seq -t 16

Total time run:   32.608549
Total reads made: 12258
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   1503.65
Average IOPS: 375
Stddev IOPS:  22
Max IOPS: 410
Min IOPS: 326
Average Latency(s):   0.0412777
Max latency(s):   0.498116
Min latency(s):   0.00447062


Random read result:

root@vwnode1:~# rados bench -p benchmarking 60 rand -t 16

Total time run:   60.066384
Total reads made: 22819
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   1519.59
Average IOPS: 379
Stddev IOPS:  21
Max IOPS: 424
Min IOPS: 320
Average Latency(s):   0.0408697
Max latency(s):   0.662955
Min latency(s):   0.00172077


I then cleaned-up with:

root@vwnode1:~# rados -p benchmarking cleanup
Removed 12258 objects


I then tested with another Ceph pool, with 512 PGs (originally created for
Proxmox VMs) - results seem quite similar:

root@vwnode1:~# rados bench -p proxmox_vms 60 write -b 4M -t 16 --no-cleanup

Total time run: 60.041712
Total writes made:  12132
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 808.238
Stddev Bandwidth:   20.7444
Max bandwidth (MB/sec): 860
Min bandwidth (MB/sec): 744
Average IOPS:   202
Stddev IOPS:5
Max IOPS:   215
Min IOPS:   186
Average Latency(s): 0.0791746
Stddev Latency(s):  0.0432707
Max latency(s): 0.42535
Min latency(s): 0.0200791


Sequential read result - once again, only ran for 32 seconds:

root@vwnode1:~# rados bench -p proxmox_vms 60 seq -t 16

Total time run:   31.249274
Total reads made: 12132
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   1552.93
Average IOPS: 388
Stddev IOPS:  30
Max IOPS: 460
Min IOPS: 320
Average Latency(s):   0.0398702
Max latency(s):   0.481106
Min latency(s):   0.00461585


Random read result:

root@vwnode1:~# rados bench -p proxmox_vms 60 rand -t 16

Total time run:   60.088822
Total reads made: 23626
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   1572.74
Average IOPS: 393
Stddev IOPS:  25
Max IOPS: 432
Min IOPS: 322
Average Latency(s):   0.0392854
Max latency(s):   0.693123
Min latency(s):   0.00178545


Cleanup:

root@vwnode1:~# rados -p proxmox_vms cleanup
Removed 12132 objects
root@vwnode1:~# rados df
POOL_NAME   USED   OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND
DEGRADED RD_OPS RD WR_OPS WR
proxmox_vms 169GiB   43396  0 130188  0   0
 0 909519 298GiB 619697 272GiB

to