Re: [ceph-users] bluestore write iops calculation
I can add RAM ans is there a way to increase rocksdb caching , can I increase bluestore_cache_size_hdd to higher value to cache rocksdb? In recent releases it's governed by the osd_memory_target parameter. In previous releases it's bluestore_cache_size_hdd. Check release notes to know for sure. This we have planned to add some SSDs and how many OSD's rocks db we can add per SSDs and i guess if one SSD is down then all related OSDs has to be re-installed. Yes. At least you'd better not put all 24 block.db's on a single SSD :) 4-8 HDDs per an SSD is usually fine. Also check db_used_bytes in `ceph daemon osd.0 perf dump` (replace 0 with actual OSD numbers) to figure out how much space your DBs use. If it's below 30gb you're lucky because in that case DBs will fit on 30GB SSD partitions. https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing -- Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bluestore write iops calculation
On Mon, Aug 5, 2019 at 6:35 PM wrote: > > Hi Team, > > @vita...@yourcmc.ru , thank you for information and could you please > > clarify on the below quires as well, > > > > 1. Average object size we use will be 256KB to 512KB , will there be > > deferred write queue ? > > With the default settings, no (bluestore_prefer_deferred_size_hdd = > 32KB) > > Are you sure that 256-512KB operations aren't counted as multiple > operations in your disk stats? > I think it is not taking multiple operations. > > > 2. Share the link of existing rocksdb ticket which does 2 write + > > syncs. > > My PR is here https://github.com/ceph/ceph/pull/26909, you can find the > issue tracker links inside it. > > > 3. Any configuration by which we can reduce/optimize the iops ? > > As already said part of your I/O may be caused by the metadata (rocksdb) > reads if it doesn't fit into RAM. You can try to add more RAM in that > case... :) > I can add RAM ans is there a way to increase rocksdb caching , can I increase bluestore_cache_size_hdd to higher value to cache rocksdb? > > You can also try to add SSDs for metadata (block.db/block.wal). > This we have planned to add some SSDs and how many OSD's rocks db we can add per SSDs and i guess if one SSD is down then all related OSDs has to be re-installed. > > Is there something else?... I don't think so. > > -- > Vitaliy Filippov > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bluestore write iops calculation
Hi Team, @vita...@yourcmc.ru , thank you for information and could you please clarify on the below quires as well, 1. Average object size we use will be 256KB to 512KB , will there be deferred write queue ? With the default settings, no (bluestore_prefer_deferred_size_hdd = 32KB) Are you sure that 256-512KB operations aren't counted as multiple operations in your disk stats? 2. Share the link of existing rocksdb ticket which does 2 write + syncs. My PR is here https://github.com/ceph/ceph/pull/26909, you can find the issue tracker links inside it. 3. Any configuration by which we can reduce/optimize the iops ? As already said part of your I/O may be caused by the metadata (rocksdb) reads if it doesn't fit into RAM. You can try to add more RAM in that case... :) You can also try to add SSDs for metadata (block.db/block.wal). Is there something else?... I don't think so. -- Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bluestore write iops calculation
Hi Team, @vita...@yourcmc.ru , thank you for information and could you please clarify on the below quires as well, 1. Average object size we use will be 256KB to 512KB , will there be deferred write queue ? 2. Share the link of existing rocksdb ticket which does 2 write + syncs. 3. Any configuration by which we can reduce/optimize the iops ? Thanks, Muthu On Fri, Aug 2, 2019 at 6:21 PM wrote: > > 1. For 750 object write request , data written directly into data > > partition and since we use EC 4+1 there will be 5 iops across the > > cluster for each obejct write . This makes 750 * 5 = 3750 iops > > don't forget about the metadata and the deferring of small writes. > deferred write queue + metadata, then data for each OSD. this is either > 2 or 3 ops per an OSD. the deferred write queue is in the same RocksDB > so deferred write queue + metadata should be 1 op, although a slightly > bigger one (8-12 kb for 4 kb writes). so it's either 3*5*750 or 2*5*750, > depending on how your final statistics is collected > > > 2. For 750 attribute request , first it will be written into > > rocksdb.WAL and then to rocks.db . So , 2 iops per disk for every > > attribute request . This makes 750*2*5 = 7500 iops inside the cluster. > > rocksdb is LSM so it doesn't write to wal then to DB, it just writes to > WAL then compacts it at some point and merges with L0->L1->L2->... > > so in theory without compaction it should be 1*5*750 iops > > however, there is a bug that makes bluestore do 2 writes+syncs instead > of 1 per each journal write (not all the time though). the first write > is the rocksdb's WAL and the second one is the bluefs's journal. this > probably adds another 5*750 iops on top of each of (1) and (2). > > so 5*((2 or 3)+1+2)*750 = either 18750 or 22500. 18750/120 = 156.25, > 22500/120 = 187.5 > > the rest may be compaction or metadata reads if you update some objects. > or maybe I'm missing something else. however this is already closer to > your 200 iops :) > > -- > Vitaliy Filippov > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bluestore write iops calculation
Any EC pool with m=1 is fragile. By default, min_size = k+1, so you'd immediately stop IO the moment you lose a single OSD. min_size can be lowered to k, but that can cause data loss and corruption. You should set m=2 at a minimum. 4+2 doesn't take much more space than 4+1, and it's far safer. On Fri, Aug 2, 2019 at 11:21 PM wrote: > > > where small means 32kb or smaller going to BlueStore, so <= 128kb > > writes > > from the client. > > > > Also: please don't do 4+1 erasure coding, see older discussions for > > details. > > Can you point me to the discussion abort the problems of 4+1? It's not > easy to google :) > > -- > Vitaliy Filippov > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bluestore write iops calculation
On 02/08/2019 08:54, nokia ceph wrote: Hi Team, Could you please help us in understanding the write iops inside ceph cluster . There seems to be mismatch in iops between theoretical and what we see in disk status. Our platform 5 node cluster 120 OSDs, with each node having 24 disks HDD ( data, rcoksdb and rocksdb.WAL all resides in the same disk) . We use EC 4+1 We do only write operation total average 1500 write iops (750objects/s and 750 attribute requests per second , single Key value entry for each object). And in the ceph status we see consistent 1500 write iops from the client. Please correct if our assumptions are wrong. 1. For 750 object write request , data written directly into data partition and since we use EC 4+1 there will be 5 iops across the cluster for each obejct write . This makes 750 * 5 = 3750 iops 2. For 750 attribute request , first it will be written into rocksdb.WAL and then to rocks.db . So , 2 iops per disk for every attribute request . This makes 750*2*5 = 7500 iops inside the cluster. Now the total iops inside the cluster would be 11250 iops. we have 120 OSDs , hence per OSD should have 11250/120 = ~94iops . Currently we see average 200iops per osd for the same load in iostat however the theoretical calculation seems to be only 94iops . Could you please let us know where we miss the remaining iops inside the cluster for 1500 write iops from client? Does each object write will endup in writing one metadata inside rocksdb , then we need to add another 3750 to the total iopsĀ and this make each OSD will have 125iops , still there is difference of 75iops per OSD. Thanks, Muthu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Also is your iostat reading write iops or total read+write iops (iostat tps), note there could be a metada read op at the start of the first write op if not cached in memory. /Maged ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bluestore write iops calculation
where small means 32kb or smaller going to BlueStore, so <= 128kb writes from the client. Also: please don't do 4+1 erasure coding, see older discussions for details. Can you point me to the discussion abort the problems of 4+1? It's not easy to google :) -- Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bluestore write iops calculation
On Fri, Aug 2, 2019 at 2:51 PM wrote: > > > 1. For 750 object write request , data written directly into data > > partition and since we use EC 4+1 there will be 5 iops across the > > cluster for each obejct write . This makes 750 * 5 = 3750 iops > > don't forget about the metadata and the deferring of small writes. > deferred write queue + metadata, then data for each OSD. this is either > 2 or 3 ops per an OSD. the deferred write queue is in the same RocksDB > so deferred write queue + metadata should be 1 op, although a slightly > bigger one (8-12 kb for 4 kb writes). so it's either 3*5*750 or 2*5*750, > depending on how your final statistics is collected where small means 32kb or smaller going to BlueStore, so <= 128kb writes from the client. Also: please don't do 4+1 erasure coding, see older discussions for details. Paul > > > 2. For 750 attribute request , first it will be written into > > rocksdb.WAL and then to rocks.db . So , 2 iops per disk for every > > attribute request . This makes 750*2*5 = 7500 iops inside the cluster. > > rocksdb is LSM so it doesn't write to wal then to DB, it just writes to > WAL then compacts it at some point and merges with L0->L1->L2->... > > so in theory without compaction it should be 1*5*750 iops > > however, there is a bug that makes bluestore do 2 writes+syncs instead > of 1 per each journal write (not all the time though). the first write > is the rocksdb's WAL and the second one is the bluefs's journal. this > probably adds another 5*750 iops on top of each of (1) and (2). > > so 5*((2 or 3)+1+2)*750 = either 18750 or 22500. 18750/120 = 156.25, > 22500/120 = 187.5 > > the rest may be compaction or metadata reads if you update some objects. > or maybe I'm missing something else. however this is already closer to > your 200 iops :) > > -- > Vitaliy Filippov > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bluestore write iops calculation
1. For 750 object write request , data written directly into data partition and since we use EC 4+1 there will be 5 iops across the cluster for each obejct write . This makes 750 * 5 = 3750 iops don't forget about the metadata and the deferring of small writes. deferred write queue + metadata, then data for each OSD. this is either 2 or 3 ops per an OSD. the deferred write queue is in the same RocksDB so deferred write queue + metadata should be 1 op, although a slightly bigger one (8-12 kb for 4 kb writes). so it's either 3*5*750 or 2*5*750, depending on how your final statistics is collected 2. For 750 attribute request , first it will be written into rocksdb.WAL and then to rocks.db . So , 2 iops per disk for every attribute request . This makes 750*2*5 = 7500 iops inside the cluster. rocksdb is LSM so it doesn't write to wal then to DB, it just writes to WAL then compacts it at some point and merges with L0->L1->L2->... so in theory without compaction it should be 1*5*750 iops however, there is a bug that makes bluestore do 2 writes+syncs instead of 1 per each journal write (not all the time though). the first write is the rocksdb's WAL and the second one is the bluefs's journal. this probably adds another 5*750 iops on top of each of (1) and (2). so 5*((2 or 3)+1+2)*750 = either 18750 or 22500. 18750/120 = 156.25, 22500/120 = 187.5 the rest may be compaction or metadata reads if you update some objects. or maybe I'm missing something else. however this is already closer to your 200 iops :) -- Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] bluestore write iops calculation
Hi Team, Could you please help us in understanding the write iops inside ceph cluster . There seems to be mismatch in iops between theoretical and what we see in disk status. Our platform 5 node cluster 120 OSDs, with each node having 24 disks HDD ( data, rcoksdb and rocksdb.WAL all resides in the same disk) . We use EC 4+1 We do only write operation total average 1500 write iops (750objects/s and 750 attribute requests per second , single Key value entry for each object). And in the ceph status we see consistent 1500 write iops from the client. Please correct if our assumptions are wrong. 1. For 750 object write request , data written directly into data partition and since we use EC 4+1 there will be 5 iops across the cluster for each obejct write . This makes 750 * 5 = 3750 iops 2. For 750 attribute request , first it will be written into rocksdb.WAL and then to rocks.db . So , 2 iops per disk for every attribute request . This makes 750*2*5 = 7500 iops inside the cluster. Now the total iops inside the cluster would be 11250 iops. we have 120 OSDs , hence per OSD should have 11250/120 = ~94iops . Currently we see average 200iops per osd for the same load in iostat however the theoretical calculation seems to be only 94iops . Could you please let us know where we miss the remaining iops inside the cluster for 1500 write iops from client? Does each object write will endup in writing one metadata inside rocksdb , then we need to add another 3750 to the total iops and this make each OSD will have 125iops , still there is difference of 75iops per OSD. Thanks, Muthu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com