Re: [ceph-users] bluestore write iops calculation

2019-08-07 Thread vitalif

I can add RAM ans is there a way to increase rocksdb caching , can I
increase bluestore_cache_size_hdd to higher value to cache rocksdb?


In recent releases it's governed by the osd_memory_target parameter. In 
previous releases it's bluestore_cache_size_hdd. Check release notes to 
know for sure.



This we have planned to add some SSDs and how many OSD's rocks db we
can add per SSDs and i guess if one SSD is down then all related OSDs
has to be re-installed.


Yes. At least you'd better not put all 24 block.db's on a single SSD :) 
4-8 HDDs per an SSD is usually fine. Also check db_used_bytes in `ceph 
daemon osd.0 perf dump` (replace 0 with actual OSD numbers) to figure 
out how much space your DBs use. If it's below 30gb you're lucky because 
in that case DBs will fit on 30GB SSD partitions. 
https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing


--
Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore write iops calculation

2019-08-06 Thread nokia ceph
On Mon, Aug 5, 2019 at 6:35 PM  wrote:

> > Hi  Team,
> > @vita...@yourcmc.ru , thank you for information and could you please
> > clarify on the below quires as well,
> >
> > 1. Average object size we use will be 256KB to 512KB , will there be
> > deferred write queue ?
>
> With the default settings, no (bluestore_prefer_deferred_size_hdd =
> 32KB)
>
>   Are you sure that 256-512KB operations aren't counted as multiple

> operations in your disk stats?
>

  I think it is not taking multiple operations.

>
> > 2. Share the link of existing rocksdb ticket which does 2 write +
> > syncs.
>
> My PR is here https://github.com/ceph/ceph/pull/26909, you can find the
> issue tracker links inside it.
>
> > 3. Any configuration by which we can reduce/optimize the iops ?
>
> As already said part of your I/O may be caused by the metadata (rocksdb)
> reads if it doesn't fit into RAM. You can try to add more RAM in that
> case... :)
>

 I can add RAM ans is there a way to increase rocksdb caching , can I
increase bluestore_cache_size_hdd to higher value to cache rocksdb?

>
> You can also try to add SSDs for metadata (block.db/block.wal).
>
 This we have planned to add some SSDs and how many OSD's rocks db we can
add per SSDs and i guess if one SSD is down then all related OSDs has to be
re-installed.

>
> Is there something else?... I don't think so.
>
> --
> Vitaliy Filippov
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore write iops calculation

2019-08-05 Thread vitalif

Hi  Team,
@vita...@yourcmc.ru , thank you for information and could you please
clarify on the below quires as well,

1. Average object size we use will be 256KB to 512KB , will there be
deferred write queue ?


With the default settings, no (bluestore_prefer_deferred_size_hdd = 
32KB)


Are you sure that 256-512KB operations aren't counted as multiple 
operations in your disk stats?


2. Share the link of existing rocksdb ticket which does 2 write + 
syncs.


My PR is here https://github.com/ceph/ceph/pull/26909, you can find the 
issue tracker links inside it.



3. Any configuration by which we can reduce/optimize the iops ?


As already said part of your I/O may be caused by the metadata (rocksdb) 
reads if it doesn't fit into RAM. You can try to add more RAM in that 
case... :)


You can also try to add SSDs for metadata (block.db/block.wal).

Is there something else?... I don't think so.

--
Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore write iops calculation

2019-08-05 Thread nokia ceph
Hi  Team,
@vita...@yourcmc.ru , thank you for information and could you please
clarify on the below quires as well,

1. Average object size we use will be 256KB to 512KB , will there be
deferred write queue ?
2. Share the link of existing rocksdb ticket which does 2 write + syncs.
3. Any configuration by which we can reduce/optimize the iops ?

Thanks,
Muthu


On Fri, Aug 2, 2019 at 6:21 PM  wrote:

> > 1. For 750 object write request , data written directly into data
> > partition and since we use EC 4+1 there will be 5 iops across the
> > cluster for each obejct write . This makes 750 * 5 = 3750 iops
>
> don't forget about the metadata and the deferring of small writes.
> deferred write queue + metadata, then data for each OSD. this is either
> 2 or 3 ops per an OSD. the deferred write queue is in the same RocksDB
> so deferred write queue + metadata should be 1 op, although a slightly
> bigger one (8-12 kb for 4 kb writes). so it's either 3*5*750 or 2*5*750,
> depending on how your final statistics is collected
>
> > 2. For 750 attribute request , first it will be written into
> > rocksdb.WAL and then to rocks.db . So , 2 iops per disk for every
> > attribute request . This makes 750*2*5 = 7500 iops inside the cluster.
>
> rocksdb is LSM so it doesn't write to wal then to DB, it just writes to
> WAL then compacts it at some point and merges with L0->L1->L2->...
>
> so in theory without compaction it should be 1*5*750 iops
>
> however, there is a bug that makes bluestore do 2 writes+syncs instead
> of 1 per each journal write (not all the time though). the first write
> is the rocksdb's WAL and the second one is the bluefs's journal. this
> probably adds another 5*750 iops on top of each of (1) and (2).
>
> so 5*((2 or 3)+1+2)*750 = either 18750 or 22500. 18750/120 = 156.25,
> 22500/120 = 187.5
>
> the rest may be compaction or metadata reads if you update some objects.
> or maybe I'm missing something else. however this is already closer to
> your 200 iops :)
>
> --
> Vitaliy Filippov
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore write iops calculation

2019-08-02 Thread Nathan Fish
Any EC pool with m=1 is fragile. By default, min_size = k+1, so you'd
immediately stop IO the moment you lose a single OSD. min_size can be
lowered to k, but that can cause data loss and corruption. You should
set m=2 at a minimum. 4+2 doesn't take much more space than 4+1, and
it's far safer.

On Fri, Aug 2, 2019 at 11:21 PM  wrote:
>
> > where small means 32kb or smaller going to BlueStore, so <= 128kb
> > writes
> > from the client.
> >
> > Also: please don't do 4+1 erasure coding, see older discussions for
> > details.
>
> Can you point me to the discussion abort the problems of 4+1? It's not
> easy to google :)
>
> --
> Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore write iops calculation

2019-08-02 Thread Maged Mokhtar


On 02/08/2019 08:54, nokia ceph wrote:

Hi Team,

Could you please help us in understanding the write iops inside ceph 
cluster . There seems to be mismatch in iops between theoretical and 
what we see in disk status.


Our platform 5 node cluster 120 OSDs, with each node having 24 disks 
HDD ( data, rcoksdb and rocksdb.WAL all resides in the same disk) .


We use EC 4+1

We do only write operation total average 1500 write iops (750objects/s 
and 750 attribute requests per second , single Key value entry for 
each object). And in the ceph status we see consistent 1500 write iops 
from the client.


Please correct if our assumptions are wrong.
1. For 750 object write request , data written directly into data 
partition and since we use EC 4+1 there will be 5 iops across the 
cluster for each obejct write . This makes 750 * 5 = 3750 iops
2. For 750 attribute request , first it will be written into 
rocksdb.WAL and then to rocks.db . So , 2 iops per disk for every 
attribute request . This makes 750*2*5 = 7500 iops inside the cluster.


Now the total iops inside the cluster would be 11250 iops. we have 120 
OSDs , hence per OSD should have 11250/120 = ~94iops .


Currently we see average 200iops per osd for the same load in iostat 
however the theoretical calculation seems to be only 94iops .


Could you please let us know where we miss the remaining iops inside 
the cluster for 1500 write iops from client?


Does each object write will endup in writing one metadata inside 
rocksdb , then we need to add another 3750 to the total iopsĀ  and this 
make each OSD will have 125iops , still there is difference of 75iops 
per OSD.


Thanks,
Muthu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Also is your iostat reading write iops or total read+write iops (iostat 
tps), note there could be a metada read op at the start of the first 
write op if not cached in memory.


/Maged


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore write iops calculation

2019-08-02 Thread vitalif
where small means 32kb or smaller going to BlueStore, so <= 128kb 
writes

from the client.

Also: please don't do 4+1 erasure coding, see older discussions for 
details.


Can you point me to the discussion abort the problems of 4+1? It's not 
easy to google :)


--
Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore write iops calculation

2019-08-02 Thread Paul Emmerich
On Fri, Aug 2, 2019 at 2:51 PM  wrote:
>
> > 1. For 750 object write request , data written directly into data
> > partition and since we use EC 4+1 there will be 5 iops across the
> > cluster for each obejct write . This makes 750 * 5 = 3750 iops
>
> don't forget about the metadata and the deferring of small writes.
> deferred write queue + metadata, then data for each OSD. this is either
> 2 or 3 ops per an OSD. the deferred write queue is in the same RocksDB
> so deferred write queue + metadata should be 1 op, although a slightly
> bigger one (8-12 kb for 4 kb writes). so it's either 3*5*750 or 2*5*750,
> depending on how your final statistics is collected

where small means 32kb or smaller going to BlueStore, so <= 128kb writes
from the client.

Also: please don't do 4+1 erasure coding, see older discussions for details.


Paul

>
> > 2. For 750 attribute request , first it will be written into
> > rocksdb.WAL and then to rocks.db . So , 2 iops per disk for every
> > attribute request . This makes 750*2*5 = 7500 iops inside the cluster.
>
> rocksdb is LSM so it doesn't write to wal then to DB, it just writes to
> WAL then compacts it at some point and merges with L0->L1->L2->...
>
> so in theory without compaction it should be 1*5*750 iops
>
> however, there is a bug that makes bluestore do 2 writes+syncs instead
> of 1 per each journal write (not all the time though). the first write
> is the rocksdb's WAL and the second one is the bluefs's journal. this
> probably adds another 5*750 iops on top of each of (1) and (2).
>
> so 5*((2 or 3)+1+2)*750 = either 18750 or 22500. 18750/120 = 156.25,
> 22500/120 = 187.5
>
> the rest may be compaction or metadata reads if you update some objects.
> or maybe I'm missing something else. however this is already closer to
> your 200 iops :)
>
> --
> Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore write iops calculation

2019-08-02 Thread vitalif

1. For 750 object write request , data written directly into data
partition and since we use EC 4+1 there will be 5 iops across the
cluster for each obejct write . This makes 750 * 5 = 3750 iops


don't forget about the metadata and the deferring of small writes. 
deferred write queue + metadata, then data for each OSD. this is either 
2 or 3 ops per an OSD. the deferred write queue is in the same RocksDB 
so deferred write queue + metadata should be 1 op, although a slightly 
bigger one (8-12 kb for 4 kb writes). so it's either 3*5*750 or 2*5*750, 
depending on how your final statistics is collected



2. For 750 attribute request , first it will be written into
rocksdb.WAL and then to rocks.db . So , 2 iops per disk for every
attribute request . This makes 750*2*5 = 7500 iops inside the cluster.


rocksdb is LSM so it doesn't write to wal then to DB, it just writes to 
WAL then compacts it at some point and merges with L0->L1->L2->...


so in theory without compaction it should be 1*5*750 iops

however, there is a bug that makes bluestore do 2 writes+syncs instead 
of 1 per each journal write (not all the time though). the first write 
is the rocksdb's WAL and the second one is the bluefs's journal. this 
probably adds another 5*750 iops on top of each of (1) and (2).


so 5*((2 or 3)+1+2)*750 = either 18750 or 22500. 18750/120 = 156.25, 
22500/120 = 187.5


the rest may be compaction or metadata reads if you update some objects. 
or maybe I'm missing something else. however this is already closer to 
your 200 iops :)


--
Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] bluestore write iops calculation

2019-08-02 Thread nokia ceph
Hi Team,

Could you please help us in understanding the write iops inside ceph
cluster . There seems to be mismatch in iops between theoretical and what
we see in disk status.

Our platform 5 node cluster 120 OSDs, with each node having 24 disks HDD (
data, rcoksdb and rocksdb.WAL all resides in the same disk) .

We use EC 4+1

We do only write operation total average 1500 write iops (750objects/s and
750 attribute requests per second , single Key value entry for each
object). And in the ceph status we see consistent 1500 write iops from the
client.

Please correct if our assumptions are wrong.
1. For 750 object write request , data written directly into data partition
and since we use EC 4+1 there will be 5 iops across the cluster for each
obejct write . This makes 750 * 5 = 3750 iops
2. For 750 attribute request , first it will be written into rocksdb.WAL
and then to rocks.db . So , 2 iops per disk for every attribute request .
This makes 750*2*5 = 7500 iops inside the cluster.

Now the total iops inside the cluster would be 11250 iops. we have 120 OSDs
, hence per OSD should have 11250/120 = ~94iops .

Currently we see average 200iops per osd for the same load in iostat
however the theoretical calculation seems to be only 94iops .

Could you please let us know where we miss the remaining iops inside the
cluster for 1500 write iops from client?

Does each object write will endup in writing one metadata inside rocksdb ,
then we need to add another 3750 to the total iops  and this make each OSD
will have 125iops , still there is difference of 75iops per OSD.

Thanks,
Muthu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com