[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread Lukasz Borek
>
> My understanding is that omap and EC are incompatible, though.

 Reason why multipart upload is using a non-EC pool to save metadata to an
omap database?




On Mon, 8 Apr 2024 at 20:21, Anthony D'Atri  wrote:

> My understanding is that omap and EC are incompatible, though.
>
> > On Apr 8, 2024, at 09:46, David Orman  wrote:
> >
> > I would suggest that you might consider EC vs. replication for index
> data, and the latency implications. There's more than just the nvme vs.
> rotational discussion to entertain, especially if using the more widely
> spread EC modes like 8+3. It would be worth testing for your particular
> workload.
> >
> > Also make sure to factor in storage utilization if you expect to see
> versioning/object lock in use. This can be the source of a significant
> amount of additional consumption that isn't planned for initially.
> >
> > On Mon, Apr 8, 2024, at 01:42, Daniel Parkes wrote:
> >> Hi Lukasz,
> >>
> >> RGW uses Omap objects for the index pool; Omaps are stored in Rocksdb
> >> database of each osd, not on the actual index pool, so by putting
> DB/WALL
> >> on an NVMe as you mentioned, you are already configuring the index pool
> on
> >> a non-rotational drive, you don't need to do anything else.
> >>
> >> You just need to size your DB/WALL partition accordingly. For RGW/object
> >> storage, a good starting point for the DB/Wall sizing is 4%.
> >>
> >> Example of Omap entries in the index pool using 0 bytes, as they are
> stored
> >> in Rocksdb:
> >>
> >> # rados -p default.rgw.buckets.index listomapkeys
> >> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> >> file1
> >> file2
> >> file4
> >> file10
> >>
> >> rados df -p default.rgw.buckets.index
> >> POOL_NAME  USED  OBJECTS  CLONES  COPIES
> >> MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS   RD  WR_OPS  WR
> >> USED COMPR  UNDER COMPR
> >> default.rgw.buckets.index   0 B   11   0  33
> >>00 0 208  207 KiB  41  20 KiB 0 B
> >>0 B
> >>
> >> # rados -p default.rgw.buckets.index stat
> >> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> >>
> default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> >> mtime 2022-12-20T07:32:11.00-0500, size 0
> >>
> >>
> >> On Sun, Apr 7, 2024 at 10:06 PM Lukasz Borek 
> wrote:
> >>
> >>> Hi!
> >>>
> >>> I'm working on a POC cluster setup dedicated to backup app writing
> objects
> >>> via s3 (large objects, up to 1TB transferred via multipart upload
> process).
> >>>
> >>> Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) +
> EC
> >>> pool.  Plan is to use cephadm.
> >>>
> >>> I'd like to follow good practice and put the RGW index pool on a
> >>> no-rotation drive. Question is how to do it?
> >>>
> >>>   - replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
> >>>   - reserve space on NVME drive on each node, create lv based OSD and
> let
> >>>   rgb index use the same NVME drive as DB/WALL
> >>>
> >>> Thoughts?
> >>>
> >>> --
> >>> Lukasz
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>
> >>>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread Anthony D'Atri
My understanding is that omap and EC are incompatible, though.

> On Apr 8, 2024, at 09:46, David Orman  wrote:
> 
> I would suggest that you might consider EC vs. replication for index data, 
> and the latency implications. There's more than just the nvme vs. rotational 
> discussion to entertain, especially if using the more widely spread EC modes 
> like 8+3. It would be worth testing for your particular workload.
> 
> Also make sure to factor in storage utilization if you expect to see 
> versioning/object lock in use. This can be the source of a significant amount 
> of additional consumption that isn't planned for initially.
> 
> On Mon, Apr 8, 2024, at 01:42, Daniel Parkes wrote:
>> Hi Lukasz,
>> 
>> RGW uses Omap objects for the index pool; Omaps are stored in Rocksdb
>> database of each osd, not on the actual index pool, so by putting DB/WALL
>> on an NVMe as you mentioned, you are already configuring the index pool on
>> a non-rotational drive, you don't need to do anything else.
>> 
>> You just need to size your DB/WALL partition accordingly. For RGW/object
>> storage, a good starting point for the DB/Wall sizing is 4%.
>> 
>> Example of Omap entries in the index pool using 0 bytes, as they are stored
>> in Rocksdb:
>> 
>> # rados -p default.rgw.buckets.index listomapkeys
>> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>> file1
>> file2
>> file4
>> file10
>> 
>> rados df -p default.rgw.buckets.index
>> POOL_NAME  USED  OBJECTS  CLONES  COPIES
>> MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS   RD  WR_OPS  WR
>> USED COMPR  UNDER COMPR
>> default.rgw.buckets.index   0 B   11   0  33
>>00 0 208  207 KiB  41  20 KiB 0 B
>>0 B
>> 
>> # rados -p default.rgw.buckets.index stat
>> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>> default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>> mtime 2022-12-20T07:32:11.00-0500, size 0
>> 
>> 
>> On Sun, Apr 7, 2024 at 10:06 PM Lukasz Borek  wrote:
>> 
>>> Hi!
>>> 
>>> I'm working on a POC cluster setup dedicated to backup app writing objects
>>> via s3 (large objects, up to 1TB transferred via multipart upload process).
>>> 
>>> Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) + EC
>>> pool.  Plan is to use cephadm.
>>> 
>>> I'd like to follow good practice and put the RGW index pool on a
>>> no-rotation drive. Question is how to do it?
>>> 
>>>   - replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
>>>   - reserve space on NVME drive on each node, create lv based OSD and let
>>>   rgb index use the same NVME drive as DB/WALL
>>> 
>>> Thoughts?
>>> 
>>> --
>>> Lukasz
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> 
>>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread Daniel Parkes
Hi,

Yes, that documentation you are linking is from Ceph 3.x with Filestore,
With Bluestore this is no longer the case, the link to the latest Red Hat
doc version is here:

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/7/html-single/object_gateway_guide/index#index-pool_rgw

I see they have this block of text there:

"For Red Hat Ceph Storage running Bluestore, Red Hat recommends deploying
an NVMe drive as a block.db device, rather than as a separate pool.
Ceph Object Gateway index data is written only into an object map (OMAP).
OMAP data for BlueStore resides on the block.db device on an OSD. When an
NVMe drive functions as a block.db device for an HDD OSD and when the index
pool is backed by HDD OSDs, the index data will ONLY be written to the
block.db device. As long as the block.db partition/lvm is sized properly at
4% of block, this configuration is all that is needed for BlueStore."

On Mon, Apr 8, 2024 at 12:02 PM Lukasz Borek  wrote:

> Thanks for clarifying.
>
> So redhat doc
> 
> is outdated?
>
> 3.6. Selecting SSDs for Bucket Indexes
>
>
> When selecting OSD hardware for use with a Ceph Object
>> Gateway—irrespective of the use case—Red Hat recommends considering an OSD
>> node that has at least one SSD drive used exclusively for the bucket index
>> pool. This is particularly important when buckets will contain a large
>> number of objects.
>
>
> A bucket index entry is approximately 200 bytes of data, stored as an
>> object map (omap) in leveldb. While this is a trivial amount of data, some
>> uses of Ceph Object Gateway can result in tens or hundreds of millions of
>> objects in a single bucket. By mapping the bucket index pool to a CRUSH
>> hierarchy of SSD nodes, the reduced latency provides a dramatic performance
>> improvement when buckets contain very large numbers of objects.
>
>
>> Important
>> In a production cluster, a typical OSD node will have at least one SSD
>> for the bucket index, AND at least on SSD for the journal.
>
>
> Current utilisation is what osd df command shows in OMAP field?:
>
> root@cephbackup:/# ceph osd df
>> ID  CLASS  WEIGHTREWEIGHT  SIZE RAW USE   DATA OMAP META
>> AVAIL%USE   VAR   PGS  STATUS
>>  0hdd   7.39870   1.0  7.4 TiB   894 GiB  769 GiB  1.5 MiB  3.4
>> GiB  6.5 TiB  11.80  1.45   40  up
>>  1hdd   7.39870   1.0  7.4 TiB   703 GiB  578 GiB  6.0 MiB  2.9
>> GiB  6.7 TiB   9.27  1.14   37  up
>>  2hdd   7.39870   1.0  7.4 TiB   700 GiB  576 GiB  3.1 MiB  3.1
>> GiB  6.7 TiB   9.24  1.13   39  up
>
>
>
>
>
> On Mon, 8 Apr 2024 at 08:42, Daniel Parkes  wrote:
>
>> Hi Lukasz,
>>
>> RGW uses Omap objects for the index pool; Omaps are stored in Rocksdb
>> database of each osd, not on the actual index pool, so by putting DB/WALL
>> on an NVMe as you mentioned, you are already configuring the index pool on
>> a non-rotational drive, you don't need to do anything else.
>>
>> You just need to size your DB/WALL partition accordingly. For RGW/object
>> storage, a good starting point for the DB/Wall sizing is 4%.
>>
>> Example of Omap entries in the index pool using 0 bytes, as they are
>> stored in Rocksdb:
>>
>> # rados -p default.rgw.buckets.index listomapkeys 
>> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>> file1
>> file2
>> file4
>> file10
>>
>> rados df -p default.rgw.buckets.index
>> POOL_NAME  USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY 
>>  UNFOUND  DEGRADED  RD_OPS   RD  WR_OPS  WR  USED COMPR  UNDER COMPR
>> default.rgw.buckets.index   0 B   11   0  33   0 
>>0 0 208  207 KiB  41  20 KiB 0 B  0 B
>>
>> # rados -p default.rgw.buckets.index stat 
>> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>> default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>>  mtime 2022-12-20T07:32:11.00-0500, size 0
>>
>>
>> On Sun, Apr 7, 2024 at 10:06 PM Lukasz Borek  wrote:
>>
>>> Hi!
>>>
>>> I'm working on a POC cluster setup dedicated to backup app writing
>>> objects
>>> via s3 (large objects, up to 1TB transferred via multipart upload
>>> process).
>>>
>>> Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) + EC
>>> pool.  Plan is to use cephadm.
>>>
>>> I'd like to follow good practice and put the RGW index pool on a
>>> no-rotation drive. Question is how to do it?
>>>
>>>- replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
>>>- reserve space on NVME drive on each node, create lv based OSD and
>>> let
>>>rgb index use the same NVME drive as DB/WALL
>>>
>>> Thoughts?
>>>
>>> --
>>> Lukasz
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>>
>
> 

[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread David Orman
I would suggest that you might consider EC vs. replication for index data, and 
the latency implications. There's more than just the nvme vs. rotational 
discussion to entertain, especially if using the more widely spread EC modes 
like 8+3. It would be worth testing for your particular workload.

Also make sure to factor in storage utilization if you expect to see 
versioning/object lock in use. This can be the source of a significant amount 
of additional consumption that isn't planned for initially.

On Mon, Apr 8, 2024, at 01:42, Daniel Parkes wrote:
> Hi Lukasz,
>
> RGW uses Omap objects for the index pool; Omaps are stored in Rocksdb
> database of each osd, not on the actual index pool, so by putting DB/WALL
> on an NVMe as you mentioned, you are already configuring the index pool on
> a non-rotational drive, you don't need to do anything else.
>
> You just need to size your DB/WALL partition accordingly. For RGW/object
> storage, a good starting point for the DB/Wall sizing is 4%.
>
> Example of Omap entries in the index pool using 0 bytes, as they are stored
> in Rocksdb:
>
> # rados -p default.rgw.buckets.index listomapkeys
> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> file1
> file2
> file4
> file10
>
> rados df -p default.rgw.buckets.index
> POOL_NAME  USED  OBJECTS  CLONES  COPIES
> MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS   RD  WR_OPS  WR
>  USED COMPR  UNDER COMPR
> default.rgw.buckets.index   0 B   11   0  33
> 00 0 208  207 KiB  41  20 KiB 0 B
> 0 B
>
> # rados -p default.rgw.buckets.index stat
> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> mtime 2022-12-20T07:32:11.00-0500, size 0
>
>
> On Sun, Apr 7, 2024 at 10:06 PM Lukasz Borek  wrote:
>
>> Hi!
>>
>> I'm working on a POC cluster setup dedicated to backup app writing objects
>> via s3 (large objects, up to 1TB transferred via multipart upload process).
>>
>> Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) + EC
>> pool.  Plan is to use cephadm.
>>
>> I'd like to follow good practice and put the RGW index pool on a
>> no-rotation drive. Question is how to do it?
>>
>>- replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
>>- reserve space on NVME drive on each node, create lv based OSD and let
>>rgb index use the same NVME drive as DB/WALL
>>
>> Thoughts?
>>
>> --
>> Lukasz
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread Lukasz Borek
Thanks for clarifying.

So redhat doc

is outdated?

3.6. Selecting SSDs for Bucket Indexes


When selecting OSD hardware for use with a Ceph Object Gateway—irrespective
> of the use case—Red Hat recommends considering an OSD node that has at
> least one SSD drive used exclusively for the bucket index pool. This is
> particularly important when buckets will contain a large number of objects.


A bucket index entry is approximately 200 bytes of data, stored as an
> object map (omap) in leveldb. While this is a trivial amount of data, some
> uses of Ceph Object Gateway can result in tens or hundreds of millions of
> objects in a single bucket. By mapping the bucket index pool to a CRUSH
> hierarchy of SSD nodes, the reduced latency provides a dramatic performance
> improvement when buckets contain very large numbers of objects.


> Important
> In a production cluster, a typical OSD node will have at least one SSD for
> the bucket index, AND at least on SSD for the journal.


Current utilisation is what osd df command shows in OMAP field?:

root@cephbackup:/# ceph osd df
> ID  CLASS  WEIGHTREWEIGHT  SIZE RAW USE   DATA OMAP META
>   AVAIL%USE   VAR   PGS  STATUS
>  0hdd   7.39870   1.0  7.4 TiB   894 GiB  769 GiB  1.5 MiB  3.4
> GiB  6.5 TiB  11.80  1.45   40  up
>  1hdd   7.39870   1.0  7.4 TiB   703 GiB  578 GiB  6.0 MiB  2.9
> GiB  6.7 TiB   9.27  1.14   37  up
>  2hdd   7.39870   1.0  7.4 TiB   700 GiB  576 GiB  3.1 MiB  3.1
> GiB  6.7 TiB   9.24  1.13   39  up





On Mon, 8 Apr 2024 at 08:42, Daniel Parkes  wrote:

> Hi Lukasz,
>
> RGW uses Omap objects for the index pool; Omaps are stored in Rocksdb
> database of each osd, not on the actual index pool, so by putting DB/WALL
> on an NVMe as you mentioned, you are already configuring the index pool on
> a non-rotational drive, you don't need to do anything else.
>
> You just need to size your DB/WALL partition accordingly. For RGW/object
> storage, a good starting point for the DB/Wall sizing is 4%.
>
> Example of Omap entries in the index pool using 0 bytes, as they are
> stored in Rocksdb:
>
> # rados -p default.rgw.buckets.index listomapkeys 
> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> file1
> file2
> file4
> file10
>
> rados df -p default.rgw.buckets.index
> POOL_NAME  USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY  
> UNFOUND  DEGRADED  RD_OPS   RD  WR_OPS  WR  USED COMPR  UNDER COMPR
> default.rgw.buckets.index   0 B   11   0  33   0  
>   0 0 208  207 KiB  41  20 KiB 0 B  0 B
>
> # rados -p default.rgw.buckets.index stat 
> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
> default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2 
> mtime 2022-12-20T07:32:11.00-0500, size 0
>
>
> On Sun, Apr 7, 2024 at 10:06 PM Lukasz Borek  wrote:
>
>> Hi!
>>
>> I'm working on a POC cluster setup dedicated to backup app writing objects
>> via s3 (large objects, up to 1TB transferred via multipart upload
>> process).
>>
>> Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) + EC
>> pool.  Plan is to use cephadm.
>>
>> I'd like to follow good practice and put the RGW index pool on a
>> no-rotation drive. Question is how to do it?
>>
>>- replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
>>- reserve space on NVME drive on each node, create lv based OSD and let
>>rgb index use the same NVME drive as DB/WALL
>>
>> Thoughts?
>>
>> --
>> Lukasz
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>

-- 
Łukasz Borek
luk...@borek.org.pl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread Daniel Parkes
Hi Lukasz,

RGW uses Omap objects for the index pool; Omaps are stored in Rocksdb
database of each osd, not on the actual index pool, so by putting DB/WALL
on an NVMe as you mentioned, you are already configuring the index pool on
a non-rotational drive, you don't need to do anything else.

You just need to size your DB/WALL partition accordingly. For RGW/object
storage, a good starting point for the DB/Wall sizing is 4%.

Example of Omap entries in the index pool using 0 bytes, as they are stored
in Rocksdb:

# rados -p default.rgw.buckets.index listomapkeys
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
file1
file2
file4
file10

rados df -p default.rgw.buckets.index
POOL_NAME  USED  OBJECTS  CLONES  COPIES
MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS   RD  WR_OPS  WR
 USED COMPR  UNDER COMPR
default.rgw.buckets.index   0 B   11   0  33
00 0 208  207 KiB  41  20 KiB 0 B
0 B

# rados -p default.rgw.buckets.index stat
.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
mtime 2022-12-20T07:32:11.00-0500, size 0


On Sun, Apr 7, 2024 at 10:06 PM Lukasz Borek  wrote:

> Hi!
>
> I'm working on a POC cluster setup dedicated to backup app writing objects
> via s3 (large objects, up to 1TB transferred via multipart upload process).
>
> Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) + EC
> pool.  Plan is to use cephadm.
>
> I'd like to follow good practice and put the RGW index pool on a
> no-rotation drive. Question is how to do it?
>
>- replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
>- reserve space on NVME drive on each node, create lv based OSD and let
>rgb index use the same NVME drive as DB/WALL
>
> Thoughts?
>
> --
> Lukasz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io