[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-08 Thread Boris Behrens
Hi Stefan, for a 6:1 or 3:1 ration we do not have enough slots (I think). There is some read but I don't know if this is a lot. client: 27 MiB/s rd, 289 MiB/s wr, 1.07k op/s rd, 261 op/s wr Putting the to use for some special rgw pools also came to my mind. But would this make a lot of diff

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-08 Thread Boris Behrens
> That does not seem like a lot. Having SSD based metadata pools might > reduce latency though. > So block.db and block.wal doesn't make sense? I would like to have a consistent cluster. In either case I would need to remove or add SSDs, because we currently have this mixed. It does waste a lot of

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-08 Thread 胡 玮文
> 在 2021年11月8日,19:08,Boris Behrens 写道: > > And does it make a different to have only a block.db partition or a > block.db and a block.wal partition? I think having only a block.db partition is better if you don’t have 2 separate disks for them. WAL will be placed in the DB partition if you don

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-08 Thread Christian Wuerdig
In addition to what the others said - generally there is little point in splitting block and wal partitions - just stick to one for both. What model are you SSDs and how well do they handle small direct writes? Because that's what you'll be getting on them and the wrong type of SSD can make things

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-09 Thread prosergey07
Not sure how much it would help the performance with osd's backed with ssd db and wal devices. Even if you go this route with one ssd per 10 hdd, you might want to set the failure domain per host in crush rules in case ssd is out of service. But from the practice ssd will not help too much to bo

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Boris Behrens
Hi, we use enterprise SSDs like SAMSUNG MZ7KM1T9. The work very well for our block storage. Some NVMe would be a lot nicer but we have some good experience with them. One SSD fail takes down 10 OSDs might sound hard, but this would be an okayish risk. Most of the tunables are defaul in our setup a

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Сергей Процун
No, you can not do that. Because RocksDB for omap key/values and WAL would be gone meaning all xattr and omap will be gone too. Hence osd will become non operational. But if you notice that ssd starts throwing errors, you can start migrating bluefs device to a new partition: ceph-bluestore-tool bl

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Boris
Oh. How would one recover from that? Sounds like it basically makes no difference if 2, 5 oder 10 OSD are in the blast radius. Can the omap key/values be regenerated? I always thought these data would be stored in the rgw pools. Or am I mixing things up and the bluestore metadata got omap k/v?

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Anthony D'Atri
> > Oh. > How would one recover from that? Sounds like it basically makes no difference > if 2, 5 oder 10 OSD are in the blast radius. Veilicht. Aber a larger blast radius means that you lose a larger percentage of your cluster, assuming that you have a CRUSH failure domain of no smaller

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Сергей Процун
In rgw.meta contains user, bucket, bucket instance metadata. rgw.bucket.index contains bucket indexes aka shards. Like if you have 32 shards you will have 32 objects in that pool: .dir.BUCKET_ID.0-31. Each would have part of your objects listed.They should be using some sort of hash table algorith

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-11 Thread Boris Behrens
Now I finally know what kind of data are stored in the RockzDB. Didn't find it in the documentation. This sounds like a horrible SPoF. How can you recover from it? Purge the OSD, wipe the disk and readd it? All flash cluster is sadly not an option for our s3, as it is just too large and we just bo

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-11 Thread Сергей Процун
Yeah . Wipe the disk, but do not remove it from ceph crush as it would result in re-balancing. Then recreate osd and let it re-join the cluster. чт, 11 лист. 2021, 11:05 користувач Boris Behrens пише: > Now I finally know what kind of data are stored in the RockzDB. Didn't > find it in the docum

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-11 Thread Anthony D'Atri
>> it in the documentation. >> This sounds like a horrible SPoF. How can you recover from it? Purge the >> OSD, wipe the disk and readd it? >> All flash cluster is sadly not an option for our s3, as it is just too >> large and we just bought around 60x 8TB Disks (in the last couple of >> months).

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-11 Thread Mark Nelson
On 11/11/21 1:09 PM, Anthony D'Atri wrote: it in the documentation. This sounds like a horrible SPoF. How can you recover from it? Purge the OSD, wipe the disk and readd it? All flash cluster is sadly not an option for our s3, as it is just too large and we just bought around 60x 8TB Disks (in t

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-11 Thread Anthony D'Atri
> > It's absolutely important to think about the use case. For most RGW cases I > generally agree with you. For something like HPC scratch storage you might > have the opposite case where 3DWPD might be at the edge of what's tolerable. > Many years ago I worked for a supercomputing institu

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-12 Thread Boris Behrens
Oh wow, a lot to read piled up in one night :) First things first: I want to thank you all for your insights and for the really valuable knowledge I pulled from this mailthread. Regarding flash only: We use flash only clusters for our RBD clusters. This is very nice and most of the maintanance is