[ceph-users] Re: RoCE?

2024-02-20 Thread Jan Marek
Hello,

we've found problem:

In systemd unit for OSD there is missing this line in the
[Service] section:

LimitMEMLOCK=infinity

When I added this line to systemd unit, OSD daemon started and
we have HEALTH_OK state in the cluster status.

Sincerely
Jan Marek

Dne Po, úno 05, 2024 at 11:10:21 CET napsal(a) Jan Marek:
> Hello,
> 
> we are configure new ceph cluster with Mellanox 2x100Gbps cards.
> 
> We bond this two ports to MLAG bond0 interface.
> 
> In the async+posix mode everythink is OK, cluster is in the
> HELTH_OK state.
> 
> CEPH version is 18.2.1.
> 
> Then we tried to configure RoCE for cluster part of network, but
> without success.
> 
> Our ceph config dump (only relevant config):
> 
> global   advanced  ms_async_rdma_device_name  
> mlx5_bond_0   
>  *
> global   advanced  ms_async_rdma_gid_idx  3
> global  host:ceph1-nvme  advanced  ms_async_rdma_local_gid
> ::::::a0d9:05d8   
>  *
> global  host:ceph2-nvme  advanced  ms_async_rdma_local_gid
> ::::::a0d9:05d7   
>  *
> global  host:ceph3-nvme  advanced  ms_async_rdma_local_gid
> ::::::a0d9:05d6   
>  *
> global   advanced  ms_async_rdma_roce_ver 2
> global   advanced  ms_async_rdma_type 
> rdma  
>  *
> global   advanced  ms_cluster_type
> async+rdma
>  *
> global   advanced  ms_public_type 
> async+posix   
>  *
> 
> 
> On the ceph1-nvme there is this show_gids.sh list:
> 
> # ./show_gids.sh
> DEV PORTINDEX   GID IPv4  
>   VER DEV
> --- -   ---   
>   --- ---
> mlx5_bond_0 1   0   fe80::::0e42:a1ff:fe93:b004   
>   v1  bond0
> mlx5_bond_0 1   1   fe80::::0e42:a1ff:fe93:b004   
>   v2  bond0
> mlx5_bond_0 1   2   ::::::a0d9:05d8 
> 160.217.5.216   v1  bond0
> mlx5_bond_0 1   3   ::::::a0d9:05d8 
> 160.217.5.216   v2  bond0
> n_gids_found=4
> 
> I have set this line in /etc/security/limits.conf:
> 
> *   hardmemlock unlimited
> 
> But when I tried to restart ceph.target, OSD nodes didn't start
> with this errors, see attachment.
> 
> Mellanox drivers are from Debian bookworm kernel.
> 
> Is there somethink missing in config, or some errors?
> 
> When I change ms_cluster_type to async+posix and restart
> ceph.target, cluster converged to HEALTH_OK state...
> 
> Thanks for advices...
> 
> Sincerely
> Jan Marek
> -- 
> Ing. Jan Marek
> University of South Bohemia
> Academic Computer Centre
> Phone: +420389032080
> http://www.gnu.org/philosophy/no-word-attachments.cs.html

> 2024-02-05T09:56:50.249344+01:00 ceph3-nvme ceph-osd[21139]: auth: 
> KeyRing::load: loaded key file /var/lib/ceph/osd/ceph-2/keyring
> 2024-02-05T09:56:50.249362+01:00 ceph3-nvme ceph-osd[21139]: auth: 
> KeyRing::load: loaded key file /var/lib/ceph/osd/ceph-2/keyring
> 2024-02-05T09:56:50.249377+01:00 ceph3-nvme ceph-osd[21139]: 
> asok(0x559592978000) register_command rotate-key hook 0x7fffcc26d398
> 2024-02-05T09:56:50.249391+01:00 ceph3-nvme ceph-osd[21139]: 
> log_channel(cluster) update_config to_monitors: true to_syslog: false 
> syslog_facility:  prio: info to_graylog: false graylog_host: 127.0.0.1 
> graylog_port: 12201)
> 2024-02-05T09:56:50.249409+01:00 ceph3-nvme ceph-osd[21139]: osd.2 109 
> log_to_monitors true
> 2024-02-05T09:56:50.249424+01:00 ceph3-nvme ceph-osd[21139]: 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc:
>  In function 'void Infiniband::init()' thread 7fb3bb042700 time 
> 2024-02-05T08:56:50.142198+#012/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc:
>  1061: FAILED ceph_assert(device)#012#012 ceph version 18.2.1 
> (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)#012 1: 
> (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-20 Thread Eugen Block

Hi,

some more details would be helpful, for example what's the pool size  
of the cache pool? Did you issue a PG split before or during the  
upgrade? This thread [1] deals with the same problem, the described  
workaround was to set hit_set_count to 0 and disable the cache layer  
until that is resolved. Afterwards you could enable the cache layer  
again. But keep in mind that the code for cache tier is entirely  
removed in Reef (IIRC).


Regards,
Eugen

[1]  
https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-adding-a-cache-osd


Zitat von Cedric :


Hello,

Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we
encounter an issue with a cache pool becoming completely stuck,
relevant messages below:

pg xx.x has invalid (post-split) stats; must scrub before tier agent
can activate

In OSD logs, scrubs are starting in a loop without succeeding for all
pg of this pool.

What we already tried without luck so far:

- shutdown / restart OSD
- rebalance pg between OSD
- raise the memory on OSD
- repeer PG

Any idea what is causing this? any help will be greatly appreciated

Thanks

Cédric
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-20 Thread Eugen Block

Please don't drop the list from your response.

The first question coming to mind is, why do you have a cache-tier if  
all your pools are on nvme decices anyway? I don't see any benefit here.

Did you try the suggested workaround and disable the cache-tier?

Zitat von Cedric :


Thanks Eugen, see attached infos.

Some more details:

- commands that actually hangs: ceph balancer status ; rbd -p vms ls ;
rados -p vms_cache cache-flush-evict-all
- all scrub running on vms_caches pgs are stall / start in a loop
without actually doing anything
- all io are 0 both from ceph status or iostat on nodes

On Tue, Feb 20, 2024 at 10:00 AM Eugen Block  wrote:


Hi,

some more details would be helpful, for example what's the pool size
of the cache pool? Did you issue a PG split before or during the
upgrade? This thread [1] deals with the same problem, the described
workaround was to set hit_set_count to 0 and disable the cache layer
until that is resolved. Afterwards you could enable the cache layer
again. But keep in mind that the code for cache tier is entirely
removed in Reef (IIRC).

Regards,
Eugen

[1]
https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-adding-a-cache-osd

Zitat von Cedric :

> Hello,
>
> Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we
> encounter an issue with a cache pool becoming completely stuck,
> relevant messages below:
>
> pg xx.x has invalid (post-split) stats; must scrub before tier agent
> can activate
>
> In OSD logs, scrubs are starting in a loop without succeeding for all
> pg of this pool.
>
> What we already tried without luck so far:
>
> - shutdown / restart OSD
> - rebalance pg between OSD
> - raise the memory on OSD
> - repeer PG
>
> Any idea what is causing this? any help will be greatly appreciated
>
> Thanks
>
> Cédric
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific 16.2.15 QE validation status

2024-02-20 Thread Yuri Weinstein
We have restarted QE validation after fixing issues and merging several PRs.
The new Build 3 (rebase of pacific) tests are summarized in the same
note (see Build 3 runs) https://tracker.ceph.com/issues/64151#note-1

Seeking approvals:

rados - Radek, Junior, Travis, Ernesto, Adam King
rgw - Casey
fs - Venky
rbd - Ilya
krbd - Ilya

upgrade/octopus-x (pacific) - Adam King, Casey PTL

upgrade/pacific-p2p - Casey PTL

ceph-volume - Guillaume, fixed by
https://github.com/ceph/ceph/pull/55658 retesting

On Thu, Feb 8, 2024 at 8:43 AM Casey Bodley  wrote:
>
> thanks, i've created https://tracker.ceph.com/issues/64360 to track
> these backports to pacific/quincy/reef
>
> On Thu, Feb 8, 2024 at 7:50 AM Stefan Kooman  wrote:
> >
> > Hi,
> >
> > Is this PR: https://github.com/ceph/ceph/pull/54918 included as well?
> >
> > You definitely want to build the Ubuntu / debian packages with the
> > proper CMAKE_CXX_FLAGS. The performance impact on RocksDB is _HUGE_.
> >
> > Thanks,
> >
> > Gr. Stefan
> >
> > P.s. Kudos to Mark Nelson for figuring it out / testing.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific 16.2.15 QE validation status

2024-02-20 Thread Ilya Dryomov
On Tue, Feb 20, 2024 at 4:59 PM Yuri Weinstein  wrote:
>
> We have restarted QE validation after fixing issues and merging several PRs.
> The new Build 3 (rebase of pacific) tests are summarized in the same
> note (see Build 3 runs) https://tracker.ceph.com/issues/64151#note-1
>
> Seeking approvals:
>
> rados - Radek, Junior, Travis, Ernesto, Adam King
> rgw - Casey
> fs - Venky
> rbd - Ilya
> krbd - Ilya

rbd and krbd approved.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread quag...@bol.com.br
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Anthony D'Atri


> Hi Anthony,
>  Did you decide that it's not a feature to be implemented?

That isn't up to me.

>  I'm asking about this so I can offer options here.
> 
>  I'd not be confortable to enable "mon_allow_pool_size_one" at a specific 
> pool.
> 
> It would be better if this feature could make a replica at a second time on 
> selected pool.
> Thanks.
> Rafael.
> 
>  
> 
> De: "Anthony D'Atri" 
> Enviada: 2024/02/01 15:00:59
> Para: quag...@bol.com.br
> Cc: ceph-users@ceph.io
> Assunto: [ceph-users] Re: Performance improvement suggestion
>  
> I'd totally defer to the RADOS folks.
> 
> One issue might be adding a separate code path, which can have all sorts of 
> problems.
> 
> > On Feb 1, 2024, at 12:53, quag...@bol.com.br wrote:
> >
> >
> >
> > Ok Anthony,
> >
> > I understood what you said. I also believe in all the professional history 
> > and experience you have.
> >
> > Anyway, could there be a configuration flag to make this happen?
> >
> > As well as those that already exist: "--yes-i-really-mean-it".
> >
> > This way, the storage pattern would remain as it is. However, it would 
> > allow situations like the one I mentioned to be possible.
> >
> > This situation will permit some rules to be relaxed (even if they are not 
> > ok at first).
> > Likewise, there are already situations like lazyio that make some 
> > exceptions to standard procedures.
> > Remembering: it's just a suggestion.
> > If this type of functionality is not interesting, it is ok.
> >
> >
> >
> > Rafael.
> >
> >
> > De: "Anthony D'Atri" 
> > Enviada: 2024/02/01 12:10:30
> > Para: quag...@bol.com.br
> > Cc: ceph-users@ceph.io
> > Assunto: [ceph-users] Re: Performance improvement suggestion
> >
> >
> >
> > > I didn't say I would accept the risk of losing data.
> >
> > That's implicit in what you suggest, though.
> >
> > > I just said that it would be interesting if the objects were first 
> > > recorded only in the primary OSD.
> >
> > What happens when that host / drive smokes before it can replicate? What 
> > happens if a secondary OSD gets a read op before the primary updates it? 
> > Swift object storage users have to code around this potential. It's a 
> > non-starter for block storage.
> >
> > This is similar to why RoC HBAs (which are a badly outdated thing to begin 
> > with) will only enter writeback mode if they have a BBU / supercap -- and 
> > of course if their firmware and hardware isn't pervasively buggy. Guess how 
> > I know this?
> >
> > > This way it would greatly increase performance (both for iops and 
> > > throuput).
> >
> > It might increase low-QD IOPS for a single client on slow media with 
> > certain networking. Depending on media, it wouldn't increase throughput.
> >
> > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x 
> > the network resources between the client and the servers.
> >
> > > Later (in the background), record the replicas. This situation would 
> > > avoid leaving users/software waiting for the recording response from all 
> > > replicas when the storage is overloaded.
> >
> > If one makes the mistake of using HDDs, they're going to be overloaded no 
> > matter how one slices and dices the ops. Ya just canna squeeze IOPS from a 
> > stone. Throughput is going to be limited by the SATA interface and seeking 
> > no matter what.
> >
> > > Where I work, performance is very important and we don't have money to 
> > > make a entire cluster only with NVMe.
> >
> > If there isn't money, then it isn't very important. But as I've written 
> > before, NVMe clusters *do not cost appreciably more than spinners* unless 
> > your procurement processes are bad. In fact they can cost significantly 
> > less. This is especially true with object storage and archival where one 
> > can leverage QLC.
> >
> > * Buy generic drives from a VAR, not channel drives through a chassis 
> > brand. Far less markup, and moreover you get the full 5 year warranty, not 
> > just 3 years. And you can painlessly RMA drives yourself - you don't have 
> > to spend hours going back and forth with $chassisvendor's TAC arguing about 
> > every single RMA. I've found that this is so bad that it is more economical 
> > to just throw away a failed component worth < USD 500 than to RMA it. Do 
> > you pay for extended warranty / support? That's expensive too.
> >
> > * Certain chassis brands who shall remain nameless push RoC HBAs hard with 
> > extreme markups. List prices as high as USD2000. Per server, eschewing 
> > those abominations makes up for a lot of the drive-only unit economics
> >
> > * But this is the part that lots of people don't get: You don't just stack 
> > up the drives on a desk and use them. They go into *servers* that cost 
> > money and *racks* that cost money. They take *power* that costs money.
> >
> > * $ / IOPS are FAR better for ANY SSD than for HDDs
> >
> > * RUs cost money, so do chassis and switches
> >
> > * Drive failures cost money
> >
> > * So does having your p

[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Özkan Göksu
Hello.

I didn't test it personally but what about rep 1 write cache pool with nvme
backed by another rep 2 pool?

It has the potential exactly what you are looking for in theory.


1 Şub 2024 Per 20:54 tarihinde quag...@bol.com.br  şunu
yazdı:

>
>
> Ok Anthony,
>
> I understood what you said. I also believe in all the professional history
> and experience you have.
>
> Anyway, could there be a configuration flag to make this happen?
>
> As well as those that already exist: "--yes-i-really-mean-it".
>
> This way, the storage pattern would remain as it is. However, it would
> allow situations like the one I mentioned to be possible.
>
> This situation will permit some rules to be relaxed (even if they are not
> ok at first).
> Likewise, there are already situations like lazyio that make some
> exceptions to standard procedures.
>
>
> Remembering: it's just a suggestion.
> If this type of functionality is not interesting, it is ok.
>
>
> Rafael.
>
> --
>
> *De: *"Anthony D'Atri" 
> *Enviada: *2024/02/01 12:10:30
> *Para: *quag...@bol.com.br
> *Cc: * ceph-users@ceph.io
> *Assunto: * [ceph-users] Re: Performance improvement suggestion
>
>
>
> > I didn't say I would accept the risk of losing data.
>
> That's implicit in what you suggest, though.
>
> > I just said that it would be interesting if the objects were first
> recorded only in the primary OSD.
>
> What happens when that host / drive smokes before it can replicate? What
> happens if a secondary OSD gets a read op before the primary updates it?
> Swift object storage users have to code around this potential. It's a
> non-starter for block storage.
>
> This is similar to why RoC HBAs (which are a badly outdated thing to begin
> with) will only enter writeback mode if they have a BBU / supercap -- and
> of course if their firmware and hardware isn't pervasively buggy. Guess how
> I know this?
>
> > This way it would greatly increase performance (both for iops and
> throuput).
>
> It might increase low-QD IOPS for a single client on slow media with
> certain networking. Depending on media, it wouldn't increase throughput.
>
> Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x
> the network resources between the client and the servers.
>
> > Later (in the background), record the replicas. This situation would
> avoid leaving users/software waiting for the recording response from all
> replicas when the storage is overloaded.
>
> If one makes the mistake of using HDDs, they're going to be overloaded no
> matter how one slices and dices the ops. Ya just canna squeeze IOPS from a
> stone. Throughput is going to be limited by the SATA interface and seeking
> no matter what.
>
> > Where I work, performance is very important and we don't have money to
> make a entire cluster only with NVMe.
>
> If there isn't money, then it isn't very important. But as I've written
> before, NVMe clusters *do not cost appreciably more than spinners* unless
> your procurement processes are bad. In fact they can cost significantly
> less. This is especially true with object storage and archival where one
> can leverage QLC.
>
> * Buy generic drives from a VAR, not channel drives through a chassis
> brand. Far less markup, and moreover you get the full 5 year warranty, not
> just 3 years. And you can painlessly RMA drives yourself - you don't have
> to spend hours going back and forth with $chassisvendor's TAC arguing about
> every single RMA. I've found that this is so bad that it is more economical
> to just throw away a failed component worth < USD 500 than to RMA it. Do
> you pay for extended warranty / support? That's expensive too.
>
> * Certain chassis brands who shall remain nameless push RoC HBAs hard with
> extreme markups. List prices as high as USD2000. Per server, eschewing
> those abominations makes up for a lot of the drive-only unit economics
>
> * But this is the part that lots of people don't get: You don't just stack
> up the drives on a desk and use them. They go into *servers* that cost
> money and *racks* that cost money. They take *power* that costs money.
>
> * $ / IOPS are FAR better for ANY SSD than for HDDs
>
> * RUs cost money, so do chassis and switches
>
> * Drive failures cost money
>
> * So does having your people and applications twiddle their thumbs waiting
> for stuff to happen. I worked for a supercomputer company who put
> low-memory low-end diskless workstations on engineer's desks. They spent
> lots of time doing nothing waiting for their applications to respond. This
> company no longer exists.
>
> * So does the risk of taking *weeks* to heal from a drive failure
>
> Punch honest numbers into
> https://www.snia.org/forums/cmsi/programs/TCOcalc
>
> I walked through this with a certain global company. QLC SSDs were
> demonstrated to have like 30% lower TCO than spinners. Part of the equation
> is that they were accustomed to limiting HDD size to 8 TB because of the
> bottlenecks, and thus requiri

[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Anthony D'Atri
Cache tiering is deprecated.

> On Feb 20, 2024, at 17:03, Özkan Göksu  wrote:
> 
> Hello.
> 
> I didn't test it personally but what about rep 1 write cache pool with nvme
> backed by another rep 2 pool?
> 
> It has the potential exactly what you are looking for in theory.
> 
> 
> 1 Şub 2024 Per 20:54 tarihinde quag...@bol.com.br  şunu
> yazdı:
> 
>> 
>> 
>> Ok Anthony,
>> 
>> I understood what you said. I also believe in all the professional history
>> and experience you have.
>> 
>> Anyway, could there be a configuration flag to make this happen?
>> 
>> As well as those that already exist: "--yes-i-really-mean-it".
>> 
>> This way, the storage pattern would remain as it is. However, it would
>> allow situations like the one I mentioned to be possible.
>> 
>> This situation will permit some rules to be relaxed (even if they are not
>> ok at first).
>> Likewise, there are already situations like lazyio that make some
>> exceptions to standard procedures.
>> 
>> 
>> Remembering: it's just a suggestion.
>> If this type of functionality is not interesting, it is ok.
>> 
>> 
>> Rafael.
>> 
>> --
>> 
>> *De: *"Anthony D'Atri" 
>> *Enviada: *2024/02/01 12:10:30
>> *Para: *quag...@bol.com.br
>> *Cc: * ceph-users@ceph.io
>> *Assunto: * [ceph-users] Re: Performance improvement suggestion
>> 
>> 
>> 
>>> I didn't say I would accept the risk of losing data.
>> 
>> That's implicit in what you suggest, though.
>> 
>>> I just said that it would be interesting if the objects were first
>> recorded only in the primary OSD.
>> 
>> What happens when that host / drive smokes before it can replicate? What
>> happens if a secondary OSD gets a read op before the primary updates it?
>> Swift object storage users have to code around this potential. It's a
>> non-starter for block storage.
>> 
>> This is similar to why RoC HBAs (which are a badly outdated thing to begin
>> with) will only enter writeback mode if they have a BBU / supercap -- and
>> of course if their firmware and hardware isn't pervasively buggy. Guess how
>> I know this?
>> 
>>> This way it would greatly increase performance (both for iops and
>> throuput).
>> 
>> It might increase low-QD IOPS for a single client on slow media with
>> certain networking. Depending on media, it wouldn't increase throughput.
>> 
>> Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x
>> the network resources between the client and the servers.
>> 
>>> Later (in the background), record the replicas. This situation would
>> avoid leaving users/software waiting for the recording response from all
>> replicas when the storage is overloaded.
>> 
>> If one makes the mistake of using HDDs, they're going to be overloaded no
>> matter how one slices and dices the ops. Ya just canna squeeze IOPS from a
>> stone. Throughput is going to be limited by the SATA interface and seeking
>> no matter what.
>> 
>>> Where I work, performance is very important and we don't have money to
>> make a entire cluster only with NVMe.
>> 
>> If there isn't money, then it isn't very important. But as I've written
>> before, NVMe clusters *do not cost appreciably more than spinners* unless
>> your procurement processes are bad. In fact they can cost significantly
>> less. This is especially true with object storage and archival where one
>> can leverage QLC.
>> 
>> * Buy generic drives from a VAR, not channel drives through a chassis
>> brand. Far less markup, and moreover you get the full 5 year warranty, not
>> just 3 years. And you can painlessly RMA drives yourself - you don't have
>> to spend hours going back and forth with $chassisvendor's TAC arguing about
>> every single RMA. I've found that this is so bad that it is more economical
>> to just throw away a failed component worth < USD 500 than to RMA it. Do
>> you pay for extended warranty / support? That's expensive too.
>> 
>> * Certain chassis brands who shall remain nameless push RoC HBAs hard with
>> extreme markups. List prices as high as USD2000. Per server, eschewing
>> those abominations makes up for a lot of the drive-only unit economics
>> 
>> * But this is the part that lots of people don't get: You don't just stack
>> up the drives on a desk and use them. They go into *servers* that cost
>> money and *racks* that cost money. They take *power* that costs money.
>> 
>> * $ / IOPS are FAR better for ANY SSD than for HDDs
>> 
>> * RUs cost money, so do chassis and switches
>> 
>> * Drive failures cost money
>> 
>> * So does having your people and applications twiddle their thumbs waiting
>> for stuff to happen. I worked for a supercomputer company who put
>> low-memory low-end diskless workstations on engineer's desks. They spent
>> lots of time doing nothing waiting for their applications to respond. This
>> company no longer exists.
>> 
>> * So does the risk of taking *weeks* to heal from a drive failure
>> 
>> Punch honest numbers into
>> https://www.snia.org/forums/cmsi/programs/TCOcalc
>> 
>> I

[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Alex Gorbachev
I would be against such an option, because it introduces a significant risk
of data loss.  Ceph has made a name for itself as a very reliable system,
where almost no one lost data, no matter how bad of a decision they made
with architecture and design.  This is what you pay for in commercial
systems, to "not be allowed a bad choice", and this is what everyone gets
with Ceph for free (if they so choose).

Allowing a change like this would likely be the beginning of the end of
Ceph.  It is a bad idea in the extreme.  Ceph reliability should never be
compromised.

There are other options for storage that are robust and do not require as
much investment.  Use ZFS, with NFS if needed.  Use bcache/flashcache, or
something similar on the client side.  Use proper RAM caching in databases
and applications.
--
Alex Gorbachev
Intelligent Systems Services Inc.
STORCIUM



On Tue, Feb 20, 2024 at 3:04 PM Anthony D'Atri 
wrote:

>
>
> > Hi Anthony,
> >  Did you decide that it's not a feature to be implemented?
>
> That isn't up to me.
>
> >  I'm asking about this so I can offer options here.
> >
> >  I'd not be confortable to enable "mon_allow_pool_size_one" at a
> specific pool.
> >
> > It would be better if this feature could make a replica at a second time
> on selected pool.
> > Thanks.
> > Rafael.
> >
> >
> >
> > De: "Anthony D'Atri" 
> > Enviada: 2024/02/01 15:00:59
> > Para: quag...@bol.com.br
> > Cc: ceph-users@ceph.io
> > Assunto: [ceph-users] Re: Performance improvement suggestion
> >
> > I'd totally defer to the RADOS folks.
> >
> > One issue might be adding a separate code path, which can have all sorts
> of problems.
> >
> > > On Feb 1, 2024, at 12:53, quag...@bol.com.br wrote:
> > >
> > >
> > >
> > > Ok Anthony,
> > >
> > > I understood what you said. I also believe in all the professional
> history and experience you have.
> > >
> > > Anyway, could there be a configuration flag to make this happen?
> > >
> > > As well as those that already exist: "--yes-i-really-mean-it".
> > >
> > > This way, the storage pattern would remain as it is. However, it would
> allow situations like the one I mentioned to be possible.
> > >
> > > This situation will permit some rules to be relaxed (even if they are
> not ok at first).
> > > Likewise, there are already situations like lazyio that make some
> exceptions to standard procedures.
> > > Remembering: it's just a suggestion.
> > > If this type of functionality is not interesting, it is ok.
> > >
> > >
> > >
> > > Rafael.
> > >
> > >
> > > De: "Anthony D'Atri" 
> > > Enviada: 2024/02/01 12:10:30
> > > Para: quag...@bol.com.br
> > > Cc: ceph-users@ceph.io
> > > Assunto: [ceph-users] Re: Performance improvement suggestion
> > >
> > >
> > >
> > > > I didn't say I would accept the risk of losing data.
> > >
> > > That's implicit in what you suggest, though.
> > >
> > > > I just said that it would be interesting if the objects were first
> recorded only in the primary OSD.
> > >
> > > What happens when that host / drive smokes before it can replicate?
> What happens if a secondary OSD gets a read op before the primary updates
> it? Swift object storage users have to code around this potential. It's a
> non-starter for block storage.
> > >
> > > This is similar to why RoC HBAs (which are a badly outdated thing to
> begin with) will only enter writeback mode if they have a BBU / supercap --
> and of course if their firmware and hardware isn't pervasively buggy. Guess
> how I know this?
> > >
> > > > This way it would greatly increase performance (both for iops and
> throuput).
> > >
> > > It might increase low-QD IOPS for a single client on slow media with
> certain networking. Depending on media, it wouldn't increase throughput.
> > >
> > > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use
> 3x the network resources between the client and the servers.
> > >
> > > > Later (in the background), record the replicas. This situation would
> avoid leaving users/software waiting for the recording response from all
> replicas when the storage is overloaded.
> > >
> > > If one makes the mistake of using HDDs, they're going to be overloaded
> no matter how one slices and dices the ops. Ya just canna squeeze IOPS from
> a stone. Throughput is going to be limited by the SATA interface and
> seeking no matter what.
> > >
> > > > Where I work, performance is very important and we don't have money
> to make a entire cluster only with NVMe.
> > >
> > > If there isn't money, then it isn't very important. But as I've
> written before, NVMe clusters *do not cost appreciably more than spinners*
> unless your procurement processes are bad. In fact they can cost
> significantly less. This is especially true with object storage and
> archival where one can leverage QLC.
> > >
> > > * Buy generic drives from a VAR, not channel drives through a chassis
> brand. Far less markup, and moreover you get the full 5 year warranty, not
> just 3 years. And you can painlessly RMA

[ceph-users] User + Dev Meetup February 22 - CephFS Snapshots story!

2024-02-20 Thread Neha Ojha
Hi everyone,

You are invited to join us at the User + Dev meeting this week Thursday,
February 22 at 10:00 AM Eastern Time!

Focus Topic: CephFS Snapshots Evaluation
Presented by: Enrico Bocchi and Abhishek Lekshmanan, Ceph operators from
CERN

>From the presenters:

Ceph at CERN provides block, object, and file storage backing the IT
infrastructure of the Organization. CephFS, in particular, is largely used
through the integration with OpenStack Manila by container-based workloads
(Kubernetes, OpenShift), HPC MPI clusters, and as a general-purpose
networked file system for enterprise groupware and open infrastructure
technologies (code/software repositories, monitoring, analytics, etc.).

Our presentation focuses on CephFS snapshots and their implications on
performance and stability. Snapshots would be a valuable addition to our
existing CephFS service, as they allow for storage rollback and disaster
recovery through mirroring. According to our observations, however, they
introduce a non-negligible performance penalty and may jeopardize the
stability of the file system.

In particular, we would like to discuss:
1. Experiences with CephFS snapshots from other operators in the Ceph
community.
2. Tools and strategies one can deploy to pre-empty or mitigate issues.
3. How to effectively contribute with upstream developers and interested
community users to address the identified limitations.

Feel free to add questions or additional topics under the "Open Discussion"
section on the agenda: https://pad.ceph.com/p/ceph-user-dev-monthly-minutes

If you have an idea for a focus topic you'd like to present at a future
meeting, you are welcome to submit it to this Google Form:
https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4vJDGBrp6d-D3-BlQ/viewform?usp=sf_link
Any Ceph user or developer is eligible to submit!

Thanks,
Neha
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific 16.2.15 QE validation status

2024-02-20 Thread Venky Shankar
Hi Yuri,

On Tue, Feb 20, 2024 at 9:29 PM Yuri Weinstein  wrote:
>
> We have restarted QE validation after fixing issues and merging several PRs.
> The new Build 3 (rebase of pacific) tests are summarized in the same
> note (see Build 3 runs) https://tracker.ceph.com/issues/64151#note-1
>
> Seeking approvals:
>
> rados - Radek, Junior, Travis, Ernesto, Adam King
> rgw - Casey
> fs - Venky

fs approved. failures are -
https://tracker.ceph.com/projects/cephfs/wiki/Pacific#20-Feb-2024

> rbd - Ilya
> krbd - Ilya
>
> upgrade/octopus-x (pacific) - Adam King, Casey PTL
>
> upgrade/pacific-p2p - Casey PTL
>
> ceph-volume - Guillaume, fixed by
> https://github.com/ceph/ceph/pull/55658 retesting
>
> On Thu, Feb 8, 2024 at 8:43 AM Casey Bodley  wrote:
> >
> > thanks, i've created https://tracker.ceph.com/issues/64360 to track
> > these backports to pacific/quincy/reef
> >
> > On Thu, Feb 8, 2024 at 7:50 AM Stefan Kooman  wrote:
> > >
> > > Hi,
> > >
> > > Is this PR: https://github.com/ceph/ceph/pull/54918 included as well?
> > >
> > > You definitely want to build the Ubuntu / debian packages with the
> > > proper CMAKE_CXX_FLAGS. The performance impact on RocksDB is _HUGE_.
> > >
> > > Thanks,
> > >
> > > Gr. Stefan
> > >
> > > P.s. Kudos to Mark Nelson for figuring it out / testing.
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific 16.2.15 QE validation status

2024-02-20 Thread Nizamudeen A
dashboard approved. our e2e specs are passing but the suite failed because
of a different error.
cluster [WRN] Health check failed: 1 stray daemon(s) not managed by cephadm
(CEPHADM_STRAY_DAEMON)" in cluster log


On Tue, Feb 20, 2024 at 9:29 PM Yuri Weinstein  wrote:

> We have restarted QE validation after fixing issues and merging several
> PRs.
> The new Build 3 (rebase of pacific) tests are summarized in the same
> note (see Build 3 runs) https://tracker.ceph.com/issues/64151#note-1
>
> Seeking approvals:
>
> rados - Radek, Junior, Travis, Ernesto, Adam King
> rgw - Casey
> fs - Venky
> rbd - Ilya
> krbd - Ilya
>
> upgrade/octopus-x (pacific) - Adam King, Casey PTL
>
> upgrade/pacific-p2p - Casey PTL
>
> ceph-volume - Guillaume, fixed by
> https://github.com/ceph/ceph/pull/55658 retesting
>
> On Thu, Feb 8, 2024 at 8:43 AM Casey Bodley  wrote:
> >
> > thanks, i've created https://tracker.ceph.com/issues/64360 to track
> > these backports to pacific/quincy/reef
> >
> > On Thu, Feb 8, 2024 at 7:50 AM Stefan Kooman  wrote:
> > >
> > > Hi,
> > >
> > > Is this PR: https://github.com/ceph/ceph/pull/54918 included as well?
> > >
> > > You definitely want to build the Ubuntu / debian packages with the
> > > proper CMAKE_CXX_FLAGS. The performance impact on RocksDB is _HUGE_.
> > >
> > > Thanks,
> > >
> > > Gr. Stefan
> > >
> > > P.s. Kudos to Mark Nelson for figuring it out / testing.
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Dan van der Ster
Hi,

I just want to echo what the others are saying.

Keep in mind that RADOS needs to guarantee read-after-write consistency for
the higher level apps to work (RBD, RGW, CephFS). If you corrupt VM block
devices, S3 objects or bucket metadata/indexes, or CephFS metadata, you're
going to suffer some long days and nights recovering.

Anyway, I think that what you proposed has at best a similar reliability to
min_size=1. And note that min_size=1 is strongly discouraged because of the
very high likelihood that a device/network/power failure turns into a
visible outage. In short: your idea would turn every OSD into a SPoF.

How would you handle this very common scenario: a power outage followed by
at least one device failing to start afterwards?

1. Write object A from client.
2. Fsync to primary device completes.
3. Ack to client.
4. Writes sent to replicas.
5. Cluster wide power outage (before replicas committed).
6. Power restored, but the primary osd does not start (e.g. permanent hdd
failure).
7. Client tries to read object A.

Today, with min_size=1 such a scenario manifests as data loss: you get
either a down PG (with many many objects offline/IO blocked until you
manually decide which data loss mode to accept) or unfounded objects (with
IO blocked until you accept data loss). With min_size=2 the likelihood of
data loss is dramatically reduced.

Another thing about that power loss scenario is that all dirty PGs would
need to be recovered when the cluster reboots. You'd lose all the writes in
transit and have to replay them from the primary's pg_log, or backfill if
the pg_log was too short. Again, any failure during that recovery would
lead to data loss.

So I think that to maintain any semblance of reliability, you'd need to at
least wait for a commit ack from the first replica (i.e. min_size=2). But
since the replica writes are dispatched in parallel, your speedup would
evaporate.

Another thing: I suspect this idea would result in many inconsistencies
from transient issues. You'd need to ramp up the number of parallel
deep-scrubs to look for those inconsistencies quickly, which would also
work against any potential speedup.

Cheers, Dan

--
Dan van der Ster
CTO

Clyso GmbH
w: https://clyso.com | e: dan.vanders...@clyso.com

Try our Ceph Analyzer!: https://analyzer.clyso.com/
We are hiring: https://www.clyso.com/jobs/


On Wed, Jan 31, 2024, 11:49 quag...@bol.com.br  wrote:

> Hello everybody,
>  I would like to make a suggestion for improving performance in Ceph
> architecture.
>  I don't know if this group would be the best place or if my proposal
> is correct.
>
>  My suggestion would be in the item
> https://docs.ceph.com/en/latest/architecture/, at the end of the topic
> "Smart Daemons Enable Hyperscale".
>
>  The Client needs to "wait" for the configured amount of replicas to
> be written (so that the client receives an ok and continues). This way, if
> there is slowness on any of the disks on which the PG will be updated, the
> client is left waiting.
>
>  It would be possible:
>
>  1-) Only record on the primary OSD
>  2-) Write other replicas in background (like the same way as when an
> OSD fails: "degraded" ).
>
>  This way, client has a faster response when writing to storage:
> improving latency and performance (throughput and IOPS).
>
>  I would find it plausible to accept a period of time (seconds) until
> all replicas are ok (written asynchronously) at the expense of improving
> performance.
>
>  Could you evaluate this scenario?
>
>
> Rafael.
>
>  ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io