[ceph-users] Re: pacific 16.2.15 QE validation status
RADOS approved On Wed, Feb 21, 2024 at 11:27 AM Yuri Weinstein wrote: > Still seeking approvals: > > rados - Radek, Junior, Travis, Adam King > > All other product areas have been approved and are ready for the release > step. > > Pls also review the Release Notes: https://github.com/ceph/ceph/pull/55694 > > > On Tue, Feb 20, 2024 at 7:58 AM Yuri Weinstein > wrote: > > > > We have restarted QE validation after fixing issues and merging several > PRs. > > The new Build 3 (rebase of pacific) tests are summarized in the same > > note (see Build 3 runs) https://tracker.ceph.com/issues/64151#note-1 > > > > Seeking approvals: > > > > rados - Radek, Junior, Travis, Ernesto, Adam King > > rgw - Casey > > fs - Venky > > rbd - Ilya > > krbd - Ilya > > > > upgrade/octopus-x (pacific) - Adam King, Casey PTL > > > > upgrade/pacific-p2p - Casey PTL > > > > ceph-volume - Guillaume, fixed by > > https://github.com/ceph/ceph/pull/55658 retesting > > > > On Thu, Feb 8, 2024 at 8:43 AM Casey Bodley wrote: > > > > > > thanks, i've created https://tracker.ceph.com/issues/64360 to track > > > these backports to pacific/quincy/reef > > > > > > On Thu, Feb 8, 2024 at 7:50 AM Stefan Kooman wrote: > > > > > > > > Hi, > > > > > > > > Is this PR: https://github.com/ceph/ceph/pull/54918 included as > well? > > > > > > > > You definitely want to build the Ubuntu / debian packages with the > > > > proper CMAKE_CXX_FLAGS. The performance impact on RocksDB is _HUGE_. > > > > > > > > Thanks, > > > > > > > > Gr. Stefan > > > > > > > > P.s. Kudos to Mark Nelson for figuring it out / testing. > > > > ___ > > > > ceph-users mailing list -- ceph-users@ceph.io > > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- Kamoltat Sirivadhna (HE/HIM) SoftWare Engineer - Ceph Storage ksiri...@redhat.comT: (857) <(919)716-5348>253-8927 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific 16.2.15 QE validation status
details of RADOS run analysis: yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi <https://pulpito.ceph.com/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/#collapseOne> 1. https://tracker.ceph.com/issues/64455 task/test_orch_cli: Health check failed: cephadm background work is paused (CEPHADM_PAUSED)" in cluster log (White list) 2. https://tracker.ceph.com/issues/64454 rados/cephadm/mgr-nfs-upgrade: Health check failed: 1 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON)" in cluster log (whitelist) 3. https://tracker.ceph.com/issues/63887: Starting alertmanager fails from missing container (happens in Pacific) 4. Failed to reconnect to smithi155 [7566763 <https://pulpito.ceph.com/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/7566763> ] 5. https://tracker.ceph.com/issues/64278 Unable to update caps for client.iscsi.iscsi.a (known failures) 6. https://tracker.ceph.com/issues/64452 Teuthology runs into "TypeError: expected string or bytes-like object" during log scraping (teuthology failure) 7. https://tracker.ceph.com/issues/64343 Expected warnings that need to be whitelisted cause rados/cephadm tests to fail for 7566717 <https://pulpito.ceph.com/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/7566717> we neeed to add (ERR|WRN|SEC) 8. https://tracker.ceph.com/issues/58145 orch/cephadm: nfs tests failing to mount exports (mount -t nfs 10.0.31.120:/fake /mnt/foo' fails) 7566724 (resolved issue re-opened) 9. https://tracker.ceph.com/issues/63577 cephadm: docker.io/library/haproxy: toomanyrequests: You have reached your pull rate limit. 10. https://tracker.ceph.com/issues/54071 rdos/cephadm/osds: Invalid command: missing required parameter hostname() 756674 <https://pulpito.ceph.com/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/7566747> Note: 1. Although 7566762 seems like a different failure from what is displayed in pulpito, in the teuth log it failed because of https://tracker.ceph.com/issues/64278. 2. rados/cephadm/thrash/ … failed a lot because of https://tracker.ceph.com/issues/64452 3. 7566717 <https://pulpito.ceph.com/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/7566717>. failed because we didn’t whitelist (ERR|WRN|SEC) :tasks.cephadm:Checking cluster log for badness... 4. 7566724 https://tracker.ceph.com/issues/58145 ganesha seems resolved 1 year ago, but popped up again so re-opened tracker and ping Adam King (resolved) 7566777, 7566781, 7566796 are due to https://tracker.ceph.com/issues/63577 White List and re-ran: yuriw-2024-02-22_21:39:39-rados-pacific-release-distro-default-smithi/ <https://pulpito.ceph.com/yuriw-2024-02-22_21:39:39-rados-pacific-release-distro-default-smithi/> rados/cephadm/mds_upgrade_sequence/ —> failed to shutdown mon (known failure discussed with A.King) rados/cephadm/mgr-nfs-upgrade —> failed to shutdown mon (known failure discussed with A.King) rados/cephadm/osds —> zap disk error (known failure) rados/cephadm/smoke-roleless —> toomanyrequests: You have reached your pull rate limit. https://www.docker.com/increase-rate-limit. (known failures) rados/cephadm/thrash —> Just needs to whitelist (CACHE_POOL_NEAR_FULL) (known failures) rados/cephadm/upgrade —> CEPHADM_FAILED_DAEMON (WRN) node-exporter (known failure discussed with A.King) rados/cephadm/workunits —> known failure: https://tracker.ceph.com/issues/63887 On Mon, Feb 26, 2024 at 10:22 AM Kamoltat Sirivadhna wrote: > RADOS approved > > On Wed, Feb 21, 2024 at 11:27 AM Yuri Weinstein > wrote: > >> Still seeking approvals: >> >> rados - Radek, Junior, Travis, Adam King >> >> All other product areas have been approved and are ready for the release >> step. >> >> Pls also review the Release Notes: >> https://github.com/ceph/ceph/pull/55694 >> >> >> On Tue, Feb 20, 2024 at 7:58 AM Yuri Weinstein >> wrote: >> > >> > We have restarted QE validation after fixing issues and merging several >> PRs. >> > The new Build 3 (rebase of pacific) tests are summarized in the same >> > note (see Build 3 runs) https://tracker.ceph.com/issues/64151#note-1 >> > >> > Seeking approvals: >> > >> > rados - Radek, Junior, Travis, Ernesto, Adam King >> > rgw - Casey >> > fs - Venky >> > rbd - Ilya >> > krbd - Ilya >> > >> > upgrade/octopus-x (pacific) - Adam King, Casey PTL >> > >> > upgrade/pacific-p2p - Casey PTL >> > >> > ceph-volume - Guillaume, fixed by >> > https://github.com/ceph/ceph/pull/55658
[ceph-users] Ceph needs your help with defining availability!
Hi everyone, One of the features we are looking into implementing for our upcoming Ceph release (Reef) is the ability to track cluster availability over time. However, the biggest *problem* that we are currently facing is basing our measurement on the *definition of availability* that matches user expectations or business objectives. Therefore, we think it is worthwhile to ask for your opinion on what you think defines availability in a Ceph cluster. *Please help us* by filling in a* survey* that won't take longer than *10 minutes* to complete: https://forms.gle/aFYvTCUM3s9daTJg8 Feel free to reach out to me if you have any questions, Thank you and have a great weekend! -- Kamoltat Sirivadhna (HE/HIM) SoftWare Engineer - Ceph Storage ksiri...@redhat.comT: (857) <(919)716-5348>253-8927 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph needs your help with defining availability!
Hi John, Yes, I'm planning to summarize the results after this week. I will definitely share it with the community. Best, On Tue, Aug 9, 2022 at 1:19 PM John Bent wrote: > Hello Kamoltat, > > This sounds very interesting. Will you be sharing the results of the > survey back with the community? > > Thanks, > > John > > On Sat, Aug 6, 2022 at 4:49 AM Kamoltat Sirivadhna > wrote: > >> Hi everyone, >> >> One of the features we are looking into implementing for our upcoming >> Ceph release (Reef) is the ability to track cluster availability over time. >> However, the biggest *problem* that we are currently facing is basing >> our measurement on the *definition of availability* that matches user >> expectations or business objectives. Therefore, we think it is worthwhile >> to ask for your opinion on what you think defines availability in a Ceph >> cluster. >> >> *Please help us* by filling in a* survey* that won't take longer than *10 >> minutes* to complete: >> >> https://forms.gle/aFYvTCUM3s9daTJg8 >> >> Feel free to reach out to me if you have any questions, >> >> Thank you and have a great weekend! >> -- >> >> Kamoltat Sirivadhna (HE/HIM) >> >> SoftWare Engineer - Ceph Storage >> >> ksiri...@redhat.com T: (857) <(919)716-5348>253-8927 >> >> ___ >> Dev mailing list -- d...@ceph.io >> To unsubscribe send an email to dev-le...@ceph.io >> > -- Kamoltat Sirivadhna (HE/HIM) SoftWare Engineer - Ceph Storage ksiri...@redhat.comT: (857) <(919)716-5348>253-8927 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph needs your help with defining availability!
Hi guys, thank you so much for filling out the Ceph Cluster Availability survey! we have received a total of 59 responses from various groups of people, which is enough to help us understand more profoundly what availability means to everyone. As promised, here is the link to the results of the survey: https://docs.google.com/forms/d/1J5Ab5KCy6fceXxHI8KDqY2Qx3FzR-V9ivKp_vunEWZ0/viewanalytics Also, I've summarized some of the written responses such that it is easier for you to make sense of the results. I hope you will find these responses helpful and please feel free to reach out if you have any questions! Response summary of the question: “”” In your own words, please describe what availability means to you in a Ceph cluster. (For example, is it the ability to serve read and write requests even if the cluster is in a degraded state?). “”” In summary, the majority of people consider the definition of availability to be the ability to serve I/O with reasonable performance (some suggest 10-20%, others say it should be user configurable) + the ability to provide other services. A couple of people define availability as all PGs being in the state of active+clean, but we will come to learn that many people disagree with this in the next question. Interestingly, a handful of people suggests that cluster availability shouldn’t be binary, but rather a scale or tiers, e.g., one response suggests that we should have: 1. Fully available - all services can serve I/O normal performance. 2. Partially available 1. some access method, although configured, is not available e.g., CephFS works and RGW doesn’t. 2. only reads or writes are possible on some storage pools. 3. some storage pools are completely unavailable while others are completely or partially available. 4. performance is severely degraded. 5. some services are stopped/crashed. 3. Unavailable - when Partially available is not reached. Moreover, some suggest that we should track availability as per pool basis to deal with a scenario where we have different crush rules or when we can afford a pool to be unavailable. Furthermore, some response cares more about the availability of one service than another, e.g., one response states that they wouldn’t care about the availability of RADOS if RGW is unavailable. Response summary of the question: “”” Do you agree with the following metric in evaluating a cluster's availability: "All placement group (PG) state in a cluster must have 'active' in them, if at least 1 PG does not have 'active' in them, then the cluster as a whole is deemed as unavailable". “”” 35.8 % of Users answered `No` 35.8% of Users answered `Yes` 28.3% of Users answered `maybe` Data clearly states that we can’t just have this as criteria for availability. Therefore, here are some of the reasons why 64.1% do not fully agree with the statement. If the client does not interact with that particular PG then it is not important, e.g., if 1 PG is inactive and the s3 endpoint is down but CephFS can still serve I/O, we cannot say that the cluster is unavailable. Some disagree because they believe that a PG relates to a single pool, therefore, that particular pool will be unavailable, not the cluster. Furthermore, some suggest that there are events that might lead to PGs not being inactive, such as provisioning a new OSD, creating a pool, or PG split, however, these events don’t necessarily indicate unavailability. Response summary of the question: “”” From your own experience, what are some of the most common events that cause a Ceph cluster to be considered unavailable based on your definition of availability. “”” Top four responses: 1. Network-related issues, e.g., network failure/instability. 2. OSD-related issues, e.g., failure, slow ops, flapping. 3. Disk-related issues, e.g., dead disks. 4. PGs-related issues, e.g., many PGs became stale, unknown, and stuck in peering. Response summary of the question: “”” Are there any events that you might consider a cluster to be unavailable but you feel like it is not worth tracking and is dismissible? “”” Top three responses: 1. No, all unavailable events are worth tracking. 2. Network related issues 3. Scheduled upgrades or maintenance On Tue, Aug 9, 2022 at 1:51 PM Kamoltat Sirivadhna wrote: > Hi John, > > Yes, I'm planning to summarize the results after this week. I will > definitely share it with the community. > > Best, > > On Tue, Aug 9, 2022 at 1:19 PM John Bent wrote: > >> Hello Kamoltat, >> >> This sounds very interesting. Will you be sharing the results of the >> survey back with the community? >> >> Thanks, >> >> John >> >> On Sat, Aug 6, 2022 at 4:49 AM Kamoltat Sirivadhna &g
[ceph-users] Slides from today's Ceph User + Dev Monthly Meeting
Hi guys, thank you all for attending today's meeting, apologies for the restricted access. Attached here is the slide in pdf format. Let me know if you have any questions, -- Kamoltat Sirivadhna (HE/HIM) SoftWare Engineer - Ceph Storage ksiri...@redhat.comT: (857) <(919)716-5348>253-8927 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io