[ceph-users] Re: pacific 16.2.15 QE validation status

2024-02-26 Thread Kamoltat Sirivadhna
RADOS approved

On Wed, Feb 21, 2024 at 11:27 AM Yuri Weinstein  wrote:

> Still seeking approvals:
>
> rados - Radek, Junior, Travis, Adam King
>
> All other product areas have been approved and are ready for the release
> step.
>
> Pls also review the Release Notes: https://github.com/ceph/ceph/pull/55694
>
>
> On Tue, Feb 20, 2024 at 7:58 AM Yuri Weinstein 
> wrote:
> >
> > We have restarted QE validation after fixing issues and merging several
> PRs.
> > The new Build 3 (rebase of pacific) tests are summarized in the same
> > note (see Build 3 runs) https://tracker.ceph.com/issues/64151#note-1
> >
> > Seeking approvals:
> >
> > rados - Radek, Junior, Travis, Ernesto, Adam King
> > rgw - Casey
> > fs - Venky
> > rbd - Ilya
> > krbd - Ilya
> >
> > upgrade/octopus-x (pacific) - Adam King, Casey PTL
> >
> > upgrade/pacific-p2p - Casey PTL
> >
> > ceph-volume - Guillaume, fixed by
> > https://github.com/ceph/ceph/pull/55658 retesting
> >
> > On Thu, Feb 8, 2024 at 8:43 AM Casey Bodley  wrote:
> > >
> > > thanks, i've created https://tracker.ceph.com/issues/64360 to track
> > > these backports to pacific/quincy/reef
> > >
> > > On Thu, Feb 8, 2024 at 7:50 AM Stefan Kooman  wrote:
> > > >
> > > > Hi,
> > > >
> > > > Is this PR: https://github.com/ceph/ceph/pull/54918 included as
> well?
> > > >
> > > > You definitely want to build the Ubuntu / debian packages with the
> > > > proper CMAKE_CXX_FLAGS. The performance impact on RocksDB is _HUGE_.
> > > >
> > > > Thanks,
> > > >
> > > > Gr. Stefan
> > > >
> > > > P.s. Kudos to Mark Nelson for figuring it out / testing.
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >
> > >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 

Kamoltat Sirivadhna (HE/HIM)

SoftWare Engineer - Ceph Storage

ksiri...@redhat.comT: (857) <(919)716-5348>253-8927
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific 16.2.15 QE validation status

2024-02-26 Thread Kamoltat Sirivadhna
details of RADOS run analysis:

yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi
<https://pulpito.ceph.com/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/#collapseOne>



   1. https://tracker.ceph.com/issues/64455  task/test_orch_cli: Health
   check failed: cephadm background work is paused (CEPHADM_PAUSED)" in
   cluster log (White list)
   2. https://tracker.ceph.com/issues/64454 rados/cephadm/mgr-nfs-upgrade:
   Health check failed: 1 stray daemon(s) not managed by cephadm
   (CEPHADM_STRAY_DAEMON)" in cluster log (whitelist)
   3. https://tracker.ceph.com/issues/63887: Starting alertmanager fails
   from missing container (happens in Pacific)
   4. Failed to reconnect to smithi155 [7566763
   
<https://pulpito.ceph.com/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/7566763>
   ]
   5. https://tracker.ceph.com/issues/64278 Unable to update caps for
   client.iscsi.iscsi.a (known failures)
   6. https://tracker.ceph.com/issues/64452 Teuthology runs into
   "TypeError: expected string or bytes-like object" during log scraping
   (teuthology failure)
   7. https://tracker.ceph.com/issues/64343 Expected warnings that need to
   be whitelisted cause rados/cephadm tests to fail for 7566717
   
<https://pulpito.ceph.com/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/7566717>
   we neeed to add (ERR|WRN|SEC)
   8. https://tracker.ceph.com/issues/58145 orch/cephadm: nfs tests failing
   to mount exports (mount -t nfs 10.0.31.120:/fake /mnt/foo' fails)
   7566724 (resolved issue re-opened)
   9. https://tracker.ceph.com/issues/63577 cephadm:
   docker.io/library/haproxy: toomanyrequests: You have reached your pull
   rate limit.
   10. https://tracker.ceph.com/issues/54071 rdos/cephadm/osds: Invalid
   command: missing required parameter hostname() 756674
   
<https://pulpito.ceph.com/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/7566747>


Note:

   1. Although 7566762 seems like a different failure from what is
   displayed in pulpito, in the teuth log it failed because of
   https://tracker.ceph.com/issues/64278.
   2. rados/cephadm/thrash/ … failed a lot because of
   https://tracker.ceph.com/issues/64452
   3. 7566717
   
<https://pulpito.ceph.com/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/7566717>.
   failed because we didn’t whitelist (ERR|WRN|SEC) :tasks.cephadm:Checking
   cluster log for badness...
   4. 7566724 https://tracker.ceph.com/issues/58145 ganesha seems resolved
   1 year ago, but popped up again so re-opened tracker and ping Adam King
   (resolved)

7566777, 7566781, 7566796 are due to https://tracker.ceph.com/issues/63577



White List and re-ran:

yuriw-2024-02-22_21:39:39-rados-pacific-release-distro-default-smithi/
<https://pulpito.ceph.com/yuriw-2024-02-22_21:39:39-rados-pacific-release-distro-default-smithi/>

rados/cephadm/mds_upgrade_sequence/ —> failed to shutdown mon (known
failure discussed with A.King)

rados/cephadm/mgr-nfs-upgrade —> failed to shutdown mon (known failure
discussed with A.King)

rados/cephadm/osds —> zap disk error (known failure)

rados/cephadm/smoke-roleless —>  toomanyrequests: You have reached your
pull rate limit. https://www.docker.com/increase-rate-limit. (known
failures)

rados/cephadm/thrash —> Just needs to whitelist (CACHE_POOL_NEAR_FULL)
(known failures)

rados/cephadm/upgrade —> CEPHADM_FAILED_DAEMON (WRN)  node-exporter  (known
failure discussed with A.King)

rados/cephadm/workunits —> known failure:
https://tracker.ceph.com/issues/63887

On Mon, Feb 26, 2024 at 10:22 AM Kamoltat Sirivadhna 
wrote:

> RADOS approved
>
> On Wed, Feb 21, 2024 at 11:27 AM Yuri Weinstein 
> wrote:
>
>> Still seeking approvals:
>>
>> rados - Radek, Junior, Travis, Adam King
>>
>> All other product areas have been approved and are ready for the release
>> step.
>>
>> Pls also review the Release Notes:
>> https://github.com/ceph/ceph/pull/55694
>>
>>
>> On Tue, Feb 20, 2024 at 7:58 AM Yuri Weinstein 
>> wrote:
>> >
>> > We have restarted QE validation after fixing issues and merging several
>> PRs.
>> > The new Build 3 (rebase of pacific) tests are summarized in the same
>> > note (see Build 3 runs) https://tracker.ceph.com/issues/64151#note-1
>> >
>> > Seeking approvals:
>> >
>> > rados - Radek, Junior, Travis, Ernesto, Adam King
>> > rgw - Casey
>> > fs - Venky
>> > rbd - Ilya
>> > krbd - Ilya
>> >
>> > upgrade/octopus-x (pacific) - Adam King, Casey PTL
>> >
>> > upgrade/pacific-p2p - Casey PTL
>> >
>> > ceph-volume - Guillaume, fixed by
>> > https://github.com/ceph/ceph/pull/55658 

[ceph-users] Ceph needs your help with defining availability!

2022-08-05 Thread Kamoltat Sirivadhna
Hi everyone,

One of the features we are looking into implementing for our upcoming Ceph
release (Reef) is the ability to track cluster availability over time.
However, the biggest *problem* that we are currently facing is basing our
measurement on the *definition of availability* that matches user
expectations or business objectives. Therefore, we think it is worthwhile
to ask for your opinion on what you think defines availability in a Ceph
cluster.

*Please help us* by filling in a* survey* that won't take longer than *10
minutes* to complete:

https://forms.gle/aFYvTCUM3s9daTJg8

Feel free to reach out to me if you have any questions,

Thank you and have a great weekend!
-- 

Kamoltat Sirivadhna (HE/HIM)

SoftWare Engineer - Ceph Storage

ksiri...@redhat.comT: (857) <(919)716-5348>253-8927
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph needs your help with defining availability!

2022-08-09 Thread Kamoltat Sirivadhna
Hi John,

Yes, I'm planning to summarize the results after this week. I will
definitely share it with the community.

Best,

On Tue, Aug 9, 2022 at 1:19 PM John Bent  wrote:

> Hello Kamoltat,
>
> This sounds very interesting. Will you be sharing the results of the
> survey back with the community?
>
> Thanks,
>
> John
>
> On Sat, Aug 6, 2022 at 4:49 AM Kamoltat Sirivadhna 
> wrote:
>
>> Hi everyone,
>>
>> One of the features we are looking into implementing for our upcoming
>> Ceph release (Reef) is the ability to track cluster availability over time.
>> However, the biggest *problem* that we are currently facing is basing
>> our measurement on the *definition of availability* that matches user
>> expectations or business objectives. Therefore, we think it is worthwhile
>> to ask for your opinion on what you think defines availability in a Ceph
>> cluster.
>>
>> *Please help us* by filling in a* survey* that won't take longer than *10
>> minutes* to complete:
>>
>> https://forms.gle/aFYvTCUM3s9daTJg8
>>
>> Feel free to reach out to me if you have any questions,
>>
>> Thank you and have a great weekend!
>> --
>>
>> Kamoltat Sirivadhna (HE/HIM)
>>
>> SoftWare Engineer - Ceph Storage
>>
>> ksiri...@redhat.com    T: (857) <(919)716-5348>253-8927
>>
>> ___
>> Dev mailing list -- d...@ceph.io
>> To unsubscribe send an email to dev-le...@ceph.io
>>
>

-- 

Kamoltat Sirivadhna (HE/HIM)

SoftWare Engineer - Ceph Storage

ksiri...@redhat.comT: (857) <(919)716-5348>253-8927
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph needs your help with defining availability!

2022-08-15 Thread Kamoltat Sirivadhna
Hi guys,

thank you so much for filling out the Ceph Cluster Availability survey!

we have received a total of 59 responses from various groups of people,
which is enough to help us understand more profoundly what availability
means to everyone.

As promised, here is the link to the results of the survey:
https://docs.google.com/forms/d/1J5Ab5KCy6fceXxHI8KDqY2Qx3FzR-V9ivKp_vunEWZ0/viewanalytics

Also, I've summarized some of the written responses such that it is easier
for you to make sense of the results.

I hope you will find these responses helpful and please feel free to reach
out if you have any questions!

Response summary of the question:

“””

In your own words, please describe what availability means to you in a Ceph
cluster. (For example, is it the ability to serve read and write requests
even if the cluster is in a degraded state?).

“””

In summary, the majority of people consider the definition of availability
to be the ability to serve I/O with reasonable performance (some suggest
10-20%, others say it should be user configurable) + the ability to provide
other services. A couple of people define availability as all PGs being in
the state of active+clean, but we will come to learn that many people
disagree with this in the next question. Interestingly, a handful of people
suggests that cluster availability shouldn’t be binary, but rather a scale
or tiers, e.g., one response suggests that we should have:


   1.

   Fully available -  all services can serve I/O normal performance.
   2.

   Partially available
   1.

  some access method, although configured, is not available e.g.,
  CephFS works and RGW doesn’t.
  2.

  only reads or writes are possible on some storage pools.
  3.

  some storage pools are completely unavailable while others are
  completely or partially available.
  4.

  performance is severely degraded.
  5.

  some services are stopped/crashed.
  3.

   Unavailable - when Partially available is not reached.


Moreover, some suggest that we should track availability as per pool basis
to deal with a scenario where we have different crush rules or when we can
afford a pool to be unavailable. Furthermore, some response cares more
about the availability of one service than another, e.g., one response
states that they wouldn’t care about the availability of RADOS if RGW is
unavailable.

Response summary of the question:

“””

Do you agree with the following metric in evaluating a cluster's
availability:

"All placement group (PG) state in a cluster must have 'active'  in them,
if at least 1 PG does not have 'active' in them, then the cluster as a
whole is deemed as unavailable".

“””

35.8 % of Users answered `No`

35.8% of Users answered `Yes`

28.3% of Users answered `maybe`

Data clearly states that we can’t just have this as criteria for
availability. Therefore, here are some of the reasons why 64.1% do not
fully agree with the statement.

If the client does not interact with that particular PG then it is not
important, e.g., if 1 PG is inactive and the s3 endpoint is down but CephFS
can still serve I/O, we cannot say that the cluster is unavailable. Some
disagree because they believe that a PG relates to a single pool,
therefore, that particular pool will be unavailable, not the cluster.
Furthermore, some suggest that there are events that might lead to PGs not
being inactive, such as provisioning a new OSD, creating a pool, or PG
split, however, these events don’t necessarily indicate unavailability.

Response summary of the question:

“””

From your own experience, what are some of the most common events that
cause a Ceph cluster to be considered unavailable based on your definition
of availability.

“””

Top four responses:


   1.

   Network-related issues, e.g., network failure/instability.
   2.

   OSD-related issues, e.g., failure, slow ops, flapping.
   3.

   Disk-related issues, e.g., dead disks.
   4.

   PGs-related issues,  e.g., many PGs became stale, unknown, and stuck in
   peering.


Response summary of the question:

“””

Are there any events that you might consider a cluster to be unavailable
but you feel like it is not worth tracking and is dismissible?

“””

Top three responses:


   1.

   No, all unavailable events are worth tracking.
   2.

   Network related issues
   3.

   Scheduled upgrades or maintenance



On Tue, Aug 9, 2022 at 1:51 PM Kamoltat Sirivadhna 
wrote:

> Hi John,
>
> Yes, I'm planning to summarize the results after this week. I will
> definitely share it with the community.
>
> Best,
>
> On Tue, Aug 9, 2022 at 1:19 PM John Bent  wrote:
>
>> Hello Kamoltat,
>>
>> This sounds very interesting. Will you be sharing the results of the
>> survey back with the community?
>>
>> Thanks,
>>
>> John
>>
>> On Sat, Aug 6, 2022 at 4:49 AM Kamoltat Sirivadhna 
&g

[ceph-users] Slides from today's Ceph User + Dev Monthly Meeting

2022-09-15 Thread Kamoltat Sirivadhna
Hi guys,

thank you all for attending today's meeting,
apologies for the restricted access.

Attached here is the slide in pdf format.

Let me know if you have any questions,

-- 

Kamoltat Sirivadhna (HE/HIM)

SoftWare Engineer - Ceph Storage

ksiri...@redhat.comT: (857) <(919)716-5348>253-8927
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io