[ceph-users] Re: [RFC][UADK integration][Acceleration of zlib compressor]

2024-07-11 Thread Brad Hubbard
On Thu, Jul 11, 2024 at 10:42 PM Rongqi Sun  wrote:
>
> Hi Ceph community,

Hi Rongqi,

Thanks for proposing this and for attending CDM to discuss it
yesterday. I see we have received some good feedback in the PR and
it's awaiting some suggested changes. I think this will be a useful
and performant addition to the code base and encourage its inclusion.

>
> UADK is an open source accelerator framework, the kernel support part is
> UACCE , which has been
> merged in Kernel for several years, targeting to provide Shared Virtual
> Addressing (SVA) between accelerators and processes. UADK provider users
> with a unified programming interface to efficiently harness the
> capabilities of hardware accelerators. It furnishes users with fundamental
> library and driver support. Now UADK can offer the hardware accelerator
> engine(e.g: Kunpeng KAE), Arm SVE and Crypto Extension CPU. UADK
>  is hosted by Linaro.
>
> UADK now has already supports different communities for compressor and
> encryption, such as OpenSSL, DPDK and SPDK, so now, we would like to bring
> it to Ceph for Acceleration of  zlib compressor. As first step, we can see
> that,
>
>1. save almost 50% cpu usage compared with no-isal compression in RBD 4M
>workload
>2. save almost 40% cpu usage compared with no-isal compression in RGW
>put op (4M) workload
>
> The PR  is under review, welcome
> any comments or reviews.
>
> Have a nice day~
>
> Rongqi Sun
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ref v18.2.0 QE Validation status

2023-08-02 Thread Brad Hubbard
On Thu, Aug 3, 2023 at 8:31 AM Yuri Weinstein  wrote:

> Updates:
>
> 1. bookworm distro build support
> We will not build bookworm until Debian bug
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129 is resolved
>
> 2. nfs ganesha
> fixed (Thanks Guillaume and Kaleb)
>
> 3. powercycle failures due to teuthology - fixed (Thanks Brad, Zack,
> Dan).  I am expecting Brad to approve it as the SELinux denials is the
> known issue.
>

Confirmed, definite pass, much better.


>
> smoke approved.
>
> Unless I hear any objections I see no open issues and will start building
> (jammy focal centos8 centos9 windows)
>
> On Wed, Aug 2, 2023 at 7:56 AM Yuri Weinstein  wrote:
>
>> https://github.com/ceph/ceph/pull/52710 merged
>>
>> smoke:
>> https://tracker.ceph.com/issues/62227 marked resolved
>> https://tracker.ceph.com/issues/62228 in progress
>>
>> Only smoke and prowercycle items are remaining from the test approval
>> standpoint.
>>
>> Here is the quote from Neha's status to the clt summarizing the correct
>> outstanding issues:
>> "
>> 1. bookworm distro build support
>> action item - Neha and Josh will have a discussion with Dan to figure out
>> what help he needs and if at all there is a workaround for the first reef
>> release
>>
>> 2. nfs ganesha
>> action item - figure out with Guillaume and Kaleb if a workaround of
>> pinning to an older release is possible
>>
>> 3. powercycle failures due to teuthology -
>> https://pulpito.ceph.com/yuriw-2023-07-29_14:04:17-powercycle-reef-release-distro-default-smithi/
>>
>> action item - open a tracker and get Zack's thoughts on whether this
>> issue can be classified as infra-only and get powercycle approvals from Brad
>> "
>>
>> On Wed, Aug 2, 2023 at 3:03 AM Radoslaw Zarzynski 
>> wrote:
>>
>>> Final ACK for RADOS.
>>>
>>> On Tue, Aug 1, 2023 at 6:28 PM Laura Flores  wrote:
>>>
 Rados failures are summarized here:
 https://tracker.ceph.com/projects/rados/wiki/REEF#Reef-v1820

 All are known. Will let Radek give the final ack.

 On Tue, Aug 1, 2023 at 9:05 AM Nizamudeen A  wrote:

> dashboard approved! failure is unrelated and tracked via
> https://tracker.ceph.com/issues/58946
>
> Regards,
> Nizam
>
> On Sun, Jul 30, 2023 at 9:16 PM Yuri Weinstein 
> wrote:
>
> > Details of this release are summarized here:
> >
> > https://tracker.ceph.com/issues/62231#note-1
> >
> > Seeking approvals/reviews for:
> >
> > smoke - Laura, Radek
> > rados - Neha, Radek, Travis, Ernesto, Adam King
> > rgw - Casey
> > fs - Venky
> > orch - Adam King
> > rbd - Ilya
> > krbd - Ilya
> > upgrade-clients:client-upgrade* - in progress
> > powercycle - Brad
> >
> > Please reply to this email with approval and/or trackers of known
> > issues/PRs to address them.
> >
> > bookworm distro support is an outstanding issue.
> >
> > TIA
> > YuriW
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


 --

 Laura Flores

 She/Her/Hers

 Software Engineer, Ceph Storage 

 Chicago, IL

 lflo...@ibm.com | lflo...@redhat.com 
 M: +17087388804


 ___
 Dev mailing list -- d...@ceph.io
 To unsubscribe send an email to dev-le...@ceph.io

>>> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>


-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ref v18.2.0 QE Validation status

2023-08-01 Thread Brad Hubbard
On Mon, Jul 31, 2023 at 1:46 AM Yuri Weinstein  wrote:
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/62231#note-1
>
> Seeking approvals/reviews for:
>
> smoke - Laura, Radek
> rados - Neha, Radek, Travis, Ernesto, Adam King
> rgw - Casey
> fs - Venky
> orch - Adam King
> rbd - Ilya
> krbd - Ilya
> upgrade-clients:client-upgrade* - in progress
> powercycle - Brad

Powercycle failures are predominantly related to a failure of some
teuthology code to detect the login prompt on the ipmi consoles.
Waiting on feedback on why this code may have suddenly stopped working
and whether it is indicative of some other issue(s).

>
> Please reply to this email with approval and/or trackers of known
> issues/PRs to address them.
>
> bookworm distro support is an outstanding issue.
>
> TIA
> YuriW
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.13 pacific QE validation status

2023-05-01 Thread Brad Hubbard
On Fri, Apr 28, 2023 at 7:21 AM Yuri Weinstein  wrote:
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/59542#note-1
> Release Notes - TBD
>
> Seeking approvals for:
>
> smoke - Radek, Laura
> rados - Radek, Laura
>   rook - Sébastien Han
>   cephadm - Adam K
>   dashboard - Ernesto
>
> rgw - Casey
> rbd - Ilya
> krbd - Ilya
> fs - Venky, Patrick
> upgrade/octopus-x (pacific) - Laura (look the same as in 16.2.8)
> upgrade/pacific-p2p - Laura
> powercycle - Brad (SELinux denials)

Still waiting on https://github.com/ceph/teuthology/pull/1830 to
resolve these - Approved

> ceph-volume - Guillaume, Adam K
>
> Thx
> YuriW
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io



-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.6 QE Validation status

2023-03-26 Thread Brad Hubbard
On Sat, Mar 25, 2023 at 5:46 AM Yuri Weinstein  wrote:
>
> Details of this release are updated here:
>
> https://tracker.ceph.com/issues/59070#note-1
> Release Notes - TBD
>
> The slowness we experienced seemed to be self-cured.
> Neha, Radek, and Laura please provide any findings if you have them.
>
> Seeking approvals/reviews for:
>
> rados - Neha, Radek, Travis, Ernesto, Adam King (rerun on Build 2 with
> PRs merged on top of quincy-release)
> rgw - Casey (rerun on Build 2 with PRs merged on top of quincy-release)
> fs - Venky
>
> upgrade/octopus-x - Neha, Laura (package issue Adam Kraitman any updates?)
> upgrade/pacific-x - Neha, Laura, Ilya see 
> https://tracker.ceph.com/issues/58914
> upgrade/quincy-p2p - Neha, Laura
> client-upgrade-octopus-quincy-quincy - Neha, Laura (package issue Adam
> Kraitman any updates?)
> powercycle - Brad

Still the selinux ping issues I've been trying to pin down. Approved.

>
> Please reply to this email with approval and/or trackers of known
> issues/PRs to address them.
>
> Josh, Neha - gibba and LRC upgrades pending major suites approvals.
> RC release - pending major suites approvals.
>
> On Tue, Mar 21, 2023 at 1:04 PM Yuri Weinstein  wrote:
> >
> > Details of this release are summarized here:
> >
> > https://tracker.ceph.com/issues/59070#note-1
> > Release Notes - TBD
> >
> > The reruns were in the queue for 4 days because of some slowness issues.
> > The core team (Neha, Radek, Laura, and others) are trying to narrow
> > down the root cause.
> >
> > Seeking approvals/reviews for:
> >
> > rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to test
> > and merge at least one PR https://github.com/ceph/ceph/pull/50575 for
> > the core)
> > rgw - Casey
> > fs - Venky (the fs suite has an unusually high amount of failed jobs,
> > any reason to suspect it in the observed slowness?)
> > orch - Adam King
> > rbd - Ilya
> > krbd - Ilya
> > upgrade/octopus-x - Laura is looking into failures
> > upgrade/pacific-x - Laura is looking into failures
> > upgrade/quincy-p2p - Laura is looking into failures
> > client-upgrade-octopus-quincy-quincy - missing packages, Adam Kraitman
> > is looking into it
> > powercycle - Brad
> > ceph-volume - needs a rerun on merged
> > https://github.com/ceph/ceph-ansible/pull/7409
> >
> > Please reply to this email with approval and/or trackers of known
> > issues/PRs to address them.
> >
> > Also, share any findings or hypnosis about the slowness in the
> > execution of the suite.
> >
> > Josh, Neha - gibba and LRC upgrades pending major suites approvals.
> > RC release - pending major suites approvals.
> >
> > Thx
> > YuriW
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io



-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.11 pacific QE validation status

2023-01-22 Thread Brad Hubbard
On Sat, Jan 21, 2023 at 2:39 AM Yuri Weinstein  wrote:
>
> The overall progress on this release is looking much better and if we
> can approve it we can plan to publish it early next week.
>
> Still seeking approvals
>
> rados - Neha, Laura
> rook - Sébastien Han
> cephadm - Adam
> dashboard - Ernesto
> rgw - Casey
> rbd - Ilya (full rbd run in progress now)
> krbd - Ilya
> fs - Venky, Patrick
> upgrade/nautilus-x (pacific) - passed thx Adam Kraitman!
> upgrade/octopus-x (pacific) - almost passed, still running 1 job
> upgrade/pacific-p2p - Neha (same as in 16.2.8)
> powercycle - Brad (see new SELinux denials)

The selinux failures are an old problem,
https://tracker.ceph.com/issues/55443 I'll see what we can do on the
infrastructure side to mask these since they appear harmless.
The other failures are during log compression after a successful test
and look like https://tracker.ceph.com/issues/56445

Neither of these should delay the release.

>
> On Tue, Jan 17, 2023 at 10:45 AM Yuri Weinstein  wrote:
> >
> > OK I will rerun failed jobs filtering rhel in
> >
> > Thx!
> >
> > On Tue, Jan 17, 2023 at 10:43 AM Adam Kraitman  wrote:
> > >
> > > Hey the satellite issue was fixed
> > >
> > > Thanks
> > >
> > > On Tue, Jan 17, 2023 at 7:43 PM Laura Flores  wrote:
> > >>
> > >> This was my summary of rados failures. There was nothing new or amiss,
> > >> although it is important to note that runs were done with filtering out
> > >> rhel 8.
> > >>
> > >> I will leave it to Neha for final approval.
> > >>
> > >> Failures:
> > >> 1. https://tracker.ceph.com/issues/58258
> > >> 2. https://tracker.ceph.com/issues/58146
> > >> 3. https://tracker.ceph.com/issues/58458
> > >> 4. https://tracker.ceph.com/issues/57303
> > >> 5. https://tracker.ceph.com/issues/54071
> > >>
> > >> Details:
> > >> 1. rook: kubelet fails from connection refused - Ceph - Orchestrator
> > >> 2. test_cephadm.sh: Error: Error initializing source docker://
> > >> quay.ceph.io/ceph-ci/ceph:master - Ceph - Orchestrator
> > >> 3. qa/workunits/post-file.sh: postf...@drop.ceph.com: Permission 
> > >> denied
> > >> - Ceph
> > >> 4. rados/cephadm: Failed to fetch package version from
> > >> https://shaman.ceph.com/api/search/?status=ready=ceph=default=ubuntu%2F22.04%2Fx86_64=b34ca7d1c2becd6090874ccda56ef4cd8dc64bf7
> > >> - Ceph - Orchestrator
> > >> 5. rados/cephadm/osds: Invalid command: missing required parameter
> > >> hostname() - Ceph - Orchestrator
> > >>
> > >> On Tue, Jan 17, 2023 at 9:48 AM Yuri Weinstein  
> > >> wrote:
> > >>
> > >> > Please see the test results on the rebased RC 6.6 in this comment:
> > >> >
> > >> > https://tracker.ceph.com/issues/58257#note-2
> > >> >
> > >> > We're still having infrastructure issues making testing difficult.
> > >> > Therefore all reruns were done excluding the rhel 8 distro
> > >> > ('--filter-out rhel_8')
> > >> >
> > >> > Also, the upgrades failed and Adam is looking into this.
> > >> >
> > >> > Seeking new approvals
> > >> >
> > >> > rados - Neha, Laura
> > >> > rook - Sébastien Han
> > >> > cephadm - Adam
> > >> > dashboard - Ernesto
> > >> > rgw - Casey
> > >> > rbd - Ilya
> > >> > krbd - Ilya
> > >> > fs - Venky, Patrick
> > >> > upgrade/nautilus-x (pacific) - Adam Kraitman
> > >> > upgrade/octopus-x (pacific) - Adam Kraitman
> > >> > upgrade/pacific-p2p - Neha - Adam Kraitman
> > >> > powercycle - Brad
> > >> >
> > >> > Thx
> > >> >
> > >> > On Fri, Jan 6, 2023 at 8:37 AM Yuri Weinstein  
> > >> > wrote:
> > >> > >
> > >> > > Happy New Year all!
> > >> > >
> > >> > > This release remains to be in "progress"/"on hold" status as we are
> > >> > > sorting all infrastructure-related issues.
> > >> > >
> > >> > > Unless I hear objections, I suggest doing a full rebase/retest QE
> > >> > > cycle (adding PRs merged lately) since it's taking much longer than
> > >> > > anticipated when sepia is back online.
> > >> > >
> > >> > > Objections?
> > >> > >
> > >> > > Thx
> > >> > > YuriW
> > >> > >
> > >> > > On Thu, Dec 15, 2022 at 9:14 AM Yuri Weinstein 
> > >> > wrote:
> > >> > > >
> > >> > > > Details of this release are summarized here:
> > >> > > >
> > >> > > > https://tracker.ceph.com/issues/58257#note-1
> > >> > > > Release Notes - TBD
> > >> > > >
> > >> > > > Seeking approvals for:
> > >> > > >
> > >> > > > rados - Neha (https://github.com/ceph/ceph/pull/49431 is still 
> > >> > > > being
> > >> > > > tested and will be merged soon)
> > >> > > > rook - Sébastien Han
> > >> > > > cephadm - Adam
> > >> > > > dashboard - Ernesto
> > >> > > > rgw - Casey (rwg will be rerun on the latest SHA1)
> > >> > > > rbd - Ilya, Deepika
> > >> > > > krbd - Ilya, Deepika
> > >> > > > fs - Venky, Patrick
> > >> > > > upgrade/nautilus-x (pacific) - Neha, Laura
> > >> > > > upgrade/octopus-x (pacific) - Neha, Laura
> > >> > > > upgrade/pacific-p2p - Neha - Neha, Laura
> > >> > > > powercycle - Brad
> > >> > > > ceph-volume - Guillaume, Adam K
> > >> > > >
> > >> > > > Thx
> > >> > 

[ceph-users] Re: 16.2.11 pacific QE validation status

2022-12-22 Thread Brad Hubbard
On Fri, Dec 16, 2022 at 8:33 AM Brad Hubbard  wrote:
>
> On Fri, Dec 16, 2022 at 3:15 AM Yuri Weinstein  wrote:
> >
> > Details of this release are summarized here:
> >
> > https://tracker.ceph.com/issues/58257#note-1
> > Release Notes - TBD
> >
> > Seeking approvals for:
> >
> > rados - Neha (https://github.com/ceph/ceph/pull/49431 is still being
> > tested and will be merged soon)
> > rook - Sébastien Han
> > cephadm - Adam
> > dashboard - Ernesto
> > rgw - Casey (rwg will be rerun on the latest SHA1)
> > rbd - Ilya, Deepika
> > krbd - Ilya, Deepika
> > fs - Venky, Patrick
> > upgrade/nautilus-x (pacific) - Neha, Laura
> > upgrade/octopus-x (pacific) - Neha, Laura
> > upgrade/pacific-p2p - Neha - Neha, Laura
> > powercycle - Brad
>
> The failure here is due to fallout from the recent lab issues and was
> fixed in main by https://github.com/ceph/ceph/pull/49021 I'm waiting
> to see if there are plans to backport this to pacific and quincy since
> that will be needed.

Pacific bakport here, https://github.com/ceph/ceph/pull/49470 thanks
to Xiubo Li.

>
> > ceph-volume - Guillaume, Adam K
> >
> > Thx
> > YuriW
> >
> > ___
> > Dev mailing list -- d...@ceph.io
> > To unsubscribe send an email to dev-le...@ceph.io
>
>
>
> --
> Cheers,
> Brad



-- 
Cheers,
Brad

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.11 pacific QE validation status

2022-12-15 Thread Brad Hubbard
On Fri, Dec 16, 2022 at 3:15 AM Yuri Weinstein  wrote:
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/58257#note-1
> Release Notes - TBD
>
> Seeking approvals for:
>
> rados - Neha (https://github.com/ceph/ceph/pull/49431 is still being
> tested and will be merged soon)
> rook - Sébastien Han
> cephadm - Adam
> dashboard - Ernesto
> rgw - Casey (rwg will be rerun on the latest SHA1)
> rbd - Ilya, Deepika
> krbd - Ilya, Deepika
> fs - Venky, Patrick
> upgrade/nautilus-x (pacific) - Neha, Laura
> upgrade/octopus-x (pacific) - Neha, Laura
> upgrade/pacific-p2p - Neha - Neha, Laura
> powercycle - Brad

The failure here is due to fallout from the recent lab issues and was
fixed in main by https://github.com/ceph/ceph/pull/49021 I'm waiting
to see if there are plans to backport this to pacific and quincy since
that will be needed.

> ceph-volume - Guillaume, Adam K
>
> Thx
> YuriW
>
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io



-- 
Cheers,
Brad

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Updating Git Submodules -- a documentation question

2022-10-17 Thread Brad Hubbard
I think if you are changing commits and/or branches a lot the
submodules can end up dirty. An alternative approach is to descend
into the submodule root directory and use git commands to work out why
it's dirty and fix it but I've always found that more trouble than
it's worth. YMMV.

On Mon, Oct 17, 2022 at 10:52 PM John Zachary Dover  wrote:
>
> Here is an example of dealing with untracked files, which Brad discusses at 
> the end of the most recent email in this thread:
>
>> [zdover@fedora src]$ git status
>> On branch main
>> Your branch is up to date with 'origin/main'.
>>
>> Untracked files:
>>   (use "git add ..." to include in what will be committed)
>> jaegertracing/jaeger-client-cpp/
>> jaegertracing/opentracing-cpp/
>> jaegertracing/thrift/
>>
>> nothing added to commit but untracked files present (use "git add" to track)
>> [zdover@fedora src]$ cd jaegertracing/
>> [zdover@fedora jaegertracing]$ rm -rf jaeger-client-cpp/
>> [zdover@fedora jaegertracing]$ git submodule update --init --recursive
>> [zdover@fedora jaegertracing]$ git status
>> On branch main
>> Your branch is up to date with 'origin/main'.
>>
>> Untracked files:
>>   (use "git add ..." to include in what will be committed)
>> opentracing-cpp/
>> thrift/
>>
>> nothing added to commit but untracked files present (use "git add" to track)
>> [zdover@fedora jaegertracing]$
>
>
> You must change the present working directory to the parent directory of the 
> target directory and then run "rm -rf target_directory/" and then run "git 
> submodule update --init --recursive" in order to clean out the offending 
> directory. This has to be done for each such directory.
>
> I do not know what causes the local working copy to get into this dirty 
> state. (That's what it's called in the git-scm documentation: a "dirty 
> state".)
>
> Zac
>
> On Wed, Oct 12, 2022 at 1:24 PM Brad Hubbard  wrote:
>>
>> For untracked files (eg. src/pybind/cephfs/cephfs.c) all you need is
>> 'git clean -fdx' which you ran last in this case.
>>
>> Just about everything can be solved by a combination of these commands.
>>
>> git submodule update --init --recursive
>> git clean -fdx
>> git submodule foreach git clean -fdx
>>
>> If you have files that show up in diff output that have unwanted
>> changes you can also use 'git checkout .' or 'git checkout
>> ./path/to/filename' to revert the changes.
>>
>> If you still have persistent problems with a submodule directory after
>> that just rm the offending directory and run 'git submodule update
>> --init --recursive' again.
>>
>> Also, rather than doing 'git checkout main; git pull' on main I would
>> do 'git checkout main; git fetch origin/main; git reset --hard
>> origin/main' as it's easy to get into a state where pull will fail.
>>
>> HTH.
>>
>> On Wed, Oct 12, 2022 at 12:40 PM John Zachary Dover  
>> wrote:
>> >
>> > The following console output, which is far too long to include in
>> > tutorial-style documentation that people are expected to read, shows the
>> > sequence of commands necessary to diagnose and repair submodules that have
>> > fallen out of sync with the submodules in the upstream repository.
>> >
>> > In this example, my local working copy has fallen out of sync. This will be
>> > obvious to adepts, but this procedure does not need to be communicated to
>> > them.
>> >
>> > This procedure was given to me by Brad Hubbard.
>> >
>> > Untracked files:
>> > >   (use "git add ..." to include in what will be committed)
>> > > src/pybind/cephfs/build/
>> > > src/pybind/cephfs/cephfs.c
>> > > src/pybind/cephfs/cephfs.egg-info/
>> > > src/pybind/rados/build/
>> > > src/pybind/rados/rados.c
>> > > src/pybind/rados/rados.egg-info/
>> > > src/pybind/rbd/build/
>> > > src/pybind/rbd/rbd.c
>> > > src/pybind/rbd/rbd.egg-info/
>> > > src/pybind/rgw/build/
>> > > src/pybind/rgw/rgw.c
>> > > src/pybind/rgw/rgw.egg-info/
>> > >
>> > > nothing added to commit but untracked files present (use "git add" to
>> > > track)
>> > > [zdover@fedora ceph]$ cd src/
>> > > [zdover@fedora src]$ ls
>> > > arch  cstart.sh   nasm-wrapper
>> > > auth  dmclock neorados
>> > > 

[ceph-users] Re: Updating Git Submodules -- a documentation question

2022-10-11 Thread Brad Hubbard
For untracked files (eg. src/pybind/cephfs/cephfs.c) all you need is
'git clean -fdx' which you ran last in this case.

Just about everything can be solved by a combination of these commands.

git submodule update --init --recursive
git clean -fdx
git submodule foreach git clean -fdx

If you have files that show up in diff output that have unwanted
changes you can also use 'git checkout .' or 'git checkout
./path/to/filename' to revert the changes.

If you still have persistent problems with a submodule directory after
that just rm the offending directory and run 'git submodule update
--init --recursive' again.

Also, rather than doing 'git checkout main; git pull' on main I would
do 'git checkout main; git fetch origin/main; git reset --hard
origin/main' as it's easy to get into a state where pull will fail.

HTH.

On Wed, Oct 12, 2022 at 12:40 PM John Zachary Dover  wrote:
>
> The following console output, which is far too long to include in
> tutorial-style documentation that people are expected to read, shows the
> sequence of commands necessary to diagnose and repair submodules that have
> fallen out of sync with the submodules in the upstream repository.
>
> In this example, my local working copy has fallen out of sync. This will be
> obvious to adepts, but this procedure does not need to be communicated to
> them.
>
> This procedure was given to me by Brad Hubbard.
>
> Untracked files:
> >   (use "git add ..." to include in what will be committed)
> > src/pybind/cephfs/build/
> > src/pybind/cephfs/cephfs.c
> > src/pybind/cephfs/cephfs.egg-info/
> > src/pybind/rados/build/
> > src/pybind/rados/rados.c
> > src/pybind/rados/rados.egg-info/
> > src/pybind/rbd/build/
> > src/pybind/rbd/rbd.c
> > src/pybind/rbd/rbd.egg-info/
> > src/pybind/rgw/build/
> > src/pybind/rgw/rgw.c
> > src/pybind/rgw/rgw.egg-info/
> >
> > nothing added to commit but untracked files present (use "git add" to
> > track)
> > [zdover@fedora ceph]$ cd src/
> > [zdover@fedora src]$ ls
> > arch  cstart.sh   nasm-wrapper
> > auth  dmclock neorados
> > bash_completion   doc objclass
> > blk   dokan   objsync
> > blkin erasure-codeocf
> > btrfs_ioc_test.c  etc-rbdmap  os
> > c-aresfmt osd
> > cephadm   global  osdc
> > ceph-clsinfo  googletest  perfglue
> > ceph_common.shinclude perf_histogram.h
> > ceph.conf.twoosds init-ceph.inpowerdns
> > ceph-coverage.in  init-radosgwps-ceph.pl
> > ceph-crash.in isa-l   push_to_qemu.pl
> > ceph-create-keys  jaegertracing   pybind
> > ceph-debugpack.in javapython-common
> > ceph_fuse.cc  journal rapidjson
> > ceph.in   json_spirit rbd_fuse
> > ceph_mds.cc   key_value_store rbdmap
> > ceph_mgr.cc   krbd.cc rbd_replay
> > ceph_mon.cc   kv  rbd-replay-many
> > ceph_osd.cc   libcephfs.ccREADME
> > ceph-osd-prestart.sh  libcephsqlite.ccrgw
> > ceph-post-file.in libkmip rocksdb
> > ceph-rbdnamer libradoss3select
> > ceph_release  librados-config.cc  sample.ceph.conf
> > ceph-run  libradosstriper script
> > ceph_syn.cc   librbd  seastar
> > ceph_ver.cloadclass.shSimpleRADOSStriper.cc
> > ceph_ver.h.in.cmake   log SimpleRADOSStriper.h
> > ceph-volume   logrotate.conf  spawn
> > civetweb  mds spdk
> > ckill.sh  messagesstop.sh
> > clientmgr telemetry
> > cls   mon test
> > cls_acl.ccmount   TODO
> > cls_crypto.cc mount.fuse.ceph tools
> > CMakeLists.txtmrgw.sh tracing
> > cmonctl   mrunvnewosd.sh
> > commonmsg vstart.sh
> > compressormstart.sh   xxHash
> > crimson   mstop.shzstd
> > crush multi-dump.sh
> > cryptomypy.ini
> > [zdover@fedora src]$ git checkout main
> > Switched to branch 'main'
> > Your branch is up to date with 'origin/main'.
> > [zdover@fedora src]$ git pull
> > Already up to date.
> > [zdover@fed

[ceph-users] Re: octopus v15.2.17 QE Validation status

2022-07-27 Thread Brad Hubbard
On Wed, Jul 27, 2022 at 12:40 AM Yuri Weinstein  wrote:
>
> Ack
>
> We need to get all approvals and resolve ceph-ansbile issue.

The primary cause of the issues with ca is that octopus was pinned to
the stable_6.0 branch of ca for octopus should be using stable_5.0
according to https://docs.ceph.com/projects/ceph-ansible/en/latest/#releases

I don't believe this should hold up the release.

>
> On Tue, Jul 26, 2022 at 7:10 AM Josh Durgin  wrote:
> >
> >
> >
> > On Sun, Jul 24, 2022 at 8:33 AM Yuri Weinstein  wrote:
> >>
> >> Still seeking approvals for:
> >>
> >> rados - Travis, Ernesto, Adam
> >> rgw - Casey
> >> fs, kcephfs, multimds - Venky, Patrick
> >> ceph-ansible - Brad pls take a look
> >>
> >> Josh, upgrade/client-upgrade-nautilus-octopus failed, do we need to fix 
> >> it, pls take a look/approve.
> >
> >
> > Looks like a python2/3 issue, not worth the overhead at this point. Let's 
> > proceed without this one.
>
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>


-- 
Cheers,
Brad

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.8 pacific QE validation status, RC2 available for testing

2022-05-09 Thread Brad Hubbard
It's the current HEAD of the pacific branch or, alternatively,
https://github.com/ceph/ceph-ci/tree/pacific-16.2.8_RC2.

$ git branch -r --contains 73636a1b00037ff974bcdc969b009c5ecec626cc
 ceph-ci/pacific-16.2.8_RC2
 upstream/pacific

HTH.

On Mon, May 9, 2022 at 7:05 PM Benoît Knecht  wrote:
>
> Hi Yuri,
>
> On Fri, May 06, 2022 at 07:00:56AM -0700, Yuri Weinstein wrote:
> > The branch name is pacific-16.2.8_RC2
> > (https://shaman.ceph.com/builds/ceph/pacific-16.2.8_RC2/
> > 73636a1b00037ff974bcdc969b009c5ecec626cc/)
>
> I don't see that branch on GitHub, would it be possible to push it there? 
> Maybe
> even tag it?
>
> Cheers,
>
> --
> Ben
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Brad

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: we're living in 2005.

2021-07-26 Thread Brad Hubbard
On Tue, Jul 27, 2021 at 3:49 AM Marc  wrote:
>
>
> > I feel like ceph is living in 2005.
>
> No it is just you. Why don't you start reading 
> https://docs.ceph.com/en/latest/
>
> >It's quite hard to find help on
> > issues related to ceph and it's almost impossible to get involved into
> > helping others.
>
> ???, Just click the reply button, you must be able to find that, not?
>
> > There's a BBS aka Mailman maillist, which is from 1980 era and there's
> > an irc channel that's dead.

Can you clarify which IRC channel specifically you are referring to here?

> > Why not set a Q board up or a google group at least?
>
> Because that is shit, you can not sign up via email, google is putting up 
> these cookie walls, and last but not least you are forcing people to share 
> their data with google. Maybe you do not care what data google grabbing of 
> you, but others might do.
>
> > Why not open a
> > "Discussions" tab on github so people would be able to get connected?
>
> Because it would spread the knowledge over different media, and therefore you 
> are likely to create a situation where on each medium your repsonse times go 
> down. Everyone has email, not everyone has a github account.
>
> > Why do I have to end up on random reddit boards and servethehome and
>
> Wtf reddit, servethehome?? There is nothing you can find there. Every time 
> there is reddit link in my search results, the information there is shit. I 
> am not even opening reddit links anymore.
>
> > proxmox forums trying to nit pick pieces of knowledge from random
> > persons?
> >
>
> Yes if I want to get more info on linux, I am always asking at microsoft 
> technet.
>
> > I'm always losing hope each time I have to deal with ceph issues.
>
> Issues?? If you have a default setup, you hardly have (one even can say no) 
> issues.
>
> > But
> > when it works, it's majestic of course. Documentation (both on redhat
> > side and main docs) is pretty thorough, though.
> >
>
> So if you read it all, you should not have any problems. I did not even read 
> all, and do not have any issues (knock on wood, of course). But I have the 
> impression you (like many others here) are not taking enough time to educate 
> yourself.
> If you a aspire to become a brain surgion, you also do not get educated via 
> reddit, not? Educate yourself, so when the shit hits the fan, you can fix the 
> majority yourself. Ceph is not a wordpress project.
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Brad

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: #ceph in Matrix [was: Re: we're living in 2005.]

2021-07-26 Thread Brad Hubbard
On Tue, Jul 27, 2021 at 5:53 AM Nico Schottelius
 wrote:
>
>
> Good evening dear mailing list,
>
> while I do think we have a great mailing list (this is one of the most
> helpful open source mailing lists I'm subscribed to), I do agree with
> the ceph IRC channel not being so helpful. The join/leave messages on
> most days significantly exceeds the number of real messages.

Can you clarify which IRC channel specifically you are referring to?

>
> I am not sure what is the reason for it, but maybe IRC is not for
> everyone. As some time ago we opened a Matrix channel on
> #ceph:ungleich.ch, I wanted to take the opportunity to invite you, in
> case you like real time discussion, but you are not into IRC.
>
> In case you don't have a matrix account yet, you can find more
> information about it on https://ungleich.ch/u/projects/open-chat/.
>
> HTH and best regards,
>
> Nico
>
> --
> Sustainable and modern Infrastructures by ungleich.ch
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Brad

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus 14.2.19 mon 100% CPU

2021-04-12 Thread Brad Hubbard
On Tue, Apr 13, 2021 at 8:40 AM Robert LeBlanc  wrote:
>
> Do you think it would be possible to build Nautilus FUSE or newer on
> 14.04, or do you think the toolchain has evolved too much since then?
>

An interesting question.

# cat /etc/os-release
NAME="Ubuntu"
VERSION="14.04.6 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.6 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/;
SUPPORT_URL="http://help.ubuntu.com/;
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/;

Had to tell cmake not to look for lz4 because the version on Trusty is too old.

# ./do_cmake.sh -DWITH_LZ4=off
# cd build/
# make -j8 ceph-fuse
# make -j8 rbd-fuse
# ./bin/rbd-fuse --version
ceph version 14.2.19-83-g53aefaa
(53aefaa1443c3a9bbd4e6448aa69e3d88b58cd51) nautilus (stable)
# ./bin/ceph-fuse --version
ceph version 14.2.19-83-g53aefaa
(53aefaa1443c3a9bbd4e6448aa69e3d88b58cd51) nautilus (stable)

I don't think Octopus would build on 14.04.

--
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus 14.2.19 mon 100% CPU

2021-04-12 Thread Brad Hubbard
On Mon, Apr 12, 2021 at 11:35 AM Robert LeBlanc  wrote:
>
> On Sun, Apr 11, 2021 at 4:19 PM Brad Hubbard  wrote:
> >
> > PSA.
> >
> > https://docs.ceph.com/en/latest/releases/general/#lifetime-of-stable-releases
> >
> > https://docs.ceph.com/en/latest/releases/#ceph-releases-index
>
> I'm very well aware that we are living on the dying edge (well, past
> dead), but a good chunk of machines are Ubuntu 14.04 not by choice.
> Getting this upgrade done was sorely needed, but very risky at the
> same time.

Sure Robert,

I understand the realities of maintaining large installations which
may have many reasons holding them back from upgrading any of the
interdependent software they run. The other side of the page however
is that we can not support releases indefinitely as each additional
supported release places a huge burden on limited dev, support, and QA
resources. We try to strike a balance but it's not "one size fits all"
unfortunately.

-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus 14.2.19 mon 100% CPU

2021-04-11 Thread Brad Hubbard
PSA.

https://docs.ceph.com/en/latest/releases/general/#lifetime-of-stable-releases

https://docs.ceph.com/en/latest/releases/#ceph-releases-index

On Sat, Apr 10, 2021 at 10:11 AM Robert LeBlanc  wrote:
>
> On Fri, Apr 9, 2021 at 4:04 PM Dan van der Ster  wrote:
> >
> > Here's what you should look for, with debug_mon=10. It shows clearly
> > that it takes the mon 23 seconds to run through
> > get_removed_snaps_range.
> > So if this is happening every 30s, it explains at least part of why
> > this mon is busy.
> >
> > 2021-04-09 17:07:27.238 7f9fc83e4700 10 mon.sun-storemon01@0(leader)
> > e45 handle_subscribe
> > mon_subscribe({mdsmap=3914079+,monmap=0+,osdmap=1170448})
> > 2021-04-09 17:07:27.238 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 check_osdmap_sub
> > 0x55e2e2133de0 next 1170448 (onetime)
> > 2021-04-09 17:07:27.238 7f9fc83e4700  5
> > mon.sun-storemon01@0(leader).osd e1987355 send_incremental
> > [1170448..1987355] to client.131831153
> > 2021-04-09 17:07:28.590 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 0
> > [1~3]
> > 2021-04-09 17:07:29.898 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 5 []
> > 2021-04-09 17:07:31.258 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 6 []
> > 2021-04-09 17:07:32.562 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 20
> > []
> > 2021-04-09 17:07:33.866 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 21
> > []
> > 2021-04-09 17:07:35.162 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 22
> > []
> > 2021-04-09 17:07:36.470 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 23
> > []
> > 2021-04-09 17:07:37.778 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 24
> > []
> > 2021-04-09 17:07:39.090 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 25
> > []
> > 2021-04-09 17:07:40.398 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 26
> > []
> > 2021-04-09 17:07:41.706 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 27
> > []
> > 2021-04-09 17:07:43.006 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 28
> > []
> > 2021-04-09 17:07:44.322 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 29
> > []
> > 2021-04-09 17:07:45.630 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 30
> > []
> > 2021-04-09 17:07:46.938 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 31
> > []
> > 2021-04-09 17:07:48.246 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 32
> > []
> > 2021-04-09 17:07:49.562 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 34
> > []
> > 2021-04-09 17:07:50.862 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 35
> > []
> > 2021-04-09 17:07:50.862 7f9fc83e4700 20
> > mon.sun-storemon01@0(leader).osd e1987355 send_incremental starting
> > with base full 1986745 664086 bytes
> > 2021-04-09 17:07:50.862 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 build_incremental
> > [1986746..1986785] with features 107b84a842aca
> >
> > So have a look for that client again or other similar traces.
>
> So, even though I blacklisted the client and we remounted the file
> system on it, it wasn't enough for it to keep performing the same bad
> requests. We found another node that had two sessions to the same
> mount point. We rebooted both nodes and the CPU is now back at a
> reasonable 4-6% and the cluster is running at full performance again.
> I've added in back both MONs to have all 3 mons in the system and
> there are no more elections. Thank you for helping us track down the
> bad clients out of over 2,000 clients.
>
> > > Maybe if that code path isn't needed in Nautilus it can be removed in
> > > the next point release?
> >
> > I think there were other major changes in this area that might make
> > such a backport difficult. And we should expect nautilus to be nearing
> > its end...
>
> But ... we just got to Nautilus... :)
>
> Thank you,
> Robert LeBlanc
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw process crashes multiple times an hour

2021-02-01 Thread Brad Hubbard
On Tue, Feb 2, 2021 at 9:20 AM Andrei Mikhailovsky  wrote:
>
> bump

Can you create a tracker for this?

I'd suggest the first step would be working out what "NOTICE: invalid
dest placement: default-placement/REDUCED_REDUNDANCY" is trying to
tell you. Someone more familiar with rgw than I should be able to tell
you so open the tracker against rgw.

>
>
> - Original Message -
> > From: "andrei" 
> > To: "Daniel Gryniewicz" 
> > Cc: "ceph-users" 
> > Sent: Thursday, 28 January, 2021 17:07:00
> > Subject: [ceph-users] Re: radosgw process crashes multiple times an hour
>
> > Hi Daniel,
> >
> > Thanks for you're reply. I've checked the package versions on that server 
> > and
> > all ceph related packages on that server are from 15.2.8 version:
> >
> > ii  librados215.2.8-1focal amd64RADOS distributed object 
> > store
> > client library
> > ii  libradosstriper1 15.2.8-1focal amd64RADOS striping interface
> > ii  python3-rados15.2.8-1focal amd64Python 3 libraries for the 
> > Ceph
> > librados library
> > ii  radosgw  15.2.8-1focal amd64REST gateway for RADOS
> > distributed object store
> > ii  librbd115.2.8-1focal amd64RADOS block device client 
> > library
> > ii  python3-rbd15.2.8-1focal amd64Python 3 libraries for the 
> > Ceph
> > librbd library
> > ii  ceph  15.2.8-1focal amd64distributed 
> > storage
> > and file system
> > ii  ceph-base 15.2.8-1focal amd64common ceph 
> > daemon
> > libraries and management tools
> > ii  ceph-common   15.2.8-1focal amd64common 
> > utilities to
> > mount and interact with a ceph storage cluster
> > ii  ceph-fuse 15.2.8-1focal amd64FUSE-based 
> > client
> > for the Ceph distributed file system
> > ii  ceph-mds  15.2.8-1focal amd64metadata 
> > server for
> > the ceph distributed file system
> > ii  ceph-mgr  15.2.8-1focal amd64manager for the
> > ceph distributed storage system
> > ii  ceph-mgr-cephadm  15.2.8-1focal all  cephadm
> > orchestrator module for ceph-mgr
> > ii  ceph-mgr-dashboard15.2.8-1focal all  dashboard 
> > module
> > for ceph-mgr
> > ii  ceph-mgr-diskprediction-cloud 15.2.8-1focal all
> > diskprediction-cloud module for ceph-mgr
> > ii  ceph-mgr-diskprediction-local 15.2.8-1focal all
> > diskprediction-local module for ceph-mgr
> > ii  ceph-mgr-k8sevents15.2.8-1focal all  kubernetes 
> > events
> > module for ceph-mgr
> > ii  ceph-mgr-modules-core 15.2.8-1focal all  ceph manager
> > modules which are always enabled
> > ii  ceph-mgr-rook 15.2.8-1focal all  rook module for
> > ceph-mgr
> > ii  ceph-mon  15.2.8-1focal amd64monitor server 
> > for
> > the ceph storage system
> > ii  ceph-osd  15.2.8-1focal amd64OSD server for 
> > the
> > ceph storage system
> > ii  cephadm   15.2.8-1focal amd64cephadm 
> > utility to
> > bootstrap ceph daemons with systemd and containers
> > ii  libcephfs215.2.8-1focal amd64Ceph 
> > distributed
> > file system client library
> > ii  python3-ceph  15.2.8-1focal amd64Meta-package 
> > for
> > python libraries for the Ceph libraries
> > ii  python3-ceph-argparse 15.2.8-1focal all  Python 3 
> > utility
> > libraries for Ceph CLI
> > ii  python3-ceph-common   15.2.8-1focal all  Python 3 
> > utility
> > libraries for Ceph
> > ii  python3-cephfs15.2.8-1focal amd64Python 3 
> > libraries
> > for the Ceph libcephfs library
> >
> > As this is a brand new 20.04 server I do not see how the older version could
> > have got onto it.
> >
> > Andrei
> >
> >
> > - Original Message -
> >> From: "Daniel Gryniewicz" 
> >> To: "ceph-users" 
> >> Sent: Thursday, 28 January, 2021 14:06:16
> >> Subject: [ceph-users] Re: radosgw process crashes multiple times an hour
> >
> >> It looks like your radosgw is using a different version of librados.  In
> >> the backtrace, the top useful line begins:
> >>
> >> librados::v14_2_0
> >>
> >> when it should be v15.2.0, like the ceph::buffer in the same line.
> >>
> >> Is there an old librados lying around that didn't get cleaned up somehow?
> >>
> >> Daniel
> >>
> >>
> >>
> >> On 1/28/21 7:27 AM, Andrei Mikhailovsky wrote:
> >>> Hello,
> >>>
> >>> I am experiencing very frequent crashes of the radosgw service. It happens
> >>> multiple times every hour. As an example, over the last 12 hours we've 
> >>> had 35
> >>> crashes. Has anyone experienced similar behaviour of the radosgw octopus
> >>> release service? More info below:
> >>>
> >>> Radosgw service is running on two Ubuntu servers. I have tried upgrading 
> >>> OS on
> >>> one of the servers to 

[ceph-users] Re: Unable to clarify error using vfs_ceph (Samba gateway for CephFS)

2020-11-12 Thread Brad Hubbard
I don't know much about the vfs plugin (nor cephfs for that matter)
but I would suggest enabling client debug logging on the machine so
you can see what the libcephfs code is doing since that's likely where
the ENOENT is coming from.

https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/
https://docs.ceph.com/en/latest/cephfs/client-config-ref/

On Fri, Nov 13, 2020 at 3:39 AM Frank Schilder  wrote:
>
> You might need to give read permissions to the ceph config and key file for 
> the user that runs the SAMBA service (samba?). Either add the SAMBA user to 
> the group ceph, or change the group of the file.
>
> The statement "/" file not found could just be an obfuscating message on an 
> actual security/permission issue.
>
> Other than that I don't really know what to look for. As I said, I gave up as 
> well. Ceph kernel client does a good job for us with an ordinary SAMBA share 
> defined on it.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Matt Larson 
> Sent: 12 November 2020 18:18:32
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] Unable to clarify error using vfs_ceph (Samba 
> gateway for CephFS)
>
> Thank you Frank,
>
>  That was a good suggestion to make sure the mount wasn't the issue. I
> tried changing the `client.samba.upload` to have read access directly
> to '/' rather than '/upload' and to also change smb.conf to directly
> use 'path = /'. Still getting the same issue (log level 10 content
> below).
>
>  It appears that it is correctly reading `/etc/ceph/ceph.conf`. It
> does appear to be the ceph_mount where the failure occurs.
>
>  It would be great to have vfs_ceph working, but if I cannot I'll try
> to find other approaches.
>
> [2020/11/12 10:47:39.360943, 10, pid=2723021, effective(0, 0), real(0,
> 0), class=vfs] ../../source3/smbd/vfs.c:65(vfs_find_backend_entry)
>
>   vfs_find_backend_entry called for ceph
>   Successfully loaded vfs module [ceph] with the new modules system
> [2020/11/12 10:47:39.360966, 10, pid=2723021, effective(0, 0), real(0,
> 0), class=vfs] ../../source3/modules/vfs_ceph.c:103(cephwrap_connect)
>   cephwrap_connect: [CEPH] calling: ceph_create
> [2020/11/12 10:47:39.365668, 10, pid=2723021, effective(0, 0), real(0,
> 0), class=vfs] ../../source3/modules/vfs_ceph.c:110(cephwrap_connect)
>   cephwrap_connect: [CEPH] calling: ceph_conf_read_file with 
> /etc/ceph/ceph.conf
> [2020/11/12 10:47:39.368842, 10, pid=2723021, effective(0, 0), real(0,
> 0), class=vfs] ../../source3/modules/vfs_ceph.c:116(cephwrap_connect)
>   cephwrap_connect: [CEPH] calling: ceph_conf_get
> [2020/11/12 10:47:39.368895, 10, pid=2723021, effective(0, 0), real(0,
> 0), class=vfs] ../../source3/modules/vfs_ceph.c:133(cephwrap_connect)
>   cephwrap_connect: [CEPH] calling: ceph_mount
> [2020/11/12 10:47:39.373319, 10, pid=2723021, effective(0, 0), real(0,
> 0), class=vfs] ../../source3/modules/vfs_ceph.c:160(cephwrap_connect)
>   cephwrap_connect: [CEPH] Error return: No such file or directory
> [2020/11/12 10:47:39.373357,  1, pid=2723021, effective(0, 0), real(0,
> 0)] ../../source3/smbd/service.c:668(make_connection_snum)
>   make_connection_snum: SMB_VFS_CONNECT for service 'cryofs_upload' at
> '/' failed: No such file or directory
>
> On Thu, Nov 12, 2020 at 2:29 AM Frank Schilder  wrote:
> >
> > You might face the same issue I had. vfs_ceph wants to have a key for the 
> > root of the cephfs, it is cutrently not possible to restrict access to a 
> > sub-directory mount. For this reason, I decided to go for a re-export of a 
> > kernel client mount.
> >
> > I consider this a serious security issue in vfs_ceph and will not use it 
> > until it is possible to do sub-directory mounts.
> >
> > I don't think its difficult to patch the vfs_ceph source code, if you need 
> > to use vfs_ceph and cannot afford to give access to "/" of the cephfs.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Matt Larson 
> > Sent: 12 November 2020 00:40:21
> > To: ceph-users
> > Subject: [ceph-users] Unable to clarify error using vfs_ceph (Samba gateway 
> > for CephFS)
> >
> > I am getting an error in the log.smbd from the Samba gateway that I
> > don’t understand and looking for help from anyone who has gotten the
> > vfs_ceph working.
> >
> > Background:
> >
> > I am trying to get a Samba gateway with CephFS working with the
> > vfs_ceph module. I observed that the default Samba package on CentOS
> > 7.7 did not come with the ceph.so vfs_ceph module, so I tried to
> > compile a working Samba version with vfs_ceph.
> >
> > Newer Samba versions have a requirement for GnuTLS >= 3.4.7, which is
> > not an available package on CentOS 7.7 without a custom repository. I
> > opted to build an earlier version of Samba.
> >
> > On CentOS 7.7, I built Samba 4.11.16 

[ceph-users] Re: virtual machines crashes after upgrade to octopus

2020-05-14 Thread Brad Hubbard
On Wed, May 13, 2020 at 6:00 PM Lomayani S. Laizer  wrote:
>
> Hello,
>
> Below is full debug log of 2 minutes before crash of virtual machine. 
> Download from below url
>
> https://storage.habari.co.tz/index.php/s/31eCwZbOoRTMpcU

This log has rbd debug output, but not rados :(

I guess you'll need to try and capture a coredump if you can't get a backtrace.

I'd also suggest opening a tracker in case one of the rbd devs has any
ideas on this, or has seen something similar. Without a backtrace or
core it will be impossible to definitively identify the issue though.

>
>
> apport.log
>
> Wed May 13 09:35:30 2020: host pid 4440 crashed in a separate mount 
> namespace, ignoring
>
> kernel.log
> May 13 09:35:30 compute5 kernel: [123071.373217] fn-radosclient[4485]: 
> segfault at 0 ip 7f4c8c85d7ed sp 7f4c66ffc470 error 4 in 
> librbd.so.1.12.0[7f4c8c65a000+5cb000]
> May 13 09:35:30 compute5 kernel: [123071.373228] Code: 8d 44 24 08 48 81 c3 
> d8 3e 00 00 49 21 f9 48 c1 e8 30 83 c0 01 48 c1 e0 30 48 89 02 48 8b 03 48 89 
> 04 24 48 8b 34 24 48 21 fe <48> 8b 06 48 89 44 24 08 48 8b 44 24 08 48 8b 0b 
> 48 21 f8 48 39 0c
> May 13 09:35:33 compute5 kernel: [123074.832700] brqa72d845b-e9: port 
> 1(tap33511c4d-2c) entered disabled state
> May 13 09:35:33 compute5 kernel: [123074.838520] device tap33511c4d-2c left 
> promiscuous mode
> May 13 09:35:33 compute5 kernel: [123074.838527] brqa72d845b-e9: port 
> 1(tap33511c4d-2c) entered disabled state
>
> syslog
> compute5 kernel: [123071.373217] fn-radosclient[4485]: segfault at 0 ip 
> 7f4c8c85d7ed sp 7f4c66ffc470 error 4 i
> n librbd.so.1.12.0[7f4c8c65a000+5cb000]
> May 13 09:35:30 compute5 kernel: [123071.373228] Code: 8d 44 24 08 48 81 c3 
> d8 3e 00 00 49 21 f9 48 c1 e8 30 83 c0 01 48 c1 e0 30 48 8
> 9 02 48 8b 03 48 89 04 24 48 8b 34 24 48 21 fe <48> 8b 06 48 89 44 24 08 48 
> 8b 44 24 08 48 8b 0b 48 21 f8 48 39 0c
> May 13 09:35:30 compute5 libvirtd[1844]: internal error: End of file from 
> qemu monitor
> May 13 09:35:33 compute5 systemd-networkd[1326]: tap33511c4d-2c: Link DOWN
> May 13 09:35:33 compute5 systemd-networkd[1326]: tap33511c4d-2c: Lost carrier
> May 13 09:35:33 compute5 kernel: [123074.832700] brqa72d845b-e9: port 
> 1(tap33511c4d-2c) entered disabled state
> May 13 09:35:33 compute5 kernel: [123074.838520] device tap33511c4d-2c left 
> promiscuous mode
> May 13 09:35:33 compute5 kernel: [123074.838527] brqa72d845b-e9: port 
> 1(tap33511c4d-2c) entered disabled state
> May 13 09:35:33 compute5 networkd-dispatcher[1614]: Failed to request link: 
> No such device
>
> On Fri, May 8, 2020 at 5:40 AM Brad Hubbard  wrote:
>>
>> On Fri, May 8, 2020 at 12:10 PM Lomayani S. Laizer  
>> wrote:
>> >
>> > Hello,
>> > On my side at point of vm crash these are logs below. At the moment my 
>> > debug is at 10 value. I will rise to 20 for full debug. these crashes are 
>> > random and so far happens on very busy vms. Downgrading clients in host to 
>> > Nautilus these crashes disappear
>>
>> You could try adding debug_rados as well but you may get a very large
>> log so keep an eye on things.
>>
>> >
>> > Qemu is not shutting down in general because other vms on the same host 
>> > continues working
>>
>> A process can not reliably continue after encountering a segfault so
>> the qemu-kvm process must be ending and therefore it should be
>> possible to capture a coredump with the right configuration.
>>
>> In the following example, if you were to search for pid 6060 you would
>> find it is no longer running.
>> >> > [ 7682.233684] fn-radosclient[6060]: segfault at 2b19 ip 
>> >> > 7f8165cc0a50 sp 7f81397f6490 error 4 in 
>> >> > librbd.so.1.12.0[7f8165ab4000+537000]
>>
>> Without a backtrace at a minimum it may be very difficult to work out
>> what's going on with certainty. If you open a tracker for the issue
>> though maybe one of the devs specialising in rbd may have some
>> feedback.
>>
>> >
>> > 2020-05-07T13:02:12.121+0300 7f88d57fa700 10 librbd::io::ReadResult: 
>> > 0x7f88c80bfbf0 finish:  got {} for [0,24576] bl 24576
>> > 2020-05-07T13:02:12.193+0300 7f88d57fa700 10 librbd::io::ReadResult: 
>> > 0x7f88c80f9330 finish: C_ObjectReadRequest: r=0
>> > 2020-05-07T13:02:12.193+0300 7f88d57fa700 10 librbd::io::ReadResult: 
>> > 0x7f88c80f9330 finish:  got {} for [0,16384] bl 16384
>> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10 librbd::ImageState: 
>> > 0x5569b5da9bb0 0x5569b5da9bb0 send_close_unlock
>> > 2020-05-07T13:02:28.694+030

[ceph-users] Re: Cluster rename procedure

2020-05-08 Thread Brad Hubbard
Are they LVM based?

The keyring files should be just the filenames, yes.

Here's a recent list I saw which was missing the keyring step but is
reported to be complete otherwise.

- Stop RGW services
- Set the flags (noout,norecover,norebalance,nobackfill,nodown,pause)
- Stop OSD/MGR/MON services
- Moves the folder under  /var/lib/ceph/mon/ into
the the 
- Moves the folder under  /var/lib/ceph/mgr/ into
the the 
- Copies the .conf and keyrings with the new cluster name
- Edits systemd unit files for MON and MGR to reflect the new cluster
name at CLUSTER env variable
- Edits /usr/share/ceph-osd-run.sh file to reflect the new cluster
name at CLUSTER env variable
- Changes the lvm tag to the new cluster name for all OSD LVs
- Reloads the systemd daemon
- Starts the MON/MGR/OSD services
- Unset the flags
- Starts the RGW service

On Sat, May 9, 2020 at 4:52 AM Anthony D'Atri  wrote:
>
>  I’ve inherited a couple of clusters with non-default (ie, not “ceph”) 
> internal names, and I want to rename them for the usual reasons.
>
> I had previously developed a full list of steps - which I no longer have 
> access to.
>
> Anyone done this recently?  Want to be sure I’m not missing something.
>
> * Nautilus, CentOS 7, RGW and RBD
> * Rename OSD mountpoints with   mount —move
> * Rename systemd resources / mounts?
> * Rename /var/lib/ceph/{mon,osd} directories
> * Rename ceph*conf files on backend and client systems
> * Rename keyrings — just the filenames?
> * Rename log files
> * Ajust `ceph config` paths for admin socket, keyring, logs, mgr/mds/mon 
> data, osd journal, rgw_data
> * Restart daemons
> * Ensure /var/run/ceph sockets are appropriately named
>
>
>
> Thanks
>
> — aad
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: virtual machines crashes after upgrade to octopus

2020-05-07 Thread Brad Hubbard
00 10 
> librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020 
> handle_invalidate_cache: r=0
> 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10 
> librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020 send_flush_notifies:
> 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10 
> librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020 
> handle_flush_notifies:
> 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10 
> librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020 
> send_close_object_map:
> 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10 
> librbd::object_map::UnlockRequest: 0x7f88c807a450 send_unlock: 
> oid=rbd_object_map.2f18f2a67fad72
> 2020-05-07T13:02:28.702+0300 7f88d57fa700 10 
> librbd::object_map::UnlockRequest: 0x7f88c807a450 handle_unlock: r=0
> 2020-05-07T13:02:28.702+0300 7f88d57fa700 10 
> librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020 
> handle_close_object_map: r=0
> 2020-05-07T13:02:28.702+0300 7f88d57fa700 10 
> librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020 send_unlock:
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::ManagedLock: 
> 0x7f88c4011bb8 handle_shutdown_pre_release: r=0
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 
> librbd::managed_lock::ReleaseRequest: 0x7f88c80b68a0 send_unlock: 
> entity=client.58292796, cookie=auto 140225447738256
> 2020-05-07T13:02:28.702+0300 7f88d57fa700 10 
> librbd::managed_lock::ReleaseRequest: 0x7f88c80b68a0 handle_unlock: r=0
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::ExclusiveLock: 
> 0x7f88c4011ba0 post_release_lock_handler: r=0 shutting_down=1
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700  5 librbd::io::ImageRequestWQ: 
> 0x7f88e8001570 unblock_writes: 0x5569b5e1ffd0, num=0
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::ImageWatcher: 
> 0x7f88c400dfe0 notify released lock
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::ImageWatcher: 
> 0x7f88c400dfe0 current lock owner: [0,0]
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::ManagedLock: 
> 0x7f88c4011bb8 handle_shutdown_post_release: r=0
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::ManagedLock: 
> 0x7f88c4011bb8 wait_for_tracked_ops: r=0
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::ManagedLock: 
> 0x7f88c4011bb8 complete_shutdown: r=0
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::image::CloseRequest: 
> 0x7f88c8175fd0 handle_shut_down_exclusive_lock: r=0
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::image::CloseRequest: 
> 0x7f88c8175fd0 send_unregister_image_watcher
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::ImageWatcher: 
> 0x7f88c400dfe0 unregistering image watcher
> 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::Watcher: 0x7f88c400dfe0 
> unregister_watch:
> 2020-05-07T13:02:28.702+0300 7f88d57fa700  5 librbd::Watcher: 0x7f88c400dfe0 
> notifications_blocked: blocked=1
> 2020-05-07T13:02:28.706+0300 7f88ceffd700 10 librbd::image::CloseRequest: 
> 0x7f88c8175fd0 handle_unregister_image_watcher: r=0
> 2020-05-07T13:02:28.706+0300 7f88ceffd700 10 librbd::image::CloseRequest: 
> 0x7f88c8175fd0 send_flush_readahead
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10 librbd::image::CloseRequest: 
> 0x7f88c8175fd0 handle_flush_readahead: r=0
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10 librbd::image::CloseRequest: 
> 0x7f88c8175fd0 send_shut_down_object_dispatcher
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700  5 librbd::io::ObjectDispatcher: 
> 0x5569b5dab700 shut_down:
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700  5 librbd::io::ObjectDispatch: 
> 0x5569b5ee8360 shut_down:
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700  5 
> librbd::io::SimpleSchedulerObjectDispatch: 0x7f88c4013ce0 shut_down:
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700  5 
> librbd::cache::WriteAroundObjectDispatch: 0x7f88c8003780 shut_down:
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10 librbd::image::CloseRequest: 
> 0x7f88c8175fd0 handle_shut_down_object_dispatcher: r=0
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10 librbd::image::CloseRequest: 
> 0x7f88c8175fd0 send_flush_op_work_queue
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10 librbd::image::CloseRequest: 
> 0x7f88c8175fd0 handle_flush_op_work_queue: r=0
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10 librbd::image::CloseRequest: 
> 0x7f88c8175fd0 handle_flush_image_watcher: r=0
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10 librbd::ImageState: 
> 0x5569b5da9bb0 0x5569b5da9bb0 handle_close: r=0
>
> On Fri, May 8, 2020 at 12:40 AM Brad Hubbard  wrote:
>>
>> On Fri, May 8, 2020 at 3:42 AM Erwin Lubbers  wrote:
>> >
>> > Hi,
>> >
>> > Did anyone found a way to resolve the problem? I'm seeing the same on a 
>> > clean Octopus Ceph installation on Ubuntu 18 with an Octopus compiled 

[ceph-users] Re: ceph-mgr high CPU utilization

2020-05-07 Thread Brad Hubbard
Could you create a tracker for this and attach an osdmap as well as
some recent balancer output (perhaps at a higher debug level if
possible)?

There are some improvements awaiting backport to nautilus for the
C++/python interface just FYI [0]

You might also look at gathering output using something like [1] to
try to narrow down further what is causing the high CPU consumption.

[0] https://github.com/ceph/ceph/pull/34356
[1] https://github.com/markhpc/gdbpmp

On Fri, May 8, 2020 at 1:10 AM Andras Pataki
 wrote:
>
> Hi everyone,
>
> After some investigation, it looks like on our large cluster, ceph-mgr
> is not able to keep up with the status updates from about 3500 OSDs.  By
> default OSDs send updates to ceph-mgr every 5 seconds, which, in our
> case, turns to about 700 messages/s to ceph-mgr.  It looks from gdb
> traces that ceph-mgr runs some python code for each of them - so 700
> python snipets/s might be too much. Increasing mgr_stats_period to 15
> seconds reduces the load and brings ceph-mgr back to responsive again.
> Unfortunately this isn't sustainable since if we were to expand the
> cluster, we'd need to further reduce the update frequency from OSDs.
>
> I also checked our other clusters and they have about proportionately
> lower load on ceph-mgr based on their OSD counts.
>
> Any thoughts about the scalability of ceph-mgr to a large number of
> OSDs?  We recently upgraded this cluster from Mimic, where we didn't see
> this issue.
>
> Andras
>
> On 5/1/20 8:48 AM, Andras Pataki wrote:
> > Also just a follow-up on the misbehavior of ceph-mgr.  It looks like
> > the upmap balancer is not acting reasonably either.  It is trying to
> > create upmap entries every minute or so - and claims to be successful,
> > but they never show up in the OSD map.  Setting the logging to
> > 'debug', I see upmap entries created such as:
> >
> > 2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd
> > pg-upmap-items 9.60c4 mappings [{'to': 3313L, 'from': 3371L}]
> > 2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd
> > pg-upmap-items 9.632b mappings [{'to': 2187L, 'from': 1477L}]
> > 2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd
> > pg-upmap-items 9.6b9c mappings [{'to': 3315L, 'from': 3371L}]
> > 2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd
> > pg-upmap-items 9.6bf6 mappings [{'to': 1581L, 'from': 1477L}]
> > 2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd
> > pg-upmap-items 9.7da4 mappings [{'to': 2419L, 'from': 2537L}]
> > ...
> > 2020-05-01 08:43:07.909 7fffca074700 20 mgr[balancer] commands
> > [,
> > ,
> > ,  > mandResult object at 0x7fffcc990650>,  > at 0x7fffcc990610>,  > 0x7fffcc990f50>, ,
> >  > fcc990d90>, ,
> > ,
> > ,
> > ,  > mmandResult object at 0x7fffbed242d0>,  > object at 0x7fffbed24d90>,  > 0x7fffbed24d50>, ,
> >  > ffbed245d0>, ,
> > ,
> > ]
> > ...
> > 2020-05-01 08:43:16.733 7fffca074700 20 mgr[balancer] done
> > ...
> >
> > but these mappings do not show up in the osd dump.  And a minute
> > later, the balancer tries again and comes up with a set of very
> > similar mappings (same from and to OSDs, slightly different PG
> > numbers) - and keeps going like that every minute without any progress
> > (the set of upmap entries stays the same, does not increase).
> >
> > Andras
> >
> >
> > On 5/1/20 8:12 AM, Andras Pataki wrote:
> >> I'm wondering if anyone still sees issues with ceph-mgr using CPU and
> >> being unresponsive even in recent Nautilus releases.  We upgraded our
> >> largest cluster from Mimic to Nautilus (14.2.8) recently - it has
> >> about 3500 OSDs.  Now ceph-mgr is constantly at 100-200% CPU (1-2
> >> cores), and becomes unresponsive after a few minutes.  The
> >> finisher-Mgr queue length grows (I've seen it at over 100k) - similar
> >> symptoms as seen with earlier Nautilus releases by many. This is what
> >> it looks like after an hour of running:
> >>
> >> "finisher-Mgr": {
> >> "queue_len": 66078,
> >> "complete_latency": {
> >> "avgcount": 21,
> >> "sum": 2098.408767721,
> >> "avgtime": 99.924227034
> >> }
> >> },
> >>
> >> We have a pretty vanilla manager config, only the balancer is enabled
> >> in upmap mode.  Here are the enabled modules:
> >>
> >> "always_on_modules": [
> >> "balancer",
> >> "crash",
> >> "devicehealth",
> >> "orchestrator_cli",
> >> "progress",
> >> "rbd_support",
> >> "status",
> >> "volumes"
> >> ],
> >> "enabled_modules": [
> >> "restful"
> >> ],
> >>
> >> Any ideas or outstanding issues in this area?
> >>
> >> Andras
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an 

[ceph-users] Re: virtual machines crashes after upgrade to octopus

2020-05-07 Thread Brad Hubbard
On Fri, May 8, 2020 at 3:42 AM Erwin Lubbers  wrote:
>
> Hi,
>
> Did anyone found a way to resolve the problem? I'm seeing the same on a clean 
> Octopus Ceph installation on Ubuntu 18 with an Octopus compiled KVM server 
> running on CentOS 7.8. The KVM machine shows:
>
> [ 7682.233684] fn-radosclient[6060]: segfault at 2b19 ip 7f8165cc0a50 sp 
> 7f81397f6490 error 4 in librbd.so.1.12.0[7f8165ab4000+537000]

Are you able to either capture a backtrace from a coredump or set up
logging and hopefully capture a backtrace that way?

>
> Ceph is healthy and stable for a few weeks and I did not get these messages 
> while running on KVM compiled with Luminous libraries.
>
> Regards,
> Erwin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Sporadic mgr segmentation fault

2020-04-26 Thread Brad Hubbard
Turns out those patches are not in 14.2.9, sorry.

On Sun, Apr 26, 2020 at 10:53 AM XuYun  wrote:
>
> Hi Brad,
>
> We got the same crash even after upgrading to 14.2.9, the crash log is:
>
>-13> 2020-04-26 05:56:02.642 7f8cea975700  4 mgr send_beacon active
>-12> 2020-04-26 05:56:02.643 7f8cea975700 10 monclient: _send_mon_message 
> to mon.111.111.121.2 at v2:111.111.121.2:3300/0
>-11> 2020-04-26 05:56:03.238 7f8cdd690700  0 log_channel(cluster) log 
> [DBG] : pgmap v5548: 256 pgs: 1 active+clean+scrubbing, 255 active+clean; 162 
> GiB data, 921 GiB used, 32 TiB / 33 TiB avail; 194 KiB/s rd, 840 KiB/s wr, 
> 375 op/s
>-10> 2020-04-26 05:56:03.238 7f8cdd690700 10 monclient: _send_mon_message 
> to mon.111.111.121.2 at v2:111.111.121.2:3300/0
> -9> 2020-04-26 05:56:03.296 7f8cee17c700  4 mgr ms_dispatch active 
> service_map(e4936 1 svc) v1
> -8> 2020-04-26 05:56:03.296 7f8cee17c700  4 mgr ms_dispatch 
> service_map(e4936 1 svc) v1
> -7> 2020-04-26 05:56:03.391 7f8cde692700  4 mgr.server handle_open from 
> mon,111.111.121.1 0x55d5d96c2000
> -6> 2020-04-26 05:56:03.391 7f8cde692700  4 mgr.server handle_report from 
> 0x55d5d96c2000 mon,111.111.121.1
> -5> 2020-04-26 05:56:03.393 7f8cde692700  4 mgr.server handle_open from 
> mgr,control 0x55d5d96c2400
> -4> 2020-04-26 05:56:03.393 7f8cde692700  4 mgr.server handle_report from 
> 0x55d5d96c2400 mgr,control
> -3> 2020-04-26 05:56:03.397 7f8cde692700  4 mgr.server handle_open from 
> mon,111.111.121.3 0x55d5d96c2800
> -2> 2020-04-26 05:56:03.398 7f8cde692700  4 mgr.server handle_open from 
> mgr,computer01 0x55d5d96c2c00
> -1> 2020-04-26 05:56:03.399 7f8cde692700  4 mgr.server handle_report from 
> 0x55d5d96c2c00 mgr,computer01
>  0> 2020-04-26 05:56:03.400 7f8cf3186700 -1 *** Caught signal 
> (Segmentation fault) **
>  in thread 7f8cf3186700 thread_name:msgr-worker-1
>
>  ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus 
> (stable)
>  1: (()+0xf5f0) [0x7f8cf74d45f0]
>  2: (bool 
> ProtocolV2::append_frame(ceph::msgr::v2::MessageFrame&)+0x1fc)
>  [0x7f8cf9ef785c]
>  3: (ProtocolV2::write_message(Message*, bool)+0x4d9) [0x7f8cf9edb929]
>  4: (ProtocolV2::write_event()+0x37d) [0x7f8cf9ef025d]
>  5: (AsyncConnection::handle_write()+0x40) [0x7f8cf9eb27e0]
>  6: (EventCenter::process_events(unsigned int, std::chrono::duration long, std::ratio<1l, 10l> >*)+0x1397) [0x7f8cf9f02ab7]
>  7: (()+0x57fa97) [0x7f8cf9f08a97]
>  8: (()+0x80f12f) [0x7f8cfa19812f]
>  9: (()+0x7e65) [0x7f8cf74cce65]
>  10: (clone()+0x6d) [0x7f8cf617a88d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
>
> Is there an issue opened for it?
>
> BR,
> Xu Yun
>
> 2020年4月23日 上午10:28,XuYun  写道:
>
> Thank you, Brad. We’ll try to upgrade 14.2.9 today.
>
> 2020年4月23日 上午7:21,Brad Hubbard  写道:
>
> On Tue, Apr 21, 2020 at 11:39 PM XuYun  wrote:
>
>
> Dear ceph users,
>
> We are experiencing sporadic mgr crash in all three ceph clusters (version 
> 14.2.6 and version 14.2.8), the crash log is:
>
> 2020-04-17 23:10:08.986 7fed7fe07700 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/common/buffer.cc:
>  In function 'const char* ceph::buffer::v14_2_0::ptr::c_str() const' thread 
> 7fed7fe07700 time 2020-04-17 23:10:08.984887
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/common/buffer.cc:
>  578: FAILED ceph_assert(_raw)
>
> ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus 
> (stable)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x14a) [0x7fed8605c325]
> 2: (()+0x2534ed) [0x7fed8605c4ed]
> 3: (()+0x5a21ed) [0x7fed863ab1ed]
> 4: (PosixConnectedSocketImpl::send(ceph::buffer::v14_2_0::list&, bool)+0xbd) 
> [0x7fed863840ed]
> 5: (AsyncConnection::_try_send(bool)+0xb6) [0x7fed8632fc76]
> 6: (ProtocolV2::write_message(Message*, bool)+0x832) [0x7fed8635bf52]
> 7: (ProtocolV2::write_event()+0x175) [0x7fed863718c5]
> 8: (AsyncConnection::handle_write()+0x40) [0x7fed86332600]
> 9: (EventCenter::process_events(unsigned int, std::chrono::duration long, std::ratio<1l, 10l> >*)+0x1397) [0x7fed8637f997]
> 10: (()+0x57c977) [0x7fed86385977]
> 11: (()+0x80bdaf) [0x7fed86614daf]
> 12: (()+0x7e65) [0x7fed8394ce65]
> 13: (clone()+0x6d) [0x7fed825fa88d]
>
> 2020-04-17 23:10:08.990 7fed7ee05700 -1 *** Caught si

[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Brad Hubbard
On Tue, Apr 21, 2020 at 6:35 PM Paul Emmerich  wrote:
>
> On Tue, Apr 21, 2020 at 3:20 AM Brad Hubbard  wrote:
> >
> > Wait for recovery to finish so you know whether any data from the down
> > OSDs is required. If not just reprovision them.
>
> Recovery will not finish from this state as several PGs are down and/or stale.

What I meant was let recovery get as far as it can.

>
>
> Paul
>
> >
> > If data is required from the down OSDs you will need to run a query on
> > the pg(s) to find out what OSDs have the required copies of the
> > pg/object required. you can then export the pg from the down osd using
> > the ceph-objectstore-tool, back it up, then import it back into the
> > cluster.
> >
> > On Tue, Apr 21, 2020 at 1:05 AM Robert Sander
> >  wrote:
> > >
> > > Hi,
> > >
> > > one of our customers had his Ceph cluster crashed due to a power or 
> > > network outage (they still try to figure out what happened).
> > >
> > > The cluster is very unhealthy but recovering:
> > >
> > > # ceph -s
> > >   cluster:
> > > id: 1c95ca5d-948b-4113-9246-14761cb9a82a
> > > health: HEALTH_ERR
> > > 1 filesystem is degraded
> > > 1 mds daemon damaged
> > > 1 osds down
> > > 1 pools have many more objects per pg than average
> > > 1/115117480 objects unfound (0.000%)
> > > Reduced data availability: 71 pgs inactive, 53 pgs down, 18 
> > > pgs peering, 27 pgs stale
> > > Possible data damage: 1 pg recovery_unfound
> > > Degraded data redundancy: 7303464/230234960 objects degraded 
> > > (3.172%), 693 pgs degraded, 945 pgs undersized
> > > 14 daemons have recently crashed
> > >
> > >   services:
> > > mon: 3 daemons, quorum 
> > > maslxlabstore01,maslxlabstore02,maslxlabstore04 (age 64m)
> > > mgr: maslxlabstore01(active, since 69m), standbys: maslxlabstore03, 
> > > maslxlabstore02, maslxlabstore04
> > > mds: cephfs:2/3 
> > > {0=maslxlabstore03=up:resolve,1=maslxlabstore01=up:resolve} 2 up:standby, 
> > > 1 damaged
> > > osd: 140 osds: 130 up (since 4m), 131 in (since 4m); 847 remapped pgs
> > > rgw: 4 daemons active (maslxlabstore01.rgw0, maslxlabstore02.rgw0, 
> > > maslxlabstore03.rgw0, maslxlabstore04.rgw0)
> > >
> > >   data:
> > > pools:   6 pools, 8328 pgs
> > > objects: 115.12M objects, 218 TiB
> > > usage:   425 TiB used, 290 TiB / 715 TiB avail
> > > pgs: 0.853% pgs not active
> > >  7303464/230234960 objects degraded (3.172%)
> > >  13486/230234960 objects misplaced (0.006%)
> > >  1/115117480 objects unfound (0.000%)
> > >  7311 active+clean
> > >  338  active+undersized+degraded+remapped+backfill_wait
> > >  255  active+undersized+degraded+remapped+backfilling
> > >  215  active+undersized+remapped+backfilling
> > >  99   active+undersized+degraded
> > >  44   down
> > >  37   active+undersized+remapped+backfill_wait
> > >  13   stale+peering
> > >  9stale+down
> > >  5stale+remapped+peering
> > >  1active+recovery_unfound+undersized+degraded+remapped
> > >  1active+clean+remapped
> > >
> > >   io:
> > > client:   168 B/s rd, 0 B/s wr, 0 op/s rd, 0 op/s wr
> > > recovery: 1.9 GiB/s, 15 keys/s, 948 objects/s
> > >
> > >
> > > The MDS cluster is unable to start because one of them is damaged.
> > >
> > > 10 of the OSDs do not start. They crash very early in the boot process:
> > >
> > > 2020-04-20 16:26:14.935 7f818ec8cc00  0 set uid:gid to 64045:64045 
> > > (ceph:ceph)
> > > 2020-04-20 16:26:14.935 7f818ec8cc00  0 ceph version 14.2.9 
> > > (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), process 
> > > ceph-osd, pid 69463
> > > 2020-04-20 16:26:14.935 7f818ec8cc00  0 pidfile_write: ignore empty 
> > > --pid-file
> > > 2020-04-20 16:26:15.503 7f818ec8cc00  0 starting osd.42 osd_data 
> > > /var/lib/ceph/osd/ceph-42 /var/lib/ceph/osd/ceph-42/journal
> > > 2020-04-20 16:26:15.523 7f818ec8cc00  0 load: jerasure load: lrc load: isa
> > > 20

[ceph-users] Re: PG deep-scrub does not finish

2020-04-20 Thread Brad Hubbard
On Mon, Apr 20, 2020 at 11:01 PM Andras Pataki
 wrote:
>
> On a cluster running Nautilus (14.2.8), we are getting a complaint about
> a PG not being deep-scrubbed on time.  Looking at the primary OSD's
> logs, it looks like it tries to deep-scrub the PG every hour or so,
> emits some complaints that I don't understand, but the deep scrub does
> not finish (either with or without a scrub error).
>
> Here is the PG from pg dump:
>
> 1.43f 31794  00 0   0
> 66930087214   0  0 3004 3004
> active+clean+scrubbing+deep 2020-04-20 04:48:13.055481 46286'483734
> 46286:563439 [354,694,851]354 [354,694,851]354
> 39594'348643 2020-04-10 12:39:16.26108838482'314644 2020-04-04
> 10:37:03.121638 0
>
> Here is a section of the primary OSD logs (osd.354):
>
> 2020-04-18 08:21:08.322 7fffd2e2b700  0 log_channel(cluster) log [DBG] :
> 1.43f deep-scrub starts

Scrubbing starts.

> 2020-04-18 08:37:53.362 7fffd2e2b700  1 osd.354 pg_epoch: 45910
> pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
> local-lis/les=45908/45909 n=30862 ec=874/871 lis/c 45908/45908 les/c/f
> 45909/45909/0 45910/45910/42988) [354,851] r=0 lpr=45910
> pi=[45908,45910)/1 luod=0'0 crt=45909'449615 lcod 45909'449614 mlcod 0'0
> active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] ->
> [354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354,
> up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199
> upacting 4611087854031667199
> 2020-04-18 08:37:53.362 7fffd2e2b700  1 osd.354 pg_epoch: 45910
> pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
> local-lis/les=45908/45909 n=30862 ec=874/871 lis/c 45908/45908 les/c/f
> 45909/45909/0 45910/45910/42988) [354,851] r=0 lpr=45910
> pi=[45908,45910)/1 crt=45909'449615 lcod 45909'449614 mlcod 0'0 unknown
> mbc={}] state: transitioning to Primary
> 2020-04-18 08:38:01.002 7fffd2e2b700  1 osd.354 pg_epoch: 45912
> pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
> local-lis/les=45910/45911 n=30862 ec=874/871 lis/c 45910/45908 les/c/f
> 45911/45909/0 45912/45912/42988) [354,694,851] r=0 lpr=45912
> pi=[45908,45912)/1 luod=0'0 crt=45909'449615 lcod 45909'449614 mlcod 0'0
> active mbc={}] start_peering_interval up [354,851] -> [354,694,851],
> acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary
> 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting
> 4611087854031667199

Epoch (and therefore map) change from 45910 to 45912 You lost osd 694
from the acting set so peering has to restart and the scrub is
aborted.

> 2020-04-18 08:38:01.002 7fffd2e2b700  1 osd.354 pg_epoch: 45912
> pg[1.43f( v 45909'449615 (45801'446525,45909'449615]
> local-lis/les=45910/45911 n=30862 ec=874/871 lis/c 45910/45908 les/c/f
> 45911/45909/0 45912/45912/42988) [354,694,851] r=0 lpr=45912
> pi=[45908,45912)/1 crt=45909'449615 lcod 45909'449614 mlcod 0'0 unknown
> mbc={}] state: transitioning to Primary
> 2020-04-18 08:40:04.219 7fffd2e2b700  0 log_channel(cluster) log [DBG] :
> 1.43f deep-scrub starts

Scrubbing start again.

> 2020-04-18 08:56:49.095 7fffd2e2b700  1 osd.354 pg_epoch: 45914
> pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
> local-lis/les=45912/45913 n=30868 ec=874/871 lis/c 45912/45912 les/c/f
> 45913/45913/0 45914/45914/42988) [354,851] r=0 lpr=45914
> pi=[45912,45914)/1 luod=0'0 crt=45913'449735 lcod 45913'449734 mlcod 0'0
> active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] ->
> [354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354,
> up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199
> upacting 4611087854031667199
> 2020-04-18 08:56:49.095 7fffd2e2b700  1 osd.354 pg_epoch: 45914
> pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
> local-lis/les=45912/45913 n=30868 ec=874/871 lis/c 45912/45912 les/c/f
> 45913/45913/0 45914/45914/42988) [354,851] r=0 lpr=45914
> pi=[45912,45914)/1 crt=45913'449735 lcod 45913'449734 mlcod 0'0 unknown
> mbc={}] state: transitioning to Primary
> 2020-04-18 08:56:55.627 7fffd2e2b700  1 osd.354 pg_epoch: 45916
> pg[1.43f( v 45913'449735 (45812'446725,45913'449735]
> local-lis/les=45914/45915 n=30868 ec=874/871 lis/c 45914/45912 les/c/f
> 45915/45913/0 45916/45916/42988) [354,694,851] r=0 lpr=45916
> pi=[45912,45916)/1 luod=0'0 crt=45913'449735 lcod 45913'449734 mlcod 0'0
> active mbc={}] start_peering_interval up [354,851] -> [354,694,851],
> acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary
> 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting
> 4611087854031667199

Same again, you lost osd 694 from the acting set, epoch and map change
requiring repeering and aborting the scrub.

You need to identify why osd 694 is flapping, or at least appears to
be flapping to the monitor.

Start by having a careful look at the logs for osd 694. If they
provide no insight look at the logs on the active monitor 

[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-20 Thread Brad Hubbard
Wait for recovery to finish so you know whether any data from the down
OSDs is required. If not just reprovision them.

If data is required from the down OSDs you will need to run a query on
the pg(s) to find out what OSDs have the required copies of the
pg/object required. you can then export the pg from the down osd using
the ceph-objectstore-tool, back it up, then import it back into the
cluster.

On Tue, Apr 21, 2020 at 1:05 AM Robert Sander
 wrote:
>
> Hi,
>
> one of our customers had his Ceph cluster crashed due to a power or network 
> outage (they still try to figure out what happened).
>
> The cluster is very unhealthy but recovering:
>
> # ceph -s
>   cluster:
> id: 1c95ca5d-948b-4113-9246-14761cb9a82a
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 mds daemon damaged
> 1 osds down
> 1 pools have many more objects per pg than average
> 1/115117480 objects unfound (0.000%)
> Reduced data availability: 71 pgs inactive, 53 pgs down, 18 pgs 
> peering, 27 pgs stale
> Possible data damage: 1 pg recovery_unfound
> Degraded data redundancy: 7303464/230234960 objects degraded 
> (3.172%), 693 pgs degraded, 945 pgs undersized
> 14 daemons have recently crashed
>
>   services:
> mon: 3 daemons, quorum maslxlabstore01,maslxlabstore02,maslxlabstore04 
> (age 64m)
> mgr: maslxlabstore01(active, since 69m), standbys: maslxlabstore03, 
> maslxlabstore02, maslxlabstore04
> mds: cephfs:2/3 
> {0=maslxlabstore03=up:resolve,1=maslxlabstore01=up:resolve} 2 up:standby, 1 
> damaged
> osd: 140 osds: 130 up (since 4m), 131 in (since 4m); 847 remapped pgs
> rgw: 4 daemons active (maslxlabstore01.rgw0, maslxlabstore02.rgw0, 
> maslxlabstore03.rgw0, maslxlabstore04.rgw0)
>
>   data:
> pools:   6 pools, 8328 pgs
> objects: 115.12M objects, 218 TiB
> usage:   425 TiB used, 290 TiB / 715 TiB avail
> pgs: 0.853% pgs not active
>  7303464/230234960 objects degraded (3.172%)
>  13486/230234960 objects misplaced (0.006%)
>  1/115117480 objects unfound (0.000%)
>  7311 active+clean
>  338  active+undersized+degraded+remapped+backfill_wait
>  255  active+undersized+degraded+remapped+backfilling
>  215  active+undersized+remapped+backfilling
>  99   active+undersized+degraded
>  44   down
>  37   active+undersized+remapped+backfill_wait
>  13   stale+peering
>  9stale+down
>  5stale+remapped+peering
>  1active+recovery_unfound+undersized+degraded+remapped
>  1active+clean+remapped
>
>   io:
> client:   168 B/s rd, 0 B/s wr, 0 op/s rd, 0 op/s wr
> recovery: 1.9 GiB/s, 15 keys/s, 948 objects/s
>
>
> The MDS cluster is unable to start because one of them is damaged.
>
> 10 of the OSDs do not start. They crash very early in the boot process:
>
> 2020-04-20 16:26:14.935 7f818ec8cc00  0 set uid:gid to 64045:64045 (ceph:ceph)
> 2020-04-20 16:26:14.935 7f818ec8cc00  0 ceph version 14.2.9 
> (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), process 
> ceph-osd, pid 69463
> 2020-04-20 16:26:14.935 7f818ec8cc00  0 pidfile_write: ignore empty --pid-file
> 2020-04-20 16:26:15.503 7f818ec8cc00  0 starting osd.42 osd_data 
> /var/lib/ceph/osd/ceph-42 /var/lib/ceph/osd/ceph-42/journal
> 2020-04-20 16:26:15.523 7f818ec8cc00  0 load: jerasure load: lrc load: isa
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> compaction_readahead_size = 2MB
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option compaction_style 
> = kCompactionStyleLevel
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> compaction_threads = 32
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option compression = 
> kNoCompression
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option flusher_threads = 
> 8
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> level0_file_num_compaction_trigger = 8
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> level0_slowdown_writes_trigger = 32
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> level0_stop_writes_trigger = 64
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> max_background_compactions = 31
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> max_bytes_for_level_base = 536870912
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> max_bytes_for_level_multiplier = 8
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> max_write_buffer_number = 32
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> min_write_buffer_number_to_merge = 2
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> recycle_log_file_num = 32
> 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> 

[ceph-users] Re: RGW jaegerTracing

2020-03-08 Thread Brad Hubbard
+d...@ceph.io

On Sun, Mar 8, 2020 at 5:16 PM Abhinav Singh
 wrote:
>
> I am trying to implement jaeger tracing in RGW, I need some advice
> regarding on which functions should I actually tracing to get a good actual
> performance status of clusters
>
> Till now I am able to deduce followings :
> 1.I think we need to provide tracing functions where the `rgw` is
> communicating with the librados, (particularly the librgw where the
> communication is actually happening), because http request and response not
> to be considered for tracing because that depends on clients internet speed.
> 2.In librgw the functions like this here
> 
> and
> its corresponding overloading methods and also the this function here
> 
> and
> its corresponding overloaded functions.
> 3.I see that pools are ultimately used to enter the crush algorithm for
> writing data, so I think the ceation of pools should also be taken into
> account while tracing,(creation of pool should be main span and these
> functions
> 
> should
> be its child span).
>
>
> Functionality of buckets like that of this
> 
> do not require tracing beacuse they are http requests.
>
> Any kind of guidance will be of great help.
>
> Thank You.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: continued warnings: Large omap object found

2020-02-27 Thread Brad Hubbard
Check the thread titled "[ceph-users] Frequest LARGE_OMAP_OBJECTS in
cephfs metadata pool" from a few days ago.

On Fri, Feb 28, 2020 at 9:03 AM Seth Galitzer  wrote:
>
> I do not have a large ceph cluster, only 4 nodes plus a mon/mgr with 48
> OSDs. I have one data pool and one metadata pool with a total of about
> 140TB of usable storage. I have maybe 30 or so clients. The rest of my
> systems connect via a host that is a ceph client and then reshares
> through samba and nfs-ganesha. I'm not using rgw anywhere. I'm running
> the latest stable release of nautilus (14.2.7) and have had it in
> production since August 2019. All ceph nodes and the smb/nfs host are
> running centos7 with latest patches. Other clients are a mix of debian
> and ubuntu.
>
> For the last several weeks, I have been getting the warning "Large omap
> object found" off and on. I've been resolving it by gradually increasing
> the value of osd_deep_scrub_large_omap_object_key_threshold and then
> running a deep scrub on the affected pg. I have now increased this
> threshold to 100 and am wondering if I should keep doing this or if
> there is another problem that needs to be addressed.
>
> The affected pg has been different most times, but they are all on the
> same osd and with the same mds object. Here's an excerpt from my current
> set of logs to show what I'm seeing:
>
> # zgrep -i "large omap object found" /var/log/ceph/ceph.log*
> /var/log/ceph/ceph.log:2020-02-27 06:02:01.761641 osd.40 (osd.40) 1578 :
> cluster [WRN] Large omap object found. Object:
> 2:654134d2:::mds0_openfiles.0:head PG: 2.4b2c82a6 (2.26) Key count:
> 1048576 Size (bytes): 46403355
> /var/log/ceph/ceph.log:2020-02-27 16:18:00.328869 osd.40 (osd.40) 1585 :
> cluster [WRN] Large omap object found. Object:
> 2:654134d2:::mds0_openfiles.0:head PG: 2.4b2c82a6 (2.26) Key count:
> 1048559 Size (bytes): 46407183
> /var/log/ceph/ceph.log-20200227.gz:2020-02-26 19:56:24.972431 osd.40
> (osd.40) 1450 : cluster [WRN] Large omap object found. Object:
> 2:c9647462:::mds0_openfiles.1:head PG: 2.462e2693 (2.13) Key count:
> 939236 Size (bytes): 40179994
> /var/log/ceph/ceph.log-20200227.gz:2020-02-26 21:14:16.497161 osd.40
> (osd.40) 1460 : cluster [WRN] Large omap object found. Object:
> 2:c9647462:::mds0_openfiles.1:head PG: 2.462e2693 (2.13) Key count:
> 939232 Size (bytes): 40179796
> /var/log/ceph/ceph.log-20200227.gz:2020-02-26 21:15:06.399267 osd.40
> (osd.40) 1464 : cluster [WRN] Large omap object found. Object:
> 2:c9647462:::mds0_openfiles.1:head PG: 2.462e2693 (2.13) Key count:
> 939231 Size (bytes): 40179756
>
> Unfortunately, older logs have already been rotated out, but if memory
> serves correctly, they had similar messages. As you can see, the key
> count continues to increase. Last week, I bumped the threshold to 75
> to clear the warning. Before that, I had bumped to 50. It looks to
> me like something isn't getting cleaned up like it's supposed to. I
> haven't been using ceph long enough to figure out what that might be.
>
> Do I continue to bump the key threshold and not worry about the
> warnings, or is there something going on that needs to be corrected? At
> what point is the threshold too high? If the problem is due to a
> specific client not closing files, is it possible to identify that
> client and attempt to reset it?
>
> Any advice is welcome. I'm happy to provide additional data if needed.
>
> Thanks.
> Seth
>
> --
> Seth Galitzer
> Systems Coordinator
> Computer Science Department
> Kansas State University
> http://www.cs.ksu.edu/~sgsax
> sg...@ksu.edu
> 785-532-7790
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluestore rocksdb behavior

2019-12-05 Thread Brad Hubbard
There's some good information here which may assist in your understanding.

https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw/search?query=bluestore

On Thu, Dec 5, 2019 at 10:36 PM Igor Fedotov  wrote:
>
> Unfortunately can't recall any
>
> On 12/4/2019 11:07 PM, Frank R wrote:
>
> Thanks.
>
> Can you recommend any docs for understanding the BlueStore on disk 
> format/behavior when there is no separate device for the WAL/DB?
>
> On Wed, Dec 4, 2019 at 10:19 AM Igor Fedotov  wrote:
>>
>> Hi Frank,
>>
>> no spillover happens/applies for the main device hence data beyond 30G is 
>> written to main device as well.
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 12/4/2019 6:13 PM, Frank R wrote:
>>
>> Hi all,
>>
>> How is the following situation handled with bluestore:
>>
>> 1. You have a 200GB OSD (no separate DB/WAL devices)
>> 2. The metadata grows past 30G for some reason and wants to create a 300GB 
>> level but can't?
>>
>> Where is the metadata over 30G stored?
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io