[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-11-18 Thread Sasha Litvak
Perhaps I missed something,  but does the survey concludes that users don't
value reliability improvements at all?  This would explain why developers
team wants to concentrate on performance and ease of management.

On Thu, Nov 18, 2021, 07:23 Stefan Kooman  wrote:

> On 11/18/21 14:09, Maged Mokhtar wrote:
> > Hello Cephers,
> >
> > i too am for LTS releases or for some kind of middle ground like longer
> > release cycle and/or have even numbered releases designated for
> > production like before. We all use LTS releases for the base OS when
> > running Ceph, yet in reality we depend much more on the Ceph code than
> > the base OS.
> >
> > Another thing we hear our users want, after stability, is performance.
> > it ultimately determines the cost of the storage solution. I think this
> > should be high on the priority list. I know there has been a lot of
> > effort with Crimson development for a while, but from my opinion if Ceph
> > was run by a purely commercial company, getting this out the door as
> > quickly as possible would take priority.
>
> That is in line with the results from the last Ceph User Survey (2021):
>
>
>
> https://ceph.io/en/news/blog/2021/2021-ceph-user-survey-results/#based-on-weighted-category-prioritization
>
> So there is a dedicated group of people involved in the "next gen" OSD
> storage sub system, which is a big endeavor. In the mean time there are
> several developers improving the current implementation incrementally.
> Zac is doing a great job improving the documentation. Cephadm team is
> working on improving management. As as I have read correctly they will
> have access to a large cluster to improve ... the next thing on the prio
> list: scalability, in this case scalability of the management system.
>
> If there is a separate "quality" team for the No. 1 priority:
> Reliability? I don't know. Maybe that is just implicit in the project,
> to make things reliable by default? That might be an interesting thing
> to ask in the upcoming user+dev meeting ...
>
> Gr. Stefan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recursive delete hangs on cephfs

2021-11-17 Thread Sasha Litvak
Gregory,
Thank you for your reply, I do understand that a number of serialized
lookups may take time.  However if 3.25 sec is OK,  11.2 seconds sounds
long, and I had once removed a large subdirectory which took over 20
minutes to complete.  I attempted to use nowsync mount option with kernel
5.15 and it seems to hide latency (i.e. it is almost immediately returns
prompt after recursive directory removal.  However, I am not sure whether
nowsync is safe to use with kernel >= 5.8.  I also have kernel 5.3 on one
of the client clusters and nowsync there is not supported, however all rm
operations happen reasonably fast.  So the second question is, does 5.3's
libceph behave differently on recursing rm compared to 5.4 or 5.8?


On Wed, Nov 17, 2021 at 9:52 AM Gregory Farnum  wrote:

> On Sat, Nov 13, 2021 at 5:25 PM Sasha Litvak
>  wrote:
> >
> > I continued looking into the issue and have no idea what hinders the
> > performance yet. However:
> >
> > 1. A client operating with kernel 5.3.0-42 (ubuntu 18.04) has no such
> > problems.  I delete a directory with hashed subdirs (00 - ff) and total
> > space taken by files ~707MB spread across those 256 in 3.25 s.
>
> Recursive rm first requires the client to get capabilities on the
> files in question, and the MDS to read that data off disk.
> Newly-created directories will be cached, but old ones might not be.
>
> So this might just be the consequence of having to do 256 serialized
> disk lookups on hard drives. 3.25 seconds seems plausible to me.
>
> The number of bytes isn't going to have any impact on how long it
> takes to delete from the client side — that deletion is just marking
> it in the MDS, and then the MDS does the object removals in the
> background.
> -Greg
>
> >
> > 2. A client operating with kernel 5.8.0-53 (ubuntu 20.04) processes a
> > similar directory with less space taken ~ 530 MB spread across 256
> subdirs
> > in 11.2 s.
> >
> > 3.Yet another client with kernel 5.4.156 has similar latency removing
> > directories as in line 2.
> >
> > In all scenarios, mounts are set with the same options, i.e.
> > noatime,secret-file,acl.
> >
> > Client 1 has luminous, client 2 has octopus, client 3 has nautilus.
>  While
> > they are all on the same LAN, ceph -s on 2 and 3 returns in ~ 800 ms and
> on
> > client in ~300 ms.
> >
> > Any ideas are appreciated,
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Nov 12, 2021 at 8:44 PM Sasha Litvak <
> alexander.v.lit...@gmail.com>
> > wrote:
> >
> > > The metadata pool is on the same type of drives as other pools; every
> node
> > > uses SATA SSDs.  They are all read / write mix DC types.  Intel and
> Seagate.
> > >
> > > On Fri, Nov 12, 2021 at 8:02 PM Anthony D'Atri <
> anthony.da...@gmail.com>
> > > wrote:
> > >
> > >> MDS RAM cache vs going to the metadata pool?  What type of drives is
> your
> > >> metadata pool on?
> > >>
> > >> > On Nov 12, 2021, at 5:30 PM, Sasha Litvak <
> alexander.v.lit...@gmail.com>
> > >> wrote:
> > >> >
> > >> > I am running Pacific 16.2.4 cluster and recently noticed that rm -rf
> > >> >  visibly hangs on the old directories.  Cluster is
> healthy,
> > >> has a
> > >> > light load, and any newly created directories deleted immediately
> (well
> > >> rm
> > >> > returns command prompt immediately).  The directories in question
> have
> > >> 10 -
> > >> > 20 small text files so nothing should be slow when removing them.
> > >> >
> > >> > I wonder if someone can please give me a hint on where to start
> > >> > troubleshooting as I see no "big bad bear" yet.
> > >> > ___
> > >> > ceph-users mailing list -- ceph-users@ceph.io
> > >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >>
> > >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recursive delete hangs on cephfs

2021-11-13 Thread Sasha Litvak
I continued looking into the issue and have no idea what hinders the
performance yet. However:

1. A client operating with kernel 5.3.0-42 (ubuntu 18.04) has no such
problems.  I delete a directory with hashed subdirs (00 - ff) and total
space taken by files ~707MB spread across those 256 in 3.25 s.

2. A client operating with kernel 5.8.0-53 (ubuntu 20.04) processes a
similar directory with less space taken ~ 530 MB spread across 256 subdirs
in 11.2 s.

3.Yet another client with kernel 5.4.156 has similar latency removing
directories as in line 2.

In all scenarios, mounts are set with the same options, i.e.
noatime,secret-file,acl.

Client 1 has luminous, client 2 has octopus, client 3 has nautilus.   While
they are all on the same LAN, ceph -s on 2 and 3 returns in ~ 800 ms and on
client in ~300 ms.

Any ideas are appreciated,









On Fri, Nov 12, 2021 at 8:44 PM Sasha Litvak 
wrote:

> The metadata pool is on the same type of drives as other pools; every node
> uses SATA SSDs.  They are all read / write mix DC types.  Intel and Seagate.
>
> On Fri, Nov 12, 2021 at 8:02 PM Anthony D'Atri 
> wrote:
>
>> MDS RAM cache vs going to the metadata pool?  What type of drives is your
>> metadata pool on?
>>
>> > On Nov 12, 2021, at 5:30 PM, Sasha Litvak 
>> wrote:
>> >
>> > I am running Pacific 16.2.4 cluster and recently noticed that rm -rf
>> >  visibly hangs on the old directories.  Cluster is healthy,
>> has a
>> > light load, and any newly created directories deleted immediately (well
>> rm
>> > returns command prompt immediately).  The directories in question have
>> 10 -
>> > 20 small text files so nothing should be slow when removing them.
>> >
>> > I wonder if someone can please give me a hint on where to start
>> > troubleshooting as I see no "big bad bear" yet.
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recursive delete hangs on cephfs

2021-11-12 Thread Sasha Litvak
The metadata pool is on the same type of drives as other pools; every node
uses SATA SSDs.  They are all read / write mix DC types.  Intel and Seagate.

On Fri, Nov 12, 2021 at 8:02 PM Anthony D'Atri 
wrote:

> MDS RAM cache vs going to the metadata pool?  What type of drives is your
> metadata pool on?
>
> > On Nov 12, 2021, at 5:30 PM, Sasha Litvak 
> wrote:
> >
> > I am running Pacific 16.2.4 cluster and recently noticed that rm -rf
> >  visibly hangs on the old directories.  Cluster is healthy,
> has a
> > light load, and any newly created directories deleted immediately (well
> rm
> > returns command prompt immediately).  The directories in question have
> 10 -
> > 20 small text files so nothing should be slow when removing them.
> >
> > I wonder if someone can please give me a hint on where to start
> > troubleshooting as I see no "big bad bear" yet.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Recursive delete hangs on cephfs

2021-11-12 Thread Sasha Litvak
I am running Pacific 16.2.4 cluster and recently noticed that rm -rf
 visibly hangs on the old directories.  Cluster is healthy, has a
light load, and any newly created directories deleted immediately (well rm
returns command prompt immediately).  The directories in question have 10 -
20 small text files so nothing should be slow when removing them.

I wonder if someone can please give me a hint on where to start
troubleshooting as I see no "big bad bear" yet.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-03 Thread Sasha Litvak
Podman containers will not restart due to restart or failure of centralized
podman daemon.  Container is not synonymous to Docker.  This thread reminds
me systemd haters threads more and more by I guess it is fine.

On Thu, Jun 3, 2021, 2:16 AM Marc  wrote:

> Not using cephadm, I would also question other things like:
>
> - If it uses docker and docker daemon fails what happens to you containers?
> - I assume the ceph-osd containers need linux capability sysadmin. So if
> you have to allow this via your OC, all your tasks have potentially access
> to this permission. (That is why I chose not to allow the OC access to it)
> - cephadm only runs with docker?
>
>
>
>
> > -Original Message-
> > From: Martin Verges 
> > Sent: 02 June 2021 13:29
> > To: Matthew Vernon 
> > Cc: ceph-users@ceph.io
> > Subject: [ceph-users] Re: Why you might want packages not containers for
> > Ceph deployments
> >
> > Hello,
> >
> > I agree to Matthew, here at croit we work a lot with containers all day
> > long. No problem with that and enough knowledge to say for sure it's not
> > about getting used to it.
> > For us and our decisions here, Storage is the most valuable piece of IT
> > equipment in a company. If you have problems with your storage, most
> > likely
> > you have a huge pain, costs, problems, downtime, whatever. Therefore,
> > your
> > storage solution must be damn simple, you switch it on, it has to work.
> >
> > If you take a short look into Ceph documentation about how to deploy a
> > cephadm cluster vs croit. We strongly believe it's much easier as we
> > take
> > away all the pain from OS up to Ceph while keeping it simple behind the
> > scene. You still can always login to a node, kill a process, attach some
> > strace or whatever you like as you know it from years of linux
> > administration without any complexity layers like docker/podman/... It's
> > just friction less. In the end, what do you need? A kernel, an
> > initramfs,
> > some systemd, a bit of libs and tooling, and the Ceph packages.
> >
> > In addition, we help lot's of Ceph users on a regular basis with their
> > hand
> > made setups, but we don't really wanna touch the cephadm ones, as they
> > are
> > often harder to debug. But of course we do it anyways :).
> >
> > To have a perfect storage, strip away anything unneccessary. Avoid any
> > complexity, avoid anything that might affect your system. Keep it simply
> > stupid.
> >
> > --
> > Martin Verges
> > Managing director
> >
> > Mobile: +49 174 9335695
> > E-Mail: martin.ver...@croit.io
> > Chat: https://t.me/MartinVerges
> >
> > croit GmbH, Freseniusstr. 31h, 81247 Munich
> > CEO: Martin Verges - VAT-ID: DE310638492
> > Com. register: Amtsgericht Munich HRB 231263
> >
> > Web: https://croit.io
> > YouTube: https://goo.gl/PGE1Bx
> >
> >
> > On Wed, 2 Jun 2021 at 11:38, Matthew Vernon  wrote:
> >
> > > Hi,
> > >
> > > In the discussion after the Ceph Month talks yesterday, there was a
> > bit
> > > of chat about cephadm / containers / packages. IIRC, Sage observed
> > that
> > > a common reason in the recent user survey for not using cephadm was
> > that
> > > it only worked on containerised deployments. I think he then went on
> > to
> > > say that he hadn't heard any compelling reasons why not to use
> > > containers, and suggested that resistance was essentially a user
> > > education question[0].
> > >
> > > I'd like to suggest, briefly, that:
> > >
> > > * containerised deployments are more complex to manage, and this is
> > not
> > > simply a matter of familiarity
> > > * reducing the complexity of systems makes admins' lives easier
> > > * the trade-off of the pros and cons of containers vs packages is not
> > > obvious, and will depend on deployment needs
> > > * Ceph users will benefit from both approaches being supported into
> > the
> > > future
> > >
> > > We make extensive use of containers at Sanger, particularly for
> > > scientific workflows, and also for bundling some web apps (e.g.
> > > Grafana). We've also looked at a number of container runtimes (Docker,
> > > singularity, charliecloud). They do have advantages - it's easy to
> > > distribute a complex userland in a way that will run on (almost) any
> > > target distribution; rapid "cloud" deployment; some separation (via
> > > namespaces) of network/users/processes.
> > >
> > > For what I think of as a 'boring' Ceph deploy (i.e. install on a set
> > of
> > > dedicated hardware and then run for a long time), I'm not sure any of
> > > these benefits are particularly relevant and/or compelling - Ceph
> > > upstream produce Ubuntu .debs and Canonical (via their Ubuntu Cloud
> > > Archive) provide .debs of a couple of different Ceph releases per
> > Ubuntu
> > > LTS - meaning we can easily separate out OS upgrade from Ceph upgrade.
> > > And upgrading the Ceph packages _doesn't_ restart the daemons[1],
> > > meaning that we maintain control over restart order during an upgrade.
> > > And while we might briefly 

[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-02 Thread Sasha Litvak
Is there a link of the talk  I can use as a reference?  I would like to
look at the pro container points as this post is getting a little bit one
sided.  I understand that most people prefer things to be stable especially
with the underlying storage systems.  To me personally, use of
containers in general adds a great flexibility because it
detaches underlying OS from the running software.  All points are fair
about adding complexity to the complex system but one thing is missing.
Every time developers decide to introduce some new more efficient libraries
or frameworks we hit a distribution dependency hell.  Because of that, ceph
sometimes abandons entire OS versions before their actual lifetime is
over.  My resources are limited and  I don't want to debug / troubleshoot
/ upgrade OS in addition to ceph itself, hence the containers.  Yes  it
took me a while to warm up to the idea in general but now I don't even
think too much about it.  I went from Nautilus to Pacific (Centos 7 to
Centos 8) within a few hours without needing to upgrade my Ubuntu bare
metal nodes.

This said,  I am for giving people a choice to use packages + ansible /
manual install and also allowing manual install of containers.  Forcing
users' hands too much may make people avoid upgrading their ceph clusters.


On Wed, Jun 2, 2021 at 9:27 AM Dave Hall  wrote:

> I'd like to pick up on something that Matthew alluded to, although what I'm
> saying may not be popular.  I agree that containers are compelling for
> cloud deployment and application scaling, and we can all be glad that
> container technology has progressed from the days of docker privilege
> escalation.  I also agree that for Ceph users switching from native Ceph
> processes to containers carries a learning curve that could be as
> intimidating as learning Ceph itself.
>
> However, here is where I disagree with containerized Ceph:  I worked for 19
> years as a software developer for a major world-wide company.  In that
> time, I noticed that most of the usability issues experienced by customers
> were due to the natural and understandable tendency for software developers
> to program in a way that's easy for the programmer, and in the process to
> lose sight of what's easy for the user.
>
> In the case of Ceph, containers make it easier for the developers to
> produce and ship releases.  It reduces dependency complexities and testing
> time.  But the developers aren't out in the field with their deployments
> when something weird impacts a cluster and the standard approaches don't
> resolve it.  And let's face it:  Ceph is a marvelously robust solution for
> large scale storage, but it is also an amazingly intricate matrix of
> layered interdependent processes, and you haven't got all of the bugs
> worked out yet.
>
> Just note that the beauty of software (or really of anything that is
> 'designed') is that a few people (developers) can produce something that a
> large number of people (storage administrators, or 'users') will want to
> use.
>
> Please remember the ratio of users (cluster administrators) to developers
> and don't lose sight of the users in working to ease and simplify
> development.
>
> -Dave
>
> --
> Dave Hall
> Binghamton University
> kdh...@binghamton.edu
>
> On Wed, Jun 2, 2021 at 5:37 AM Matthew Vernon  wrote:
>
> > Hi,
> >
> > In the discussion after the Ceph Month talks yesterday, there was a bit
> > of chat about cephadm / containers / packages. IIRC, Sage observed that
> > a common reason in the recent user survey for not using cephadm was that
> > it only worked on containerised deployments. I think he then went on to
> > say that he hadn't heard any compelling reasons why not to use
> > containers, and suggested that resistance was essentially a user
> > education question[0].
> >
> > I'd like to suggest, briefly, that:
> >
> > * containerised deployments are more complex to manage, and this is not
> > simply a matter of familiarity
> > * reducing the complexity of systems makes admins' lives easier
> > * the trade-off of the pros and cons of containers vs packages is not
> > obvious, and will depend on deployment needs
> > * Ceph users will benefit from both approaches being supported into the
> > future
> >
> > We make extensive use of containers at Sanger, particularly for
> > scientific workflows, and also for bundling some web apps (e.g.
> > Grafana). We've also looked at a number of container runtimes (Docker,
> > singularity, charliecloud). They do have advantages - it's easy to
> > distribute a complex userland in a way that will run on (almost) any
> > target distribution; rapid "cloud" deployment; some separation (via
> > namespaces) of network/users/processes.
> >
> > For what I think of as a 'boring' Ceph deploy (i.e. install on a set of
> > dedicated hardware and then run for a long time), I'm not sure any of
> > these benefits are particularly relevant and/or compelling - Ceph
> > upstream produce Ubuntu .debs and Canonical (via their 

[ceph-users] Re: upgrade problem nautilus 14.2.15 -> 14.2.18? (Broken ceph!)

2021-03-30 Thread Sasha Litvak
Any time frame on 14.2.19?

On Fri, Mar 26, 2021, 1:43 AM Konstantin Shalygin  wrote:

> Finally master is merged now
>
>
> k
>
> Sent from my iPhone
>
> > On 25 Mar 2021, at 23:09, Simon Oosthoek 
> wrote:
> >
> > I'll wait a bit before upgrading the remaining nodes. I hope 14.2.19
> will be available quickly.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Suspicious newsletter] Re: Centos 8 2021 with ceph, how to move forward?

2021-01-14 Thread Sasha Litvak
What is not clear for me is what will be available for container images.  I
run ceph in containers.  Also I wonder if IBM will decide closing ceph and
other projects cimilar to CentOS.

On Thu, Jan 14, 2021, 7:01 AM Szabo, Istvan (Agoda) 
wrote:

> Thank you guys, so we might should use ububtu based then as it has good
> driver support and the lts sounds like a working solution.
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> On 2021. Jan 14., at 19:50, Martin Verges  wrote:
>
> Email received from outside the company. If in doubt don't click links
> nor open attachments!
> 
>
> Hello,
>
> we from croit use Ceph on Debian and deploy all our clusters with it.
> It works like a charm and I personally have quite good experience with
> it since ~20 years. It is a fantastic solid OS for Servers.
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
> Am Do., 14. Jan. 2021 um 11:12 Uhr schrieb David Majchrzak, ODERLAND
> Webbhotell AB :
>
> One of our providers (cloudlinux)  released a 1:1 binary compatible
> redhat fork due to the changes with Centos 8.
>
> Could be worth looking at.
>
> https://almalinux.org/
>
> In our case we're using ceph on debian 10.
>
> --
>
> David Majchrzak
> CTO
> Oderland Webbhotell AB
> Östra Hamngatan 50B, 411 09 Göteborg, SWEDEN
>
> Den 2021-01-14 kl. 09:04, skrev Szabo, Istvan (Agoda):
> Hi,
>
> Just curious how you guys move forward with this Centos 8 change.
>
> We just finished installing our full multisite cluster and looks like we
> need to change the operating system.
>
> So curious if you are using centos 8 with ceph, where you are going to
> move forward.
>
> Thank you
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Re: 14.2.12 breaks mon_host pointing to Round Robin DNS entry

2020-10-31 Thread Sasha Litvak
Hello everyone,

Assuming that backport has been merged for few days now,  is there a chance
that 14.2.13 will be released?


On Fri, Oct 23, 2020, 6:03 AM Van Alstyne, Kenneth <
kenneth.vanalst...@perspecta.com> wrote:

> Jason/Wido, et al:
>  I was hitting this exact problem when attempting to update from
> 14.2.11 to 14.2.12.  I reverted the two commits associated with that pull
> request and was able to successfully upgrade to 14.2.12.  Everything seems
> normal, now.
>
>
> Thanks,
>
> --
> Kenneth Van Alstyne
> Systems Architect
> M: 804.240.2327
> 14291 Park Meadow Drive, Chantilly, VA 20151
> perspecta
>
> 
> From: Jason Dillaman 
> Sent: Thursday, October 22, 2020 12:54 PM
> To: Wido den Hollander 
> Cc: ceph-users@ceph.io 
> Subject: [EXTERNAL] [ceph-users] Re: 14.2.12 breaks mon_host pointing to
> Round Robin DNS entry
>
> This backport [1] looks suspicious as it was introduced in v14.2.12
> and directly changes the initial MonMap code. If you revert it in a
> dev build does it solve your problem?
>
> [1] https://github.com/ceph/ceph/pull/36704
>
> On Thu, Oct 22, 2020 at 12:39 PM Wido den Hollander  wrote:
> >
> > Hi,
> >
> > I already submitted a ticket: https://tracker.ceph.com/issues/47951
> >
> > Maybe other people noticed this as well.
> >
> > Situation:
> > - Cluster is running IPv6
> > - mon_host is set to a DNS entry
> > - DNS entry is a Round Robin with three -records
> >
> > root@wido-standard-benchmark:~# ceph -s
> > unable to parse addrs in 'mon.objects.xx.xxx.net'
> > [errno 22] error connecting to the cluster
> > root@wido-standard-benchmark:~#
> >
> > The relevant part of the ceph.conf:
> >
> > [global]
> > auth_client_required = cephx
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > mon_host = mon.objects.xxx.xxx.xxx
> > ms_bind_ipv6 = true
> >
> > This works fine with 14.2.11 and breaks under 14.2.12
> >
> > Anybody else seeing this as well?
> >
> > Wido
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
> --
> Jason
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Migration to ceph.readthedocs.io underway

2020-09-16 Thread Sasha Litvak
I wonder if this new system allows me to choose Ceph versions.  I see the
v:latest in the right bottom corner but it seems to be the only choice so
far.

On Wed, Sep 16, 2020 at 12:31 PM Marc Roos  wrote:

>
>
> - In the future you will not be able to read the docs if you have an
> adblocker(?)
>
>
>
> -Original Message-
> To: dev; ceph-users
> Cc: Kefu Chai
> Subject: [ceph-users] Migration to ceph.readthedocs.io underway
>
> Hi everyone,
>
> We are in the process of migrating from docs.ceph.com to
> ceph.readthedocs.io. We enabled it in
> https://github.com/ceph/ceph/pull/34499 and will now be using it by
> default.
>
> Why?
>
> - The search feature in ceph.readthedocs.io is much better than
> docs.ceph.com and allows you to search multiple strings.
> - RTD provides an in-built version switching feature which we plan to
> use in future.
>
> What does it mean to you?
>
> - Some broken links are expected during this migration. Things like ceph
> API documentation need special handling (example:
> https://docs.ceph.com/en/latest/rados/api/) and are expected to be
> broken temporarily.
>
> - Much better Ceph documentation experience once the migration is done.
>
> Thanks for your patience!
>
> Cheers,
> Neha
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: YUM doesn't find older release version of nautilus

2020-07-14 Thread Sasha Litvak
Nautilus RPM repository is still broken.  I cannot build containers with
version <= 14.2.9 but 14.2.10 builds fine,  Please fix it.


On Thu, Jul 2, 2020 at 9:51 AM Lee, H. (Hurng-Chun) 
wrote:

> Hi,
>
> On Thu, 2020-07-02 at 16:15 +0200, Janne Johansson wrote:
> Den tors 2 juli 2020 kl 14:42 skrev Lee, H. (Hurng-Chun) <
> h@donders.ru.nl>:
> Hi,
>
> We use the official Ceph RPM repository (http://download.ceph.com/rpm-
> nautilus/el7 <
> http://download.ceph.com/rpm-nautilus/el7>) for installing packages on
> the client nodes running
> CentOS7.
>
> But we noticed today that the repo only provides the latest version
> (2:14.2.10-0.el7) of nautilus so that we couldn't install an older
> (2:14.2.7-0.el7) version of the ceph-common package.
>
>
> The link you showed lists this:
> 
>
> ../
>
>
> 
>
> repodata/
>
>   26-Jun-2020 14:47
>-
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.1.0-0.el7.x86_64.rpm
> >
>
> ceph-14.1.0-0.el7.x86_64.rpm
>
>01-Mar-2019 18:343024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.1.1-0.el7.x86_64.rpm
> >
>
> ceph-14.1.1-0.el7.x86_64.rpm
>
>12-Mar-2019 18:453024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.0-0.el7.x86_64.rpm
> >
>
> ceph-14.2.0-0.el7.x86_64.rpm
>
>20-Mar-2019 05:473024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.1-0.el7.x86_64.rpm
> >
>
> ceph-14.2.1-0.el7.x86_64.rpm
>
>26-Apr-2019 16:063024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.10-0.el7.x86_64.rpm
> >
>
> ceph-14.2.10-0.el7.x86_64.rpm
>
>   26-Jun-2020 14:463116
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.2-0.el7.x86_64.rpm
> >
>
> ceph-14.2.2-0.el7.x86_64.rpm
>
>17-Jul-2019 23:103024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.3-0.el7.x86_64.rpm
> >
>
> ceph-14.2.3-0.el7.x86_64.rpm
>
>03-Sep-2019 23:523024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.4-0.el7.x86_64.rpm
> >
>
> ceph-14.2.4-0.el7.x86_64.rpm
>
>16-Sep-2019 19:303024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.5-0.el7.x86_64.rpm
> >
>
> ceph-14.2.5-0.el7.x86_64.rpm
>
>12-Dec-2019 20:443024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.6-0.el7.x86_64.rpm
> >
>
> ceph-14.2.6-0.el7.x86_64.rpm
>
>14-Jan-2020 23:113024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.7-0.el7.x86_64.rpm
> >
>
> ceph-14.2.7-0.el7.x86_64.rpm
>
>04-Mar-2020 14:193024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.8-0.el7.x86_64.rpm
> >
>
> ceph-14.2.8-0.el7.x86_64.rpm
>
>04-Mar-2020 14:073024
>
> <
> https://download.ceph.com/rpm-nautilus/el7/x86_64/ceph-14.2.9-0.el7.x86_64.rpm
> >
>
> ceph-14.2.9-0.el7.x86_64.rpm
>
>10-Apr-2020 13:463104
>
>
> Does it mean that the official repo no longer provides RPM packages for
> older versions?  Thanks!
>
>
> No, probably only lists the latest in the index file.
>
> Indeed, the XML file specified in
> http://download.ceph.com/rpm-nautilus/el7/x86_64/repodata/repomd.xml only
> lists 14.2.10 ...
>
>
> http://download.ceph.com/rpm-nautilus/el7/x86_64/repodata/aca9315a50c3b0f9925cd2adb9fae1f7d1ede20eff452a3c2f7cb14edcebe3a4-other.xml.gz
>
> Good to know that those older versions are not skipped by intention.  I
> guess it will be fixed later?
>
> Hong
>
>
>
> --
>
> Hurng-Chun (Hong) Lee, PhD
> ICT manager
>
> Donders Institute for Brain, Cognition and Behaviour,
> Centre for Cognitive Neuroimaging
> Radboud University Nijmegen
>
> e-mail: h@donders.ru.nl
> tel: +31(0) 243610977
> web: http://www.ru.nl/donders/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Podman 2 + cephadm bootstrap == mon won't start

2020-07-10 Thread Sasha Litvak
David,

I see an issue on the github:
https://github.com/containers/podman/issues/6933

I could send you a 1.9.3 deb source + dsc file so you can rebuild and
downgrade.  Or you can actually sign to Suse Open Build Service yourself
and get it from there.  Still have to build it but it is not hard.


https://build.opensuse.org/package/show/devel:kubic:libcontainers:stable/podman?rev=285

If you signed in you will have download links active for you.

On Fri, Jul 10, 2020 at 5:36 PM David Orman  wrote:

> Hi,
>
> Using the repo suggested for Ubuntu 18 (
>
> https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_18.04/
> ) podman 2.0.2~1 is installed. However, when attempting to use cephadm to
> bootstrap a cluster, we see an error when attempting to start the mon
> container:
>
> "Error: invalid config provided: AppArmorProfile and privileged are
> mutually exclusive options"
>
> From the bit of reading we've done, this looks to be an issue with Podman
> v2 compatibility, and it appears to break with Ceph 15.2.4.
>
> Has anybody else run into this/been able to workaround it? We'll have to
> downgrade podman, but unfortunately, that repo does not keep previous
> versions.
>
> Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: v15.2.4 Octopus released

2020-06-30 Thread Sasha Litvak
David,

Download link points to 14.2.10 tarball.

On Tue, Jun 30, 2020, 3:38 PM David Galloway  wrote:

> We're happy to announce the fourth bugfix release in the Octopus series.
> In addition to a security fix in RGW, this release brings a range of fixes
> across all components. We recommend that all Octopus users upgrade to this
> release. For a detailed release notes with links & changelog please
> refer to the official blog entry at
> https://ceph.io/releases/v15-2-4-octopus-released
>
> Notable Changes
> ---
> * CVE-2020-10753: rgw: sanitize newlines in s3 CORSConfiguration's
> ExposeHeader
>   (William Bowling, Adam Mohammed, Casey Bodley)
>
> * Cephadm: There were a lot of small usability improvements and bug fixes:
>   * Grafana when deployed by Cephadm now binds to all network interfaces.
>   * `cephadm check-host` now prints all detected problems at once.
>   * Cephadm now calls `ceph dashboard set-grafana-api-ssl-verify false`
> when generating an SSL certificate for Grafana.
>   * The Alertmanager is now correctly pointed to the Ceph Dashboard
>   * `cephadm adopt` now supports adopting an Alertmanager
>   * `ceph orch ps` now supports filtering by service name
>   * `ceph orch host ls` now marks hosts as offline, if they are not
> accessible.
>
> * Cephadm can now deploy NFS Ganesha services. For example, to deploy NFS
> with
>   a service id of mynfs, that will use the RADOS pool nfs-ganesha and
> namespace
>   nfs-ns::
>
> ceph orch apply nfs mynfs nfs-ganesha nfs-ns
>
> * Cephadm: `ceph orch ls --export` now returns all service specifications
> in
>   yaml representation that is consumable by `ceph orch apply`. In addition,
>   the commands `orch ps` and `orch ls` now support `--format yaml` and
>   `--format json-pretty`.
>
> * Cephadm: `ceph orch apply osd` supports a `--preview` flag that prints a
> preview of
>   the OSD specification before deploying OSDs. This makes it possible to
>   verify that the specification is correct, before applying it.
>
> * RGW: The `radosgw-admin` sub-commands dealing with orphans --
>   `radosgw-admin orphans find`, `radosgw-admin orphans finish`, and
>   `radosgw-admin orphans list-jobs` -- have been deprecated. They have
>   not been actively maintained and they store intermediate results on
>   the cluster, which could fill a nearly-full cluster.  They have been
>   replaced by a tool, currently considered experimental,
>   `rgw-orphan-list`.
>
> * RBD: The name of the rbd pool object that is used to store
>   rbd trash purge schedule is changed from "rbd_trash_trash_purge_schedule"
>   to "rbd_trash_purge_schedule". Users that have already started using
>   `rbd trash purge schedule` functionality and have per pool or namespace
>   schedules configured should copy "rbd_trash_trash_purge_schedule"
>   object to "rbd_trash_purge_schedule" before the upgrade and remove
>   "rbd_trash_purge_schedule" using the following commands in every RBD
>   pool and namespace where a trash purge schedule was previously
>   configured::
>
> rados -p  [-N namespace] cp rbd_trash_trash_purge_schedule
> rbd_trash_purge_schedule
> rados -p  [-N namespace] rm rbd_trash_trash_purge_schedule
>
>   or use any other convenient way to restore the schedule after the
>   upgrade.
>
> Getting Ceph
> 
> * Git at git://github.com/ceph/ceph.git
> * Tarball at http://download.ceph.com/tarballs/ceph-14.2.10.tar.gz
> * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
> * Release git sha1: 7447c15c6ff58d7fce91843b705a268a1917325c
>
> --
> David Galloway
> Systems Administrator, RDU
> Ceph Engineering
> IRC: dgalloway
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 14.2.9 MDS Failing

2020-05-03 Thread Sasha Litvak
Marco,

Could you please share what was done to make your cluster stable again?

On Fri, May 1, 2020 at 4:47 PM Marco Pizzolo  wrote:
>
> Thanks Everyone,
>
> I was able to address the issue at least temporarily.  The filesystem and
> MDSes are for the time staying online and the pgs are being remapped.
>
> What i'm not sure about is the best tuning for MDS given our use case, nor
> am i sure of exactly what caused the OSD to flap as they did, so I don't
> yet know how to avoid a recurrence.
>
> I do very much like Ceph though
>
> Best wishes,
>
> Marco
>
> On Fri, May 1, 2020 at 3:49 PM Marco Pizzolo  wrote:
>
> > Understood Paul, thanks.
> >
> > In case this helps to shed any further light...Digging through logs I'm
> > also seeing this:
> >
> > 2020-05-01 10:06:55.984 7eff10cc3700  1 mds.prdceph01 Updating MDS map to
> > version 1487236 from mon.2
> > 2020-05-01 10:06:56.398 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > 17 slow requests, 1 included below; oldest blocked for > 254.203584 secs
> > 2020-05-01 10:06:56.398 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > slow request 60.552277 seconds old, received at 2020-05-01 10:05:55.846466:
> > client_request(client.2525280:277916371 mkdir #0x10014e76974/1f 2020-05-01
> > 10:05:55.844490 caller_uid=1010, caller_gid=1015{}) currently submit entry:
> > journal_and_reply
> > 2020-05-01 10:06:57.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > 17 slow requests, 2 included below; oldest blocked for > 255.205489 secs
> > 2020-05-01 10:06:57.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > slow request 60.564545 seconds old, received at 2020-05-01 10:05:56.836104:
> > client_request(client.2525280:277921203 create
> > #0x10014f12b86/9254b3f0-1d5a-4e88-8d41-d36f244bcb12.zip 2020-05-01
> > 10:05:56.834494 caller_uid=1010, caller_gid=1015{}) currently submit entry:
> > journal_and_reply
> > 2020-05-01 10:06:57.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > slow request 60.550874 seconds old, received at 2020-05-01 10:05:56.849775:
> > client_request(client.2525280:277921267 mkdir #0x10014e78bec/e0 2020-05-01
> > 10:05:56.848494 caller_uid=1010, caller_gid=1015{}) currently submit entry:
> > journal_and_reply
> > 2020-05-01 10:06:58.400 7eff0e4be700  0 log_channel(cluster) log [WRN] :
> > 17 slow requests, 0 included below; oldest blocked for > 256.205519 secs
> > 2020-05-01 10:07:15.250 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> > had timed out after 15
> > 2020-05-01 10:07:15.250 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> > beacon heartbeat to monitors (last acked 3.9s ago); MDS internal
> > heartbeat is not healthy!
> > 2020-05-01 10:07:15.750 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> > had timed out after 15
> > 2020-05-01 10:07:15.750 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> > beacon heartbeat to monitors (last acked 4.4s ago); MDS internal
> > heartbeat is not healthy!
> > 2020-05-01 10:07:16.250 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> > had timed out after 15
> > 2020-05-01 10:07:16.250 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> > beacon heartbeat to monitors (last acked 4.8s ago); MDS internal
> > heartbeat is not healthy!
> > 2020-05-01 10:07:16.750 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> > had timed out after 15
> > 2020-05-01 10:07:16.750 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> > beacon heartbeat to monitors (last acked 5.49998s ago); MDS internal
> > heartbeat is not healthy!
> > 2020-05-01 10:07:17.250 7eff0dcbd700  1 heartbeat_map is_healthy 'MDSRank'
> > had timed out after 15
> > 2020-05-01 10:07:17.250 7eff0dcbd700  0 mds.beacon.prdceph01 Skipping
> > beacon heartbeat to monitors (last acked 5.8s ago); MDS internal
> > heartbeat is not healthy!
> >
> >
> > THEN about 5 minutes later...
> >
> >
> >
> > 2020-05-01 10:07:35.559 7eff10cc3700  1 mds.prdceph01  9: 'ceph'
> > 2020-05-01 10:07:35.559 7eff10cc3700  1 mds.prdceph01 respawning with exe
> > /usr/bin/ceph-mds
> > 2020-05-01 10:07:35.559 7eff10cc3700  1 mds.prdceph01  exe_path
> > /proc/self/exe
> > 2020-05-01 10:07:50.785 7fbff66291c0  0 ceph version 14.2.9
> > (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), process
> > ceph-mds, pid 9710
> > 2020-05-01 10:07:50.787 7fbff66291c0  0 pidfile_write: ignore empty
> > --pid-file
> > 2020-05-01 10:07:50.817 7fbfe4408700  1 mds.prdceph01 Updating MDS map to
> > version 1487238 from mon.2
> > 2020-05-01 10:07:55.820 7fbfe4408700  1 mds.prdceph01 Updating MDS map to
> > version 1487239 from mon.2
> > 2020-05-01 10:07:55.820 7fbfe4408700  1 mds.prdceph01 Map has assigned me
> > to become a standby
> > 2020-05-01 10:11:07.369 7fbfe4408700  1 mds.prdceph01 Updating MDS map to
> > version 1487282 from mon.2
> > 2020-05-01 10:11:07.373 7fbfe4408700  1 mds.0.1487282 handle_mds_map i am
> > now mds.0.1487282
> > 2020-05-01 10:11:07.373 7fbfe4408700  1 mds.0.1487282 handle_mds_map state
> > change up:boot --> 

[ceph-users] Re: v15.2.0 Octopus released

2020-03-25 Thread Sasha Litvak
I assume upgrading cluster running in docker / podman containers should be
non issue is it?  Just making sure.   Also wonder if anything is different
in this case from normal container upgrade scenario
  I.e.  monitors -> mgrs -> osds -> mdss -> clients.

Thank you,

On Wed, Mar 25, 2020, 5:32 AM Wido den Hollander  wrote:

>
>
> On 3/25/20 10:24 AM, Simon Oosthoek wrote:
> > On 25/03/2020 10:10, konstantin.ilya...@mediascope.net wrote:
> >> That is why i am asking that question about upgrade instruction.
> >> I really don`t understand, how to upgrade/reinstall CentOS 7 to 8
> without affecting the work of cluster.
> >> As i know, this process is easier on Debian, but we deployed our
> cluster Nautilus on CentOS because there weren`t any packages for 14.x for
> Debian Stretch (9) or Buster(10).
> >> P.s.: if this is even possible, i would like to know how to upgrade
> servers with CentOs7 + ceph 14.2.8 to Debian 10 with ceph 15.2.0 (we have
> servers with OSD only and 3 servers with Mon/Mgr/Mds)
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> > I guess you could upgrade each node one by one. So upgrade/reinstall the
> > OS, install Ceph 15 and re-initialise the OSDs if necessary. Though it
> > would be nice if there was a way to re-integrate the OSDs from the
> > previous installation...
> >
>
> That works just fine. You can re-install the host OS and have
> ceph-volume scan all the volumes. The OSDs should then just come back.
>
> Or you can take the even safer route by removing OSDs completely from
> the cluster and wiping a box.
>
> Did this recently with a customer. In the meantime they took the
> oppertunity to also flash the firmware of all the components and the
> machines came back again with a complete fresh installation.
>
> > Personally, I'm planning to wait for a while to upgrade to Ceph 15, not
> > in the least because it's not convenient to do stuff like OS upgrades
> > from home ;-)
> >
> > Currently we're running ubuntu 18.04 on the ceph nodes, I'd like to
> > upgrade to ubuntu 20.04 and then to ceph 15.
> >
>
> I think many people will do this. I wouldn't run 15.2.0 on my production
> environment right away.
>
> Wido
>
> > Cheers
> >
> > /Simon
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Module 'telemetry' has experienced an error

2020-02-21 Thread Sasha Litvak
Thore,

Thank you for your reply.
Unless the issue was specifically with a ceph telemetry server or their
subnet we had no network issues at that time, at least none was reported by
monitoring or customers.  It is very weird unless the telemetry module may
have a bug of some kind and hangs on its own.  But that is the first time
it happened and the cluster has been up for a year.

On Fri, Feb 21, 2020 at 3:49 AM Thore Krüss  wrote:

> On Fri, Feb 21, 2020 at 05:28:12AM -, alexander.v.lit...@gmail.com
> wrote:
> > This evening I was awakened by an error message
> >
> >  cluster:
> > id: 9b4468b7-5bf2-4964-8aec-4b2f4bee87ad
> > health: HEALTH_ERR
> > Module 'telemetry' has failed: ('Connection aborted.',
> error(101, 'Network is unreachable'))
> >
> >   services:
> >
> > I have not seen any other problems with anything else on the cluster.  I
> disabled and enabled the telemetry module and health returned to OK
> status.  Any ideas on what could cause the issue?  As far as I understand,
> telemetry is a module that sends messages to an external ceph server
> outside of the network.
>
> Maybe an uplink issue? We had similar behaviour as we had some trouble
> with a
> core router.
>
> You have been able to disable and enable the module? This failed for me
> with the
> reason that the module had failed (Nautilus). Restarting all mgrs did help.
>
> Still - I'm not sure why this is considered to be a health_err.
>
> Best regards
> Thore
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] lists and gmail

2020-01-20 Thread Sasha Litvak
It seems that people now split between new and old list servers.
 Regardless of either one of them, I am missing a number of messages that
appear on archive pages but never seem to make to my inbox.  And no they
are not in my junk folder.  I wonder if some of my questions are not
getting a response because people don't receive them.  Any other reason,
like people choosing not to answer, is pretty acceptable.  Does anyone else
have difficulty communicating with user list using gmail account?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Uneven Node utilization

2020-01-16 Thread Sasha Litvak
Hello, Cephers,

I have a small 6 node cluster with 36 OSDs.  When running the
benchmark/torture tests I noticed that some nodes, usually storage2n6-la
and also sometimes others are utilized much more.  I see some osds are used
100% and load average goes up to 21 while on the others the load average is
5 - 6 and osds are within 40 - 50 - 60% of utilization.  I cannot use upmap
mode for balancer because I still have some client machines using hammer.
I wonder if my issue is caused by compat balancing mode as compat weight
shows nodes with the same number and disk size but different compat
weights.  If so what can I do to improve the load/disk usage distribution
in the cluster?   Also, my legacy client machines only need to access
cephfs on the new cluster, so I wonder if keeping hammer as the oldest
client version makes sense and I should change it to jewel and set crush
tunables to optimal.

Help is greatly appreciated,


ceph df
RAW STORAGE:.
CLASS SIZE   AVAIL  USEDRAW USED %RAW USED
ssd   94 TiB 88 TiB 5.9 TiB  5.9 TiB  6.29
TOTAL 94 TiB 88 TiB 5.9 TiB  5.9 TiB  6.29

POOLS:
POOLID STORED  OBJECTS USED%USED
  MAX AVAIL
cephfs_data  1 1.6 TiB   3.77M 4.9 TiB  5.57
 28 TiB
cephfs_metadata  2 3.9 GiB 367.34k 4.3 GiB 0
 28 TiB
one  5 344 GiB  90.94k 1.0 TiB  1.20
 28 TiB
 ceph -s
  cluster:
id: 9b4468b7-5bf2-4964-8aec-4b2f4bee87ad
health: HEALTH_OK

  services:
mon: 3 daemons, quorum storage2n1-la,storage2n2-la,storage2n3-la (age
39h)
mgr: storage2n1-la(active, since 39h), standbys: storage2n2-la,
storage2n3-la
mds: cephfs:1 {0=storage2n4-la=up:active} 1 up:standby-replay 1
up:standby
osd: 36 osds: 36 up (since 37h), 36 in (since 10w)

  data:
pools:   3 pools, 1664 pgs
objects: 4.23M objects, 1.9 TiB
usage:   5.9 TiB used, 88 TiB / 94 TiB avail
pgs: 1664 active+clean

  io:
client:   1.2 KiB/s rd, 46 KiB/s wr, 5 op/s rd, 2 op/s wr

Ceph osd df looks like this

ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETA AVAIL
%USE VAR  PGS STATUS
 6   ssd 1.74609  1.0 1.7 TiB 115 GiB 114 GiB 186 MiB  838 MiB 1.6 TiB
6.45 1.02  92 up
12   ssd 1.74609  1.0 1.7 TiB 122 GiB 121 GiB  90 MiB  934 MiB 1.6 TiB
6.81 1.08  92 up
18   ssd 1.74609  1.0 1.7 TiB 112 GiB 111 GiB 107 MiB  917 MiB 1.6 TiB
6.24 0.99  91 up
24   ssd 3.49219  1.0 3.5 TiB 233 GiB 232 GiB 206 MiB  818 MiB 3.3 TiB
6.53 1.04 185 up
30   ssd 3.49219  1.0 3.5 TiB 224 GiB 223 GiB 246 MiB  778 MiB 3.3 TiB
6.25 0.99 187 up
35   ssd 3.49219  1.0 3.5 TiB 216 GiB 215 GiB 252 MiB  772 MiB 3.3 TiB
6.04 0.96 184 up
 5   ssd 1.74609  1.0 1.7 TiB 112 GiB 111 GiB  88 MiB  936 MiB 1.6 TiB
6.28 1.00  92 up
11   ssd 1.74609  1.0 1.7 TiB 112 GiB 111 GiB 112 MiB  912 MiB 1.6 TiB
6.26 0.99  92 up
17   ssd 1.74609  1.0 1.7 TiB 112 GiB 111 GiB 274 MiB  750 MiB 1.6 TiB
6.25 0.99  94 up
23   ssd 3.49219  1.0 3.5 TiB 234 GiB 233 GiB 192 MiB  832 MiB 3.3 TiB
6.54 1.04 183 up
29   ssd 3.49219  1.0 3.5 TiB 216 GiB 215 GiB 356 MiB  668 MiB 3.3 TiB
6.03 0.96 184 up
34   ssd 3.49219  1.0 3.5 TiB 227 GiB 226 GiB 267 MiB  757 MiB 3.3 TiB
6.34 1.01 184 up
 4   ssd 1.74609  1.0 1.7 TiB 125 GiB 124 GiB  16 MiB 1008 MiB 1.6 TiB
7.00 1.11  94 up
10   ssd 1.74609  1.0 1.7 TiB 108 GiB 107 GiB 163 MiB  861 MiB 1.6 TiB
6.01 0.96  93 up
16   ssd 1.74609  1.0 1.7 TiB 107 GiB 106 GiB 163 MiB  861 MiB 1.6 TiB
6.00 0.95  94 up
22   ssd 3.49219  1.0 3.5 TiB 221 GiB 220 GiB 385 MiB  700 MiB 3.3 TiB
6.18 0.98 187 up
28   ssd 3.49219  1.0 3.5 TiB 223 GiB 222 GiB 257 MiB  767 MiB 3.3 TiB
6.23 0.99 186 up
33   ssd 3.49219  1.0 3.5 TiB 241 GiB 240 GiB 233 MiB  791 MiB 3.3 TiB
6.74 1.07 185 up
 1   ssd 1.74609  1.0 1.7 TiB 103 GiB 102 GiB 240 MiB  784 MiB 1.6 TiB
5.76 0.92  93 up
 7   ssd 1.74609  1.0 1.7 TiB 117 GiB 116 GiB  70 MiB  954 MiB 1.6 TiB
6.56 1.04  91 up
13   ssd 1.74609  1.0 1.7 TiB 126 GiB 125 GiB  76 MiB  948 MiB 1.6 TiB
7.03 1.12  95 up
19   ssd 3.49219  1.0 3.5 TiB 230 GiB 229 GiB 307 MiB  717 MiB 3.3 TiB
6.44 1.02 186 up
25   ssd 3.49219  1.0 3.5 TiB 220 GiB 219 GiB 309 MiB  715 MiB 3.3 TiB
6.15 0.98 185 up
31   ssd 3.49219  1.0 3.5 TiB 223 GiB 222 GiB 205 MiB  819 MiB 3.3 TiB
6.23 0.99 186 up
 0   ssd 1.74609  1.0 1.7 TiB 116 GiB 115 GiB 151 MiB  873 MiB 1.6 TiB
6.49 1.03  93 up
 3   ssd 1.74609  1.0 1.7 TiB 121 GiB 120 GiB  89 MiB  935 MiB 1.6 TiB
6.77 1.08  91 up
 9   ssd 1.74609  1.0 1.7 TiB 104 GiB 103 GiB 183 MiB  841 MiB 1.6 TiB
5.81 0.92  93 up
15   ssd 3.49219  1.0 3.5 TiB 222 GiB 221 GiB 205 MiB  819 MiB 3.3 TiB
6.20 0.98 185 up
21   ssd 3.49219  1.0 3.5 TiB 213 GiB 

[ceph-users] Re: High CPU usage by ceph-mgr in 14.2.5

2019-12-20 Thread Sasha Litvak
Was the root cause found and fixed?  If so, will the fix be available in
14.2.6 or sooner?

On Thu, Dec 19, 2019 at 5:48 PM Mark Nelson  wrote:

> Hi Paul,
>
>
> Thanks for gathering this!  It looks to me like at the very least we
> should redo the fixed_u_to_string and fixed_to_string functions in
> common/Formatter.cc.  That alone looks like it's having a pretty
> significant impact.
>
>
> Mark
>
>
> On 12/19/19 2:09 PM, Paul Mezzanini wrote:
> > Based on what we've seen with perf, we think this is the relevant
> section.  (attached is also the whole file)
> >
> > Thread: 73 (mgr-fin) - 1000 samples
> >
> > + 100.00% clone
> >+ 100.00% start_thread
> >  + 100.00% Finisher::finisher_thread_entry()
> >+ 99.40% Context::complete(int)
> >| + 99.40% FunctionContext::finish(int)
> >|   + 99.40% ActivePyModule::notify(std::string const&,
> std::string const&)
> >| + 91.30% PyObject_CallMethod
> >| | + 91.30% call_function_tail
> >| |   + 91.30% PyObject_Call
> >| | + 91.30% instancemethod_call
> >| |   + 91.30% PyObject_Call
> >| | + 91.30% function_call
> >| |   + 91.30% PyEval_EvalCodeEx
> >| | + 88.40% PyEval_EvalFrameEx
> >| | | + 88.40% PyEval_EvalFrameEx
> >| | |   + 88.40% ceph_state_get(BaseMgrModule*,
> _object*)
> >| | | + 88.40%
> ActivePyModules::get_python(std::string const&)
> >| | |   + 51.10%
> PGMap::dump_osd_stats(ceph::Formatter*) const
> >| | |   | + 51.10%
> osd_stat_t::dump(ceph::Formatter*) const
> >| | |   |   + 22.50%
> ceph::fixed_u_to_string(unsigned long, int)
> >| | |   |   | + 10.50%
> std::basic_ostringstream, std::allocator
> >::basic_ostringstream(std::_Ios_Openmode)
> >| | |   |   | | + 9.30% std::basic_ios std::char_traits >::init(std::basic_streambuf std::char_traits >*)
> >| | |   |   | | | + 7.00%
> std::basic_ios >::_M_cache_locale(std::locale
> const&)
> >| | |   |   | | | | + 1.60% std::ctype
> const& std::use_facet >(std::locale const&)
> >| | |   |   | | | | | + 1.50% __dynamic_cast
> >| | |   |   | | | | |   + 0.80%
> __cxxabiv1::__vmi_class_type_info::__do_dyncast(long,
> __cxxabiv1::__class_type_info::__sub_kind, __cxxabiv1::__class_type_info
> const*, void const*, __cxxabiv1::__class_type_info const*, void const*,
> __cxxabiv1::__class_type_info::__dyncast_result&) const
> >| | |   |   | | | | + 1.40% bool
> std::has_facet >(std::locale const&)
> >| | |   |   | | | | | + 1.30% __dynamic_cast
> >| | |   |   | | | | |   + 0.90%
> __cxxabiv1::__vmi_class_type_info::__do_dyncast(long,
> __cxxabiv1::__class_type_info::__sub_kind, __cxxabiv1::__class_type_info
> const*, void const*, __cxxabiv1::__class_type_info const*, void const*,
> __cxxabiv1::__class_type_info::__dyncast_result&) const
> >| | |   |   | | | | + 1.10% bool
> std::has_facet std::char_traits > > >(std::locale const&)
> >| | |   |   | | | | | + 0.90% __dynamic_cast
> >| | |   |   | | | | + 1.00% bool
> std::has_facet std::char_traits > > >(std::locale const&)
> >| | |   |   | | | | | + 0.70% __dynamic_cast
> >| | |   |   | | | | | + 0.10%
> std::locale::id::_M_id() const
> >| | |   |   | | | | | + 0.10%
> _ZNKSt6locale2id5_M_idEv@plt
> >| | |   |   | | | | + 0.80%
> std::num_put >
> > const& std::use_facet std::char_traits > > >(std::locale const&)
> >| | |   |   | | | | + 0.70%
> std::num_get >
> > const& std::use_facet std::char_traits > > >(std::locale const&)
> >| | |   |   | | | | + 0.10%
> _ZSt9has_facetISt7num_putIcSt19ostreambuf_iteratorIcSt11char_traitsIcbRKSt6locale@plt
> >| | |   |   | | | + 2.00%
> std::ios_base::_M_init()
> >| | |   |   | | | | + 0.80%
> std::locale::operator=(std::locale const&)
> >| | |   |   | | | | + 0.80%
> std::locale::locale()
> >| | |   |   | | | | + 0.30%
> std::locale::~locale()
> >| | |   |   | | | | + 0.10%
> _ZNSt6localeC1Ev@plt
> >| | |   |   | | | + 0.20%
> _ZNSt8ios_base7_M_initEv@plt
> >| | |   |   | | + 0.90% std::locale::locale()
> >| | |   |   | | + 0.10%
> std::ios_base::ios_base()
> >| | |  

[ceph-users] Re: ceph-mon using 100% CPU after upgrade to 14.2.5

2019-12-16 Thread Sasha Litvak
Bryan, thank you.  We are about to start testing 14.2.2 -> 14.2.5 upgrade,
so folks here are a bit cautious :-)  We don't need to convert but may have
to rebuild few disks after an upgrade.

On Mon, Dec 16, 2019 at 3:57 PM Bryan Stillwell 
wrote:

> Sasha,
>
> I was able to get past it by restarting the ceph-mon processes every time
> it got stuck, but that's not a very good solution for a production cluster.
>
> Right now I'm trying to narrow down what is causing the problem.
> Rebuilding the OSDs with BlueStore doesn't seem to be enough.  I believe it
> could be related to us using the extra space on the journal device as an
> SSD-based OSD.  During the conversion process I'm removing this SSD-based
> OSD (since with BlueStore the omap data is ending up on the SSD anyways),
> and I'm suspecting it might be causing this problem.
>
> Bryan
>
> On Dec 14, 2019, at 10:27 AM, Sasha Litvak 
> wrote:
>
> Notice: This email is from an external sender.
>
> Bryan,
>
> Were you able to resolve this?  If yes, can you please share with the list?
>
> On Fri, Dec 13, 2019 at 10:08 AM Bryan Stillwell 
> wrote:
>
>> Adding the dev list since it seems like a bug in 14.2.5.
>>
>> I was able to capture the output from perf top:
>>
>>   21.58%  libceph-common.so.0   [.]
>> ceph::buffer::v14_2_0::list::append
>>   20.90%  libstdc++.so.6.0.19   [.] std::getline> std::char_traits, std::allocator >
>>   13.25%  libceph-common.so.0   [.]
>> ceph::buffer::v14_2_0::list::append
>>   10.11%  libstdc++.so.6.0.19   [.]
>> std::istream::sentry::sentry
>>8.94%  libstdc++.so.6.0.19   [.] std::basic_ios> std::char_traits >::clear
>>3.24%  libceph-common.so.0   [.]
>> ceph::buffer::v14_2_0::ptr::unused_tail_length
>>1.69%  libceph-common.so.0   [.] std::getline> std::char_traits, std::allocator >@plt
>>1.63%  libstdc++.so.6.0.19   [.]
>> std::istream::sentry::sentry@plt
>>1.21%  [kernel]  [k] __do_softirq
>>0.77%  libpython2.7.so.1.0   [.] PyEval_EvalFrameEx
>>0.55%  [kernel]  [k]
>> _raw_spin_unlock_irqrestore
>>
>> I increased mon debugging to 20 and nothing stuck out to me.
>>
>> Bryan
>>
>> > On Dec 12, 2019, at 4:46 PM, Bryan Stillwell 
>> wrote:
>> >
>> > On our test cluster after upgrading to 14.2.5 I'm having problems with
>> the mons pegging a CPU core while moving data around.  I'm currently
>> converting the OSDs from FileStore to BlueStore by marking the OSDs out in
>> multiple nodes, destroying the OSDs, and then recreating them with
>> ceph-volume lvm batch.  This seems too get the ceph-mon process into a
>> state where it pegs a CPU core on one of the mons:
>> >
>> > 1764450 ceph  20   0 4802412   2.1g  16980 S 100.0 28.1   4:54.72
>> ceph-mon
>> >
>> > Has anyone else run into this with 14.2.5 yet?  I didn't see this
>> problem while the cluster was running 14.2.4.
>> >
>> > Thanks,
>> > Bryan
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-mon using 100% CPU after upgrade to 14.2.5

2019-12-14 Thread Sasha Litvak
Bryan,

Were you able to resolve this?  If yes, can you please share with the list?

On Fri, Dec 13, 2019 at 10:08 AM Bryan Stillwell 
wrote:

> Adding the dev list since it seems like a bug in 14.2.5.
>
> I was able to capture the output from perf top:
>
>   21.58%  libceph-common.so.0   [.]
> ceph::buffer::v14_2_0::list::append
>   20.90%  libstdc++.so.6.0.19   [.] std::getline std::char_traits, std::allocator >
>   13.25%  libceph-common.so.0   [.]
> ceph::buffer::v14_2_0::list::append
>   10.11%  libstdc++.so.6.0.19   [.]
> std::istream::sentry::sentry
>8.94%  libstdc++.so.6.0.19   [.] std::basic_ios std::char_traits >::clear
>3.24%  libceph-common.so.0   [.]
> ceph::buffer::v14_2_0::ptr::unused_tail_length
>1.69%  libceph-common.so.0   [.] std::getline std::char_traits, std::allocator >@plt
>1.63%  libstdc++.so.6.0.19   [.]
> std::istream::sentry::sentry@plt
>1.21%  [kernel]  [k] __do_softirq
>0.77%  libpython2.7.so.1.0   [.] PyEval_EvalFrameEx
>0.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
>
> I increased mon debugging to 20 and nothing stuck out to me.
>
> Bryan
>
> > On Dec 12, 2019, at 4:46 PM, Bryan Stillwell 
> wrote:
> >
> > On our test cluster after upgrading to 14.2.5 I'm having problems with
> the mons pegging a CPU core while moving data around.  I'm currently
> converting the OSDs from FileStore to BlueStore by marking the OSDs out in
> multiple nodes, destroying the OSDs, and then recreating them with
> ceph-volume lvm batch.  This seems too get the ceph-mon process into a
> state where it pegs a CPU core on one of the mons:
> >
> > 1764450 ceph  20   0 4802412   2.1g  16980 S 100.0 28.1   4:54.72
> ceph-mon
> >
> > Has anyone else run into this with 14.2.5 yet?  I didn't see this
> problem while the cluster was running 14.2.4.
> >
> > Thanks,
> > Bryan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Replace ceph osd in a container

2019-10-22 Thread Sasha Litvak
Frank,

Thank you for your suggestion.  It sounds very promising.  I will
definitely try it.

Best,

On Tue, Oct 22, 2019, 2:44 AM Frank Schilder  wrote:

> > I am suspecting that mon or mgr have no access to /dev or /var/lib while
> osd containers do.
> > Cluster configured originally by ceph-ansible (nautilus 14.2.2)
>
> They don't, because they don't need to.
>
> > The question is if I want to replace all disks on a single node, and I
> have 6 nodes with pools
> > replication 3, is it safe to restart mgr mounting /dev and /var/lib/ceph
> volumes (not configured right now).
>
> Restarting mons is safe in the sense that data will not get lost. However,
> access might get lost temporarily.
>
> The question is, how many mons do you have? If you have only 1 or 2, it
> will mean downtime. If you can bear the downtime, it doesn't matter. If you
> have at least 3, you can restart one after the other.
>
> However, I would not do that. Having to restart a mon container every time
> some minor container config changes for reasons that have nothing to do
> with a mon sounds like calling for trouble.
>
> I also use containers and would recommend a different approach. I created
> an additional type of container (ceph-adm) that I use for all admin tasks.
> Its the same image and the entry point simply executes a sleep infinity. In
> this container I make all relevant hardware visible. You might also want to
> expose /var/run/ceph to be able to use admin sockets without hassle. This
> way, I separated admin operations from actual storage daemons and can
> modify and restart the admin container as I like.
>
> Best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: ceph-users  on behalf of Alex
> Litvak 
> Sent: 22 October 2019 08:04
> To: ceph-us...@lists.ceph.com
> Subject: [ceph-users] Replace ceph osd in a container
>
> Hello cephers,
>
> So I am having trouble with a new hardware systems with strange OSD
> behavior and I want to replace a disk with a brand new one to test the
> theory.
>
> I run all daemons in containers and on one of the nodes I have mon, mgr,
> and 6 osds.  So following
> https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>
> I stopped container with osd.23, waited until it is down and out, ran
> safe-to-destroy loop and then destroyed the osd all using the monitor from
> the container on this node.  All good.
>
> Then I swapped the SSDs and started running additional steps (from step 3)
> using the same mon container.  I have no ceph packages installed on the
> bare metal box. It looks like mon container doesn't
> see the disk.
>
>  podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap /dev/sdh
>   stderr: lsblk: /dev/sdh: not a block device
>   stderr: error: /dev/sdh: No such file or directory
>   stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or
> /sys expected.
> usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
> [--osd-fsid OSD_FSID]
> [DEVICES [DEVICES ...]]
> ceph-volume lvm zap: error: Unable to proceed with non-existing device:
> /dev/sdh
> Error: exit status 2
> root@storage2n2-la:~# ls -l /dev/sd
> sda   sdc   sdd   sde   sdf   sdg   sdg1  sdg2  sdg5  sdh
> root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume
> lvm zap sdh
>   stderr: lsblk: sdh: not a block device
>   stderr: error: sdh: No such file or directory
>   stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or
> /sys expected.
> usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
> [--osd-fsid OSD_FSID]
> [DEVICES [DEVICES ...]]
> ceph-volume lvm zap: error: Unable to proceed with non-existing device: sdh
> Error: exit status 2
>
> I execute lsblk and it sees device sdh
> root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk
> lsblk: dm-1: failed to get device path
> lsblk: dm-2: failed to get device path
> lsblk: dm-4: failed to get device path
> lsblk: dm-6: failed to get device path
> lsblk: dm-4: failed to get device path
> lsblk: dm-2: failed to get device path
> lsblk: dm-1: failed to get device path
> lsblk: dm-0: failed to get device path
> lsblk: dm-0: failed to get device path
> lsblk: dm-7: failed to get device path
> lsblk: dm-5: failed to get device path
> lsblk: dm-7: failed to get device path
> lsblk: dm-6: failed to get device path
> lsblk: dm-5: failed to get device path
> lsblk: dm-3: failed to get device path
> NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sdf  8:80   0   1.8T  0 disk
> sdd  8:48   0   1.8T  0 disk
> sdg  8:96   0 223.5G  0 disk
> |-sdg5   8:101  0   223G  0 part
> |-sdg1   8:97   487M  0 part
> `-sdg2   8:98 1K  0 part
> sde  8:64   0   1.8T  0 disk
> sdc  8:32   0   3.5T  0 disk
> sda  8:00   3.5T  0 disk
> sdh 

[ceph-users] Re: ceph mdss keep on crashing after update to 14.2.3

2019-09-19 Thread Sasha Litvak
Any chance for a fix soon?  In 14.2.5 ?

On Thu, Sep 19, 2019 at 8:44 PM Yan, Zheng  wrote:

> On Thu, Sep 19, 2019 at 11:37 PM Dan van der Ster 
> wrote:
> >
> > You were running v14.2.2 before?
> >
> > It seems that that  ceph_assert you're hitting was indeed added
> > between v14.2.2. and v14.2.3 in this commit
> >
> https://github.com/ceph/ceph/commit/12f8b813b0118b13e0cdac15b19ba8a7e127730b
> >
> > There's a comment in the tracker for that commit which says the
> > original fix was incomplete
> > (https://tracker.ceph.com/issues/39987#note-5)
> >
> > So perhaps nautilus needs
> >
> https://github.com/ceph/ceph/pull/28459/commits/0a1e92abf1cfc8bddf526cbf5bceea7b854dcfe8
> > ??
> >
>
> You are right. Sorry for the bug. For now, please got back to 14.2.2
> (just mds) or complie ceph-mds from source
>
> Yan, Zheng
>
> > Did you already try going back to v14.2.2 (on the MDS's only) ??
> >
> > -- dan
> >
> > On Thu, Sep 19, 2019 at 4:59 PM Kenneth Waegeman
> >  wrote:
> > >
> > > Hi all,
> > >
> > > I updated our ceph cluster to 14.2.3 yesterday, and today the mds are
> crashing one after another. I'm using two active mds.
> > >
> > > I've made a tracker ticket, but I was wondering if someone else also
> has seen this issue yet?
> > >
> > >-27> 2019-09-19 15:42:00.196 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865333:8887 lookup
> #0x100166004d4/WindowsPhone-MSVC-CXX.cmake 2019-09-19 15:42:00.203132
> caller_uid=0, caller_gid=0{0,}) v4
> > >-26> 2019-09-19 15:42:00.196 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865372:5815 lookup
> #0x20005a6eb3a/selectable.cpython-37.pyc 2019-09-19 15:42:00.204970
> caller_uid=0, caller_gid=0{0,}) v4
> > >-25> 2019-09-19 15:42:00.196 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865333: lookup
> #0x100166004d4/WindowsPhone.cmake 2019-09-19 15:42:00.206381 caller_uid=0,
> caller_gid=0{0,}) v4
> > >-24> 2019-09-19 15:42:00.206 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865333:8889 lookup
> #0x100166004d4/WindowsStore-MSVC-C.cmake 2019-09-19 15:42:00.209703
> caller_uid=0, caller_gid=0{0,}) v4
> > >-23> 2019-09-19 15:42:00.206 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865333:8890 lookup
> #0x100166004d4/WindowsStore-MSVC-CXX.cmake 2019-09-19 15:42:00.213200
> caller_uid=0, caller_gid=0{0,}) v4
> > >-22> 2019-09-19 15:42:00.216 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865333:8891 lookup
> #0x100166004d4/WindowsStore.cmake 2019-09-19 15:42:00.216577 caller_uid=0,
> caller_gid=0{0,}) v4
> > >-21> 2019-09-19 15:42:00.216 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865333:8892 lookup
> #0x100166004d4/Xenix.cmake 2019-09-19 15:42:00.220230 caller_uid=0,
> caller_gid=0{0,}) v4
> > >-20> 2019-09-19 15:42:00.216 7f0369aeb700  2 mds.1.cache Memory
> usage:  total 4603496, rss 4167920, heap 323836, baseline 323836, 501 /
> 1162471 inodes have caps, 506 caps, 0.00043528 caps per inode
> > >-19> 2019-09-19 15:42:00.216 7f03652e2700  5 mds.1.log
> _submit_thread 30520209420029~9062 : EUpdate scatter_writebehind [metablob
> 0x1000bd8ac7b, 2 dirs]
> > >-18> 2019-09-19 15:42:00.216 7f03652e2700  5 mds.1.log
> _submit_thread 30520209429111~10579 : EUpdate scatter_writebehind [metablob
> 0x1000bf26309, 9 dirs]
> > >-17> 2019-09-19 15:42:00.216 7f03652e2700  5 mds.1.log
> _submit_thread 30520209439710~2305 : EUpdate scatter_writebehind [metablob
> 0x1000bf2745b.001*, 2 dirs]
> > >-16> 2019-09-19 15:42:00.216 7f03652e2700  5 mds.1.log
> _submit_thread 30520209442035~1845 : EUpdate scatter_writebehind [metablob
> 0x1000c233753, 2 dirs]
> > >-15> 2019-09-19 15:42:00.216 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865333:8893 lookup
> #0x100166004d4/eCos.cmake 2019-09-19 15:42:00.223360 caller_uid=0,
> caller_gid=0{0,}) v4
> > >-14> 2019-09-19 15:42:00.216 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865319:2381 lookup
> #0x1001172f39d/microsoft-cp1251 2019-09-19 15:42:00.224940 caller_uid=0,
> caller_gid=0{0,}) v4
> > >-13> 2019-09-19 15:42:00.226 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865333:8894 lookup
> #0x100166004d4/gas.cmake 2019-09-19 15:42:00.226624 caller_uid=0,
> caller_gid=0{0,}) v4
> > >-12> 2019-09-19 15:42:00.226 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865319:2382 readdir
> #0x1001172f3d7 2019-09-19 15:42:00.228673 caller_uid=0, caller_gid=0{0,}) v4
> > >-11> 2019-09-19 15:42:00.226 7f036c2f0700  4 mds.1.server
> handle_client_request client_request(client.37865333:8895 lookup
> #0x100166004d4/kFreeBSD.cmake 2019-09-19 15:42:00.229668 caller_uid=0,
> caller_gid=0{0,}) v4
> > >-10> 2019-09-19 15:42:00.226 7f036c2f0700  4 mds.1.server
> 

[ceph-users] Re: download.ceph.com repository changes

2019-09-17 Thread Sasha Litvak
I have been affected by few issues mentioned by Alfredo.

* Version Pinning:  Had to install several debs of specific version to be
able to pull dependencies of the correct version.  I believe that other
projects resolving it by creating a virtual package that pulls all of the
proper dependencies in.  Not sure if the same done by RPM / Yum.

* Unannounced releases.  I believe it is more of a procedural issue and
unfortunately something will need to be done to enforce the compliance once
rules of packages release are finalized.

* I am bothered with a quality of the releases of a very complex system
that can bring down a whole house and keep it down for a while.  While I
wish the QA would be perfect, I wonder if it would be practical to release
new packages to a testing repo before moving it to a main one.  There is a
chance then someone will detect a problem before it becomes a
production issue.   Let it seat for a couple days or weeks in testing.
People who need new update right away or just want to test will install it
and report the problems.  Others will not be affected.

Just my 2c,

On Tue, Sep 17, 2019, 8:15 AM Alfredo Deza  wrote:

> Reviving this old thread.
>
> I still think this is something we should consider as users still
> experience problems:
>
> * Impossible to 'pin' to a version. User installs 14.2.0 and 4 months
> later they add other nodes but version moved to 14.2.2
> * Impossible to use a version that is not what the latest is (e.g. if
> someone doesn't need the release from Monday, but wants the one from 6
> months ago), similar to the above
> * When a release is underway, the repository breaks because syncing
> packages takes hours. The operation is not atomic.
> * It is not currently possible to "remove" a bad release, in the past,
> this means cutting a new release as soon as possible, which can take
> days
>
> The latest issue (my fault!) was to cut a release and get the packages
> out without communicating with the release manager, which caused users
> to note there is a new version *as soon as it was up* vs, a
> process that could've not touched the 'latest' url until the
> announcement goes out.
>
> If you have been affected by any of these issues (or others I didn't
> come up with), please let us know in this thread so that we can find
> some common ground and try to improve the process.
>
> Thanks!
>
> On Tue, Jul 24, 2018 at 10:38 AM Alfredo Deza  wrote:
> >
> > Hi all,
> >
> > After the 12.2.6 release went out, we've been thinking on better ways
> > to remove a version from our repositories to prevent users from
> > upgrading/installing a known bad release.
> >
> > The way our repos are structured today means every single version of
> > the release is included in the repository. That is, for Luminous,
> > every 12.x.x version of the binaries is in the same repo. This is true
> > for both RPM and DEB repositories.
> >
> > However, the DEB repos don't allow pinning to a given version because
> > our tooling (namely reprepro) doesn't construct the repositories in a
> > way that this is allowed. For RPM repos this is fine, and version
> > pinning works.
> >
> > To remove a bad version we have to proposals (and would like to hear
> > ideas on other possibilities), one that would involve symlinks and the
> > other one which purges the known bad version from our repos.
> >
> > *Symlinking*
> > When releasing we would have a "previous" and "latest" symlink that
> > would get updated as versions move forward. It would require
> > separation of versions at the URL level (all versions would no longer
> > be available in one repo).
> >
> > The URL structure would then look like:
> >
> > debian/luminous/12.2.3/
> > debian/luminous/previous/  (points to 12.2.5)
> > debian/luminous/latest/   (points to 12.2.7)
> >
> > Caveats: the url structure would change from debian-luminous/ to
> > prevent breakage, and the versions would be split. For RPMs it would
> > mean a regression if someone is used to pinning, for example pinning
> > to 12.2.2 wouldn't be possible using the same url.
> >
> > Pros: Faster release times, less need to move packages around, and
> > easier to remove a bad version
> >
> >
> > *Single version removal*
> > Our tooling would need to go and remove the known bad version from the
> > repository, which would require to rebuild the repository again, so
> > that the metadata is updated with the difference in the binaries.
> >
> > Caveats: time intensive process, almost like cutting a new release
> > which takes about a day (and sometimes longer). Error prone since the
> > process wouldn't be the same (one off, just when a version needs to be
> > removed)
> >
> > Pros: all urls for download.ceph.com and its structure are kept the
> same.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing 

[ceph-users] Re: CEPH 14.2.3

2019-09-04 Thread Sasha Litvak
I wonder if it is possible to send an announcement at least a 1 business
day before you run the sync.  You can state the date of the packages
availability in the announcement.  This is just a suggestion.



On Wed, Sep 4, 2019, 9:23 AM Abhishek Lekshmanan  wrote:

> Fyodor Ustinov  writes:
>
> > Hi!
> >
> > The problem is not so much that you or I cannot update.
> >
> > The problem is that now you cannot install nautilus in at least one of
> > the standard ways.
> This should be fixed now, and the official release announcements are out
> just now. There were some packages missing in the sync we ran yesterday.
> >
> > - Original Message -
> >> From: "EDH - Manuel Rios Fernandez" 
> >> To: "Fyodor Ustinov" , "ceph-users" 
> >> Sent: Wednesday, 4 September, 2019 15:25:40
> >> Subject: RE: [ceph-users] Re: CEPH 14.2.3
> >
> >> There's no patch notes at ceph.com I suggest don’t update until update
> >> changelog.
> >>
> >>
> >> -Mensaje original-
> >> De: Fyodor Ustinov 
> >> Enviado el: miércoles, 4 de septiembre de 2019 14:16
> >> Para: ceph-users 
> >> Asunto: [ceph-users] Re: CEPH 14.2.3
> >>
> >> Hi!
> >>
> >> And by the way, I confirm - the installation of the nautilus is broken:
> >>
> >> $ ceph-deploy install --release nautilus S-26-6-2-3
> >>
> >> [S-26-6-2-3][WARNIN] Error downloading packages:
> >> [S-26-6-2-3][WARNIN]   2:librados2-14.2.3-0.el7.x86_64: [Errno 256] No
> more
> >> mirrors to try.
> >>
> >>
> >> - Original Message -
> >>> From: "Fyodor Ustinov" 
> >>> To: "ceph-users" 
> >>> Sent: Wednesday, 4 September, 2019 13:18:55
> >>> Subject: [ceph-users] CEPH 14.2.3
> >>
> >>> Dear CEPH Developers!
> >>>
> >>> We all respect you for your work.
> >>>
> >>> But I have one small request.
> >>>
> >>> Please, make an announcement about the new version and prepare the
> >>> documentation before posting the new version to the repository.
> >>>
> >>> It is very, very, very necessary.
> >>>
> >>> WBR,
> >>>Fyodor.
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> >>> email to ceph-users-le...@ceph.io
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email
> >> to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> Abhishek Lekshmanan
> SUSE Software Solutions Germany GmbH
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nürnberg)
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Nautilus 14.2.3 packages appearing on the mirrors

2019-09-03 Thread Sasha Litvak
Is there an actual release or an accident?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io