Re: [ceph-users] CephFS client hanging and cache issues

2019-10-30 Thread Bob Farrell
Thanks a lot and sorry for the spam, I should have checked ! We are on
18.04, kernel is currently upgrading so if you don't hear back from me then
it is fixed.

Thanks for the amazing support !

On Wed, 30 Oct 2019, 09:54 Lars Täuber,  wrote:

> Hi.
>
> Sounds like you use kernel clients with kernels from canonical/ubuntu.
> Two kernels have a bug:
> 4.15.0-66
> and
> 5.0.0-32
>
> Updated kernels are said to have fixes.
> Older kernels also work:
> 4.15.0-65
> and
> 5.0.0-31
>
>
> Lars
>
> Wed, 30 Oct 2019 09:42:16 +
> Bob Farrell  ==> ceph-users 
> :
> > Hi. We are experiencing a CephFS client issue on one of our servers.
> >
> > ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus
> > (stable)
> >
> > Trying to access, `umount`, or `umount -f` a mounted CephFS volumes
> causes
> > my shell to hang indefinitely.
> >
> > After a reboot I can remount the volumes cleanly but they drop out after
> <
> > 1 hour of use.
> >
> > I see this log entry multiple times when I reboot the server:
> > ```
> > cache_from_obj: Wrong slab cache. inode_cache but object is from
> > ceph_inode_info
> > ```
> > The machine then reboots after approx. 30 minutes.
> >
> > All other Ceph/CephFS clients and servers seem perfectly happy. CephFS
> > cluster is HEALTH_OK.
> >
> > Any help appreciated. If I can provide any further details please let me
> > know.
> >
> > Thanks in advance,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS client hanging and cache issues

2019-10-30 Thread Bob Farrell
Hi. We are experiencing a CephFS client issue on one of our servers.

ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus
(stable)

Trying to access, `umount`, or `umount -f` a mounted CephFS volumes causes
my shell to hang indefinitely.

After a reboot I can remount the volumes cleanly but they drop out after <
1 hour of use.

I see this log entry multiple times when I reboot the server:
```
cache_from_obj: Wrong slab cache. inode_cache but object is from
ceph_inode_info
```
The machine then reboots after approx. 30 minutes.

All other Ceph/CephFS clients and servers seem perfectly happy. CephFS
cluster is HEALTH_OK.

Any help appreciated. If I can provide any further details please let me
know.

Thanks in advance,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-26 Thread Bob Farrell
March seems sensible to me for the reasons you stated. If a release gets
delayed, I'd prefer it to be on the spring side of Christmas (again for the
reasons already mentioned).

That aside, I'm now very impatient to install Octopus on my 8-node cluster.
: )

On Wed, 26 Jun 2019 at 15:46, Sage Weil  wrote:

> Hi everyone,
>
> We talked a bit about this during the CLT meeting this morning.  How about
> the following proposal:
>
> - Target release date of Mar 1 each year.
> - Target freeze in Dec.  That will allow us to use the holidays to do a
>   lot of testing when the lab infrastructure tends to be somewhat idle.
>
> If we get an early build out at the point of the freeze (or even earlier),
> perhaps this capture some of the time that the retailers have during their
> lockdown to identify structural issues with release.  It is probably
> better to do more of this testing at this point in the cycle so that we
> have time to properly fix any big issues (like performance or scaling
> regressions).  It is of course a challenge to motivate testing on
> something that is too far from the final a release, but we can try.
>
> This avoids an abbreviated octopus cycle, and avoids placing August (which
> also often has people out for vacations) right in the middle of the
> lead-up to the freeze.
>
> Thoughts?
> sage
>
>
>
> On Wed, 26 Jun 2019, Sage Weil wrote:
>
> > On Wed, 26 Jun 2019, Alfonso Martinez Hidalgo wrote:
> > > I think March is a good idea.
> >
> > Spring had a slight edge over fall in the twitter poll (for whatever
> > that's worth).  I see the appeal for fall when it comes to down time
> for
> > retailers, but as a practical matter for Octopus specifically, a target
> of
> > say October means freezing in August, which means we only have 2
> > more months of development time.  I'm worried that will turn Octopus
> > in another weak (aka lightly adopted) release.
> >
> > March would mean freezing in January again, which would give us July to
> > Dec... 6 more months.
> >
> > sage
> >
> >
> >
> > >
> > > On Tue, Jun 25, 2019 at 4:32 PM Alfredo Deza  wrote:
> > >
> > > > On Mon, Jun 17, 2019 at 4:09 PM David Turner 
> > > > wrote:
> > > > >
> > > > > This was a little long to respond with on Twitter, so I thought I'd
> > > > share my thoughts here. I love the idea of a 12 month cadence. I like
> > > > October because admins aren't upgrading production within the first
> few
> > > > months of a new release. It gives it plenty of time to be stable for
> the OS
> > > > distros as well as giving admins something low-key to work on over
> the
> > > > holidays with testing the new releases in stage/QA.
> > > >
> > > > October sounds ideal, but in reality, we haven't been able to release
> > > > right on time as long as I can remember. Realistically, if we set
> > > > October, we are probably going to get into November/December.
> > > >
> > > > For example, Nautilus was set to release in February and we got it
> out
> > > > late in late March (Almost April)
> > > >
> > > > Would love to see more of a discussion around solving the problem of
> > > > releasing when we say we are going to - so that we can then choose
> > > > what the cadence is.
> > > >
> > > > >
> > > > > On Mon, Jun 17, 2019 at 12:22 PM Sage Weil 
> wrote:
> > > > >>
> > > > >> On Wed, 5 Jun 2019, Sage Weil wrote:
> > > > >> > That brings us to an important decision: what time of year
> should we
> > > > >> > release?  Once we pick the timing, we'll be releasing at that
> time
> > > > *every
> > > > >> > year* for each release (barring another schedule shift, which
> we want
> > > > to
> > > > >> > avoid), so let's choose carefully!
> > > > >>
> > > > >> I've put up a twitter poll:
> > > > >>
> > > > >> https://twitter.com/liewegas/status/1140655233430970369
> > > > >>
> > > > >> Thanks!
> > > > >> sage
> > > > >> ___
> > > > >> ceph-users mailing list
> > > > >> ceph-users@lists.ceph.com
> > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > >
> > >
> > > --
> > >
> > > Alfonso Martínez
> > >
> > > Senior Software Engineer, Ceph Storage
> > >
> > > Red Hat 
> > > 
> > > ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus HEALTH_WARN for msgr2 protocol

2019-06-19 Thread Bob Farrell
Aha, yes, that does help ! I tried a lot of variations but couldn't quite
get it to work so used the simpler alternative instead.

Thanks !

On Wed, 19 Jun 2019 at 09:21, Dominik Csapak  wrote:

> On 6/14/19 6:10 PM, Bob Farrell wrote:
> > Hi. Firstly thanks to all involved in this great mailing list, I learn
> > lots from it every day.
> >
>
> Hi,
>
> >
> > I never figured out the correct syntax to set up the first monitor to
> > use both 6789 and 3300. The other monitors that join the cluster set
> > this config automatically but I couldn't work out how to apply it to the
> > first monitor node.
> >
>
> I struggled with this myself yesterday and found that the relevant
> argument is not really documented:
>
> monmaptool --create --addv ID [v1:ip:6789,v2:ip:3300] /path/to/monmap
>
>
> hope this helps :)
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus HEALTH_WARN for msgr2 protocol

2019-06-18 Thread Bob Farrell
All, sorry for the slow response but thank you all for the quick and very
helpful replies !

As you can see below your various advice has fixed the problem - much
appreciated and noted for future reference.

Have a great week !

root@swarmctl:~# ceph mon enable-msgr2
root@swarmctl:~# ceph -s
  cluster:
id: **
health: HEALTH_OK

root@swarmctl:~# ceph mon dump
dumped monmap epoch 8
epoch 8
fsid 7273720d-04d7-480f-a77c-f0207ae35852
last_changed 2019-06-18 16:08:17.837379
created 2019-04-02 18:21:09.925941
min_mon_release 14 (nautilus)
0: [v2:172.30.0.144:3300/0,v1:172.30.0.144:6789/0] mon.node01.homeflow.co.uk
1: [v2:172.30.0.146:3300/0,v1:172.30.0.146:6789/0] mon.node03.homeflow.co.uk
2: [v2:172.30.0.147:3300/0,v1:172.30.0.147:6789/0] mon.node04.homeflow.co.uk
3: [v2:172.30.0.148:3300/0,v1:172.30.0.148:6789/0] mon.node05.homeflow.co.uk
4: [v2:172.30.0.145:3300/0,v1:172.30.0.145:6789/0] mon.node02.homeflow.co.uk
5: [v2:172.30.0.149:3300/0,v1:172.30.0.149:6789/0] mon.node06.homeflow.co.uk
6: [v2:172.30.0.150:3300/0,v1:172.30.0.150:6789/0] mon.node07.homeflow.co.uk


On Fri, 14 Jun 2019 at 23:24,  wrote:

> Bob;
>
> Have you verified that port 3300 is open for TCP on that host?
>
> The extra host firewall rules for v2 protocol caused me all kinds of grief
> when I was setting up my MONs.
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director – Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Paul Emmerich
> Sent: Friday, June 14, 2019 10:23 AM
> To: Brett Chancellor
> Cc: ceph-users
> Subject: Re: [ceph-users] Nautilus HEALTH_WARN for msgr2 protocol
>
>
>
> On Fri, Jun 14, 2019 at 6:23 PM Brett Chancellor <
> bchancel...@salesforce.com> wrote:
> If you don't figure out how to enable it on your monitor, you can always
> disable it to squash the warnings
> ceph config set mon.node01 ms_bind_msgr2 false
>
> No, that just disables msgr2 on that mon.
>
> Use this option if you want to disable the warning
>
> mon_warn_on_msgr2_not_enabled false
>
>
> But that's probably not a good idea since there's clearly something wrong
> with that mon.
>
> Paul
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
>
> On Fri, Jun 14, 2019 at 12:11 PM Bob Farrell  wrote:
> Hi. Firstly thanks to all involved in this great mailing list, I learn
> lots from it every day.
>
> We are running Ceph with a huge amount of success to store website
> themes/templates across a large collection of websites. We are very pleased
> with the solution in every way.
>
> The only issue we have, which we have had since day 1, is we always see
> HEALTH_WARN:
>
> health: HEALTH_WARN
> 1 monitors have not enabled msgr2
>
> And this is reflected in the monmap:
>
> monmaptool: monmap file /tmp/monmap
> epoch 7
> fsid 7273720d-04d7-480f-a77c-f0207ae35852
> last_changed 2019-04-02 17:21:56.935381
> created 2019-04-02 17:21:09.925941
> min_mon_release 14 (nautilus)
> 0: v1:172.30.0.144:6789/0 mon.node01.homeflow.co.uk
> 1: [v2:172.30.0.146:3300/0,v1:172.30.0.146:6789/0]
> mon.node03.homeflow.co.uk
> 2: [v2:172.30.0.147:3300/0,v1:172.30.0.147:6789/0]
> mon.node04.homeflow.co.uk
> 3: [v2:172.30.0.148:3300/0,v1:172.30.0.148:6789/0]
> mon.node05.homeflow.co.uk
> 4: [v2:172.30.0.145:3300/0,v1:172.30.0.145:6789/0]
> mon.node02.homeflow.co.uk
> 5: [v2:172.30.0.149:3300/0,v1:172.30.0.149:6789/0]
> mon.node06.homeflow.co.uk
> 6: [v2:172.30.0.150:3300/0,v1:172.30.0.150:6789/0]
> mon.node07.homeflow.co.uk
>
> I never figured out the correct syntax to set up the first monitor to use
> both 6789 and 3300. The other monitors that join the cluster set this
> config automatically but I couldn't work out how to apply it to the first
> monitor node.
>
> The cluster has been operating in production for at least a month now with
> no issues at all, so it would be nice to remove this warning as, at the
> moment, it's not really very useful as a monitoring metric.
>
> Could somebody advise me on the safest/most sensible way to update the
> monmap so that node01 listens on v2 and v1 ?
>
> Thanks for any help !
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Nautilus HEALTH_WARN for msgr2 protocol

2019-06-14 Thread Bob Farrell
Hi. Firstly thanks to all involved in this great mailing list, I learn lots
from it every day.

We are running Ceph with a huge amount of success to store website
themes/templates across a large collection of websites. We are very pleased
with the solution in every way.

The only issue we have, which we have had since day 1, is we always see
HEALTH_WARN:

health: HEALTH_WARN
1 monitors have not enabled msgr2

And this is reflected in the monmap:

monmaptool: monmap file /tmp/monmap
epoch 7
fsid 7273720d-04d7-480f-a77c-f0207ae35852
last_changed 2019-04-02 17:21:56.935381
created 2019-04-02 17:21:09.925941
min_mon_release 14 (nautilus)
0: v1:172.30.0.144:6789/0 mon.node01.homeflow.co.uk
1: [v2:172.30.0.146:3300/0,v1:172.30.0.146:6789/0] mon.node03.homeflow.co.uk
2: [v2:172.30.0.147:3300/0,v1:172.30.0.147:6789/0] mon.node04.homeflow.co.uk
3: [v2:172.30.0.148:3300/0,v1:172.30.0.148:6789/0] mon.node05.homeflow.co.uk
4: [v2:172.30.0.145:3300/0,v1:172.30.0.145:6789/0] mon.node02.homeflow.co.uk
5: [v2:172.30.0.149:3300/0,v1:172.30.0.149:6789/0] mon.node06.homeflow.co.uk
6: [v2:172.30.0.150:3300/0,v1:172.30.0.150:6789/0] mon.node07.homeflow.co.uk

I never figured out the correct syntax to set up the first monitor to use
both 6789 and 3300. The other monitors that join the cluster set this
config automatically but I couldn't work out how to apply it to the first
monitor node.

The cluster has been operating in production for at least a month now with
no issues at all, so it would be nice to remove this warning as, at the
moment, it's not really very useful as a monitoring metric.

Could somebody advise me on the safest/most sensible way to update the
monmap so that node01 listens on v2 and v1 ?

Thanks for any help !
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Topology query

2019-04-11 Thread Bob Farrell
Thanks a lot, Marc - this looks similar to the post I found:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003369.html

It seems to suggest that this wouldn't be an issue in more recent kernels
but would be great to get confirmation on that. I'll keep researching.

On Thu, 11 Apr 2019 at 19:50, Marc Roos  wrote:

>
>
> AFAIK you at least risk with cephfs on osd nodes this 'kernel deadlock'?
> I have it also, but with enough memory. Search mailing list for this.
> I am looking at similar setup, but with mesos and strugling with some
> cni plugin we have to develop.
>
>
> -Original Message-
> From: Bob Farrell [mailto:b...@homeflow.co.uk]
> Sent: donderdag 11 april 2019 20:45
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Topology query
>
> Hello. I am running Ceph Nautilus v14.2.0 on Ubuntu Bionic 18.04 LTS.
>
> I would like to ask if anybody could advise if there will be any
> potential problems with my setup as I am running a lot of services on
> each node.
>
> I have 8 large dedicated servers, each with two physical disks. All
> servers run Docker Swarm and host numerous web applications.
>
> I have also installed Ceph on each node (not in Docker). The secondary
> disk on each server hosts an LVM volume which is dedicated to Ceph. Each
> node runs one of each: osd, mon, mgr, mdss. I use CephFS to mount the
> data into each node's filesystem, which is then accessed by numerous
> containers via Docker bindmounts.
>
> So far everything is working great but we haven't put anything under
> heavy load. I googled around to see if there are any potential problems
> with what I'm doing but couldn't find too much. There was one forum post
> I read [but can't find now] which warned against this unless using very
> latest glibc due to kernel fsync issues (IIRC) but this post was from
> 2014 so I hope I'm safe ?
>
> Thanks for the great project - I got this far just from reading the docs
> and writing my own Ansible script (wanted to learn Ceph properly). It's
> really good stuff. : )
>
> Cheers,
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Topology query

2019-04-11 Thread Bob Farrell
Hello. I am running Ceph Nautilus v14.2.0 on Ubuntu Bionic 18.04 LTS.

I would like to ask if anybody could advise if there will be any potential
problems with my setup as I am running a lot of services on each node.

I have 8 large dedicated servers, each with two physical disks. All servers
run Docker Swarm and host numerous web applications.

I have also installed Ceph on each node (not in Docker). The secondary disk
on each server hosts an LVM volume which is dedicated to Ceph. Each node
runs one of each: osd, mon, mgr, mdss. I use CephFS to mount the data into
each node's filesystem, which is then accessed by numerous containers via
Docker bindmounts.

So far everything is working great but we haven't put anything under heavy
load. I googled around to see if there are any potential problems with what
I'm doing but couldn't find too much. There was one forum post I read [but
can't find now] which warned against this unless using very latest glibc
due to kernel fsync issues (IIRC) but this post was from 2014 so I hope I'm
safe ?

Thanks for the great project - I got this far just from reading the docs
and writing my own Ansible script (wanted to learn Ceph properly). It's
really good stuff. : )

Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com