Re: [ceph-users] bad crc/signature errors

2017-10-04 Thread Adrian Saul
We see the same messages and are similarly on a 4.4 KRBD version that is affected by this. I have seen no impact from it so far that I know about > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Jason Dillaman > Sent: Thursday, 5

Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Joao Eduardo Luis
On 10/04/2017 09:19 PM, Gregory Farnum wrote: Oh, hmm, you're right. I see synchronization starts but it seems to progress very slowly, and it certainly doesn't complete in that 2.5 minute logging window. I don't see any clear reason why it's so slow; it might be more clear if you could

Re: [ceph-users] Mimic timeline

2017-10-04 Thread Sage Weil
On Wed, 4 Oct 2017, Sage Weil wrote: > Hi everyone, > > After further discussion we are targetting 9 months for Mimic 13.2.0: > > - Mar 16, 2018 feature freeze > - May 1, 2018 release > > Upgrades for Mimic will be from Luminous only (we've already made that a > required stop), but we plan

[ceph-users] Mimic timeline

2017-10-04 Thread Sage Weil
Hi everyone, After further discussion we are targetting 9 months for Mimic 13.2.0: - Mar 16, 2018 feature freeze - May 1, 2018 release Upgrades for Mimic will be from Luminous only (we've already made that a required stop), but we plan to allow Luminous -> Nautilus too (and Mimic -> O).

Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Gregory Farnum
Oh, hmm, you're right. I see synchronization starts but it seems to progress very slowly, and it certainly doesn't complete in that 2.5 minute logging window. I don't see any clear reason why it's so slow; it might be more clear if you could provide logs of the other logs at the same time

Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Nico Schottelius
Some more detail: when restarting the monitor on server1, it stays in synchronizing state forever. However the other two monitors change into electing state. I have double checked that there are not (host) firewalls active and that the times are within 1 second different of the hosts (they all

Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Nico Schottelius
Hello Gregory, the logfile I produced has already debug mon = 20 set: [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf debug mon = 20 It is clear that server1 is out of quorum, however how do we make it being part of the quorum again? I expected that the quorum finding process is

Re: [ceph-users] bad crc/signature errors

2017-10-04 Thread Jason Dillaman
Perhaps this is related to a known issue on some 4.4 and later kernels [1] where the stable write flag was not preserved by the kernel? [1] http://tracker.ceph.com/issues/19275 On Wed, Oct 4, 2017 at 2:36 PM, Gregory Farnum wrote: > That message indicates that the checksums

Re: [ceph-users] bad crc/signature errors

2017-10-04 Thread Gregory Farnum
That message indicates that the checksums of messages between your kernel client and OSD are incorrect. It could be actual physical transmission errors, but if you don't see other issues then this isn't fatal; they can recover from it. On Wed, Oct 4, 2017 at 8:52 AM Josy

Re: [ceph-users] inconsistent pg on erasure coded pool

2017-10-04 Thread Gregory Farnum
This says it's actually missing one object, and a repair won't fix that (if it could, the object wouldn't be missing!). There should be more details somewhere in the logs about which object. On Wed, Oct 4, 2017 at 5:03 AM Kenneth Waegeman wrote: > Hi, > > We have some

Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Gregory Farnum
You'll need to change the config so that it's running "debug mon = 20" for the log to be very useful here. It does say that it's dropping client connections because it's been out of quorum for too long, which is the correct behavior in general. I'd imagine that you've got clients trying to connect

Re: [ceph-users] Ceph Developers Monthly - October

2017-10-04 Thread Leonardo Vaz
On Wed, Oct 04, 2017 at 03:02:09AM -0300, Leonardo Vaz wrote: > On Thu, Sep 28, 2017 at 12:08:00AM -0300, Leonardo Vaz wrote: > > Hey Cephers, > > > > This is just a friendly reminder that the next Ceph Developer Montly > > meeting is coming up: > > > > http://wiki.ceph.com/Planning > > > > If

[ceph-users] Ceph-mgr summarize recovery counters

2017-10-04 Thread Benjeman Meekhof
Wondering if anyone can tell me how to summarize recovery bytes/ops/objects from counters available in the ceph-mgr python interface? To put another way, how does the ceph -s command put together that infomation and can I access that information from a counter queryable by the ceph-mgr python

[ceph-users] bad crc/signature errors

2017-10-04 Thread Josy
Hi, We have setup a cluster with 8 OSD servers (31 disks) Ceph health is Ok. -- [root@las1-1-44 ~]# ceph -s   cluster:     id: de296604-d85c-46ab-a3af-add3367f0e6d     health: HEALTH_OK   services:     mon: 3 daemons, quorum ceph-las-mon-a1,ceph-las-mon-a2,ceph-las-mon-a3    

[ceph-users] inconsistent pg on erasure coded pool

2017-10-04 Thread Kenneth Waegeman
Hi, We have some inconsistency / scrub error on a Erasure coded pool, that I can't seem to solve. [root@osd008 ~]# ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 5.144 is active+clean+inconsistent, acting [81,119,148,115,142,100,25,63,48,11,43] 1 scrub errors In the

[ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Nico Schottelius
Good morning, we have recently upgraded our kraken cluster to luminous and since then noticed an odd behaviour: we cannot add a monitor anymore. As soon as we start a new monitor (server2), ceph -s and ceph -w start to hang. The situation became worse, since one of our staff stopped an

Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-10-04 Thread Micha Krause
Hi, Did you edit the code before trying Luminous? Yes, I'm still on jewel. I also noticed from your > original mail that it appears you're using multiple active metadata> servers? If so, that's not stable in Jewel. You may have tripped on> one of many bugs fixed in Luminous for that

Re: [ceph-users] why sudden (and brief) HEALTH_ERR

2017-10-04 Thread lists
ok, thanks for the feedback Piotr and Dan! MJ On 4-10-2017 9:38, Dan van der Ster wrote: Since Jewel (AFAIR), when (re)starting OSDs, pg status is reset to "never contacted", resulting in "pgs are stuck inactive for more than 300 seconds" being reported until osds regain connections between

Re: [ceph-users] why sudden (and brief) HEALTH_ERR

2017-10-04 Thread Dan van der Ster
On Wed, Oct 4, 2017 at 9:08 AM, Piotr Dałek wrote: > On 17-10-04 08:51 AM, lists wrote: >> >> Hi, >> >> Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our >> jewel migration, and noticed something interesting. >> >> After I brought back up the OSDs I

Re: [ceph-users] why sudden (and brief) HEALTH_ERR

2017-10-04 Thread Piotr Dałek
On 17-10-04 08:51 AM, lists wrote: Hi, Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our jewel migration, and noticed something interesting. After I brought back up the OSDs I just chowned, the system had some recovery to do. During that recovery, the system went to

[ceph-users] why sudden (and brief) HEALTH_ERR

2017-10-04 Thread lists
Hi, Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our jewel migration, and noticed something interesting. After I brought back up the OSDs I just chowned, the system had some recovery to do. During that recovery, the system went to HEALTH_ERR for a short moment: See