Re: [ceph-users] Ceph health error (was: Prioritize recovery over backfilling)

2018-06-07 Thread Caspar Smit
Well i let it run with flags nodown and it looked like it would finish BUT
it all went wrong somewhere:

This is now the state:

health: HEALTH_ERR
nodown flag(s) set
5602396/94833780 objects misplaced (5.908%)
Reduced data availability: 143 pgs inactive, 142 pgs peering, 7
pgs stale
Degraded data redundancy: 248859/94833780 objects degraded
(0.262%), 194 pgs unclean, 21 pgs degraded, 12 pgs undersized
11 stuck requests are blocked > 4096 sec

pgs: 13.965% pgs not active
 248859/94833780 objects degraded (0.262%)
 5602396/94833780 objects misplaced (5.908%)
 830 active+clean
 75  remapped+peering
 66  peering
 26  active+remapped+backfill_wait
 6   active+undersized+degraded+remapped+backfill_wait
 6   active+recovery_wait+degraded+remapped
 3   active+undersized+degraded+remapped+backfilling
 3   stale+active+undersized+degraded+remapped+backfill_wait
 3   stale+active+remapped+backfill_wait
 2   active+recovery_wait+degraded
 2   active+remapped+backfilling
 1   activating+degraded+remapped
 1   stale+remapped+peering


#ceph health detail shows:

REQUEST_STUCK 11 stuck requests are blocked > 4096 sec
11 ops are blocked > 16777.2 sec
osds 4,7,23,24 have stuck requests > 16777.2 sec


So what happened and what should i do now?

Thank you very much for any help

Kind regards,
Caspar


2018-06-07 13:33 GMT+02:00 Sage Weil :

> On Wed, 6 Jun 2018, Caspar Smit wrote:
> > Hi all,
> >
> > We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a
> node
> > to it.
> >
> > osd-max-backfills is at the default 1 so backfilling didn't go very fast
> > but that doesn't matter.
> >
> > Once it started backfilling everything looked ok:
> >
> > ~300 pgs in backfill_wait
> > ~10 pgs backfilling (~number of new osd's)
> >
> > But i noticed the degraded objects increasing a lot. I presume a pg that
> is
> > in backfill_wait state doesn't accept any new writes anymore? Hence
> > increasing the degraded objects?
> >
> > So far so good, but once a while i noticed a random OSD flapping (they
> come
> > back up automatically). This isn't because the disk is saturated but a
> > driver/controller/kernel incompatibility which 'hangs' the disk for a
> short
> > time (scsi abort_task error in syslog). Investigating further i noticed
> > this was already the case before the node expansion.
> >
> > These OSD's flapping results in lots of pg states which are a bit
> worrying:
> >
> >  109 active+remapped+backfill_wait
> >  80  active+undersized+degraded+remapped+backfill_wait
> >  51  active+recovery_wait+degraded+remapped
> >  41  active+recovery_wait+degraded
> >  27  active+recovery_wait+undersized+degraded+remapped
> >  14  active+undersized+remapped+backfill_wait
> >  4   active+undersized+degraded+remapped+backfilling
> >
> > I think the recovery_wait is more important then the backfill_wait, so i
> > like to prioritize these because the recovery_wait was triggered by the
> > flapping OSD's
>
> Just a note: this is fixed in mimic.  Previously, we would choose the
> highest-priority PG to start recovery on at the time, but once recovery
> had started, the appearance of a new PG with a higher priority (e.g.,
> because it finished peering after the others) wouldn't preempt/cancel the
> other PG's recovery, so you would get behavior like the above.
>
> Mimic implements that preemption, so you should not see behavior like
> this.  (If you do, then the function that assigns a priority score to a
> PG needs to be tweaked.)
>
> sage
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Tracy Reed
On Thu, Jun 07, 2018 at 09:30:23AM PDT, Jason Dillaman spake thusly:
> I think what Ilya is saying is that it's a very old RHEL 7-based
> kernel (RHEL 7.1?). For example, the current RHEL 7.5 kernel includes
> numerous improvements that have been backported from the current
> upstream kernel.

Ah, I understand now. My VM servers tend not to get upgraded often as
restarting all of the VMs is a hassle. I'll fix that. Do we think that
is related to my issues? It has worked reliably for ages as far as
mapping rbd goes.

I still have the following in flight requests. I set osd.73 out as
suggested and even went and restarted the osd process on the node. It
doesn't seem to have had any effect. And I still have unkillable
processes blocking on mapped rbd devices. I guess I can patch/reboot
this box which would likely clear this up but that's going to have to
wait a week or so and involves downtime for 21 VMs which is less than
ideal. I would love to get this fixed, finish transferring images from
iscsi storage to ceph rbd, then I can retire the iscsi storage and have
some surplus amps so I can bring some more VM servers online so I can
live migrate these VMs in the future allowing easier reboots/upgrades as
that's the real limiting factor here.

# find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
# [70/1950]
/sys/kernel/debug/ceph/b2b00aae-f00d-41b4-a29b-58859aa41375.client31276017/osdmap
epoch 232501
flags
pool 0 pg_num 2500 (4095) read_tier -1 write_tier -1
pool 2 pg_num 512 (511) read_tier -1 write_tier -1
pool 3 pg_num 128 (127) read_tier -1 write_tier -1
pool 4 pg_num 100 (127) read_tier -1 write_tier -1
osd010.0.5.3:680154%(exists, up)100%
osd110.0.5.3:681257%(exists, up)100%
osd2(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd310.0.5.4:681250%(exists, up)100%
osd4(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd5(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd610.0.5.9:686137%(exists, up)100%
osd710.0.5.9:687628%(exists, up)100%
osd810.0.5.9:686443%(exists, up)100%
osd910.0.5.9:683630%(exists, up)100%
osd10   10.0.5.9:682022%(exists, up)100%
osd11   10.0.5.9:684454%(exists, up)100%
osd12   10.0.5.9:680343%(exists, up)100%
osd13   10.0.5.9:682641%(exists, up)100%
osd14   10.0.5.9:685337%(exists, up)100%
osd15   10.0.5.9:687236%(exists, up)100%
osd16   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd17   10.0.5.9:681244%(exists, up)100%
osd18   10.0.5.9:681748%(exists, up)100%
osd19   10.0.5.9:685633%(exists, up)100%
osd20   10.0.5.9:680846%(exists, up)100%
osd21   10.0.5.9:687141%(exists, up)100%
osd22   10.0.5.9:681649%(exists, up)100%
osd23   10.0.5.9:682356%(exists, up)100%
osd24   10.0.5.9:680054%(exists, up)100%
osd25   10.0.5.9:684854%(exists, up)100%
osd26   10.0.5.9:684037%(exists, up)100%
osd27   10.0.5.9:688369%(exists, up)100%
osd28   10.0.5.9:683339%(exists, up)100%
osd29   10.0.5.9:680938%(exists, up)100%
osd30   10.0.5.9:682951%(exists, up)100%
osd31   10.0.5.11:6828   47%(exists, up)100%
osd32   10.0.5.11:6848   25%(exists, up)100%
osd33   10.0.5.11:6802   56%(exists, up)100%
osd34   10.0.5.11:6840   35%(exists, up)100%
osd35   10.0.5.11:6856   32%(exists, up)100%
osd36   10.0.5.11:6832   26%(exists, up)100%
osd37   10.0.5.11:6868   42%(exists, up)100%
osd38   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd39   10.0.5.11:6812   52%(exists, up)100%
[23/1950]
osd40   10.0.5.11:6864   44%(exists, up)100%
osd41   10.0.5.11:6801   25%(exists, up)100%
osd42   10.0.5.11:6872   39%(exists, up)100%
osd43   10.0.5.13:6809   38%(exists, up)100%
osd44   10.0.5.11:6844   47%(exists, up)100%
osd45   10.0.5.11:6816   20%(exists, up)100%
osd46   10.0.5.3:680058%(exists, up)100%
osd47   10.0.5.2:680843%(exists, up)100%
osd48   10.0.5.2:680444%(exists, up)100%
osd49   10.0.5.2:681244%(exists, up)100%
osd50   10.0.5.2:680047%(exists, up)100%
osd51   10.0.5.4:680843%(exists, up)100%
osd52   10.0.5.12:6815   41%(exists, up)100%
osd53   10.0.5.11:6820   24%(up)100%
osd54   10.0.5.11:6876   34%(exists, up)100%
osd55   10.0.5.11:6836   48%(exists, up)100%
osd56   10.0.5.11:6824   31%(exists, up)100%
osd57   10.0.5.11:6860   48%(exists, up)100%
osd58   10.0.5.11:6852   35%(exists, up)100%
osd59   10.0.5.11:6800   42%(exists, up)100%
osd60   10.0.5.11:6880   58%(exists, up)100%
osd61   10.0.5.3:680352%(exists, up)  

Re: [ceph-users] Adding additional disks to the production cluster without performance impacts on the existing

2018-06-07 Thread Pardhiv Karri
Hi John,

We recently added a lot of nodes to our ceph clusters. To mitigate lot of
problems (we are using tree algorithm) we added an empty node first to the
crushmap and then added OSDs with zero weight, made sure the ceph health is
OK and then started ramping up each OSD. I created a script to do it
dynamically, which will check CPU of the new host with OSDs that is being
added, max backfills, and degradation values from ceph -s command and
depending on those values it will ramp up more OSDs to their full weight.
This made sure the cluster performance isn't impacted too much.

The other values which I kept changing so as not to cause any issue for
client io are,
ceph tell osd.* injectargs '--osd-max-backfills 4'
ceph tell osd.* injectargs '--osd-recovery-max-active 6'
ceph tell osd.* injectargs '--osd-recovery-threads 5'
ceph tell osd.* injectargs '--osd-recovery-op-priority 20'


--Pardhiv Karri


On Thu, Jun 7, 2018 at 2:23 PM, Paul Emmerich 
wrote:

> Hi,
>
> the "osd_recovery_sleep_hdd/ssd" options are way better to fine-tune the
> impact of a backfill operation in this case.
>
> Paul
>
> 2018-06-07 20:55 GMT+02:00 David Turner :
>
>> A recommendation for adding disks with minimal impact is to add them with
>> a crush weight of 0 (configurable in the ceph.conf file and then increasing
>> their weight in small increments until you get to the desired OSD weight.
>> That way you're never moving too much data at once and can stop at any time.
>>
>> If you don't want to be quite this paranoid, you can just manage the
>> osd_max_backfill settings and call it a day while letting the OSDs add to
>> their full weight from the start.  It all depends on your client IO needs,
>> how much data you have, speed of disks/network, etc.
>>
>> On Wed, Jun 6, 2018 at 3:09 AM John Molefe  wrote:
>>
>>> Hi everyone
>>>
>>> We have completed all phases and the only remaining part is just adding
>>> the disks to the current cluster but i am afraid of impacting performance
>>> as it is on production.
>>> Any guides and advices on how this can be achieved with least impact on
>>> production??
>>>
>>> Thanks in advance
>>> John
>>>
>>> Vrywaringsklousule / Disclaimer: *
>>> http://www.nwu.ac.za/it/gov-man/disclaimer.html
>>> *
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Pardhiv Karri*
"Rise and Rise again until LAMBS become LIONS"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding additional disks to the production cluster without performance impacts on the existing

2018-06-07 Thread Paul Emmerich
Hi,

the "osd_recovery_sleep_hdd/ssd" options are way better to fine-tune the
impact of a backfill operation in this case.

Paul

2018-06-07 20:55 GMT+02:00 David Turner :

> A recommendation for adding disks with minimal impact is to add them with
> a crush weight of 0 (configurable in the ceph.conf file and then increasing
> their weight in small increments until you get to the desired OSD weight.
> That way you're never moving too much data at once and can stop at any time.
>
> If you don't want to be quite this paranoid, you can just manage the
> osd_max_backfill settings and call it a day while letting the OSDs add to
> their full weight from the start.  It all depends on your client IO needs,
> how much data you have, speed of disks/network, etc.
>
> On Wed, Jun 6, 2018 at 3:09 AM John Molefe  wrote:
>
>> Hi everyone
>>
>> We have completed all phases and the only remaining part is just adding
>> the disks to the current cluster but i am afraid of impacting performance
>> as it is on production.
>> Any guides and advices on how this can be achieved with least impact on
>> production??
>>
>> Thanks in advance
>> John
>>
>> Vrywaringsklousule / Disclaimer: *
>> http://www.nwu.ac.za/it/gov-man/disclaimer.html
>> *
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cannot add new OSDs in mimic

2018-06-07 Thread Michael Kuriger
Yes, I followed the procedure. Also, I'm not able to create new OSD's at all in 
mimic, even on a newly deployed osd server. Same error. Even if I pass the --id 
{1d} parameter to the ceph-volume command, it still uses the first available ID 
and not the one I specify.


Mike Kuriger 
Sr. Unix Systems Engineer 
T: 818-649-7235 M: 818-434-6195 



-Original Message-
From: Vasu Kulkarni [mailto:vakul...@redhat.com] 
Sent: Thursday, June 07, 2018 1:53 PM
To: Michael Kuriger
Cc: ceph-users
Subject: Re: [ceph-users] cannot add new OSDs in mimic

It is actually documented in replacing osd case,
https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_rados_operations_add-2Dor-2Drm-2Dosds_-23replacing-2Dan-2Dosd=DwIFaQ=5m9CfXHY6NXqkS7nN5n23w=5r9bhr1JAPRaUcJcU-FfGg=aq6X3Wv3kt3ORFoya83IqZqUQY0UzkWP_E09S0RuOk8=8qLqOnvmldGsBQFSfdTyP9q4tPrD5oViYvgvybXJDm8=,
I hope you followed that procedure?

On Thu, Jun 7, 2018 at 1:11 PM, Michael Kuriger  wrote:
> Do you mean:
> ceph osd destroy {ID}  --yes-i-really-mean-it
>
> Mike Kuriger
>
>
>
> -Original Message-
> From: Vasu Kulkarni [mailto:vakul...@redhat.com]
> Sent: Thursday, June 07, 2018 12:28 PM
> To: Michael Kuriger
> Cc: ceph-users
> Subject: Re: [ceph-users] cannot add new OSDs in mimic
>
> There is a osd destroy command but not documented, did you run that as well?
>
> On Thu, Jun 7, 2018 at 12:21 PM, Michael Kuriger  wrote:
>> CEPH team,
>> Is there a solution yet for adding OSDs in mimic - specifically re-using old 
>> IDs?  I was looking over this BUG report - 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__tracker.ceph.com_issues_24423=DwIFaQ=5m9CfXHY6NXqkS7nN5n23w=5r9bhr1JAPRaUcJcU-FfGg=0PCKiecm216R95S_krqboYMskCBoolGysrvgHZo8LEM=hfI2uudTfY0lGtBI6iIXvZWvNpme4xwBJe2SWx0_N3I=
>>  and my issue is similar.  I removed a bunch of OSD's after upgrading to 
>> mimic and I'm not able to re-add them using the new volume format.  I 
>> haven't tried manually adding them using 'never used' IDs.  I'll try that 
>> now but was hoping there would be a fix.
>>
>> Thanks!
>>
>> Mike Kuriger
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com=DwIFaQ=5m9CfXHY6NXqkS7nN5n23w=5r9bhr1JAPRaUcJcU-FfGg=0PCKiecm216R95S_krqboYMskCBoolGysrvgHZo8LEM=2aoWc5hTz041_26Stz6zPtLiB5zGFw2GbX3TPjsvieI=
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cannot add new OSDs in mimic

2018-06-07 Thread Vasu Kulkarni
It is actually documented in replacing osd case,
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd,
I hope you followed that procedure?

On Thu, Jun 7, 2018 at 1:11 PM, Michael Kuriger  wrote:
> Do you mean:
> ceph osd destroy {ID}  --yes-i-really-mean-it
>
> Mike Kuriger
>
>
>
> -Original Message-
> From: Vasu Kulkarni [mailto:vakul...@redhat.com]
> Sent: Thursday, June 07, 2018 12:28 PM
> To: Michael Kuriger
> Cc: ceph-users
> Subject: Re: [ceph-users] cannot add new OSDs in mimic
>
> There is a osd destroy command but not documented, did you run that as well?
>
> On Thu, Jun 7, 2018 at 12:21 PM, Michael Kuriger  wrote:
>> CEPH team,
>> Is there a solution yet for adding OSDs in mimic - specifically re-using old 
>> IDs?  I was looking over this BUG report - 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__tracker.ceph.com_issues_24423=DwIFaQ=5m9CfXHY6NXqkS7nN5n23w=5r9bhr1JAPRaUcJcU-FfGg=0PCKiecm216R95S_krqboYMskCBoolGysrvgHZo8LEM=hfI2uudTfY0lGtBI6iIXvZWvNpme4xwBJe2SWx0_N3I=
>>  and my issue is similar.  I removed a bunch of OSD's after upgrading to 
>> mimic and I'm not able to re-add them using the new volume format.  I 
>> haven't tried manually adding them using 'never used' IDs.  I'll try that 
>> now but was hoping there would be a fix.
>>
>> Thanks!
>>
>> Mike Kuriger
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com=DwIFaQ=5m9CfXHY6NXqkS7nN5n23w=5r9bhr1JAPRaUcJcU-FfGg=0PCKiecm216R95S_krqboYMskCBoolGysrvgHZo8LEM=2aoWc5hTz041_26Stz6zPtLiB5zGFw2GbX3TPjsvieI=
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cannot add new OSDs in mimic

2018-06-07 Thread Gerhard W. Recher

Michael,


this apperently chnage in mimic to "--yes-i-really-really-mean-it" :(

Gerhard W. Recher

net4sec UG (haftungsbeschränkt)
Leitenweg 6
86929 Penzing

+49 171 4802507
Am 07.06.2018 um 22:11 schrieb Michael Kuriger:

Do you mean:
ceph osd destroy {ID}  --yes-i-really-mean-it

Mike Kuriger



-Original Message-
From: Vasu Kulkarni [mailto:vakul...@redhat.com]
Sent: Thursday, June 07, 2018 12:28 PM
To: Michael Kuriger
Cc: ceph-users
Subject: Re: [ceph-users] cannot add new OSDs in mimic

There is a osd destroy command but not documented, did you run that as well?

On Thu, Jun 7, 2018 at 12:21 PM, Michael Kuriger  wrote:

CEPH team,
Is there a solution yet for adding OSDs in mimic - specifically re-using old IDs?  I was looking over 
this BUG report - 
https://urldefense.proofpoint.com/v2/url?u=https-3A__tracker.ceph.com_issues_24423=DwIFaQ=5m9CfXHY6NXqkS7nN5n23w=5r9bhr1JAPRaUcJcU-FfGg=0PCKiecm216R95S_krqboYMskCBoolGysrvgHZo8LEM=hfI2uudTfY0lGtBI6iIXvZWvNpme4xwBJe2SWx0_N3I=
 and my issue is similar.  I removed a bunch of OSD's after upgrading to mimic and I'm not able to 
re-add them using the new volume format.  I haven't tried manually adding them using 'never used' IDs.  
I'll try that now but was hoping there would be a fix.

Thanks!

Mike Kuriger

___
ceph-users mailing list
ceph-users@lists.ceph.com
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com=DwIFaQ=5m9CfXHY6NXqkS7nN5n23w=5r9bhr1JAPRaUcJcU-FfGg=0PCKiecm216R95S_krqboYMskCBoolGysrvgHZo8LEM=2aoWc5hTz041_26Stz6zPtLiB5zGFw2GbX3TPjsvieI=

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cannot add new OSDs in mimic

2018-06-07 Thread Michael Kuriger
Do you mean:
ceph osd destroy {ID}  --yes-i-really-mean-it

Mike Kuriger 



-Original Message-
From: Vasu Kulkarni [mailto:vakul...@redhat.com] 
Sent: Thursday, June 07, 2018 12:28 PM
To: Michael Kuriger
Cc: ceph-users
Subject: Re: [ceph-users] cannot add new OSDs in mimic

There is a osd destroy command but not documented, did you run that as well?

On Thu, Jun 7, 2018 at 12:21 PM, Michael Kuriger  wrote:
> CEPH team,
> Is there a solution yet for adding OSDs in mimic - specifically re-using old 
> IDs?  I was looking over this BUG report - 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__tracker.ceph.com_issues_24423=DwIFaQ=5m9CfXHY6NXqkS7nN5n23w=5r9bhr1JAPRaUcJcU-FfGg=0PCKiecm216R95S_krqboYMskCBoolGysrvgHZo8LEM=hfI2uudTfY0lGtBI6iIXvZWvNpme4xwBJe2SWx0_N3I=
>  and my issue is similar.  I removed a bunch of OSD's after upgrading to 
> mimic and I'm not able to re-add them using the new volume format.  I haven't 
> tried manually adding them using 'never used' IDs.  I'll try that now but was 
> hoping there would be a fix.
>
> Thanks!
>
> Mike Kuriger
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com=DwIFaQ=5m9CfXHY6NXqkS7nN5n23w=5r9bhr1JAPRaUcJcU-FfGg=0PCKiecm216R95S_krqboYMskCBoolGysrvgHZo8LEM=2aoWc5hTz041_26Stz6zPtLiB5zGFw2GbX3TPjsvieI=
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cannot add new OSDs in mimic

2018-06-07 Thread Vasu Kulkarni
There is a osd destroy command but not documented, did you run that as well?

On Thu, Jun 7, 2018 at 12:21 PM, Michael Kuriger  wrote:
> CEPH team,
> Is there a solution yet for adding OSDs in mimic - specifically re-using old 
> IDs?  I was looking over this BUG report - 
> https://tracker.ceph.com/issues/24423 and my issue is similar.  I removed a 
> bunch of OSD's after upgrading to mimic and I'm not able to re-add them using 
> the new volume format.  I haven't tried manually adding them using 'never 
> used' IDs.  I'll try that now but was hoping there would be a fix.
>
> Thanks!
>
> Mike Kuriger
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cannot add new OSDs in mimic

2018-06-07 Thread Michael Kuriger
CEPH team,
Is there a solution yet for adding OSDs in mimic - specifically re-using old 
IDs?  I was looking over this BUG report - 
https://tracker.ceph.com/issues/24423 and my issue is similar.  I removed a 
bunch of OSD's after upgrading to mimic and I'm not able to re-add them using 
the new volume format.  I haven't tried manually adding them using 'never used' 
IDs.  I'll try that now but was hoping there would be a fix.

Thanks!

Mike Kuriger 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Alfredo Deza
On Thu, Jun 7, 2018 at 3:04 PM, Dan van der Ster  wrote:
> On Thu, Jun 7, 2018 at 8:58 PM Alfredo Deza  wrote:
>>
>> On Thu, Jun 7, 2018 at 2:45 PM, Dan van der Ster  wrote:
>> > On Thu, Jun 7, 2018 at 6:58 PM Alfredo Deza  wrote:
>> >>
>> >> On Thu, Jun 7, 2018 at 12:09 PM, Sage Weil  wrote:
>> >> > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> >> >> On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster  
>> >> >> wrote:
>> >> >> >
>> >> >> > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil  wrote:
>> >> >> > >
>> >> >> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> >> >> > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  
>> >> >> > > > wrote:
>> >> >> > > > >
>> >> >> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> >> >> > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  
>> >> >> > > > > > wrote:
>> >> >> > > > > > >
>> >> >> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> >> >> > > > > > > > Hi all,
>> >> >> > > > > > > >
>> >> >> > > > > > > > We have an intermittent issue where bluestore osds 
>> >> >> > > > > > > > sometimes fail to
>> >> >> > > > > > > > start after a reboot.
>> >> >> > > > > > > > The osds all fail the same way [see 2], failing to open 
>> >> >> > > > > > > > the superblock.
>> >> >> > > > > > > > One one particular host, there are 24 osds and 4 SSDs 
>> >> >> > > > > > > > partitioned for
>> >> >> > > > > > > > the block.db's. The affected non-starting OSDs all have 
>> >> >> > > > > > > > block.db on
>> >> >> > > > > > > > the same ssd (/dev/sdaa).
>> >> >> > > > > > > >
>> >> >> > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and 
>> >> >> > > > > > > > were created
>> >> >> > > > > > > > by ceph-volume lvm, e.g. see [1].
>> >> >> > > > > > > >
>> >> >> > > > > > > > This seems like a permissions or similar issue related 
>> >> >> > > > > > > > to the
>> >> >> > > > > > > > ceph-volume tooling.
>> >> >> > > > > > > > Any clues how to debug this further?
>> >> >> > > > > > >
>> >> >> > > > > > > I take it the OSDs start up if you try again?
>> >> >> > > > > >
>> >> >> > > > > > Hey.
>> >> >> > > > > > No, they don't. For example, we do this `ceph-volume lvm 
>> >> >> > > > > > activate 48
>> >> >> > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its 
>> >> >> > > > > > the same
>> >> >> > > > > > mount failure every time.
>> >> >> > > > >
>> >> >> > > > > That sounds like a bluefs bug then, not a ceph-volume issue.  
>> >> >> > > > > Can you
>> >> >> > > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
>> >> >> > > > > debug bluestore = 20)
>> >> >> > > > >
>> >> >> > > >
>> >> >> > > > Here: https://pastebin.com/TJXZhfcY
>> >> >> > > >
>> >> >> > > > Is it supposed to print something about the block.db at some 
>> >> >> > > > point
>> >> >> > >
>> >> >> > > Can you dump the bluefs superblock for me?
>> >> >> > >
>> >> >> > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
>> >> >> > > hexdump -C /tmp/foo
>> >> >> > >
>> >> >> >
>> >> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
>> >> >> > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
>> >> >> > 1+0 records in
>> >> >> > 1+0 records out
>> >> >> > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
>> >> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C 
>> >> >> > /tmp/foo
>> >> >> >   01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  
>> >> >> > |..]..MC1J...|
>> >> >> > 0010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  
>> >> >> > |r6.MK...|
>> >> >> > 0020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  
>> >> >> > ||
>> >> >> > 0030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  
>> >> >> > |+..@|
>> >> >> > 0040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  
>> >> >> > ||
>> >> >> > 0050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  
>> >> >> > ||
>> >> >> > 0060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. 
>> >> >> > .am...|
>> >> >> > 0070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
>> >> >> > ||
>> >> >> > *
>> >> >> > 1000
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> Wait, we found something!!!
>> >> >>
>> >> >> In the 1st 4k on the block we found the block.db pointing at the wrong
>> >> >> device (/dev/sdc1 instead of /dev/sdaa1)
>> >> >>
>> >> >> 0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  
>> >> >> |k5y+g==path_|
>> >> >> 0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  
>> >> >> |block.db/dev|
>> >> >> 0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  
>> >> >> |/sdc1ready..|
>> >> >> 0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  
>> >> >> |..readywhoam|
>> >> >> 0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  
>> >> >> |i48.|
>> >> >> 0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
>> >> >> ||
>> >> >>
>> >> >> It 

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 8:58 PM Alfredo Deza  wrote:
>
> On Thu, Jun 7, 2018 at 2:45 PM, Dan van der Ster  wrote:
> > On Thu, Jun 7, 2018 at 6:58 PM Alfredo Deza  wrote:
> >>
> >> On Thu, Jun 7, 2018 at 12:09 PM, Sage Weil  wrote:
> >> > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> >> On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster  
> >> >> wrote:
> >> >> >
> >> >> > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil  wrote:
> >> >> > >
> >> >> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> >> > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
> >> >> > > > >
> >> >> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> >> > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  
> >> >> > > > > > wrote:
> >> >> > > > > > >
> >> >> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> >> > > > > > > > Hi all,
> >> >> > > > > > > >
> >> >> > > > > > > > We have an intermittent issue where bluestore osds 
> >> >> > > > > > > > sometimes fail to
> >> >> > > > > > > > start after a reboot.
> >> >> > > > > > > > The osds all fail the same way [see 2], failing to open 
> >> >> > > > > > > > the superblock.
> >> >> > > > > > > > One one particular host, there are 24 osds and 4 SSDs 
> >> >> > > > > > > > partitioned for
> >> >> > > > > > > > the block.db's. The affected non-starting OSDs all have 
> >> >> > > > > > > > block.db on
> >> >> > > > > > > > the same ssd (/dev/sdaa).
> >> >> > > > > > > >
> >> >> > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and 
> >> >> > > > > > > > were created
> >> >> > > > > > > > by ceph-volume lvm, e.g. see [1].
> >> >> > > > > > > >
> >> >> > > > > > > > This seems like a permissions or similar issue related to 
> >> >> > > > > > > > the
> >> >> > > > > > > > ceph-volume tooling.
> >> >> > > > > > > > Any clues how to debug this further?
> >> >> > > > > > >
> >> >> > > > > > > I take it the OSDs start up if you try again?
> >> >> > > > > >
> >> >> > > > > > Hey.
> >> >> > > > > > No, they don't. For example, we do this `ceph-volume lvm 
> >> >> > > > > > activate 48
> >> >> > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its 
> >> >> > > > > > the same
> >> >> > > > > > mount failure every time.
> >> >> > > > >
> >> >> > > > > That sounds like a bluefs bug then, not a ceph-volume issue.  
> >> >> > > > > Can you
> >> >> > > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
> >> >> > > > > debug bluestore = 20)
> >> >> > > > >
> >> >> > > >
> >> >> > > > Here: https://pastebin.com/TJXZhfcY
> >> >> > > >
> >> >> > > > Is it supposed to print something about the block.db at some 
> >> >> > > > point
> >> >> > >
> >> >> > > Can you dump the bluefs superblock for me?
> >> >> > >
> >> >> > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> >> >> > > hexdump -C /tmp/foo
> >> >> > >
> >> >> >
> >> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
> >> >> > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> >> >> > 1+0 records in
> >> >> > 1+0 records out
> >> >> > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
> >> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C 
> >> >> > /tmp/foo
> >> >> >   01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  
> >> >> > |..]..MC1J...|
> >> >> > 0010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  
> >> >> > |r6.MK...|
> >> >> > 0020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  
> >> >> > ||
> >> >> > 0030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  
> >> >> > |+..@|
> >> >> > 0040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  
> >> >> > ||
> >> >> > 0050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  
> >> >> > ||
> >> >> > 0060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. 
> >> >> > .am...|
> >> >> > 0070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> >> >> > ||
> >> >> > *
> >> >> > 1000
> >> >> >
> >> >> >
> >> >>
> >> >> Wait, we found something!!!
> >> >>
> >> >> In the 1st 4k on the block we found the block.db pointing at the wrong
> >> >> device (/dev/sdc1 instead of /dev/sdaa1)
> >> >>
> >> >> 0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  
> >> >> |k5y+g==path_|
> >> >> 0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  
> >> >> |block.db/dev|
> >> >> 0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  
> >> >> |/sdc1ready..|
> >> >> 0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  
> >> >> |..readywhoam|
> >> >> 0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  
> >> >> |i48.|
> >> >> 0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> >> >> ||
> >> >>
> >> >> It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
> >> >> instead of /dev/sdaa2).
> >> >> And for the osds that are running, that block.db is correct!
> >> >>
> >> >> 

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Alfredo Deza
On Thu, Jun 7, 2018 at 2:45 PM, Dan van der Ster  wrote:
> On Thu, Jun 7, 2018 at 6:58 PM Alfredo Deza  wrote:
>>
>> On Thu, Jun 7, 2018 at 12:09 PM, Sage Weil  wrote:
>> > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> >> On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster  
>> >> wrote:
>> >> >
>> >> > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil  wrote:
>> >> > >
>> >> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> >> > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
>> >> > > > >
>> >> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> >> > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  
>> >> > > > > > wrote:
>> >> > > > > > >
>> >> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> >> > > > > > > > Hi all,
>> >> > > > > > > >
>> >> > > > > > > > We have an intermittent issue where bluestore osds 
>> >> > > > > > > > sometimes fail to
>> >> > > > > > > > start after a reboot.
>> >> > > > > > > > The osds all fail the same way [see 2], failing to open the 
>> >> > > > > > > > superblock.
>> >> > > > > > > > One one particular host, there are 24 osds and 4 SSDs 
>> >> > > > > > > > partitioned for
>> >> > > > > > > > the block.db's. The affected non-starting OSDs all have 
>> >> > > > > > > > block.db on
>> >> > > > > > > > the same ssd (/dev/sdaa).
>> >> > > > > > > >
>> >> > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and 
>> >> > > > > > > > were created
>> >> > > > > > > > by ceph-volume lvm, e.g. see [1].
>> >> > > > > > > >
>> >> > > > > > > > This seems like a permissions or similar issue related to 
>> >> > > > > > > > the
>> >> > > > > > > > ceph-volume tooling.
>> >> > > > > > > > Any clues how to debug this further?
>> >> > > > > > >
>> >> > > > > > > I take it the OSDs start up if you try again?
>> >> > > > > >
>> >> > > > > > Hey.
>> >> > > > > > No, they don't. For example, we do this `ceph-volume lvm 
>> >> > > > > > activate 48
>> >> > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the 
>> >> > > > > > same
>> >> > > > > > mount failure every time.
>> >> > > > >
>> >> > > > > That sounds like a bluefs bug then, not a ceph-volume issue.  Can 
>> >> > > > > you
>> >> > > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
>> >> > > > > debug bluestore = 20)
>> >> > > > >
>> >> > > >
>> >> > > > Here: https://pastebin.com/TJXZhfcY
>> >> > > >
>> >> > > > Is it supposed to print something about the block.db at some 
>> >> > > > point
>> >> > >
>> >> > > Can you dump the bluefs superblock for me?
>> >> > >
>> >> > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
>> >> > > hexdump -C /tmp/foo
>> >> > >
>> >> >
>> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
>> >> > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
>> >> > 1+0 records in
>> >> > 1+0 records out
>> >> > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
>> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C 
>> >> > /tmp/foo
>> >> >   01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  
>> >> > |..]..MC1J...|
>> >> > 0010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  
>> >> > |r6.MK...|
>> >> > 0020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  
>> >> > ||
>> >> > 0030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  
>> >> > |+..@|
>> >> > 0040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  
>> >> > ||
>> >> > 0050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  
>> >> > ||
>> >> > 0060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. 
>> >> > .am...|
>> >> > 0070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
>> >> > ||
>> >> > *
>> >> > 1000
>> >> >
>> >> >
>> >>
>> >> Wait, we found something!!!
>> >>
>> >> In the 1st 4k on the block we found the block.db pointing at the wrong
>> >> device (/dev/sdc1 instead of /dev/sdaa1)
>> >>
>> >> 0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  
>> >> |k5y+g==path_|
>> >> 0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  
>> >> |block.db/dev|
>> >> 0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  
>> >> |/sdc1ready..|
>> >> 0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  
>> >> |..readywhoam|
>> >> 0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  
>> >> |i48.|
>> >> 0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
>> >> ||
>> >>
>> >> It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
>> >> instead of /dev/sdaa2).
>> >> And for the osds that are running, that block.db is correct!
>> >>
>> >> So the block.db device is persisted in the block header? But after
>> >> a reboot it gets a new name. (sd* naming is famously chaotic).
>> >> ceph-volume creates a softlink to the correct db dev, but it seems not 
>> >> used?
>> >
>> > Aha, yes.. the bluestore startup code looks for the 

Re: [ceph-users] Adding additional disks to the production cluster without performance impacts on the existing

2018-06-07 Thread David Turner
A recommendation for adding disks with minimal impact is to add them with a
crush weight of 0 (configurable in the ceph.conf file and then increasing
their weight in small increments until you get to the desired OSD weight.
That way you're never moving too much data at once and can stop at any time.

If you don't want to be quite this paranoid, you can just manage the
osd_max_backfill settings and call it a day while letting the OSDs add to
their full weight from the start.  It all depends on your client IO needs,
how much data you have, speed of disks/network, etc.

On Wed, Jun 6, 2018 at 3:09 AM John Molefe  wrote:

> Hi everyone
>
> We have completed all phases and the only remaining part is just adding
> the disks to the current cluster but i am afraid of impacting performance
> as it is on production.
> Any guides and advices on how this can be achieved with least impact on
> production??
>
> Thanks in advance
> John
>
> Vrywaringsklousule / Disclaimer: *
> http://www.nwu.ac.za/it/gov-man/disclaimer.html
> *
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 6:58 PM Alfredo Deza  wrote:
>
> On Thu, Jun 7, 2018 at 12:09 PM, Sage Weil  wrote:
> > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster  
> >> wrote:
> >> >
> >> > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil  wrote:
> >> > >
> >> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
> >> > > > >
> >> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> > > > > > > > Hi all,
> >> > > > > > > >
> >> > > > > > > > We have an intermittent issue where bluestore osds sometimes 
> >> > > > > > > > fail to
> >> > > > > > > > start after a reboot.
> >> > > > > > > > The osds all fail the same way [see 2], failing to open the 
> >> > > > > > > > superblock.
> >> > > > > > > > One one particular host, there are 24 osds and 4 SSDs 
> >> > > > > > > > partitioned for
> >> > > > > > > > the block.db's. The affected non-starting OSDs all have 
> >> > > > > > > > block.db on
> >> > > > > > > > the same ssd (/dev/sdaa).
> >> > > > > > > >
> >> > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and 
> >> > > > > > > > were created
> >> > > > > > > > by ceph-volume lvm, e.g. see [1].
> >> > > > > > > >
> >> > > > > > > > This seems like a permissions or similar issue related to the
> >> > > > > > > > ceph-volume tooling.
> >> > > > > > > > Any clues how to debug this further?
> >> > > > > > >
> >> > > > > > > I take it the OSDs start up if you try again?
> >> > > > > >
> >> > > > > > Hey.
> >> > > > > > No, they don't. For example, we do this `ceph-volume lvm 
> >> > > > > > activate 48
> >> > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the 
> >> > > > > > same
> >> > > > > > mount failure every time.
> >> > > > >
> >> > > > > That sounds like a bluefs bug then, not a ceph-volume issue.  Can 
> >> > > > > you
> >> > > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
> >> > > > > debug bluestore = 20)
> >> > > > >
> >> > > >
> >> > > > Here: https://pastebin.com/TJXZhfcY
> >> > > >
> >> > > > Is it supposed to print something about the block.db at some 
> >> > > > point
> >> > >
> >> > > Can you dump the bluefs superblock for me?
> >> > >
> >> > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> >> > > hexdump -C /tmp/foo
> >> > >
> >> >
> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
> >> > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> >> > 1+0 records in
> >> > 1+0 records out
> >> > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C 
> >> > /tmp/foo
> >> >   01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  
> >> > |..]..MC1J...|
> >> > 0010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  
> >> > |r6.MK...|
> >> > 0020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  
> >> > ||
> >> > 0030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  
> >> > |+..@|
> >> > 0040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  
> >> > ||
> >> > 0050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  
> >> > ||
> >> > 0060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. 
> >> > .am...|
> >> > 0070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> >> > ||
> >> > *
> >> > 1000
> >> >
> >> >
> >>
> >> Wait, we found something!!!
> >>
> >> In the 1st 4k on the block we found the block.db pointing at the wrong
> >> device (/dev/sdc1 instead of /dev/sdaa1)
> >>
> >> 0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  
> >> |k5y+g==path_|
> >> 0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  
> >> |block.db/dev|
> >> 0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  
> >> |/sdc1ready..|
> >> 0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  
> >> |..readywhoam|
> >> 0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  
> >> |i48.|
> >> 0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> >> ||
> >>
> >> It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
> >> instead of /dev/sdaa2).
> >> And for the osds that are running, that block.db is correct!
> >>
> >> So the block.db device is persisted in the block header? But after
> >> a reboot it gets a new name. (sd* naming is famously chaotic).
> >> ceph-volume creates a softlink to the correct db dev, but it seems not 
> >> used?
> >
> > Aha, yes.. the bluestore startup code looks for the value in the
> > superblock before the on in the directory.
> >
> > We can either (1) reverse that order, (and/)or (2) make ceph-volume use a
> > stable path for the device name when creating the 

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 6:33 PM Sage Weil  wrote:
>
> On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > Wait, we found something!!!
> > > >
> > > > In the 1st 4k on the block we found the block.db pointing at the wrong
> > > > device (/dev/sdc1 instead of /dev/sdaa1)
> > > >
> > > > 0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  
> > > > |k5y+g==path_|
> > > > 0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  
> > > > |block.db/dev|
> > > > 0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  
> > > > |/sdc1ready..|
> > > > 0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  
> > > > |..readywhoam|
> > > > 0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  
> > > > |i48.|
> > > > 0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> > > > ||
> > > >
> > > > It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
> > > > instead of /dev/sdaa2).
> > > > And for the osds that are running, that block.db is correct!
>
> Also, note that you can fix your OSDs by changing the path to a stable
> name for the same device (/dev/disk/by-partuuid/something?) with
> 'ceph-bluestore-tool set-label-key ...'.

Good to know, thanks!
I understand your (3) earlier now...  Yes, ceph-volume should call
that to fix the OSD if the device changes.

Cheers, Dan

>
> sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Ilya Dryomov
On Thu, Jun 7, 2018 at 6:30 PM, Jason Dillaman  wrote:
> On Thu, Jun 7, 2018 at 12:13 PM, Tracy Reed  wrote:
>> On Thu, Jun 07, 2018 at 08:40:50AM PDT, Ilya Dryomov spake thusly:
>>> > Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue 
>>> > Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> This is a *very* old kernel.
>>
>> It's what's shipping with CentOS/RHEL 7 and probably what the vast
>> majority of people are using aside from perhaps the Ubuntu LTS people.
>
> I think what Ilya is saying is that it's a very old RHEL 7-based
> kernel (RHEL 7.1?). For example, the current RHEL 7.5 kernel includes
> numerous improvements that have been backported from the current
> upstream kernel.

Correct.  RHEL 7.1 isn't supported anymore -- even the EUS (Extended
Update Support) from Red Hat ended more than a year ago.

I would recommend an upgrade to 7.5 or a recent upstream kernel from
ELRepo.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack VMs with Ceph EC pools

2018-06-07 Thread Jason Dillaman
On Thu, Jun 7, 2018 at 12:54 PM, Pardhiv Karri  wrote:
> Thank you, Andrew and Jason for replying.
>
> Jason,
> Do you have a sample ceph config file that you can share which works with
> RBD and EC pools?

Yup -- see below from my previous email.

> Thanks,
> Pardhiv Karri
>
> On Thu, Jun 7, 2018 at 9:08 AM, Jason Dillaman  wrote:
>>
>> On Thu, Jun 7, 2018 at 11:54 AM, Andrew Denton 
>> wrote:
>> > On Wed, 2018-06-06 at 17:02 -0700, Pardhiv Karri wrote:
>> >> Hi,
>> >>
>> >> Is anyone using Openstack with Ceph  Erasure Coding pools as it now
>> >> supports RBD in Luminous. If so, hows the performance?
>> >
>> > I attempted it, but couldn't figure out how to get Cinder to specify
>> > the data pool. You can't just point Cinder at the erasure-coded pool
>> > since the ec pool doesn't support OMAP and the rbd creation will fail.
>> > Cinder will need to learn how to create the rbd differently, or there
>> > needs to be some override in ceph.conf.
>>
>> Correct, you can put an override in the "ceph.conf" file on your
>> Cinder controller nodes:
>>
>> [client.cinder]
>> rbd default data pool = XYZ
>>
>> You can also use Cinder multi-backend, each using a different CephX
>> user id, to support different overrides for different device classes.
>>
>>
>> > Thanks,
>> > Andrew
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason
>
>
>
>
> --
> Pardhiv Karri
> "Rise and Rise again until LAMBS become LIONS"
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mimic (13.2.0) Release Notes Bug on CephFS Snapshot Upgrades

2018-06-07 Thread Patrick Donnelly
There was a bug [1] in the release notes [2] which had incorrect
commands for upgrading the snapshot format of an existing CephFS file
system which has had snapshots enabled at some point. The correction
is here [3]:

diff --git a/doc/releases/mimic.rst b/doc/releases/mimic.rst
index 137d56311c..3a3345bbc0 100644
--- a/doc/releases/mimic.rst
+++ b/doc/releases/mimic.rst
@@ -346,8 +346,8 @@ These changes occurred between the Luminous and
Mimic releases.
 previous max_mds" step in above URL to fail. To re-enable the feature,
 either delete all old snapshots or scrub the whole filesystem:

-  - ``ceph daemon  scrub_path /``
-  - ``ceph daemon  scrub_path '~mdsdir'``
+  - ``ceph daemon  scrub_path / force recursive repair``
+  - ``ceph daemon  scrub_path '~mdsdir' force
recursive repair``

   - Support has been added in Mimic for quotas in the Linux kernel
client as of v4.17.


The release notes on the blog have already been updated.

If you executed the wrong commands already, it should be sufficient to
run the correct commands once more to fix the file system.

[1] https://tracker.ceph.com/issues/24435
[2] https://ceph.com/releases/v13-2-0-mimic-released/
[3] https://github.com/ceph/ceph/pull/22445/files

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack VMs with Ceph EC pools

2018-06-07 Thread Pardhiv Karri
Thank you, Andrew and Jason for replying.

Jason,
Do you have a sample ceph config file that you can share which works with
RBD and EC pools?

Thanks,
Pardhiv Karri

On Thu, Jun 7, 2018 at 9:08 AM, Jason Dillaman  wrote:

> On Thu, Jun 7, 2018 at 11:54 AM, Andrew Denton 
> wrote:
> > On Wed, 2018-06-06 at 17:02 -0700, Pardhiv Karri wrote:
> >> Hi,
> >>
> >> Is anyone using Openstack with Ceph  Erasure Coding pools as it now
> >> supports RBD in Luminous. If so, hows the performance?
> >
> > I attempted it, but couldn't figure out how to get Cinder to specify
> > the data pool. You can't just point Cinder at the erasure-coded pool
> > since the ec pool doesn't support OMAP and the rbd creation will fail.
> > Cinder will need to learn how to create the rbd differently, or there
> > needs to be some override in ceph.conf.
>
> Correct, you can put an override in the "ceph.conf" file on your
> Cinder controller nodes:
>
> [client.cinder]
> rbd default data pool = XYZ
>
> You can also use Cinder multi-backend, each using a different CephX
> user id, to support different overrides for different device classes.
>
>
> > Thanks,
> > Andrew
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
>



-- 
*Pardhiv Karri*
"Rise and Rise again until LAMBS become LIONS"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool has many more objects per pg than average

2018-06-07 Thread Brett Chancellor
The error will go away once you start storing data in the other pools. Or,
you could simply silence the message with mon_pg_warn_max_object_skew = 0


On Thu, Jun 7, 2018 at 10:48 AM, Torin Woltjer 
wrote:

> I have a ceph cluster and status shows this error: pool libvirt-pool has
> many more objects per pg than average (too few pgs?) This pool has the most
> stored in it currently, by a large margin. The other pools are
> underutilized currently, but are purposed to take a role much greater than
> libvirt-pool. Once the other pools begin storing more objects, will this
> error go away, or am I misunderstanding the message?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Sage Weil
On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > Wait, we found something!!!
> > >
> > > In the 1st 4k on the block we found the block.db pointing at the wrong
> > > device (/dev/sdc1 instead of /dev/sdaa1)
> > >
> > > 0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  
> > > |k5y+g==path_|
> > > 0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  
> > > |block.db/dev|
> > > 0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  
> > > |/sdc1ready..|
> > > 0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  
> > > |..readywhoam|
> > > 0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  
> > > |i48.|
> > > 0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> > > ||
> > >
> > > It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
> > > instead of /dev/sdaa2).
> > > And for the osds that are running, that block.db is correct!

Also, note that you can fix your OSDs by changing the path to a stable 
name for the same device (/dev/disk/by-partuuid/something?) with 
'ceph-bluestore-tool set-label-key ...'.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Sergey Malinin
http://elrepo.org/tiki/kernel-ml  provides 
4.17

> On 7.06.2018, at 19:13, Tracy Reed  wrote:
> 
> It's what's shipping with CentOS/RHEL 7 and probably what the vast
> majority of people are using aside from perhaps the Ubuntu LTS people.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Sage Weil
On Thu, 7 Jun 2018, Dan van der Ster wrote:
> On Thu, Jun 7, 2018 at 6:09 PM Sage Weil  wrote:
> >
> > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster  
> > > wrote:
> > > >
> > > > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil  wrote:
> > > > >
> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
> > > > > > >
> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > We have an intermittent issue where bluestore osds 
> > > > > > > > > > sometimes fail to
> > > > > > > > > > start after a reboot.
> > > > > > > > > > The osds all fail the same way [see 2], failing to open the 
> > > > > > > > > > superblock.
> > > > > > > > > > One one particular host, there are 24 osds and 4 SSDs 
> > > > > > > > > > partitioned for
> > > > > > > > > > the block.db's. The affected non-starting OSDs all have 
> > > > > > > > > > block.db on
> > > > > > > > > > the same ssd (/dev/sdaa).
> > > > > > > > > >
> > > > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and 
> > > > > > > > > > were created
> > > > > > > > > > by ceph-volume lvm, e.g. see [1].
> > > > > > > > > >
> > > > > > > > > > This seems like a permissions or similar issue related to 
> > > > > > > > > > the
> > > > > > > > > > ceph-volume tooling.
> > > > > > > > > > Any clues how to debug this further?
> > > > > > > > >
> > > > > > > > > I take it the OSDs start up if you try again?
> > > > > > > >
> > > > > > > > Hey.
> > > > > > > > No, they don't. For example, we do this `ceph-volume lvm 
> > > > > > > > activate 48
> > > > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the 
> > > > > > > > same
> > > > > > > > mount failure every time.
> > > > > > >
> > > > > > > That sounds like a bluefs bug then, not a ceph-volume issue.  Can 
> > > > > > > you
> > > > > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
> > > > > > > debug bluestore = 20)
> > > > > > >
> > > > > >
> > > > > > Here: https://pastebin.com/TJXZhfcY
> > > > > >
> > > > > > Is it supposed to print something about the block.db at some 
> > > > > > point
> > > > >
> > > > > Can you dump the bluefs superblock for me?
> > > > >
> > > > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> > > > > hexdump -C /tmp/foo
> > > > >
> > > >
> > > > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
> > > > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> > > > 1+0 records in
> > > > 1+0 records out
> > > > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
> > > > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C 
> > > > /tmp/foo
> > > >   01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  
> > > > |..]..MC1J...|
> > > > 0010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  
> > > > |r6.MK...|
> > > > 0020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  
> > > > ||
> > > > 0030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  
> > > > |+..@|
> > > > 0040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  
> > > > ||
> > > > 0050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  
> > > > ||
> > > > 0060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. 
> > > > .am...|
> > > > 0070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> > > > ||
> > > > *
> > > > 1000
> > > >
> > > >
> > >
> > > Wait, we found something!!!
> > >
> > > In the 1st 4k on the block we found the block.db pointing at the wrong
> > > device (/dev/sdc1 instead of /dev/sdaa1)
> > >
> > > 0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  
> > > |k5y+g==path_|
> > > 0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  
> > > |block.db/dev|
> > > 0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  
> > > |/sdc1ready..|
> > > 0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  
> > > |..readywhoam|
> > > 0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  
> > > |i48.|
> > > 0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> > > ||
> > >
> > > It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
> > > instead of /dev/sdaa2).
> > > And for the osds that are running, that block.db is correct!
> > >
> > > So the block.db device is persisted in the block header? But after
> > > a reboot it gets a new name. (sd* naming is famously chaotic).
> > > ceph-volume creates a softlink to the correct db dev, but it seems not 
> > > used?
> >
> > Aha, yes.. the bluestore startup code looks for the value in the
> > superblock before the on in the directory.
> >
> > We 

Re: [ceph-users] rbd map hangs

2018-06-07 Thread Jason Dillaman
On Thu, Jun 7, 2018 at 12:13 PM, Tracy Reed  wrote:
> On Thu, Jun 07, 2018 at 08:40:50AM PDT, Ilya Dryomov spake thusly:
>> > Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue 
>> > Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>
>> This is a *very* old kernel.
>
> It's what's shipping with CentOS/RHEL 7 and probably what the vast
> majority of people are using aside from perhaps the Ubuntu LTS people.

I think what Ilya is saying is that it's a very old RHEL 7-based
kernel (RHEL 7.1?). For example, the current RHEL 7.5 kernel includes
numerous improvements that have been backported from the current
upstream kernel.

> Does anyone really still compile their own latest kernels? Back in the
> mid-90's I'd compile a new kernel at the drop of a hat. But now it has
> gotten so complicated with so many options and drivers etc. that it's
> actually pretty hard to get it right.
>
>> These lines indicate in-flight requests.  Looks like there may have
>> been a problem with osd1 in the past, as some of these are much older
>> than others.  Try bouncing osd1 with "ceph osd down 1" (it should
>> come back up automatically) and see if that clears up this batch.
>
> Thanks!
>
> --
> Tracy Reed
> http://tracyreed.org
> Digital signature attached for your safety.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 6:09 PM Sage Weil  wrote:
>
> On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster  wrote:
> > >
> > > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil  wrote:
> > > >
> > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
> > > > > >
> > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
> > > > > > > >
> > > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > We have an intermittent issue where bluestore osds sometimes 
> > > > > > > > > fail to
> > > > > > > > > start after a reboot.
> > > > > > > > > The osds all fail the same way [see 2], failing to open the 
> > > > > > > > > superblock.
> > > > > > > > > One one particular host, there are 24 osds and 4 SSDs 
> > > > > > > > > partitioned for
> > > > > > > > > the block.db's. The affected non-starting OSDs all have 
> > > > > > > > > block.db on
> > > > > > > > > the same ssd (/dev/sdaa).
> > > > > > > > >
> > > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and were 
> > > > > > > > > created
> > > > > > > > > by ceph-volume lvm, e.g. see [1].
> > > > > > > > >
> > > > > > > > > This seems like a permissions or similar issue related to the
> > > > > > > > > ceph-volume tooling.
> > > > > > > > > Any clues how to debug this further?
> > > > > > > >
> > > > > > > > I take it the OSDs start up if you try again?
> > > > > > >
> > > > > > > Hey.
> > > > > > > No, they don't. For example, we do this `ceph-volume lvm activate 
> > > > > > > 48
> > > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the 
> > > > > > > same
> > > > > > > mount failure every time.
> > > > > >
> > > > > > That sounds like a bluefs bug then, not a ceph-volume issue.  Can 
> > > > > > you
> > > > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
> > > > > > debug bluestore = 20)
> > > > > >
> > > > >
> > > > > Here: https://pastebin.com/TJXZhfcY
> > > > >
> > > > > Is it supposed to print something about the block.db at some point
> > > >
> > > > Can you dump the bluefs superblock for me?
> > > >
> > > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> > > > hexdump -C /tmp/foo
> > > >
> > >
> > > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
> > > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> > > 1+0 records in
> > > 1+0 records out
> > > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
> > > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C 
> > > /tmp/foo
> > >   01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  
> > > |..]..MC1J...|
> > > 0010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  
> > > |r6.MK...|
> > > 0020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  
> > > ||
> > > 0030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  
> > > |+..@|
> > > 0040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  
> > > ||
> > > 0050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  
> > > ||
> > > 0060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. 
> > > .am...|
> > > 0070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> > > ||
> > > *
> > > 1000
> > >
> > >
> >
> > Wait, we found something!!!
> >
> > In the 1st 4k on the block we found the block.db pointing at the wrong
> > device (/dev/sdc1 instead of /dev/sdaa1)
> >
> > 0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  
> > |k5y+g==path_|
> > 0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  
> > |block.db/dev|
> > 0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  
> > |/sdc1ready..|
> > 0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  
> > |..readywhoam|
> > 0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  
> > |i48.|
> > 0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> > ||
> >
> > It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
> > instead of /dev/sdaa2).
> > And for the osds that are running, that block.db is correct!
> >
> > So the block.db device is persisted in the block header? But after
> > a reboot it gets a new name. (sd* naming is famously chaotic).
> > ceph-volume creates a softlink to the correct db dev, but it seems not used?
>
> Aha, yes.. the bluestore startup code looks for the value in the
> superblock before the on in the directory.
>
> We can either (1) reverse that order, (and/)or (2) make ceph-volume use a
> stable path for the device name when creating the bluestore.  And/or (3)
> use ceph-bluestore-tool set-label-key to fix it if it doesn't match (this
> would repair old superblocks... permanently if we use the stable path
> name).

ceph-volume has its detection magic to 

Re: [ceph-users] rbd map hangs

2018-06-07 Thread Tracy Reed
On Thu, Jun 07, 2018 at 08:40:50AM PDT, Ilya Dryomov spake thusly:
> > Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue 
> > Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> 
> This is a *very* old kernel.

It's what's shipping with CentOS/RHEL 7 and probably what the vast
majority of people are using aside from perhaps the Ubuntu LTS people.
Does anyone really still compile their own latest kernels? Back in the
mid-90's I'd compile a new kernel at the drop of a hat. But now it has
gotten so complicated with so many options and drivers etc. that it's
actually pretty hard to get it right.

> These lines indicate in-flight requests.  Looks like there may have
> been a problem with osd1 in the past, as some of these are much older
> than others.  Try bouncing osd1 with "ceph osd down 1" (it should
> come back up automatically) and see if that clears up this batch.

Thanks!

-- 
Tracy Reed
http://tracyreed.org
Digital signature attached for your safety.


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 6:01 PM Dan van der Ster  wrote:
>
> On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster  wrote:
> >
> > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil  wrote:
> > >
> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
> > > > >
> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
> > > > > > >
> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > We have an intermittent issue where bluestore osds sometimes 
> > > > > > > > fail to
> > > > > > > > start after a reboot.
> > > > > > > > The osds all fail the same way [see 2], failing to open the 
> > > > > > > > superblock.
> > > > > > > > One one particular host, there are 24 osds and 4 SSDs 
> > > > > > > > partitioned for
> > > > > > > > the block.db's. The affected non-starting OSDs all have 
> > > > > > > > block.db on
> > > > > > > > the same ssd (/dev/sdaa).
> > > > > > > >
> > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and were 
> > > > > > > > created
> > > > > > > > by ceph-volume lvm, e.g. see [1].
> > > > > > > >
> > > > > > > > This seems like a permissions or similar issue related to the
> > > > > > > > ceph-volume tooling.
> > > > > > > > Any clues how to debug this further?
> > > > > > >
> > > > > > > I take it the OSDs start up if you try again?
> > > > > >
> > > > > > Hey.
> > > > > > No, they don't. For example, we do this `ceph-volume lvm activate 48
> > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
> > > > > > mount failure every time.
> > > > >
> > > > > That sounds like a bluefs bug then, not a ceph-volume issue.  Can you
> > > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
> > > > > debug bluestore = 20)
> > > > >
> > > >
> > > > Here: https://pastebin.com/TJXZhfcY
> > > >
> > > > Is it supposed to print something about the block.db at some point
> > >
> > > Can you dump the bluefs superblock for me?
> > >
> > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> > > hexdump -C /tmp/foo
> > >
> >
> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
> > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C /tmp/foo
> >   01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  
> > |..]..MC1J...|
> > 0010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  
> > |r6.MK...|
> > 0020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  
> > ||
> > 0030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  
> > |+..@|
> > 0040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  
> > ||
> > 0050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  
> > ||
> > 0060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. 
> > .am...|
> > 0070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> > ||
> > *
> > 1000
> >
> >
>
> Wait, we found something!!!
>
> In the 1st 4k on the block we found the block.db pointing at the wrong
> device (/dev/sdc1 instead of /dev/sdaa1)
>
> 0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  |k5y+g==path_|
> 0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  |block.db/dev|
> 0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  |/sdc1ready..|
> 0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  |..readywhoam|
> 0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  |i48.|
> 0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
>
> It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
> instead of /dev/sdaa2).
> And for the osds that are running, that block.db is correct!
>
> So the block.db device is persisted in the block header? But after
> a reboot it gets a new name. (sd* naming is famously chaotic).
> ceph-volume creates a softlink to the correct db dev, but it seems not used?
>

Yup, we did this crazy hack # ln -s /dev/sdaa1 /dev/sdc1

lrwxrwxrwx 1 root root  10 Jun  7 18:07 /dev/sdc1 -> /dev/sdaa1

systemctl start ceph-osd@48

and it's up :-/

-- dan






> ...
> Dan & Teo
>
>
>
>
>
> >
> > -- dan
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > > Thanks!
> > > sage
> > >
> > > >
> > > > Here's the osd dir:
> > > >
> > > > # ls -l /var/lib/ceph/osd/ceph-48/
> > > > total 24
> > > > lrwxrwxrwx. 1 ceph ceph 93 Jun  7 16:46 block ->
> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > > > lrwxrwxrwx. 1 root root 10 Jun  7 16:46 block.db -> /dev/sdaa1
> > > > -rw---. 1 ceph ceph 37 Jun  7 16:46 ceph_fsid
> > > > -rw---. 1 ceph ceph 37 Jun  7 16:46 fsid
> > > > -rw---. 1 ceph ceph 56 Jun  7 

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Sage Weil
On Thu, 7 Jun 2018, Dan van der Ster wrote:
> On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster  wrote:
> >
> > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil  wrote:
> > >
> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
> > > > >
> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
> > > > > > >
> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > We have an intermittent issue where bluestore osds sometimes 
> > > > > > > > fail to
> > > > > > > > start after a reboot.
> > > > > > > > The osds all fail the same way [see 2], failing to open the 
> > > > > > > > superblock.
> > > > > > > > One one particular host, there are 24 osds and 4 SSDs 
> > > > > > > > partitioned for
> > > > > > > > the block.db's. The affected non-starting OSDs all have 
> > > > > > > > block.db on
> > > > > > > > the same ssd (/dev/sdaa).
> > > > > > > >
> > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and were 
> > > > > > > > created
> > > > > > > > by ceph-volume lvm, e.g. see [1].
> > > > > > > >
> > > > > > > > This seems like a permissions or similar issue related to the
> > > > > > > > ceph-volume tooling.
> > > > > > > > Any clues how to debug this further?
> > > > > > >
> > > > > > > I take it the OSDs start up if you try again?
> > > > > >
> > > > > > Hey.
> > > > > > No, they don't. For example, we do this `ceph-volume lvm activate 48
> > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
> > > > > > mount failure every time.
> > > > >
> > > > > That sounds like a bluefs bug then, not a ceph-volume issue.  Can you
> > > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
> > > > > debug bluestore = 20)
> > > > >
> > > >
> > > > Here: https://pastebin.com/TJXZhfcY
> > > >
> > > > Is it supposed to print something about the block.db at some point
> > >
> > > Can you dump the bluefs superblock for me?
> > >
> > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> > > hexdump -C /tmp/foo
> > >
> >
> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
> > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> > 1+0 records in
> > 1+0 records out
> > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C /tmp/foo
> >   01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  
> > |..]..MC1J...|
> > 0010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  
> > |r6.MK...|
> > 0020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  
> > ||
> > 0030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  
> > |+..@|
> > 0040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  
> > ||
> > 0050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  
> > ||
> > 0060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. 
> > .am...|
> > 0070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> > ||
> > *
> > 1000
> >
> >
> 
> Wait, we found something!!!
> 
> In the 1st 4k on the block we found the block.db pointing at the wrong
> device (/dev/sdc1 instead of /dev/sdaa1)
> 
> 0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  |k5y+g==path_|
> 0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  |block.db/dev|
> 0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  |/sdc1ready..|
> 0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  |..readywhoam|
> 0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  |i48.|
> 0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
> 
> It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
> instead of /dev/sdaa2).
> And for the osds that are running, that block.db is correct!
> 
> So the block.db device is persisted in the block header? But after
> a reboot it gets a new name. (sd* naming is famously chaotic).
> ceph-volume creates a softlink to the correct db dev, but it seems not used?

Aha, yes.. the bluestore startup code looks for the value in the 
superblock before the on in the directory.

We can either (1) reverse that order, (and/)or (2) make ceph-volume use a 
stable path for the device name when creating the bluestore.  And/or (3) 
use ceph-bluestore-tool set-label-key to fix it if it doesn't match (this 
would repair old superblocks... permanently if we use the stable path 
name).

sage


> 
> ...
> Dan & Teo
> 
> 
> 
> 
> 
> >
> > -- dan
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > > Thanks!
> > > sage
> > >
> > > >
> > > > Here's the osd dir:
> > > >
> > > > # ls -l /var/lib/ceph/osd/ceph-48/
> > > > total 24
> > > > lrwxrwxrwx. 1 ceph ceph 93 Jun  7 16:46 block ->
> > > > 

Re: [ceph-users] Openstack VMs with Ceph EC pools

2018-06-07 Thread Jason Dillaman
On Thu, Jun 7, 2018 at 11:54 AM, Andrew Denton  wrote:
> On Wed, 2018-06-06 at 17:02 -0700, Pardhiv Karri wrote:
>> Hi,
>>
>> Is anyone using Openstack with Ceph  Erasure Coding pools as it now
>> supports RBD in Luminous. If so, hows the performance?
>
> I attempted it, but couldn't figure out how to get Cinder to specify
> the data pool. You can't just point Cinder at the erasure-coded pool
> since the ec pool doesn't support OMAP and the rbd creation will fail.
> Cinder will need to learn how to create the rbd differently, or there
> needs to be some override in ceph.conf.

Correct, you can put an override in the "ceph.conf" file on your
Cinder controller nodes:

[client.cinder]
rbd default data pool = XYZ

You can also use Cinder multi-backend, each using a different CephX
user id, to support different overrides for different device classes.


> Thanks,
> Andrew
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster  wrote:
>
> On Thu, Jun 7, 2018 at 5:34 PM Sage Weil  wrote:
> >
> > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
> > > >
> > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
> > > > > >
> > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We have an intermittent issue where bluestore osds sometimes fail 
> > > > > > > to
> > > > > > > start after a reboot.
> > > > > > > The osds all fail the same way [see 2], failing to open the 
> > > > > > > superblock.
> > > > > > > One one particular host, there are 24 osds and 4 SSDs partitioned 
> > > > > > > for
> > > > > > > the block.db's. The affected non-starting OSDs all have block.db 
> > > > > > > on
> > > > > > > the same ssd (/dev/sdaa).
> > > > > > >
> > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and were 
> > > > > > > created
> > > > > > > by ceph-volume lvm, e.g. see [1].
> > > > > > >
> > > > > > > This seems like a permissions or similar issue related to the
> > > > > > > ceph-volume tooling.
> > > > > > > Any clues how to debug this further?
> > > > > >
> > > > > > I take it the OSDs start up if you try again?
> > > > >
> > > > > Hey.
> > > > > No, they don't. For example, we do this `ceph-volume lvm activate 48
> > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
> > > > > mount failure every time.
> > > >
> > > > That sounds like a bluefs bug then, not a ceph-volume issue.  Can you
> > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
> > > > debug bluestore = 20)
> > > >
> > >
> > > Here: https://pastebin.com/TJXZhfcY
> > >
> > > Is it supposed to print something about the block.db at some point
> >
> > Can you dump the bluefs superblock for me?
> >
> > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> > hexdump -C /tmp/foo
> >
>
> [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
> if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> 1+0 records in
> 1+0 records out
> 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
> [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C /tmp/foo
>   01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  |..]..MC1J...|
> 0010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  |r6.MK...|
> 0020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  ||
> 0030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  |+..@|
> 0040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  ||
> 0050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  ||
> 0060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. .am...|
> 0070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
> *
> 1000
>
>

Wait, we found something!!!

In the 1st 4k on the block we found the block.db pointing at the wrong
device (/dev/sdc1 instead of /dev/sdaa1)

0130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  |k5y+g==path_|
0140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  |block.db/dev|
0150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  |/sdc1ready..|
0160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  |..readywhoam|
0170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  |i48.|
0180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||

It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
instead of /dev/sdaa2).
And for the osds that are running, that block.db is correct!

So the block.db device is persisted in the block header? But after
a reboot it gets a new name. (sd* naming is famously chaotic).
ceph-volume creates a softlink to the correct db dev, but it seems not used?

...
Dan & Teo





>
> -- dan
>
>
>
>
>
>
>
>
>
> > Thanks!
> > sage
> >
> > >
> > > Here's the osd dir:
> > >
> > > # ls -l /var/lib/ceph/osd/ceph-48/
> > > total 24
> > > lrwxrwxrwx. 1 ceph ceph 93 Jun  7 16:46 block ->
> > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > > lrwxrwxrwx. 1 root root 10 Jun  7 16:46 block.db -> /dev/sdaa1
> > > -rw---. 1 ceph ceph 37 Jun  7 16:46 ceph_fsid
> > > -rw---. 1 ceph ceph 37 Jun  7 16:46 fsid
> > > -rw---. 1 ceph ceph 56 Jun  7 16:46 keyring
> > > -rw---. 1 ceph ceph  6 Jun  7 16:46 ready
> > > -rw---. 1 ceph ceph 10 Jun  7 16:46 type
> > > -rw---. 1 ceph ceph  3 Jun  7 16:46 whoami
> > >
> > > # ls -l 
> > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > > lrwxrwxrwx. 1 root root 7 Jun  7 16:46
> > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > > -> ../dm-4
> > >
> > > # ls -l /dev/dm-4
> > > brw-rw. 1 ceph ceph 253, 4 Jun  

Re: [ceph-users] Openstack VMs with Ceph EC pools

2018-06-07 Thread Andrew Denton
On Wed, 2018-06-06 at 17:02 -0700, Pardhiv Karri wrote:
> Hi,
> 
> Is anyone using Openstack with Ceph  Erasure Coding pools as it now
> supports RBD in Luminous. If so, hows the performance? 

I attempted it, but couldn't figure out how to get Cinder to specify
the data pool. You can't just point Cinder at the erasure-coded pool
since the ec pool doesn't support OMAP and the rbd creation will fail.
Cinder will need to learn how to create the rbd differently, or there
needs to be some override in ceph.conf.

Thanks,
Andrew
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Ilya Dryomov
On Thu, Jun 7, 2018 at 4:33 PM, Tracy Reed  wrote:
> On Thu, Jun 07, 2018 at 02:05:31AM PDT, Ilya Dryomov spake thusly:
>> > find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
>>
>> Can you paste the entire output of that command?
>>
>> Which kernel are you running on the client box?
>
> Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 
> 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

This is a *very* old kernel.

>
> output is:
>
> # find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
> /sys/kernel/debug/ceph/b2b00aae-f00d-41b4-a29b-58859aa41375.client31276017/osdmap
> epoch 232455
> flags
> pool 0 pg_num 2500 (4095) read_tier -1 write_tier -1
> pool 2 pg_num 512 (511) read_tier -1 write_tier -1
> pool 3 pg_num 128 (127) read_tier -1 write_tier -1
> pool 4 pg_num 100 (127) read_tier -1 write_tier -1
> osd010.0.5.3:680154%(exists, up)100%
> osd110.0.5.3:681257%(exists, up)100%
> osd2(unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd310.0.5.4:681250%(exists, up)100%
> osd4(unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd5(unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd610.0.5.9:686137%(exists, up)100%
> osd710.0.5.9:687628%(exists, up)100%
> osd810.0.5.9:686443%(exists, up)100%
> osd910.0.5.9:683630%(exists, up)100%
> osd10   10.0.5.9:682022%(exists, up)100%
> osd11   10.0.5.9:684454%(exists, up)100%
> osd12   10.0.5.9:680343%(exists, up)100%
> osd13   10.0.5.9:682641%(exists, up)100%
> osd14   10.0.5.9:685337%(exists, up)100%
> osd15   10.0.5.9:687236%(exists, up)100%
> osd16   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd17   10.0.5.9:681244%(exists, up)100%
> osd18   10.0.5.9:681748%(exists, up)100%
> osd19   10.0.5.9:685633%(exists, up)100%
> osd20   10.0.5.9:680846%(exists, up)100%
> osd21   10.0.5.9:687141%(exists, up)100%
> osd22   10.0.5.9:681649%(exists, up)100%
> osd23   10.0.5.9:682356%(exists, up)100%
> osd24   10.0.5.9:680054%(exists, up)100%
> osd25   10.0.5.9:684854%(exists, up)100%
> osd26   10.0.5.9:684037%(exists, up)100%
> osd27   10.0.5.9:688369%(exists, up)100%
> osd28   10.0.5.9:683339%(exists, up)100%
> osd29   10.0.5.9:680938%(exists, up)100%
> osd30   10.0.5.9:682951%(exists, up)100%
> osd31   10.0.5.11:6828   47%(exists, up)100%
> osd32   10.0.5.11:6848   25%(exists, up)100%
> osd33   10.0.5.11:6802   56%(exists, up)100%
> osd34   10.0.5.11:6840   35%(exists, up)100%
> osd35   10.0.5.11:6856   32%(exists, up)100%
> osd36   10.0.5.11:6832   26%(exists, up)100%
> [88/1848]
> osd37   10.0.5.11:6868   42%(exists, up)100%
> osd38   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd39   10.0.5.11:6812   52%(exists, up)100%
> osd40   10.0.5.11:6864   44%(exists, up)100%
> osd41   10.0.5.11:6801   25%(exists, up)100%
> osd42   10.0.5.11:6872   39%(exists, up)100%
> osd43   10.0.5.13:6809   38%(exists, up)100%
> osd44   10.0.5.11:6844   47%(exists, up)100%
> osd45   10.0.5.11:6816   20%(exists, up)100%
> osd46   10.0.5.3:680058%(exists, up)100%
> osd47   10.0.5.2:680843%(exists, up)100%
> osd48   10.0.5.2:680444%(exists, up)100%
> osd49   10.0.5.2:681244%(exists, up)100%
> osd50   10.0.5.2:680047%(exists, up)100%
> osd51   10.0.5.4:680843%(exists, up)100%
> osd52   10.0.5.12:6815   41%(exists, up)100%
> osd53   10.0.5.11:6820   24%(up)100%
> osd54   10.0.5.11:6876   34%(exists, up)100%
> osd55   10.0.5.11:6836   48%(exists, up)100%
> osd56   10.0.5.11:6824   31%(exists, up)100%
> osd57   10.0.5.11:6860   48%(exists, up)100%
> osd58   10.0.5.11:6852   35%(exists, up)100%
> osd59   10.0.5.11:6800   42%(exists, up)100%
> osd60   10.0.5.11:6880   58%(exists, up)100%
> osd61   10.0.5.3:680352%(exists, up)100%
> osd62   10.0.5.12:6800   42%(exists, up)100%
> osd63   10.0.5.12:6819   46%(exists, up)100%
> osd64   10.0.5.12:6809   44%(exists, up)100%
> osd65   10.0.5.13:6800   44%(exists, up)100%
> osd66   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd67   10.0.5.13:6808   50%(exists, up)100%
> osd68   10.0.5.4:680441%(exists, up)100%
> osd69   10.0.5.4:680039%(exists, up)100%
> osd70   10.0.5.13:6804   42%(exists, up)100%
> osd71   (unknown sockaddr family 0)   0%(doesn't exist) 100%
> osd72   (unknown sockaddr family 0)   0%(doesn't 

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 5:34 PM Sage Weil  wrote:
>
> On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
> > >
> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
> > > > >
> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > We have an intermittent issue where bluestore osds sometimes fail to
> > > > > > start after a reboot.
> > > > > > The osds all fail the same way [see 2], failing to open the 
> > > > > > superblock.
> > > > > > One one particular host, there are 24 osds and 4 SSDs partitioned 
> > > > > > for
> > > > > > the block.db's. The affected non-starting OSDs all have block.db on
> > > > > > the same ssd (/dev/sdaa).
> > > > > >
> > > > > > The osds are all running 12.2.5 on latest centos 7.5 and were 
> > > > > > created
> > > > > > by ceph-volume lvm, e.g. see [1].
> > > > > >
> > > > > > This seems like a permissions or similar issue related to the
> > > > > > ceph-volume tooling.
> > > > > > Any clues how to debug this further?
> > > > >
> > > > > I take it the OSDs start up if you try again?
> > > >
> > > > Hey.
> > > > No, they don't. For example, we do this `ceph-volume lvm activate 48
> > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
> > > > mount failure every time.
> > >
> > > That sounds like a bluefs bug then, not a ceph-volume issue.  Can you
> > > try to start the OSD will logging enabled?  (debug bluefs = 20,
> > > debug bluestore = 20)
> > >
> >
> > Here: https://pastebin.com/TJXZhfcY
> >
> > Is it supposed to print something about the block.db at some point
>
> Can you dump the bluefs superblock for me?
>
> dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> hexdump -C /tmp/foo
>

[17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
[17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C /tmp/foo
  01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  |..]..MC1J...|
0010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  |r6.MK...|
0020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  ||
0030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  |+..@|
0040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  ||
0050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  ||
0060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. .am...|
0070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
1000



-- dan









> Thanks!
> sage
>
> >
> > Here's the osd dir:
> >
> > # ls -l /var/lib/ceph/osd/ceph-48/
> > total 24
> > lrwxrwxrwx. 1 ceph ceph 93 Jun  7 16:46 block ->
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > lrwxrwxrwx. 1 root root 10 Jun  7 16:46 block.db -> /dev/sdaa1
> > -rw---. 1 ceph ceph 37 Jun  7 16:46 ceph_fsid
> > -rw---. 1 ceph ceph 37 Jun  7 16:46 fsid
> > -rw---. 1 ceph ceph 56 Jun  7 16:46 keyring
> > -rw---. 1 ceph ceph  6 Jun  7 16:46 ready
> > -rw---. 1 ceph ceph 10 Jun  7 16:46 type
> > -rw---. 1 ceph ceph  3 Jun  7 16:46 whoami
> >
> > # ls -l 
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > lrwxrwxrwx. 1 root root 7 Jun  7 16:46
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > -> ../dm-4
> >
> > # ls -l /dev/dm-4
> > brw-rw. 1 ceph ceph 253, 4 Jun  7 16:46 /dev/dm-4
> >
> >
> >   --- Logical volume ---
> >   LV Path
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >   LV Nameosd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >   VG Nameceph-34f24306-d90c-49ff-bafb-2657a6a18010
> >   LV UUIDFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
> >   LV Write Accessread/write
> >   LV Creation host, time p06253939y61826.cern.ch, 2018-03-15 10:57:37 +0100
> >   LV Status  available
> >   # open 0
> >   LV Size<5.46 TiB
> >   Current LE 1430791
> >   Segments   1
> >   Allocation inherit
> >   Read ahead sectors auto
> >   - currently set to 256
> >   Block device   253:4
> >
> >   --- Physical volume ---
> >   PV Name   /dev/sda
> >   VG Name   ceph-34f24306-d90c-49ff-bafb-2657a6a18010
> >   PV Size   <5.46 TiB / not usable <2.59 MiB
> >   Allocatable   yes (but full)
> >   PE Size   4.00 MiB
> >   Total PE  1430791
> >   Free PE   0
> >   Allocated PE  1430791
> >   PV UUID   WP0Z7C-ejSh-fpSa-a73N-H2Hz-yC78-qBezcI
> >
> > (sorry for 

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Sage Weil
On Thu, 7 Jun 2018, Dan van der Ster wrote:
> On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
> >
> > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
> > > >
> > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > > Hi all,
> > > > >
> > > > > We have an intermittent issue where bluestore osds sometimes fail to
> > > > > start after a reboot.
> > > > > The osds all fail the same way [see 2], failing to open the 
> > > > > superblock.
> > > > > One one particular host, there are 24 osds and 4 SSDs partitioned for
> > > > > the block.db's. The affected non-starting OSDs all have block.db on
> > > > > the same ssd (/dev/sdaa).
> > > > >
> > > > > The osds are all running 12.2.5 on latest centos 7.5 and were created
> > > > > by ceph-volume lvm, e.g. see [1].
> > > > >
> > > > > This seems like a permissions or similar issue related to the
> > > > > ceph-volume tooling.
> > > > > Any clues how to debug this further?
> > > >
> > > > I take it the OSDs start up if you try again?
> > >
> > > Hey.
> > > No, they don't. For example, we do this `ceph-volume lvm activate 48
> > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
> > > mount failure every time.
> >
> > That sounds like a bluefs bug then, not a ceph-volume issue.  Can you
> > try to start the OSD will logging enabled?  (debug bluefs = 20,
> > debug bluestore = 20)
> >
> 
> Here: https://pastebin.com/TJXZhfcY
> 
> Is it supposed to print something about the block.db at some point

Can you dump the bluefs superblock for me?

dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
hexdump -C /tmp/foo

Thanks!
sage

> 
> Here's the osd dir:
> 
> # ls -l /var/lib/ceph/osd/ceph-48/
> total 24
> lrwxrwxrwx. 1 ceph ceph 93 Jun  7 16:46 block ->
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> lrwxrwxrwx. 1 root root 10 Jun  7 16:46 block.db -> /dev/sdaa1
> -rw---. 1 ceph ceph 37 Jun  7 16:46 ceph_fsid
> -rw---. 1 ceph ceph 37 Jun  7 16:46 fsid
> -rw---. 1 ceph ceph 56 Jun  7 16:46 keyring
> -rw---. 1 ceph ceph  6 Jun  7 16:46 ready
> -rw---. 1 ceph ceph 10 Jun  7 16:46 type
> -rw---. 1 ceph ceph  3 Jun  7 16:46 whoami
> 
> # ls -l 
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> lrwxrwxrwx. 1 root root 7 Jun  7 16:46
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> -> ../dm-4
> 
> # ls -l /dev/dm-4
> brw-rw. 1 ceph ceph 253, 4 Jun  7 16:46 /dev/dm-4
> 
> 
>   --- Logical volume ---
>   LV Path
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>   LV Nameosd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>   VG Nameceph-34f24306-d90c-49ff-bafb-2657a6a18010
>   LV UUIDFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
>   LV Write Accessread/write
>   LV Creation host, time p06253939y61826.cern.ch, 2018-03-15 10:57:37 +0100
>   LV Status  available
>   # open 0
>   LV Size<5.46 TiB
>   Current LE 1430791
>   Segments   1
>   Allocation inherit
>   Read ahead sectors auto
>   - currently set to 256
>   Block device   253:4
> 
>   --- Physical volume ---
>   PV Name   /dev/sda
>   VG Name   ceph-34f24306-d90c-49ff-bafb-2657a6a18010
>   PV Size   <5.46 TiB / not usable <2.59 MiB
>   Allocatable   yes (but full)
>   PE Size   4.00 MiB
>   Total PE  1430791
>   Free PE   0
>   Allocated PE  1430791
>   PV UUID   WP0Z7C-ejSh-fpSa-a73N-H2Hz-yC78-qBezcI
> 
> (sorry for wall o' lvm)
> 
> -- dan
> 
> 
> 
> 
> 
> 
> 
> 
> > Thanks!
> > sage
> >
> >
> > > -- dan
> > >
> > >
> > > >
> > > > sage
> > > >
> > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Dan
> > > > >
> > > > > [1]
> > > > >
> > > > > == osd.48 ==
> > > > >
> > > > >   [block]
> > > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > > > >
> > > > >   type  block
> > > > >   osd id48
> > > > >   cluster fsid  dd535a7e-4647-4bee-853d-f34112615f81
> > > > >   cluster name  ceph
> > > > >   osd fsid  99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > > > >   db device /dev/sdaa1
> > > > >   encrypted 0
> > > > >   db uuid   3381a121-1c1b-4e45-a986-c1871c363edc
> > > > >   cephx lockbox secret
> > > > >   block uuidFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
> > > > >   block device
> > > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > > > >   crush device classNone
> > > > >
> > > > >   [  

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 5:16 PM Alfredo Deza  wrote:
>
> On Thu, Jun 7, 2018 at 10:54 AM, Dan van der Ster  wrote:
> > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
> >>
> >> On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
> >> > >
> >> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> > > > Hi all,
> >> > > >
> >> > > > We have an intermittent issue where bluestore osds sometimes fail to
> >> > > > start after a reboot.
> >> > > > The osds all fail the same way [see 2], failing to open the 
> >> > > > superblock.
> >> > > > One one particular host, there are 24 osds and 4 SSDs partitioned for
> >> > > > the block.db's. The affected non-starting OSDs all have block.db on
> >> > > > the same ssd (/dev/sdaa).
> >> > > >
> >> > > > The osds are all running 12.2.5 on latest centos 7.5 and were created
> >> > > > by ceph-volume lvm, e.g. see [1].
> >> > > >
> >> > > > This seems like a permissions or similar issue related to the
> >> > > > ceph-volume tooling.
> >> > > > Any clues how to debug this further?
> >> > >
> >> > > I take it the OSDs start up if you try again?
> >> >
> >> > Hey.
> >> > No, they don't. For example, we do this `ceph-volume lvm activate 48
> >> > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
> >> > mount failure every time.
> >>
> >> That sounds like a bluefs bug then, not a ceph-volume issue.  Can you
> >> try to start the OSD will logging enabled?  (debug bluefs = 20,
> >> debug bluestore = 20)
> >>
> >
> > Here: https://pastebin.com/TJXZhfcY
> >
> > Is it supposed to print something about the block.db at some point
>
> This has to be some logging mistake because it is block.db, never just 
> 'block' :
>
> bdev(0x5653ffdadc00 /var/lib/ceph/osd/ceph-48/block) open path
> /var/lib/ceph/osd/ceph-48/block
>
> That is what you are referring to here right?

What I mean by my question is "I didn't see block.db or sdaa1
mentioned in the log. Maybe that is the problem."

>
> Now, re-reading the thread, you say that it sometimes does boot
> normally?

On this host right now we are unable to start the 6 OSDs with block.db
on sdaa[1-6].

On another two hosts earlier today, we had this same problem, but
after rebooting a couple times the OSDs started normally.
We've rebooted this host with osd.48 and others a few times now but no
such luck.

> ceph-volume tries (in different ways) to ensure that the
> devices
> used are the correct ones. In the case of /dev/sdaa1 it has persisted
> the partuuid (3381a121-1c1b-4e45-a986-c1871c363edc) which is later
> queried using blkid to find the right device name (/dev/sdaa1 in your case).
>
> Is it possible that you are seeing somewhere where ceph-volume is
> *not* matching this correctly? If osd.48 comes up online, how does the
> /var/lib/osd/ceph-48 looks? the same?

As far as we can tell that partuuid is only applied to sdaa1:

# blkid  | grep 3381a121-1c1b-4e45-a986-c1871c363edc
/dev/sdaa1: PARTLABEL="primary" PARTUUID="3381a121-1c1b-4e45-a986-c1871c363edc"

We don't manage to bring osd.48 online, but comparing it to another
osd that *is* up... nothing stands out as different.

In case it helps, here's how we created these osds (nothing unusual...):

  dd if=/dev/zero of=/dev/sdaa bs=512 count=1
  parted -s /dev/sdaa mklabel gpt
  parted -s -a optimal /dev/sdaa mkpart primary "0%" "16%"
  ceph-volume lvm create --bluestore --data /dev/sda --block.db /dev/sdaa1

Thanks for all your help!

dan


>
>
> >
> > Here's the osd dir:
> >
> > # ls -l /var/lib/ceph/osd/ceph-48/
> > total 24
> > lrwxrwxrwx. 1 ceph ceph 93 Jun  7 16:46 block ->
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > lrwxrwxrwx. 1 root root 10 Jun  7 16:46 block.db -> /dev/sdaa1
> > -rw---. 1 ceph ceph 37 Jun  7 16:46 ceph_fsid
> > -rw---. 1 ceph ceph 37 Jun  7 16:46 fsid
> > -rw---. 1 ceph ceph 56 Jun  7 16:46 keyring
> > -rw---. 1 ceph ceph  6 Jun  7 16:46 ready
> > -rw---. 1 ceph ceph 10 Jun  7 16:46 type
> > -rw---. 1 ceph ceph  3 Jun  7 16:46 whoami
> >
> > # ls -l 
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > lrwxrwxrwx. 1 root root 7 Jun  7 16:46
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > -> ../dm-4
> >
> > # ls -l /dev/dm-4
> > brw-rw. 1 ceph ceph 253, 4 Jun  7 16:46 /dev/dm-4
> >
> >
> >   --- Logical volume ---
> >   LV Path
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >   LV Nameosd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >   VG Nameceph-34f24306-d90c-49ff-bafb-2657a6a18010
> >   LV UUIDFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
> >   LV Write Accessread/write
> >   LV Creation host, time p06253939y61826.cern.ch, 2018-03-15 10:57:37 +0100
> >   LV Status  available
> >   # open  

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Alfredo Deza
On Thu, Jun 7, 2018 at 10:54 AM, Dan van der Ster  wrote:
> On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
>>
>> On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
>> > >
>> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> > > > Hi all,
>> > > >
>> > > > We have an intermittent issue where bluestore osds sometimes fail to
>> > > > start after a reboot.
>> > > > The osds all fail the same way [see 2], failing to open the superblock.
>> > > > One one particular host, there are 24 osds and 4 SSDs partitioned for
>> > > > the block.db's. The affected non-starting OSDs all have block.db on
>> > > > the same ssd (/dev/sdaa).
>> > > >
>> > > > The osds are all running 12.2.5 on latest centos 7.5 and were created
>> > > > by ceph-volume lvm, e.g. see [1].
>> > > >
>> > > > This seems like a permissions or similar issue related to the
>> > > > ceph-volume tooling.
>> > > > Any clues how to debug this further?
>> > >
>> > > I take it the OSDs start up if you try again?
>> >
>> > Hey.
>> > No, they don't. For example, we do this `ceph-volume lvm activate 48
>> > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
>> > mount failure every time.
>>
>> That sounds like a bluefs bug then, not a ceph-volume issue.  Can you
>> try to start the OSD will logging enabled?  (debug bluefs = 20,
>> debug bluestore = 20)
>>
>
> Here: https://pastebin.com/TJXZhfcY
>
> Is it supposed to print something about the block.db at some point

This has to be some logging mistake because it is block.db, never just 'block' :

bdev(0x5653ffdadc00 /var/lib/ceph/osd/ceph-48/block) open path
/var/lib/ceph/osd/ceph-48/block

That is what you are referring to here right?

Now, re-reading the thread, you say that it sometimes does boot
normally? ceph-volume tries (in different ways) to ensure that the
devices
used are the correct ones. In the case of /dev/sdaa1 it has persisted
the partuuid (3381a121-1c1b-4e45-a986-c1871c363edc) which is later
queried using blkid to find the right device name (/dev/sdaa1 in your case).

Is it possible that you are seeing somewhere where ceph-volume is
*not* matching this correctly? If osd.48 comes up online, how does the
/var/lib/osd/ceph-48 looks? the same?


>
> Here's the osd dir:
>
> # ls -l /var/lib/ceph/osd/ceph-48/
> total 24
> lrwxrwxrwx. 1 ceph ceph 93 Jun  7 16:46 block ->
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> lrwxrwxrwx. 1 root root 10 Jun  7 16:46 block.db -> /dev/sdaa1
> -rw---. 1 ceph ceph 37 Jun  7 16:46 ceph_fsid
> -rw---. 1 ceph ceph 37 Jun  7 16:46 fsid
> -rw---. 1 ceph ceph 56 Jun  7 16:46 keyring
> -rw---. 1 ceph ceph  6 Jun  7 16:46 ready
> -rw---. 1 ceph ceph 10 Jun  7 16:46 type
> -rw---. 1 ceph ceph  3 Jun  7 16:46 whoami
>
> # ls -l 
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> lrwxrwxrwx. 1 root root 7 Jun  7 16:46
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> -> ../dm-4
>
> # ls -l /dev/dm-4
> brw-rw. 1 ceph ceph 253, 4 Jun  7 16:46 /dev/dm-4
>
>
>   --- Logical volume ---
>   LV Path
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>   LV Nameosd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>   VG Nameceph-34f24306-d90c-49ff-bafb-2657a6a18010
>   LV UUIDFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
>   LV Write Accessread/write
>   LV Creation host, time p06253939y61826.cern.ch, 2018-03-15 10:57:37 +0100
>   LV Status  available
>   # open 0
>   LV Size<5.46 TiB
>   Current LE 1430791
>   Segments   1
>   Allocation inherit
>   Read ahead sectors auto
>   - currently set to 256
>   Block device   253:4
>
>   --- Physical volume ---
>   PV Name   /dev/sda
>   VG Name   ceph-34f24306-d90c-49ff-bafb-2657a6a18010
>   PV Size   <5.46 TiB / not usable <2.59 MiB
>   Allocatable   yes (but full)
>   PE Size   4.00 MiB
>   Total PE  1430791
>   Free PE   0
>   Allocated PE  1430791
>   PV UUID   WP0Z7C-ejSh-fpSa-a73N-H2Hz-yC78-qBezcI
>
> (sorry for wall o' lvm)
>
> -- dan
>
>
>
>
>
>
>
>
>> Thanks!
>> sage
>>
>>
>> > -- dan
>> >
>> >
>> > >
>> > > sage
>> > >
>> > >
>> > > >
>> > > > Thanks!
>> > > >
>> > > > Dan
>> > > >
>> > > > [1]
>> > > >
>> > > > == osd.48 ==
>> > > >
>> > > >   [block]
>> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>> > > >
>> > > >   type  block
>> > > >   osd id48
>> > > >   cluster fsid  dd535a7e-4647-4bee-853d-f34112615f81
>> > > >   cluster name  ceph
>> > > >   osd fsid

[ceph-users] slow MDS requests [Solved]

2018-06-07 Thread Alfredo Daniel Rezinovsky
I had a lot of "slow MDS request" (MDS, nos OSD) when writing a lot of 
small files to cephfs.


The problem was an I/O stuck when flushing XFS buffers. I had the same 
problem in other (non ceph) xfs systems when asking for a lot of inode 
operations (rm -rf for example).


The solution was to destroy the filestore (xfs) OSDs and recreate them 
in bluestore. Is a slow process. First you have to stop the OSDs, wait 
for rebalancing, destroy, recreate, and go on with the next OSD. It took 
a 10 days to recreate 8 OSDs with 3Gb each (In production, with the RBD 
and CEPHFS services in use).


Hope this helps.

--
Alfredo Daniel Rezinovsky
Director de Tecnologías de Información y Comunicaciones
Facultad de Ingeniería - Universidad Nacional de Cuyo

<>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 4:41 PM Sage Weil  wrote:
>
> On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
> > >
> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > > Hi all,
> > > >
> > > > We have an intermittent issue where bluestore osds sometimes fail to
> > > > start after a reboot.
> > > > The osds all fail the same way [see 2], failing to open the superblock.
> > > > One one particular host, there are 24 osds and 4 SSDs partitioned for
> > > > the block.db's. The affected non-starting OSDs all have block.db on
> > > > the same ssd (/dev/sdaa).
> > > >
> > > > The osds are all running 12.2.5 on latest centos 7.5 and were created
> > > > by ceph-volume lvm, e.g. see [1].
> > > >
> > > > This seems like a permissions or similar issue related to the
> > > > ceph-volume tooling.
> > > > Any clues how to debug this further?
> > >
> > > I take it the OSDs start up if you try again?
> >
> > Hey.
> > No, they don't. For example, we do this `ceph-volume lvm activate 48
> > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
> > mount failure every time.
>
> That sounds like a bluefs bug then, not a ceph-volume issue.  Can you
> try to start the OSD will logging enabled?  (debug bluefs = 20,
> debug bluestore = 20)
>

Here: https://pastebin.com/TJXZhfcY

Is it supposed to print something about the block.db at some point

Here's the osd dir:

# ls -l /var/lib/ceph/osd/ceph-48/
total 24
lrwxrwxrwx. 1 ceph ceph 93 Jun  7 16:46 block ->
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
lrwxrwxrwx. 1 root root 10 Jun  7 16:46 block.db -> /dev/sdaa1
-rw---. 1 ceph ceph 37 Jun  7 16:46 ceph_fsid
-rw---. 1 ceph ceph 37 Jun  7 16:46 fsid
-rw---. 1 ceph ceph 56 Jun  7 16:46 keyring
-rw---. 1 ceph ceph  6 Jun  7 16:46 ready
-rw---. 1 ceph ceph 10 Jun  7 16:46 type
-rw---. 1 ceph ceph  3 Jun  7 16:46 whoami

# ls -l 
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
lrwxrwxrwx. 1 root root 7 Jun  7 16:46
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
-> ../dm-4

# ls -l /dev/dm-4
brw-rw. 1 ceph ceph 253, 4 Jun  7 16:46 /dev/dm-4


  --- Logical volume ---
  LV Path
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
  LV Nameosd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
  VG Nameceph-34f24306-d90c-49ff-bafb-2657a6a18010
  LV UUIDFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
  LV Write Accessread/write
  LV Creation host, time p06253939y61826.cern.ch, 2018-03-15 10:57:37 +0100
  LV Status  available
  # open 0
  LV Size<5.46 TiB
  Current LE 1430791
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:4

  --- Physical volume ---
  PV Name   /dev/sda
  VG Name   ceph-34f24306-d90c-49ff-bafb-2657a6a18010
  PV Size   <5.46 TiB / not usable <2.59 MiB
  Allocatable   yes (but full)
  PE Size   4.00 MiB
  Total PE  1430791
  Free PE   0
  Allocated PE  1430791
  PV UUID   WP0Z7C-ejSh-fpSa-a73N-H2Hz-yC78-qBezcI

(sorry for wall o' lvm)

-- dan








> Thanks!
> sage
>
>
> > -- dan
> >
> >
> > >
> > > sage
> > >
> > >
> > > >
> > > > Thanks!
> > > >
> > > > Dan
> > > >
> > > > [1]
> > > >
> > > > == osd.48 ==
> > > >
> > > >   [block]
> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > > >
> > > >   type  block
> > > >   osd id48
> > > >   cluster fsid  dd535a7e-4647-4bee-853d-f34112615f81
> > > >   cluster name  ceph
> > > >   osd fsid  99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > > >   db device /dev/sdaa1
> > > >   encrypted 0
> > > >   db uuid   3381a121-1c1b-4e45-a986-c1871c363edc
> > > >   cephx lockbox secret
> > > >   block uuidFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
> > > >   block device
> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > > >   crush device classNone
> > > >
> > > >   [  db]/dev/sdaa1
> > > >
> > > >   PARTUUID  3381a121-1c1b-4e45-a986-c1871c363edc
> > > >
> > > >
> > > >
> > > > [2]
> > > >-11> 2018-06-07 16:12:16.138407 7fba30fb4d80  1 -- - start start
> > > >-10> 2018-06-07 16:12:16.138516 7fba30fb4d80  1
> > > > bluestore(/var/lib/ceph/osd/ceph-48) _mount path /var/lib/ceph/os
> > > > d/ceph-48
> > > > -9> 2018-06-07 16:12:16.138801 7fba30fb4d80  1 bdev create path
> > > > 

[ceph-users] pool has many more objects per pg than average

2018-06-07 Thread Torin Woltjer
I have a ceph cluster and status shows this error: pool libvirt-pool has many 
more objects per pg than average (too few pgs?) This pool has the most stored 
in it currently, by a large margin. The other pools are underutilized 
currently, but are purposed to take a role much greater than libvirt-pool. Once 
the other pools begin storing more objects, will this error go away, or am I 
misunderstanding the message?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Alfredo Deza
On Thu, Jun 7, 2018 at 10:40 AM, Dan van der Ster  wrote:
> On Thu, Jun 7, 2018 at 4:31 PM Alfredo Deza  wrote:
>>
>> On Thu, Jun 7, 2018 at 10:23 AM, Dan van der Ster  
>> wrote:
>> > Hi all,
>> >
>> > We have an intermittent issue where bluestore osds sometimes fail to
>> > start after a reboot.
>> > The osds all fail the same way [see 2], failing to open the superblock.
>> > One one particular host, there are 24 osds and 4 SSDs partitioned for
>> > the block.db's. The affected non-starting OSDs all have block.db on
>> > the same ssd (/dev/sdaa).
>> >
>> > The osds are all running 12.2.5 on latest centos 7.5 and were created
>> > by ceph-volume lvm, e.g. see [1].
>> >
>> > This seems like a permissions or similar issue related to the
>> > ceph-volume tooling.
>> > Any clues how to debug this further?
>>
>> There are useful logs in both /var/log/ceph/ceph-volume.log and
>> /var/log/ceph-volume-systemd.log
>>
>> This is odd because the OSD is attempting to start, so the logs will
>> just mention everything was done accordingly and then the OSD was
>> started at the end of the (successful) setup.
>>
>
> ceph-volume.log:
> [2018-06-07 16:32:58,265][ceph_volume.devices.lvm.activate][DEBUG ]
> Found block device (osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5)
> with encryption: False
> [2018-06-07 16:32:58,266][ceph_volume.process][INFO  ] Running
> command: /usr/sbin/blkid -t
> PARTUUID="3381a121-1c1b-4e45-a986-c1871c363edc" -o device
> [2018-06-07 16:32:58,386][ceph_volume.process][INFO  ] stdout /dev/sdaa1
> [2018-06-07 16:32:58,387][ceph_volume.devices.lvm.activate][DEBUG ]
> Found block device (osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5)
> with encryption: False
> [2018-06-07 16:32:58,387][ceph_volume.process][INFO  ] Running
> command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> --path /var/lib/ceph/osd/ceph-48
> [2018-06-07 16:32:58,441][ceph_volume.process][INFO  ] Running
> command: ln -snf
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> /var/lib/ceph/osd/ceph-48/block
> [2018-06-07 16:32:58,445][ceph_volume.process][INFO  ] Running
> command: chown -R ceph:ceph /dev/dm-4
> [2018-06-07 16:32:58,448][ceph_volume.process][INFO  ] Running
> command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-48
> [2018-06-07 16:32:58,451][ceph_volume.process][INFO  ] Running
> command: ln -snf /dev/sdaa1 /var/lib/ceph/osd/ceph-48/block.db
> [2018-06-07 16:32:58,453][ceph_volume.process][INFO  ] Running
> command: chown -R ceph:ceph /dev/sdaa1
> [2018-06-07 16:32:58,456][ceph_volume.process][INFO  ] Running
> command: systemctl enable
> ceph-volume@lvm-48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> [2018-06-07 16:32:58,580][ceph_volume.process][INFO  ] Running
> command: systemctl start ceph-osd@48
>
>
> ceph-volume-systemd.log reports nothing for attempts at ceph-volume
> lvm activate. But if we start the ceph-volume unit manually we get:
>
> [2018-06-07 16:38:16,952][systemd][INFO  ] raw systemd input received:
> lvm-48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> [2018-06-07 16:38:16,952][systemd][INFO  ] parsed sub-command: lvm,
> extra data: 48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> [2018-06-07 16:38:16,977][ceph_volume.process][INFO  ] Running
> command: ceph-volume lvm trigger
> 48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> [2018-06-07 16:38:17,961][ceph_volume.process][INFO  ] stdout Running
> command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> --path /var/lib/ceph/osd/ceph-48
> Running command: ln -snf
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> /var/lib/ceph/osd/ceph-48/block
> Running command: chown -R ceph:ceph /dev/dm-4
> Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-48
> Running command: ln -snf /dev/sdaa1 /var/lib/ceph/osd/ceph-48/block.db
> Running command: chown -R ceph:ceph /dev/sdaa1
> Running command: systemctl enable
> ceph-volume@lvm-48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> Running command: systemctl start ceph-osd@48
> --> ceph-volume lvm activate successful for osd ID: 48
> [2018-06-07 16:38:17,968][systemd][INFO  ] successfully trggered
> activation for: 48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>
>
> Everything *looks* correct... but the OSDs always fail
> `bluefs_types.h: 54: FAILED assert(pos <= end)`

Right, I was somewhat expecting that. The "activate" process just
makes sure everything is in place for the OSD to start. This means
that
you can just go ahead and try to start the OSD directly without
calling activate again (although it doesn't matter, it is idempotent).

One thing you could do is poke around /var/lib/ceph/osd/ceph-48/ and
see if there are any issues there. You mentioned permissions, and
ceph-volume
does try to have correct permissions before starting the OSD, 

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Sage Weil
On Thu, 7 Jun 2018, Dan van der Ster wrote:
> On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
> >
> > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > > Hi all,
> > >
> > > We have an intermittent issue where bluestore osds sometimes fail to
> > > start after a reboot.
> > > The osds all fail the same way [see 2], failing to open the superblock.
> > > One one particular host, there are 24 osds and 4 SSDs partitioned for
> > > the block.db's. The affected non-starting OSDs all have block.db on
> > > the same ssd (/dev/sdaa).
> > >
> > > The osds are all running 12.2.5 on latest centos 7.5 and were created
> > > by ceph-volume lvm, e.g. see [1].
> > >
> > > This seems like a permissions or similar issue related to the
> > > ceph-volume tooling.
> > > Any clues how to debug this further?
> >
> > I take it the OSDs start up if you try again?
> 
> Hey.
> No, they don't. For example, we do this `ceph-volume lvm activate 48
> 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
> mount failure every time.

That sounds like a bluefs bug then, not a ceph-volume issue.  Can you 
try to start the OSD will logging enabled?  (debug bluefs = 20, 
debug bluestore = 20)

Thanks!
sage


> -- dan
> 
> 
> >
> > sage
> >
> >
> > >
> > > Thanks!
> > >
> > > Dan
> > >
> > > [1]
> > >
> > > == osd.48 ==
> > >
> > >   [block]
> > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > >
> > >   type  block
> > >   osd id48
> > >   cluster fsid  dd535a7e-4647-4bee-853d-f34112615f81
> > >   cluster name  ceph
> > >   osd fsid  99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > >   db device /dev/sdaa1
> > >   encrypted 0
> > >   db uuid   3381a121-1c1b-4e45-a986-c1871c363edc
> > >   cephx lockbox secret
> > >   block uuidFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
> > >   block device
> > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> > >   crush device classNone
> > >
> > >   [  db]/dev/sdaa1
> > >
> > >   PARTUUID  3381a121-1c1b-4e45-a986-c1871c363edc
> > >
> > >
> > >
> > > [2]
> > >-11> 2018-06-07 16:12:16.138407 7fba30fb4d80  1 -- - start start
> > >-10> 2018-06-07 16:12:16.138516 7fba30fb4d80  1
> > > bluestore(/var/lib/ceph/osd/ceph-48) _mount path /var/lib/ceph/os
> > > d/ceph-48
> > > -9> 2018-06-07 16:12:16.138801 7fba30fb4d80  1 bdev create path
> > > /var/lib/ceph/osd/ceph-48/block type kernel
> > > -8> 2018-06-07 16:12:16.138808 7fba30fb4d80  1 bdev(0x55eb46433a00
> > > /var/lib/ceph/osd/ceph-48/block) open path /v
> > > ar/lib/ceph/osd/ceph-48/block
> > > -7> 2018-06-07 16:12:16.138999 7fba30fb4d80  1 bdev(0x55eb46433a00
> > > /var/lib/ceph/osd/ceph-48/block) open size 60
> > > 01172414464 (0x57541c0, 5589 GB) block_size 4096 (4096 B) rotational
> > > -6> 2018-06-07 16:12:16.139188 7fba30fb4d80  1
> > > bluestore(/var/lib/ceph/osd/ceph-48) _set_cache_sizes cache_size
> > > 134217728 meta 0.01 kv 0.99 data 0
> > > -5> 2018-06-07 16:12:16.139275 7fba30fb4d80  1 bdev create path
> > > /var/lib/ceph/osd/ceph-48/block type kernel
> > > -4> 2018-06-07 16:12:16.139281 7fba30fb4d80  1 bdev(0x55eb46433c00
> > > /var/lib/ceph/osd/ceph-48/block) open path /v
> > > ar/lib/ceph/osd/ceph-48/block
> > > -3> 2018-06-07 16:12:16.139454 7fba30fb4d80  1 bdev(0x55eb46433c00
> > > /var/lib/ceph/osd/ceph-48/block) open size 60
> > > 01172414464 (0x57541c0, 5589 GB) block_size 4096 (4096 B) rotational
> > > -2> 2018-06-07 16:12:16.139464 7fba30fb4d80  1 bluefs
> > > add_block_device bdev 1 path /var/lib/ceph/osd/ceph-48/blo
> > > ck size 5589 GB
> > > -1> 2018-06-07 16:12:16.139510 7fba30fb4d80  1 bluefs mount
> > >  0> 2018-06-07 16:12:16.142930 7fba30fb4d80 -1
> > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILA
> > > BLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/o
> > > s/bluestore/bluefs_types.h: In function 'static void
> > > bluefs_fnode_t::_denc_finish(ceph::buffer::ptr::iterator&, __u8
> > > *, __u8*, char**, uint32_t*)' thread 7fba30fb4d80 time 2018-06-07
> > > 16:12:16.139666
> > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/os/bluestore/bluefs_types.h:
> > > 54: FAILED assert(pos <= end)
> > >
> > >  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
> > > luminous (stable)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x110) [0x55eb3b597780]
> > >  2: (bluefs_super_t::decode(ceph::buffer::list::iterator&)+0x776)
> > > [0x55eb3b52db36]
> > >  3: (BlueFS::_open_super()+0xfe) 

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 4:31 PM Alfredo Deza  wrote:
>
> On Thu, Jun 7, 2018 at 10:23 AM, Dan van der Ster  wrote:
> > Hi all,
> >
> > We have an intermittent issue where bluestore osds sometimes fail to
> > start after a reboot.
> > The osds all fail the same way [see 2], failing to open the superblock.
> > One one particular host, there are 24 osds and 4 SSDs partitioned for
> > the block.db's. The affected non-starting OSDs all have block.db on
> > the same ssd (/dev/sdaa).
> >
> > The osds are all running 12.2.5 on latest centos 7.5 and were created
> > by ceph-volume lvm, e.g. see [1].
> >
> > This seems like a permissions or similar issue related to the
> > ceph-volume tooling.
> > Any clues how to debug this further?
>
> There are useful logs in both /var/log/ceph/ceph-volume.log and
> /var/log/ceph-volume-systemd.log
>
> This is odd because the OSD is attempting to start, so the logs will
> just mention everything was done accordingly and then the OSD was
> started at the end of the (successful) setup.
>

ceph-volume.log:
[2018-06-07 16:32:58,265][ceph_volume.devices.lvm.activate][DEBUG ]
Found block device (osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5)
with encryption: False
[2018-06-07 16:32:58,266][ceph_volume.process][INFO  ] Running
command: /usr/sbin/blkid -t
PARTUUID="3381a121-1c1b-4e45-a986-c1871c363edc" -o device
[2018-06-07 16:32:58,386][ceph_volume.process][INFO  ] stdout /dev/sdaa1
[2018-06-07 16:32:58,387][ceph_volume.devices.lvm.activate][DEBUG ]
Found block device (osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5)
with encryption: False
[2018-06-07 16:32:58,387][ceph_volume.process][INFO  ] Running
command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
--path /var/lib/ceph/osd/ceph-48
[2018-06-07 16:32:58,441][ceph_volume.process][INFO  ] Running
command: ln -snf
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
/var/lib/ceph/osd/ceph-48/block
[2018-06-07 16:32:58,445][ceph_volume.process][INFO  ] Running
command: chown -R ceph:ceph /dev/dm-4
[2018-06-07 16:32:58,448][ceph_volume.process][INFO  ] Running
command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-48
[2018-06-07 16:32:58,451][ceph_volume.process][INFO  ] Running
command: ln -snf /dev/sdaa1 /var/lib/ceph/osd/ceph-48/block.db
[2018-06-07 16:32:58,453][ceph_volume.process][INFO  ] Running
command: chown -R ceph:ceph /dev/sdaa1
[2018-06-07 16:32:58,456][ceph_volume.process][INFO  ] Running
command: systemctl enable
ceph-volume@lvm-48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
[2018-06-07 16:32:58,580][ceph_volume.process][INFO  ] Running
command: systemctl start ceph-osd@48


ceph-volume-systemd.log reports nothing for attempts at ceph-volume
lvm activate. But if we start the ceph-volume unit manually we get:

[2018-06-07 16:38:16,952][systemd][INFO  ] raw systemd input received:
lvm-48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
[2018-06-07 16:38:16,952][systemd][INFO  ] parsed sub-command: lvm,
extra data: 48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
[2018-06-07 16:38:16,977][ceph_volume.process][INFO  ] Running
command: ceph-volume lvm trigger
48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
[2018-06-07 16:38:17,961][ceph_volume.process][INFO  ] stdout Running
command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
--path /var/lib/ceph/osd/ceph-48
Running command: ln -snf
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
/var/lib/ceph/osd/ceph-48/block
Running command: chown -R ceph:ceph /dev/dm-4
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-48
Running command: ln -snf /dev/sdaa1 /var/lib/ceph/osd/ceph-48/block.db
Running command: chown -R ceph:ceph /dev/sdaa1
Running command: systemctl enable
ceph-volume@lvm-48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
Running command: systemctl start ceph-osd@48
--> ceph-volume lvm activate successful for osd ID: 48
[2018-06-07 16:38:17,968][systemd][INFO  ] successfully trggered
activation for: 48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5


Everything *looks* correct... but the OSDs always fail
`bluefs_types.h: 54: FAILED assert(pos <= end)`

-- dan






>
> >
> > Thanks!
> >
> > Dan
> >
> > [1]
> >
> > == osd.48 ==
> >
> >   [block]
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >
> >   type  block
> >   osd id48
> >   cluster fsid  dd535a7e-4647-4bee-853d-f34112615f81
> >   cluster name  ceph
> >   osd fsid  99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >   db device /dev/sdaa1
> >   encrypted 0
> >   db uuid   3381a121-1c1b-4e45-a986-c1871c363edc
> >   cephx lockbox secret
> >   block uuid 

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
On Thu, Jun 7, 2018 at 4:33 PM Sage Weil  wrote:
>
> On Thu, 7 Jun 2018, Dan van der Ster wrote:
> > Hi all,
> >
> > We have an intermittent issue where bluestore osds sometimes fail to
> > start after a reboot.
> > The osds all fail the same way [see 2], failing to open the superblock.
> > One one particular host, there are 24 osds and 4 SSDs partitioned for
> > the block.db's. The affected non-starting OSDs all have block.db on
> > the same ssd (/dev/sdaa).
> >
> > The osds are all running 12.2.5 on latest centos 7.5 and were created
> > by ceph-volume lvm, e.g. see [1].
> >
> > This seems like a permissions or similar issue related to the
> > ceph-volume tooling.
> > Any clues how to debug this further?
>
> I take it the OSDs start up if you try again?

Hey.
No, they don't. For example, we do this `ceph-volume lvm activate 48
99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
mount failure every time.

-- dan


>
> sage
>
>
> >
> > Thanks!
> >
> > Dan
> >
> > [1]
> >
> > == osd.48 ==
> >
> >   [block]
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >
> >   type  block
> >   osd id48
> >   cluster fsid  dd535a7e-4647-4bee-853d-f34112615f81
> >   cluster name  ceph
> >   osd fsid  99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >   db device /dev/sdaa1
> >   encrypted 0
> >   db uuid   3381a121-1c1b-4e45-a986-c1871c363edc
> >   cephx lockbox secret
> >   block uuidFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
> >   block device
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >   crush device classNone
> >
> >   [  db]/dev/sdaa1
> >
> >   PARTUUID  3381a121-1c1b-4e45-a986-c1871c363edc
> >
> >
> >
> > [2]
> >-11> 2018-06-07 16:12:16.138407 7fba30fb4d80  1 -- - start start
> >-10> 2018-06-07 16:12:16.138516 7fba30fb4d80  1
> > bluestore(/var/lib/ceph/osd/ceph-48) _mount path /var/lib/ceph/os
> > d/ceph-48
> > -9> 2018-06-07 16:12:16.138801 7fba30fb4d80  1 bdev create path
> > /var/lib/ceph/osd/ceph-48/block type kernel
> > -8> 2018-06-07 16:12:16.138808 7fba30fb4d80  1 bdev(0x55eb46433a00
> > /var/lib/ceph/osd/ceph-48/block) open path /v
> > ar/lib/ceph/osd/ceph-48/block
> > -7> 2018-06-07 16:12:16.138999 7fba30fb4d80  1 bdev(0x55eb46433a00
> > /var/lib/ceph/osd/ceph-48/block) open size 60
> > 01172414464 (0x57541c0, 5589 GB) block_size 4096 (4096 B) rotational
> > -6> 2018-06-07 16:12:16.139188 7fba30fb4d80  1
> > bluestore(/var/lib/ceph/osd/ceph-48) _set_cache_sizes cache_size
> > 134217728 meta 0.01 kv 0.99 data 0
> > -5> 2018-06-07 16:12:16.139275 7fba30fb4d80  1 bdev create path
> > /var/lib/ceph/osd/ceph-48/block type kernel
> > -4> 2018-06-07 16:12:16.139281 7fba30fb4d80  1 bdev(0x55eb46433c00
> > /var/lib/ceph/osd/ceph-48/block) open path /v
> > ar/lib/ceph/osd/ceph-48/block
> > -3> 2018-06-07 16:12:16.139454 7fba30fb4d80  1 bdev(0x55eb46433c00
> > /var/lib/ceph/osd/ceph-48/block) open size 60
> > 01172414464 (0x57541c0, 5589 GB) block_size 4096 (4096 B) rotational
> > -2> 2018-06-07 16:12:16.139464 7fba30fb4d80  1 bluefs
> > add_block_device bdev 1 path /var/lib/ceph/osd/ceph-48/blo
> > ck size 5589 GB
> > -1> 2018-06-07 16:12:16.139510 7fba30fb4d80  1 bluefs mount
> >  0> 2018-06-07 16:12:16.142930 7fba30fb4d80 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILA
> > BLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/o
> > s/bluestore/bluefs_types.h: In function 'static void
> > bluefs_fnode_t::_denc_finish(ceph::buffer::ptr::iterator&, __u8
> > *, __u8*, char**, uint32_t*)' thread 7fba30fb4d80 time 2018-06-07
> > 16:12:16.139666
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/os/bluestore/bluefs_types.h:
> > 54: FAILED assert(pos <= end)
> >
> >  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
> > luminous (stable)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x110) [0x55eb3b597780]
> >  2: (bluefs_super_t::decode(ceph::buffer::list::iterator&)+0x776)
> > [0x55eb3b52db36]
> >  3: (BlueFS::_open_super()+0xfe) [0x55eb3b50cede]
> >  4: (BlueFS::mount()+0xe3) [0x55eb3b5250c3]
> >  5: (BlueStore::_open_db(bool)+0x173d) [0x55eb3b43ebcd]
> >  6: (BlueStore::_mount(bool)+0x40e) [0x55eb3b47025e]
> >  7: (OSD::init()+0x3bd) [0x55eb3b02a1cd]
> >  8: (main()+0x2d07) [0x55eb3af2f977]
> >  9: (__libc_start_main()+0xf5) [0x7fba2d47b445]
> >  10: (()+0x4b7033) [0x55eb3afce033]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to 

Re: [ceph-users] rbd map hangs

2018-06-07 Thread Tracy Reed
On Thu, Jun 07, 2018 at 02:05:31AM PDT, Ilya Dryomov spake thusly:
> > find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
> 
> Can you paste the entire output of that command?
> 
> Which kernel are you running on the client box?

Kernel is Linux cpu04.mydomain.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 3 
19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

output is:

# find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
/sys/kernel/debug/ceph/b2b00aae-f00d-41b4-a29b-58859aa41375.client31276017/osdmap
epoch 232455
flags
pool 0 pg_num 2500 (4095) read_tier -1 write_tier -1
pool 2 pg_num 512 (511) read_tier -1 write_tier -1
pool 3 pg_num 128 (127) read_tier -1 write_tier -1
pool 4 pg_num 100 (127) read_tier -1 write_tier -1
osd010.0.5.3:680154%(exists, up)100%
osd110.0.5.3:681257%(exists, up)100%
osd2(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd310.0.5.4:681250%(exists, up)100%
osd4(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd5(unknown sockaddr family 0)   0%(doesn't exist) 100%
osd610.0.5.9:686137%(exists, up)100%
osd710.0.5.9:687628%(exists, up)100%
osd810.0.5.9:686443%(exists, up)100%
osd910.0.5.9:683630%(exists, up)100%
osd10   10.0.5.9:682022%(exists, up)100%
osd11   10.0.5.9:684454%(exists, up)100%
osd12   10.0.5.9:680343%(exists, up)100%
osd13   10.0.5.9:682641%(exists, up)100%
osd14   10.0.5.9:685337%(exists, up)100%
osd15   10.0.5.9:687236%(exists, up)100%
osd16   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd17   10.0.5.9:681244%(exists, up)100%
osd18   10.0.5.9:681748%(exists, up)100%
osd19   10.0.5.9:685633%(exists, up)100%
osd20   10.0.5.9:680846%(exists, up)100%
osd21   10.0.5.9:687141%(exists, up)100%
osd22   10.0.5.9:681649%(exists, up)100%
osd23   10.0.5.9:682356%(exists, up)100%
osd24   10.0.5.9:680054%(exists, up)100%
osd25   10.0.5.9:684854%(exists, up)100%
osd26   10.0.5.9:684037%(exists, up)100%
osd27   10.0.5.9:688369%(exists, up)100%
osd28   10.0.5.9:683339%(exists, up)100%
osd29   10.0.5.9:680938%(exists, up)100%
osd30   10.0.5.9:682951%(exists, up)100%
osd31   10.0.5.11:6828   47%(exists, up)100%
osd32   10.0.5.11:6848   25%(exists, up)100%
osd33   10.0.5.11:6802   56%(exists, up)100%
osd34   10.0.5.11:6840   35%(exists, up)100%
osd35   10.0.5.11:6856   32%(exists, up)100%
osd36   10.0.5.11:6832   26%(exists, up)100%
[88/1848]
osd37   10.0.5.11:6868   42%(exists, up)100%
osd38   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd39   10.0.5.11:6812   52%(exists, up)100%
osd40   10.0.5.11:6864   44%(exists, up)100%
osd41   10.0.5.11:6801   25%(exists, up)100%
osd42   10.0.5.11:6872   39%(exists, up)100%
osd43   10.0.5.13:6809   38%(exists, up)100%
osd44   10.0.5.11:6844   47%(exists, up)100%
osd45   10.0.5.11:6816   20%(exists, up)100%
osd46   10.0.5.3:680058%(exists, up)100%
osd47   10.0.5.2:680843%(exists, up)100%
osd48   10.0.5.2:680444%(exists, up)100%
osd49   10.0.5.2:681244%(exists, up)100%
osd50   10.0.5.2:680047%(exists, up)100%
osd51   10.0.5.4:680843%(exists, up)100%
osd52   10.0.5.12:6815   41%(exists, up)100%
osd53   10.0.5.11:6820   24%(up)100%
osd54   10.0.5.11:6876   34%(exists, up)100%
osd55   10.0.5.11:6836   48%(exists, up)100%
osd56   10.0.5.11:6824   31%(exists, up)100%
osd57   10.0.5.11:6860   48%(exists, up)100%
osd58   10.0.5.11:6852   35%(exists, up)100%
osd59   10.0.5.11:6800   42%(exists, up)100%
osd60   10.0.5.11:6880   58%(exists, up)100%
osd61   10.0.5.3:680352%(exists, up)100%
osd62   10.0.5.12:6800   42%(exists, up)100%
osd63   10.0.5.12:6819   46%(exists, up)100%
osd64   10.0.5.12:6809   44%(exists, up)100%
osd65   10.0.5.13:6800   44%(exists, up)100%
osd66   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd67   10.0.5.13:6808   50%(exists, up)100%
osd68   10.0.5.4:680441%(exists, up)100%
osd69   10.0.5.4:680039%(exists, up)100%
osd70   10.0.5.13:6804   42%(exists, up)100%
osd71   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd72   (unknown sockaddr family 0)   0%(doesn't exist) 100%
osd73   10.0.5.16:6825   92%(exists, up)100%
osd74   10.0.5.16:6846  100%(exists, up)100%
osd75   10.0.5.16:6811   98%(exists, up)100%
osd76   10.0.5.16:6815  100%(exists, up)100%
osd77   10.0.5.16:6835   93%

Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Sage Weil
On Thu, 7 Jun 2018, Dan van der Ster wrote:
> Hi all,
> 
> We have an intermittent issue where bluestore osds sometimes fail to
> start after a reboot.
> The osds all fail the same way [see 2], failing to open the superblock.
> One one particular host, there are 24 osds and 4 SSDs partitioned for
> the block.db's. The affected non-starting OSDs all have block.db on
> the same ssd (/dev/sdaa).
> 
> The osds are all running 12.2.5 on latest centos 7.5 and were created
> by ceph-volume lvm, e.g. see [1].
> 
> This seems like a permissions or similar issue related to the
> ceph-volume tooling.
> Any clues how to debug this further?

I take it the OSDs start up if you try again?

sage


> 
> Thanks!
> 
> Dan
> 
> [1]
> 
> == osd.48 ==
> 
>   [block]
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> 
>   type  block
>   osd id48
>   cluster fsid  dd535a7e-4647-4bee-853d-f34112615f81
>   cluster name  ceph
>   osd fsid  99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>   db device /dev/sdaa1
>   encrypted 0
>   db uuid   3381a121-1c1b-4e45-a986-c1871c363edc
>   cephx lockbox secret
>   block uuidFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
>   block device
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>   crush device classNone
> 
>   [  db]/dev/sdaa1
> 
>   PARTUUID  3381a121-1c1b-4e45-a986-c1871c363edc
> 
> 
> 
> [2]
>-11> 2018-06-07 16:12:16.138407 7fba30fb4d80  1 -- - start start
>-10> 2018-06-07 16:12:16.138516 7fba30fb4d80  1
> bluestore(/var/lib/ceph/osd/ceph-48) _mount path /var/lib/ceph/os
> d/ceph-48
> -9> 2018-06-07 16:12:16.138801 7fba30fb4d80  1 bdev create path
> /var/lib/ceph/osd/ceph-48/block type kernel
> -8> 2018-06-07 16:12:16.138808 7fba30fb4d80  1 bdev(0x55eb46433a00
> /var/lib/ceph/osd/ceph-48/block) open path /v
> ar/lib/ceph/osd/ceph-48/block
> -7> 2018-06-07 16:12:16.138999 7fba30fb4d80  1 bdev(0x55eb46433a00
> /var/lib/ceph/osd/ceph-48/block) open size 60
> 01172414464 (0x57541c0, 5589 GB) block_size 4096 (4096 B) rotational
> -6> 2018-06-07 16:12:16.139188 7fba30fb4d80  1
> bluestore(/var/lib/ceph/osd/ceph-48) _set_cache_sizes cache_size
> 134217728 meta 0.01 kv 0.99 data 0
> -5> 2018-06-07 16:12:16.139275 7fba30fb4d80  1 bdev create path
> /var/lib/ceph/osd/ceph-48/block type kernel
> -4> 2018-06-07 16:12:16.139281 7fba30fb4d80  1 bdev(0x55eb46433c00
> /var/lib/ceph/osd/ceph-48/block) open path /v
> ar/lib/ceph/osd/ceph-48/block
> -3> 2018-06-07 16:12:16.139454 7fba30fb4d80  1 bdev(0x55eb46433c00
> /var/lib/ceph/osd/ceph-48/block) open size 60
> 01172414464 (0x57541c0, 5589 GB) block_size 4096 (4096 B) rotational
> -2> 2018-06-07 16:12:16.139464 7fba30fb4d80  1 bluefs
> add_block_device bdev 1 path /var/lib/ceph/osd/ceph-48/blo
> ck size 5589 GB
> -1> 2018-06-07 16:12:16.139510 7fba30fb4d80  1 bluefs mount
>  0> 2018-06-07 16:12:16.142930 7fba30fb4d80 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILA
> BLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/o
> s/bluestore/bluefs_types.h: In function 'static void
> bluefs_fnode_t::_denc_finish(ceph::buffer::ptr::iterator&, __u8
> *, __u8*, char**, uint32_t*)' thread 7fba30fb4d80 time 2018-06-07
> 16:12:16.139666
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/os/bluestore/bluefs_types.h:
> 54: FAILED assert(pos <= end)
> 
>  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
> luminous (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x55eb3b597780]
>  2: (bluefs_super_t::decode(ceph::buffer::list::iterator&)+0x776)
> [0x55eb3b52db36]
>  3: (BlueFS::_open_super()+0xfe) [0x55eb3b50cede]
>  4: (BlueFS::mount()+0xe3) [0x55eb3b5250c3]
>  5: (BlueStore::_open_db(bool)+0x173d) [0x55eb3b43ebcd]
>  6: (BlueStore::_mount(bool)+0x40e) [0x55eb3b47025e]
>  7: (OSD::init()+0x3bd) [0x55eb3b02a1cd]
>  8: (main()+0x2d07) [0x55eb3af2f977]
>  9: (__libc_start_main()+0xf5) [0x7fba2d47b445]
>  10: (()+0x4b7033) [0x55eb3afce033]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Alfredo Deza
On Thu, Jun 7, 2018 at 10:23 AM, Dan van der Ster  wrote:
> Hi all,
>
> We have an intermittent issue where bluestore osds sometimes fail to
> start after a reboot.
> The osds all fail the same way [see 2], failing to open the superblock.
> One one particular host, there are 24 osds and 4 SSDs partitioned for
> the block.db's. The affected non-starting OSDs all have block.db on
> the same ssd (/dev/sdaa).
>
> The osds are all running 12.2.5 on latest centos 7.5 and were created
> by ceph-volume lvm, e.g. see [1].
>
> This seems like a permissions or similar issue related to the
> ceph-volume tooling.
> Any clues how to debug this further?

There are useful logs in both /var/log/ceph/ceph-volume.log and
/var/log/ceph-volume-systemd.log

This is odd because the OSD is attempting to start, so the logs will
just mention everything was done accordingly and then the OSD was
started at the end of the (successful) setup.


>
> Thanks!
>
> Dan
>
> [1]
>
> == osd.48 ==
>
>   [block]
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>
>   type  block
>   osd id48
>   cluster fsid  dd535a7e-4647-4bee-853d-f34112615f81
>   cluster name  ceph
>   osd fsid  99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>   db device /dev/sdaa1
>   encrypted 0
>   db uuid   3381a121-1c1b-4e45-a986-c1871c363edc
>   cephx lockbox secret
>   block uuidFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
>   block device
> /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>   crush device classNone
>
>   [  db]/dev/sdaa1
>
>   PARTUUID  3381a121-1c1b-4e45-a986-c1871c363edc
>
>
>
> [2]
>-11> 2018-06-07 16:12:16.138407 7fba30fb4d80  1 -- - start start
>-10> 2018-06-07 16:12:16.138516 7fba30fb4d80  1
> bluestore(/var/lib/ceph/osd/ceph-48) _mount path /var/lib/ceph/os
> d/ceph-48
> -9> 2018-06-07 16:12:16.138801 7fba30fb4d80  1 bdev create path
> /var/lib/ceph/osd/ceph-48/block type kernel
> -8> 2018-06-07 16:12:16.138808 7fba30fb4d80  1 bdev(0x55eb46433a00
> /var/lib/ceph/osd/ceph-48/block) open path /v
> ar/lib/ceph/osd/ceph-48/block
> -7> 2018-06-07 16:12:16.138999 7fba30fb4d80  1 bdev(0x55eb46433a00
> /var/lib/ceph/osd/ceph-48/block) open size 60
> 01172414464 (0x57541c0, 5589 GB) block_size 4096 (4096 B) rotational
> -6> 2018-06-07 16:12:16.139188 7fba30fb4d80  1
> bluestore(/var/lib/ceph/osd/ceph-48) _set_cache_sizes cache_size
> 134217728 meta 0.01 kv 0.99 data 0
> -5> 2018-06-07 16:12:16.139275 7fba30fb4d80  1 bdev create path
> /var/lib/ceph/osd/ceph-48/block type kernel
> -4> 2018-06-07 16:12:16.139281 7fba30fb4d80  1 bdev(0x55eb46433c00
> /var/lib/ceph/osd/ceph-48/block) open path /v
> ar/lib/ceph/osd/ceph-48/block
> -3> 2018-06-07 16:12:16.139454 7fba30fb4d80  1 bdev(0x55eb46433c00
> /var/lib/ceph/osd/ceph-48/block) open size 60
> 01172414464 (0x57541c0, 5589 GB) block_size 4096 (4096 B) rotational
> -2> 2018-06-07 16:12:16.139464 7fba30fb4d80  1 bluefs
> add_block_device bdev 1 path /var/lib/ceph/osd/ceph-48/blo
> ck size 5589 GB
> -1> 2018-06-07 16:12:16.139510 7fba30fb4d80  1 bluefs mount
>  0> 2018-06-07 16:12:16.142930 7fba30fb4d80 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILA
> BLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/o
> s/bluestore/bluefs_types.h: In function 'static void
> bluefs_fnode_t::_denc_finish(ceph::buffer::ptr::iterator&, __u8
> *, __u8*, char**, uint32_t*)' thread 7fba30fb4d80 time 2018-06-07
> 16:12:16.139666
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/os/bluestore/bluefs_types.h:
> 54: FAILED assert(pos <= end)
>
>  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
> luminous (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x55eb3b597780]
>  2: (bluefs_super_t::decode(ceph::buffer::list::iterator&)+0x776)
> [0x55eb3b52db36]
>  3: (BlueFS::_open_super()+0xfe) [0x55eb3b50cede]
>  4: (BlueFS::mount()+0xe3) [0x55eb3b5250c3]
>  5: (BlueStore::_open_db(bool)+0x173d) [0x55eb3b43ebcd]
>  6: (BlueStore::_mount(bool)+0x40e) [0x55eb3b47025e]
>  7: (OSD::init()+0x3bd) [0x55eb3b02a1cd]
>  8: (main()+0x2d07) [0x55eb3af2f977]
>  9: (__libc_start_main()+0xf5) [0x7fba2d47b445]
>  10: (()+0x4b7033) [0x55eb3afce033]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-volume: failed to activate some bluestore osds

2018-06-07 Thread Dan van der Ster
Hi all,

We have an intermittent issue where bluestore osds sometimes fail to
start after a reboot.
The osds all fail the same way [see 2], failing to open the superblock.
One one particular host, there are 24 osds and 4 SSDs partitioned for
the block.db's. The affected non-starting OSDs all have block.db on
the same ssd (/dev/sdaa).

The osds are all running 12.2.5 on latest centos 7.5 and were created
by ceph-volume lvm, e.g. see [1].

This seems like a permissions or similar issue related to the
ceph-volume tooling.
Any clues how to debug this further?

Thanks!

Dan

[1]

== osd.48 ==

  [block]
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5

  type  block
  osd id48
  cluster fsid  dd535a7e-4647-4bee-853d-f34112615f81
  cluster name  ceph
  osd fsid  99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
  db device /dev/sdaa1
  encrypted 0
  db uuid   3381a121-1c1b-4e45-a986-c1871c363edc
  cephx lockbox secret
  block uuidFQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
  block device
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
  crush device classNone

  [  db]/dev/sdaa1

  PARTUUID  3381a121-1c1b-4e45-a986-c1871c363edc



[2]
   -11> 2018-06-07 16:12:16.138407 7fba30fb4d80  1 -- - start start
   -10> 2018-06-07 16:12:16.138516 7fba30fb4d80  1
bluestore(/var/lib/ceph/osd/ceph-48) _mount path /var/lib/ceph/os
d/ceph-48
-9> 2018-06-07 16:12:16.138801 7fba30fb4d80  1 bdev create path
/var/lib/ceph/osd/ceph-48/block type kernel
-8> 2018-06-07 16:12:16.138808 7fba30fb4d80  1 bdev(0x55eb46433a00
/var/lib/ceph/osd/ceph-48/block) open path /v
ar/lib/ceph/osd/ceph-48/block
-7> 2018-06-07 16:12:16.138999 7fba30fb4d80  1 bdev(0x55eb46433a00
/var/lib/ceph/osd/ceph-48/block) open size 60
01172414464 (0x57541c0, 5589 GB) block_size 4096 (4096 B) rotational
-6> 2018-06-07 16:12:16.139188 7fba30fb4d80  1
bluestore(/var/lib/ceph/osd/ceph-48) _set_cache_sizes cache_size
134217728 meta 0.01 kv 0.99 data 0
-5> 2018-06-07 16:12:16.139275 7fba30fb4d80  1 bdev create path
/var/lib/ceph/osd/ceph-48/block type kernel
-4> 2018-06-07 16:12:16.139281 7fba30fb4d80  1 bdev(0x55eb46433c00
/var/lib/ceph/osd/ceph-48/block) open path /v
ar/lib/ceph/osd/ceph-48/block
-3> 2018-06-07 16:12:16.139454 7fba30fb4d80  1 bdev(0x55eb46433c00
/var/lib/ceph/osd/ceph-48/block) open size 60
01172414464 (0x57541c0, 5589 GB) block_size 4096 (4096 B) rotational
-2> 2018-06-07 16:12:16.139464 7fba30fb4d80  1 bluefs
add_block_device bdev 1 path /var/lib/ceph/osd/ceph-48/blo
ck size 5589 GB
-1> 2018-06-07 16:12:16.139510 7fba30fb4d80  1 bluefs mount
 0> 2018-06-07 16:12:16.142930 7fba30fb4d80 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILA
BLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/o
s/bluestore/bluefs_types.h: In function 'static void
bluefs_fnode_t::_denc_finish(ceph::buffer::ptr::iterator&, __u8
*, __u8*, char**, uint32_t*)' thread 7fba30fb4d80 time 2018-06-07
16:12:16.139666
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/os/bluestore/bluefs_types.h:
54: FAILED assert(pos <= end)

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x55eb3b597780]
 2: (bluefs_super_t::decode(ceph::buffer::list::iterator&)+0x776)
[0x55eb3b52db36]
 3: (BlueFS::_open_super()+0xfe) [0x55eb3b50cede]
 4: (BlueFS::mount()+0xe3) [0x55eb3b5250c3]
 5: (BlueStore::_open_db(bool)+0x173d) [0x55eb3b43ebcd]
 6: (BlueStore::_mount(bool)+0x40e) [0x55eb3b47025e]
 7: (OSD::init()+0x3bd) [0x55eb3b02a1cd]
 8: (main()+0x2d07) [0x55eb3af2f977]
 9: (__libc_start_main()+0xf5) [0x7fba2d47b445]
 10: (()+0x4b7033) [0x55eb3afce033]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs when one of three nodes is down

2018-06-07 Thread Grigori Frolov
Thank you, Burkhard. There are really 3 active MDSs, so this is a 
misconfiguration.
I will try a standby one.

kind regards, Grigori.


От: ceph-users  от имени Burkhard Linke 

Отправлено: 7 июня 2018 г. 18:59
Кому: ceph-users@lists.ceph.com
Тема: Re: [ceph-users] I/O hangs when one of three nodes is down

Hi,


On 06/07/2018 02:52 PM, Фролов Григорий wrote:
> ?Hello. Could you please help me troubleshoot the issue.
>
> I have 3 nodes in a cluster.
*snipsnap*

> root@testk8s2:~# ceph -s
>  cluster 0bcc00ec-731a-4734-8d76-599f70f06209
>   health HEALTH_ERR
>  80 pgs degraded
>  80 pgs stuck degraded
>  80 pgs stuck unclean
>  80 pgs stuck undersized
>  80 pgs undersized
>  recovery 1075/3225 objects degraded (33.333%)
>  mds rank 2 has failed
>  mds cluster is degraded
>  1 mons down, quorum 1,2 testk8s2,testk8s3
>   monmap e1: 3 mons at 
> {testk8s1=10.105.6.116:6789/0,testk8s2=10.105.6.117:6789/0,testk8s3=10.105.6.118:6789/0}
>  election epoch 120, quorum 1,2 testk8s2,testk8s3
>fsmap e14084: 2/3/3 up {0=testk8s2=up:active,1=testk8s3=up:active}, 1 
> failed
>   osdmap e9939: 3 osds: 2 up, 2 in; 80 remapped pgs
>  flags sortbitwise,require_jewel_osds
>pgmap v17491: 80 pgs, 3 pools, 194 MB data, 1075 objects
>  1530 MB used, 16878 MB / 18408 MB avail
>  1075/3225 objects degraded (33.333%)
>80 active+undersized+degraded
I assume all your MDS servers are active MDS. In this setup the
filesystem metadata is shared between the hosts. If one of the MDS is
not available, the part of the filesystem served by that MDS is not
accessible.

You can prevent this kind of lock up by using standby MDS server that
will become active as soon as one of the active MDS server fails.

To keep the failover time as low as possible, you can configure a
standby MDS to be associated with a running active MDS. You would
require one standby MDS for each active MDS, but failover time would be
minimal. An unassociated MDS can replace any failed active MDS, but need
to load its inode cache before becoming active. This may take some time.

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs when one of three nodes is down

2018-06-07 Thread Burkhard Linke

Hi,


On 06/07/2018 02:52 PM, Фролов Григорий wrote:

?Hello. Could you please help me troubleshoot the issue.

I have 3 nodes in a cluster.

*snipsnap*


root@testk8s2:~# ceph -s
 cluster 0bcc00ec-731a-4734-8d76-599f70f06209
  health HEALTH_ERR
 80 pgs degraded
 80 pgs stuck degraded
 80 pgs stuck unclean
 80 pgs stuck undersized
 80 pgs undersized
 recovery 1075/3225 objects degraded (33.333%)
 mds rank 2 has failed
 mds cluster is degraded
 1 mons down, quorum 1,2 testk8s2,testk8s3
  monmap e1: 3 mons at 
{testk8s1=10.105.6.116:6789/0,testk8s2=10.105.6.117:6789/0,testk8s3=10.105.6.118:6789/0}
 election epoch 120, quorum 1,2 testk8s2,testk8s3
   fsmap e14084: 2/3/3 up {0=testk8s2=up:active,1=testk8s3=up:active}, 1 
failed
  osdmap e9939: 3 osds: 2 up, 2 in; 80 remapped pgs
 flags sortbitwise,require_jewel_osds
   pgmap v17491: 80 pgs, 3 pools, 194 MB data, 1075 objects
 1530 MB used, 16878 MB / 18408 MB avail
 1075/3225 objects degraded (33.333%)
   80 active+undersized+degraded
I assume all your MDS servers are active MDS. In this setup the 
filesystem metadata is shared between the hosts. If one of the MDS is 
not available, the part of the filesystem served by that MDS is not 
accessible.


You can prevent this kind of lock up by using standby MDS server that 
will become active as soon as one of the active MDS server fails.


To keep the failover time as low as possible, you can configure a 
standby MDS to be associated with a running active MDS. You would 
require one standby MDS for each active MDS, but failover time would be 
minimal. An unassociated MDS can replace any failed active MDS, but need 
to load its inode cache before becoming active. This may take some time.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs when one of three nodes is down

2018-06-07 Thread Grigori Frolov
root@testk8s1:~# ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 12 flags hashpspool 
crash_replay_interval 45 stripe_width 0
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_ruleset 0 
object_hash rjenkins pg_num 8 pgp_num 8 last_change 11 flags hashpspool 
stripe_width 0

I haven't changed any crush rule. Here's the dump:

root@testk8s1:~# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_ruleset",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]

?
kind regards, Grigori

От: Paul Emmerich 
Отправлено: 7 июня 2018 г. 18:26
Кому: Grigori Frolov
Копия: ceph-users@lists.ceph.com
Тема: Re: [ceph-users] I/O hangs when one of three nodes is down

can you post your pool configuration?

 ceph osd pool ls detail

and the crush rule if you modified it.


Paul

2018-06-07 14:52 GMT+02:00 Фролов Григорий 
mailto:gfro...@naumen.ru>>:

?Hello. Could you please help me troubleshoot the issue.

I have 3 nodes in a cluster.


ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.02637 root default
-2 0.00879 host testk8s3
 0 0.00879 osd.0  up  1.0  1.0
-3 0.00879 host testk8s1
 1 0.00879 osd.1down0  1.0
-4 0.00879 host testk8s2
 2 0.00879 osd.2  up  1.0  1.0

Each node runs ceph-osd, ceph-mon and ceph-mds.
So when all nodes are up, everything is fine.
When any of 3 nodes goes down, no matter if it shuts down gracefully or in a 
hard way, remaining nodes cannot read or write to the catalog where ceph 
storage is mounted. They also cannot unmount the volume. Every process touching 
the catalog just hangs forever, going into uninterruptible sleep. When I try to 
strace that process, strace hangs too. When the failed node goes up, each hung 
process finishes successfully.

So what could cause the issue?

root@testk8s2:~# ps -eo pid,stat,cmd | grep ls
 3700 Dls --color=auto /mnt/db
 3997 S+   grep --color=auto ls
root@testk8s2:~# strace -p 3700&
[1] 4020
root@testk8s2:~# strace: Process 3700 attached

root@testk8s2:~# ps -eo pid,stat,cmd | grep strace
 4020 Sstrace -p 3700

root@testk8s2:~# umount /mnt&
[2] 4084
root@testk8s2:~# ps -eo pid,state,cmd | grep umount
 4084 D umount /mnt

root@testk8s2:~# ceph -v
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
root@testk8s2:~# ceph -s
cluster 0bcc00ec-731a-4734-8d76-599f70f06209
 health HEALTH_ERR
80 pgs degraded
80 pgs stuck degraded
80 pgs stuck unclean
80 pgs stuck undersized
80 pgs undersized
recovery 1075/3225 objects degraded (33.333%)
mds rank 2 has failed
mds cluster is degraded
1 mons down, quorum 1,2 testk8s2,testk8s3
 monmap e1: 3 mons at 
{testk8s1=10.105.6.116:6789/0,testk8s2=10.105.6.117:6789/0,testk8s3=10.105.6.118:6789/0}
election epoch 120, quorum 1,2 testk8s2,testk8s3
  fsmap e14084: 2/3/3 up {0=testk8s2=up:active,1=testk8s3=up:active}, 1 
failed
 osdmap e9939: 3 osds: 2 up, 2 in; 80 remapped pgs
flags sortbitwise,require_jewel_osds
  pgmap v17491: 80 pgs, 3 pools, 194 MB data, 1075 objects
1530 MB used, 16878 MB / 18408 MB avail
1075/3225 objects degraded (33.333%)
  80 active+undersized+degraded


Thanks.


kind regards, Grigori

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 Munchen
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-07 Thread Nick Fisk
So I've recompiled a 12.2.5 ceph-osd binary with the fix included in 
https://github.com/ceph/ceph/pull/22396

The OSD has restarted as expected and the PG is now active+clean ..success 
so far.

What's the best method to clean up the stray snapshot on OSD.46? I'm guessing 
using the object-store-tool, but not sure if I want to
clean the clone metadata or try and remove the actual snapshot object.

-Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 17:22
To: 'ceph-users' 
Subject: Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

So, from what I can see I believe this issue is being caused by one of the 
remaining OSD's acting for this PG containing a snapshot
file of the object

/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__head_F930D2CA_
_1
/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__1c_F930D2CA__1

Both the OSD which crashed and the other acting OSD don't have this "1c" 
snapshot file. Is the solution to use objectstore tool to
remove this "1c" snapshot object and then allow thigs to backfill?


-Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 16:43
To: 'ceph-users' 
Subject: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

Cluster is 12.2.5 with Filestore and 3x replica pools

I noticed after the snapshot was removed that there were 2 inconsistent PG's. I 
can a repair on both and one of them (1.2ca) seems
to have triggered this issue.


Snippet of log from two different OSD's. 1st one is from the original OSD 
holding the PG, 2nd is from where the OSD was marked out
and it was trying to be recovered to:


-4> 2018-06-05 15:15:45.997225 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6819/3544315 --> [2a03:25e0:253:5::14]:0/3307345 --
osd_ping(ping_reply e2196479 stamp 2018-06-05 15:15:45.994907) v4 -- 
0x557d3d72f800 con 0
-3> 2018-06-05 15:15:46.018088 7febb2954700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 --> [2a03:25e0:254:5::12]:6809/5784710 --
MOSDPGPull(1.2ca e2196479/2196477 cost 8389608) v3 -- 0x557d4d180b40 con 0
-2> 2018-06-05 15:15:46.018412 7febce7a5700  5 -- 
[2a03:25e0:254:5::112]:6817/3544315 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557d4b1a9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13470 
cs=1 l=0).
rx osd.46 seq 13 0x557d4d180b40 MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c,
version: 2195927'1249660, data_included: [], data_size: 0, omap_header_size: 0, 
omap_ent
ries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 0, copy_subset: [], clone_subset:
{}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap _recovered_to:, omap_complete:false,
error:false))]) v3
-1> 2018-06-05 15:15:46.018425 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 13  MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f
29.000bf479:1c, version: 2195927'1249660, data_included: [], data_size: 
0, omap_header_size: 0, omap_entries_size: 0,
attrset_size: 1, recovery_info: 
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.0
00bf479:1c@2195927'1249660, size: 0, copy_subset: [], clone_subset: {}, 
snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:t rue, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap_recovered_to:, omap_complete:false,
error:false))]) v3  885+0+0 (695790480 0 0) 0x557d4d180b40 con 0x5
57d4b1a9000
 0> 2018-06-05 15:15:46.022099 7febb2954700 -1 
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, 
ObjectContextRef, bool , ObjectStore::Transaction*)'
thread 7febb2954700 time 2018-06-05 15:15:46.019130
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())







-4> 2018-06-05 16:28:59.560140 7fcd7b655700  5 -- 
[2a03:25e0:254:5::113]:6829/525383 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557447510800 :6829 

Re: [ceph-users] Adding cluster network to running cluster

2018-06-07 Thread mj




On 06/07/2018 01:45 PM, Wido den Hollander wrote:


Removing cluster network is enough. After the restart the OSDs will not
publish a cluster network in the OSDMap anymore.

You can keep the public network in ceph.conf and can even remove that
after you removed the 10.10.x.x addresses from the system(s).

Wido


Thanks for the info, Wido. :-)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] I/O hangs when one of three nodes is down

2018-06-07 Thread Фролов Григорий
?Hello. Could you please help me troubleshoot the issue.

I have 3 nodes in a cluster.


ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.02637 root default
-2 0.00879 host testk8s3
 0 0.00879 osd.0  up  1.0  1.0
-3 0.00879 host testk8s1
 1 0.00879 osd.1down0  1.0
-4 0.00879 host testk8s2
 2 0.00879 osd.2  up  1.0  1.0

Each node runs ceph-osd, ceph-mon and ceph-mds.
So when all nodes are up, everything is fine.
When any of 3 nodes goes down, no matter if it shuts down gracefully or in a 
hard way, remaining nodes cannot read or write to the catalog where ceph 
storage is mounted. They also cannot unmount the volume. Every process touching 
the catalog just hangs forever, going into uninterruptible sleep. When I try to 
strace that process, strace hangs too. When the failed node goes up, each hung 
process finishes successfully.

So what could cause the issue?

root@testk8s2:~# ps -eo pid,stat,cmd | grep ls
 3700 Dls --color=auto /mnt/db
 3997 S+   grep --color=auto ls
root@testk8s2:~# strace -p 3700&
[1] 4020
root@testk8s2:~# strace: Process 3700 attached

root@testk8s2:~# ps -eo pid,stat,cmd | grep strace
 4020 Sstrace -p 3700

root@testk8s2:~# umount /mnt&
[2] 4084
root@testk8s2:~# ps -eo pid,state,cmd | grep umount
 4084 D umount /mnt

root@testk8s2:~# ceph -v
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
root@testk8s2:~# ceph -s
cluster 0bcc00ec-731a-4734-8d76-599f70f06209
 health HEALTH_ERR
80 pgs degraded
80 pgs stuck degraded
80 pgs stuck unclean
80 pgs stuck undersized
80 pgs undersized
recovery 1075/3225 objects degraded (33.333%)
mds rank 2 has failed
mds cluster is degraded
1 mons down, quorum 1,2 testk8s2,testk8s3
 monmap e1: 3 mons at 
{testk8s1=10.105.6.116:6789/0,testk8s2=10.105.6.117:6789/0,testk8s3=10.105.6.118:6789/0}
election epoch 120, quorum 1,2 testk8s2,testk8s3
  fsmap e14084: 2/3/3 up {0=testk8s2=up:active,1=testk8s3=up:active}, 1 
failed
 osdmap e9939: 3 osds: 2 up, 2 in; 80 remapped pgs
flags sortbitwise,require_jewel_osds
  pgmap v17491: 80 pgs, 3 pools, 194 MB data, 1075 objects
1530 MB used, 16878 MB / 18408 MB avail
1075/3225 objects degraded (33.333%)
  80 active+undersized+degraded


Thanks.


kind regards, Grigori
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding cluster network to running cluster

2018-06-07 Thread Wido den Hollander



On 06/07/2018 01:39 PM, mj wrote:
> Hi,
> 
> Please allow me to ask one more question:
> 
> We currently have a seperated network: cluster on 10.10.x.x and public
> on 192.168.x.x.
> 
> I would like to migrate all network to 192.168.x.x setup, which would
> give us 2*10G.
> 
> Is simply changing the cluster network in ceph.conf and a restart enough?
> 

Removing cluster network is enough. After the restart the OSDs will not
publish a cluster network in the OSDMap anymore.

You can keep the public network in ceph.conf and can even remove that
after you removed the 10.10.x.x addresses from the system(s).

Wido

> (of course provided that the 2*10G bond is configured correctly, etc)
> 
> Or is there more to it?
> 
> MJ
> 
> On 06/07/2018 01:13 PM, Paul Emmerich wrote:
>> We also build almost all of our clusters with a single Ceph network.
>> 2x10 Gbit/s is almost never the bottleneck.
>>
>>
>>
>> Paul
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize recovery over backfilling

2018-06-07 Thread Sage Weil
On Wed, 6 Jun 2018, Caspar Smit wrote:
> Hi all,
> 
> We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node
> to it.
> 
> osd-max-backfills is at the default 1 so backfilling didn't go very fast
> but that doesn't matter.
> 
> Once it started backfilling everything looked ok:
> 
> ~300 pgs in backfill_wait
> ~10 pgs backfilling (~number of new osd's)
> 
> But i noticed the degraded objects increasing a lot. I presume a pg that is
> in backfill_wait state doesn't accept any new writes anymore? Hence
> increasing the degraded objects?
> 
> So far so good, but once a while i noticed a random OSD flapping (they come
> back up automatically). This isn't because the disk is saturated but a
> driver/controller/kernel incompatibility which 'hangs' the disk for a short
> time (scsi abort_task error in syslog). Investigating further i noticed
> this was already the case before the node expansion.
> 
> These OSD's flapping results in lots of pg states which are a bit worrying:
> 
>  109 active+remapped+backfill_wait
>  80  active+undersized+degraded+remapped+backfill_wait
>  51  active+recovery_wait+degraded+remapped
>  41  active+recovery_wait+degraded
>  27  active+recovery_wait+undersized+degraded+remapped
>  14  active+undersized+remapped+backfill_wait
>  4   active+undersized+degraded+remapped+backfilling
> 
> I think the recovery_wait is more important then the backfill_wait, so i
> like to prioritize these because the recovery_wait was triggered by the
> flapping OSD's

Just a note: this is fixed in mimic.  Previously, we would choose the 
highest-priority PG to start recovery on at the time, but once recovery 
had started, the appearance of a new PG with a higher priority (e.g., 
because it finished peering after the others) wouldn't preempt/cancel the 
other PG's recovery, so you would get behavior like the above.

Mimic implements that preemption, so you should not see behavior like 
this.  (If you do, then the function that assigns a priority score to a 
PG needs to be tweaked.)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding cluster network to running cluster

2018-06-07 Thread Paul Emmerich
We also build almost all of our clusters with a single Ceph network.
2x10 Gbit/s is almost never the bottleneck.



Paul

2018-06-07 11:05 GMT+02:00 Wido den Hollander :

>
>
> On 06/07/2018 10:56 AM, Kevin Olbrich wrote:
> > Realy?
> >
> > I always thought that splitting the replication network is best practice.
> > Keeping everything in the same IPv6 network is much easier.
> >
>
> No, there is no big benefit unless your usecase (which 99% isn't) asks
> for it.
>
> Keep it simple, one network to run the cluster on. Less components which
> can fail or complicate things.
>
> Wido
>
> > Thank you.
> >
> > Kevin
> >
> > 2018-06-07 10:44 GMT+02:00 Wido den Hollander  > >:
> >
> >
> >
> > On 06/07/2018 09:46 AM, Kevin Olbrich wrote:
> > > Hi!
> > >
> > > When we installed our new luminous cluster, we had issues with the
> > > cluster network (setup of mon's failed).
> > > We moved on with a single network setup.
> > >
> > > Now I would like to set the cluster network again but the cluster
> is in
> > > use (4 nodes, 2 pools, VMs).
> >
> > Why? What is the benefit from having the cluster network? Back in the
> > old days when 10Gb was expensive you would run public on 1G and
> cluster
> > on 10G.
> >
> > Now with 2x10Gb going into each machine, why still bother with
> managing
> > two networks?
> >
> > I really do not see the benefit.
> >
> > I manage multiple 1000 ~ 2500 OSD clusters all running with all their
> > nodes on IPv6 and 2x10Gb in a single network. That works just fine.
> >
> > Try to keep the network simple and do not overcomplicate it.
> >
> > Wido
> >
> > > What happens if I set the cluster network on one of the nodes and
> reboot
> > > (maintenance, updates, etc.)?
> > > Will the node use both networks as the other three nodes are not
> > > reachable there?
> > >
> > > Both the MONs and OSDs have IPs in both networks, routing is not
> needed.
> > > This cluster is dualstack but we set ms_bind_ipv6 = true.
> > >
> > > Thank you.
> > >
> > > Kind regards
> > > Kevin
> > >
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize recovery over backfilling

2018-06-07 Thread Piotr Dałek

On 18-06-07 12:43 PM, Caspar Smit wrote:

Hi Piotr,

Thanks for your answer! I've set nodown and now it doesn't mark any OSD's as 
down anymore :)


Any tip when everything is recovered/backfilled and unsetting the nodown 
flag?


When all pgs are reported as active+clean (any scrubbing/deep scrubbing is 
fine).


>Shutdown all activity to the ceph cluster before that moment?




Depends on whether it's actually possible in your case and what load your 
users generate - you have to decide.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize recovery over backfilling

2018-06-07 Thread Caspar Smit
Hi Piotr,

Thanks for your answer! I've set nodown and now it doesn't mark any OSD's
as down anymore :)

Any tip when everything is recovered/backfilled and unsetting the nodown
flag? Shutdown all activity to the ceph cluster before that moment?

If i unset the nodown flag and suddenly a lot of OSD's are flagged down it
should be better when there's no activity at all, that when the osd's come
back up there is nothing to be done (no due recovery/backfilling).

Kind regards,
Caspar

2018-06-07 8:47 GMT+02:00 Piotr Dałek :

> On 18-06-06 09:29 PM, Caspar Smit wrote:
>
>> Hi all,
>>
>> We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a
>> node to it.
>>
>> osd-max-backfills is at the default 1 so backfilling didn't go very fast
>> but that doesn't matter.
>>
>> Once it started backfilling everything looked ok:
>>
>> ~300 pgs in backfill_wait
>> ~10 pgs backfilling (~number of new osd's)
>>
>> But i noticed the degraded objects increasing a lot. I presume a pg that
>> is in backfill_wait state doesn't accept any new writes anymore? Hence
>> increasing the degraded objects?
>>
>> So far so good, but once a while i noticed a random OSD flapping (they
>> come back up automatically). This isn't because the disk is saturated but a
>> driver/controller/kernel incompatibility which 'hangs' the disk for a short
>> time (scsi abort_task error in syslog). Investigating further i noticed
>> this was already the case before the node expansion.
>> These OSD's flapping results in lots of pg states which are a bit
>> worrying:
>>
>>   109 active+remapped+backfill_wait
>>   80  active+undersized+degraded+remapped+backfill_wait
>>   51  active+recovery_wait+degraded+remapped
>>   41  active+recovery_wait+degraded
>>   27  active+recovery_wait+undersized+degraded+remapped
>>   14  active+undersized+remapped+backfill_wait
>>   4   active+undersized+degraded+remapped+backfilling
>>
>> I think the recovery_wait is more important then the backfill_wait, so i
>> like to prioritize these because the recovery_wait was triggered by the
>> flapping OSD's
>>
> >
>
>> furthermore the undersized ones should get absolute priority or is that
>> already the case?
>>
>> I was thinking about setting "nobackfill" to prioritize recovery instead
>> of backfilling.
>> Would that help in this situation? Or am i making it even worse then?
>>
>> ps. i tried increasing the heartbeat values for the OSD's to no avail,
>> they still get flagged as down once in a while after a hiccup of the driver.
>>
>
> First of all, use "nodown" flag so osds won't be marked down automatically
> and unset it once everything backfills/recovers and settles for good --
> note that there might be lingering osd down reports, so unsetting nodown
> might cause some of problematic osds to be instantly marked as down.
>
> Second, since Luminous you can use "ceph pg force-recovery" to ask
> particular pgs to recover first, even if there are other pgs to backfill
> and/or recovery.
>
> --
> Piotr Dałek
> piotr.da...@corp.ovh.com
> https://www.ovhcloud.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding cluster network to running cluster

2018-06-07 Thread Wido den Hollander


On 06/07/2018 10:56 AM, Kevin Olbrich wrote:
> Realy?
> 
> I always thought that splitting the replication network is best practice.
> Keeping everything in the same IPv6 network is much easier.
> 

No, there is no big benefit unless your usecase (which 99% isn't) asks
for it.

Keep it simple, one network to run the cluster on. Less components which
can fail or complicate things.

Wido

> Thank you.
> 
> Kevin
> 
> 2018-06-07 10:44 GMT+02:00 Wido den Hollander  >:
> 
> 
> 
> On 06/07/2018 09:46 AM, Kevin Olbrich wrote:
> > Hi!
> > 
> > When we installed our new luminous cluster, we had issues with the
> > cluster network (setup of mon's failed).
> > We moved on with a single network setup.
> > 
> > Now I would like to set the cluster network again but the cluster is in
> > use (4 nodes, 2 pools, VMs).
> 
> Why? What is the benefit from having the cluster network? Back in the
> old days when 10Gb was expensive you would run public on 1G and cluster
> on 10G.
> 
> Now with 2x10Gb going into each machine, why still bother with managing
> two networks?
> 
> I really do not see the benefit.
> 
> I manage multiple 1000 ~ 2500 OSD clusters all running with all their
> nodes on IPv6 and 2x10Gb in a single network. That works just fine.
> 
> Try to keep the network simple and do not overcomplicate it.
> 
> Wido
> 
> > What happens if I set the cluster network on one of the nodes and reboot
> > (maintenance, updates, etc.)?
> > Will the node use both networks as the other three nodes are not
> > reachable there?
> > 
> > Both the MONs and OSDs have IPs in both networks, routing is not needed.
> > This cluster is dualstack but we set ms_bind_ipv6 = true.
> > 
> > Thank you.
> > 
> > Kind regards
> > Kevin
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> >
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Ilya Dryomov
On Thu, Jun 7, 2018 at 5:12 AM, Tracy Reed  wrote:
>
> Hello all! I'm running luminous with old style non-bluestore OSDs. ceph
> 10.2.9 clients though, haven't been able to upgrade those yet.
>
> Occasionally I have access to rbds hang on the client such as right now.
> I tried to dd a VM image into a mapped rbd and it just hung.
>
> Then I tried to map a new rbd and that hangs also.
>
> How would I troubleshoot this? /var/log/ceph is empty, nothing in
> /var/log/messages or dmesg etc.
>
> I just discovered:
>
> find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
>
> which produces (among other seemingly innocuous things, let me know if
> anyone wants to see the rest):
>
> osd2(unknown sockaddr family 0) 0%(doesn't exist) 100%
>
> which seems suspicious.

Can you paste the entire output of that command?

Which kernel are you running on the client box?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding cluster network to running cluster

2018-06-07 Thread Kevin Olbrich
Realy?

I always thought that splitting the replication network is best practice.
Keeping everything in the same IPv6 network is much easier.

Thank you.

Kevin

2018-06-07 10:44 GMT+02:00 Wido den Hollander :

>
>
> On 06/07/2018 09:46 AM, Kevin Olbrich wrote:
> > Hi!
> >
> > When we installed our new luminous cluster, we had issues with the
> > cluster network (setup of mon's failed).
> > We moved on with a single network setup.
> >
> > Now I would like to set the cluster network again but the cluster is in
> > use (4 nodes, 2 pools, VMs).
>
> Why? What is the benefit from having the cluster network? Back in the
> old days when 10Gb was expensive you would run public on 1G and cluster
> on 10G.
>
> Now with 2x10Gb going into each machine, why still bother with managing
> two networks?
>
> I really do not see the benefit.
>
> I manage multiple 1000 ~ 2500 OSD clusters all running with all their
> nodes on IPv6 and 2x10Gb in a single network. That works just fine.
>
> Try to keep the network simple and do not overcomplicate it.
>
> Wido
>
> > What happens if I set the cluster network on one of the nodes and reboot
> > (maintenance, updates, etc.)?
> > Will the node use both networks as the other three nodes are not
> > reachable there?
> >
> > Both the MONs and OSDs have IPs in both networks, routing is not needed.
> > This cluster is dualstack but we set ms_bind_ipv6 = true.
> >
> > Thank you.
> >
> > Kind regards
> > Kevin
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding cluster network to running cluster

2018-06-07 Thread Wido den Hollander


On 06/07/2018 09:46 AM, Kevin Olbrich wrote:
> Hi!
> 
> When we installed our new luminous cluster, we had issues with the
> cluster network (setup of mon's failed).
> We moved on with a single network setup.
> 
> Now I would like to set the cluster network again but the cluster is in
> use (4 nodes, 2 pools, VMs).

Why? What is the benefit from having the cluster network? Back in the
old days when 10Gb was expensive you would run public on 1G and cluster
on 10G.

Now with 2x10Gb going into each machine, why still bother with managing
two networks?

I really do not see the benefit.

I manage multiple 1000 ~ 2500 OSD clusters all running with all their
nodes on IPv6 and 2x10Gb in a single network. That works just fine.

Try to keep the network simple and do not overcomplicate it.

Wido

> What happens if I set the cluster network on one of the nodes and reboot
> (maintenance, updates, etc.)?
> Will the node use both networks as the other three nodes are not
> reachable there?
> 
> Both the MONs and OSDs have IPs in both networks, routing is not needed.
> This cluster is dualstack but we set ms_bind_ipv6 = true.
> 
> Thank you.
> 
> Kind regards
> Kevin
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding cluster network to running cluster

2018-06-07 Thread Burkhard Linke

Hi,


I may be wrong, but AFAIK the cluster network is only used to bind the 
corresponding functionality to the correct network interface. There's no 
check for a common CIDR range or something similar in CEPH.



As long as the traffic is routeable from the current public network and 
the new cluster network, you should be able to transform the network 
setup host by host without any downtime. We used the same strategy to 
get rid of our IPoIB cluster network by defining one host as gateway to 
the cluster network and adding static routes on the other hosts during 
the transition. After all hosts were processed, we simply deleted the 
routes.


In case of the original poster it should simply be a matter of defining 
the cluster network in ceph.conf and restarting the OSDs, since the 
cluster network interfaces are already present on all hosts.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding cluster network to running cluster

2018-06-07 Thread David Turner
2 Things about this.  First, Mons do not communicate over the cluster
network.  Only OSD daemons send any traffic over that network.  Mons, MDS,
RGW, MGR, etc all communicate over the public network.  OSDs communicate
with all clients, Mons, etc on the public network and with each other on
the cluster network.  That is for sharing extra replicas to other OSDs,
backfilling, recovering, pinging each other to see if things are up and
running. etc.

Second, What will happen if you change the cluster network on 1 node and
not the others is that the other nodes will report to the Mons that every
OSD on that node is down because they won't respond to pings and the 1 node
will report to the Mons that every other OSD in the cluster is down. Making
changes to the cluster network is easiest by stopping all OSDs and starting
them back up with the new settings.

On Thu, Jun 7, 2018 at 3:46 AM Kevin Olbrich  wrote:

> Hi!
>
> When we installed our new luminous cluster, we had issues with the cluster
> network (setup of mon's failed).
> We moved on with a single network setup.
>
> Now I would like to set the cluster network again but the cluster is in
> use (4 nodes, 2 pools, VMs).
> What happens if I set the cluster network on one of the nodes and reboot
> (maintenance, updates, etc.)?
> Will the node use both networks as the other three nodes are not reachable
> there?
>
> Both the MONs and OSDs have IPs in both networks, routing is not needed.
> This cluster is dualstack but we set ms_bind_ipv6 = true.
>
> Thank you.
>
> Kind regards
> Kevin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Update to Mimic with prior Snapshots leads to MDS damaged metadata

2018-06-07 Thread Yan, Zheng
On Thu, Jun 7, 2018 at 2:44 PM, Tobias Florek  wrote:
> Hi!
>
> Thank you for your help! The cluster is running healthily for a day now.
>
> Regarding the problem, I just checked in the release notes [1] and on
> docs.ceph.com and did not find the right invocation after an upgrade.
> Maybe that ought to be fixed.
>

We are fixing the release note. https://github.com/ceph/ceph/pull/22445

>>> [upgrade from luminous to mimic with prior cephfs snapshots]
>> The correct commands should be:
>>
>> ceph daemon  scrub_path / force recursive repair
>> ceph daemon  scrub_path '~mdsdir' force recursive
>
> [1] https://ceph.com/releases/v13-2-0-mimic-released/
> [2]
> https://www.google.com/search?q=site%3Adocs.ceph.com+scrub_path+inurl%3Amimic=site%3Adocs.ceph.com+scrub_path+inurl%3Amimic
>
> Cheers,
>  Tobias Florek
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Debian GPG key for Luminous

2018-06-07 Thread Steffen Winther Sørensen
Community,

Where would I find the GPG release key for Debian Luminous?

as I’m getting:

W: GPG error: http://download.ceph.com/debian-luminous stretch InRelease: The 
following signatures couldn't be verified because the public key is not 
available: NO_PUBKEY E84AC2C0460F3994

when attempting apt-get update with:

# cat /etc/apt/sources.list.d/ceph.list 
deb http://download.ceph.com/debian-luminous stretch main

TIA

/Steffen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread ceph
Just a bet: have you inconsistant MTU across your network ?

I already had your issue when OSD and client was using jumbo frames, but
MON did not (or something like that)


On 06/07/2018 05:12 AM, Tracy Reed wrote:
> 
> Hello all! I'm running luminous with old style non-bluestore OSDs. ceph
> 10.2.9 clients though, haven't been able to upgrade those yet. 
> 
> Occasionally I have access to rbds hang on the client such as right now.
> I tried to dd a VM image into a mapped rbd and it just hung.
> 
> Then I tried to map a new rbd and that hangs also.
> 
> How would I troubleshoot this? /var/log/ceph is empty, nothing in
> /var/log/messages or dmesg etc.
> 
> I just discovered:
> 
> find /sys/kernel/debug/ceph -type f -print -exec cat {} \;
> 
> which produces (among other seemingly innocuous things, let me know if
> anyone wants to see the rest):
> 
> osd2(unknown sockaddr family 0) 0%(doesn't exist) 100%
> 
> which seems suspicious.
> 
> rbd ls works reliably. As does create.  Cluster is healthy. 
> 
> But the processes which hung trying to access that mapped rbd appear to
> be completely unkillable. What 
> 
> else should I check?
> 
> Thanks!
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic cephfs snapshot in active/standby mds env

2018-06-07 Thread Yan, Zheng
On Thu, Jun 7, 2018 at 10:04 AM, Brady Deetz  wrote:
> I've seen several mentions of stable snapshots in Mimic for cephfs in
> multi-active mds environments. I'm currently running active/standby in
> 12.2.5 with no snapshops. If I upgrade to Mimic, is there any concern with
> snapshots in an active/standby MDS environment. It seems like a silly
> question since it is considered stable for multi-mds, but better safe than
> sorry.
>

Snapshot on active/standby MDS environment is also supported.

Yan, Zheng


> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize recovery over backfilling

2018-06-07 Thread Piotr Dałek

On 18-06-06 09:29 PM, Caspar Smit wrote:

Hi all,

We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node 
to it.


osd-max-backfills is at the default 1 so backfilling didn't go very fast but 
that doesn't matter.


Once it started backfilling everything looked ok:

~300 pgs in backfill_wait
~10 pgs backfilling (~number of new osd's)

But i noticed the degraded objects increasing a lot. I presume a pg that is 
in backfill_wait state doesn't accept any new writes anymore? Hence 
increasing the degraded objects?


So far so good, but once a while i noticed a random OSD flapping (they come 
back up automatically). This isn't because the disk is saturated but a 
driver/controller/kernel incompatibility which 'hangs' the disk for a short 
time (scsi abort_task error in syslog). Investigating further i noticed this 
was already the case before the node expansion.

These OSD's flapping results in lots of pg states which are a bit worrying:

              109 active+remapped+backfill_wait
              80  active+undersized+degraded+remapped+backfill_wait
              51  active+recovery_wait+degraded+remapped
              41  active+recovery_wait+degraded
              27  active+recovery_wait+undersized+degraded+remapped
              14  active+undersized+remapped+backfill_wait
              4   active+undersized+degraded+remapped+backfilling

I think the recovery_wait is more important then the backfill_wait, so i 
like to prioritize these because the recovery_wait was triggered by the 
flapping OSD's

>
furthermore the undersized ones should get absolute priority or is that 
already the case?


I was thinking about setting "nobackfill" to prioritize recovery instead of 
backfilling.

Would that help in this situation? Or am i making it even worse then?

ps. i tried increasing the heartbeat values for the OSD's to no avail, they 
still get flagged as down once in a while after a hiccup of the driver.


First of all, use "nodown" flag so osds won't be marked down automatically 
and unset it once everything backfills/recovers and settles for good -- note 
that there might be lingering osd down reports, so unsetting nodown might 
cause some of problematic osds to be instantly marked as down.


Second, since Luminous you can use "ceph pg force-recovery" to ask 
particular pgs to recover first, even if there are other pgs to backfill 
and/or recovery.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Update to Mimic with prior Snapshots leads to MDS damaged metadata

2018-06-07 Thread Tobias Florek
Hi!

Thank you for your help! The cluster is running healthily for a day now.

Regarding the problem, I just checked in the release notes [1] and on
docs.ceph.com and did not find the right invocation after an upgrade.
Maybe that ought to be fixed.

>> [upgrade from luminous to mimic with prior cephfs snapshots]
> The correct commands should be:
>
> ceph daemon  scrub_path / force recursive repair
> ceph daemon  scrub_path '~mdsdir' force recursive

[1] https://ceph.com/releases/v13-2-0-mimic-released/
[2]
https://www.google.com/search?q=site%3Adocs.ceph.com+scrub_path+inurl%3Amimic=site%3Adocs.ceph.com+scrub_path+inurl%3Amimic

Cheers,
 Tobias Florek
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stop scrubbing

2018-06-07 Thread Wido den Hollander



On 06/06/2018 08:32 PM, Joe Comeau wrote:
> When I am upgrading from filestore to bluestore
> or any other server maintenance for a short time
> (ie high I/O while rebuilding)
>  
> ceph osd set noout
> ceph osd set noscrub
> ceph osd set nodeep-scrub
>  
> when finished
>  
> ceph osd unset noscrub
> ceph osd unset nodeep-scrub
> ceph osd unset noout
>  
> again only while working on a server/cluster for a short time
> 

Keep in mind that OSDs involved in recovery will not start a new
(deep)scrub since Jewel. So there is no need to set this flag.

OSDs which are not performing recovery will do a scrub as regular, which
is fine.

Wido

> 
 Alexandru Cucu  6/6/2018 1:51 AM >>>
> Hi,
> 
> The only way I know is pretty brutal: list all the PGs with a
> scrubbing process, get the primary OSD and mark it as down. The
> scrubbing process will stop.
> Make sure you set the noout, norebalance and norecovery flags so you
> don't add even more load to your cluster.
> 
> On Tue, Jun 5, 2018 at 11:41 PM Marc Roos  wrote:
>>
>>
>> Is it possible to stop the current running scrubs/deep-scrubs?
>>
>> http://tracker.ceph.com/issues/11202
>>
>>
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com