Re: [ceph-users] CephFS log jam prevention

2017-12-05 Thread Dan Jakubiec
To add a little color here... we started an rsync last night to copy about 4TB 
worth of files to CephFS.  Paused it this morning because CephFS was 
unresponsive on the machine (e.g. can't cat a file from the filesystem).

Been waiting about 3 hours for the log jam to clear.  Slow requests have 
steadily decreased but still can't cat a file.

Seems like there should be something throttling the rsync operation to prevent 
the queues from backing up so far.  Is this is configuration problem or a bug?

From reading the Ceph docs, this seems to be the most telling:

mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by clients, 
1 stray files

[Ref: http://docs.ceph.com/docs/master/cephfs/cache-size-limits/]

"Be aware that the cache limit is not a hard limit. Potential bugs in the 
CephFS client or MDS or misbehaving applications might cause the MDS to exceed 
its cache size. The  mds_health_cache_threshold configures the cluster health 
warning message so that operators can investigate why the MDS cannot shrink its 
cache."

Any suggestions?

Thanks,

-- Dan



> On Dec 5, 2017, at 10:07, Reed Dier  wrote:
> 
> Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD 
> backed CephFS pool.
> 
> Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running 
> mix of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and 
> clients.
> 
>> $ ceph versions
>> {
>> "mon": {
>> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
>> luminous (stable)": 3
>> },
>> "mgr": {
>> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
>> luminous (stable)": 3
>> },
>> "osd": {
>> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
>> luminous (stable)": 74
>> },
>> "mds": {
>> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
>> luminous (stable)": 2
>> },
>> "overall": {
>> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
>> luminous (stable)": 82
>> }
>> }
> 
>>  
>> HEALTH_ERR
>>  1 MDSs report oversized cache; 1 MDSs have many clients failing to respond 
>> to cache pressure; 1 MDSs behind on tr
>> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1 
>> pool(s); 242 slow requests are blocked > 32 sec
>> ; 769378 stuck requests are blocked > 4096 sec
>> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
>> mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by 
>> clients, 1 stray files
>> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache 
>> pressure
>> mdsdb(mds.0): Many clients (37) failing to respond to cache 
>> pressureclient_count: 37
>> MDS_TRIM 1 MDSs behind on trimming
>> mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30, 
>> num_segments: 36252
>> OSDMAP_FLAGS noout,nodeep-scrub flag(s) set
>> REQUEST_SLOW 242 slow requests are blocked > 32 sec
>> 236 ops are blocked > 2097.15 sec
>> 3 ops are blocked > 1048.58 sec
>> 2 ops are blocked > 524.288 sec
>> 1 ops are blocked > 32.768 sec
>> REQUEST_STUCK 769378 stuck requests are blocked > 4096 sec
>> 91 ops are blocked > 67108.9 sec
>> 121258 ops are blocked > 33554.4 sec
>> 308189 ops are blocked > 16777.2 sec
>> 251586 ops are blocked > 8388.61 sec
>> 88254 ops are blocked > 4194.3 sec
>> osds 0,1,3,6,8,12,15,16,17,21,22,23 have stuck requests > 16777.2 sec
>> osds 4,7,9,10,11,14,18,20 have stuck requests > 33554.4 sec
>> osd.13 has stuck requests > 67108.9 sec
> 
> This is across 8 nodes, holding 3x 8TB HDD’s each, all backed by Intel P3600 
> NVMe drives for journaling.
> Removed SSD OSD’s for brevity.
> 
>> $ ceph osd tree
>> ID  CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF
>> -1387.28799 root ssd
>>  -1   174.51500 root default
>> -10   174.51500 rack default.rack2
>> -5543.62000 chassis node2425
>>  -221.81000 host node24
>>   0   hdd   7.26999 osd.0 up  1.0 1.0
>>   8   hdd   7.26999 osd.8 up  1.0 1.0
>>  16   hdd   7.26999 osd.16up  1.0 1.0
>>  -321.81000 host node25
>>   1   hdd   7.26999 osd.1 up  1.0 1.0
>>   9   hdd   7.26999 osd.9 up  1.0 1.0
>>  17   hdd   7.26999 osd.17up  1.0 1.0
>> -5643.63499 chassis node2627
>>  -421.81999 host node26
>>   2   hdd   7.27499 osd.2 up  1.0 1.0
>>  10   hdd   7.26999 osd.10up  1.0 1.0
>>  18   hdd   7.27499 osd.18  

[ceph-users] How can we repair OSD leveldb?

2016-08-17 Thread Dan Jakubiec
Hello, we have a Ceph cluster with 8 OSD that recently lost power to all 8 
machines.  We've managed to recover the XFS filesystems on 7 of the machines, 
but the OSD service is only starting on 1 of them.

The other 5 machines all have complaints similar to the following:

2016-08-17 09:32:15.549588 7fa2f4666800 -1 
filestore(/var/lib/ceph/osd/ceph-1) Error initializing leveldb : Corruption: 6 
missing files; e.g.: /var/lib/ceph/osd/ceph-1/current/omap/042421.ldb

How can we repair the leveldb to allow the OSDs to startup?  

Thanks,

-- Dan J___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How can we repair OSD leveldb?

2016-08-17 Thread Dan Jakubiec
Hi Wido,

Thank you for the response:

> On Aug 17, 2016, at 16:25, Wido den Hollander  wrote:
> 
> 
>> Op 17 augustus 2016 om 17:44 schreef Dan Jakubiec :
>> 
>> 
>> Hello, we have a Ceph cluster with 8 OSD that recently lost power to all 8 
>> machines.  We've managed to recover the XFS filesystems on 7 of the 
>> machines, but the OSD service is only starting on 1 of them.
>> 
>> The other 5 machines all have complaints similar to the following:
>> 
>>  2016-08-17 09:32:15.549588 7fa2f4666800 -1 
>> filestore(/var/lib/ceph/osd/ceph-1) Error initializing leveldb : Corruption: 
>> 6 missing files; e.g.: /var/lib/ceph/osd/ceph-1/current/omap/042421.ldb
>> 
>> How can we repair the leveldb to allow the OSDs to startup?  
>> 
> 
> My first question would be: How did this happen?
> 
> What hardware are you using underneath? Is there a RAID controller which is 
> not flushing properly? Since this should not happen during a power failure.
> 

Each OSD drive is connected to an onboard hardware RAID controller and 
configured in RAID 0 mode as individual virtual disks.  The RAID controller is 
an LSI 3108.

I agree -- I am finding it bizarre that 7 of our 8 OSDs (one per machine) did 
not survive the power outage.  

We did have some problems with the stock Ubunut xfs_repair (3.1.9) seg 
faulting, which eventually we overcame by building a newer version of 
xfs_repair (4.7.0).  But it did finally repair clean.

We actually have some different errors on other OSDs.  A few of them are 
failing with "Missing map in load_pgs" errors.  But generally speaking it 
appears to be missing files of various types causing different kinds of 
failures.

I'm really nervous now about the OSD's inability to start with any 
inconsistencies and no repair utilities (that I can find).  Any advice on how 
to recover?

> I don't know the answer to your question, but lost files are not good.
> 
> You might find them in a lost+found directory if XFS repair worked?
> 

Sadly this directory is empty.

-- Dan

> Wido
> 
>> Thanks,
>> 
>> -- Dan J___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] librados Java support for rados_lock_exclusive()

2016-08-24 Thread Dan Jakubiec
Hello, 

Is anyone planning to implement support for Rados locks in the Java API anytime 
soon?

Thanks,

-- Dan J

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librados Java support for rados_lock_exclusive()

2016-08-25 Thread Dan Jakubiec
Thanks Wido, I will have a look at it late next week.

-- Dan

> On Aug 25, 2016, at 00:23, Wido den Hollander  wrote:
> 
> Hi Dan,
> 
> Not on my list currently. I think it's not that difficult, but I never got 
> around to maintaining rados-java and keep up with librados.
> 
> You are more then welcome to send a Pull Request though! 
> https://github.com/ceph/rados-java/pulls
> 
> Wido
> 
>> Op 24 augustus 2016 om 21:58 schreef Dan Jakubiec :
>> 
>> 
>> Hello, 
>> 
>> Is anyone planning to implement support for Rados locks in the Java API 
>> anytime soon?
>> 
>> Thanks,
>> 
>> -- Dan J
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow Request on OSD

2016-09-01 Thread Dan Jakubiec
Thanks Wido.  Reed and I have been working together to try to restore this 
cluster for about 3 weeks now.  I have been accumulating a number of failure 
modes that I am hoping to share with the Ceph group soon, but have been holding 
off a bit until we see the full picture clearly so that we can provide some 
succinct observations.

We know that losing 6 of 8 OSDs was definitely going to result in data loss, so 
I think we are resigned to that.  What has been difficult for us is that there 
have been many steps in the rebuild process that seem to get stuck and need our 
intervention.  But it is not 100% obvious what interventions we should applying.

My very over-simplied hope was this:

We would remove the corrupted OSDs from the cluster
We would replace them with new OSDs
Ceph would figure out that a lot of PGs were lost
We would "agree and say okay -- lose the objects/files"
The cluster would use what remains and return to working state

I feel we have done something wrong along the way, and at this point we are 
trying to figure out how to do step #4 completely.  We are about to follow the 
steps to "mark unfound lost", which makes sense to me... but I'm not sure what 
to do about all the other inconsistencies.

What procedure do we need to follow to just tell Ceph "those PGs are lost, 
let's move on"?

===

A very quick history of what we did to get here:

8 OSDs lost power simultaneously.
2 OSDs came back without issues.
1 OSD wouldn't start (various assertion failures), but we were able to copy its 
PGs to a new OSD as follows:
ceph-objectstore-tool "export"
ceph osd crush rm osd.N
ceph auth del osd.N
ceph os rm osd.N
Create new OSD from scrach (it got a new OSD ID)
ceph-objectstore-tool "import"
The remaining 5 OSDs were corrupt beyond repair (could not export, mostly due 
to missing leveldb files after xfs_repair).  We redeployed them as follows:
ceph osd crush rm osd.N
ceph auth del osd.N
ceph os rm osd.N
Create new OSD from scratch (it got the same OSD ID as the old OSD)

All the new OSDs from #4.4 ended up getting the same OSD ID as the original 
OSD.  Don't know if that is part of the problem?  It seems like doing the 
"crush rm" should have advised the cluster correctly, but perhaps not?

Where did we go wrong in the recovery process?

Thank you!

-- Dan

> On Sep 1, 2016, at 00:18, Wido den Hollander  wrote:
> 
> 
>> Op 31 augustus 2016 om 23:21 schreef Reed Dier :
>> 
>> 
>> Multiple XFS corruptions, multiple leveldb issues. Looked to be result of 
>> write cache settings which have been adjusted now.
>> 
> 
> That is bad news, really bad.
> 
>> You’ll see below that there are tons of PG’s in bad states, and it was 
>> slowly but surely bringing the number of bad PGs down, but it seems to have 
>> hit a brick wall with this one slow request operation.
>> 
> 
> No, you have more issues. You can 17 PGs which are incomplete, a few 
> down+incomplete.
> 
> Without those PGs functioning (active+X) your MDS will probably not work.
> 
> Take a look at: 
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> 
> Make sure you go to HEALTH_WARN at first, in HEALTH_ERR the MDS will never 
> come online.
> 
> Wido
> 
>>> ceph -s
>>> cluster []
>>> health HEALTH_ERR
>>>292 pgs are stuck inactive for more than 300 seconds
>>>142 pgs backfill_wait
>>>135 pgs degraded
>>>63 pgs down
>>>80 pgs incomplete
>>>199 pgs inconsistent
>>>2 pgs recovering
>>>5 pgs recovery_wait
>>>1 pgs repair
>>>132 pgs stale
>>>160 pgs stuck inactive
>>>132 pgs stuck stale
>>>71 pgs stuck unclean
>>>128 pgs undersized
>>>1 requests are blocked > 32 sec
>>>recovery 5301381/46255447 objects degraded (11.461%)
>>>recovery 6335505/46255447 objects misplaced (13.697%)
>>>recovery 131/20781800 unfound (0.001%)
>>>14943 scrub errors
>>>mds cluster is degraded
>>> monmap e1: 3 mons at {core=[]:6789/0,db=[]:6789/0,dev=[]:6789/0}
>>>election epoch 262, quorum 0,1,2 core,dev,db
>>>  fsmap e3627: 1/1/1 up {0=core=up:replay}
>>> osdmap e3685: 8 osds: 8 up, 8 in; 153 remapped pgs
>>>flags sortbitwise
>>>  pgmap v1807138: 744 pgs, 10 pools, 7668 GB data, 20294 kobjects
>>>8998 GB used, 50598 GB / 59596 GB avail
>>>5301381/46255447 objects degraded (11.461%)
>>>6335505/46255447 objects misplaced (13.697%)
>>>131/20781800 unfound (0.001%)
>>> 209 active+clean
>>> 170 active+clean+inconsistent
>>> 112 stale+active+clean
>>>  74 undersized+degraded+remapped+wait_backfill+peered
>>>  63 down+incomplete
>>>  48 active+undersized+degraded+remapped+wait_backfill
>>>  19 stale+active+clean+inc

Re: [ceph-users] Slow Request on OSD

2016-09-01 Thread Dan Jakubiec
Thanks you for all the help Wido:

> On Sep 1, 2016, at 14:03, Wido den Hollander  wrote:
> 
> You have to mark those OSDs as lost and also force create the incomplete PGs.
> 

This might be the root of our problems.  We didn't mark the parent OSD as 
"lost" before we removed it.  Now ceph won't let us mark it as lost (and it is 
no longer in the OSD tree):

djakubiec@dev:~$ ceph osd lost 8 --yes-i-really-mean-it
osd.8 is not down or doesn't exist


djakubiec@dev:~$ ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 58.19960 root default
-2  7.27489 host node24
 1  7.27489 osd.1up  1.0  1.0
-3  7.27489 host node25
 2  7.27489 osd.2up  1.0  1.0
-4  7.27489 host node26
 3  7.27489 osd.3up  1.0  1.0
-5  7.27489 host node27
 4  7.27489 osd.4up  1.0  1.0
-6  7.27489 host node28
 5  7.27489 osd.5up  1.0  1.0
-7  7.27489 host node29
 6  7.27489 osd.6up  1.0  1.0
-8  7.27539 host node30
 9  7.27539 osd.9up  1.0  1.0
-9  7.27489 host node31
 7  7.27489 osd.7up  1.0  1.0

BUT, even though OSD 8 no longer exists I see still lots of references to OSD 8 
in various dumps and query's.

Interestingly do still see weird entries in the CRUSH map (should I do 
something about these?):

# devices
device 0 device0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 device8
device 9 osd.9

I then tried on all 80 incomplete PGs:

ceph pg force_create_pg 

The 80 PGs moved to "creating" for a few minutes but then all went back to 
"incomplete".

Is there some way to force individual PGs to be marked as "lost"?

Thanks!

-- Dan


> But I think you have lost so many objects that the cluster is beyond a point 
> of repair honestly.
> 
> Wido
> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to abandon PGs that are stuck in "incomplete"?

2016-09-02 Thread Dan Jakubiec
Re-packaging this question which was buried in a larger, less-specific thread 
from a couple of days ago.  Hoping this will be more useful here.

We have been working on restoring our Ceph cluster after losing a large number 
of OSDs.  We have all PGs active now except for 80 PGs that are stuck in the 
"incomplete" state.  These PGs are referencing OSD.8 which we removed 2 weeks 
ago due to corruption.

We would like to abandon the "incomplete" PGs as they are not restorable.  We 
have tried the following:

Per the docs, we made sure min_size on the corresponding pools was set to 1.  
This did not clear the condition.
Ceph would not let us issue "ceph osd lost N" because OSD.8 had already been 
removed from the cluster.
We also tried "ceph pg force_create_pg X" on all the PGs.  The 80 PGs moved to 
"creating" for a few minutes but then all went back to "incomplete".

How do we abandon these PGs to allow recovery to continue?  Is there some way 
to force individual PGs to be marked as "lost"?




Some miscellaneous data below:

djakubiec@dev:~$ ceph osd lost 8 --yes-i-really-mean-it
osd.8 is not down or doesn't exist


djakubiec@dev:~$ ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 58.19960 root default
-2  7.27489 host node24
 1  7.27489 osd.1up  1.0  1.0
-3  7.27489 host node25
 2  7.27489 osd.2up  1.0  1.0
-4  7.27489 host node26
 3  7.27489 osd.3up  1.0  1.0
-5  7.27489 host node27
 4  7.27489 osd.4up  1.0  1.0
-6  7.27489 host node28
 5  7.27489 osd.5up  1.0  1.0
-7  7.27489 host node29
 6  7.27489 osd.6up  1.0  1.0
-8  7.27539 host node30
 9  7.27539 osd.9up  1.0  1.0
-9  7.27489 host node31
 7  7.27489 osd.7up  1.0  1.0

BUT, even though OSD 8 no longer exists I see still lots of references to OSD 8 
in various ceph dumps and query's.

Interestingly, we do still see weird entries in the CRUSH map (should I do 
something about these?):

# devices
device 0 device0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 device8
device 9 osd.9



And for what it is worth, here is the ceph -s:

cluster 10d47013-8c2a-40c1-9b4a-214770414234
 health HEALTH_ERR
212 pgs are stuck inactive for more than 300 seconds
93 pgs backfill_wait
1 pgs backfilling
101 pgs degraded
63 pgs down
80 pgs incomplete
89 pgs inconsistent
4 pgs recovery_wait
1 pgs repair
132 pgs stale
80 pgs stuck inactive
132 pgs stuck stale
103 pgs stuck unclean
97 pgs undersized
2 requests are blocked > 32 sec
recovery 4394354/46343776 objects degraded (9.482%)
recovery 4025310/46343776 objects misplaced (8.686%)
2157 scrub errors
mds cluster is degraded
 monmap e1: 3 mons at 
{core=10.0.1.249:6789/0,db=10.0.1.251:6789/0,dev=10.0.1.250:6789/0}
election epoch 266, quorum 0,1,2 core,dev,db
  fsmap e3627: 1/1/1 up {0=core=up:replay}
 osdmap e4293: 8 osds: 8 up, 8 in; 144 remapped pgs
flags sortbitwise
  pgmap v1866639: 744 pgs, 10 pools, 7668 GB data, 20673 kobjects
8339 GB used, 51257 GB / 59596 GB avail
4394354/46343776 objects degraded (9.482%)
4025310/46343776 objects misplaced (8.686%)
 362 active+clean
 112 stale+active+clean
  89 active+undersized+degraded+remapped+wait_backfill
  66 active+clean+inconsistent
  63 down+incomplete
  19 stale+active+clean+inconsistent
  17 incomplete
   5 active+undersized+degraded+remapped
   4 active+recovery_wait+degraded
   2 
active+undersized+degraded+remapped+inconsistent+wait_backfill
   1 stale+active+clean+scrubbing+deep+inconsistent+repair
   1 active+remapped+inconsistent+wait_backfill
   1 active+clean+scrubbing+deep
   1 active+remapped+wait_backfill
   1 active+undersized+degraded+remapped+backfilling



Thanks,

-- Dan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can someone explain the strange leftover OSD devices in CRUSH map -- renamed from osd.N to deviceN?

2016-09-02 Thread Dan Jakubiec
A while back we removed two damaged OSDs from our cluster, osd.0 and osd.8.  
They are now gone from most Ceph commands, but are still showing up in the 
CRUSH map with weird device names:

...

# devices
device 0 device0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 device8
device 9 osd.9

...



Can someone please explain why these are there, if they have any affect on the 
cluster, and whether or not we should remove them?

device0 and device8 do not show up anywhere else in the CRUSH map.

I am mainly asking because we are dealing with some stuck PGs (incomplete) 
which are still referencing id "8" in various places.

Otherwise, "ceph osd tree" looks how I would expect (no osd.8 and no osd.0):

djakubiec@dev:~$ ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 58.19960 root default
-2  7.27489 host node24
 1  7.27489 osd.1up  1.0  1.0
-3  7.27489 host node25
 2  7.27489 osd.2up  1.0  1.0
-4  7.27489 host node26
 3  7.27489 osd.3up  1.0  1.0
-5  7.27489 host node27
 4  7.27489 osd.4up  1.0  1.0
-6  7.27489 host node28
 5  7.27489 osd.5up  1.0  1.0
-7  7.27489 host node29
 6  7.27489 osd.6up  1.0  1.0
-8  7.27539 host node30
 9  7.27539 osd.9up  1.0  1.0
-9  7.27489 host node31
 7  7.27489 osd.7up  1.0  1.0


Thanks,

-- Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to abandon PGs that are stuck in "incomplete"?

2016-09-03 Thread Dan Jakubiec
I think we are zero'ing in now on root cause for the stuck incomplete.  Looks 
like the common factor for all our stuck PGs is that they are all showing the 
removed OSD 8 in their "down_osds_we_would_probe" list (from "ceph pg  
query").

For reference, I found a few archived threads of other people experiencing 
similar problems in the past:

  https://www.mail-archive.com/ceph-users@lists.ceph.com/msg13985.html
  
http://ceph-users.ceph.narkive.com/jJ2DyVw7/ceph-pgs-stuck-creating-after-running-force-create-pg
  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-August/042338.html

The general consensus from those threads is that as long as 
down_osds_we_would_probe is pointing to any OSD that can't be reached, those 
PGs will remain stuck incomplete and can't be cured by force_create_pg or even 
"ceph osd lost".

Question: is there any command we can run to remove the old OSD from 
down_osds_we_would_probe?

I did try to create an new "fake" OSD.8 today (just created the OSD, but didn't 
bring it all the way up), and I was able to finally run "ceph osd lost 8".  Did 
not seem to have any impact.

If there is no command to removed the old OSD, I think our next step will be to 
bring up a new/real/empty OSD.8 and see if that will clear the log jam.  But 
seems like there should be a tool to deal with this kind of thing?

Thanks,

-- Dan


> On Sep 2, 2016, at 15:01, Dan Jakubiec  wrote:
> 
> Re-packaging this question which was buried in a larger, less-specific thread 
> from a couple of days ago.  Hoping this will be more useful here.
> 
> We have been working on restoring our Ceph cluster after losing a large 
> number of OSDs.  We have all PGs active now except for 80 PGs that are stuck 
> in the "incomplete" state.  These PGs are referencing OSD.8 which we removed 
> 2 weeks ago due to corruption.
> 
> We would like to abandon the "incomplete" PGs as they are not restorable.  We 
> have tried the following:
> 
> Per the docs, we made sure min_size on the corresponding pools was set to 1.  
> This did not clear the condition.
> Ceph would not let us issue "ceph osd lost N" because OSD.8 had already been 
> removed from the cluster.
> We also tried "ceph pg force_create_pg X" on all the PGs.  The 80 PGs moved 
> to "creating" for a few minutes but then all went back to "incomplete".
> 
> How do we abandon these PGs to allow recovery to continue?  Is there some way 
> to force individual PGs to be marked as "lost"?
> 
> 
> 
> 
> Some miscellaneous data below:
> 
> djakubiec@dev:~$ ceph osd lost 8 --yes-i-really-mean-it
> osd.8 is not down or doesn't exist
> 
> 
> djakubiec@dev:~$ ceph osd tree
> ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 58.19960 root default
> -2  7.27489 host node24
>  1  7.27489 osd.1up  1.0  1.0
> -3  7.27489 host node25
>  2  7.27489 osd.2up  1.0  1.0
> -4  7.27489 host node26
>  3  7.27489 osd.3up  1.0  1.0
> -5  7.27489 host node27
>  4  7.27489 osd.4up  1.0  1.0
> -6  7.27489 host node28
>  5  7.27489 osd.5up  1.0  1.0
> -7  7.27489 host node29
>  6  7.27489 osd.6up  1.0  1.0
> -8  7.27539 host node30
>  9  7.27539 osd.9up  1.0  1.0
> -9  7.27489 host node31
>  7  7.27489 osd.7up  1.0  1.0
> 
> BUT, even though OSD 8 no longer exists I see still lots of references to OSD 
> 8 in various ceph dumps and query's.
> 
> Interestingly, we do still see weird entries in the CRUSH map (should I do 
> something about these?):
> 
> # devices
> device 0 device0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 device8
> device 9 osd.9
> 
> 
> 
> And for what it is worth, here is the ceph -s:
> 
> cluster 10d47013-8c2a-40c1-9b4a-214770414234
>  health HEALTH_ERR
> 212 pgs are stuck inactive for more than 300 seconds
> 93 pgs backfill_wait
> 1 pgs backfilling
> 101 pgs degraded
> 63 pgs down
> 80 pgs incomplete
> 89 pgs inconsistent
> 4 pgs recovery_wait
> 1 pgs repair
> 132 pgs stale
> 80 pgs stuck inactive
> 132 pgs stuck stale
> 103 pgs stuck unclean
> 97 pgs undersized
> 2 requests are bl

Re: [ceph-users] OSD daemon randomly stops

2016-09-03 Thread Dan Jakubiec
Hi Samuel,

Here is another assert, but this time with debug filestore = 20.

Does this reveal anything?

2016-09-03 16:12:44.122451 7fec728c9700 20 list_by_hash_bitwise prefix 08F3
2016-09-03 16:12:44.123046 7fec728c9700 20 list_by_hash_bitwise prefix 08F30042
2016-09-03 16:12:44.123068 7fec728c9700 20 list_by_hash_bitwise prefix 08FB
2016-09-03 16:12:44.123669 7fec728c9700 20 list_by_hash_bitwise prefix 08FB00D8
2016-09-03 16:12:44.123687 7fec728c9700 20 list_by_hash_bitwise prefix 08F708EF
2016-09-03 16:12:44.123738 7fec728c9700 20 filestore(/var/lib/ceph/osd/ceph-4) 
objects: 0x7fec728c6e60
2016-09-03 16:12:44.123753 7fec728c9700 10 osd.4 pg_epoch: 5023 pg[7.80( v 
1096'91073 (727'87762,1096'91073] local-les=5023 n=31613 ec=32 les/c/f 
5023/5023/0 5022/5022/4987) [9,4] r=1 lpr=5022 pi=4984-502
1/17 luod=0'0 crt=1096'91073 lcod 0'0 active] be_scan_list scanning 25 objects 
deeply
2016-09-03 16:12:44.123803 7fec728c9700 10 filestore(/var/lib/ceph/osd/ceph-4) 
stat 7.80_head/#7:0100377b:::119e202.:head# = 0 (size 11644)
2016-09-03 16:12:44.123810 7fec728c9700 15 filestore(/var/lib/ceph/osd/ceph-4) 
getattrs 7.80_head/#7:0100377b:::119e202.:head#
2016-09-03 16:12:44.123865 7fec728c9700 20 filestore(/var/lib/ceph/osd/ceph-4) 
fgetattrs 132 getting '_'
2016-09-03 16:12:44.123876 7fec728c9700 20 filestore(/var/lib/ceph/osd/ceph-4) 
fgetattrs 132 getting '_parent'
2016-09-03 16:12:44.123880 7fec728c9700 20 filestore(/var/lib/ceph/osd/ceph-4) 
fgetattrs 132 getting 'snapset'
2016-09-03 16:12:44.123884 7fec728c9700 20 filestore(/var/lib/ceph/osd/ceph-4) 
fgetattrs 132 getting '_layout'
2016-09-03 16:12:44.123889 7fec728c9700 10 filestore(/var/lib/ceph/osd/ceph-4) 
getattrs no xattr exists in object_map r = 0
2016-09-03 16:12:44.123890 7fec728c9700 10 filestore(/var/lib/ceph/osd/ceph-4) 
getattrs 7.80_head/#7:0100377b:::119e202.:head# = 0
2016-09-03 16:12:44.123894 7fec728c9700 10 osd.4 pg_epoch: 5023 pg[7.80( v 
1096'91073 (727'87762,1096'91073] local-les=5023 n=31613 ec=32 les/c/f 
5023/5023/0 5022/5022/4987) [9,4] r=1 lpr=5022 pi=4984-502
1/17 luod=0'0 crt=1096'91073 lcod 0'0 active] be_deep_scrub 
7:0100377b:::119e202.:head seed 4294967295
2016-09-03 16:12:44.123904 7fec728c9700 15 filestore(/var/lib/ceph/osd/ceph-4) 
read 7.80_head/#7:0100377b:::119e202.:head# 0~524288
2016-09-03 16:12:44.124020 7fec728c9700 10 filestore(/var/lib/ceph/osd/ceph-4) 
FileStore::read 7.80_head/#7:0100377b:::119e202.:head# 
0~11644/524288
2016-09-03 16:12:44.124033 7fec728c9700 15 filestore(/var/lib/ceph/osd/ceph-4) 
read 7.80_head/#7:0100377b:::119e202.:head# 11644~524288
2016-09-03 16:12:44.129766 7fec6e0c0700 -1 *** Caught signal (Aborted) **
 in thread 7fec6e0c0700 thread_name:tp_osd_recov

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x8ebb02) [0x560bbe037b02]
 2: (()+0x10330) [0x7fec9b31d330]
 3: (gsignal()+0x37) [0x7fec9937fc37]
 4: (abort()+0x148) [0x7fec99383028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x265) [0x560bbe12ef85]
 6: (ReplicatedPG::scan_range(int, int, PG::BackfillInterval*, 
ThreadPool::TPHandle&)+0xad2) [0x560bbdc11482]
 7: (ReplicatedPG::update_range(PG::BackfillInterval*, 
ThreadPool::TPHandle&)+0x614) [0x560bbdc11ac4]
 8: (ReplicatedPG::recover_backfill(int, ThreadPool::TPHandle&, bool*)+0x337) 
[0x560bbdc31c87]
 9: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, int*)+0x8a0) 
[0x560bbdc63160]
 10: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x355) [0x560bbdaf3555]
 11: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0xd) 
[0x560bbdb3c0dd]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x560bbe12018e]
 13: (ThreadPool::WorkThread::entry()+0x10) [0x560bbe121070]
 14: (()+0x8184) [0x7fec9b315184]
 15: (clone()+0x6d) [0x7fec9944337d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--- begin dump of recent events ---
   -80> 2016-09-03 16:12:44.102928 7fec728c9700 20 list_by_hash_bitwise prefix 
08B702C7
   -79> 2016-09-03 16:12:44.102953 7fec728c9700 20 list_by_hash_bitwise prefix 
08BF
   -78> 2016-09-03 16:12:44.103614 7fec728c9700 20 list_by_hash_bitwise prefix 
08BF0464
   -77> 2016-09-03 16:12:44.103675 7fec728c9700 20 list_by_hash_bitwise prefix 
087
   -76> 2016-09-03 16:12:44.103753 7fec728c9700 20 list_by_hash_bitwise prefix 
0870
   -75> 2016-09-03 16:12:44.104343 7fec728c9700 20 list_by_hash_bitwise prefix 
087B
   -74> 2016-09-03 16:12:44.104363 7fec728c9700 20 list_by_hash_bitwise prefix 
0878
   -73> 2016-09-03 16:12:44.105032 7fec728c9700 20 list_by_hash_bitwise prefix 
0878005D
   -72> 2016-09-03 16:12:44.105054 7fec728c9700 20 list_by_hash_bitwise prefix 
0874
   -71> 2016-09-03 16:12:44.105693 7fec728c9700 20 list_by_hash_bitwise prefix 
087400A0
   -70> 2016-09-03 16:12:44.105714 7fec728c9700 20 list_by_hash_bitwise prefix 
087C
   -69> 2016-09-03 16:12:44.106376 7fec728c9

Re: [ceph-users] OSD daemon randomly stops

2016-09-03 Thread Dan Jakubiec
Hi Brad, thank you very much for the response:

> On Sep 3, 2016, at 17:05, Brad Hubbard  wrote:
> 
> 
> 
> On Sun, Sep 4, 2016 at 6:21 AM, Dan Jakubiec  <mailto:dan.jakub...@gmail.com>> wrote:
> 
>> 2016-09-03 16:12:44.124033 7fec728c9700 15
>> filestore(/var/lib/ceph/osd/ceph-4) read
>> 7.80_head/#7:0100377b:::119e202.:head# 11644~524288
>> 2016-09-03 16:12:44.129766 7fec6e0c0700 -1 *** Caught signal (Aborted) **
>> in thread 7fec6e0c0700 thread_name:tp_osd_recov
> 
> Can you do a comparison of this object on all replicas (md5sum might be good).
> 
> 7.80_head/#7:0100377b:::119e202.:head#
> 
> It's name on disk should be something like 119e202.__head_XXX__7 
> if
> I'm not mistaken.
> 

Looks like this PG has not yet been replicated.  It only exists on one OSD:

djakubiec@dev:~/rados$ ceph pg map 7.80
osdmap e5262 pg 7.80 (7.80) -> up [9] acting [9]

> Have you tried doing a deep scrub on this pg and checking the OSD logs for 
> scrub
> errors?
> 

I just kicked off a deep-scrub on the PG and go this:

2016-09-03 18:08:54.830432 7f7d97dc1700 -1 log_channel(cluster) log [ERR] : 
7.10 deep-scrub stat mismatch, got 28730/31606 objects, 0/0 clones, 28730/31606 
dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
35508027596/38535007872 bytes, 0/0 hit_set_archive bytes.
2016-09-03 18:08:54.830446 7f7d97dc1700 -1 log_channel(cluster) log [ERR] : 
7.10 deep-scrub 1 errors

Can you explain what it means?  What should I do about this?

> 
>>-7> 2016-09-03 16:12:44.123884 7fec728c9700 20
>> filestore(/var/lib/ceph/osd/ceph-4) fgetattrs 132 getting '_layout'
>>-6> 2016-09-03 16:12:44.123889 7fec728c9700 10
>> filestore(/var/lib/ceph/osd/ceph-4) getattrs no xattr exists in object_map r
>> = 0
>>-5> 2016-09-03 16:12:44.123890 7fec728c9700 10
>> filestore(/var/lib/ceph/osd/ceph-4) getattrs
>> 7.80_head/#7:0100377b:::119e202.:head# = 0
>>   -29> 2016-09-03 16:12:44.119228 7fec728c9700 20 list_by_hash_bitwise
>> prefix 08FE
>> 7: (ReplicatedPG::update_range(PG::BackfillInterval*,
>> ThreadPool::TPHandle&)+0x614) [0x560bbdc11ac4]
>> 8: (ReplicatedPG::recover_backfill(int, ThreadPool::TPHandle&,
>> bool*)+0x337) [0x560bbdc31c87]
> 
> This looks messed up.
> 
> Is this how it actually looks in the logs?
> 

Yes, I cut and pasted directly.  Which part looks messed up?

Thanks,

-- Dan

> -- 
> Cheers,
> Brad
> 
>> 
>> On Sep 2, 2016, at 12:25, Samuel Just  wrote:
>> 
>> Probably an EIO.  You can reproduce with debug filestore = 20 to confirm.
>> -Sam
>> 
>> On Fri, Sep 2, 2016 at 10:18 AM, Reed Dier  wrote:
>> 
>> OSD has randomly stopped for some reason. Lots of recovery processes
>> currently running on the ceph cluster. OSD log with assert below:
>> 
>> -14> 2016-09-02 11:32:38.672460 7fcf65514700  5 -- op tracker -- seq: 1147,
>> time: 2016-09-02 11:32:38.672460, event: queued_for_pg, op:
>> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>>  -13> 2016-09-02 11:32:38.672533 7fcf70d40700  5 -- op tracker -- seq:
>> 1147, time: 2016-09-02 11:32:38.672533, event: reached_pg, op:
>> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>>  -12> 2016-09-02 11:32:38.672548 7fcf70d40700  5 -- op tracker -- seq:
>> 1147, time: 2016-09-02 11:32:38.672548, event: started, op:
>> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>>  -11> 2016-09-02 11:32:38.672548 7fcf7cd58700  1 -- [].28:6800/27735 <==
>> mon.0 [].249:6789/0 60  pg_stats_ack(0 pgs tid 45) v1  4+0+0 (0 0 0)
>> 0x55a4443b1400 con 0x55a4434a4e80
>>  -10> 2016-09-02 11:32:38.672559 7fcf70d40700  1 -- [].28:6801/27735 -->
>> [].31:6801/2070838 -- osd_sub_op(unknown.0.0:0 7.d1 MIN [scrub-unreserve] v
>> 0'0 snapset=0=[]:[]) v12 -- ?+0 0x55a443aec100 con 0x55a443be0600
>>   -9> 2016-09-02 11:32:38.672571 7fcf70d40700  5 -- op tracker -- seq:
>> 1147, time: 2016-09-02 11:32:38.672571, event: done, op:
>> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>>   -8> 2016-09-02 11:32:38.681929 7fcf7b555700  1 -- [].28:6801/27735 <==
>> osd.2 [].26:6801/9468 148  MBackfillReserve GRANT  pgid: 15.11,
>> query_epoch: 4235 v3  30+0+0 (3067148394 0 0) 0x55a4441f65a0 con
>> 0x55a4434ab200
>>   -7> 2016-09-02 11:32:38.682009 7fcf7b555700  5 -- op tracker -- seq:
>> 1148, time: 2016-09-02 11:32:38.682008, event: done, op: MBackfillReserve
>> GRANT  pgid: 15.11, query_epoch: 4235
>>   -6> 2016-09

[ceph-users] Is rados_write_op_* any more efficient than issuing the commands individually?

2016-09-06 Thread Dan Jakubiec
Hello, I need to issue the following commands on millions of objects:

rados_write_full(oid1, ...)
rados_setxattr(oid1, "attr1", ...)
rados_setxattr(oid1, "attr2", ...)

Would it make it any faster if I combined all 3 of these into a single 
rados_write_op and issued them "together" as a single call?  

My current application doesn't really care much about the atomicity, but 
maximizing our write throughput is quite important.

Does rados_write_op save any roundtrips to the OSD or have any other efficiency 
gains?

Thanks,

-- Dan Jakubiec___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovery/Backfill Speedup

2016-10-05 Thread Dan Jakubiec
73/681597074 objects degraded (24.255%)
 298607229/681597074 objects misplaced (43.810%)
  249 active+undersized+degraded+remapped+wait_backfill
  188 active+clean
   85 active+recovery_wait+degraded
   22 active+recovery_wait+degraded+remapped
   22 active+recovery_wait+undersized+degraded+remapped
3 active+remapped+wait_backfill
3 active+undersized+degraded+remapped+backfilling
3 active+degraded+remapped+wait_backfill
1 active+degraded+remapped+backfilling
recovery io 9361 kB/s, 415 objects/s
   client io 597 kB/s rd, 62 op/s rd, 0 op/s wr

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



4 pgs backfilling

this sounds incredibly low for your configuration. you do not say 
anything about The default is 10. so with 8 nodes each having 1 osd 
writing and 1 osd reading you should see much more then 4 pgs 
backfilling at any given time. theoretical max beeing 8*10 = 80


check what your current max backfill value is. and try setting 
osd-max-backfill higher, preferable in smaller increments while 
monitoring how many pg's are backfilling and the load on machines and 
network.


kind regards
Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Dan Jakubiec
VP Development
Focus VQ LLC
<https://www.postbox-inc.com/?utm_source=email&utm_medium=siglink&utm_campaign=reach>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-24 Thread Dan Jakubiec
Thanks Kostis, great read.  

We also had a Ceph disaster back in August and a lot of this experience looked 
familiar.  Sadly, in the end we were not able to recover our cluster but glad 
to hear that you were successful.

LevelDB corruptions were one of our big problems.  Your note below about 
running RepairDB from Python is interesting.  At the time we were looking for a 
Ceph tool to run LevelDB repairs in order to get our OSDs back up and couldn't 
find one.  I felt like this is something that should be in the standard toolkit.

Would be great to see this added some day, but in the meantime I will remember 
this option exists.  If you still have the Python script, perhaps you could 
post it as an example?

Thanks!

-- Dan


> On Oct 20, 2016, at 01:42, Kostis Fardelas  wrote:
> 
> We pulled leveldb from upstream and fired leveldb.RepairDB against the
> OSD omap directory using a simple python script. Ultimately, that
> didn't make things forward. We resorted to check every object's
> timestamp/md5sum/attributes on the crashed OSD against the replicas in
> the cluster and at last took the way of discarding the journal, when
> we concluded with as much confidence as possible that we would not
> lose data.
> 
> It would be really useful at that moment if we had a tool to inspect
> the journal's contents of the crashed OSD and limit the scope of the
> verification process.
> 
> On 20 October 2016 at 08:15, Goncalo Borges
>  wrote:
>> Hi Kostis...
>> That is a tale from the dark side. Glad you recover it and that you were 
>> willing to doc it all up, and share it. Thank you for that,
>> Can I also ask which tool did you use to recover the leveldb?
>> Cheers
>> Goncalo
>> 
>> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis 
>> Fardelas [dante1...@gmail.com]
>> Sent: 20 October 2016 09:09
>> To: ceph-users
>> Subject: [ceph-users] Surviving a ceph cluster outage: the hard way
>> 
>> Hello cephers,
>> this is the blog post on our Ceph cluster's outage we experienced some
>> weeks ago and about how we managed to revive the cluster and our
>> clients's data.
>> 
>> I hope it will prove useful for anyone who will find himself/herself
>> in a similar position. Thanks for everyone on the ceph-users and
>> ceph-devel lists who contributed to our inquiries during
>> troubleshooting.
>> 
>> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
>> 
>> Regards,
>> Kostis
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS in existing pool namespace

2016-11-02 Thread Dan Jakubiec
Hi John,

How does one configure namespaces for file/dir layouts?  I'm looking here, but 
am not seeing any mentions of namespaces:

  http://docs.ceph.com/docs/jewel/cephfs/file-layouts/

Thanks,

-- Dan

> On Oct 28, 2016, at 04:11, John Spray  wrote:
> 
> On Thu, Oct 27, 2016 at 9:43 PM, Reed Dier  > wrote:
>> Looking to add CephFS into our Ceph cluster (10.2.3), and trying to plan for 
>> that addition.
>> 
>> Currently only using RADOS on a single replicated, non-EC, pool, no RBD or 
>> RGW, and segmenting logically in namespaces.
>> 
>> No auth scoping at this time, but likely something we will be moving to in 
>> the future as our Ceph cluster grows in size and use.
>> 
>> The main question at hand is bringing CephFS, by way of the kernel driver, 
>> into our cluster. We are trying to be more efficient with our PG 
>> enumeration, and questioning whether there is efficiency or unwanted 
>> complexity by way of creating a namespace in the existing pool, versus a 
>> completely separate pool.
>> 
>> On top of that, how does the cephfs-metadata pool/namespace equate into 
>> that? Is this even feasible?
>> 
>> Barring feasibility, how do others plan their pg_num for separate pools for 
>> cephfs and the metadata pool, compared to a standard object pool?
>> 
>> Hopefully someone has some experience with this and can comment.
>> 
>> TL;DR - is there a way to specify cephfs_data and cephfs_metadata ‘pools’ as 
>> a namespace, rather than entire pools?
>>$ ceph fs new
>> --metadata-namespace  --data-namespace 
>> is the name of the pool where metadata is stored, 
>>  the namespace within the aforementioned pool.
>> and  analogous with the metadata side.
> 
> Currently, you can set a namespace on a file/dir layout which will
> control where the file data goes, but the MDS will still put some
> things outside of that namespace.
> 
> The feature ticket for putting everything into namespaces is this one:
> http://tracker.ceph.com/issues/15066 
> 
> It's one of the simpler things in the backlog so if anyone wants a
> project then it's a good one.
> 
> Cheers,
> John
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multi-tenancy and sharing CephFS data pools with other RADOS users

2016-11-02 Thread Dan Jakubiec
We currently have one master RADOS pool in our cluster that is shared among 
many applications.  All objects stored in the pool are currently stored using 
specific namespaces -- nothing is stored in the default namespace.

We would like to add a CephFS filesystem to our cluster, and would like to use 
the same master RADOS pool as the data pool for the filesystem.

Since there are no other tenants using the default namespace would it be safe 
to share our RADOS pool in this way?  Any reason to NOT do this?

Thanks,

-- Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to pick the number of PGs for a CephFS metadata pool?

2016-11-08 Thread Dan Jakubiec
Hello,

Picking the number of PGs for the CephFS data pool seems straightforward, but 
how does one do this for the metadata pool?

Any rules of thumb or recommendations?

Thanks,

-- Dan Jakubiec
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to pick the number of PGs for a CephFS metadata pool?

2016-11-08 Thread Dan Jakubiec
Thanks Greg, makes sense.

Our ceph cluster currently has 16 OSDs, each with an 8TB disk.

Sounds like 32 PGs at 3x replication might be a reasonable starting point?

Thanks,

-- Dan

> On Nov 8, 2016, at 14:02, Gregory Farnum  wrote:
> 
> On Tue, Nov 8, 2016 at 9:37 AM, Dan Jakubiec  wrote:
>> Hello,
>> 
>> Picking the number of PGs for the CephFS data pool seems straightforward, 
>> but how does one do this for the metadata pool?
>> 
>> Any rules of thumb or recommendations?
> 
> I don't think we have any good ones yet. You've got to worry about the
> log and about the backing directory objects; depending on how your map
> looks I'd just try and get enough for a decent IO distribution across
> the disks you're actually using. Given the much lower amount of
> absolute data you're less worried about balancing the data precisely
> evenly and more concerned about not accidentally driving all IO to one
> of 7 disks because you have 8 PGs, and all your supposedly-parallel
> ops are contending. ;)
> -Greg
> 
>> 
>> Thanks,
>> 
>> -- Dan Jakubiec
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stalls caused by scrub on jewel

2016-12-02 Thread Dan Jakubiec
For what it's worth... this sounds like the condition we hit we re-enabled 
scrub on our 16 OSDs (after 6 to 8 weeks of noscrub).  They flapped for about 
30 minutes as most of the OSDs randomly hit suicide timeouts here and there.

This settled down after about an hour and the OSDs stopped dying.  We have 
since left scrub enabled for about 4 days and have only seen three small spurts 
of OSD flapping since then (which quickly resolved themselves).

-- Dan

> On Dec 1, 2016, at 14:38, Frédéric Nass  
> wrote:
> 
> Hi Yoann,
> 
> Thank you for your input. I was just told by RH support that it’s gonna make 
> it to RHCS 2.0 (10.2.3). Thank you guys for the fix !
> 
> We thought about increasing the number of PGs just after changing the 
> merge/split threshold values but this would have led to a _lot_ of data 
> movements (1.2 billion of XFS files) over weeks, without any possibility to 
> scrub / deep-scrub to ensure data consistency. Still as soon as we get the 
> fix, we will increase the number of PGs.
> 
> Regards,
> 
> Frederic.
> 
> 
> 
>> Le 1 déc. 2016 à 16:47, Yoann Moulin  a écrit :
>> 
>> Hello,
>> 
>>> We're impacted by this bug (case 01725311). Our cluster is running RHCS 2.0 
>>> and is no more capable to scrub neither deep-scrub.
>>> 
>>> [1] http://tracker.ceph.com/issues/17859
>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1394007
>>> [3] https://github.com/ceph/ceph/pull/11898
>>> 
>>> I'm worried we'll have to live with a cluster that can't scrub/deep-scrub 
>>> until March 2017 (ETA for RHCS 2.2 running Jewel 10.2.4).
>>> 
>>> Can we have this fix any sooner ?
>> 
>> As far as I know about that bug, it appears if you have big PGs, a 
>> workaround could be increasing the pg_num of the pool that has the biggest 
>> PGs.
>> 
>> -- 
>> Yoann Moulin
>> EPFL IC-IT
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stalls caused by scrub on jewel

2016-12-02 Thread Dan Jakubiec

> On Dec 2, 2016, at 10:48, Sage Weil  wrote:
> 
> On Fri, 2 Dec 2016, Dan Jakubiec wrote:
>> For what it's worth... this sounds like the condition we hit we 
>> re-enabled scrub on our 16 OSDs (after 6 to 8 weeks of noscrub).  They 
>> flapped for about 30 minutes as most of the OSDs randomly hit suicide 
>> timeouts here and there.
>> 
>> This settled down after about an hour and the OSDs stopped dying.  We 
>> have since left scrub enabled for about 4 days and have only seen three 
>> small spurts of OSD flapping since then (which quickly resolved 
>> themselves).
> 
> Yeah.  I think what's happening is that with a cold cache it is slow 
> enough to suicide, but with a warm cache it manages to complete (although 
> I bet it's still stalling other client IO for perhaps multiple seconds).  
> I would leave noscrub set for now.

Ah... thanks for the suggestion!  We are indeed working through some jerky 
performance issues.  Perhaps this is a layer of that onion, thank you.

-- Dan

> 
> sage
> 
> 
> 
> 
>> 
>> -- Dan
>> 
>>> On Dec 1, 2016, at 14:38, Frédéric Nass  
>>> wrote:
>>> 
>>> Hi Yoann,
>>> 
>>> Thank you for your input. I was just told by RH support that it’s gonna 
>>> make it to RHCS 2.0 (10.2.3). Thank you guys for the fix !
>>> 
>>> We thought about increasing the number of PGs just after changing the 
>>> merge/split threshold values but this would have led to a _lot_ of data 
>>> movements (1.2 billion of XFS files) over weeks, without any possibility to 
>>> scrub / deep-scrub to ensure data consistency. Still as soon as we get the 
>>> fix, we will increase the number of PGs.
>>> 
>>> Regards,
>>> 
>>> Frederic.
>>> 
>>> 
>>> 
>>>> Le 1 déc. 2016 à 16:47, Yoann Moulin  a écrit :
>>>> 
>>>> Hello,
>>>> 
>>>>> We're impacted by this bug (case 01725311). Our cluster is running RHCS 
>>>>> 2.0 and is no more capable to scrub neither deep-scrub.
>>>>> 
>>>>> [1] http://tracker.ceph.com/issues/17859
>>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1394007
>>>>> [3] https://github.com/ceph/ceph/pull/11898
>>>>> 
>>>>> I'm worried we'll have to live with a cluster that can't scrub/deep-scrub 
>>>>> until March 2017 (ETA for RHCS 2.2 running Jewel 10.2.4).
>>>>> 
>>>>> Can we have this fix any sooner ?
>>>> 
>>>> As far as I know about that bug, it appears if you have big PGs, a 
>>>> workaround could be increasing the pg_num of the pool that has the biggest 
>>>> PGs.
>>>> 
>>>> -- 
>>>> Yoann Moulin
>>>> EPFL IC-IT
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-17 Thread Dan Jakubiec
Also worth pointing out something a bit obvious but: this kind of 
faster/destructive migration should only be attempted if all your pools are at 
least 3x replicated.

For example, if you had a 1x replicated pool you would lose data using this 
approach.

-- Dan

> On Jan 11, 2018, at 14:24, Reed Dier  wrote:
> 
> Thank you for documenting your progress and peril on the ML.
> 
> Luckily I only have 24x 8TB HDD and 50x 1.92TB SSDs to migrate over to 
> bluestore.
> 
> 8 nodes, 4 chassis (failure domain), 3 drives per node for the HDDs, so I’m 
> able to do about 3 at a time (1 node) for rip/replace.
> 
> Definitely taking it slow and steady, and the SSDs will move quickly for 
> backfills as well.
> Seeing about 1TB/6hr on backfills, without much performance hit on rest of 
> everything, about 5TB average util on each 8TB disk, so just about 30 
> hours-ish per host *8 hosts will be about 10 days, so a couple weeks is a 
> safe amount of headway.
> This write performance certainly seems better on bluestore than filestore, so 
> that likely helps as well.
> 
> Expect I can probably refill an SSD osd in about an hour or two, and will 
> likely stagger those out.
> But with such a small number of osd’s currently, I’m taking the by-hand 
> approach rather than scripting it so as to avoid similar pitfalls.
> 
> Reed 
> 
>> On Jan 11, 2018, at 12:38 PM, Brady Deetz > > wrote:
>> 
>> I hear you on time. I have 350 x 6TB drives to convert. I recently posted 
>> about a disaster I created automating my migration. Good luck
>> 
>> On Jan 11, 2018 12:22 PM, "Reed Dier" > > wrote:
>> I am in the process of migrating my OSDs to bluestore finally and thought I 
>> would give you some input on how I am approaching it.
>> Some of saga you can find in another ML thread here: 
>> https://www.spinics.net/lists/ceph-users/msg41802.html 
>> 
>> 
>> My first OSD I was cautious, and I outed the OSD without downing it, 
>> allowing it to move data off.
>> Some background on my cluster, for this OSD, it is an 8TB spinner, with an 
>> NVMe partition previously used for journaling in filestore, intending to be 
>> used for block.db in bluestore.
>> 
>> Then I downed it, flushed the journal, destroyed it, zapped with 
>> ceph-volume, set norecover and norebalance flags, did ceph osd crush remove 
>> osd.$ID, ceph auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume 
>> locally to create the new LVM target. Then unset the norecover and 
>> norebalance flags and it backfilled like normal.
>> 
>> I initially ran into issues with specifying --osd.id  
>> causing my osd’s to fail to start, but removing that I was able to get it to 
>> fill in the gap of the OSD I just removed.
>> 
>> I’m now doing quicker, more destructive migrations in an attempt to reduce 
>> data movement.
>> This way I don’t read from OSD I’m replacing, write to other OSD 
>> temporarily, read back from temp OSD, write back to ‘new’ OSD.
>> I’m just reading from replica and writing to ‘new’ OSD.
>> 
>> So I’m setting the norecover and norebalance flags, down the OSD (but not 
>> out, it stays in, also have the noout flag set), destroy/zap, recreate using 
>> ceph-volume, unset the flags, and it starts backfilling.
>> For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time 
>> to offload it and then backfill back from them. I trust my disks enough to 
>> backfill from the other disks, and its going well. Also seeing very good 
>> write performance backfilling compared to previous drive replacements in 
>> filestore, so thats very promising.
>> 
>> Reed
>> 
>>> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen >> > wrote:
>>> 
>>> Hi Alfredo,
>>> 
>>> thank you for your comments:
>>> 
>>> Zitat von Alfredo Deza mailto:ad...@redhat.com>>:
 On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen >>> > wrote:
> Dear *,
> 
> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
> keeping the OSD number? There have been a number of messages on the list,
> reporting problems, and my experience is the same. (Removing the existing
> OSD and creating a new one does work for me.)
> 
> I'm working on an Ceph 12.2.2 cluster and tried following
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>  
> 
> - this basically says
> 
> 1. destroy old OSD
> 2. zap the disk
> 3. prepare the new OSD
> 4. activate the new OSD
> 
> I never got step 4 to complete. The closest I got was by doing the 
> following
> steps (assuming OSD ID "999" on /dev/sdzz):
> 
> 1. Stop the old OSD via systemd (osd-node # systemctl stop
> ceph-osd@999.service )