Re: [ceph-users] rbd cache did not help improve performance

2016-02-29 Thread Tom Christensen
If you are mapping the RBD with the kernel driver then you're not using librbd so these settings will have no effect I believe. The kernel driver does its own caching but I don't believe there are any settings to change its default behavior. On Mon, Feb 29, 2016 at 9:36 PM, Shinobu Kinjo wrote:

Re: [ceph-users] Over 13,000 osdmaps in current/meta

2016-02-25 Thread Tom Christensen
We've seen this as well as early as 0.94.3 and have a bug, http://tracker.ceph.com/issues/13990 which we're working through currently. Nothing fixed yet, still trying to nail down exactly why the osd maps aren't being trimmed as they should. On Thu, Feb 25, 2016 at 10:16 AM, Stillwell, Bryan < b

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-14 Thread Tom Christensen
dless of the number of pgs/osd. Meaning it started out bad, and stayed bad but didn't get worse as we added osds. We've had to reweight osds in our crushmap to get anything close to a sane distribution of pgs. -Tom On Sat, Feb 13, 2016 at 10:57 PM, Christian Balzer wrote: > On S

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Tom Christensen
> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0 0 osd.2 > > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0 0 osd.2 1788 > > load_pgs opened > 564 pgs > --- > Another minute to load the PGs. > Same OSD reboot as above : 8 seconds for this. Do you really have 564 pgs on a si

Re: [ceph-users] Kernel RBD hang on OSD Failure

2015-12-17 Thread Tom Christensen
I've just checked 1072 and 872, they both look the same, a single op for the object in question, in retry+read state, appears to be retrying forever. On Thu, Dec 17, 2015 at 10:05 AM, Tom Christensen wrote: > I had already nuked the previous hang, but we have another one: > &g

Re: [ceph-users] Kernel RBD hang on OSD Failure

2015-12-08 Thread Tom Christensen
duce the output you've requested. > > I'd also be willing to run an early 4.5 version in our test environment. > > On Tue, Dec 8, 2015 at 3:35 AM, Ilya Dryomov wrote: > >> On Tue, Dec 8, 2015 at 10:57 AM, Tom Christensen >> wrote: >> > We aren't running

Re: [ceph-users] Kernel RBD hang on OSD Failure

2015-12-08 Thread Tom Christensen
ng to run an early 4.5 version in our test environment. On Tue, Dec 8, 2015 at 3:35 AM, Ilya Dryomov wrote: > On Tue, Dec 8, 2015 at 10:57 AM, Tom Christensen wrote: > > We aren't running NFS, but regularly use the kernel driver to map RBDs > and > > mount filesystems in sa

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Tom Christensen
, > > 2015-12-08 10:34 GMT+01:00 Tom Christensen : > > > We didn't go forward to 4.2 as its a large production cluster, and we > just > > needed the problem fixed. We'll probably test out 4.2 in the next couple > > unfortunately we don't have the

Re: [ceph-users] Kernel RBD hang on OSD Failure

2015-12-08 Thread Tom Christensen
We aren't running NFS, but regularly use the kernel driver to map RBDs and mount filesystems in same. We see very similar behavior across nearly all kernel versions we've tried. In my experience only very few versions of the kernel driver survive any sort of crush map change/update while somethin

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Tom Christensen
We didn't go forward to 4.2 as its a large production cluster, and we just needed the problem fixed. We'll probably test out 4.2 in the next couple months, but this one slipped past us as it didn't occur in our test cluster until after we had upgraded production. In our experience it takes about

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Tom Christensen
We have been seeing this same behavior on a cluster that has been perfectly happy until we upgraded to the ubuntu vivid 3.19 kernel. We are in the process of "upgrading" back to the 3.16 kernel across our cluster as we've not seen this behavior on that kernel for over 6 months and we're pretty str

Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-12-03 Thread Tom Christensen
Farnum wrote: > On Tue, Dec 1, 2015 at 10:02 AM, Tom Christensen wrote: > > Another thing that we don't quite grasp is that when we see slow requests > > now they almost always, probably 95% have the "known_if_redirected" state > > set. What does this state mean

Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-12-01 Thread Tom Christensen
this new message indicate? Can it be disabled or turned off? so that librbd sessions don't cause a new osdmap to be generated? In ceph -w output, whenever we see those entries, we immediately see a new osdmap, hence my suspicion that this message is causing a new osdmap to be generated. On

Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-12-01 Thread Tom Christensen
h source code except that the value > is set in some of the test software… > > Paul > > > From: ceph-users on behalf of Tom > Christensen > Date: Monday, 30 November 2015 at 23:20 > To: "ceph-users@lists.ceph.com" > Subject: Re: [ceph-users] Flapping OSDs

Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-11-30 Thread Tom Christensen
ough to see what the osd is doing, maybe you need debug_filestore=10 > also. If that doesn't show the problem, bump those up to 20. > > Good luck, > > Dan > > On 30 Nov 2015 20:56, "Tom Christensen" wrote: > > > > We recently upgraded to 0.94.3 from

Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-11-30 Thread Tom Christensen
many OSDs, and most of them have occurred in the middle of the night during our peak load times. On Mon, Nov 30, 2015 at 1:41 PM, Wido den Hollander wrote: > On 11/30/2015 08:56 PM, Tom Christensen wrote: > > We recently upgraded to 0.94.3 from firefly and now for the last week > &

[ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-11-30 Thread Tom Christensen
We recently upgraded to 0.94.3 from firefly and now for the last week have had intermittent slow requests and flapping OSDs. We have been unable to nail down the cause, but its feeling like it may be related to our osdmaps not getting deleted properly. Most of our osds are now storing over 100GB