Re: [ceph-users] rbd cache did not help improve performance

2016-02-29 Thread Tom Christensen
If you are mapping the RBD with the kernel driver then you're not using
librbd so these settings will have no effect I believe.  The kernel driver
does its own caching but I don't believe there are any settings to change
its default behavior.


On Mon, Feb 29, 2016 at 9:36 PM, Shinobu Kinjo  wrote:

> You may want to set "ioengine=rbd", I guess.
>
> Cheers,
>
> - Original Message -
> From: "min fang" 
> To: "ceph-users" 
> Sent: Tuesday, March 1, 2016 1:28:54 PM
> Subject: [ceph-users]  rbd cache did not help improve performance
>
> Hi, I set the following parameters in ceph.conf
>
> [client]
> rbd cache=true
> rbd cache size= 25769803776
> rbd readahead disable after byte=0
>
>
> map a rbd image to a rbd device then run fio testing on 4k read as the
> command
> ./fio -filename=/dev/rbd4 -direct=1 -iodepth 64 -thread -rw=read
> -ioengine=aio -bs=4K -size=500G -numjobs=32 -runtime=300 -group_reporting
> -name=mytest2
>
> Compared the result with setting rbd cache=false and enable cache model, I
> did not see performance improved by librbd cache.
>
> Is my setting not right, or it is true that ceph librbd cache will not
> have benefit on 4k seq read?
>
> thanks.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Over 13,000 osdmaps in current/meta

2016-02-25 Thread Tom Christensen
We've seen this as well as early as 0.94.3 and have a bug,
http://tracker.ceph.com/issues/13990 which we're working through
currently.  Nothing fixed yet, still trying to nail down exactly why the
osd maps aren't being trimmed as they should.


On Thu, Feb 25, 2016 at 10:16 AM, Stillwell, Bryan <
bryan.stillw...@twcable.com> wrote:

> After evacuated all the PGs from a node in hammer 0.94.5, I noticed that
> each of the OSDs was still using ~8GB of storage.  After investigating it
> appears like all the data is coming from around 13,000 files in
> /usr/lib/ceph/osd/ceph-*/current/meta/ with names like:
>
> DIR_4/DIR_0/DIR_0/osdmap.303231__0_C23E4004__none
> DIR_4/DIR_2/DIR_F/osdmap.314431__0_C24ADF24__none
> DIR_4/DIR_0/DIR_A/osdmap.312688__0_C2510A04__none
>
> They're all around 500KB in size.  I'm guessing these are all old OSD
> maps, but I'm wondering why there are so many of them?
>
> Thanks,
> Bryan
>
>
> 
>
> This E-mail and any of its attachments may contain Time Warner Cable
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to Time Warner Cable. This E-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this E-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this E-mail is strictly prohibited and
> may be unlawful. If you have received this E-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this E-mail and any printout.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-14 Thread Tom Christensen
To be clear when you are restarting these osds how many pgs go into peering
state?  And do they stay there for the full 3 minutes?  Certainly I've seen
iops drop to zero or near zero when a large number of pgs are peering.  It
would be wonderful if we could keep iops flowing even when pgs are
peering.  In your case with such a high pg/osd count, my guess is peering
always takes a long time.  As the OSD goes down it has to peer those 564
pgs across the remaining 3 osds, then re-peer them once the OSD comes up
again...  Also because the OSD is a RAID6 I'm pretty sure the IO pattern is
going to be bad, all 564 of those threads are going to request reads and
writes (the peering process is going to update metadata in each pg
directory on the OSD) nearly simultaneously.  In a raid 6 each non-cached
read will cause a read io from at least 5 disks and each write will cause a
write io to all 7 disks.  With that many threads hitting the volume
simultaneously it means you're going to have massive disk head
contention/head seek times, which is going to absolutely destroy your iops
and make peering take that much longer.  In effect in the non-cached case
the raid6 is going to almost entirely negate the distribution of IO load
across those 7 disks, and is going to make them behave with a performance
closer to a single HDD.  As Lionel said earlier, the HW Cache is going to
be nearly useless in any sort of recovery scenario in ceph (which this is).

I hope Robert or someone can come up with a way to continue IO to a pg in
peering state, that would be wonderful as this is the fundamental problem I
believe.  I'm not "happy" with the amount of work we had to put in to
getting our cluster to behave as well as it is now, and it would certainly
be great if things "Just Worked".  I'm just trying to relate our
experience, and indicate what I see as the bottleneck in this particular
setup based on that experience.  I believe the ceph pg calculator and
recommendations about pg counts are too high and your setup is 2-3x above
that.  I've been able to easily topple clusters (mostly due to RAM
exhaustion/swapping/OOM killer) with the recommended pg/osd counts and
recommended RAM (1GB/OSD + 1GB/TB of storage) by causing recovery in a
cluster for 2 years now, and its not been improved as far as I can tell.
The only solution I've seen work reliably is to drop the pg/osd ratio.
Dropping said ratio also greatly reduced the peering load and time and made
the pain of osd restarts almost negligible.

To your question about our data distribution, it is excellent as far as per
pg is concerned, less than 3% variance between pgs.  We did see a massive
disparity between how many pgs each osd gets.  Originally we had osds with
as few as 100pgs, and some with as many as 250 when on average they should
have had about 175pgs each, that was with the recommended pg/osd settings.
Additionally that ratio/variance has been the same regardless of the number
of pgs/osd.  Meaning it started out bad, and stayed bad but didn't get
worse as we added osds.  We've had to reweight osds in our crushmap to get
anything close to a sane distribution of pgs.

-Tom


On Sat, Feb 13, 2016 at 10:57 PM, Christian Balzer <ch...@gol.com> wrote:

> On Sat, 13 Feb 2016 20:51:19 -0700 Tom Christensen wrote:
>
> > > > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2
> > > > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788
> > > > load_pgs opened
> > > 564 pgs > --- > Another minute to load the PGs.
> > > Same OSD reboot as above : 8 seconds for this.
> >
> > Do you really have 564 pgs on a single OSD?
>
> Yes, the reason is simple, more than a year ago it should have been 8 OSDs
> (halving that number) and now it should be 18 OSDs, which would be a
> perfect fit for the 1024 PGs in the rbd pool.
>
> >I've never had anything like
> > decent performance on an OSD with greater than about 150pgs.  In our
> > production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd
> > total (with size set to 3).  When we initially deployed our large cluster
> > with 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had
> > no end of trouble getting pgs to peer.  The OSDs ate RAM like nobody's
> > business, took forever to do anything, and in general caused problems.
>
> The cluster performs admirable for the stress it is under, the number of
> PGs per OSD never really was an issue when it came to CPU/RAM/network.
> For example the restart increased the OSD process size from 1.3 to 2.8GB,
> but that left 24GB still "free".
> The main reason to have more OSDs (and thus a lower PG count per OSD) is
> to have more IOPS from the underlying storage.
>
> > If you're running 564 pgs/osd in this 4 OSD cluster, I'd look at that

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Tom Christensen
> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2
> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788
> > load_pgs opened
> 564 pgs > --- > Another minute to load the PGs.
> Same OSD reboot as above : 8 seconds for this.

Do you really have 564 pgs on a single OSD?  I've never had anything like
decent performance on an OSD with greater than about 150pgs.  In our
production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd
total (with size set to 3).  When we initially deployed our large cluster
with 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had no
end of trouble getting pgs to peer.  The OSDs ate RAM like nobody's
business, took forever to do anything, and in general caused problems.  If
you're running 564 pgs/osd in this 4 OSD cluster, I'd look at that first as
the potential culprit.  That is a lot of threads inside the OSD process
that all need to get CPU/network/disk time in order to peer as they come
up.  Especially on firefly I would point to this.  We've moved to Hammer
and that did improve a number of our performance bottlenecks, though we've
also grown our cluster without adding pgs, so we are now down in the 25-30
primary pgs/osd range, and restarting osds, or whole nodes (24-32 OSDs for
us) no longer causes us pain.  In the past restarting a node could cause
5-10 minutes of peering and pain/slow requests/unhappiness of various sorts
(RAM exhaustion, OOM Killer, Flapping OSDs).  This all improved greatly
once we got our pg/osd count under 100 even before we upgraded to hammer.





On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton <
lionel-subscript...@bouton.name> wrote:

> Hi,
>
> Le 13/02/2016 15:52, Christian Balzer a écrit :
> > [..]
> >
> > Hum that's surprisingly long. How much data (size and nb of files) do
> > you have on this OSD, which FS do you use, what are the mount options,
> > what is the hardware and the kind of access ?
> >
> > I already mentioned the HW, Areca RAID controller with 2GB HW cache and a
> > 7 disk RAID6 per OSD.
> > Nothing aside from noatime for mount options and EXT4.
>
> Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me
> and may not be innocent.
>
> >
> > 2.6TB per OSD and with 1.4 million objects in the cluster a little more
> > than 700k files per OSD.
>
> That's nearly 3x more than my example OSD but it doesn't explain the
> more than 10x difference in startup time (especially considering BTRFS
> OSDs are slow to startup and my example was with dropped caches unlike
> your case). Your average file size is similar so it's not that either.
> Unless you have a more general, system-wide performance problem which
> impacts everything including the OSD init, there's 3 main components
> involved here :
> - Ceph OSD init code,
> - ext4 filesystem,
> - HW RAID6 block device.
>
> So either :
> - OSD init code doesn't scale past ~500k objects per OSD.
> - your ext4 filesystem is slow for the kind of access used during init
> (inherently or due to fragmentation, you might want to use filefrag on a
> random sample on PG directories, omap and meta),
> - your RAID6 array is slow for the kind of access used during init.
> - any combination of the above.
>
> I believe it's possible but doubtful that the OSD code wouldn't scale at
> this level (this does not feel like an abnormally high number of objects
> to me). Ceph devs will know better.
> ext4 could be a problem as it's not the most common choice for OSDs
> (from what I read here XFS is usually preferred over it) and it forces
> Ceph to use omap to store data which would be stored in extended
> attributes otherwise (which probably isn't without performance problems).
> RAID5/6 on HW might have performance problems. The usual ones happen on
> writes and OSD init is probably read-intensive (or maybe not, you should
> check the kind of access happening during the OSD init to avoid any
> surprise) but with HW cards it's difficult to know for sure the
> performance limitations they introduce (the only sure way is testing the
> actual access patterns).
>
> So I would probably try to reproduce the problem replacing one OSDs
> based on RAID6 arrays with as many OSDs as you have devices in the arrays.
> Then if it solves the problem and you didn't already do it you might
> want to explore Areca tuning, specifically with RAID6 if you must have it.
>
>
> >
> > And kindly take note that my test cluster has less than 120k objects and
> > thus 15k files per OSD and I still was able to reproduce this behaviour
> (in
> > spirit at least).
>
> I assume the test cluster uses ext4 and RAID6 arrays too: it would be a
> perfect testing environment for defragmentation/switch to XFS/switch to
> single drive OSDs then.
>
> >
> >> The only time I saw OSDs take several minutes to reach the point where
> >> they fully rejoin is with BTRFS with default options/config.
> >>
> > There isn't a pole long enough I would touch BTRFS with for production,
> > 

Re: [ceph-users] Kernel RBD hang on OSD Failure

2015-12-17 Thread Tom Christensen
I've just checked 1072 and 872, they both look the same, a single op for
the object in question, in retry+read state, appears to be retrying forever.


On Thu, Dec 17, 2015 at 10:05 AM, Tom Christensen <pav...@gmail.com> wrote:

> I had already nuked the previous hang, but we have another one:
>
> osdc output:
>
> 70385   osd853  5.fb666328  rbd_data.36f804a163d632a.000370ff
>   read
>
> 11024940osd1072 5.f438406c
> rbd_id.volume-44c74bb5-14f8-4279-b44f-8e867248531b  call
>
> 11241684osd872  5.175f624d
> rbd_id.volume-3e068bc7-75eb-4504-b109-df851a787f89  call
>
> 11689088osd685  5.1fc9acd5  rbd_header.36f804a163d632a
> 442390'5605610926112768 watch
>
>
> #ceph osd map rbd rbd_data.36f804a163d632a.000370ff
>
> osdmap e1309560 pool 'rbd' (5) object
> 'rbd_data.36f804a163d632a.000370ff' -> pg 5.fb666328 (5.6328) -> up
> ([853,247,265], p853) acting ([853,247,265], p853)
>
>
> As to the output of osd.853 ops, objecter_requests, dump_ops_in_flight,
> dump_historic_ops.. I see a single request in the output of osd.853 ops:
>
> {
>
> "ops": [
>
> {
>
> "description": "osd_op(client.58016244.1:70385
> rbd_data.36f804a163d632a.000370ff [read 7274496~4096] 5.fb666328
> RETRY=1 retry+read e1309006)",
>
> "initiated_at": "2015-12-17 10:03:35.360401",
>
> "age": 0.000503,
>
> "duration": 0.000233,
>
> "type_data": [
>
> "reached pg",
>
> {
>
> "client": "client.58016244",
>
> "tid": 70385
>
> },
>
> [
>
> {
>
> "time": "2015-12-17 10:03:35.360401",
>
> "event": "initiated"
>
> },
>
> {
>
> "time": "2015-12-17 10:03:35.360635",
>
> "event": "reached_pg"
>
> }
>
> ]
>
> ]
>
> }
>
> ],
>
> "num_ops": 1
>
> }
>
>
> The other commands either return nothing (ops_in_flight,
> objecter_requests) or in the case of historic ops, it returns 20 ops (thats
> what its set to keep), but none of them are this request or reference this
> object.  It seems this read is just retrying forever?
>
>
>
> On Sat, Dec 12, 2015 at 12:10 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
>
>> On Sat, Dec 12, 2015 at 6:37 PM, Tom Christensen <pav...@gmail.com>
>> wrote:
>> > We had a kernel map get hung up again last night/this morning.  The rbd
>> is
>> > mapped but unresponsive, if I try to unmap it I get the following error:
>> > rbd: sysfs write failed
>> > rbd: unmap failed: (16) Device or resource busy
>> >
>> > Now that this has happened attempting to map another RBD fails, using
>> lsblk
>> > fails as well, both of these tasks just hang forever.
>> >
>> > We have 1480 OSDs in the cluster so posting the osdmap seems excessive,
>> > however here is the beginning (didn't change in 5 runs):
>> > root@wrk-slc-01-02:~# cat
>> >
>> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdmap
>> > epoch 1284256
>> > flags
>> > pool 0 pg_num 2048 (2047) read_tier -1 write_tier -1
>> > pool 1 pg_num 512 (511) read_tier -1 write_tier -1
>> > pool 3 pg_num 2048 (2047) read_tier -1 write_tier -1
>> > pool 4 pg_num 512 (511) read_tier -1 write_tier -1
>> > pool 5 pg_num 32768 (32767) read_tier -1 write_tier -1
>> >
>> > Here is osdc output, it is not changed after 5 runs:
>> >
>> > root@wrk-slc-01-02:~# cat
>> >
>> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc
>> > 93835   osd1206 5.6841959c
>> rbd_data.34df3ac703ced61.1dff
>> > read
>> > 9065810 osd1382 5.a50fa0ea  rbd_header.34df3ac703ced61
>> > 474103'5506530325561344 watch
>> > root@wrk-slc-01-02:~# cat
>> >
>> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc
>> > 93835   osd

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Tom Christensen
We didn't go forward to 4.2 as its a large production cluster, and we just
needed the problem fixed.  We'll probably test out 4.2 in the next couple
months, but this one slipped past us as it didn't occur in our test cluster
until after we had upgraded production.  In our experience it takes about 2
weeks to start happening, but once it does its all hands on deck cause
nodes are going to go down regularly.

All that being said, if/when we try 4.2 its going to need to run for 1-2
months rock solid in our test cluster before it gets to production.

On Tue, Dec 8, 2015 at 2:30 AM, Benedikt Fraunhofer 
wrote:

> Hi Tom,
>
> > We have been seeing this same behavior on a cluster that has been
> perfectly
> > happy until we upgraded to the ubuntu vivid 3.19 kernel.  We are in the
>
> i can't recall when we gave 3.19 a shot but now that you say it... The
> cluster was happy for >9 months with 3.16.
> Did you try 4.2 or do you think the regression from 3.16 introduced
> somewhere trough 3.19 is still in 4.2?
>
> Thx!
>Benedikt
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel RBD hang on OSD Failure

2015-12-08 Thread Tom Christensen
We aren't running NFS, but regularly use the kernel driver to map RBDs and
mount filesystems in same.  We see very similar behavior across nearly all
kernel versions we've tried.  In my experience only very few versions of
the kernel driver survive any sort of crush map change/update while
something is mapped.  In fact in the last 2 years I think I've only seen
this work on 1 kernel version unfortunately its badly out of date and we
can't run it in our environment anymore, I think it was a 3.0 kernel
version running on ubuntu 12.04.  We have just recently started trying to
find a kernel that will survive OSD outages or changes to the cluster.
We're on ubuntu 14.04, and have tried 3.16, 3.19.0-25, 4.3, and 4.2 without
success in the last week.  We only map 1-3 RBDs per client machine at a
time but we regularly will get processes stuck in D state which are
accessing the filesystem inside the RBD and will have to hard reboot the
RBD client machine.  This is always associated with a cluster change in
some way, reweighting OSDs, rebooting an OSD host, restarting an individual
OSD, adding OSDs, and removing OSDs all cause the kernel client to hang.
If no change is made to the cluster, the kernel client will be happy for
weeks.

On Mon, Dec 7, 2015 at 2:55 PM, Blair Bethwaite 
wrote:

> Hi Matt,
>
> (CC'ing in ceph-users too - similar reports there:
> http://www.spinics.net/lists/ceph-users/msg23037.html)
>
> We've seen something similar for KVM [lib]RBD clients acting as NFS
> gateways within our OpenStack cloud, the NFS services were locking up
> and causing client timeouts whenever we started doing Ceph
> maintenance. We eventually realised we'd somehow set the pool min_size
> == size, so any single OSD outage was blocking client IO - *oops*.
> Your issue sounds like something different, but NFS does seem to be
> very touchy and lacking any graceful recovery from issues with the
> underlying FS.
>
>
> On 8 December 2015 at 07:56, Matt Conner 
> wrote:
> > Hi,
> >
> > We have a Ceph cluster in which we have been having issues with RBD
> > clients hanging when an OSD failure occurs. We are using a NAS gateway
> > server which maps RBD images to filesystems and serves the filesystems
> > out via NFS. The gateway server has close to 180 NFS clients and
> > almost every time even 1 OSD goes down during heavy load, the NFS
> > exports lock up and the clients are unable to access the NAS share via
> > NFS. When the OSD fails, Ceph recovers without issue, but the gateway
> > kernel RBD module appears to get stuck waiting on the now failed OSD.
> > Note that this works correctly when under lighter loads.
> >
> > From what we have been able to determine, the NFS server daemon hangs
> > waiting for I/O from the OSD that went out and never recovers.
> > Similarly, attempting to access files from the exported FS locally on
> > the gateway server will result in a similar hang. We also noticed that
> > Ceph health details will continue to report blocked I/O on the now
> > down OSD until either the OSD is recovered or the gateway server is
> > rebooted.  Based on a few kernel logs from NFS and PVS, we were able
> > to trace the problem to the RBD kernel module.
> >
> > Unfortunately, the only way we have been able to recover our gateway
> > is by hard rebooting the server.
> >
> > Has anyone else encountered this issue and/or have a possible solution?
> > Are there suggestions for getting more detailed debugging information
> > from the RBD kernel module?
> >
> >
> > Few notes on our setup:
> > We are using Kernel RBD on a gateway server that exports filesystems via
> NFS
> > The exported filesystems are XFS on LVMs which are each composed of 16
> > striped images (NFS->LVM->XFS->PVS->RBD)
> > There are currently 176 mapped RBD images on the server (11
> > filesystems, 16 mapped RBD images per FS)
> > Gateway Kernel: 3.18.6
> > Ceph version: 0.80.9
> > Note - We've tried using different kernels all the way up to 4.3.0 but
> > the problem persists.
> >
> > Thanks,
> > Matt Conner
> > Keeper Technology
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Cheers,
> ~Blairo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Tom Christensen
We have been seeing this same behavior on a cluster that has been perfectly
happy until we upgraded to the ubuntu vivid 3.19 kernel.  We are in the
process of "upgrading" back to the 3.16 kernel across our cluster as we've
not seen this behavior on that kernel for over 6 months and we're pretty
strongly of the opinion this is a regression in the kernel.  Please let the
list know if upping your threads fixes your issue (though I'm not
optimistic) as we have our max threads set to the value recommended here
(4194303) but we still see this issue regularly on the 3.19 ubuntu kernel
we tried both 3.19.0-25 and 3.19.0-33 before giving up and reverting to
3.16.


On Tue, Dec 8, 2015 at 1:03 AM, Jan Schermer  wrote:

>
> > On 08 Dec 2015, at 08:57, Benedikt Fraunhofer 
> wrote:
> >
> > Hi Jan,
> >
> >> Doesn't look near the limit currently (but I suppose you rebooted it in
> the meantime?).
> >
> > the box this numbers came from has an uptime of 13 days
> > so it's one of the boxes that did survive yesterdays
> half-cluster-wide-reboot.
> >
>
> So this box had no issues? Keep an eye on the number of threadas, but
> maybe others will have a better idea, this is just where I'd start. I have
> seen close to a milion threads from OSDs on my boxes, not sure what the
> number are now.
>
> >> Did iostat say anything about the drives? (btw dm-1 and dm-6 are what?
> Is that your data drives?) - were they overloaded really?
> >
> > no they didn't have any load and or iops.
> > Basically the whole box had nothing to do.
> >
> > If I understand the load correctly, this just reports threads
> > that are ready and willing to work but - in this case -
> > don't get any data to work with.
>
> Different unixes calculate this differently :-) By itself "load" is
> meaningless.
> It should be something like an average number of processes that want to
> run at any given time but can't (because they are waiting for whatever they
> need - disks, CPU, blocking sockets...).
>
> Jan
>
>
> >
> > Thx
> >
> > Benedikt
> >
> >
> > 2015-12-08 8:44 GMT+01:00 Jan Schermer :
> >>
> >> Jan
> >>
> >>
> >>> On 08 Dec 2015, at 08:41, Benedikt Fraunhofer 
> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>> we had 65k for pid_max, which made
> >>> kernel.threads-max = 1030520.
> >>> or
> >>> kernel.threads-max = 256832
> >>> (looks like it depends on the number of cpus?)
> >>>
> >>> currently we've
> >>>
> >>> root@ceph1-store209:~# sysctl -a | grep -e thread -e pid
> >>> kernel.cad_pid = 1
> >>> kernel.core_uses_pid = 0
> >>> kernel.ns_last_pid = 60298
> >>> kernel.pid_max = 65535
> >>> kernel.threads-max = 256832
> >>> vm.nr_pdflush_threads = 0
> >>> root@ceph1-store209:~# ps axH |wc -l
> >>> 17548
> >>>
> >>> we'll see how it behaves once puppet has come by and adjusted it.
> >>>
> >>> Thx!
> >>>
> >>> Benedikt
> >>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel RBD hang on OSD Failure

2015-12-08 Thread Tom Christensen
To be clear, we are also using format 2 RBDs, so we didn't really expect it
to work until recently as it was listed as unsupported.  We are under the
understanding that as of 3.19 RBD format 2 should be supported.  Are we
incorrect in that understanding?

On Tue, Dec 8, 2015 at 3:44 AM, Tom Christensen <pav...@gmail.com> wrote:

> We haven't submitted a ticket as we've just avoided using the kernel
> client.  We've periodically tried with various kernels and various versions
> of ceph over the last two years, but have just given up each time and
> reverted to using rbd-fuse, which although not super stable, at least
> doesn't hang the client box.  We find ourselves in the position now where
> for additional functionality we *need* an actual block device, so we have
> to find a kernel client that works.  I will certainly keep you posted and
> can produce the output you've requested.
>
> I'd also be willing to run an early 4.5 version in our test environment.
>
> On Tue, Dec 8, 2015 at 3:35 AM, Ilya Dryomov <idryo...@gmail.com> wrote:
>
>> On Tue, Dec 8, 2015 at 10:57 AM, Tom Christensen <pav...@gmail.com>
>> wrote:
>> > We aren't running NFS, but regularly use the kernel driver to map RBDs
>> and
>> > mount filesystems in same.  We see very similar behavior across nearly
>> all
>> > kernel versions we've tried.  In my experience only very few versions
>> of the
>> > kernel driver survive any sort of crush map change/update while
>> something is
>> > mapped.  In fact in the last 2 years I think I've only seen this work
>> on 1
>> > kernel version unfortunately its badly out of date and we can't run it
>> in
>> > our environment anymore, I think it was a 3.0 kernel version running on
>> > ubuntu 12.04.  We have just recently started trying to find a kernel
>> that
>> > will survive OSD outages or changes to the cluster.  We're on ubuntu
>> 14.04,
>> > and have tried 3.16, 3.19.0-25, 4.3, and 4.2 without success in the last
>> > week.  We only map 1-3 RBDs per client machine at a time but we
>> regularly
>> > will get processes stuck in D state which are accessing the filesystem
>> > inside the RBD and will have to hard reboot the RBD client machine.
>> This is
>> > always associated with a cluster change in some way, reweighting OSDs,
>> > rebooting an OSD host, restarting an individual OSD, adding OSDs, and
>> > removing OSDs all cause the kernel client to hang.  If no change is
>> made to
>> > the cluster, the kernel client will be happy for weeks.
>>
>> There are a couple of known bugs in the remap/resubmit area, but those
>> are supposedly corner cases (like *all* the OSDs going down and then
>> back up, etc).  I had no idea it was that severe and goes that back.
>> Apparently triggering it requires a heavier load, as we've never seen
>> anything like that in our tests.
>>
>> For unrelated reasons, remap/resubmit code is getting entirely
>> rewritten for kernel 4.5, so, if you've been dealing with this issue
>> for the last two years (I don't remember seeing any tickets listing
>> that many kernel versions and not mentioning NFS), I'm afraid the best
>> course of action for you would be to wait for 4.5 to come out and try
>> it.  If you'd be willing to test out an early version on one of more of
>> your client boxes, I can ping you when it's ready.
>>
>> I'll take a look at 3.0 vs 3.16 with an eye on remap code.  Did you
>> happen to try 3.10?
>>
>> It sounds like you can reproduce this pretty easily.  Can you get it to
>> lock up and do:
>>
>> # cat /sys/kernel/debug/ceph/*/osdmap
>> # cat /sys/kernel/debug/ceph/*/osdc
>> $ ceph status
>>
>> and bunch of times?  I have a hunch that kernel client simply fails to
>> request enough of new osdmaps after the cluster topology changes under
>> load.
>>
>> Thanks,
>>
>> Ilya
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Tom Christensen
We run deep scrubs via cron with a script so we know when deep scrubs are
happening, and we've seen nodes fail both during deep scrubbing and while
no deep scrubs are occurring so I'm pretty sure its not related.


On Tue, Dec 8, 2015 at 2:42 AM, Benedikt Fraunhofer <fraunho...@traced.net>
wrote:

> Hi Tom,
>
> 2015-12-08 10:34 GMT+01:00 Tom Christensen <pav...@gmail.com>:
>
> > We didn't go forward to 4.2 as its a large production cluster, and we
> just
> > needed the problem fixed.  We'll probably test out 4.2 in the next couple
>
> unfortunately we don't have the luxury of a test cluster.
> and to add to that, we couldnt simulate the load, altough it does not
> seem to be load related.
> Did you try running with nodeep-scrub as a short-term workaround?
>
> I'll give ~30% of the nodes 4.2 and see how it goes.
>
> > In our experience it takes about 2 weeks to start happening
>
> we're well below that. Somewhat between 1 and 4 days.
> And yes, once one goes south, it affects the rest of the cluster.
>
> Thx!
>
>  Benedikt
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel RBD hang on OSD Failure

2015-12-08 Thread Tom Christensen
We haven't submitted a ticket as we've just avoided using the kernel
client.  We've periodically tried with various kernels and various versions
of ceph over the last two years, but have just given up each time and
reverted to using rbd-fuse, which although not super stable, at least
doesn't hang the client box.  We find ourselves in the position now where
for additional functionality we *need* an actual block device, so we have
to find a kernel client that works.  I will certainly keep you posted and
can produce the output you've requested.

I'd also be willing to run an early 4.5 version in our test environment.

On Tue, Dec 8, 2015 at 3:35 AM, Ilya Dryomov <idryo...@gmail.com> wrote:

> On Tue, Dec 8, 2015 at 10:57 AM, Tom Christensen <pav...@gmail.com> wrote:
> > We aren't running NFS, but regularly use the kernel driver to map RBDs
> and
> > mount filesystems in same.  We see very similar behavior across nearly
> all
> > kernel versions we've tried.  In my experience only very few versions of
> the
> > kernel driver survive any sort of crush map change/update while
> something is
> > mapped.  In fact in the last 2 years I think I've only seen this work on
> 1
> > kernel version unfortunately its badly out of date and we can't run it in
> > our environment anymore, I think it was a 3.0 kernel version running on
> > ubuntu 12.04.  We have just recently started trying to find a kernel that
> > will survive OSD outages or changes to the cluster.  We're on ubuntu
> 14.04,
> > and have tried 3.16, 3.19.0-25, 4.3, and 4.2 without success in the last
> > week.  We only map 1-3 RBDs per client machine at a time but we regularly
> > will get processes stuck in D state which are accessing the filesystem
> > inside the RBD and will have to hard reboot the RBD client machine.
> This is
> > always associated with a cluster change in some way, reweighting OSDs,
> > rebooting an OSD host, restarting an individual OSD, adding OSDs, and
> > removing OSDs all cause the kernel client to hang.  If no change is made
> to
> > the cluster, the kernel client will be happy for weeks.
>
> There are a couple of known bugs in the remap/resubmit area, but those
> are supposedly corner cases (like *all* the OSDs going down and then
> back up, etc).  I had no idea it was that severe and goes that back.
> Apparently triggering it requires a heavier load, as we've never seen
> anything like that in our tests.
>
> For unrelated reasons, remap/resubmit code is getting entirely
> rewritten for kernel 4.5, so, if you've been dealing with this issue
> for the last two years (I don't remember seeing any tickets listing
> that many kernel versions and not mentioning NFS), I'm afraid the best
> course of action for you would be to wait for 4.5 to come out and try
> it.  If you'd be willing to test out an early version on one of more of
> your client boxes, I can ping you when it's ready.
>
> I'll take a look at 3.0 vs 3.16 with an eye on remap code.  Did you
> happen to try 3.10?
>
> It sounds like you can reproduce this pretty easily.  Can you get it to
> lock up and do:
>
> # cat /sys/kernel/debug/ceph/*/osdmap
> # cat /sys/kernel/debug/ceph/*/osdc
> $ ceph status
>
> and bunch of times?  I have a hunch that kernel client simply fails to
> request enough of new osdmaps after the cluster topology changes under
> load.
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-12-03 Thread Tom Christensen
We were able to prevent the blacklist operations, and now the cluster is
much happier, however, the OSDs have not started cleaning up old osd maps
after 48 hours.  Is there anything we can do to poke them to get them to
start cleaning up old osd maps?



On Wed, Dec 2, 2015 at 11:25 AM, Gregory Farnum <gfar...@redhat.com> wrote:

> On Tue, Dec 1, 2015 at 10:02 AM, Tom Christensen <pav...@gmail.com> wrote:
> > Another thing that we don't quite grasp is that when we see slow requests
> > now they almost always, probably 95% have the "known_if_redirected" state
> > set.  What does this state mean?  Does it indicate we have OSD maps that
> are
> > lagging and the cluster isn't really in sync?  Could this be the cause of
> > our growing osdmaps?
>
> This is just a flag set on operations by new clients to let the OSD
> perform more effectively — you don't need to worry about it.
>
> I'm not sure why you're getting a bunch of client blacklist
> operations, but each one will generate a new OSDMap (if nothing else
> prompts one), yes.
> -Greg
>
> >
> > -Tom
> >
> >
> > On Tue, Dec 1, 2015 at 2:35 AM, HEWLETT, Paul (Paul)
> > <paul.hewl...@alcatel-lucent.com> wrote:
> >>
> >> I believe that ‘filestore xattr use omap’ is no longer used in Ceph –
> can
> >> anybody confirm this?
> >> I could not find any usage in the Ceph source code except that the value
> >> is set in some of the test software…
> >>
> >> Paul
> >>
> >>
> >> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Tom
> >> Christensen <pav...@gmail.com>
> >> Date: Monday, 30 November 2015 at 23:20
> >> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> >> Subject: Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs
> >>
> >> What counts as ancient?  Concurrent to our hammer upgrade we went from
> >> 3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel
> >> we'd been running because we're also seeing an intermittent (its
> happened
> >> twice in 2 weeks) massive load spike that completely hangs the osd node
> >> (we're talking about load averages that hit 20k+ before the box becomes
> >> completely unresponsive).  We saw a similar behavior on a 3.13 kernel,
> which
> >> resolved by moving to the 3.16 kernel we had before.  I'll try to catch
> one
> >> with debug_ms=1 and see if I can see it we're hitting a similar hang.
> >>
> >> To your comment about omap, we do have filestore xattr use omap = true
> in
> >> our conf... which we believe was placed there by ceph-deploy (which we
> used
> >> to deploy this cluster).  We are on xfs, but we do take tons of RBD
> >> snapshots.  If either of these use cases will cause lots of osd map size
> >> then, we may just be exceeding the limits of the number of rbd snapshots
> >> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster)
> >>
> >> An interesting note, we had an OSD flap earlier this morning, and when
> it
> >> did, immediately after it came back I checked its meta directory size
> with
> >> du -sh, this returned immediately, and showed a size of 107GB.  The fact
> >> that it returned immediately indicated to me that something had just
> >> recently read through that whole directory and it was all cached in the
> FS
> >> cache.  Normally a du -sh on the meta directory takes a good 5 minutes
> to
> >> return.  Anyway, since it dropped this morning its meta directory size
> >> continues to shrink and is down to 93GB.  So it feels like something
> happens
> >> that makes the OSD read all its historical maps which results in the OSD
> >> hanging cause there are a ton of them, and then it wakes up and
> realizes it
> >> can delete a bunch of them...
> >>
> >> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster <dvand...@gmail.com>
> >> wrote:
> >>>
> >>> The trick with debugging heartbeat problems is to grep back through the
> >>> log to find the last thing the affected thread was doing, e.g. is
> >>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading
> through the
> >>> omap, etc..
> >>>
> >>> I agree this doesn't look to be network related, but if you want to
> rule
> >>> it out you should use debug_ms=1.
> >>>
> >>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and
> >>> simi

Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-12-01 Thread Tom Christensen
Another thing that we don't quite grasp is that when we see slow requests
now they almost always, probably 95% have the "known_if_redirected" state
set.  What does this state mean?  Does it indicate we have OSD maps that
are lagging and the cluster isn't really in sync?  Could this be the cause
of our growing osdmaps?

-Tom


On Tue, Dec 1, 2015 at 2:35 AM, HEWLETT, Paul (Paul) <
paul.hewl...@alcatel-lucent.com> wrote:

> I believe that ‘filestore xattr use omap’ is no longer used in Ceph – can
> anybody confirm this?
> I could not find any usage in the Ceph source code except that the value
> is set in some of the test software…
>
> Paul
>
>
> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Tom
> Christensen <pav...@gmail.com>
> Date: Monday, 30 November 2015 at 23:20
> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs
>
> What counts as ancient?  Concurrent to our hammer upgrade we went from
> 3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel
> we'd been running because we're also seeing an intermittent (its happened
> twice in 2 weeks) massive load spike that completely hangs the osd node
> (we're talking about load averages that hit 20k+ before the box becomes
> completely unresponsive).  We saw a similar behavior on a 3.13 kernel,
> which resolved by moving to the 3.16 kernel we had before.  I'll try to
> catch one with debug_ms=1 and see if I can see it we're hitting a similar
> hang.
>
> To your comment about omap, we do have filestore xattr use omap = true in
> our conf... which we believe was placed there by ceph-deploy (which we used
> to deploy this cluster).  We are on xfs, but we do take tons of RBD
> snapshots.  If either of these use cases will cause lots of osd map size
> then, we may just be exceeding the limits of the number of rbd snapshots
> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster)
>
> An interesting note, we had an OSD flap earlier this morning, and when it
> did, immediately after it came back I checked its meta directory size with
> du -sh, this returned immediately, and showed a size of 107GB.  The fact
> that it returned immediately indicated to me that something had just
> recently read through that whole directory and it was all cached in the FS
> cache.  Normally a du -sh on the meta directory takes a good 5 minutes to
> return.  Anyway, since it dropped this morning its meta directory size
> continues to shrink and is down to 93GB.  So it feels like something
> happens that makes the OSD read all its historical maps which results in
> the OSD hanging cause there are a ton of them, and then it wakes up and
> realizes it can delete a bunch of them...
>
> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster <dvand...@gmail.com>
> wrote:
>
>> The trick with debugging heartbeat problems is to grep back through the
>> log to find the last thing the affected thread was doing, e.g. is
>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the
>> omap, etc..
>>
>> I agree this doesn't look to be network related, but if you want to rule
>> it out you should use debug_ms=1.
>>
>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and
>> similarly started getting slow requests. To make a long story short, our
>> issue turned out to be sendmsg blocking (very rarely), probably due to an
>> ancient el6 kernel (these osd servers had ~800 days' uptime). The signature
>> of this was 900s of slow requests, then an ms log showing "initiating
>> reconnect". Until we got the kernel upgraded everywhere, we used a
>> workaround of ms tcp read timeout = 60.
>> So, check your kernels, and upgrade if they're ancient. Latest el6
>> kernels work for us.
>>
>> Otherwise, those huge osd leveldb's don't look right. (Unless you're
>> using tons and tons of omap...) And it kinda reminds me of the other
>> problem we hit after the hammer upgrade, namely the return of the ever
>> growing mon leveldb issue. The solution was to recreate the mons one by
>> one. Perhaps you've hit something similar with the OSDs. debug_osd=10 might
>> be good enough to see what the osd is doing, maybe you need
>> debug_filestore=10 also. If that doesn't show the problem, bump those up to
>> 20.
>>
>> Good luck,
>>
>> Dan
>>
>> On 30 Nov 2015 20:56, "Tom Christensen" <pav...@gmail.com> wrote:
>> >
>> > We recently upgraded to 0.94.3 from firefly and now for the last week
>> have had intermittent slow requests and flapping OSDs.  

Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-12-01 Thread Tom Christensen
Another "new" thing we see with hammer is constant:

mon.0 [INF] from='client.52217412 :/0' entity='client.admin'
cmd='[{"prefix": "osd blacklist", "blacklistop": "add", "addr":
":0/3562049007"}]': finished

entries in the log and while watching ceph -w

The cluster appears to generate a new osdmap after every one of these
entries.  These appear to be associated with client connect/disconnect
operations.  Unfortunately in our use case we use librbd to connect and
disconnect a lot.  What does this new message indicate?  Can it be disabled
or turned off? so that librbd sessions don't cause a new osdmap to be
generated?

In ceph -w output, whenever we see those entries, we immediately see a new
osdmap, hence my suspicion that this message is causing a new osdmap to be
generated.




On Tue, Dec 1, 2015 at 11:02 AM, Tom Christensen <pav...@gmail.com> wrote:

> Another thing that we don't quite grasp is that when we see slow requests
> now they almost always, probably 95% have the "known_if_redirected" state
> set.  What does this state mean?  Does it indicate we have OSD maps that
> are lagging and the cluster isn't really in sync?  Could this be the cause
> of our growing osdmaps?
>
> -Tom
>
>
> On Tue, Dec 1, 2015 at 2:35 AM, HEWLETT, Paul (Paul) <
> paul.hewl...@alcatel-lucent.com> wrote:
>
>> I believe that ‘filestore xattr use omap’ is no longer used in Ceph – can
>> anybody confirm this?
>> I could not find any usage in the Ceph source code except that the value
>> is set in some of the test software…
>>
>> Paul
>>
>>
>> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Tom
>> Christensen <pav...@gmail.com>
>> Date: Monday, 30 November 2015 at 23:20
>> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs
>>
>> What counts as ancient?  Concurrent to our hammer upgrade we went from
>> 3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel
>> we'd been running because we're also seeing an intermittent (its happened
>> twice in 2 weeks) massive load spike that completely hangs the osd node
>> (we're talking about load averages that hit 20k+ before the box becomes
>> completely unresponsive).  We saw a similar behavior on a 3.13 kernel,
>> which resolved by moving to the 3.16 kernel we had before.  I'll try to
>> catch one with debug_ms=1 and see if I can see it we're hitting a similar
>> hang.
>>
>> To your comment about omap, we do have filestore xattr use omap = true in
>> our conf... which we believe was placed there by ceph-deploy (which we used
>> to deploy this cluster).  We are on xfs, but we do take tons of RBD
>> snapshots.  If either of these use cases will cause lots of osd map size
>> then, we may just be exceeding the limits of the number of rbd snapshots
>> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster)
>>
>> An interesting note, we had an OSD flap earlier this morning, and when it
>> did, immediately after it came back I checked its meta directory size with
>> du -sh, this returned immediately, and showed a size of 107GB.  The fact
>> that it returned immediately indicated to me that something had just
>> recently read through that whole directory and it was all cached in the FS
>> cache.  Normally a du -sh on the meta directory takes a good 5 minutes to
>> return.  Anyway, since it dropped this morning its meta directory size
>> continues to shrink and is down to 93GB.  So it feels like something
>> happens that makes the OSD read all its historical maps which results in
>> the OSD hanging cause there are a ton of them, and then it wakes up and
>> realizes it can delete a bunch of them...
>>
>> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster <dvand...@gmail.com>
>> wrote:
>>
>>> The trick with debugging heartbeat problems is to grep back through the
>>> log to find the last thing the affected thread was doing, e.g. is
>>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the
>>> omap, etc..
>>>
>>> I agree this doesn't look to be network related, but if you want to rule
>>> it out you should use debug_ms=1.
>>>
>>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and
>>> similarly started getting slow requests. To make a long story short, our
>>> issue turned out to be sendmsg blocking (very rarely), probably due to an
>>> ancient el6 kernel (these osd servers had ~800 days'

[ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-11-30 Thread Tom Christensen
We recently upgraded to 0.94.3 from firefly and now for the last week have
had intermittent slow requests and flapping OSDs.  We have been unable to
nail down the cause, but its feeling like it may be related to our osdmaps
not getting deleted properly.  Most of our osds are now storing over 100GB
of data in the meta directory, almost all of that is historical osd maps
going back over 7 days old.

We did do a small cluster change (We added 35 OSDs to a 1445 OSD cluster),
the rebalance took about 36 hours, and it completed 10 days ago.  Since
that time the cluster has been HEALTH_OK and all pgs have been active+clean
except for when we have an OSD flap.

When the OSDs flap they do not crash and restart, they just go unresponsive
for 1-3 minutes, and then come back alive all on their own.  They get
marked down by peers, and cause some peering and then they just come back
rejoin the cluster and continue on their merry way.

We see a bunch of this in the logs while the OSD is catatonic:

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166 7f5b03679700  1
heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' had timed out
after 15

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176 7f5b03679700 10
osd.1191 1203850 internal heartbeat not healthy, dropping ping request

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143210 7f5b04e7c700  1
heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' had timed out
after 15

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143218 7f5b04e7c700 10
osd.1191 1203850 internal heartbeat not healthy, dropping ping request

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143288 7f5b03679700  1
heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' had timed out
after 15

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143293 7f5b03679700 10
osd.1191 1203850 internal heartbeat not healthy, dropping ping request


I have a chunk of logs at debug 20/5, not sure if I should have done just
20... It's pretty hard to catch, we have to basically see the slow requests
and get debug logging set in about a 5-10 second window before the OSD
stops responding to the admin socket...

As networking is almost always the cause of flapping OSDs we have tested
the network quite extensively.  It hasn't changed physically since before
the hammer upgrade, and was performing well.  We have done large amounts of
ping tests and have not seen a single dropped packet between osd nodes or
between osd nodes and mons.

I don't see any error packets or drops on switches either.

Ideas?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-11-30 Thread Tom Christensen
What counts as ancient?  Concurrent to our hammer upgrade we went from
3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel
we'd been running because we're also seeing an intermittent (its happened
twice in 2 weeks) massive load spike that completely hangs the osd node
(we're talking about load averages that hit 20k+ before the box becomes
completely unresponsive).  We saw a similar behavior on a 3.13 kernel,
which resolved by moving to the 3.16 kernel we had before.  I'll try to
catch one with debug_ms=1 and see if I can see it we're hitting a similar
hang.

To your comment about omap, we do have filestore xattr use omap = true in
our conf... which we believe was placed there by ceph-deploy (which we used
to deploy this cluster).  We are on xfs, but we do take tons of RBD
snapshots.  If either of these use cases will cause lots of osd map size
then, we may just be exceeding the limits of the number of rbd snapshots
ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster)

An interesting note, we had an OSD flap earlier this morning, and when it
did, immediately after it came back I checked its meta directory size with
du -sh, this returned immediately, and showed a size of 107GB.  The fact
that it returned immediately indicated to me that something had just
recently read through that whole directory and it was all cached in the FS
cache.  Normally a du -sh on the meta directory takes a good 5 minutes to
return.  Anyway, since it dropped this morning its meta directory size
continues to shrink and is down to 93GB.  So it feels like something
happens that makes the OSD read all its historical maps which results in
the OSD hanging cause there are a ton of them, and then it wakes up and
realizes it can delete a bunch of them...

On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster <dvand...@gmail.com>
wrote:

> The trick with debugging heartbeat problems is to grep back through the
> log to find the last thing the affected thread was doing, e.g. is
> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the
> omap, etc..
>
> I agree this doesn't look to be network related, but if you want to rule
> it out you should use debug_ms=1.
>
> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and
> similarly started getting slow requests. To make a long story short, our
> issue turned out to be sendmsg blocking (very rarely), probably due to an
> ancient el6 kernel (these osd servers had ~800 days' uptime). The signature
> of this was 900s of slow requests, then an ms log showing "initiating
> reconnect". Until we got the kernel upgraded everywhere, we used a
> workaround of ms tcp read timeout = 60.
> So, check your kernels, and upgrade if they're ancient. Latest el6 kernels
> work for us.
>
> Otherwise, those huge osd leveldb's don't look right. (Unless you're using
> tons and tons of omap...) And it kinda reminds me of the other problem we
> hit after the hammer upgrade, namely the return of the ever growing mon
> leveldb issue. The solution was to recreate the mons one by one. Perhaps
> you've hit something similar with the OSDs. debug_osd=10 might be good
> enough to see what the osd is doing, maybe you need debug_filestore=10
> also. If that doesn't show the problem, bump those up to 20.
>
> Good luck,
>
> Dan
>
> On 30 Nov 2015 20:56, "Tom Christensen" <pav...@gmail.com> wrote:
> >
> > We recently upgraded to 0.94.3 from firefly and now for the last week
> have had intermittent slow requests and flapping OSDs.  We have been unable
> to nail down the cause, but its feeling like it may be related to our
> osdmaps not getting deleted properly.  Most of our osds are now storing
> over 100GB of data in the meta directory, almost all of that is historical
> osd maps going back over 7 days old.
> >
> > We did do a small cluster change (We added 35 OSDs to a 1445 OSD
> cluster), the rebalance took about 36 hours, and it completed 10 days ago.
> Since that time the cluster has been HEALTH_OK and all pgs have been
> active+clean except for when we have an OSD flap.
> >
> > When the OSDs flap they do not crash and restart, they just go
> unresponsive for 1-3 minutes, and then come back alive all on their own.
> They get marked down by peers, and cause some peering and then they just
> come back rejoin the cluster and continue on their merry way.
> >
> > We see a bunch of this in the logs while the OSD is catatonic:
> >
> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166
> 7f5b03679700  1 heartbeat_map is_healthy 'OSD::osd_tp thread
> 0x7f5affe72700' had timed out after 15
> >
> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176 7f5b03679700
> 10 osd.1191 1203850 internal heartbeat not healthy, droppi