Re: rbd map command hangs for 15 minutes during system start up
Unfortunately the hangs happen with the new set of patches. Here's some debug info: https://gist.github.com/raw/4187123/90194ce172130244a9c1c968ed185eee7282d809/gistfile1.txt On Fri, Nov 30, 2012 at 3:22 PM, Alex Elder wrote: > On 11/29/2012 02:37 PM, Alex Elder wrote: >> On 11/22/2012 12:04 PM, Nick Bartos wrote: >>> Here are the ceph log messages (including the libceph kernel debug >>> stuff you asked for) from a node boot with the rbd command hung for a >>> couple of minutes: > > I'm sorry, but I did something stupid... > > Yes, the branch I gave you includes these fixes. However > it does *not* include the commit that was giving you trouble > to begin with. > > So... > > I have updated that same branch (wip-nick) to contain: > - Linux 3.5.5 > - Plus the first *50* (not 49) patches you listed > - Plus the ones I added before. > > The new commit id for that branch begins with be3198d6. > > I'm really sorry for this mistake. Please try this new > branch and report back what you find. > > -Alex > > >> Nick, I have put together a branch that includes two fixes >> that might be helpful. I don't expect these fixes will >> necessarily *fix* what you're seeing, but one of them >> pulls a big hunk of processing out of the picture and >> might help eliminate some potential causes. I had to >> pull in several other patches as prerequisites in order >> to get those fixes to apply cleanly. >> >> Would you be able to give it a try, and let us know what >> results you get? The branch contains: >> - Linux 3.5.5 >> - Plus the first 49 patches you listed >> - Plus four patches, which are prerequisites... >> libceph: define ceph_extract_encoded_string() >> rbd: define some new format constants >> rbd: define rbd_dev_image_id() >> rbd: kill create_snap sysfs entry >> - ...for these two bug fixes: >> libceph: remove 'osdtimeout' option >> ceph: don't reference req after put >> >> The branch is available in the ceph-client git repository >> under the name "wip-nick" and has commit id dd9323aa. >> https://github.com/ceph/ceph-client/tree/wip-nick >> >>> https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt >> >> This full debug output is very helpful. Please supply >> that again as well. >> >> Thanks. >> >> -Alex >> >>> On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos wrote: It's very easy to reproduce now with my automated install script, the most I've seen it succeed with that patch is 2 in a row, and hanging on the 3rd, although it hangs on most builds. So it shouldn't take much to get it to do it again. I'll try and get to that tomorrow, when I'm a bit more rested and my brain is working better. Yes during this the OSDs are probably all syncing up. All the osd and mon daemons have started by the time the rdb commands are ran, though. On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil wrote: > On Wed, 21 Nov 2012, Nick Bartos wrote: >> FYI the build which included all 3.5 backports except patch #50 is >> still going strong after 21 builds. > > Okay, that one at least makes some sense. I've opened > > http://tracker.newdream.net/issues/3519 > > How easy is this to reproduce? If it is something you can trigger with > debugging enabled ('echo module libceph +p > > /sys/kernel/debug/dynamic_debug/control') that would help tremendously. > > I'm guessing that during this startup time the OSDs are still in the > process of starting? > > Alex, I bet that a test that does a lot of map/unmap stuff in a loop while > thrashing OSDs could hit this. > > Thanks! > sage > > >> >> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos >> wrote: >>> With 8 successful installs already done, I'm reasonably confident that >>> it's patch #50. I'm making another build which applies all patches >>> from the 3.5 backport branch, excluding that specific one. I'll let >>> you know if that turns up any unexpected failures. >>> >>> What will the potential fall out be for removing that specific patch? >>> >>> >>> On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos >>> wrote: It's really looking like it's the libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When patches 1-50 (listed below) are applied to 3.5.7, the hang is present. So far I have gone through 4 successful installs with no hang with only 1-49 applied. I'm still leaving my test run to make sure it's not a fluke, but since previously it hangs within the first couple of builds, it really looks like this is where the problem originated. 1-libceph_eliminate_connection_state_DEAD.patch 2-libceph_kill_bad_proto_ceph_connection_op.patch 3-libceph_rename_so
Re: endless flying slow requests
I pushed another fix that plugs a leak when requests race with peering that fixed the 'slow request' for osd_sub_op messages for me. It's in wip-osd-leak. Sam, when you get a chance, can you take a look and push it to next if it looks okay? Thanks! sage On Fri, 30 Nov 2012, Samuel Just wrote: > I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7. > Let me know if the problem persists with this patch. > -Sam > > On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov wrote: > > On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just wrote: > >> Also, these clusters aren't mixed argonaut and next, are they? (Not > >> that that shouldn't work, but it would be a useful data point.) > >> -Sam > >> > >> On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just wrote: > >>> Did you observe hung io along with that error? Both sub_op_commit and > >>> sub_op_applied have happened, so the sub_op_reply should have been > >>> sent back to the primary. This looks more like a leak. If you also > >>> observed hung io, then it's possible that the problem is occurring > >>> between the sub_op_applied event and the response. > >>> -Sam > >>> > > > > It is relatively easy to check if one of client VMs has locked one or > > more cores to iowait or just hangs, so yes, these ops are related to > > real commit operations and they are hanged. > > I`m using all-new 0.54 cluster, without mixing of course. Does > > everyone who hit that bug readjusted cluster before bug shows > > itself(say, in a day-long distance)? > > > >>> On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov wrote: > On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil wrote: > > Hi Stefan, > > > > On Thu, 15 Nov 2012, Sage Weil wrote: > >> On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: > >> > Am 14.11.2012 15:59, schrieb Sage Weil: > >> > > Hi Stefan, > >> > > > >> > > I would be nice to confirm that no clients are waiting on replies > >> > > for > >> > > these requests; currently we suspect that the OSD request tracking > >> > > is the > >> > > buggy part. If you query the OSD admin socket you should be able > >> > > to dump > >> > > requests and see the client IP, and then query the client. > >> > > > >> > > Is it librbd? In that case you likely need to change the config > >> > > so that > >> > > it is listening on an admin socket ('admin socket = path'). > >> > > >> > Yes it is. So i have to specify admin socket at the KVM host? > >> > >> Right. IIRC the disk line is a ; (or \;) separated list of key/value > >> pairs. > >> > >> > How do i query the admin socket for requests? > >> > >> ceph --admin-daemon /path/to/socket help > >> ceph --admin-daemon /path/to/socket objecter_dump (i think) > > > > Were you able to reproduce this? > > > > Thanks! > > sage > > Meanwhile, I did. :) > Such requests will always be created if you have restarted or marked > an osd out and then back in and scrub didn`t happen in the meantime > (after such operation and before request arrival). > What is more interesting, the hangup happens not exactly at the time > of operation, but tens of minutes later. > > { "description": "osd_sub_op(client.1292013.0:45422 4.731 > a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v > 16444'21693 snapset=0=[]:[] snapc=0=[])", > "received_at": "2012-11-28 03:54:43.094151", > "age": "27812.942680", > "duration": "2.676641", > "flag_point": "started", > "events": [ > { "time": "2012-11-28 03:54:43.094222", > "event": "waiting_for_osdmap"}, > { "time": "2012-11-28 03:54:43.386890", > "event": "reached_pg"}, > { "time": "2012-11-28 03:54:43.386894", > "event": "started"}, > { "time": "2012-11-28 03:54:43.386973", > "event": "commit_queued_for_journal_write"}, > { "time": "2012-11-28 03:54:45.360049", > "event": "write_thread_in_journal_buffer"}, > { "time": "2012-11-28 03:54:45.586183", > "event": "journaled_completion_queued"}, > { "time": "2012-11-28 03:54:45.586262", > "event": "sub_op_commit"}, > { "time": "2012-11-28 03:54:45.770792", > "event": "sub_op_applied"}]}]} > > > > > > > >> > >> sage > >> > >> > > >> > Stefan > >> > > >> > > >> > > On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: > >> > > > >> > > > Hello list, > >> > > > > >> > > > i see this several times. Endless flying slow requests. And they > >> > > > never > >
Re: rbd STDIN import does not work / wip-rbd-export-stdout
On 12/1/2012 12:23 PM, Stefan Priebe wrote: Hi Dan, Am 27.11.2012 10:16, schrieb Stefan Priebe - Profihost AG: Am 26.11.2012 21:00, schrieb Dan Mick: It writes zeros; there's no way for it to know how many zeros are coming. It could make a half-hearted attempt depending on its buffer size and the amount of data the source is willing to buffer. Yes but why not skip at leas if the WHOLE BUFFER is zero and then use the buffer size = stripe size? Right now a 40GB image which has just 3GB used rest is zeros due to using fstrim before doing the export takes a long time to import. We could speed this up a lot if we would skip parts with only zeros. Great minds! I'm actually working on that now; as you say it's not hard. Greets, Stefab -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd STDIN import does not work / wip-rbd-export-stdout
On 11/28/2012 6:53 AM, Stefan Priebe - Profihost AG wrote: Hi Dan, thanks for your update / new import-export branch. The import seems to work now. What i'm wondering is that it is really slow. I see around 10% CPU Usage on importing host (host where rbd import is started) and just 4MB every 2s raising Data on ceph. This means 40GB took around 5-6 hours. Night this be due to he size update? 4MB every 2s looks like the stripe size. I did change the resize to double each time to help with that before pushing to master, but that still seems awfully slow. You should definitely use the version in the mainline, though; it's better in several ways. Are these particularly sparse images? I'm testing a "preserve sparseness" tweak that should help a lot if so (and save space in the cluster, obviously). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] mds: fix CDir::_commit_partial() bug
On Sat, 1 Dec 2012, Yan, Zheng wrote: > Hi Sage, > > This patch is buggy, please drop it from wip-mds. Done. > On 11/23/2012 12:52 AM, Yan, Zheng wrote: > > From: "Yan, Zheng" > > > > When a null dentry is encountered, CDir::_commit_partial() adds > > a OSD_TMAP_RM command to delete the dentry. But if the dentry is > > new, the osd will not find the dentry when handling the command > > and the tmap update operation will fail totally. > > > > This patch also makes sure dentries are properly marked as new > > when preparing new dentries and exporting dentries. > > > > Signed-off-by: Yan, Zheng > > --- > SNIP > > diff --git a/src/mds/Server.cc b/src/mds/Server.cc > > index ec0d5d5..228fede 100644 > > --- a/src/mds/Server.cc > > +++ b/src/mds/Server.cc > > @@ -1685,6 +1685,9 @@ CDentry* Server::prepare_null_dentry(MDRequest *mdr, > > CDir *dir, const string& dn > >} > > } > > > > +if (!dn->is_dirty()) > > + dn->mark_new(); > > + > This should be something like 'else if (!dn->is_dirty())'. This issue > has been fixed on the osd side, this patch is no longer needed. Yeah. It would still be nice to avoid including those ops in the osd request, as an optimization, but that can happen later. Repushed wip-mds, and scheduled the fs suite to run against it. That only covers single mds configurations still, but as long as there aren't regressions there I'm happy relying on your testing for now. Real Soon Now we need to flesh out that suite to include clustered mds tests, export thrashing, and build a mds_thrasher teuthology task that restarts, adds, and stops mds's to exercise the failure paths. Eventually we'll want to hook that into the failure injection sites in Migration.cc as well so we deliberatel exercise each failure point... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
wip-msgr-delay-queue
Is ready to merge, after a final look. There is one additional patch that handles the connection fault and reconnect cases that hasn't been reviewed yet. Passed the rados suite, although when we add it to the night suite we'll want lower probabilities because most of the runs took hours to complete with the artificial delays. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd STDIN import does not work / wip-rbd-export-stdout
Hi Dan, Am 27.11.2012 10:16, schrieb Stefan Priebe - Profihost AG: Am 26.11.2012 21:00, schrieb Dan Mick: It writes zeros; there's no way for it to know how many zeros are coming. It could make a half-hearted attempt depending on its buffer size and the amount of data the source is willing to buffer. Yes but why not skip at leas if the WHOLE BUFFER is zero and then use the buffer size = stripe size? Right now a 40GB image which has just 3GB used rest is zeros due to using fstrim before doing the export takes a long time to import. We could speed this up a lot if we would skip parts with only zeros. Greets, Stefab -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
slow backfilling / remapping
Hello list, today i was trying to look what happens if i remove an osd. I've executed: ceph osd 21 out Cluster Details: - XFS FS for OSDs - Software version: latest next from today 12/01/2012 - 6 Nodes with 4 OSDs only SSDs each I have just 8 GB of data so i thought removing an OSD should be VERY fast. Here are parts of the logs whole process tooked 15 minutes. 2012-12-01 20:06:02.607191 mon.0 [INF] pgmap v59693: 7632 pgs: 7348 active+clean, 266 active+remapped+wait_backfill, 12 active+remapped+backfilling, 6 active+recovering; 7880 MB data, 18812 MB used, 4428 GB / 4446 GB avail; 180/4303 degraded (4.183%) 2012-12-01 20:16:11.284705 mon.0 [INF] pgmap v6: 7632 pgs: 7562 active+clean, 65 active+remapped+wait_backfill, 5 active+remapped+backfilling; 7880 MB data, 18981 MB used, 4428 GB / 4446 GB avail; 44/4178 degraded (1.053%) 2012-12-01 20:22:35.829481 mon.0 [INF] pgmap v60182: 7632 pgs: 7628 active+clean, 4 active+remapped+backfilling; 7880 MB data, 19029 MB used, 4428 GB / 4446 GB avail; 3/4137 degraded (0.073%) 2012-12-01 20:22:37.052136 mon.0 [INF] pgmap v60183: 7632 pgs: 7632 active+clean; 7880 MB data, 19033 MB used, 4428 GB / 4446 GB avail Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hangup during scrubbing - possible solutions
On Sat, Dec 1, 2012 at 9:07 AM, Samuel Just wrote: > Just pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7. > Let me know if it persists. Thanks for the logs! > -Sam > Very nice, thanks! There is one corner case - ``on-the-fly'' upgrade works well only if your patch applied to ``generic'' 0.54 by cherry-picking, online upgrade to the next-dccf6ee from tagged 0.54 causes osd processes on the upgraded nodes to fall shortly after restart with backtrace you may see below. Offline upgrade, e.g over shutting down entire cluster, works fine, so only one problem is a preservation of running state of the cluster over upgrade which may confuse some users(at least ones who runs production suites). http://xdel.ru/downloads/ceph-log/bt-recovery-sj-patch.out.gz > On Fri, Nov 30, 2012 at 2:04 PM, Samuel Just wrote: >> Hah! Thanks for the log, it's our handling of active_pushes. I'll >> have a patch shortly. >> >> Thanks! >> -Sam >> >> On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov wrote: >>> http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz >>> http://xdel.ru/downloads/ceph-log/cluster-w.log.gz >>> >>> Here, please. >>> >>> I have initiated a deep-scrub of osd.1 which was lead to forever-stuck >>> I/O requests in a short time(scrub `ll do the same). Second log may be >>> useful for proper timestamps, as seeks on the original may took a long >>> time. Osd processes on the specific node was restarted twice - at the >>> beginning to be sure all config options were applied and at the end to >>> do same plus to get rid of stuck requests. >>> >>> >>> On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just wrote: If you can reproduce it again, what we really need are the osd logs from the acting set of a pg stuck in scrub with debug osd = 20 debug ms = 1 debug filestore = 20. Thanks, -Sam On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov wrote: > On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil wrote: >> On Thu, 22 Nov 2012, Andrey Korolyov wrote: >>> Hi, >>> >>> In the recent versions Ceph introduces some unexpected behavior for >>> the permanent connections (VM or kernel clients) - after crash >>> recovery, I/O will hang on the next planned scrub on the following >>> scenario: >>> >>> - launch a bunch of clients doing non-intensive writes, >>> - lose one or more osd, mark them down, wait for recovery completion, >>> - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script, >>> or wait for ceph to do the same, >>> - observe a raising number of pgs stuck in the active+clean+scrubbing >>> state (they took a master role from ones which was on killed osd and >>> almost surely they are being written in time of crash), >>> - some time later, clients will hang hardly and ceph log introduce >>> stuck(old) I/O requests. >>> >>> The only one way to return clients back without losing their I/O state >>> is per-osd restart, which also will help to get rid of >>> active+clean+scrubbing pgs. >>> >>> First of all, I`ll be happy to help to solve this problem by providing >>> logs. >> >> If you can reproduce this behavior with 'debug osd = 20' and 'debug ms = >> 1' logging on the OSD, that would be wonderful! >> > > I have tested slightly different recovery flow, please see below. > Since there is no real harm, like frozen I/O, placement groups also > was stuck forever on the active+clean+scrubbing state, until I > restarted all osds (end of the log): > > http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz > > - start the healthy cluster > - start persistent clients > - add an another host with pair of OSDs, let them be in the data placement > - wait for data to rearrange > - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I > have an 1/2 hour delay on readjust in such case, I did ``ceph osd > out'' manually) > - watch for data to rearrange again > - [22:51 timestamp] when it ends, start a manual rescrub, with > non-zero active+clean+scrubbing-state placement groups at the end of > process which `ll stay in this state forever until something happens > > After that, I can restart osds one per one, if I want to get rid of > scrubbing states immediately and then do deep-scrub(if I don`t, those > states will return at next ceph self-scrubbing) or do per-osd > deep-scrub, if I have a lot of time. The case I have described in the > previous message took place when I remove osd from data placement > which existed on the moment when client(s) have started and indeed it > is more harmful than current one(frozen I/O leads to hanging entire > guest, for example). Since testing those flow took a lot of time, I`ll > send logs related to this case tomorrow. > >>> Second question is not directly relate
Re: [PATCH V2] mds: fix CDir::_commit_partial() bug
Hi Sage, This patch is buggy, please drop it from wip-mds. On 11/23/2012 12:52 AM, Yan, Zheng wrote: > From: "Yan, Zheng" > > When a null dentry is encountered, CDir::_commit_partial() adds > a OSD_TMAP_RM command to delete the dentry. But if the dentry is > new, the osd will not find the dentry when handling the command > and the tmap update operation will fail totally. > > This patch also makes sure dentries are properly marked as new > when preparing new dentries and exporting dentries. > > Signed-off-by: Yan, Zheng > --- SNIP > diff --git a/src/mds/Server.cc b/src/mds/Server.cc > index ec0d5d5..228fede 100644 > --- a/src/mds/Server.cc > +++ b/src/mds/Server.cc > @@ -1685,6 +1685,9 @@ CDentry* Server::prepare_null_dentry(MDRequest *mdr, > CDir *dir, const string& dn >} > } > > +if (!dn->is_dirty()) > + dn->mark_new(); > + This should be something like 'else if (!dn->is_dirty())'. This issue has been fixed on the osd side, this patch is no longer needed. Regards Yan, Zheng > return dn; >} > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html