Re: rbd map command hangs for 15 minutes during system start up

2012-12-01 Thread Nick Bartos
Unfortunately the hangs happen with the new set of patches.  Here's
some debug info:

https://gist.github.com/raw/4187123/90194ce172130244a9c1c968ed185eee7282d809/gistfile1.txt


On Fri, Nov 30, 2012 at 3:22 PM, Alex Elder  wrote:
> On 11/29/2012 02:37 PM, Alex Elder wrote:
>> On 11/22/2012 12:04 PM, Nick Bartos wrote:
>>> Here are the ceph log messages (including the libceph kernel debug
>>> stuff you asked for) from a node boot with the rbd command hung for a
>>> couple of minutes:
>
> I'm sorry, but I did something stupid...
>
> Yes, the branch I gave you includes these fixes.  However
> it does *not* include the commit that was giving you trouble
> to begin with.
>
> So...
>
> I have updated that same branch (wip-nick) to contain:
> - Linux 3.5.5
> - Plus the first *50* (not 49) patches you listed
> - Plus the ones I added before.
>
> The new commit id for that branch begins with be3198d6.
>
> I'm really sorry for this mistake.  Please try this new
> branch and report back what you find.
>
> -Alex
>
>
>> Nick, I have put together a branch that includes two fixes
>> that might be helpful.  I don't expect these fixes will
>> necessarily *fix* what you're seeing, but one of them
>> pulls a big hunk of processing out of the picture and
>> might help eliminate some potential causes.  I had to
>> pull in several other patches as prerequisites in order
>> to get those fixes to apply cleanly.
>>
>> Would you be able to give it a try, and let us know what
>> results you get?  The branch contains:
>> - Linux 3.5.5
>> - Plus the first 49 patches you listed
>> - Plus four patches, which are prerequisites...
>> libceph: define ceph_extract_encoded_string()
>> rbd: define some new format constants
>> rbd: define rbd_dev_image_id()
>> rbd: kill create_snap sysfs entry
>> - ...for these two bug fixes:
>> libceph: remove 'osdtimeout' option
>> ceph: don't reference req after put
>>
>> The branch is available in the ceph-client git repository
>> under the name "wip-nick" and has commit id dd9323aa.
>> https://github.com/ceph/ceph-client/tree/wip-nick
>>
>>> https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt
>>
>> This full debug output is very helpful.  Please supply
>> that again as well.
>>
>> Thanks.
>>
>>   -Alex
>>
>>> On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos  wrote:
 It's very easy to reproduce now with my automated install script, the
 most I've seen it succeed with that patch is 2 in a row, and hanging
 on the 3rd, although it hangs on most builds.  So it shouldn't take
 much to get it to do it again.  I'll try and get to that tomorrow,
 when I'm a bit more rested and my brain is working better.

 Yes during this the OSDs are probably all syncing up.  All the osd and
 mon daemons have started by the time the rdb commands are ran, though.

 On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil  wrote:
> On Wed, 21 Nov 2012, Nick Bartos wrote:
>> FYI the build which included all 3.5 backports except patch #50 is
>> still going strong after 21 builds.
>
> Okay, that one at least makes some sense.  I've opened
>
> http://tracker.newdream.net/issues/3519
>
> How easy is this to reproduce?  If it is something you can trigger with
> debugging enabled ('echo module libceph +p >
> /sys/kernel/debug/dynamic_debug/control') that would help tremendously.
>
> I'm guessing that during this startup time the OSDs are still in the
> process of starting?
>
> Alex, I bet that a test that does a lot of map/unmap stuff in a loop while
> thrashing OSDs could hit this.
>
> Thanks!
> sage
>
>
>>
>> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos  
>> wrote:
>>> With 8 successful installs already done, I'm reasonably confident that
>>> it's patch #50.  I'm making another build which applies all patches
>>> from the 3.5 backport branch, excluding that specific one.  I'll let
>>> you know if that turns up any unexpected failures.
>>>
>>> What will the potential fall out be for removing that specific patch?
>>>
>>>
>>> On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos  
>>> wrote:
 It's really looking like it's the
 libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
 patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
  So far I have gone through 4 successful installs with no hang with
 only 1-49 applied.  I'm still leaving my test run to make sure it's
 not a fluke, but since previously it hangs within the first couple of
 builds, it really looks like this is where the problem originated.

 1-libceph_eliminate_connection_state_DEAD.patch
 2-libceph_kill_bad_proto_ceph_connection_op.patch
 3-libceph_rename_so

Re: endless flying slow requests

2012-12-01 Thread Sage Weil
I pushed another fix that plugs a leak when requests race with peering 
that fixed the 'slow request' for osd_sub_op messages for me.  It's in 
wip-osd-leak.

Sam, when you get a chance, can you take a look and push it to next if it 
looks okay?

Thanks!
sage


On Fri, 30 Nov 2012, Samuel Just wrote:

> I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
> Let me know if the problem persists with this patch.
> -Sam
> 
> On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov  wrote:
> > On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just  wrote:
> >> Also, these clusters aren't mixed argonaut and next, are they?  (Not
> >> that that shouldn't work, but it would be a useful data point.)
> >> -Sam
> >>
> >> On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just  wrote:
> >>> Did you observe hung io along with that error?  Both sub_op_commit and
> >>> sub_op_applied have happened, so the sub_op_reply should have been
> >>> sent back to the primary.  This looks more like a leak.  If you also
> >>> observed hung io, then it's possible that the problem is occurring
> >>> between the sub_op_applied event and the response.
> >>> -Sam
> >>>
> >
> > It is relatively easy to check if one of client VMs has locked one or
> > more cores to iowait or just hangs, so yes, these ops are related to
> > real commit operations and they are hanged.
> > I`m using all-new 0.54 cluster, without mixing of course. Does
> > everyone who hit that bug readjusted cluster before bug shows
> > itself(say, in a day-long distance)?
> >
> >>> On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov  wrote:
>  On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil  wrote:
> > Hi Stefan,
> >
> > On Thu, 15 Nov 2012, Sage Weil wrote:
> >> On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
> >> > Am 14.11.2012 15:59, schrieb Sage Weil:
> >> > > Hi Stefan,
> >> > >
> >> > > I would be nice to confirm that no clients are waiting on replies 
> >> > > for
> >> > > these requests; currently we suspect that the OSD request tracking 
> >> > > is the
> >> > > buggy part.  If you query the OSD admin socket you should be able 
> >> > > to dump
> >> > > requests and see the client IP, and then query the client.
> >> > >
> >> > > Is it librbd?  In that case you likely need to change the config 
> >> > > so that
> >> > > it is listening on an admin socket ('admin socket = path').
> >> >
> >> > Yes it is. So i have to specify admin socket at the KVM host?
> >>
> >> Right.  IIRC the disk line is a ; (or \;) separated list of key/value
> >> pairs.
> >>
> >> > How do i query the admin socket for requests?
> >>
> >> ceph --admin-daemon /path/to/socket help
> >> ceph --admin-daemon /path/to/socket objecter_dump (i think)
> >
> > Were you able to reproduce this?
> >
> > Thanks!
> > sage
> 
>  Meanwhile, I did. :)
>  Such requests will always be created if you have restarted or marked
>  an osd out and then back in and scrub didn`t happen in the meantime
>  (after such operation and before request arrival).
>  What is more interesting, the hangup happens not exactly at the time
>  of operation, but tens of minutes later.
> 
>  { "description": "osd_sub_op(client.1292013.0:45422 4.731
>  a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
>  16444'21693 snapset=0=[]:[] snapc=0=[])",
>    "received_at": "2012-11-28 03:54:43.094151",
>    "age": "27812.942680",
>    "duration": "2.676641",
>    "flag_point": "started",
>    "events": [
>  { "time": "2012-11-28 03:54:43.094222",
>    "event": "waiting_for_osdmap"},
>  { "time": "2012-11-28 03:54:43.386890",
>    "event": "reached_pg"},
>  { "time": "2012-11-28 03:54:43.386894",
>    "event": "started"},
>  { "time": "2012-11-28 03:54:43.386973",
>    "event": "commit_queued_for_journal_write"},
>  { "time": "2012-11-28 03:54:45.360049",
>    "event": "write_thread_in_journal_buffer"},
>  { "time": "2012-11-28 03:54:45.586183",
>    "event": "journaled_completion_queued"},
>  { "time": "2012-11-28 03:54:45.586262",
>    "event": "sub_op_commit"},
>  { "time": "2012-11-28 03:54:45.770792",
>    "event": "sub_op_applied"}]}]}
> 
> 
> >
> >
> >>
> >> sage
> >>
> >> >
> >> > Stefan
> >> >
> >> >
> >> > > On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
> >> > >
> >> > > > Hello list,
> >> > > >
> >> > > > i see this several times. Endless flying slow requests. And they 
> >> > > > never
> >

Re: rbd STDIN import does not work / wip-rbd-export-stdout

2012-12-01 Thread Dan Mick

On 12/1/2012 12:23 PM, Stefan Priebe wrote:

Hi Dan,

Am 27.11.2012 10:16, schrieb Stefan Priebe - Profihost AG:

Am 26.11.2012 21:00, schrieb Dan Mick:

It writes zeros; there's no way for it to know how many zeros are
coming.  It could make a half-hearted attempt depending on its buffer
size and the amount of data the source is willing to buffer.


Yes but why not skip at leas if the WHOLE BUFFER is zero and then use 
the buffer size = stripe size?


Right now a 40GB image which has just 3GB used rest is zeros due to 
using fstrim before doing the export takes a long time to import. We 
could speed this up a lot if we would skip parts with only zeros.


Great minds!  I'm actually working on that now; as you say it's not hard.


Greets,
Stefab


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd STDIN import does not work / wip-rbd-export-stdout

2012-12-01 Thread Dan Mick

On 11/28/2012 6:53 AM, Stefan Priebe - Profihost AG wrote:

Hi Dan,

thanks for your update / new import-export branch.

The import seems to work now. What i'm wondering is that it is really 
slow.


I see around 10% CPU Usage on importing host (host where rbd import is 
started) and just 4MB every 2s raising Data on ceph. This means 40GB 
took around 5-6 hours.


Night this be due to he size update? 4MB every 2s looks like the 
stripe size.


I did change the resize to double each time to help with that before
pushing to master, but that still seems awfully slow.  You should 
definitely use the

version in the mainline, though; it's better in several ways.

Are these particularly sparse images?  I'm testing a "preserve 
sparseness" tweak that should help

a lot if so (and save space in the cluster, obviously).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] mds: fix CDir::_commit_partial() bug

2012-12-01 Thread Sage Weil
On Sat, 1 Dec 2012, Yan, Zheng wrote:
> Hi Sage,
> 
> This patch is buggy, please drop it from wip-mds.

Done.
 
> On 11/23/2012 12:52 AM, Yan, Zheng wrote:
> > From: "Yan, Zheng" 
> > 
> > When a null dentry is encountered, CDir::_commit_partial() adds
> > a OSD_TMAP_RM command to delete the dentry. But if the dentry is
> > new, the osd will not find the dentry when handling the command
> > and the tmap update operation will fail totally.
> > 
> > This patch also makes sure dentries are properly marked as new
> > when preparing new dentries and exporting dentries.
> > 
> > Signed-off-by: Yan, Zheng 
> > ---
> SNIP
> > diff --git a/src/mds/Server.cc b/src/mds/Server.cc
> > index ec0d5d5..228fede 100644
> > --- a/src/mds/Server.cc
> > +++ b/src/mds/Server.cc
> > @@ -1685,6 +1685,9 @@ CDentry* Server::prepare_null_dentry(MDRequest *mdr, 
> > CDir *dir, const string& dn
> >}
> >  }
> >  
> > +if (!dn->is_dirty())
> > +  dn->mark_new();
> > +
> This should be something like 'else if (!dn->is_dirty())'. This issue
> has been fixed on the osd side, this patch is no longer needed.

Yeah.  It would still be nice to avoid including those ops in the osd 
request, as an optimization, but that can happen later.

Repushed wip-mds, and scheduled the fs suite to run against it.  That 
only covers single mds configurations still, but as long as there aren't 
regressions there I'm happy relying on your testing for now.

Real Soon Now we need to flesh out that suite to include clustered mds 
tests, export thrashing, and build a mds_thrasher teuthology task that 
restarts, adds, and stops mds's to exercise the failure paths.  Eventually 
we'll want to hook that into the failure injection sites in Migration.cc 
as well so we deliberatel exercise each failure point...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


wip-msgr-delay-queue

2012-12-01 Thread Sage Weil
Is ready to merge, after a final look.  There is one additional patch that 
handles the connection fault and reconnect cases that hasn't been reviewed 
yet.  Passed the rados suite, although when we add it to the night suite 
we'll want lower probabilities because most of the runs took hours to 
complete with the artificial delays.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd STDIN import does not work / wip-rbd-export-stdout

2012-12-01 Thread Stefan Priebe

Hi Dan,

Am 27.11.2012 10:16, schrieb Stefan Priebe - Profihost AG:

Am 26.11.2012 21:00, schrieb Dan Mick:

It writes zeros; there's no way for it to know how many zeros are
coming.  It could make a half-hearted attempt depending on its buffer
size and the amount of data the source is willing to buffer.


Yes but why not skip at leas if the WHOLE BUFFER is zero and then use 
the buffer size = stripe size?


Right now a 40GB image which has just 3GB used rest is zeros due to 
using fstrim before doing the export takes a long time to import. We 
could speed this up a lot if we would skip parts with only zeros.


Greets,
Stefab
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


slow backfilling / remapping

2012-12-01 Thread Stefan Priebe

Hello list,

today i was trying to look what happens if i remove an osd.

I've executed:
ceph osd 21 out

Cluster Details:
- XFS FS for OSDs
- Software version: latest next from today 12/01/2012
- 6 Nodes with 4 OSDs only SSDs each

I have just 8 GB of data so i thought removing an OSD should be VERY fast.

Here are parts of the logs whole process tooked 15 minutes.

2012-12-01 20:06:02.607191 mon.0 [INF] pgmap v59693: 7632 pgs: 7348 
active+clean, 266 active+remapped+wait_backfill, 12 
active+remapped+backfilling, 6 active+recovering; 7880 MB data, 18812 MB 
used, 4428 GB / 4446 GB avail; 180/4303 degraded (4.183%)


2012-12-01 20:16:11.284705 mon.0 [INF] pgmap v6: 7632 pgs: 7562 
active+clean, 65 active+remapped+wait_backfill, 5 
active+remapped+backfilling; 7880 MB data, 18981 MB used, 4428 GB / 4446 
GB avail; 44/4178 degraded (1.053%)


2012-12-01 20:22:35.829481 mon.0 [INF] pgmap v60182: 7632 pgs: 7628 
active+clean, 4 active+remapped+backfilling; 7880 MB data, 19029 MB 
used, 4428 GB / 4446 GB avail; 3/4137 degraded (0.073%)
2012-12-01 20:22:37.052136 mon.0 [INF] pgmap v60183: 7632 pgs: 7632 
active+clean; 7880 MB data, 19033 MB used, 4428 GB / 4446 GB avail


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hangup during scrubbing - possible solutions

2012-12-01 Thread Andrey Korolyov
On Sat, Dec 1, 2012 at 9:07 AM, Samuel Just  wrote:
> Just pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
> Let me know if it persists.  Thanks for the logs!
> -Sam
>

Very nice, thanks!

There is one corner case - ``on-the-fly'' upgrade works well only if
your patch applied to ``generic'' 0.54 by cherry-picking, online
upgrade to the next-dccf6ee from tagged 0.54 causes osd processes on
the upgraded nodes to fall shortly after restart with backtrace you
may see below. Offline upgrade, e.g over shutting down entire cluster,
works fine, so only one problem is a preservation of running state of
the cluster over upgrade which may confuse some users(at least ones
who runs production suites).

http://xdel.ru/downloads/ceph-log/bt-recovery-sj-patch.out.gz



> On Fri, Nov 30, 2012 at 2:04 PM, Samuel Just  wrote:
>> Hah!  Thanks for the log, it's our handling of active_pushes.  I'll
>> have a patch shortly.
>>
>> Thanks!
>> -Sam
>>
>> On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov  wrote:
>>> http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz
>>> http://xdel.ru/downloads/ceph-log/cluster-w.log.gz
>>>
>>> Here, please.
>>>
>>> I have initiated a deep-scrub of osd.1 which was lead to forever-stuck
>>> I/O requests in a short time(scrub `ll do the same). Second log may be
>>> useful for proper timestamps, as seeks on the original may took a long
>>> time. Osd processes on the specific node was restarted twice - at the
>>> beginning to be sure all config options were applied and at the end to
>>> do same plus to get rid of stuck requests.
>>>
>>>
>>> On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just  wrote:
 If you can reproduce it again, what we really need are the osd logs
 from the acting set of a pg stuck in scrub with
 debug osd = 20
 debug ms = 1
 debug filestore = 20.

 Thanks,
 -Sam

 On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov  wrote:
> On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil  wrote:
>> On Thu, 22 Nov 2012, Andrey Korolyov wrote:
>>> Hi,
>>>
>>> In the recent versions Ceph introduces some unexpected behavior for
>>> the permanent connections (VM or kernel clients) - after crash
>>> recovery, I/O will hang on the next planned scrub on the following
>>> scenario:
>>>
>>> - launch a bunch of clients doing non-intensive writes,
>>> - lose one or more osd, mark them down, wait for recovery completion,
>>> - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
>>> or wait for ceph to do the same,
>>> - observe a raising number of pgs stuck in the active+clean+scrubbing
>>> state (they took a master role from ones which was on killed osd and
>>> almost surely they are being written in time of crash),
>>> - some time later, clients will hang hardly and ceph log introduce
>>> stuck(old) I/O requests.
>>>
>>> The only one way to return clients back without losing their I/O state
>>> is per-osd restart, which also will help to get rid of
>>> active+clean+scrubbing pgs.
>>>
>>> First of all, I`ll be happy to help to solve this problem by providing
>>> logs.
>>
>> If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
>> 1' logging on the OSD, that would be wonderful!
>>
>
> I have tested slightly different recovery flow, please see below.
> Since there is no real harm, like frozen I/O, placement groups also
> was stuck forever on the active+clean+scrubbing state, until I
> restarted all osds (end of the log):
>
> http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz
>
> - start the healthy cluster
> - start persistent clients
> - add an another host with pair of OSDs, let them be in the data placement
> - wait for data to rearrange
> - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
> have an 1/2 hour delay on readjust in such case, I did ``ceph osd
> out'' manually)
> - watch for data to rearrange again
> - [22:51 timestamp] when it ends, start a manual rescrub, with
> non-zero active+clean+scrubbing-state placement groups at the end of
> process which `ll stay in this state forever until something happens
>
> After that, I can restart osds one per one, if I want to get rid of
> scrubbing states immediately and then do deep-scrub(if I don`t, those
> states will return at next ceph self-scrubbing) or do per-osd
> deep-scrub, if I have a lot of time. The case I have described in the
> previous message took place when I remove osd from data placement
> which existed on the moment when client(s) have started and indeed it
> is more harmful than current one(frozen I/O leads to hanging entire
> guest, for example). Since testing those flow took a lot of time, I`ll
> send logs related to this case tomorrow.
>
>>> Second question is not directly relate

Re: [PATCH V2] mds: fix CDir::_commit_partial() bug

2012-12-01 Thread Yan, Zheng
Hi Sage,

This patch is buggy, please drop it from wip-mds.

On 11/23/2012 12:52 AM, Yan, Zheng wrote:
> From: "Yan, Zheng" 
> 
> When a null dentry is encountered, CDir::_commit_partial() adds
> a OSD_TMAP_RM command to delete the dentry. But if the dentry is
> new, the osd will not find the dentry when handling the command
> and the tmap update operation will fail totally.
> 
> This patch also makes sure dentries are properly marked as new
> when preparing new dentries and exporting dentries.
> 
> Signed-off-by: Yan, Zheng 
> ---
SNIP
> diff --git a/src/mds/Server.cc b/src/mds/Server.cc
> index ec0d5d5..228fede 100644
> --- a/src/mds/Server.cc
> +++ b/src/mds/Server.cc
> @@ -1685,6 +1685,9 @@ CDentry* Server::prepare_null_dentry(MDRequest *mdr, 
> CDir *dir, const string& dn
>}
>  }
>  
> +if (!dn->is_dirty())
> +  dn->mark_new();
> +
This should be something like 'else if (!dn->is_dirty())'. This issue
has been fixed on the osd side, this patch is no longer needed.

Regards
Yan, Zheng

>  return dn;
>}
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html