A bug in rolling back to a degraded object

2015-06-01 Thread Wang, Zhiqiang
Hi Sage and all,

Another bug discovered during the proxy write teuthology testing is when 
rolling back to a degraded object. This doesn't seem specific to proxy write. I 
see a scenario like below from the log file:

- A rollback op comes in, and is enqueued.
- Several other ops on the same object come in, and are enqueued.
- The rollback op dispatches, and finds the object which it rollbacks to is 
degraded, then this op is pushbacked into a list to wait for the degraded 
object to recover.
- The later ops are handled and responded back to client.
- The degraded object recovers. The rollback op is enqueued again and finally 
responded to client.

This breaks the op order. A fix for this is to maintain a map to track the 
source, destination pair. And when an op on the source dispatches, if such a 
pair exists, queue the op in the destination's degraded waiting list. A 
drawback of this approach is that some entries in the ' 
waiting_for_degraded_object' list of the destination object may not be actually 
accessing the destination, but the source. Does this make sense?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nightly Kernel branch not found errors

2015-06-01 Thread Ilya Dryomov
On Mon, Jun 1, 2015 at 1:19 AM, Sage Weil s...@newdream.net wrote:
 On Sun, 31 May 2015, Gregory Farnum wrote:
 On Sun, May 31, 2015 at 1:40 PM, Ilya Dryomov idryo...@gmail.com wrote:
  On Sun, May 31, 2015 at 10:54 PM, Gregory Farnum g...@gregs42.com wrote:
  We are getting this error in what looks like everything that specifies
  the testing kernel. (That turns out to be almost all of the FS tests
  and a surprising number of the non-rados runs; e.g. rgw.) I've checked
  that the testing branch of ceph-client.git still exists and when
  looking at the teuthology git logs the recent ping-pong of commits on
  kernel flavors et al stand out. Any ideas? :)
 
  Which lab is this in?  Latest rgw and fs in sepia look fine.
 
  Kernel task was broken the entire last week but in a different way -
  when scheduled with teuthology-suite it wouldn't install anything even
  if you told it to install e.g. testing.  I fixed that on Friday.
 
  Looks like this is coming from create_initial_config().  Could be an
  environment issue, like down gitbuilders or a problem with requests
  module?  Kernel branch is checked before the others so it may not have
  anything to do with it at all.

 It's happening across labs for tests that were supposed to be
 scheduled starting on the 29 (at least, that I've noticed). I think
 this issue is before they get into pulpito, which is why the latest fs
 suite run there was scheduled on May 27. :(
 Looking at the gitbuliders I do see that the CentOS6 testing branch is
 red and rhel7 appears to be down...maybe we're checking on more of
 them now and then failing when those don't appear? :/

 We're half-way through creating the centos7 kernel builder to replace the
 rhel ones so I expect things are broken on the rpm side.  Not sure if
 that is the root cause here, but we can probably wait for that to get
 fixed first before looking further

I think it's definitely related to this.  commit f2ce5e1ed3d4 (Treat
RHEL as CentOS when scheduling) by Zack makes it so centos7
gitbuilders are poked in case distro wasn't specified explicitly, which
is how I assume all those runs are scheduled.  kernel centos6
gitbuilder is down and AFAICT the centos7 one doesn't exist.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Discuss: New default recovery config settings

2015-06-01 Thread Lionel Bouton
On 06/01/15 09:43, Jan Schermer wrote:
 We had to disable deep scrub or the cluster would me unusable - we need to 
 turn it back on sooner or later, though.
 With minimal scrubbing and recovery settings, everything is mostly good. 
 Turned out many issues we had were due to too few PGs - once we increased 
 them from 4K to 16K everything sped up nicely (because the chunks are 
 smaller), but during heavy activity we are still getting some “slow IOs”.
 I believe there is an ionice knob in newer versions (we still run Dumpling), 
 and that should do the trick no matter how much additional “load” is put on 
 the OSDs.
 Everybody’s bottleneck will be different - we run all flash so disk IO is not 
 a problem but an OSD daemon is - no ionice setting will help with that, it 
 just needs to be faster ;-)

If you are interested I'm currently testing a ruby script which
schedules the deep scrubs one at a time trying to simultaneously make
them fit in a given time window, avoid successive scrubs on the same OSD
and space the deep scrubs according to the amount of data scrubed.  I
use it because Ceph by itself can't prevent multiple scrubs to happen
simultaneously on the network and it can severely impact our VM performance.
I can clean it up and post it on Github.

Best regards,

Lionel
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nightly Kernel branch not found errors

2015-06-01 Thread Ilya Dryomov
On Mon, Jun 1, 2015 at 1:28 AM, Gregory Farnum g...@gregs42.com wrote:
 On Sun, May 31, 2015 at 3:19 PM, Sage Weil s...@newdream.net wrote:
 On Sun, 31 May 2015, Gregory Farnum wrote:
 On Sun, May 31, 2015 at 1:40 PM, Ilya Dryomov idryo...@gmail.com wrote:
  On Sun, May 31, 2015 at 10:54 PM, Gregory Farnum g...@gregs42.com wrote:
  We are getting this error in what looks like everything that specifies
  the testing kernel. (That turns out to be almost all of the FS tests
  and a surprising number of the non-rados runs; e.g. rgw.) I've checked
  that the testing branch of ceph-client.git still exists and when
  looking at the teuthology git logs the recent ping-pong of commits on
  kernel flavors et al stand out. Any ideas? :)
 
  Which lab is this in?  Latest rgw and fs in sepia look fine.
 
  Kernel task was broken the entire last week but in a different way -
  when scheduled with teuthology-suite it wouldn't install anything even
  if you told it to install e.g. testing.  I fixed that on Friday.
 
  Looks like this is coming from create_initial_config().  Could be an
  environment issue, like down gitbuilders or a problem with requests
  module?  Kernel branch is checked before the others so it may not have
  anything to do with it at all.

 It's happening across labs for tests that were supposed to be
 scheduled starting on the 29 (at least, that I've noticed). I think
 this issue is before they get into pulpito, which is why the latest fs
 suite run there was scheduled on May 27. :(
 Looking at the gitbuliders I do see that the CentOS6 testing branch is
 red and rhel7 appears to be down...maybe we're checking on more of
 them now and then failing when those don't appear? :/

 We're half-way through creating the centos7 kernel builder to replace the
 rhel ones so I expect things are broken on the rpm side.  Not sure if
 that is the root cause here, but we can probably wait for that to get
 fixed first before looking further

 *None* of our FS tests are running while this problem persists (and
 they're not alone). That's not the sort of thing we can wait on...
 Maybe we have some gratuitous non-use of the testing kernel we can
 remove (I'm not sure), but that's the sort of thing that needs to be
 discussed across teams so we can deal with it proactively instead of
 just finding out when the nightlies start failing.

It's not the kernel to blame here, so unless there is a specific reason
let's not remove use of testing kernel.

What I think we need to do is to make sure that teuthology test suite
includes teuthology-suite --dry-run ... with --kernel testing at
the very least.  It would be good to also verify that testing kernel is
actually going to be installed by the kernel task to avoid last week's
breakage, but that's a matter of adding unit tests and is a different
issue.

Unless I'm missing something, the only problem here is really that the
commit I mentioned in another mail got pushed before the gitbuilder was
setup.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Discuss: New default recovery config settings

2015-06-01 Thread Jan Schermer
Slow requests are not exactly tied to the PG number, but we were getting slow 
requests whenever backfills or recoveries fired up - increasing the number of 
PGs helped with this as the “blocks” of work are much smaller than before.

We have roughly the same number of OSDs as you but only one really important 
pool (“volumes”), we ended with 16384 PGs for this one.
Number of threads increased exponentionaly, some latencies wet down, some went 
up, in the end it works just as well as before with the added benefit of better 
data distribution and a better behaving cluster.
But YMMV - once you go up you can’t go down.

Jan


 On 01 Jun 2015, at 10:57, huang jun hjwsm1...@gmail.com wrote:
 
 hi,jan
 
 2015-06-01 15:43 GMT+08:00 Jan Schermer j...@schermer.cz:
 We had to disable deep scrub or the cluster would me unusable - we need to 
 turn it back on sooner or later, though.
 With minimal scrubbing and recovery settings, everything is mostly good. 
 Turned out many issues we had were due to too few PGs - once we increased 
 them from 4K to 16K everything sped up nicely (because the chunks are 
 smaller), but during heavy activity we are still getting some “slow IOs”.
 
 How many PGs do you set ?  we get slow requests many times, but
 didn't relate it to PG number.
 And we follow the equation below for every pool:
 
(OSDs * 100)
 Total PGs =  -
  pool size
 our cluster has 157 OSDs and 3 POOLs, we set pg_num to  8192 for every pool,
 but osd cpu utlity percentage is up to 300% after restart, we think
 it's  loading pgs during the period.
 and we will try different PG number when we get slow request
 
 thanks!
 
 I believe there is an ionice knob in newer versions (we still run Dumpling), 
 and that should do the trick no matter how much additional “load” is put on 
 the OSDs.
 Everybody’s bottleneck will be different - we run all flash so disk IO is 
 not a problem but an OSD daemon is - no ionice setting will help with that, 
 it just needs to be faster ;-)
 
 Jan
 
 
 On 30 May 2015, at 01:17, Gregory Farnum g...@gregs42.com wrote:
 
 On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote:
 Many people have reported that they need to lower the osd recovery config 
 options to minimize the impact of recovery on client io.  We are talking 
 about changing the defaults as follows:
 
 osd_max_backfills to 1 (from 10)
 osd_recovery_max_active to 3 (from 15)
 osd_recovery_op_priority to 1 (from 10)
 osd_recovery_max_single_start to 1 (from 5)
 
 I'm under the (possibly erroneous) impression that reducing the number
 of max backfills doesn't actually reduce recovery speed much (but will
 reduce memory use), but that dropping the op priority can. I'd rather
 we make users manually adjust values which can have a material impact
 on their data safety, even if most of them choose to do so.
 
 After all, even under our worst behavior we're still doing a lot
 better than a resilvering RAID array. ;)
 -Greg
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
 -- 
 thanks
 huangjun

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Discuss: New default recovery config settings

2015-06-01 Thread Jan Schermer
We had to disable deep scrub or the cluster would me unusable - we need to turn 
it back on sooner or later, though.
With minimal scrubbing and recovery settings, everything is mostly good. Turned 
out many issues we had were due to too few PGs - once we increased them from 4K 
to 16K everything sped up nicely (because the chunks are smaller), but during 
heavy activity we are still getting some “slow IOs”.
I believe there is an ionice knob in newer versions (we still run Dumpling), 
and that should do the trick no matter how much additional “load” is put on the 
OSDs.
Everybody’s bottleneck will be different - we run all flash so disk IO is not a 
problem but an OSD daemon is - no ionice setting will help with that, it just 
needs to be faster ;-)

Jan


 On 30 May 2015, at 01:17, Gregory Farnum g...@gregs42.com wrote:
 
 On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote:
 Many people have reported that they need to lower the osd recovery config 
 options to minimize the impact of recovery on client io.  We are talking 
 about changing the defaults as follows:
 
 osd_max_backfills to 1 (from 10)
 osd_recovery_max_active to 3 (from 15)
 osd_recovery_op_priority to 1 (from 10)
 osd_recovery_max_single_start to 1 (from 5)
 
 I'm under the (possibly erroneous) impression that reducing the number
 of max backfills doesn't actually reduce recovery speed much (but will
 reduce memory use), but that dropping the op priority can. I'd rather
 we make users manually adjust values which can have a material impact
 on their data safety, even if most of them choose to do so.
 
 After all, even under our worst behavior we're still doing a lot
 better than a resilvering RAID array. ;)
 -Greg
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Discuss: New default recovery config settings

2015-06-01 Thread huang jun
hi,jan

2015-06-01 15:43 GMT+08:00 Jan Schermer j...@schermer.cz:
 We had to disable deep scrub or the cluster would me unusable - we need to 
 turn it back on sooner or later, though.
 With minimal scrubbing and recovery settings, everything is mostly good. 
 Turned out many issues we had were due to too few PGs - once we increased 
 them from 4K to 16K everything sped up nicely (because the chunks are 
 smaller), but during heavy activity we are still getting some “slow IOs”.

How many PGs do you set ?  we get slow requests many times, but
didn't relate it to PG number.
And we follow the equation below for every pool:

(OSDs * 100)
Total PGs =  -
  pool size
our cluster has 157 OSDs and 3 POOLs, we set pg_num to  8192 for every pool,
but osd cpu utlity percentage is up to 300% after restart, we think
it's  loading pgs during the period.
and we will try different PG number when we get slow request

thanks!

 I believe there is an ionice knob in newer versions (we still run Dumpling), 
 and that should do the trick no matter how much additional “load” is put on 
 the OSDs.
 Everybody’s bottleneck will be different - we run all flash so disk IO is not 
 a problem but an OSD daemon is - no ionice setting will help with that, it 
 just needs to be faster ;-)

 Jan


 On 30 May 2015, at 01:17, Gregory Farnum g...@gregs42.com wrote:

 On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote:
 Many people have reported that they need to lower the osd recovery config 
 options to minimize the impact of recovery on client io.  We are talking 
 about changing the defaults as follows:

 osd_max_backfills to 1 (from 10)
 osd_recovery_max_active to 3 (from 15)
 osd_recovery_op_priority to 1 (from 10)
 osd_recovery_max_single_start to 1 (from 5)

 I'm under the (possibly erroneous) impression that reducing the number
 of max backfills doesn't actually reduce recovery speed much (but will
 reduce memory use), but that dropping the op priority can. I'd rather
 we make users manually adjust values which can have a material impact
 on their data safety, even if most of them choose to do so.

 After all, even under our worst behavior we're still doing a lot
 better than a resilvering RAID array. ;)
 -Greg
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
thanks
huangjun
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: use kvfree() in ceph_put_page_vector()

2015-06-01 Thread Ilya Dryomov
On Mon, Jun 1, 2015 at 5:36 PM, Geliang Tang geliangt...@163.com wrote:
 Use kvfree() instead of open-coding it.

 Signed-off-by: Geliang Tang geliangt...@163.com
 ---
  net/ceph/pagevec.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

 diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
 index 096d914..d4f5f22 100644
 --- a/net/ceph/pagevec.c
 +++ b/net/ceph/pagevec.c
 @@ -51,10 +51,7 @@ void ceph_put_page_vector(struct page **pages, int 
 num_pages, bool dirty)
 set_page_dirty_lock(pages[i]);
 put_page(pages[i]);
 }
 -   if (is_vmalloc_addr(pages))
 -   vfree(pages);
 -   else
 -   kfree(pages);
 +   kvfree(pages);
  }
  EXPORT_SYMBOL(ceph_put_page_vector);

Already fixed in testing, wasn't pushed to linux-next though, sorry!

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding chance_test_backfill_full thrasher in the ec tasks

2015-06-01 Thread Loic Dachary
I filed http://tracker.ceph.com/issues/11831 and will work on it

On 01/06/2015 17:18, Samuel Just wrote:
 Yep, that should be included in the ec thrashing tests.
 -Sam
 
 - Original Message -
 From: shylesh kumar shylesh.mo...@gmail.com
 To: Samuel Just sj...@redhat.com, l...@dachary.org
 Cc: ceph-devel@vger.kernel.org
 Sent: Saturday, May 30, 2015 8:27:20 AM
 Subject: Adding chance_test_backfill_full thrasher in the ec tasks
 
 Hi ,
 
 As per discussion with loic none of the ec tests have
 chance_test_backfill_full thrasher in the tasks, do you think it
 would be a good idea to add it which can simulate disk space
 scenarios.
 
 
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: Adding chance_test_backfill_full thrasher in the ec tasks

2015-06-01 Thread Samuel Just
Yep, that should be included in the ec thrashing tests.
-Sam

- Original Message -
From: shylesh kumar shylesh.mo...@gmail.com
To: Samuel Just sj...@redhat.com, l...@dachary.org
Cc: ceph-devel@vger.kernel.org
Sent: Saturday, May 30, 2015 8:27:20 AM
Subject: Adding chance_test_backfill_full thrasher in the ec tasks

Hi ,

As per discussion with loic none of the ec tests have
chance_test_backfill_full thrasher in the tasks, do you think it
would be a good idea to add it which can simulate disk space
scenarios.



-- 
Thanks
Shylesh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libceph: use kvfree() in ceph_put_page_vector()

2015-06-01 Thread Geliang Tang
Use kvfree() instead of open-coding it.

Signed-off-by: Geliang Tang geliangt...@163.com
---
 net/ceph/pagevec.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 096d914..d4f5f22 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -51,10 +51,7 @@ void ceph_put_page_vector(struct page **pages, int 
num_pages, bool dirty)
set_page_dirty_lock(pages[i]);
put_page(pages[i]);
}
-   if (is_vmalloc_addr(pages))
-   vfree(pages);
-   else
-   kfree(pages);
+   kvfree(pages);
 }
 EXPORT_SYMBOL(ceph_put_page_vector);
 
-- 
2.3.4


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph branch status

2015-06-01 Thread ceph branch robot
-- All Branches --

Adam Crume adamcr...@gmail.com
2014-12-01 20:45:58 -0800   wip-doc-rbd-replay

Alfredo Deza ad...@redhat.com
2015-03-23 16:39:48 -0400   wip-11212
2015-03-25 10:10:43 -0400   wip-11065

Alfredo Deza alfredo.d...@inktank.com
2014-07-08 13:58:35 -0400   wip-8679
2014-09-04 13:58:14 -0400   wip-8366
2014-10-13 11:10:10 -0400   wip-9730

Boris Ranto bra...@redhat.com
2015-04-13 13:51:32 +0200   wip-fix-ceph-dencoder-build
2015-04-14 13:51:49 +0200   wip-fix-ceph-dencoder-build-master
2015-05-15 15:26:05 +0200   wip-selinux-policy

Chi Xinze xmdx...@gmail.com
2015-05-15 21:47:44 +   XinzeChi-wip-ec-read

Dan Mick dan.m...@inktank.com
2013-07-16 23:00:06 -0700   wip-5634

Danny Al-Gaaf danny.al-g...@bisect.de
2015-04-23 16:32:00 +0200   wip-da-SCA-20150421
2015-04-23 17:18:57 +0200   wip-nosetests
2015-04-23 18:20:16 +0200   wip-unify-num_objects_degraded
2015-04-30 12:34:08 +0200   wip-user
2015-06-01 06:59:26 +0200   wip-da-SCA-20150427

David Zafman dzaf...@redhat.com
2014-08-29 10:41:23 -0700   wip-libcommon-rebase
2014-11-26 09:41:50 -0800   wip-9403
2014-12-02 21:20:17 -0800   wip-zafman-docfix
2015-01-08 15:07:45 -0800   wip-vstart-kvs
2015-02-20 16:13:43 -0800   wip-10883-firefly
2015-02-20 16:14:57 -0800   wip-10883-dumpling
2015-04-23 10:22:09 -0700   wip-11454-80.8
2015-04-24 13:14:23 -0700   wip-cot-giant
2015-05-21 15:29:48 -0700   wip-11511
2015-05-28 13:41:30 -0700   wip-10794

Dongmao Zhang deanracc...@gmail.com
2014-11-14 19:14:34 +0800   thesues-master

Greg Farnum gfar...@redhat.com
2015-04-19 18:03:41 -0700   greg-testing-quota-full
2015-04-29 21:44:11 -0700   wip-init-names
2015-05-31 16:30:59 -0700   greg-fs-testing

Greg Farnum g...@inktank.com
2014-10-23 13:33:44 -0700   wip-forward-scrub

Gregory Meno gm...@redhat.com
2015-02-25 17:30:33 -0800   wip-fix-typo-troubleshooting

Guang Yang ygu...@yahoo-inc.com
2014-08-08 10:41:12 +   wip-guangyy-pg-splitting
2014-09-25 00:47:46 +   wip-9008
2014-09-30 10:36:39 +   guangyy-wip-9614

Haomai Wang haomaiw...@gmail.com
2014-07-27 13:37:49 +0800   wip-flush-set
2015-04-20 00:47:59 +0800   update-organization
2015-04-20 00:48:42 +0800   update-organization-1

Ilya Dryomov ilya.dryo...@inktank.com
2014-09-05 16:15:10 +0400   wip-rbd-notify-errors

James Page james.p...@ubuntu.com
2013-02-27 22:50:38 +   wip-debhelper-8

Jason Dillaman dilla...@redhat.com
2015-04-14 14:55:50 -0400   wip-11056-firefly
2015-05-12 11:04:36 -0400   wip-librbd-helgrind
2015-05-15 11:19:33 -0400   wip-11537
2015-05-21 16:27:18 -0400   wip-11579
2015-05-22 00:52:20 -0400   wip-11625
2015-05-27 00:41:42 -0400   wip-librbd-perf-counters
2015-05-28 15:23:33 -0400   wip-11405-next

Jenkins jenk...@inktank.com
2014-07-29 05:24:39 -0700   wip-nhm-hang
2015-02-02 10:35:28 -0800   wip-sam-v0.92
2015-04-13 13:24:40 -0700   rhcs-v0.80.8
2015-05-27 16:44:15 -0700   rhcs-v0.94.1-ubuntu

Joao Eduardo Luis jec...@gmail.com
2014-09-10 09:39:23 +0100   wip-leveldb-get.dumpling

Joao Eduardo Luis joao.l...@gmail.com
2014-07-22 15:41:42 +0100   wip-leveldb-misc

Joao Eduardo Luis joao.l...@inktank.com
2014-09-02 17:19:52 +0100   wip-leveldb-get
2014-10-17 16:20:11 +0100   wip-paxos-fix
2014-10-21 21:32:46 +0100   wip-9675.dumpling

Joao Eduardo Luis j...@redhat.com
2014-11-17 16:43:53 +   wip-mon-osdmap-cleanup
2014-12-15 16:18:56 +   wip-giant-mon-backports
2014-12-17 17:13:57 +   wip-mon-backports.firefly
2014-12-17 23:15:10 +   wip-mon-sync-fix.dumpling
2015-01-07 23:01:00 +   wip-mon-blackhole-mlog-0.87.7
2015-01-10 02:40:42 +   wip-dho-joao
2015-01-10 02:46:31 +   wip-mon-paxos-fix
2015-01-26 13:00:09 +   wip-mon-datahealth-fix
2015-02-04 22:36:14 +   wip-10643
2015-02-26 14:54:01 +   wip-10507

Joao Eduardo Luis j...@suse.de
2015-05-27 23:48:45 +0100   wip-mon-scrub
2015-05-28 08:12:48 +0100   wip-11786
2015-05-29 12:21:43 +0100   wip-11545
2015-06-01 12:18:01 +0100   wip-joao-testing

John Spray john.sp...@redhat.com
2015-02-18 14:04:18 +   wip10649
2015-04-06 17:25:02 +0100   wip-progress-events
2015-05-05 14:29:16 +0100   wip-live-query
2015-05-06 16:21:25 +0100   wip-11541-hammer-workaround
2015-05-22 10:48:32 +0100   wip-offline-backward
2015-05-22 14:17:20 +0100   wip-damaged-fixes
2015-05-28 13:31:32 +0100   wip-9963
2015-05-29 13:59:03 +0100   wip-9964-intrapg
2015-05-29 

Re: MDS auth caps for cephfs

2015-06-01 Thread Sage Weil
I have a pull request posted at

https://github.com/ceph/ceph/pull/4809

that updates the mds cap parser and defines a check method.  Please take 
a look and see if this makes sense.

For the path restrictions, I think the next steps are something like

 - generate a path string and pass it in to that method
 - write some simple tests
 - make sure the hook is called from everywhere it needs to be (all of the 
other request handlers in Server.cc)
 - call the hook from the cap writeback path (tricky)
 - figure out how to handle files in the stray dir (tricky)

For the user-based restrictions,

 - I think we need to expand the allows() method so that it has a couple 
output arguments (uid and gid list) that subsequent permissions should be 
validated against.  Then we can change the function in Server.cc so 
that when those are populated it does an actually unix permissions check.  
This seems like the simplest thing to me, although it does have the 
slightly-odd property that if you are doing an operation that requires 
permission on two inodes (directory and file, say) you might have to 
different 'allow ...' lines granting access to each.  (I think this is 
both painful to avoid and also harmless?)
 - same items above to call into the check method in the appropriate 
places.
 - extend client/mds protocol to pass credential struct.  This should 
piggyback on Goldwyn's work to fix up these structures for namespaces.
 - mark caps with credentials on clients, fix writeback order, etc.

In any case, the first item on both those lists seems like the place 
to start.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A bug in rolling back to a degraded object

2015-06-01 Thread Sage Weil
Hi Zhiqiang,

On Mon, 1 Jun 2015, Wang, Zhiqiang wrote:
 Hi Sage and all,
 
 Another bug discovered during the proxy write teuthology testing is when 
 rolling back to a degraded object. This doesn't seem specific to proxy 
 write. I see a scenario like below from the log file:
 
  - A rollback op comes in, and is enqueued.
  - Several other ops on the same object come in, and are enqueued.
  - The rollback op dispatches, and finds the object which it rollbacks 
 to is degraded, then this op is pushbacked into a list to wait for the 
 degraded object to recover.
  - The later ops are handled and responded back to client.
  - The degraded object recovers. The rollback op is enqueued again and 
 finally responded to client.

Yep!
 
 This breaks the op order. A fix for this is to maintain a map to track 
 the source, destination pair. And when an op on the source dispatches, 
 if such a pair exists, queue the op in the destination's degraded 
 waiting list. A drawback of this approach is that some entries in the ' 
 waiting_for_degraded_object' list of the destination object may not be 
 actually accessing the destination, but the source. Does this make 
 sense?

Yeah, and I think it's fine for the op to appear in the other object's 
list.  In fact there is already a mechanism in place that does something 
similar: obc-blocked_by.  It was added for the clone operation, which 
nfortunately I don't think it's not exercised in any of our test.. but 
I think it does exactly what you need.  If you set the head's blocked_by 
to the clone (and the clone's blocks set to include the head) then anybody 
trying to write to the head will queue up on the clone's degraded 
list (see the check for this in ReplicatedPG::do_op()).

I think this mostl amounts to making the _rollback_to() method get the 
clone's obc, set up the blocked_by/blocks relationship, start recovery of 
that object immediately, and queue itself on the waiting list.

Does that make sense?
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Sage Weil
I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash 
with that branch with 'debug newstore = 20' and send us the log?  
(You can just do 'ceph-post-file filename'.)

Thanks!
sage

On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,
 
 The assertion failed at line 1639, here is the log message
 
 
 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174
 
 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)
 
 
 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)
 
 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0
 
 
 Regards
 Srikanth
 
 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what it is.
 
  Thanks!
  sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: A bug in rolling back to a degraded object

2015-06-01 Thread Wang, Zhiqiang
That's great we already have such a field. I'll make use of it to fix this. 
Thanks for pointing it out.

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Tuesday, June 2, 2015 7:42 AM
To: Wang, Zhiqiang
Cc: ceph-devel@vger.kernel.org
Subject: Re: A bug in rolling back to a degraded object

Hi Zhiqiang,

On Mon, 1 Jun 2015, Wang, Zhiqiang wrote:
 Hi Sage and all,
 
 Another bug discovered during the proxy write teuthology testing is 
 when rolling back to a degraded object. This doesn't seem specific to 
 proxy write. I see a scenario like below from the log file:
 
  - A rollback op comes in, and is enqueued.
  - Several other ops on the same object come in, and are enqueued.
  - The rollback op dispatches, and finds the object which it rollbacks 
 to is degraded, then this op is pushbacked into a list to wait for the 
 degraded object to recover.
  - The later ops are handled and responded back to client.
  - The degraded object recovers. The rollback op is enqueued again and 
 finally responded to client.

Yep!
 
 This breaks the op order. A fix for this is to maintain a map to track 
 the source, destination pair. And when an op on the source 
 dispatches, if such a pair exists, queue the op in the destination's 
 degraded waiting list. A drawback of this approach is that some entries in 
 the '
 waiting_for_degraded_object' list of the destination object may not be 
 actually accessing the destination, but the source. Does this make 
 sense?

Yeah, and I think it's fine for the op to appear in the other object's list.  
In fact there is already a mechanism in place that does something
similar: obc-blocked_by.  It was added for the clone operation, which 
nfortunately I don't think it's not exercised in any of our test.. but I think 
it does exactly what you need.  If you set the head's blocked_by to the clone 
(and the clone's blocks set to include the head) then anybody trying to write 
to the head will queue up on the clone's degraded list (see the check for this 
in ReplicatedPG::do_op()).

I think this mostl amounts to making the _rollback_to() method get the clone's 
obc, set up the blocked_by/blocks relationship, start recovery of that object 
immediately, and queue itself on the waiting list.

Does that make sense?
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage and all,

I build ceph code from wip-newstore on RHEL7 and running performance
tests to compare with filestore. After few hours of running the tests
the osd daemons started to crash. Here is the stack trace, the osd
crashes immediately after the restart. So I could not get the osd up
and running.

ceph version b8e22893f44979613738dfcdd40dada2b513118
(eb8e22893f44979613738dfcdd40dada2b513118)
1: /usr/bin/ceph-osd() [0xb84652]
2: (()+0xf130) [0x7f915f84f130]
3: (gsignal()+0x39) [0x7f915e2695c9]
4: (abort()+0x148) [0x7f915e26acd8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
6: (()+0x5e946) [0x7f915eb6b946]
7: (()+0x5e973) [0x7f915eb6b973]
8: (()+0x5eb9f) [0x7f915eb6bb9f]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0xc84c5a]
10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
ghobject_t*)+0x13c9) [0xa08639]
11: (PGBackend::objects_list_partial(hobject_t const, int, int,
snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
hobject_t*)+0x352) [0x918a02]
12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906]
13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b]
14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
ThreadPool::TPHandle)+0x68a) [0x85dbea]
15: (OSD::dequeue_op(boost::intrusive_ptrPG,
std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
[0x6c3f5d]
16: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf]
18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
19: (()+0x7df3) [0x7f915f847df3]
20: (clone()+0x6d) [0x7f915e32a01d]
NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

Please let me know the cause of this crash, when this crash happens I
noticed that two osds on separate machines are down. I can bring one
osd up but restarting the other osd causes both OSDs to crash. My
understanding is the crash seems to happen when two OSDs try to
communicate and replicate a particular PG.

Regards
Srikanth
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage,

The assertion failed at line 1639, here is the log message


2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
function 'virtual int NewStore::collection_list_partial(coll_t,
ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


Just before the crash the here are the debug statements printed by the
method (collection_list_partial)

2015-05-30 22:49:23.607232 7f1681934700 15
newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
start -1/0//0/0 min/max 1024/1024 snap head
2015-05-30 22:49:23.607251 7f1681934700 20
newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
--.7fb4.. to --.7fb4.0800. and
--.804b.. to --.804b.0800. start
-1/0//0/0


Regards
Srikanth

On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
 Hi Sage and all,

 I build ceph code from wip-newstore on RHEL7 and running performance
 tests to compare with filestore. After few hours of running the tests
 the osd daemons started to crash. Here is the stack trace, the osd
 crashes immediately after the restart. So I could not get the osd up
 and running.

 ceph version b8e22893f44979613738dfcdd40dada2b513118
 (eb8e22893f44979613738dfcdd40dada2b513118)
 1: /usr/bin/ceph-osd() [0xb84652]
 2: (()+0xf130) [0x7f915f84f130]
 3: (gsignal()+0x39) [0x7f915e2695c9]
 4: (abort()+0x148) [0x7f915e26acd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
 6: (()+0x5e946) [0x7f915eb6b946]
 7: (()+0x5e973) [0x7f915eb6b973]
 8: (()+0x5eb9f) [0x7f915eb6bb9f]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x27a) [0xc84c5a]
 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
 snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
 ghobject_t*)+0x13c9) [0xa08639]
 11: (PGBackend::objects_list_partial(hobject_t const, int, int,
 snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
 hobject_t*)+0x352) [0x918a02]
 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
 [0x8aa906]
 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b]
 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
 ThreadPool::TPHandle)+0x68a) [0x85dbea]
 15: (OSD::dequeue_op(boost::intrusive_ptrPG,
 std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
 [0x6c3f5d]
 16: (OSD::ShardedOpWQ::_process(unsigned int,
 ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
 [0xc746bf]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
 19: (()+0x7df3) [0x7f915f847df3]
 20: (clone()+0x6d) [0x7f915e32a01d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 Please let me know the cause of this crash, when this crash happens I
 noticed that two osds on separate machines are down. I can bring one
 osd up but restarting the other osd causes both OSDs to crash. My
 understanding is the crash seems to happen when two OSDs try to
 communicate and replicate a particular PG.

 Can you include the log lines that preceed the dump above?  In particular,
 there should be a line that tells you what assertion failed in what
 function and at what line number.  I haven't seen this crash so I'm not
 sure offhand what it is.

 Thanks!
 sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Discuss: New default recovery config settings

2015-06-01 Thread Paul Von-Stamwitz
On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote:
  Many people have reported that they need to lower the osd recovery config 
  options to minimize the impact of recovery on client io.  We are talking 
  about changing the defaults as follows:
 
  osd_max_backfills to 1 (from 10)
  osd_recovery_max_active to 3 (from 15)
  osd_recovery_op_priority to 1 (from 10)
  osd_recovery_max_single_start to 1 (from 5)
 
 I'm under the (possibly erroneous) impression that reducing the number of max 
 backfills doesn't actually reduce recovery speed much (but will reduce memory 
 use), but that dropping the op priority can. I'd rather we make users 
 manually adjust values which can have a material impact on their data safety, 
 even if most of them choose to do so.
 
 After all, even under our worst behavior we're still doing a lot better than 
 a resilvering RAID array. ;) -Greg
 --


Greg,
When we set...

osd recovery max active = 1
osd max backfills = 1

We see rebalance times go down by more than half and client write performance 
increase significantly while rebalancing. We initially played with these 
settings to improve client IO expecting recovery time to get worse, but we got 
a 2-for-1. 
This was with firefly using replication, downing an entire node with lots of 
SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and 
osd_recovery_max_single_start default.

We dropped osd_recovery_max_active and osd_max_backfills together. If you're 
right, do you think osd_recovery_max_active=1 is primary reason for the 
improvement? (higher osd_max_backfills helps recovery time with erasure coding.)

-Paul


Re: Discuss: New default recovery config settings

2015-06-01 Thread Gregory Farnum
On Mon, Jun 1, 2015 at 6:39 PM, Paul Von-Stamwitz
pvonstamw...@us.fujitsu.com wrote:
 On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote:
  Many people have reported that they need to lower the osd recovery config 
  options to minimize the impact of recovery on client io.  We are talking 
  about changing the defaults as follows:
 
  osd_max_backfills to 1 (from 10)
  osd_recovery_max_active to 3 (from 15)
  osd_recovery_op_priority to 1 (from 10)
  osd_recovery_max_single_start to 1 (from 5)

 I'm under the (possibly erroneous) impression that reducing the number of 
 max backfills doesn't actually reduce recovery speed much (but will reduce 
 memory use), but that dropping the op priority can. I'd rather we make users 
 manually adjust values which can have a material impact on their data 
 safety, even if most of them choose to do so.

 After all, even under our worst behavior we're still doing a lot better than 
 a resilvering RAID array. ;) -Greg
 --


 Greg,
 When we set...

 osd recovery max active = 1
 osd max backfills = 1

 We see rebalance times go down by more than half and client write performance 
 increase significantly while rebalancing. We initially played with these 
 settings to improve client IO expecting recovery time to get worse, but we 
 got a 2-for-1.
 This was with firefly using replication, downing an entire node with lots of 
 SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and 
 osd_recovery_max_single_start default.

 We dropped osd_recovery_max_active and osd_max_backfills together. If you're 
 right, do you think osd_recovery_max_active=1 is primary reason for the 
 improvement? (higher osd_max_backfills helps recovery time with erasure 
 coding.)

Well, recovery max active and max backfills are similar in many ways.
Both are about moving data into a new or outdated copy of the PG — the
difference is that recovery refers to our log-based recovery (where we
compare the PG logs and move over the objects which have changed)
whereas backfill requires us to incrementally move through the entire
PG's hash space and compare.
I suspect dropping down max backfills is more important than reducing
max recovery (gathering recovery metadata happens largely in memory)
but I don't really know either way.

My comment was meant to convey that I'd prefer we not reduce the
recovery op priority levels. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Sage Weil
On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
 Hi Sage and all,
 
 I build ceph code from wip-newstore on RHEL7 and running performance
 tests to compare with filestore. After few hours of running the tests
 the osd daemons started to crash. Here is the stack trace, the osd
 crashes immediately after the restart. So I could not get the osd up
 and running.
 
 ceph version b8e22893f44979613738dfcdd40dada2b513118
 (eb8e22893f44979613738dfcdd40dada2b513118)
 1: /usr/bin/ceph-osd() [0xb84652]
 2: (()+0xf130) [0x7f915f84f130]
 3: (gsignal()+0x39) [0x7f915e2695c9]
 4: (abort()+0x148) [0x7f915e26acd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
 6: (()+0x5e946) [0x7f915eb6b946]
 7: (()+0x5e973) [0x7f915eb6b973]
 8: (()+0x5eb9f) [0x7f915eb6bb9f]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x27a) [0xc84c5a]
 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
 snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
 ghobject_t*)+0x13c9) [0xa08639]
 11: (PGBackend::objects_list_partial(hobject_t const, int, int,
 snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
 hobject_t*)+0x352) [0x918a02]
 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
 [0x8aa906]
 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b]
 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
 ThreadPool::TPHandle)+0x68a) [0x85dbea]
 15: (OSD::dequeue_op(boost::intrusive_ptrPG,
 std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
 [0x6c3f5d]
 16: (OSD::ShardedOpWQ::_process(unsigned int,
 ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
 [0xc746bf]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
 19: (()+0x7df3) [0x7f915f847df3]
 20: (clone()+0x6d) [0x7f915e32a01d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.
 
 Please let me know the cause of this crash, when this crash happens I
 noticed that two osds on separate machines are down. I can bring one
 osd up but restarting the other osd causes both OSDs to crash. My
 understanding is the crash seems to happen when two OSDs try to
 communicate and replicate a particular PG.

Can you include the log lines that preceed the dump above?  In particular, 
there should be a line that tells you what assertion failed in what 
function and at what line number.  I haven't seen this crash so I'm not 
sure offhand what it is.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage,

Unfortunately I purged the cluster yesterday and restarted the
backfill tool. I did not see the osd crash yet on the cluster. I am
monitoring the OSDs and will update you once I see the crash.

With the new backfill run I have reduced the rps by half, not sure if
this is the reason for not seeing the crash yet.

Regards
Srikanth


On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote:
 I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
 with that branch with 'debug newstore = 20' and send us the log?
 (You can just do 'ceph-post-file filename'.)

 Thanks!
 sage

 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,

 The assertion failed at line 1639, here is the log message


 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)

 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0


 Regards
 Srikanth

 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what it is.
 
  Thanks!
  sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html