re: crush: warn on do_rule failure

2012-10-03 Thread Dan Carpenter
Hello Sage Weil,

The patch 8b3932690084: "crush: warn on do_rule failure" from May 7,
2012, leads to the following warning:
net/ceph/osdmap.c:1117:8: warning: comparison of unsigned expression < 0 is 
always false [-Wtautological-compare]

net/ceph/osdmap.c
  1114  r = crush_do_rule(osdmap->crush, ruleno, pps, osds,
  1115min_t(int, pool->v.size, *num),
  1116osdmap->osd_weight);
  1117  if (r < 0) {
^
r is unsigned so it's never less than zero.  Also crush_do_rule() never
returns negative numbers.

  1118  pr_err("error %d from crush rule: pool %d ruleset %d 
type %d"
  1119 " size %d\n", r, poolid, pool->v.crush_ruleset,
  1120 pool->v.type, pool->v.size);
  1121  return NULL;
  1122  }
  1123  *num = r;

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mds: Use stray dir as base inode for stray reintegration

2012-10-03 Thread Yan, Zheng
From: "Yan, Zheng" 

Server::handle_client_rename() only skips common ancestor check
if source path's base inode is stray directory, but source path's
base inode is mdsdir in the stray reintegration case.

Signed-off-by: Yan, Zheng 
---
 src/mds/MDCache.cc | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 8b02b8b..32c9e36 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -8288,8 +8288,7 @@ void MDCache::reintegrate_stray(CDentry *straydn, CDentry 
*rdn)
   dout(10) << "reintegrate_stray " << *straydn << " into " << *rdn << dendl;
   
   // rename it to another mds.
-  filepath src;
-  straydn->make_path(src);
+  filepath src(straydn->get_name(), straydn->get_dir()->get_inode()->ino());
   filepath dst;
   rdn->make_path(dst);
 
-- 
1.7.11.4

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/12] mds: Avoid creating unnecessary snaprealm

2012-10-03 Thread Yan, Zheng
On 10/03/2012 08:12 AM, Sage Weil wrote:
> On Wed, 3 Oct 2012, Yan, Zheng wrote:
>> On 10/03/2012 02:31 AM, Sage Weil wrote:
>>> Hi Yan,
>>>
>>> This whole series looks great!  Sticking it in wip-mds and running it 
>>> through the fs qa suite before merging it.
>>>
>>> How are you testing these?  If you haven't seen it yet, there is an 'mds 
>>> thrash exports' option that will make MDSs random migrate subtrees to each 
>>> other that is great for shaking out bugs.  That and periodic daemon 
>>> restarts (one of the first things we need to do on the clustered mds front 
>>> is to get daemon restarting integrated into teuthology).
>>>
>>
>> The patches are fixes for problems I encountered during playing MDS shutdown.
>> I setup a 2 MDS cephfs and copied some data into it, deleted some directories
>> whose authority is MDS.1, then shutdown MDS.1.
>>
>> Most patches in this series are obvious. The two snaprealm related patches 
>> are
>> workaround for a bug: replica inode's snaprealm->open is not true. The bug 
>> triggers
>> assertion in CInode::pop_projected_snaprealm() if snaprealm is involved in 
>> cross
>> authority rename.
> 
> Do you mind opening a ticket at tracker.newdream.net so we don't lose 
> track of it?

will do
> 
> Fsstress on a single mds turned up this:
> 
> 2012-10-02T17:09:09.359 INFO:teuthology.task.ceph.mds.a.err:*** Caught signal 
> (Segmentation fault) **
> 2012-10-02T17:09:09.359 INFO:teuthology.task.ceph.mds.a.err: in thread 
> 7f8873a41700
> 2012-10-02T17:09:09.361 INFO:teuthology.task.ceph.mds.a.err: ceph version 
> 0.52-949-ge8df6a7 (commit:e8df6a74cae66accb6682129c9c5ad33797f458c)
> 2012-10-02T17:09:09.361 INFO:teuthology.task.ceph.mds.a.err: 1: 
> /tmp/cephtest/binary/usr/local/bin/ceph-mds() [0x812b21]
> 2012-10-02T17:09:09.361 INFO:teuthology.task.ceph.mds.a.err: 2: (()+0xfcb0) 
> [0x7f88787b3cb0]
> 2012-10-02T17:09:09.361 INFO:teuthology.task.ceph.mds.a.err: 3: 
> (Server::handle_client_rename(MDRequest*)+0xa28) [0x53dc88]
> 2012-10-02T17:09:09.361 INFO:teuthology.task.ceph.mds.a.err: 4: 
> (Server::dispatch_client_request(MDRequest*)+0x4fb) [0x54123b]
> 2012-10-02T17:09:09.361 INFO:teuthology.task.ceph.mds.a.err: 5: 
> (Server::handle_client_request(MClientRequest*)+0x51d) [0x544a6d]
> 2012-10-02T17:09:09.361 INFO:teuthology.task.ceph.mds.a.err: 6: 
> (Server::dispatch(Message*)+0x2d3) [0x5452e3]
> 2012-10-02T17:09:09.362 INFO:teuthology.task.ceph.mds.a.err: 7: 
> (MDS::handle_deferrable_message(Message*)+0x91f) [0x4bc32f]
> 2012-10-02T17:09:09.362 INFO:teuthology.task.ceph.mds.a.err: 8: 
> (MDS::_dispatch(Message*)+0x9b6) [0x4cf8b6]
> 2012-10-02T17:09:09.362 INFO:teuthology.task.ceph.mds.a.err: 9: 
> (MDS::ms_dispatch(Message*)+0x21b) [0x4d0c3b]
> 2012-10-02T17:09:09.362 INFO:teuthology.task.ceph.mds.a.err: 10: 
> (DispatchQueue::entry()+0x711) [0x7eb301]
> 2012-10-02T17:09:09.362 INFO:teuthology.task.ceph.mds.a.err: 11: 
> (DispatchQueue::DispatchThread::entry()+0xd) [0x7713dd]
> 2012-10-02T17:09:09.362 INFO:teuthology.task.ceph.mds.a.err: 12: (()+0x7e9a) 
> [0x7f88787abe9a]
> 2012-10-02T17:09:09.362 INFO:teuthology.task.ceph.mds.a.err: 13: 
> (clone()+0x6d) [0x7f8876d534bd]
> 2012-10-02T17:09:09.362 INFO:teuthology.task.ceph.mds.a.err:2012-10-02 
> 17:09:09.349272 7f8873a41700 -1 *** Caught signal (Segmentation fault) **
> 2012-10-02T17:09:09.362 INFO:teuthology.task.ceph.mds.a.err: in thread 
> 7f8873a41700
> 
> I don't have time right now to hunt this down, but you should be able to 
> reproduce with qa/workunits/suites/fsstress.sh on top of ceph-fuse with 1 
> mds.
> 

this is a old stray reintegration bug, I just sent a patch to fix it.

Regards
Yan, Zheng
 

> Thanks!
> sage
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow ceph fs performance

2012-10-03 Thread Bryan K. Wright
Hi again,

A few answers to questions from various people on the list
after my last e-mail:

g...@inktank.com said:
> Yes. Bryan, you mentioned that you didn't see a lot of resource usage — was it
> perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in
> theory, but in practice it has the equivalent of a Big Kernel Lock so it's not
> going to get much past one cpu core of time... 

The CPU usage on the MDSs hovered around a few percent.
They're quad-core machines, and I didn't see it ever get as high
as 25% usage on any of the cores while watching with atop.

g...@inktank.com said:
> The rados bench results do indicate some pretty bad small-file write
> performance as well though, so I guess it's possible your testing is running
> long enough that the page cache isn't absorbing that hit. Did performance
> start out higher or has it been flat? 

Looking at the details of the rados benchmark output, it does 
look like performance starts out better for the first few iterations,
and then goes bad.  Here's the begining of a typical small-file run:

 Maintaining 256 concurrent writes of 4096 bytes for at least 900 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1 255  3683  3428   13.3894   13.3906  0.002569 0.0696906
 2 256  7561  7305   14.2661   15.1445  0.106437 0.0669534
 3 256 10408 10152   13.2173   11.1211  0.002176 0.0689543
 4 256 11256 1100010.7413.3125  0.002097 0.0846414
 5 256 11256 110008.5928 0 - 0.0846414
 6 256 11370 4   7.23489  0.222656  0.002399 0.0962989
 7 255 12480 12225   6.82126   4.33984  0.117658  0.142335
 8 256 13289 13033   6.36311   3.15625  0.002574  0.151261
 9 256 13737 13481   5.85051  1.75  0.120657  0.158865
10 256 14341 14085   5.50138   2.35938  0.022544  0.178298

I see the same behavior every time I repeat the small-file 
rados benchmark.  Here's a graph showing the first 100 "cur MB/s" values
for a short-file benchmark:

http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4096-run1-09282012-curmbps.pdf

On the other hand, with 4MB files, I see results that start out like 
this:

 Maintaining 256 concurrent writes of 4194304 bytes for at least 900 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  4949 0 0 0 - 0
 2  7676 0 0 0 - 0
 3 105   105 0 0 0 - 0
 4 133   133 0 0 0 - 0
 5 159   159 0 0 0 - 0
 6 188   188 0 0 0 - 0
 7 218   218 0 0 0 - 0
 8 246   246 0 0 0 - 0
 9 256   27418   7.99904 8   8.97759   8.66218
10 255   30146   18.3978   1129.1456   8.94095
11 255   33075   27.2695   116   9.06968 9.013
12 255   358   103   34.3292   112   9.12486   9.04374

Here's a graph showing the first 100 "cur MB/s" values for a typical
4MB file benchmark:

http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4MB-run1-09282012-curmbps.pdf

mark.nel...@inktank.com said:
> When you were doing this, what kind of results did collectl give you for
> average write sizes to the underlying OSD disks? 

The average "rwsize" reported by collectl hovered around 
6 +/- a few (in whatever units collectl reports) for the RAID
array, and around 15 for the journal SSD, while doing the small-file
rados benchmark.  Here's a screenshot showing atop running on
each of the MDS hosts, and collectl running on each of the OSD
hosts, while the benchmark was running:

http://ayesha.phys.virginia.edu/~bryan/collectl-atop-t256-b4096.png

Here's the same, but with collectl running on the MDSs instead of atop:

http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4096.png

Looking at the last screenshot again, it does look like the disks on
the MDSs are getting some exercise, with ~40% utilization (if I'm
interpreting the collectl output correctly).

Here's a similar snapshot for the 4MB test:

http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4MB.png

It looks like similar "pct util" on the MDS disks, but much higher
average rwsize values on the OSDs.

mark.nel...@inktank.com said:
> There's multiple issues potentially here.  Part of it might be how  writes are
> coalesced by XFS in each sce

Re: [ceph-commit] teuthology lock_server error

2012-10-03 Thread Sage Weil
If you are not actually running the lock server (which you probably don't 
need anyway), add

check-lcoks: false

to your yaml.  I think that will get you going...

sage

On Wed, 3 Oct 2012, Lokesh Krishnappa wrote:

> Hi all,
> 
> Am new to git and playing around git qa suite , trying to test rbd on local
> machine .
> and local test bed setup as ;
> 
> OS :  Ubuntu 12.04LTS
> teuthology version: 0.0.1
> ceph : 0.48.1 arganaut
> 
> we are running server, and the client on locale in a single machine .
> yaml file format :
> roles:
> - [mon.a, osd.0, osd.1]
> 
> targets:
>    lokesh@lokesh: 
> ssh-rsaB3NzaC1yc2EDAQABAAABAQDmSb/VpYpdSQLp4unsKUNV/DKV2f55M1QSXHu10Qvco33
> rYopOF/l5+eYlINCHF1v/SA+bLOMT/4OHGkZR67TajAoXFpyolVSxZKRoJHLV2iJ+MrRiaxShct
> MYWpLOjc4iRA4BG3FTG5TTPKJweDj8dDdUIqZ9PuSpP5VuOriCpQkPWECL2hJqAYwnknK1Uhg3r
> YV0XxL14Iep9KCZf2PcJNw3Eur6XKDAczu/sAUlOiLrBsQpFmOaPY5jFDaM3U7KpJvqiI4Drq33
> 1iMh9n3GuA+JvcTMKuT7CN36GswlvGuakTzaDoR66JaYYKkGzDl977K94XrAdQwa2NKLu1XH
> 
> 
> tasks:
> - ceph:
> - ceph-fuse:
> - workunit:
>     clients:
>   client.0:
>     - rbd/test_cls_rbd.sh
> #-interactive-on-error: true
> 
> and my .teutholgy.yaml file format is:
> 
> lock_server: http://lokesh/lock
> queue_host: lokesh
> queue_port: 4000
> 
> while am trying to execute test suite as:
> lokesh@lokesh:~/Downloads/ceph-teuthology-78b7b02$
> ./virtualenv/bin/teuthology rbd_cls_tests1.yaml
> 
> i met error as:
> INFO:teuthology.run_tasks:Running task internal.save_config...
> INFO:teuthology.task.internal:Saving configuration
> INFO:teuthology.run_tasks:Running task internal.check_lock...
> INFO:teuthology.task.internal:Checking locks...
> INFO:teuthology.lock:GET request to 'http://lokesh/lock/lokesh@lokesh' with
> body 'None' failed with response code 404
> ERROR:teuthology.run_tasks:Saw exception from tasks
> Traceback (most recent call last):
>   File
> "/home/lokesh/Downloads/ceph-teuthology-78b7b02/teuthology/run_tasks.py",
> line 25, in run_tasks
>     manager = _run_one_task(taskname, ctx=ctx, config=config)
>   File
> "/home/lokesh/Downloads/ceph-teuthology-78b7b02/teuthology/run_tasks.py",
> line 14, in _run_one_task
>     return fn(**kwargs)
>   
> File"/home/lokesh/Downloads/ceph-teuthology-78b7b02/teuthology/task/internal.py
> ", line 110, in check_lock
>     'could not read lock status for {name}'.format(name=machine)
> AssertionError: could not read lock status for lokesh@lokesh
> 
> i didnt get idea abt it . but the server is working fine.. any idea pls
> reply.
> 
> 
> 
> With Regards,
> 
> Lokesh K
> 
> 
> 
> 
> 

Re: [PATCH] mds: Use stray dir as base inode for stray reintegration

2012-10-03 Thread Sage Weil
On Wed, 3 Oct 2012, Yan, Zheng wrote:
> From: "Yan, Zheng" 
> 
> Server::handle_client_rename() only skips common ancestor check
> if source path's base inode is stray directory, but source path's
> base inode is mdsdir in the stray reintegration case.
> 
> Signed-off-by: Yan, Zheng 

Hmm, I think this is a problem because path_traverse() does not know how 
to get started with a stray dir it doesn't already have.  It might make 
more sense to use the full path here, and make the handle_client_rename() 
check smarter.

I'm testing this:

diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index b706b5a..0c3ff96 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -5066,7 +5066,8 @@ void Server::handle_client_rename(MDRequest *mdr)
 
   // src+dest traces _must_ share a common ancestor for locking to prevent 
orphans
   if (destpath.get_ino() != srcpath.get_ino() &&
-  !MDS_INO_IS_STRAY(srcpath.get_ino())) {  // <-- mds 'rename' out of 
stray dir is ok!
+  !(req->get_source()->is_mds() &&
+   MDS_INO_IS_MDSDIR(srcpath.get_ino( {  // <-- mds 'rename' out of 
stray dir is ok!
 // do traces share a dentry?
 CDentry *common = 0;
 for (unsigned i=0; i < srctrace.size(); i++) {


Thanks!
sage


> ---
>  src/mds/MDCache.cc | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
> index 8b02b8b..32c9e36 100644
> --- a/src/mds/MDCache.cc
> +++ b/src/mds/MDCache.cc
> @@ -8288,8 +8288,7 @@ void MDCache::reintegrate_stray(CDentry *straydn, 
> CDentry *rdn)
>dout(10) << "reintegrate_stray " << *straydn << " into " << *rdn << dendl;
>
>// rename it to another mds.
> -  filepath src;
> -  straydn->make_path(src);
> +  filepath src(straydn->get_name(), straydn->get_dir()->get_inode()->ino());
>filepath dst;
>rdn->make_path(dst);
>  
> -- 
> 1.7.11.4
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mds: Use stray dir as base inode for stray reintegration

2012-10-03 Thread Yan, Zheng
On 10/03/2012 11:17 PM, Sage Weil wrote:
> On Wed, 3 Oct 2012, Yan, Zheng wrote:
>> From: "Yan, Zheng" 
>>
>> Server::handle_client_rename() only skips common ancestor check
>> if source path's base inode is stray directory, but source path's
>> base inode is mdsdir in the stray reintegration case.
>>
>> Signed-off-by: Yan, Zheng 
> 
> Hmm, I think this is a problem because path_traverse() does not know how 
> to get started with a stray dir it doesn't already have.  It might make 
> more sense to use the full path here, and make the handle_client_rename() 
> check smarter.
> 
> I'm testing this:
> 
> diff --git a/src/mds/Server.cc b/src/mds/Server.cc
> index b706b5a..0c3ff96 100644
> --- a/src/mds/Server.cc
> +++ b/src/mds/Server.cc
> @@ -5066,7 +5066,8 @@ void Server::handle_client_rename(MDRequest *mdr)
>  
>// src+dest traces _must_ share a common ancestor for locking to prevent 
> orphans
>if (destpath.get_ino() != srcpath.get_ino() &&
> -  !MDS_INO_IS_STRAY(srcpath.get_ino())) {  // <-- mds 'rename' out of 
> stray dir is ok!
> +  !(req->get_source()->is_mds() &&
> +   MDS_INO_IS_MDSDIR(srcpath.get_ino( {  // <-- mds 'rename' out of 
> stray dir is ok!
>  // do traces share a dentry?
>  CDentry *common = 0;
>  for (unsigned i=0; i < srctrace.size(); i++) {
> 
>
FYI: MDCache::migrate_stray() uses stray directory as base inode, it requires 
an update.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: error: could not read lock status

2012-10-03 Thread Gregory Farnum
Check out the thread titled "[ceph-commit] teuthology lock_server error". :)

On Wed, Oct 3, 2012 at 5:41 AM, Pradeep S  wrote:
> hi, i am getting the following error while executing ceph-qa-suite in
> teuthology.
> INFO:teuthology.run_tasks:Running task internal.save_config...
> INFO:teuthology.task.internal:Saving configuration
> INFO:teuthology.run_tasks:Running task internal.check_lock...
> INFO:teuthology.task.internal:Checking locks...
> INFO:teuthology.lock:GET request to
> 'http://pradeep:4000/lock/prad...@pradeep.com' with body 'None' failed
> with response code 404
> ERROR:teuthology.run_tasks:Saw exception from tasks
> Traceback (most recent call last):
>   File 
> "/home/pradeep/Downloads/ceph-teuthology-0395df3/teuthology/run_tasks.py",
> line 25, in run_tasks
> manager = _run_one_task(taskname, ctx=ctx, config=config)
>   File 
> "/home/pradeep/Downloads/ceph-teuthology-0395df3/teuthology/run_tasks.py",
> line 14, in _run_one_task
> return fn(**kwargs)
>   File 
> "/home/pradeep/Downloads/ceph-teuthology-0395df3/teuthology/task/internal.py",
> line 110, in check_lock
> 'could not read lock status for {name}'.format(name=machine)
> AssertionError: could not read lock status for prad...@pradeep.com
>
> The conf file is: lock_server: http://pradeep:4000/lock.
>
> With thanks
> Pradeep S
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: error: could not read lock status

2012-10-03 Thread Sage Weil
On Wed, 3 Oct 2012, Gregory Farnum wrote:
> Check out the thread titled "[ceph-commit] teuthology lock_server error". :)

In case it didn't hit ceph-devel:

Add 

check-locks: false

to your yaml file and the lock checks will all go away.

sage


> 
> On Wed, Oct 3, 2012 at 5:41 AM, Pradeep S  wrote:
> > hi, i am getting the following error while executing ceph-qa-suite in
> > teuthology.
> > INFO:teuthology.run_tasks:Running task internal.save_config...
> > INFO:teuthology.task.internal:Saving configuration
> > INFO:teuthology.run_tasks:Running task internal.check_lock...
> > INFO:teuthology.task.internal:Checking locks...
> > INFO:teuthology.lock:GET request to
> > 'http://pradeep:4000/lock/prad...@pradeep.com' with body 'None' failed
> > with response code 404
> > ERROR:teuthology.run_tasks:Saw exception from tasks
> > Traceback (most recent call last):
> >   File 
> > "/home/pradeep/Downloads/ceph-teuthology-0395df3/teuthology/run_tasks.py",
> > line 25, in run_tasks
> > manager = _run_one_task(taskname, ctx=ctx, config=config)
> >   File 
> > "/home/pradeep/Downloads/ceph-teuthology-0395df3/teuthology/run_tasks.py",
> > line 14, in _run_one_task
> > return fn(**kwargs)
> >   File 
> > "/home/pradeep/Downloads/ceph-teuthology-0395df3/teuthology/task/internal.py",
> > line 110, in check_lock
> > 'could not read lock status for {name}'.format(name=machine)
> > AssertionError: could not read lock status for prad...@pradeep.com
> >
> > The conf file is: lock_server: http://pradeep:4000/lock.
> >
> > With thanks
> > Pradeep S
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow ceph fs performance

2012-10-03 Thread Gregory Farnum
I think I'm with Mark now — this does indeed look like too much random
IO for the disks to handle. In particular, Ceph requires that each
write be synced to disk before it's considered complete, which rsync
definitely doesn't. In the filesystem this is generally disguised
fairly well by all the caches and such in the way, but this use case
is unfriendly to that arrangement.

However, I am particularly struck by seeing one of your OSDs at 96%
disk utilization while the others remain <50%, and I've just realized
we never saw output from ceph -s. Can you provide that, please?
-Greg

On Wed, Oct 3, 2012 at 7:55 AM, Bryan K. Wright
 wrote:
> Hi again,
>
> A few answers to questions from various people on the list
> after my last e-mail:
>
> g...@inktank.com said:
>> Yes. Bryan, you mentioned that you didn't see a lot of resource usage — was 
>> it
>> perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in
>> theory, but in practice it has the equivalent of a Big Kernel Lock so it's 
>> not
>> going to get much past one cpu core of time...
>
> The CPU usage on the MDSs hovered around a few percent.
> They're quad-core machines, and I didn't see it ever get as high
> as 25% usage on any of the cores while watching with atop.
>
> g...@inktank.com said:
>> The rados bench results do indicate some pretty bad small-file write
>> performance as well though, so I guess it's possible your testing is running
>> long enough that the page cache isn't absorbing that hit. Did performance
>> start out higher or has it been flat?
>
> Looking at the details of the rados benchmark output, it does
> look like performance starts out better for the first few iterations,
> and then goes bad.  Here's the begining of a typical small-file run:
>
>  Maintaining 256 concurrent writes of 4096 bytes for at least 900 seconds.
>sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>  0   0 0 0 0 0 - 0
>  1 255  3683  3428   13.3894   13.3906  0.002569 0.0696906
>  2 256  7561  7305   14.2661   15.1445  0.106437 0.0669534
>  3 256 10408 10152   13.2173   11.1211  0.002176 0.0689543
>  4 256 11256 1100010.7413.3125  0.002097 0.0846414
>  5 256 11256 110008.5928 0 - 0.0846414
>  6 256 11370 4   7.23489  0.222656  0.002399 0.0962989
>  7 255 12480 12225   6.82126   4.33984  0.117658  0.142335
>  8 256 13289 13033   6.36311   3.15625  0.002574  0.151261
>  9 256 13737 13481   5.85051  1.75  0.120657  0.158865
> 10 256 14341 14085   5.50138   2.35938  0.022544  0.178298
>
> I see the same behavior every time I repeat the small-file
> rados benchmark.  Here's a graph showing the first 100 "cur MB/s" values
> for a short-file benchmark:
>
> http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4096-run1-09282012-curmbps.pdf
>
> On the other hand, with 4MB files, I see results that start out like
> this:
>
>  Maintaining 256 concurrent writes of 4194304 bytes for at least 900 seconds.
>sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>  0   0 0 0 0 0 - 0
>  1  4949 0 0 0 - 0
>  2  7676 0 0 0 - 0
>  3 105   105 0 0 0 - 0
>  4 133   133 0 0 0 - 0
>  5 159   159 0 0 0 - 0
>  6 188   188 0 0 0 - 0
>  7 218   218 0 0 0 - 0
>  8 246   246 0 0 0 - 0
>  9 256   27418   7.99904 8   8.97759   8.66218
> 10 255   30146   18.3978   1129.1456   8.94095
> 11 255   33075   27.2695   116   9.06968 9.013
> 12 255   358   103   34.3292   112   9.12486   9.04374
>
> Here's a graph showing the first 100 "cur MB/s" values for a typical
> 4MB file benchmark:
>
> http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4MB-run1-09282012-curmbps.pdf
>
> mark.nel...@inktank.com said:
>> When you were doing this, what kind of results did collectl give you for
>> average write sizes to the underlying OSD disks?
>
> The average "rwsize" reported by collectl hovered around
> 6 +/- a few (in whatever units collectl reports) for the RAID
> array, and around 15 for the journal SSD, while doing the small-file
> rados benchmark.  Here's a screenshot showing atop running on
> each of the MDS hosts, and collectl running on each of the OSD
> hosts, while the ben

Re: [GIT PULL v5] java: add libcephfs Java bindings

2012-10-03 Thread Noah Watkins
Hi Sage,

I wanted to touch base on this Java bindings patch series to make sure
this can become a solid foundation for the Hadoop shim clean-up. Were
there any specific issues with this, other than not yet having a major
consumer?

-Noah

On Thu, Sep 6, 2012 at 11:02 AM, Noah Watkins  wrote:
> On Thu, Sep 6, 2012 at 8:42 AM, Sage Weil  wrote:
>>
>> Also, I noticed the automake patch removes CephException.java, added in
>> the previous patch... probably an accident?
>
> I left CephException out intentionally because it could be replaced by
> generic exceptions in Java like IOException or FileNotFoundException.
> My belief is that CephException (and specializations of this) will
> need to be re-introduced as more error cases are covered. For example,
> I think we'll want a CephBadFileDescriptor, but currently libcephfs
> doesn't appear to return -EBADF (the client is crashing do to fd_map
> assert). So, the Java wrappers should be as stable as the C API, but
> both seem to need some loose ends tied up.
>
>> Can't wait to pull this in!  BTW, Noah, do you know if this is still an
>> issue?
>>
>> http://tracker.newdream.net/issues/2778
>
> It does appear that -ENOENT is now returned. I added a unit test that
> expects FileNotFoundException and that works out nicely.
>
> - Noah
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL v5] java: add libcephfs Java bindings

2012-10-03 Thread Sage Weil
On Wed, 3 Oct 2012, Noah Watkins wrote:
> Hi Sage,
> 
> I wanted to touch base on this Java bindings patch series to make sure
> this can become a solid foundation for the Hadoop shim clean-up. Were
> there any specific issues with this, other than not yet having a major
> consumer?

>From my perspective it's ready.  I was hoping that Laszlo could look at 
the packaging bits before I merged, but then I lost track of it.

Laszlo, do you have a minute to take a look?

Thanks!
sage

> 
> -Noah
> 
> On Thu, Sep 6, 2012 at 11:02 AM, Noah Watkins  wrote:
> > On Thu, Sep 6, 2012 at 8:42 AM, Sage Weil  wrote:
> >>
> >> Also, I noticed the automake patch removes CephException.java, added in
> >> the previous patch... probably an accident?
> >
> > I left CephException out intentionally because it could be replaced by
> > generic exceptions in Java like IOException or FileNotFoundException.
> > My belief is that CephException (and specializations of this) will
> > need to be re-introduced as more error cases are covered. For example,
> > I think we'll want a CephBadFileDescriptor, but currently libcephfs
> > doesn't appear to return -EBADF (the client is crashing do to fd_map
> > assert). So, the Java wrappers should be as stable as the C API, but
> > both seem to need some loose ends tied up.
> >
> >> Can't wait to pull this in!  BTW, Noah, do you know if this is still an
> >> issue?
> >>
> >> http://tracker.newdream.net/issues/2778
> >
> > It does appear that -ENOENT is now returned. I added a unit test that
> > expects FileNotFoundException and that works out nicely.
> >
> > - Noah
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to write data to rbd pool.

2012-10-03 Thread Dan Mick
Don't understand the question.  rbd images live, by default, in the rbd 
pool; the image namespace is flat, and there is no path.


On 10/02/2012 06:13 AM, ramu eppa wrote:

Hi all,

   How to know the rbd volume path.

Thanks,
Ramu.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to write data to rbd pool.

2012-10-03 Thread Dan Mick
What tool/library/context are you using the string "nova/volume1/" 
in?  What does  represent?  How was volume1 created?




On 10/02/2012 10:09 AM, ramu eppa wrote:

Hi Tommi Virtanen,

Actually we know the rbd full path,we write some data directly to volume.
nova/volume1/.


Thanks,
ramu.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to write data to rbd pool.

2012-10-03 Thread Tommi Virtanen
On Tue, Oct 2, 2012 at 9:55 PM, ramu eppa  wrote:
>When map rbd to /dev it's giving error,
> "rbd map /dev/mypool/mytest "

The syntax is "rbd map mypool/mytest".

http://ceph.com/docs/master/man/8/rbd/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL v5] java: add libcephfs Java bindings

2012-10-03 Thread Gregory Farnum
Sorry I haven't provided any feedback on this either — it's still in
my queue but I've had a great many things to do since you sent it
along. :)
-Greg

On Wed, Oct 3, 2012 at 12:34 PM, Noah Watkins  wrote:
> Hi Sage,
>
> I wanted to touch base on this Java bindings patch series to make sure
> this can become a solid foundation for the Hadoop shim clean-up. Were
> there any specific issues with this, other than not yet having a major
> consumer?
>
> -Noah
>
> On Thu, Sep 6, 2012 at 11:02 AM, Noah Watkins  wrote:
>> On Thu, Sep 6, 2012 at 8:42 AM, Sage Weil  wrote:
>>>
>>> Also, I noticed the automake patch removes CephException.java, added in
>>> the previous patch... probably an accident?
>>
>> I left CephException out intentionally because it could be replaced by
>> generic exceptions in Java like IOException or FileNotFoundException.
>> My belief is that CephException (and specializations of this) will
>> need to be re-introduced as more error cases are covered. For example,
>> I think we'll want a CephBadFileDescriptor, but currently libcephfs
>> doesn't appear to return -EBADF (the client is crashing do to fd_map
>> assert). So, the Java wrappers should be as stable as the C API, but
>> both seem to need some loose ends tied up.
>>
>>> Can't wait to pull this in!  BTW, Noah, do you know if this is still an
>>> issue?
>>>
>>> http://tracker.newdream.net/issues/2778
>>
>> It does appear that -ENOENT is now returned. I added a unit test that
>> expects FileNotFoundException and that works out nicely.
>>
>> - Noah
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Collection of strange lockups on 0.51

2012-10-03 Thread Andrey Korolyov
On Mon, Oct 1, 2012 at 8:42 PM, Tommi Virtanen  wrote:
> On Sun, Sep 30, 2012 at 2:55 PM, Andrey Korolyov  wrote:
>> Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems
>> to appear more likely on 0.51 traffic patterns, which is very strange
>> for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my
>> case, exposed to extremely high load - database benchmark over 700+
>> rbd-backed VMs and cluster rebalance at same time. It explains
>> post-reboot lockups in igb driver and all types of lockups above. I
>> would very appreciate any suggestions of switch models which do not
>> expose such behavior in simultaneous conditions both off-list and in
>> this thread.
>
> I don't see how a switch dropping packets would give an ethernet card
> driver any excuse to crash, but I'm simultaneously happy to hear that
> it doesn't seem like Ceph is at fault, and sorry for your troubles.
>
> I don't have an up to date 1GbE card recommendation to share, but I
> would recommend making sure you're using a recent Linux kernel.

I have incorrectly formulated a reason - of course drops can not cause
a lockup by themselves, but switch may create somehow a long-lasting
`corrupt` state on the trunk ports which leads to such lockups at the
ethernet card. Of course I`ll play with the driver versions and
card|port settings, thanks for suggestion :)

I`m still investigating the issue since it is a quite hard to repeat
in the right time and hope I`m able to capture this state using
tcpdump-like, e.g. s/w methods - if card driver locks on something, it
may prevent to process problematic byte sequence at packet sniffer level.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mds cache size configuration option being ignored

2012-10-03 Thread Tren Blackburn
Hi List;

I was advised to use the "mds cache size" option to limit the memory
that the mds process will take. I have it set to "32768". However it
the ceph-mds process is now at 50GB and still growing.

fern ceph # ps wwaux | grep ceph-mds
root   895  4.3 26.6 53269304 52725820 ?   Ssl  Sep28 312:29
/usr/bin/ceph-mds -i fern --pid-file /var/run/ceph/mds.fern.pid -c
/etc/ceph/ceph.conf

Have I specified the limit incorrectly? How far will it go?

Thanks in advance,

Tren
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mds cache size configuration option being ignored

2012-10-03 Thread Gregory Farnum
On Wed, Oct 3, 2012 at 3:22 PM, Tren Blackburn  wrote:
> Hi List;
>
> I was advised to use the "mds cache size" option to limit the memory
> that the mds process will take. I have it set to "32768". However it
> the ceph-mds process is now at 50GB and still growing.
>
> fern ceph # ps wwaux | grep ceph-mds
> root   895  4.3 26.6 53269304 52725820 ?   Ssl  Sep28 312:29
> /usr/bin/ceph-mds -i fern --pid-file /var/run/ceph/mds.fern.pid -c
> /etc/ceph/ceph.conf
>
> Have I specified the limit incorrectly? How far will it go?

Oof. That looks correct; it sounds like we have a leak or some other
kind of bug. I believe you're on Gentoo; did you build with tcmalloc?
If so, can you run "ceph -w" in one window and then "ceph mds tell 0
heap stats" and send back the output?
If you didn't build with tcmalloc, can you do so and try again? We
have noticed fragmentation issues with the default memory allocator,
which is why we switched (though I can't imagine it'd balloon that far
— but tcmalloc will give us some better options to diagnose it). Sorry
I didn't mention this before!
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mds cache size configuration option being ignored

2012-10-03 Thread Tren Blackburn
On Wed, Oct 3, 2012 at 4:15 PM, Gregory Farnum  wrote:
> On Wed, Oct 3, 2012 at 3:22 PM, Tren Blackburn  wrote:
>> Hi List;
>>
>> I was advised to use the "mds cache size" option to limit the memory
>> that the mds process will take. I have it set to "32768". However it
>> the ceph-mds process is now at 50GB and still growing.
>>
>> fern ceph # ps wwaux | grep ceph-mds
>> root   895  4.3 26.6 53269304 52725820 ?   Ssl  Sep28 312:29
>> /usr/bin/ceph-mds -i fern --pid-file /var/run/ceph/mds.fern.pid -c
>> /etc/ceph/ceph.conf
>>
>> Have I specified the limit incorrectly? How far will it go?
>
> Oof. That looks correct; it sounds like we have a leak or some other
> kind of bug. I believe you're on Gentoo; did you build with tcmalloc?
> If so, can you run "ceph -w" in one window and then "ceph mds tell 0
> heap stats" and send back the output?
> If you didn't build with tcmalloc, can you do so and try again? We
> have noticed fragmentation issues with the default memory allocator,
> which is why we switched (though I can't imagine it'd balloon that far
> — but tcmalloc will give us some better options to diagnose it). Sorry
> I didn't mention this before!

Hey Greg! Good recall, I am on Gentoo, and I did build with tcmalloc.
Here is the information you requested:

2012-10-03 16:20:43.979673 mds.0 [INF] mds.ferntcmalloc heap
stats:
2012-10-03 16:20:43.979676 mds.0 [INF] MALLOC:53796808560 (51304.6
MiB) Bytes in use by application
2012-10-03 16:20:43.979679 mds.0 [INF] MALLOC: +   753664 (0.7
MiB) Bytes in page heap freelist
2012-10-03 16:20:43.979681 mds.0 [INF] MALLOC: + 93299048 (   89.0
MiB) Bytes in central cache freelist
2012-10-03 16:20:43.979683 mds.0 [INF] MALLOC: +  6110720 (5.8
MiB) Bytes in transfer cache freelist
2012-10-03 16:20:43.979685 mds.0 [INF] MALLOC: + 84547880 (   80.6
MiB) Bytes in thread cache freelists
2012-10-03 16:20:43.979686 mds.0 [INF] MALLOC: + 84606976 (   80.7
MiB) Bytes in malloc metadata
2012-10-03 16:20:43.979688 mds.0 [INF] MALLOC:   
2012-10-03 16:20:43.979690 mds.0 [INF] MALLOC: =  54066126848 (51561.5
MiB) Actual memory used (physical + swap)
2012-10-03 16:20:43.979691 mds.0 [INF] MALLOC: +0 (0.0
MiB) Bytes released to OS (aka unmapped)
2012-10-03 16:20:43.979693 mds.0 [INF] MALLOC:   
2012-10-03 16:20:43.979694 mds.0 [INF] MALLOC: =  54066126848 (51561.5
MiB) Virtual address space used
2012-10-03 16:20:43.979700 mds.0 [INF] MALLOC:
2012-10-03 16:20:43.979702 mds.0 [INF] MALLOC: 609757
Spans in use
2012-10-03 16:20:43.979703 mds.0 [INF] MALLOC:395
Thread heaps in use
2012-10-03 16:20:43.979705 mds.0 [INF] MALLOC:   8192
Tcmalloc page size
2012-10-03 16:20:43.979710 mds.0 [INF]

2012-10-03 16:20:43.979716 mds.0 [INF] Call ReleaseFreeMemory() to
release freelist memory to the OS (via madvise()).
2012-10-03 16:20:43.979718 mds.0 [INF] Bytes released to the

It didn't print anything past the "Bytes released to the"...

Let me know if you need anything else.

t.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mds cache size configuration option being ignored

2012-10-03 Thread Gregory Farnum
On Wed, Oct 3, 2012 at 4:23 PM, Tren Blackburn  wrote:
> On Wed, Oct 3, 2012 at 4:15 PM, Gregory Farnum  wrote:
>> On Wed, Oct 3, 2012 at 3:22 PM, Tren Blackburn  wrote:
>>> Hi List;
>>>
>>> I was advised to use the "mds cache size" option to limit the memory
>>> that the mds process will take. I have it set to "32768". However it
>>> the ceph-mds process is now at 50GB and still growing.
>>>
>>> fern ceph # ps wwaux | grep ceph-mds
>>> root   895  4.3 26.6 53269304 52725820 ?   Ssl  Sep28 312:29
>>> /usr/bin/ceph-mds -i fern --pid-file /var/run/ceph/mds.fern.pid -c
>>> /etc/ceph/ceph.conf
>>>
>>> Have I specified the limit incorrectly? How far will it go?
>>
>> Oof. That looks correct; it sounds like we have a leak or some other
>> kind of bug. I believe you're on Gentoo; did you build with tcmalloc?
>> If so, can you run "ceph -w" in one window and then "ceph mds tell 0
>> heap stats" and send back the output?
>> If you didn't build with tcmalloc, can you do so and try again? We
>> have noticed fragmentation issues with the default memory allocator,
>> which is why we switched (though I can't imagine it'd balloon that far
>> — but tcmalloc will give us some better options to diagnose it). Sorry
>> I didn't mention this before!
>
> Hey Greg! Good recall, I am on Gentoo, and I did build with tcmalloc.

Search is a wonderful thing. ;)

> Here is the information you requested:
>
> 2012-10-03 16:20:43.979673 mds.0 [INF] mds.ferntcmalloc heap
> stats:
> 2012-10-03 16:20:43.979676 mds.0 [INF] MALLOC:53796808560 (51304.6
> MiB) Bytes in use by application
> 2012-10-03 16:20:43.979679 mds.0 [INF] MALLOC: +   753664 (0.7
> MiB) Bytes in page heap freelist
> 2012-10-03 16:20:43.979681 mds.0 [INF] MALLOC: + 93299048 (   89.0
> MiB) Bytes in central cache freelist
> 2012-10-03 16:20:43.979683 mds.0 [INF] MALLOC: +  6110720 (5.8
> MiB) Bytes in transfer cache freelist
> 2012-10-03 16:20:43.979685 mds.0 [INF] MALLOC: + 84547880 (   80.6
> MiB) Bytes in thread cache freelists
> 2012-10-03 16:20:43.979686 mds.0 [INF] MALLOC: + 84606976 (   80.7
> MiB) Bytes in malloc metadata
> 2012-10-03 16:20:43.979688 mds.0 [INF] MALLOC:   
> 2012-10-03 16:20:43.979690 mds.0 [INF] MALLOC: =  54066126848 (51561.5
> MiB) Actual memory used (physical + swap)
> 2012-10-03 16:20:43.979691 mds.0 [INF] MALLOC: +0 (0.0
> MiB) Bytes released to OS (aka unmapped)
> 2012-10-03 16:20:43.979693 mds.0 [INF] MALLOC:   
> 2012-10-03 16:20:43.979694 mds.0 [INF] MALLOC: =  54066126848 (51561.5
> MiB) Virtual address space used
> 2012-10-03 16:20:43.979700 mds.0 [INF] MALLOC:
> 2012-10-03 16:20:43.979702 mds.0 [INF] MALLOC: 609757
> Spans in use
> 2012-10-03 16:20:43.979703 mds.0 [INF] MALLOC:395
> Thread heaps in use
> 2012-10-03 16:20:43.979705 mds.0 [INF] MALLOC:   8192
> Tcmalloc page size
> 2012-10-03 16:20:43.979710 mds.0 [INF]

So tcmalloc thinks the MDS is actually using >50GB of RAM. ie, we have a leak.

Sage suggests we check out the perfcounters (specifically, how many
log segments are open). "ceph --admin-daemon 
perfcounters_dump" I believe the default path is
/var/run/ceph/ceph-mds.a.asok.

If this doesn't provide us a clue, I'm afraid we're going to have to
start keeping track of heap usage with tcmalloc or run the daemon
through massif...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mds cache size configuration option being ignored

2012-10-03 Thread Tren Blackburn
On Wed, Oct 3, 2012 at 4:56 PM, Gregory Farnum  wrote:
> On Wed, Oct 3, 2012 at 4:23 PM, Tren Blackburn  wrote:
>> On Wed, Oct 3, 2012 at 4:15 PM, Gregory Farnum  wrote:
>>> On Wed, Oct 3, 2012 at 3:22 PM, Tren Blackburn  wrote:
 Hi List;

 I was advised to use the "mds cache size" option to limit the memory
 that the mds process will take. I have it set to "32768". However it
 the ceph-mds process is now at 50GB and still growing.

 fern ceph # ps wwaux | grep ceph-mds
 root   895  4.3 26.6 53269304 52725820 ?   Ssl  Sep28 312:29
 /usr/bin/ceph-mds -i fern --pid-file /var/run/ceph/mds.fern.pid -c
 /etc/ceph/ceph.conf

 Have I specified the limit incorrectly? How far will it go?
>>>
>>> Oof. That looks correct; it sounds like we have a leak or some other
>>> kind of bug. I believe you're on Gentoo; did you build with tcmalloc?
>>> If so, can you run "ceph -w" in one window and then "ceph mds tell 0
>>> heap stats" and send back the output?
>>> If you didn't build with tcmalloc, can you do so and try again? We
>>> have noticed fragmentation issues with the default memory allocator,
>>> which is why we switched (though I can't imagine it'd balloon that far
>>> — but tcmalloc will give us some better options to diagnose it). Sorry
>>> I didn't mention this before!
>>
>> Hey Greg! Good recall, I am on Gentoo, and I did build with tcmalloc.
>
> Search is a wonderful thing. ;)
>
>> Here is the information you requested:
>>
>> 2012-10-03 16:20:43.979673 mds.0 [INF] mds.ferntcmalloc heap
>> stats:
>> 2012-10-03 16:20:43.979676 mds.0 [INF] MALLOC:53796808560 (51304.6
>> MiB) Bytes in use by application
>> 2012-10-03 16:20:43.979679 mds.0 [INF] MALLOC: +   753664 (0.7
>> MiB) Bytes in page heap freelist
>> 2012-10-03 16:20:43.979681 mds.0 [INF] MALLOC: + 93299048 (   89.0
>> MiB) Bytes in central cache freelist
>> 2012-10-03 16:20:43.979683 mds.0 [INF] MALLOC: +  6110720 (5.8
>> MiB) Bytes in transfer cache freelist
>> 2012-10-03 16:20:43.979685 mds.0 [INF] MALLOC: + 84547880 (   80.6
>> MiB) Bytes in thread cache freelists
>> 2012-10-03 16:20:43.979686 mds.0 [INF] MALLOC: + 84606976 (   80.7
>> MiB) Bytes in malloc metadata
>> 2012-10-03 16:20:43.979688 mds.0 [INF] MALLOC:   
>> 2012-10-03 16:20:43.979690 mds.0 [INF] MALLOC: =  54066126848 (51561.5
>> MiB) Actual memory used (physical + swap)
>> 2012-10-03 16:20:43.979691 mds.0 [INF] MALLOC: +0 (0.0
>> MiB) Bytes released to OS (aka unmapped)
>> 2012-10-03 16:20:43.979693 mds.0 [INF] MALLOC:   
>> 2012-10-03 16:20:43.979694 mds.0 [INF] MALLOC: =  54066126848 (51561.5
>> MiB) Virtual address space used
>> 2012-10-03 16:20:43.979700 mds.0 [INF] MALLOC:
>> 2012-10-03 16:20:43.979702 mds.0 [INF] MALLOC: 609757
>> Spans in use
>> 2012-10-03 16:20:43.979703 mds.0 [INF] MALLOC:395
>> Thread heaps in use
>> 2012-10-03 16:20:43.979705 mds.0 [INF] MALLOC:   8192
>> Tcmalloc page size
>> 2012-10-03 16:20:43.979710 mds.0 [INF]
>
> So tcmalloc thinks the MDS is actually using >50GB of RAM. ie, we have a leak.
>
> Sage suggests we check out the perfcounters (specifically, how many
> log segments are open). "ceph --admin-daemon 
> perfcounters_dump" I believe the default path is
> /var/run/ceph/ceph-mds.a.asok.

Got it...

--- Start ---
fern ceph # ceph --admin-daemon /var/run/ceph/ceph-mds.fern.asok
perfcounters_dump
{"mds":{"req":0,"reply":48446606,"replyl":{"avgcount":48446606,"sum":28781.3},"fw":0,"dir_f":1238738,"dir_c":1709578,"dir_sp":0,"dir_ffc":0,"imax":32768,"i":9236006,"itop":421,"ibot":2,"iptail":9235583,"ipin":9236004,"iex":20572348,"icap":9235995,"cap":9235995,"dis":0,"t":60401624,"thit":43843666,"tfw":0,"tdis":0,"tdirf":1235679,"trino":0,"tlock":0,"l":347,"q":0,"popanyd":0,"popnest":0,"sm":2,"ex":0,"iexp":0,"im":0,"iim":0},"mds_log":{"evadd":41768893,"evex":41734641,"evtrm":41734641,"ev":34252,"evexg":0,"evexd":1158,"segadd":44958,"segex":44928,"segtrm":44928,"seg":31,"segexg":0,"segexd":1,"expos":188437496802,"wrpos":188567160172,"rdpos":0,"jlat":0},"mds_mem":{"ino":9236008,"ino+":20540696,"ino-":11304688,"dir":1219715,"dir+":2806911,"dir-":1587196,"dn":9236006,"dn+":29809444,"dn-":20573438,"cap":9235995,"cap+":20077556,"cap-":10841561,"rss":52843824,"heap":10792,"malloc":-1925579,"buf":0},"mds_server":{"hcreq":48446606,"hsreq":0,"hcsess":0,"dcreq":51199273,"dsreq":0},"objecter":{"op_active":0,"op_laggy":0,"op_send":6842412,"op_send_bytes":0,"op_resend":216654,"op_ack":1238738,"op_commit":5387021,"op":6625759,"op_r":1238738,"op_w":5387021,"op_rmw":0,"op_pg":0,"osdop_stat":0,"osdop_create":0,"osdop_read":0,"osdop_write":3542566,"osdop_writefull":43897,"osdop_append":0,"osdop_zero":0,"osdop_truncate":0,"osdop_delete":90980,"osdop_mapext":0,"osdop_sparse_read":0,"osdop_clonerange":0,"osdop_getxattr":0,"osdop_setxattr":3419156,"osdop_cmpxattr":0,"osdop_rmxattr

Re: mds cache size configuration option being ignored

2012-10-03 Thread Gregory Farnum
On Wed, Oct 3, 2012 at 4:59 PM, Tren Blackburn  wrote:
> On Wed, Oct 3, 2012 at 4:56 PM, Gregory Farnum  wrote:
>> On Wed, Oct 3, 2012 at 4:23 PM, Tren Blackburn  wrote:
>>> On Wed, Oct 3, 2012 at 4:15 PM, Gregory Farnum  wrote:
 On Wed, Oct 3, 2012 at 3:22 PM, Tren Blackburn  
 wrote:
> Hi List;
>
> I was advised to use the "mds cache size" option to limit the memory
> that the mds process will take. I have it set to "32768". However it
> the ceph-mds process is now at 50GB and still growing.
>
> fern ceph # ps wwaux | grep ceph-mds
> root   895  4.3 26.6 53269304 52725820 ?   Ssl  Sep28 312:29
> /usr/bin/ceph-mds -i fern --pid-file /var/run/ceph/mds.fern.pid -c
> /etc/ceph/ceph.conf
>
> Have I specified the limit incorrectly? How far will it go?

 Oof. That looks correct; it sounds like we have a leak or some other
 kind of bug. I believe you're on Gentoo; did you build with tcmalloc?
 If so, can you run "ceph -w" in one window and then "ceph mds tell 0
 heap stats" and send back the output?
 If you didn't build with tcmalloc, can you do so and try again? We
 have noticed fragmentation issues with the default memory allocator,
 which is why we switched (though I can't imagine it'd balloon that far
 — but tcmalloc will give us some better options to diagnose it). Sorry
 I didn't mention this before!
>>>
>>> Hey Greg! Good recall, I am on Gentoo, and I did build with tcmalloc.
>>
>> Search is a wonderful thing. ;)
>>
>>> Here is the information you requested:
>>>
>>> 2012-10-03 16:20:43.979673 mds.0 [INF] mds.ferntcmalloc heap
>>> stats:
>>> 2012-10-03 16:20:43.979676 mds.0 [INF] MALLOC:53796808560 (51304.6
>>> MiB) Bytes in use by application
>>> 2012-10-03 16:20:43.979679 mds.0 [INF] MALLOC: +   753664 (0.7
>>> MiB) Bytes in page heap freelist
>>> 2012-10-03 16:20:43.979681 mds.0 [INF] MALLOC: + 93299048 (   89.0
>>> MiB) Bytes in central cache freelist
>>> 2012-10-03 16:20:43.979683 mds.0 [INF] MALLOC: +  6110720 (5.8
>>> MiB) Bytes in transfer cache freelist
>>> 2012-10-03 16:20:43.979685 mds.0 [INF] MALLOC: + 84547880 (   80.6
>>> MiB) Bytes in thread cache freelists
>>> 2012-10-03 16:20:43.979686 mds.0 [INF] MALLOC: + 84606976 (   80.7
>>> MiB) Bytes in malloc metadata
>>> 2012-10-03 16:20:43.979688 mds.0 [INF] MALLOC:   
>>> 2012-10-03 16:20:43.979690 mds.0 [INF] MALLOC: =  54066126848 (51561.5
>>> MiB) Actual memory used (physical + swap)
>>> 2012-10-03 16:20:43.979691 mds.0 [INF] MALLOC: +0 (0.0
>>> MiB) Bytes released to OS (aka unmapped)
>>> 2012-10-03 16:20:43.979693 mds.0 [INF] MALLOC:   
>>> 2012-10-03 16:20:43.979694 mds.0 [INF] MALLOC: =  54066126848 (51561.5
>>> MiB) Virtual address space used
>>> 2012-10-03 16:20:43.979700 mds.0 [INF] MALLOC:
>>> 2012-10-03 16:20:43.979702 mds.0 [INF] MALLOC: 609757
>>> Spans in use
>>> 2012-10-03 16:20:43.979703 mds.0 [INF] MALLOC:395
>>> Thread heaps in use
>>> 2012-10-03 16:20:43.979705 mds.0 [INF] MALLOC:   8192
>>> Tcmalloc page size
>>> 2012-10-03 16:20:43.979710 mds.0 [INF]
>>
>> So tcmalloc thinks the MDS is actually using >50GB of RAM. ie, we have a 
>> leak.
>>
>> Sage suggests we check out the perfcounters (specifically, how many
>> log segments are open). "ceph --admin-daemon 
>> perfcounters_dump" I believe the default path is
>> /var/run/ceph/ceph-mds.a.asok.
>
> Got it...
>
> --- Start ---
> fern ceph # ceph --admin-daemon /var/run/ceph/ceph-mds.fern.asok
> perfcounters_dump
> {"mds":{"req":0,"reply":48446606,"replyl":{"avgcount":48446606,"sum":28781.3},"fw":0,"dir_f":1238738,"dir_c":1709578,"dir_sp":0,"dir_ffc":0,"imax":32768,"i":9236006,"itop":421,"ibot":2,"iptail":9235583,"ipin":9236004,"iex":20572348,"icap":9235995,"cap":9235995,"dis":0,"t":60401624,"thit":43843666,"tfw":0,"tdis":0,"tdirf":1235679,"trino":0,"tlock":0,"l":347,"q":0,"popanyd":0,"popnest":0,"sm":2,"ex":0,"iexp":0,"im":0,"iim":0},"mds_log":{"evadd":41768893,"evex":41734641,"evtrm":41734641,"ev":34252,"evexg":0,"evexd":1158,"segadd":44958,"segex":44928,"segtrm":44928,"seg":31,"segexg":0,"segexd":1,"expos":188437496802,"wrpos":188567160172,"rdpos":0,"jlat":0},"mds_mem":{"ino":9236008,"ino+":20540696,"ino-":11304688,"dir":1219715,"dir+":2806911,"dir-":1587196,"dn":9236006,"dn+":29809444,"dn-":20573438,"cap":9235995,"cap+":20077556,"cap-":10841561,"rss":52843824,"heap":10792,"malloc":-1925579,"buf":0},"mds_server":{"hcreq":48446606,"hsreq":0,"hcsess":0,"dcreq":51199273,"dsreq":0},"objecter":{"op_active":0,"op_laggy":0,"op_send":6842412,"op_send_bytes":0,"op_resend":216654,"op_ack":1238738,"op_commit":5387021,"op":6625759,"op_r":1238738,"op_w":5387021,"op_rmw":0,"op_pg":0,"osdop_stat":0,"osdop_create":0,"osdop_read":0,"osdop_write":3542566,"osdop_writefull":43897,"osdop_append":0,"osdop_zero":0,"osdop_truncate":0,"osdop_del

Re: mds stuck in clientreplay state after failover

2012-10-03 Thread Gregory Farnum
On Tue, Sep 25, 2012 at 4:55 PM, Tren Blackburn  wrote:
> On Tue, Sep 25, 2012 at 2:15 PM, Gregory Farnum  wrote:
>> Hi Tren,
>> Sorry your last message got dropped — we've all been really busy!
>>
>
> No worries! I know you guys are busy, and I appreciate any assistance
> you're able to provide.
>
>> On Tue, Sep 25, 2012 at 10:22 AM, Tren Blackburn  
>> wrote:
>> 
>>
>>> All ceph servers are running ceph-0.51. Here is the output of ceph -s:
>>>
>>> ocr31-ire ~ # ceph -s
>>>health HEALTH_OK
>>>monmap e1: 3 mons at
>>> {fern=10.87.1.88:6789/0,ocr46=10.87.1.104:6789/0,sap=10.87.1.87:6789/0},
>>> election epoch 92, quorum 0,1,2 fern,ocr46,sap
>>>osdmap e60: 192 osds: 192 up, 192 in
>>> pgmap v47728: 73728 pgs: 73728 active+clean; 290 GB data, 2794 GB
>>> used, 283 TB / 286 TB avail
>>>mdsmap e19: 1/1/1 up {0=fern=up:clientreplay}, 2 up:standby
>>
>> Okay, so all your OSDs are up and all your PGs are active+clean, which
>> means it's definitely a problem in the MDS
>>
>>
>>> Here are the logs from sap, which was the mds master before it was
>>> told to respawn:
>>>
>>> 2012-09-25 04:45:32.588374 7f2a34a40700  0 mds.0.3 ms_handle_reset on
>>> 10.87.1.100:6832/32257
>>> 2012-09-25 04:45:32.589064 7f2a34a40700  0 mds.0.3 ms_handle_connect
>>> on 10.87.1.100:6832/32257
>>> 2012-09-25 04:45:57.787416 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 04:55:21.718357 7f2a34a40700  0 mds.0.3 ms_handle_reset on
>>> 10.87.1.89:6815/8101
>>> 2012-09-25 04:55:21.719044 7f2a34a40700  0 mds.0.3 ms_handle_connect
>>> on 10.87.1.89:6815/8101
>>> 2012-09-25 04:55:26.758359 7f2a34a40700  0 mds.0.3 ms_handle_reset on
>>> 10.87.1.96:6800/6628
>>> 2012-09-25 04:55:26.759415 7f2a34a40700  0 mds.0.3 ms_handle_connect
>>> on 10.87.1.96:6800/6628
>>> 2012-09-25 05:13:16.367476 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 05:18:45.177585 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 05:19:44.911831 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 05:30:38.178449 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 05:37:26.597832 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 05:37:34.088781 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 05:38:37.548132 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 05:39:21.528884 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 05:39:57.791457 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 05:40:13.579926 7f2a34a40700  0 mds.0.3 handle_mds_beacon
>>> no longer laggy
>>> 2012-09-25 05:41:07.598457 7f2a2f709700  0 -- 10.87.1.87:6801/27351 >>
>>> 10.87.1.89:0/30567 pipe(0x1db2fc0 sd=100 pgs=2 cs=1 l=0).fault with
>>> nothing to send, going to standby
>>> 2012-09-25 05:41:07.603802 7f2a30517700  0 -- 10.87.1.87:6801/27351 >>
>>> 10.87.1.94:0/9358 pipe(0x1db2b40 sd=99 pgs=2 cs=1 l=0).fault with
>>> nothing to send, going to standby
>>> 2012-09-25 05:41:07.969148 7f2a34a40700  1 mds.-1.-1 handle_mds_map i
>>> (10.87.1.87:6801/27351) dne in the mdsmap, respawning myself
>>> 2012-09-25 05:41:07.969154 7f2a34a40700  1 mds.-1.-1 respawn
>>> 2012-09-25 05:41:07.969155 7f2a34a40700  1 mds.-1.-1  e: '/usr/bin/ceph-mds'
>>> 2012-09-25 05:41:07.969157 7f2a34a40700  1 mds.-1.-1  0: '/usr/bin/ceph-mds'
>>> 2012-09-25 05:41:07.969159 7f2a34a40700  1 mds.-1.-1  1: '-i'
>>> 2012-09-25 05:41:07.969160 7f2a34a40700  1 mds.-1.-1  2: 'sap'
>>> 2012-09-25 05:41:07.969161 7f2a34a40700  1 mds.-1.-1  3: '--pid-file'
>>> 2012-09-25 05:41:07.969162 7f2a34a40700  1 mds.-1.-1  4:
>>> '/var/run/ceph/mds.sap.pid'
>>> 2012-09-25 05:41:07.969163 7f2a34a40700  1 mds.-1.-1  5: '-c'
>>> 2012-09-25 05:41:07.969164 7f2a34a40700  1 mds.-1.-1  6: 
>>> '/etc/ceph/ceph.conf'
>>> 2012-09-25 05:41:07.969165 7f2a34a40700  1 mds.-1.-1  cwd /
>>> 2012-09-25 05:41:08.003262 7fae819af780  0 ceph version 0.51
>>> (commit:c03ca95d235c9a072dcd8a77ad5274a52e93ae30), process ceph-mds,
>>> pid 1173
>>> 2012-09-25 05:41:08.005237 7fae7c9bd700  0 mds.-1.0 ms_handle_connect
>>> on 10.87.1.88:6789/0
>>> 2012-09-25 05:41:08.802610 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
>>> 2012-09-25 05:41:09.602141 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
>>> 2012-09-25 05:41:23.772891 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
>>> 2012-09-25 05:41:25.273745 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
>>> 2012-09-25 05:41:27.994344 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
>>> 2012-09-25 05:41:28.588681 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
>>> 2012-09-25 05:41:42.282588 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
>>>
>>> Why did sap get marked as down in the mdsmap?
>>
>> That's hard to guess about the precise reasons without more logging
>> enabled, but notice all of the "handle_mds_beacon no longer laggy".
>> Those indicat

mds heartbeat information

2012-10-03 Thread Gregory Farnum
I was asked to send this to the list. At some point it will be
properly documented, but for the moment...

The mds heartbeat controls are mds_beacon_interval and
mds_beacon_grace. (So named because the beacon is used for a little
bit more than heartbeating; but it also serves heartbeat purposes and
its other duties aren't harmed by changing these settings.) A
heartbeat is sent every mds_beacon_interval_seconds, and the monitor
will mark an MDS laggy if the monitor doesn't receive a beacon for
mds_beacon_grace seconds.

If you change mds_beacon_grace you should also change
mds_session_timeout and mds_reconnect_timeout to match, so that
mds_reconnect_timeout = (mds_session_timeout - mds_beacon_grace).
These tunables specify how long a recovering MDS waits before it
declares clients as having failed, which is why it needs to be
adjusted as well.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL v5] java: add libcephfs Java bindings

2012-10-03 Thread Laszlo Boszormenyi (GCS)
On Wed, 2012-10-03 at 12:39 -0700, Sage Weil wrote:
> On Wed, 3 Oct 2012, Noah Watkins wrote:
> > I wanted to touch base on this Java bindings patch series to make sure
> > this can become a solid foundation for the Hadoop shim clean-up. Were
> > there any specific issues with this, other than not yet having a major
> > consumer?
> 
> From my perspective it's ready.  I was hoping that Laszlo could look at 
> the packaging bits before I merged, but then I lost track of it.
> 
> Laszlo, do you have a minute to take a look?
 I was too busy with IRL and stuff. But sounds interesting. Will check
this in the afternoon and report back.

Laszlo/GCS

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html