Re: leveldb compaction overhead

2013-06-05 Thread Jim Schutt
Hi Sage, On 05/31/2013 06:00 PM, Sage Weil wrote: On Fri, 31 May 2013, Jim Schutt wrote: Hi Sage, On 05/29/2013 03:07 PM, Sage Weil wrote: Hi all- I have a couple of branches (wip-5176 and wip-5176-cuttlefish) that try to make the leveldb compaction on the monitor less expensive by doing

Re: leveldb compaction overhead

2013-06-05 Thread Jim Schutt
On 06/05/2013 01:05 PM, Mark Nelson wrote: FWIW, I've been fighting with some mon/leveldb issues on 24-node test cluster causing high CPU utilization, constant reads, laggy osdmap updates, and mons dropping out of quorum. Work is going on in wip-mon. Should have some more testing done

Re: pg balancing

2013-06-05 Thread Jim Schutt
Hi Sage, On 05/13/2013 06:35 PM, Sage Weil wrote: Hi Jim- You mentioned the other day your concerns about the uniformity of the PG and data distribution. There are several ways to attack it (including increasing the number of PGs), but one that we haven't tested much yet is the

Re: leveldb compaction overhead

2013-05-31 Thread Jim Schutt
Hi Sage, On 05/29/2013 03:07 PM, Sage Weil wrote: Hi all- I have a couple of branches (wip-5176 and wip-5176-cuttlefish) that try to make the leveldb compaction on the monitor less expensive by doing it in an async thread and compaction only the trimmed range. If anyone who is

RE: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-15 Thread Jim Schutt
On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote: How responsive generally is the machine under load? Is there available CPU? The machine works well, and the issued OSDs are likely the same, seems because they have relative slower disk( disk type are the same but the latency is a bit higher

Re: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-15 Thread Jim Schutt
where a mon drops out of quorum, then comes back in on the next election, I've found that to be a sign that my mons are too busy. -- Jim 发自我的 iPhone 在 2013-5-15,23:07,Jim Schutt jasc...@sandia.gov 写道: On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote: How responsive generally is the machine

[PATCH v2 0/3] ceph: fix might_sleep while atomic

2013-05-15 Thread Jim Schutt
/include/ceph_fs.h. Jim Schutt (3): ceph: fix up comment for ceph_count_locks() as to which lock to hold ceph: add missing cpu_to_le32() calls when encoding a reconnect capability ceph: ceph_pagelist_append might sleep while atomic fs/ceph/locks.c | 75

[PATCH v2 1/3] ceph: fix up comment for ceph_count_locks() as to which lock to hold

2013-05-15 Thread Jim Schutt
Signed-off-by: Jim Schutt jasc...@sandia.gov --- fs/ceph/locks.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 202dd3d..ffc86cb 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -169,7 +169,7 @@ int ceph_flock(struct file

[PATCH v2 2/3] ceph: add missing cpu_to_le32() calls when encoding a reconnect capability

2013-05-15 Thread Jim Schutt
and src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph). I also checked the server side for flock_len decoding, and I believe that also happens correctly, by virtue of having been declared __le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h. Signed-off-by: Jim

[PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic

2013-05-15 Thread Jim Schutt
success [13490.720032] ceph: mds0 caps stale [13501.235257] ceph: mds0 recovery completed [13501.300419] ceph: mds0 caps renewed Fix it up by encoding locks into a buffer first, and when the number of encoded locks is stable, copy that into a ceph_pagelist. Signed-off-by: Jim Schutt jasc...@sandia.gov

Re: [PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic

2013-05-15 Thread Jim Schutt
On 05/15/2013 10:49 AM, Alex Elder wrote: On 05/15/2013 11:38 AM, Jim Schutt wrote: Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result

Re: pg balancing

2013-05-14 Thread Jim Schutt
[resent to list because I missed that Cc:] Hi Sage, On 05/13/2013 06:35 PM, Sage Weil wrote: Hi Jim- You mentioned the other day your concerns about the uniformity of the PG and data distribution. There are several ways to attack it (including increasing the number of PGs), but one that

Re: [PATCH] libceph: ceph_pagelist_append might sleep while atomic

2013-05-14 Thread Jim Schutt
On 05/14/2013 10:44 AM, Alex Elder wrote: On 05/09/2013 09:42 AM, Jim Schutt wrote: Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result: I finally

[PATCH] libceph: ceph_pagelist_append might sleep while atomic

2013-05-09 Thread Jim Schutt
success [13490.720032] ceph: mds0 caps stale [13501.235257] ceph: mds0 recovery completed [13501.300419] ceph: mds0 caps renewed Fix it up by encoding locks into a buffer first, and when the number of encoded locks is stable, copy that into a ceph_pagelist. Signed-off-by: Jim Schutt jasc...@sandia.gov

Re: [PATCH v2] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values

2013-04-12 Thread Jim Schutt
Hi Greg, On 04/10/2013 06:39 PM, Gregory Farnum wrote: Jim, I took this patch as a base for setting up config options which people can tune manually and have pushed those changes to wip-leveldb-config. I was out of the office unexpectedly for a few days, so I'm just now taking a look.

[PATCH v2] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values

2013-04-05 Thread Jim Schutt
at this block size. Signed-off-by: Jim Schutt jasc...@sandia.gov --- src/common/config_opts.h |4 src/os/LevelDBStore.cc |9 + src/os/LevelDBStore.h|3 +++ 3 files changed, 16 insertions(+), 0 deletions(-) diff --git a/src/common/config_opts.h b/src/common/config_opts.h index

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-04 Thread Jim Schutt
On 04/03/2013 04:51 PM, Gregory Farnum wrote: On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt jasc...@sandia.gov wrote: On 04/03/2013 12:25 PM, Sage Weil wrote: Sorry, guess I forgot some of the history since this piece at least is resolved now. I'm surprised if 30-second timeouts are causing

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-04 Thread Jim Schutt
On 04/04/2013 08:15 AM, Jim Schutt wrote: On 04/03/2013 04:51 PM, Gregory Farnum wrote: On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt jasc...@sandia.gov wrote: On 04/03/2013 12:25 PM, Sage Weil wrote: Sorry, guess I forgot some of the history since this piece at least is resolved now. I'm

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-04 Thread Jim Schutt
On 04/03/2013 04:40 PM, Jim Schutt wrote: On 04/03/2013 12:25 PM, Sage Weil wrote: Sorry, guess I forgot some of the history since this piece at least is resolved now. I'm surprised if 30-second timeouts are causing issues without those overloads you were seeing; have you

[PATCH] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values

2013-04-04 Thread Jim Schutt
. Signed-off-by: Jim Schutt jasc...@sandia.gov --- src/os/LevelDBStore.cc |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/src/os/LevelDBStore.cc b/src/os/LevelDBStore.cc index 3d94096..1b6ae7d 100644 --- a/src/os/LevelDBStore.cc +++ b/src/os/LevelDBStore.cc @@ -16,6 +16,9

Trouble getting a new file system to start, for v0.59 and newer

2013-04-03 Thread Jim Schutt
Hi Joao, I alluded in an earlier thread about an issue I've been recently having with starting a new filesystem, which I thought I had tracked into the paxos subsystem. I believe I started having this trouble when I started testing v0.59, and it's still there in v0.60. The basic configuration

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-03 Thread Jim Schutt
Hi Sage, On 04/03/2013 09:58 AM, Sage Weil wrote: Hi Jim, What happens if you change 'osd mon ack timeout = 300' (from the default of 30)? I suspect part of the problem is that the mons are just slow enough that the osd's resend the same thing again and it snowballs into more work for

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-03 Thread Jim Schutt
On 04/03/2013 11:49 AM, Gregory Farnum wrote: On Wed, Apr 3, 2013 at 10:14 AM, Gregory Farnum g...@inktank.com wrote: On Wed, Apr 3, 2013 at 10:09 AM, Jim Schutt jasc...@sandia.gov wrote: Hi Sage, On 04/03/2013 09:58 AM, Sage Weil wrote: Hi Jim, What happens if you change 'osd mon ack

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-03 Thread Jim Schutt
On 04/03/2013 12:25 PM, Sage Weil wrote: Sorry, guess I forgot some of the history since this piece at least is resolved now. I'm surprised if 30-second timeouts are causing issues without those overloads you were seeing; have you seen this issue without your high debugging levels and

Re: Trouble with paxos service for large PG count

2013-04-02 Thread Jim Schutt
On 04/02/2013 09:42 AM, Joao Eduardo Luis wrote: On 04/01/2013 10:14 PM, Jim Schutt wrote: Hi, I've been having trouble starting a new file system created using the current next branch (most recently, commit 3b5f663f11). I believe the trouble is related to how long it takes paxos

Re: Trouble with paxos service for large PG count

2013-04-02 Thread Jim Schutt
On 04/02/2013 12:28 PM, Joao Luis wrote: Right. I'll push a patch to bump that sort of output to 30 when I get home. Thanks - but FWIW, I don't think it's the root cause of my issue -- more below If you're willing, try reducing the paxos debug level to 0 and let us know if those delays

Re: Trouble with paxos service for large PG count

2013-04-02 Thread Jim Schutt
On 04/02/2013 01:16 PM, Jim Schutt wrote: On 04/02/2013 12:28 PM, Joao Luis wrote: Right. I'll push a patch to bump that sort of output to 30 when I get home. Thanks - but FWIW, I don't think it's the root cause of my issue -- more below OK, I see now that you're talking about

Trouble with paxos service for large PG count

2013-04-01 Thread Jim Schutt
Hi, I've been having trouble starting a new file system created using the current next branch (most recently, commit 3b5f663f11). I believe the trouble is related to how long it takes paxos to process a pgmap proposal. For a configuration with 1 mon, 1 mds, and 576 osds, using pg_bits = 3 and

Re: CephFS Space Accounting and Quotas

2013-03-18 Thread Jim Schutt
On 03/15/2013 05:17 PM, Greg Farnum wrote: [Putting list back on cc] On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote: On 03/15/2013 04:23 PM, Greg Farnum wrote: As I come back and look at these again, I'm not sure what the context for these logs is. Which test did they come from

Re: CephFS Space Accounting and Quotas

2013-03-12 Thread Jim Schutt
On 03/11/2013 02:40 PM, Jim Schutt wrote: If you want I can attempt to duplicate my memory of the first test I reported, writing the files today and doing the strace tomorrow (with timestamps, this time). Also, would it be helpful to write the files with minimal logging, in hopes

Re: CephFS Space Accounting and Quotas

2013-03-11 Thread Jim Schutt
On 03/08/2013 07:05 PM, Greg Farnum wrote: On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote: On 03/07/2013 08:15 AM, Jim Schutt wrote: On 03/06/2013 05:18 PM, Greg Farnum wrote: On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: [snip] Do you want the MDS log at 10

Re: CephFS Space Accounting and Quotas

2013-03-11 Thread Jim Schutt
On 03/11/2013 09:48 AM, Greg Farnum wrote: On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote: On 03/08/2013 07:05 PM, Greg Farnum wrote: On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote: On 03/07/2013 08:15 AM, Jim Schutt wrote: On 03/06/2013 05:18 PM, Greg Farnum wrote

Re: Estimating OSD memory requirements (was Re: stuff for v0.56.4)

2013-03-11 Thread Jim Schutt
Hi Bryan, On 03/11/2013 09:10 AM, Bryan K. Wright wrote: s...@inktank.com said: On Thu, 7 Mar 2013, Bryan K. Wright wrote: s...@inktank.com said: - pg log trimming (probably a conservative subset) to avoid memory bloat Anything that reduces the size of OSD processes would be appreciated.

Re: CephFS Space Accounting and Quotas

2013-03-11 Thread Jim Schutt
On 03/11/2013 10:57 AM, Greg Farnum wrote: On Monday, March 11, 2013 at 9:48 AM, Jim Schutt wrote: On 03/11/2013 09:48 AM, Greg Farnum wrote: On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote: For this run, the MDS logging slowed it down enough to cause the client caps to occasionally

Re: CephFS Space Accounting and Quotas

2013-03-08 Thread Jim Schutt
On 03/07/2013 08:15 AM, Jim Schutt wrote: On 03/06/2013 05:18 PM, Greg Farnum wrote: On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: [snip] Do you want the MDS log at 10 or 20? More is better. ;) OK, thanks. I've sent some mds logs via private email... -- Jim

Re: CephFS Space Accounting and Quotas

2013-03-07 Thread Jim Schutt
On 03/06/2013 05:18 PM, Greg Farnum wrote: On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: When I'm doing these stat operations the file system is otherwise idle. What's the cluster look like? This is just one active MDS and a couple hundred clients? 1 mds, 1 mon, 576 osds, 198

Re: CephFS First product release discussion

2013-03-06 Thread Jim Schutt
On 03/05/2013 12:33 PM, Sage Weil wrote: Running 'du' on each directory would be much faster with Ceph since it accounts tracks the subdirectories and shows their total size with an 'ls -al'. Environments with 100k users also tend to be very dynamic with adding and removing users all

Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Jim Schutt
On 03/06/2013 12:13 PM, Greg Farnum wrote: On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote: On 03/05/2013 12:33 PM, Sage Weil wrote: Running 'du' on each directory would be much faster with Ceph since it accounts tracks the subdirectories and shows their total size with an 'ls -al

Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Jim Schutt
On 03/06/2013 01:21 PM, Greg Farnum wrote: Also, this issue of stat on files created on other clients seems like it's going to be problematic for many interactions our users will have with the files created by their parallel compute jobs - any suggestion on how to avoid or fix it?

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-28 Thread Jim Schutt
Hi Sage, On 02/26/2013 12:36 PM, Sage Weil wrote: On Tue, 26 Feb 2013, Jim Schutt wrote: I think the right solution is to make an option that will setsockopt on SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, wip-tcp. Do you mind checking to see if this addresses

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-26 Thread Jim Schutt
Hi Sage, On 02/20/2013 05:12 PM, Sage Weil wrote: Hi Jim, I'm resurrecting an ancient thread here, but: we've just observed this on another big cluster and remembered that this hasn't actually been fixed. Sorry for the delayed reply - I missed this in a backlog of unread email... I

Re: Slow request in XFS

2013-01-31 Thread Jim Schutt
On 01/31/2013 05:43 AM, Sage Weil wrote: Hi- Can you reproduce this with logs? It looks like there are a few ops that are hanging for a very long time, but there isn't enough information here except to point to osds 610, 612, 615, and 68... FWIW, I have a small pile of disks with bad

Re: Slow request in XFS

2013-01-31 Thread Jim Schutt
On 01/31/2013 01:00 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Jim Schutt wrote: On 01/31/2013 05:43 AM, Sage Weil wrote: Hi- Can you reproduce this with logs? It looks like there are a few ops that are hanging for a very long time, but there isn't enough information here except to point

Re: [PATCH] libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed

2013-01-16 Thread Jim Schutt
Hi Sage, On 01/15/2013 07:55 PM, Sage Weil wrote: Hi Jim- I just realized this didn't make it into our tree. It's now in testing, and will get merged in the next window. D'oh! That's great news - thanks for the update. -- Jim sage -- To unsubscribe from this list: send the line

OSDMonitor: don't allow creation of pools with 65535 pgs

2012-12-14 Thread Jim Schutt
Hi, I'm looking at commit e3ed28eb2 in the next branch, and I have a question. Shouldn't the limit be pg_num 65536, because PGs are numbered 0 thru pg_num-1? If not, what am I missing? FWIW, up through yesterday I've been using the next branch and this: ceph osd pool set data pg_num 65536

Re: [EXTERNAL] Re: OSDMonitor: don't allow creation of pools with 65535 pgs

2012-12-14 Thread Jim Schutt
On 12/14/2012 09:59 AM, Joao Eduardo Luis wrote: On 12/14/2012 03:41 PM, Jim Schutt wrote: Hi, I'm looking at commit e3ed28eb2 in the next branch, and I have a question. Shouldn't the limit be pg_num 65536, because PGs are numbered 0 thru pg_num-1? If not, what am I missing? FWIW, up

Re: 3.7.0-rc8 btrfs locking issue

2012-12-12 Thread Jim Schutt
On 12/11/2012 06:37 PM, Liu Bo wrote: On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote: On 12/09/2012 07:04 AM, Liu Bo wrote: On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: Hi Jim, Could you please apply the following patch to test if it works? Hi, So far

Re: 3.7.0-rc8 btrfs locking issue

2012-12-11 Thread Jim Schutt
On 12/09/2012 07:04 AM, Liu Bo wrote: On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: Hi, I'm hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported

Re: 3.7.0-rc8 btrfs locking issue

2012-12-07 Thread Jim Schutt
On 12/05/2012 09:07 AM, Jim Schutt wrote: Hi, I'm hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported this issue a while ago: http://www.spinics.net/lists/linux-btrfs/msg19370.html

3.7.0-rc8 btrfs locking issue

2012-12-05 Thread Jim Schutt
Hi, I'm hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported this issue a while ago: http://www.spinics.net/lists/linux-btrfs/msg19370.html when I was testing what I thought might be

[PATCH] libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed

2012-11-30 Thread Jim Schutt
be relatively rare. Signed-off-by: Jim Schutt jasc...@sandia.gov --- include/linux/ceph/ceph_features.h |4 +++- include/linux/crush/crush.h|2 ++ net/ceph/crush/mapper.c| 13 ++--- net/ceph/osdmap.c |6 ++ 4 files changed, 21 insertions

Re: chooseleaf_descend_once

2012-11-28 Thread Jim Schutt
On 11/28/2012 09:11 AM, Caleb Miles wrote: Hey Jim, Running the third test with tunable chooseleaf_descend_once 0 with no devices marked out yields the following result (999.827397, 0.48667056652539997) so chi squared value is 999 with a corresponding p value of 0.487 so that the

Re: chooseleaf_descend_once

2012-11-27 Thread Jim Schutt
Hi Caleb, On 11/26/2012 07:28 PM, caleb miles wrote: Hello all, Here's what I've done to try and validate the new chooseleaf_descend_once tunable first described in commit f1a53c5e80a48557e63db9c52b83f39391bc69b8 in the wip-crush branch of ceph.git. First I set the new tunable to it's

Re: [PATCH] PG: Do not discard op data too early

2012-10-26 Thread Jim Schutt
On 10/26/2012 02:52 PM, Gregory Farnum wrote: Wanted to touch base on this patch again. If Sage and Sam agree that we don't want to play any tricks with memory accounting, we should pull this patch in. I'm pretty sure we want it for Bobtail! I've been running with it since I posted it. I think

[PATCH] FileJournal: correctly check return value of lseek in write_fd

2012-09-27 Thread Jim Schutt
Signed-off-by: Jim Schutt jasc...@sandia.gov --- src/os/FileJournal.cc |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/src/os/FileJournal.cc b/src/os/FileJournal.cc index d1c92dc..2254720 100644 --- a/src/os/FileJournal.cc +++ b/src/os/FileJournal.cc @@ -945,9

[PATCH] PG: Do not discard op data too early

2012-09-27 Thread Jim Schutt
ops. Signed-off-by: Jim Schutt jasc...@sandia.gov --- src/osd/ReplicatedPG.cc |4 1 files changed, 0 insertions(+), 4 deletions(-) diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index a64abda..80bec2a 100644 --- a/src/osd/ReplicatedPG.cc +++ b/src/osd/ReplicatedPG.cc

Re: [PATCH] PG: Do not discard op data too early

2012-09-27 Thread Jim Schutt
On 09/27/2012 04:07 PM, Gregory Farnum wrote: Have you tested that this does what you want? If it does, I think we'll want to implement this so that we actually release the memory, but continue accounting it. Yes. I have diagnostic patches where I add an advisory option to Throttle, and apply

Re: [PATCH] PG: Do not discard op data too early

2012-09-27 Thread Jim Schutt
On 09/27/2012 04:27 PM, Gregory Farnum wrote: On Thu, Sep 27, 2012 at 3:23 PM, Jim Schuttjasc...@sandia.gov wrote: On 09/27/2012 04:07 PM, Gregory Farnum wrote: Have you tested that this does what you want? If it does, I think we'll want to implement this so that we actually release the

Lots of misdirected client requests with 73984 PGs/pool

2012-08-28 Thread Jim Schutt
Hi, I was testing on 288 OSDs with pg_bits=8, for 73984 PGs/pool, 221952 total PGs. Writing from CephFS clients generates lots of messages like this: 2012-08-28 14:53:33.772344 osd.235 [WRN] client.4533 172.17.135.45:0/1432642641 misdirected client.4533.1:124 pg 0.8b9d12d4 to osd.235 in e7,

Re: [PATCH] make mkcephfs and init-ceph osd filesystem handling more flexible

2012-08-09 Thread Jim Schutt
On 08/09/2012 10:26 AM, Tommi Virtanen wrote: mkcephfs is not a viable route forward. For example, it is unable to expand a pre-existing cluster. The new OSD hotplugging style init is much, much nicer. And does more than just mkfs mount. I'm embarrassed to admit I haven't been keeping up

Re: [PATCH 3.6-rc1] libceph: ensure banner is written on client connect before feature negotiation starts

2012-08-09 Thread Jim Schutt
On 08/08/2012 07:13 PM, Alex Elder wrote: On 08/08/2012 11:09 AM, Jim Schutt wrote: Because the Ceph client messenger uses a non-blocking connect, it is possible for the sending of the client banner to race with the arrival of the banner sent by the peer. This is possible because the server

[PATCH 3.6-rc1] libceph: ensure banner is written on client connect before feature negotiation starts

2012-08-08 Thread Jim Schutt
of prepare_write_connect() to its callers at all locations except the one where the banner might still need to be sent. Signed-off-by: Jim Schutt jasc...@sandia.gov --- net/ceph/messenger.c | 11 +-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph

Re: [EXTERNAL] Re: avoiding false detection of down OSDs

2012-07-31 Thread Jim Schutt
On 07/30/2012 06:24 PM, Gregory Farnum wrote: On Mon, Jul 30, 2012 at 3:47 PM, Jim Schuttjasc...@sandia.gov wrote: Above you mentioned that you are seeing these issues as you scaled out a storage cluster, but none of the solutions you mentioned address scaling. Let's assume your preferred

Re: avoiding false detection of down OSDs

2012-07-30 Thread Jim Schutt
Hi Greg, Thanks for the write-up. I have a couple questions below. On 07/30/2012 12:46 PM, Gregory Farnum wrote: As Ceph gets deployed on larger clusters our most common scaling issues have related to 1) our heartbeat system, and 2) handling the larger numbers of OSDMaps that get generated by

Re: osd/OSDMap.h: 330: FAILED assert(is_up(osd))

2012-07-18 Thread Jim Schutt
On 07/17/2012 06:03 PM, Samuel Just wrote: master should now have a fix for that, let me know how it goes. I opened bug #2798 for this issue. Hmmm, it seems handle_osd_ping() now runs into a case where for the first ping it gets, service.osdmap can be empty? 0 2012-07-18

Re: osd/OSDMap.h: 330: FAILED assert(is_up(osd))

2012-07-18 Thread Jim Schutt
On 07/18/2012 12:03 PM, Samuel Just wrote: Sorry, master has a fix now for that also. 76efd9772c60b93bbf632e3ecc3b9117dc081427 -Sam That got things running for me. Thanks for the quick reply. -- Jim -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a

osd/OSDMap.h: 330: FAILED assert(is_up(osd))

2012-07-17 Thread Jim Schutt
Hi, Recent master branch is asserting for me like this: ceph version 0.48argonaut-404-gabe05a3 (commit:abe05a3fbbb120d8d354623258d9104584db66f7) 1: (OSDMap::get_cluster_inst(int) const+0xc9) [0x58cde9] 2: (OSD::handle_osd_ping(MOSDPing*)+0x8cf) [0x5d4b4f] 3:

Re: osd/OSDMap.h: 330: FAILED assert(is_up(osd))

2012-07-17 Thread Jim Schutt
On 07/17/2012 03:44 PM, Samuel Just wrote: Not quite. OSDService::get_osdmap() returns the most recently published osdmap. Generally, OSD::osdmap is safe to use when you are holding the osd lock. Otherwise, OSDService::get_osdmap() should be used. There are a few other things that should be

Re: Interesting results

2012-07-02 Thread Jim Schutt
On 07/01/2012 01:57 PM, Stefan Priebe wrote: thanks for sharing. Which btrfs mount options did you use? -o noatime is all I use. -- Jim Am 29.06.2012 00:37, schrieb Jim Schutt: Hi, Lots of trouble reports go by on the list - I thought it would be useful to report a success. Using

Re: Interesting results

2012-07-02 Thread Jim Schutt
On 07/02/2012 08:07 AM, Stefan Priebe - Profihost AG wrote: Am 02.07.2012 16:04, schrieb Jim Schutt: On 07/01/2012 01:57 PM, Stefan Priebe wrote: thanks for sharing. Which btrfs mount options did you use? -o noatime is all I use. Thanks. Have you ever measured random I/O performance

Re: Interesting results

2012-06-29 Thread Jim Schutt
On 06/28/2012 04:53 PM, Mark Nelson wrote: On 06/28/2012 05:37 PM, Jim Schutt wrote: Hi, Lots of trouble reports go by on the list - I thought it would be useful to report a success. Using a patch (https://lkml.org/lkml/2012/6/28/446) on top of 2.5-rc4 for my OSD servers, the same kernel

Re: excessive CPU utilization by isolate_freepages?

2012-06-28 Thread Jim Schutt
On 06/28/2012 05:36 AM, Mel Gorman wrote: On Wed, Jun 27, 2012 at 03:59:19PM -0600, Jim Schutt wrote: Hi, I'm running into trouble with systems going unresponsive, and perf suggests it's excessive CPU usage by isolate_freepages(). I'm currently testing 3.5-rc4, but I think this problem may

Re: OSD Hardware questions

2012-06-28 Thread Jim Schutt
On 06/28/2012 09:45 AM, Alexandre DERUMIER wrote: Definitely. Seeing perf/oprofile/whatever results for the osd under that workload would be very interesting! We need to get perf going in our testing environment... I'm not an expert, but if you give me command line, I'll do it ;) Thanks to

Interesting results

2012-06-28 Thread Jim Schutt
Hi, Lots of trouble reports go by on the list - I thought it would be useful to report a success. Using a patch (https://lkml.org/lkml/2012/6/28/446) on top of 2.5-rc4 for my OSD servers, the same kernel for my Linux clients, and a recent master branch tip (git://github.com/ceph/ceph commit

Re: OSD Hardware questions

2012-06-27 Thread Jim Schutt
Hi Mark, On 06/27/2012 07:55 AM, Mark Nelson wrote: For what it's worth, I've got a pair of Dell R515 setup with a single 2.8GHz 6-core 4184 Opteron, 16GB of RAM, and 10 SSDs that are capable of about 200MB/s each. Currently I'm topping out at about 600MB/s with rados bench using half of

Re: OSD Hardware questions

2012-06-27 Thread Jim Schutt
On 06/27/2012 09:19 AM, Stefan Priebe wrote: Am 27.06.2012 16:55, schrieb Jim Schutt: This is my current best tuning for my hardware, which uses 24 SAS drives/server, and 1 OSD/drive with a journal partition on the outer tracks and btrfs for the data store. Which raid level do you use

Re: OSD Hardware questions

2012-06-27 Thread Jim Schutt
On 06/27/2012 11:54 AM, Stefan Priebe wrote: Am 27.06.2012 um 19:23 schrieb Jim Schuttjasc...@sandia.gov: On 06/27/2012 09:19 AM, Stefan Priebe wrote: Am 27.06.2012 16:55, schrieb Jim Schutt: This is my current best tuning for my hardware, which uses 24 SAS drives/server, and 1 OSD/drive

Re: OSD Hardware questions

2012-06-27 Thread Jim Schutt
On 06/27/2012 12:48 PM, Stefan Priebe wrote: Am 27.06.2012 20:38, schrieb Jim Schutt: Actually, when my 166-client test is running, ps -o pid,nlwp,args -C ceph-osd tells me that I typically have ~1200 threads/OSD. huh i see only 124 threads per OSD even with your settings. FWIW: 2 threads

excessive CPU utilization by isolate_freepages?

2012-06-27 Thread Jim Schutt
Hi, I'm running into trouble with systems going unresponsive, and perf suggests it's excessive CPU usage by isolate_freepages(). I'm currently testing 3.5-rc4, but I think this problem may have first shown up in 3.4. I'm only just learning how to use perf, so I only currently have results to

mkcephfs regression in current master branch

2012-05-24 Thread Jim Schutt
Hi, In my testing I make repeated use of the manual mkcephfs sequence described in the man page: master# mkdir /tmp/foo master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo mdsnode# mkcephfs --init-local-daemons

Re: mkcephfs regression in current master branch

2012-05-24 Thread Jim Schutt
On 05/24/2012 03:13 PM, Sage Weil wrote: Hi Jim, On Thu, 24 May 2012, Jim Schutt wrote: Hi, In my testing I make repeated use of the manual mkcephfs sequence described in the man page: master# mkdir /tmp/foo master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo

Re: [EXTERNAL] Re: [RFC PATCH 0/2] Distribute re-replicated objects evenly after OSD failure

2012-05-14 Thread Jim Schutt
to find an in device during spread re-replication around mode. This makes it less likely we'll give up when the storage cluster has many failed devices. Signed-off-by: Jim Schutt jasc...@sandia.gov --- src/crush/mapper.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src

[PATCH 2/2] ceph: add tracepoints for message send queueing and completion, reply handling

2012-05-10 Thread Jim Schutt
Signed-off-by: Jim Schutt jasc...@sandia.gov --- include/trace/events/ceph.h | 67 +++ net/ceph/messenger.c|9 +- net/ceph/osd_client.c |1 + 3 files changed, 76 insertions(+), 1 deletions(-) diff --git a/include/trace/events

[PATCH 1/2] ceph: add tracepoints for message submission on read/write requests

2012-05-10 Thread Jim Schutt
Trace callers of ceph_osdc_start_request, so that call locations are identified implicitly. Put the tracepoints after calls to ceph_osdc_start_request, since it fills in the request transaction ID and request OSD. Signed-off-by: Jim Schutt jasc...@sandia.gov --- fs/ceph/addr.c

[PATCH 0/2] Ceph tracepoints

2012-05-10 Thread Jim Schutt
Hi Alex, I ran across tracker #2374 today - I've been carrying these two tracepoint patches for a while. Perhaps you'll find them useful. Jim Schutt (2): ceph: add tracepoints for message submission on read/write requests ceph: add tracepoints for message send queueing and completion, reply

[RFC PATCH 0/2] Distribute re-replicated objects evenly after OSD failure

2012-05-10 Thread Jim Schutt
. So, this is looking fairly solid to me so far. What do you think? Thanks -- Jim Jim Schutt (2): ceph: retry CRUSH map descent before retrying bucket ceph: retry CRUSH map descent from root if leaf is failed src/crush/mapper.c | 30 ++ 1 files changed, 22

[RFC PATCH 1/2] ceph: retry CRUSH map descent before retrying bucket

2012-05-10 Thread Jim Schutt
. Signed-off-by: Jim Schutt jasc...@sandia.gov --- src/crush/mapper.c | 20 ++-- 1 files changed, 14 insertions(+), 6 deletions(-) diff --git a/src/crush/mapper.c b/src/crush/mapper.c index 8857577..e5dc950 100644 --- a/src/crush/mapper.c +++ b/src/crush/mapper.c @@ -350,8 +350,7

[RFC PATCH 2/2] ceph: retry CRUSH map descent from root if leaf is failed

2012-05-10 Thread Jim Schutt
, if the primary OSD in a placement group has failed, choosing a replacement may result in one of the other OSDs in the PG colliding with the new primary. This requires that OSD's data for that PG to need moving as well. This seems unavoidable but should be relatively rare. Signed-off-by: Jim Schutt jasc

Re: Re-replicated data does not seem to get uniformly redistributed after OSD failure

2012-04-30 Thread Jim Schutt
On 04/30/2012 11:12 AM, Samuel Just wrote: There is a (unfortunately non-optional at the moment) feature in crush where we retry in the same bucket a few times before restarting the descent when hitting an out leaf. The result of this is to localise recovery at the expense of inadequately

Re: Release/branch naming; input requested

2012-04-27 Thread Jim Schutt
On 04/26/2012 06:09 PM, Tommi Virtanen wrote: Now, here are my actual questions: 1. What should the relative names of the branches be? stable vs latest etc. I especially don't like integration, but I do see a time where it is not ready for stable but still needs to branch off of latest. 2. Do

Re-replicated data does not seem to get uniformly redistributed after OSD failure

2012-04-25 Thread Jim Schutt
Hi, I've been experimenting with failure scenarios to make sure I understand what happens when an OSD drops out. In particular, I've been using ceph osd out n and watching my all my OSD servers to see where the data from the removed OSD ends up after recovery. I've been doing this testing with

Re: scaling issues

2012-04-10 Thread Jim Schutt
On 03/09/2012 04:21 PM, Jim Schutt wrote: On 03/09/2012 12:39 PM, Jim Schutt wrote: On 03/08/2012 05:26 PM, Sage Weil wrote: On Thu, 8 Mar 2012, Jim Schutt wrote: Hi, I've been trying to scale up a Ceph filesystem to as big as I have hardware for - up to 288 OSDs right now. (I'm using

Re: [EXTERNAL] Re: scaling issues

2012-04-10 Thread Jim Schutt
On 04/10/2012 10:39 AM, Sage Weil wrote: On Tue, 10 Apr 2012, Jim Schutt wrote: On 03/09/2012 04:21 PM, Jim Schutt wrote: On 03/09/2012 12:39 PM, Jim Schutt wrote: On 03/08/2012 05:26 PM, Sage Weil wrote: On Thu, 8 Mar 2012, Jim Schutt wrote: Hi, I've been trying to scale up a Ceph

[PATCH] Makefile: fix modules that cannot find pk11pub.h when compiling with NSS on RHEL6

2012-03-21 Thread Jim Schutt
Signed-off-by: Jim Schutt jasc...@sandia.gov --- src/Makefile.am |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/src/Makefile.am b/src/Makefile.am index cdfb43d..2062d1c 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -48,7 +48,7 @@ if LINUX ceph_osd_LDADD

Re: platform requirements / centos 6.2

2012-03-21 Thread Jim Schutt
On 03/21/2012 05:16 AM, Plaetinck, Dieter wrote: Hello, Ceph/Rados looks very well designed and engineered. I would like to build a cluster to test the rados distributed object storage (not the distributed FS or block devices) I've seen the list of dependencies on the wiki, but it doesn't

Re: scaling issues

2012-03-09 Thread Jim Schutt
On 03/08/2012 05:26 PM, Sage Weil wrote: On Thu, 8 Mar 2012, Jim Schutt wrote: Hi, I've been trying to scale up a Ceph filesystem to as big as I have hardware for - up to 288 OSDs right now. (I'm using commit ed0f605365e - tip of master branch from a few days ago.) My problem is that I

Re: scaling issues

2012-03-09 Thread Jim Schutt
On 03/09/2012 12:39 PM, Jim Schutt wrote: On 03/08/2012 05:26 PM, Sage Weil wrote: On Thu, 8 Mar 2012, Jim Schutt wrote: Hi, I've been trying to scale up a Ceph filesystem to as big as I have hardware for - up to 288 OSDs right now. (I'm using commit ed0f605365e - tip of master branch from

scaling issues

2012-03-08 Thread Jim Schutt
Hi, I've been trying to scale up a Ceph filesystem to as big as I have hardware for - up to 288 OSDs right now. (I'm using commit ed0f605365e - tip of master branch from a few days ago.) My problem is that I cannot get a 288 OSD filesystem to go active (that's with 1 mon and 1 MDS). Pretty

Re: [PATCH] net/ceph: Only clear SOCK_NOSPACE when there is sufficient space in the socket buffer

2012-02-29 Thread Jim Schutt
Hi Alex, On 02/02/2012 07:07 PM, Alex Elder wrote: On Wed, 2012-02-01 at 08:59 -0700, Jim Schutt wrote: The Ceph messenger would sometimes queue multiple work items to write data to a socket when the socket buffer was full. Fix this problem by making ceph_write_space() use SOCK_NOSPACE

Re: [PATCH] net/ceph: Only clear SOCK_NOSPACE when there is sufficient space in the socket buffer

2012-02-29 Thread Jim Schutt
On 02/29/2012 08:47 AM, Alex Elder wrote: On 02/29/2012 07:30 AM, Jim Schutt wrote: Hi Alex, On 02/02/2012 07:07 PM, Alex Elder wrote: On Wed, 2012-02-01 at 08:59 -0700, Jim Schutt wrote: The Ceph messenger would sometimes queue multiple work items to write data to a socket when the socket

  1   2   3   >