from:"Jim Schutt"

3.10.0 failed paging request from kthread_data

2013-07-17 Thread Jim Schutt

Hi,

I'm trying to test the btrfs and ceph contributions to 3.11, without
testing all of 3.11-rc1 (just yet), so I'm testing with the "next"
branch of Chris Mason's tree  (commit cbacd76bb3 from
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git)
merged into the for-linus branch of the ceph tree (commit 8b8cf8917f
from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git)

One of my ceph clients hit this:

[94633.463166] BUG: unable to handle kernel paging request at ffa8
[94633.464003] IP: [] kthread_data+0x10/0x20
[94633.464003] PGD 1a0c067 PUD 1a0e067 PMD 0 
[94633.464003] Oops:  [#2] SMP 
[94633.464003] Modules linked in: cbc ceph libceph ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror 
dm_region_hash dm_log dm_multipath scsi_dh scsi_mod vhost_net macvtap macvlan 
tun uinput iTCO_wdt iTCO_vendor_support dcdbas coretemp kvm microcode button 
serio_raw pcspkr ehci_pci ehci_hcd ib_mthca ib_mad ib_core lpc_ich mfd_core 
uhci_hcd i5k_amb i5000_edac edac_core dm_mod nfsv4 nfsv3 nfs_acl nfsv2 nfs 
lockd sunrpc fscache broadcom tg3 bnx2 igb ptp pps_core i2c_algo_bit i2c_core 
dca hwmon e1000
[94633.464003] CPU: 0 PID: 78416 Comm: kworker/0:1 Tainted: G  D W
3.10.0-00119-g2925339 #601
[94633.464003] Hardware name: Dell Inc. PowerEdge 1950/0NK937, BIOS 1.1.0 
06/21/2006
[94633.464003] task: 880415b6 ti: 88040e39a000 task.ti: 
88040e39a000
[94633.464003] RIP: 0010:[]  [] 
kthread_data+0x10/0x20
[94633.464003] RSP: 0018:88040e39b7f8  EFLAGS: 00010092
[94633.464003] RAX:  RBX:  RCX: 81d30320
[94633.464003] RDX:  RSI:  RDI: 880415b6
[94633.464003] RBP: 88040e39b7f8 R08: 880415b60070 R09: 0001
[94633.464003] R10:  R11:  R12: 
[94633.464003] R13: 880415b603e8 R14: 0001 R15: 0002
[94633.464003] FS:  () GS:88042fc0() 
knlGS:
[94633.464003] CS:  0010 DS:  ES:  CR0: 8005003b
[94633.464003] CR2: 0028 CR3: 000415f77000 CR4: 07f0
[94633.464003] DR0:  DR1:  DR2: 
[94633.464003] DR3:  DR6: 0ff0 DR7: 0400
[94633.464003] Stack:
[94633.464003]  88040e39b818 810602a5 88040e39b818 
88042fc139c0
[94633.464003]  88040e39b8a8 814ef79e 8804 
88040e39bfd8
[94633.464003]  88040e39a000 88040e39a000 88040e39a010 
88040e39a000
[94633.464003] Call Trace:
[94633.464003]  [] wq_worker_sleeping+0x15/0xa0
[94633.464003]  [] __schedule+0x17e/0x6b0
[94633.464003]  [] schedule+0x5d/0x60
[94633.464003]  [] do_exit+0x3eb/0x440
[94633.464003]  [] oops_end+0xd8/0xf0
[94633.464003]  [] no_context+0x1bf/0x1e0
[94633.464003]  [] __bad_area_nosemaphore+0x1f5/0x230
[94633.464003]  [] bad_area_nosemaphore+0x13/0x20
[94633.464003]  [] __do_page_fault+0x416/0x4b0
[94633.464003]  [] ? idle_balance+0x14e/0x180
[94633.464003]  [] ? finish_task_switch+0x3f/0x110
[94633.464003]  [] ? error_sti+0x5/0x6
[94633.464003]  [] ? trace_hardirqs_off_caller+0x29/0xd0
[94633.464003]  [] ? trace_hardirqs_off_thunk+0x3a/0x3c
[94633.464003]  [] do_page_fault+0xe/0x10
[94633.464003]  [] page_fault+0x22/0x30
[94633.464003]  [] ? rb_erase+0x297/0x3a0
[94633.464003]  [] __remove_osd+0x98/0xd0 [libceph]
[94633.464003]  [] __reset_osd+0xa3/0x1c0 [libceph]
[94633.464003]  [] ? osd_reset+0x9b/0xd0 [libceph]
[94633.464003]  [] __kick_osd_requests+0x7b/0x2e0 [libceph]
[94633.464003]  [] osd_reset+0xa6/0xd0 [libceph]
[94633.464003]  [] con_work+0x445/0x4a0 [libceph]
[94633.464003]  [] process_one_work+0x2e5/0x510
[94633.464003]  [] ? process_one_work+0x240/0x510
[94633.464003]  [] worker_thread+0x215/0x340
[94633.464003]  [] ? manage_workers+0x170/0x170
[94633.464003]  [] kthread+0xe1/0xf0
[94633.464003]  [] ? __init_kthread_worker+0x70/0x70
[94633.464003]  [] ret_from_fork+0x7c/0xb0
[94633.464003]  [] ? __init_kthread_worker+0x70/0x70
[94633.464003] Code: 90 03 00 00 48 8b 40 98 c9 48 c1 e8 02 83 e0 01 c3 66 2e 
0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 90 03 00 00 <48> 8b 
40 a8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 
[94633.464003] RIP  [] kthread_data+0x10/0x20
[94633.464003]  RSP 
[94633.464003] CR2: ffa8
[94633.464003] ---[ end trace 89622896705a7fac ]---
[94633.464003] Fixing recursive fault but reboot is needed!
[94633.464003] [ cut here ]

kthread_data disassembles to this:

(gdb) disassemble kthread_data
Dump of assembler code for function kthread_data:
   0x8106a060 <+0>: push   %rbp
   0x8106a061 <+1>: mov%rsp,%rbp
   0x8106a064 <+4>: callq  0x814fabc0
   0x8106a069 <+9>: mov0x390(%rdi),%rax
   0x8106a070 <+16>:

lost bugfix in sync_requester_abort?

2013-06-20 Thread Jim Schutt

Hi Sage,

A week or so ago you had a patch in your wip-mon branch
(now gone) called "mon: fix raw use of *_dout in sync_requester_abort()".
It seems that patch got dropped.

AFAICS, without that patch, sync_requester_abort() has different
behavior depending on the value of "debug mon".  In particular,
I believe the sync leader/provider won't get reset if debug mon
is less than 10.

Am I missing something, or was dropping that change unintended?

Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pg balancing

2013-06-05 Thread Jim Schutt

Hi Sage,

On 05/13/2013 06:35 PM, Sage Weil wrote:
> Hi Jim-
> 
> You mentioned the other day your concerns about the uniformity of the PG 
> and data distribution.  There are several ways to attack it (including 
> increasing the number of PGs), but one that we haven't tested much yet is 
> the 'reweight-by-utilization' function in the monitor.
> 
> The idea is that there will always be some statistical variance in the 
> distribution and a non-zero probability of having outlier OSDs with too 
> many PG.  We adjust for this by taking nodes that are substantially above 
> the mean down by some adjustment factor in an automated way.
> 
>  ceph osd reweight-by-utilization MIN
> 
> where MIN is the minimum relative utilization at which we will start 
> adjusting down.  It is always > 100 (100% of the mean), and defaults to 
> 120.  After it adjusts the reweights, you should see the result in 'ceph 
> osd tree' output
> 
> Have you played with this at all on your cluster?  I'd be very interested 
> in how well this does/does not improve things for you.

I've been experimenting with re-weighting, and have found that
it works well to redistribute data, as you expected.

Here's a few observations:

- when an OSD goes out and comes back in, its weight gets reset
to 1.  It would be nice if it could remember its old weight.

- in order to reach the data distribution uniformity I'm after, I
need to run multiple iterations of re-weighting - each iteration
pushes data off the most highly utilized OSDs, but some ends up
on average OSDs and pushes them over the limit.

- as you expected, to reach the uniformity I'm after, a _lot_ of data
needs to move.  I've got some scripts I'm using to generate 
'ceph osd reweight OSD WEIGHT' commands based on PG distribution,
and I can use these after I create a new filesystem, to get a
suitably uniform PG distribution before there is any data to
move.  Some iteration is required here as well, and this is
working really well for me.  When I start writing data to such
a re-weighted filesystem, the data distribution pretty closely
mirrors the PG distribution (once you write enough data).

- re-weighting to get a more uniform data distribution works better
if there are more PGs to work with.  At 576 OSDs, I can't quite
get things as uniform as I'd like with 64K PGs, but I can with
128K PGs.  FWIW, here's the (max PGs/OSD)/(min PGs/OSD) I've
measured for various numbers of PGs on 576 OSDs, with no
re-weighting:

PGs (max PGs/OSD) / (min PGs/OSD)
 
65536   1.478
   131073   1.308
   262144   1.240
   524288   1.155
  1048576   1.105

(BTW, your recent leveldb work enabled those 512K and 1M
measurements.  Thanks!)

With 128K PGs and iterative re-weighting, I can get
(max PGs/OSD)/(min PGs/OSD) < 1.05, and after writing
enough data to consume ~33% of available storage, I
get (max OSD data use)/(min OSD data use) ~ 1.06.
OSD weights end up in the 0.85 - 1.0 range for such a
distribution.

So, re-weighting is definitely working for me.

-- Jim

> 
> Thanks!
> sage
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: leveldb compaction overhead

2013-06-05 Thread Jim Schutt

On 06/05/2013 01:05 PM, Mark Nelson wrote:
> 
> FWIW, I've been fighting with some mon/leveldb issues on 24-node test
> cluster causing high CPU utilization, constant reads, laggy osdmap
> updates, and mons dropping out of quorum.  Work is going on in
> wip-mon.  Should have some more testing done later today.  Other than
> some crashing it appears to be behaving much better. ;)

Cool!

I've seen that branch, but haven't had a chance to take it for
a spin yet

Thanks -- Jim

> 
> Mark


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: leveldb compaction overhead

2013-06-05 Thread Jim Schutt

Hi Sage,

On 05/31/2013 06:00 PM, Sage Weil wrote:
> On Fri, 31 May 2013, Jim Schutt wrote:
>> Hi Sage,
>>
>> On 05/29/2013 03:07 PM, Sage Weil wrote:
>>> Hi all-
>>>
>>> I have a couple of branches (wip-5176 and wip-5176-cuttlefish) that try to 
>>> make the leveldb compaction on the monitor less expensive by doing it in 
>>> an async thread and compaction only the trimmed range.  If anyone who is 
>>> experience high monitor io on cuttlefish would like to test it out, 
>>> feedback on whether/how much it improves things would be much appreciated!
>>
>> I've been flogging wip-5176-cuttlefish merged into recent cuttlefish
>> branch (commit 02ef6e918e), and the result has been very stable for
>> me.  I've been testing OSD reweights, and so have been getting lots
>> of pgmap updates, and lots of data movement.
>>
>> I'm no longer seeing stalls, and I see much less data movement
>> on the monitor hosts.  I haven't seen any monitors drop out
>> and rejoin, which had been a regular occurrence for me.
>> I stopped a mon and reinitialized it, and it resync'ed in 
>> just a few minutes, which is also a big improvement.
>>
>> This is all with 128K PGs - next week I'll try much higher
>> PG counts.
>>
>> Thanks a bunch for these fixes - they are working great
>> for me!
> 
> This is great news!
> 
> I pushed a few more patches to wip-5176-cuttlefish that should make it 
> even better (smarter about ranges to compact, and perfcounters so we can 
> tell what leveldb compactions are taking place).  Do you mind trying it 
> out as well?

I've been testing these out, in the cuttlefish branch (at commit 8544ea7518).
They've been working well for me, with the possible exception of commit
61135964419 ("mon/Paxos: adjust trimming defaults up; rename options").

FWIW, I've found that the new default values for paxos_trim_min and
paxos_service_trim_min aren't working well for me at 256K PGs.
I periodically get the classic symptoms of monitor non-responsiveness:
mons dropping out of quorum and coming back in later, and an mds going
laggy and booting.  The mds behavior is slightly new - the active
mds doesn't fail over to one of my standby mds instances, it just
goes laggy, boots, repeat.

I've gone to really, really aggressive trimming (both paxos_trim_min
and paxos_service_trim_min at 10), and this has been working really
well for me so far.

I'm wondering, when leveldb compacts, does it stop committing new
objects for the duration of the compaction?  If so, then possibly
smaller but more frequent compaction causes shorter periods of
no updates.  So, even though there's more compaction work overall,
each episode is much less disruptive, and my cluster is much happier.
If not, then I'm not sure why my trim tuning seems to help.

Thanks -- Jim

> 
> In the meantime, I'm going to pull it into the cuttlefish branch and test 
> over the weekend.  If that looks good we'll cut a point release with all 
> of these fixes.
> 
> Thank you to everyone who has helped with the debugging and testing on 
> these issues!
> 
> sage
> 
> 
>>
>> Thanks -- Jim
>>
>>>
>>> Thanks-
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: leveldb compaction overhead

2013-05-31 Thread Jim Schutt

Hi Sage,

On 05/29/2013 03:07 PM, Sage Weil wrote:
> Hi all-
> 
> I have a couple of branches (wip-5176 and wip-5176-cuttlefish) that try to 
> make the leveldb compaction on the monitor less expensive by doing it in 
> an async thread and compaction only the trimmed range.  If anyone who is 
> experience high monitor io on cuttlefish would like to test it out, 
> feedback on whether/how much it improves things would be much appreciated!

I've been flogging wip-5176-cuttlefish merged into recent cuttlefish
branch (commit 02ef6e918e), and the result has been very stable for
me.  I've been testing OSD reweights, and so have been getting lots
of pgmap updates, and lots of data movement.

I'm no longer seeing stalls, and I see much less data movement
on the monitor hosts.  I haven't seen any monitors drop out
and rejoin, which had been a regular occurrence for me.
I stopped a mon and reinitialized it, and it resync'ed in 
just a few minutes, which is also a big improvement.

This is all with 128K PGs - next week I'll try much higher
PG counts.

Thanks a bunch for these fixes - they are working great
for me!

Thanks -- Jim

> 
> Thanks-
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic

2013-05-15 Thread Jim Schutt

On 05/15/2013 10:49 AM, Alex Elder wrote:
> On 05/15/2013 11:38 AM, Jim Schutt wrote:
>> > Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while
>> > holding a lock, but it's spoiled because ceph_pagelist_addpage() always
>> > calls kmap(), which might sleep.  Here's the result:
> This looks good to me, but I admit I didn't take as close
> a look at it this time.
> 
> I appreciate your updating the series to include the things
> I mentioned.
> 
> I'll commit these for you, and I'll get confirmation on the
> byte order thing as well.

Cool.  Thanks, Alex.

-- Jim

> 
> Reviewed-by: Alex Elder 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic

2013-05-15 Thread Jim Schutt

Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while
holding a lock, but it's spoiled because ceph_pagelist_addpage() always
calls kmap(), which might sleep.  Here's the result:

[13439.295457] ceph: mds0 reconnect start
[13439.300572] BUG: sleeping function called from invalid context at 
include/linux/highmem.h:58
[13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
[13439.316464] 5 locks held by kworker/1:1/12059:
[13439.320998]  #0:  (ceph-msgr){..}, at: [] 
process_one_work+0x218/0x480
[13439.329701]  #1:  ((&(&con->work)->work)){..}, at: [] 
process_one_work+0x218/0x480
[13439.339446]  #2:  (&s->s_mutex){..}, at: [] 
send_mds_reconnect+0xec/0x450 [ceph]
[13439.349081]  #3:  (&mdsc->snap_rwsem){..}, at: [] 
send_mds_reconnect+0x16e/0x450 [ceph]
[13439.359278]  #4:  (file_lock_lock){..}, at: [] 
lock_flocks+0x15/0x20
[13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW
3.9.0-00358-g308ae61 #557
[13439.376225] Call Trace:
[13439.378757]  [] __might_sleep+0xfc/0x110
[13439.384353]  [] ceph_pagelist_append+0x120/0x1b0 [libceph]
[13439.391491]  [] ceph_encode_locks+0x89/0x190 [ceph]
[13439.398035]  [] ? _raw_spin_lock+0x49/0x50
[13439.403775]  [] ? lock_flocks+0x15/0x20
[13439.409277]  [] encode_caps_cb+0x41f/0x4a0 [ceph]
[13439.415622]  [] ? igrab+0x28/0x70
[13439.420610]  [] ? iterate_session_caps+0xe8/0x250 [ceph]
[13439.427584]  [] iterate_session_caps+0x115/0x250 [ceph]
[13439.434499]  [] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
[13439.441646]  [] send_mds_reconnect+0x238/0x450 [ceph]
[13439.448363]  [] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
[13439.455250]  [] check_new_map+0x352/0x500 [ceph]
[13439.461534]  [] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
[13439.468432]  [] ? mutex_unlock+0xe/0x10
[13439.473934]  [] extra_mon_dispatch+0x22/0x30 [ceph]
[13439.480464]  [] dispatch+0xbc/0x110 [libceph]
[13439.486492]  [] process_message+0x1ad/0x1d0 [libceph]
[13439.493190]  [] ? read_partial_message+0x3e8/0x520 
[libceph]
[13439.500583]  [] ? kernel_recvmsg+0x44/0x60
[13439.506324]  [] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[13439.513140]  [] try_read+0x5fe/0x7e0 [libceph]
[13439.519246]  [] con_work+0x378/0x4a0 [libceph]
[13439.525345]  [] ? finish_task_switch+0x3f/0x110
[13439.531515]  [] process_one_work+0x2b5/0x480
[13439.537439]  [] ? process_one_work+0x218/0x480
[13439.543526]  [] worker_thread+0x1f5/0x320
[13439.549191]  [] ? manage_workers+0x170/0x170
[13439.555102]  [] kthread+0xe1/0xf0
[13439.560075]  [] ? __init_kthread_worker+0x70/0x70
[13439.566419]  [] ret_from_fork+0x7c/0xb0
[13439.571918]  [] ? __init_kthread_worker+0x70/0x70
[13439.587132] ceph: mds0 reconnect success
[13490.720032] ceph: mds0 caps stale
[13501.235257] ceph: mds0 recovery completed
[13501.300419] ceph: mds0 caps renewed

Fix it up by encoding locks into a buffer first, and when the
number of encoded locks is stable, copy that into a ceph_pagelist.

Signed-off-by: Jim Schutt 
---
 fs/ceph/locks.c  |   76 +++---
 fs/ceph/mds_client.c |   65 +++---
 fs/ceph/super.h  |9 -
 3 files changed, 89 insertions(+), 61 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 4518313..8978851 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -191,29 +191,23 @@ void ceph_count_locks(struct inode *inode, int 
*fcntl_count, int *flock_count)
 }
 
 /**
- * Encode the flock and fcntl locks for the given inode into the pagelist.
- * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
- * sequential flock locks.
- * Must be called with lock_flocks() already held.
- * If we encounter more of a specific lock type than expected,
- * we return the value 1.
+ * Encode the flock and fcntl locks for the given inode into the ceph_filelock
+ * array. Must be called with lock_flocks() already held.
+ * If we encounter more of a specific lock type than expected, return -ENOSPC.
  */
-int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist,
- int num_fcntl_locks, int num_flock_locks)
+int ceph_encode_locks_to_buffer(struct inode *inode,
+   struct ceph_filelock *flocks,
+   int num_fcntl_locks, int num_flock_locks)
 {
struct file_lock *lock;
-   struct ceph_filelock cephlock;
int err = 0;
int seen_fcntl = 0;
int seen_flock = 0;
-   __le32 nlocks;
+   int l = 0;
 
dout("encoding %d flock and %d fcntl locks", num_flock_locks,
 num_fcntl_locks);
-   nlocks = cpu_to_le32(num_fcntl_locks);
-   err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks));
-   if (err)
-   goto fail;
+
for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) {
if (lock->fl_flags & FL_PO

[PATCH v2 2/3] ceph: add missing cpu_to_le32() calls when encoding a reconnect capability

2013-05-15 Thread Jim Schutt

In his review, Alex Elder mentioned that he hadn't checked that num_fcntl_locks
and num_flock_locks were properly decoded on the server side, from a le32
over-the-wire type to a cpu type.  I checked, and AFAICS it is done; those
interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and
src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph).

I also checked the server side for flock_len decoding, and I believe that
also happens correctly, by virtue of having been declared __le32 in
struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h.

Signed-off-by: Jim Schutt 
---
 fs/ceph/locks.c  |7 +--
 fs/ceph/mds_client.c |2 +-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index ffc86cb..4518313 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -206,10 +206,12 @@ int ceph_encode_locks(struct inode *inode, struct 
ceph_pagelist *pagelist,
int err = 0;
int seen_fcntl = 0;
int seen_flock = 0;
+   __le32 nlocks;
 
dout("encoding %d flock and %d fcntl locks", num_flock_locks,
 num_fcntl_locks);
-   err = ceph_pagelist_append(pagelist, &num_fcntl_locks, sizeof(u32));
+   nlocks = cpu_to_le32(num_fcntl_locks);
+   err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks));
if (err)
goto fail;
for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) {
@@ -229,7 +231,8 @@ int ceph_encode_locks(struct inode *inode, struct 
ceph_pagelist *pagelist,
goto fail;
}
 
-   err = ceph_pagelist_append(pagelist, &num_flock_locks, sizeof(u32));
+   nlocks = cpu_to_le32(num_flock_locks);
+   err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks));
if (err)
goto fail;
for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) {
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4f22671..d9ca152 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2485,7 +2485,7 @@ static int encode_caps_cb(struct inode *inode, struct 
ceph_cap *cap,
lock_flocks();
ceph_count_locks(inode, &num_fcntl_locks,
 &num_flock_locks);
-   rec.v2.flock_len = (2*sizeof(u32) +
+   rec.v2.flock_len = cpu_to_le32(2*sizeof(u32) +
(num_fcntl_locks+num_flock_locks) *
sizeof(struct ceph_filelock));
unlock_flocks();
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/3] ceph: fix up comment for ceph_count_locks() as to which lock to hold

2013-05-15 Thread Jim Schutt


Signed-off-by: Jim Schutt 
---
 fs/ceph/locks.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 202dd3d..ffc86cb 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -169,7 +169,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock 
*fl)
 }
 
 /**
- * Must be called with BKL already held. Fills in the passed
+ * Must be called with lock_flocks() already held. Fills in the passed
  * counter variables, so you can prepare pagelist metadata before calling
  * ceph_encode_locks.
  */
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/3] ceph: fix might_sleep while atomic

2013-05-15 Thread Jim Schutt

This patch series fixes an issue where encode_caps_cb() ends up
calling kmap() while holding a spinlock.

The original patch (http://www.spinics.net/lists/ceph-devel/msg14573.html)
was expanded to fix some other outstanding issues noticed by Alex Elder
in his review.

Changes from the original version:

 - clean up a comment
 - add missing calls to cpu_to_le32()
 - eliminate unnecessary loop around calls to ceph_pagelist_append()
 - only loop on calls to ceph_count_locks() and ceph_encode_locks_to_buffer()
 if the number of locks changes; exit on other errors
 - return proper error code from encode_caps_cb() if kmalloc fails

In his review, Alex mentioned that he hadn't checked that num_fcntl_locks
and num_flock_locks were properly decoded on the server side, from a le32
over-the-wire type to a cpu type.  I checked, and AFAICS it is done; those
interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and
src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph).

I also checked the server side for flock_len decoding, and I believe that
also happens correctly, by virtue of having been declared __le32 in
struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h.

Jim Schutt (3):
  ceph: fix up comment for ceph_count_locks() as to which lock to hold
  ceph: add missing cpu_to_le32() calls when encoding a reconnect capability
  ceph: ceph_pagelist_append might sleep while atomic

 fs/ceph/locks.c  |   75 --
 fs/ceph/mds_client.c |   65 +++
 fs/ceph/super.h  |9 -
 3 files changed, 90 insertions(+), 59 deletions(-)

-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-15 Thread Jim Schutt

On 05/15/2013 09:12 AM, Chen, Xiaoxi wrote:
> Thanks， but i am not quite understand how to determine weather monitor 
> overloaded？

I've also found it difficult, and a process of elimination.

Make sure your mon debugging is minimal.  If it's not,
lower it and see if it helps.

Watch your mon host with vmstat or iostat, to see what
it's disk write rate is during the periods you're having
trouble with OSDs having heartbeat issues.  If it's near
what you think the limiting rate is for whatever device
is holding the mon's store.db directory, that might be
your bottleneck.

> and if yes，will start several monitor help？

Sage or one of the other Ceph developers can speak better
to this.  I will note that with three monitors, if one
is too busy, it'll miss renewing its lease.  The result of
that will be an election to get a new quorum.  If you run
multiple mons and they hold frequent elections where a mon
drops out of quorum, then comes back in on the next election,
I've found that to be a sign that my mons are too busy.

-- Jim

> 
> 发自我的 iPhone
> 
> 在 2013-5-15，23:07，"Jim Schutt"  写道：
> 
>> On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote:
>>>> How responsive generally is the machine under load?  Is there available 
>>>> CPU?
>>>The machine works well, and the issued OSDs are likely the same, seems 
>>> because they have relative slower disk( disk type are the same but the 
>>> latency is a bit higher ,8ms -> 10ms).
>>>
>>>Top show no idle % but still have 30+% of io_wait,  my colleague educate 
>>> me that io_wait can be treated as free.
>>>
>>> Another information is offload the heartbeat to 1Gb nic doesn't solve the 
>>> problem, what's more, when we doing random write test, we can still see 
>>> this flipping happen. So I would like to say it may related with CPU 
>>> scheduler ? The heartbeat thread (in busy OSD ) failed to get enough cpu 
>>> cycle.
>>>
>>
>> FWIW, also take a close look at your monitor daemons, and
>> whether they show any signs of being overloaded.
>>
>> I frequently see OSDs wrongly marked down when my
>> mons cannot keep up with their workload.
>>
>> -- Jim
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [ceph-users] OSD state flipping when cluster-network in high utilization

2013-05-15 Thread Jim Schutt

On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote:
>> How responsive generally is the machine under load?  Is there available CPU?
>   The machine works well, and the issued OSDs are likely the same, seems 
> because they have relative slower disk( disk type are the same but the 
> latency is a bit higher ,8ms -> 10ms).
>   
>   Top show no idle % but still have 30+% of io_wait,  my colleague 
> educate me that io_wait can be treated as free.
> 
> Another information is offload the heartbeat to 1Gb nic doesn't solve the 
> problem, what's more, when we doing random write test, we can still see this 
> flipping happen. So I would like to say it may related with CPU scheduler ? 
> The heartbeat thread (in busy OSD ) failed to get enough cpu cycle.
> 

FWIW, also take a close look at your monitor daemons, and
whether they show any signs of being overloaded.

I frequently see OSDs wrongly marked down when my
mons cannot keep up with their workload.

-- Jim


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: ceph_pagelist_append might sleep while atomic

2013-05-14 Thread Jim Schutt

On 05/14/2013 10:44 AM, Alex Elder wrote:
> On 05/09/2013 09:42 AM, Jim Schutt wrote:
>> Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while
>> holding a lock, but it's spoiled because ceph_pagelist_addpage() always
>> calls kmap(), which might sleep.  Here's the result:
> 
> I finally took a close look at this today, Jim.  Sorry
> for the delay.
> 

No worries - thanks for taking a look.

> The issue is formatting the reconnect message--which will
> hold an arbitrary amount of data and therefore which we'll
> need to do some allocation (and kmap) for--in the face of
> having to hold the flock spinlock while doing so.
> 
> And as you found, ceph_pagelist_addpage(), which is called
> by ceph_pagelist_append(), calls kmap() even if it doesn't
> need to allocate anything.  This means that despite reserving
> the pages, those pages are in the free list and because they'll
> need to be the subject of kmap() their preallocation doesn't
> help.
> 
> Your solution was to pre-allocate a buffer, format the locks
> into that buffer while holding the lock, then append the
> buffer contents to a pagelist after releasing the lock.  You
> check for a changing (increasing) lock count while you format
> the locks, which is good.
> 
> So...  Given that, I think your change looks good.  It's a shame
> we can't format directly into the pagelist buffer but this won't
> happen much so it's not a big deal.  I have a few small suggestions,
> below.
> 
> I do find some byte order bugs though.   They aren't your doing,
> but I think they ought to be fixed first, as a separate patch
> that would precede this one.  The bug is that the lock counts
> that are put into the buffer (num_fcntl_locks and num_flock_locks)
> are not properly byte-swapped.  I'll point it out inline
> in your code, below.
> 
> I'll say that what you have is OK.  Consider my suggestions, and
> if you choose not to fix the byte order bugs, please let me know.

I'll happily fix up a v2 series with your suggestions addressed.
Thanks for catching those issues.  Stay tuned...

Thanks -- Jim


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pg balancing

2013-05-14 Thread Jim Schutt

[resent to list because I missed that Cc:]

Hi Sage,

On 05/13/2013 06:35 PM, Sage Weil wrote:
> Hi Jim-
> 
> You mentioned the other day your concerns about the uniformity of the PG 
> and data distribution.  There are several ways to attack it (including 
> increasing the number of PGs), but one that we haven't tested much yet is 
> the 'reweight-by-utilization' function in the monitor.
> 
> The idea is that there will always be some statistical variance in the 
> distribution and a non-zero probability of having outlier OSDs with too 
> many PG.  We adjust for this by taking nodes that are substantially above 
> the mean down by some adjustment factor in an automated way.
> 
>  ceph osd reweight-by-utilization MIN
> 
> where MIN is the minimum relative utilization at which we will start 
> adjusting down.  It is always > 100 (100% of the mean), and defaults to 
> 120.  After it adjusts the reweights, you should see the result in 'ceph 
> osd tree' output
> 
> Have you played with this at all on your cluster?  I'd be very interested 
> in how well this does/does not improve things for you.

I haven't yet, but likely it's because I don't understand
what the reweighting does, exactly.  Maybe you can comment
inline below where I go wrong?  Here's my thinking:

I'm only partially motivated by the actual amount of storage
used per OSD, although it is a factor.

My major concern is a performance issue for our parallel
application codes.  Their computation cycle is: compute
furiously, write results, repeat.  The issue is that none
of our codes implement write-behind; each task must finish
writing results before any can resume computing.

So, when some OSDs carry more PGs, they cannot complete
their portion of the write phase as quickly as other OSDs
with  fewer PGs.  Thus, the application's ability to resume
computation is delayed by the busiest OSDs.

My concern is that when we rebalance, we just cause some
other subset of the OSDs to be the busiest, in order to
send fewer writes to the overused OSDs and more writes
to underused OSDs.

At least, that's what I was thinking, without actually 
examining the code to see what is really going on in
rebalancing, and without testing.

Another thing I haven't done is actually compute from the
statistics of uniform distributions what the expected variance
is for my specific layout, (now 256K PGs across 576 OSDs, with
root/host/device hierarchy, 24 OSDs/host).  That's mostly due
to my lack of knowledge of statistics

If I'm getting more variance than expected I want to understand
why, in case it can be fixed.

In any event, I think it's past time I tried reweighting.

Suppose I use 'ceph osd reweight-by-utilization 101', on the
theory that I'd cause continuous, small adjustments to
utilization, and I'd learn what the maximum impact can be.
Does that seem like a bad idea to you, and if so could you
help me understand why?

Thanks for taking the time to think about this -
I know you're busy.

PS - FWIW, another reason I keep pushing the number of PGs is
because when we actually deploy Ceph for production, it'll
be at a bigger scale than my testbed.  So, I'm trying to
shake out any scale-related issues now, to make sure our
users' first experience with Ceph is a good one.

-- Jim

> 
> Thanks!
> sage
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] libceph: ceph_pagelist_append might sleep while atomic

2013-05-09 Thread Jim Schutt

Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while
holding a lock, but it's spoiled because ceph_pagelist_addpage() always
calls kmap(), which might sleep.  Here's the result:

[13439.295457] ceph: mds0 reconnect start
[13439.300572] BUG: sleeping function called from invalid context at 
include/linux/highmem.h:58
[13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
[13439.316464] 5 locks held by kworker/1:1/12059:
[13439.320998]  #0:  (ceph-msgr){..}, at: [] 
process_one_work+0x218/0x480
[13439.329701]  #1:  ((&(&con->work)->work)){..}, at: [] 
process_one_work+0x218/0x480
[13439.339446]  #2:  (&s->s_mutex){..}, at: [] 
send_mds_reconnect+0xec/0x450 [ceph]
[13439.349081]  #3:  (&mdsc->snap_rwsem){..}, at: [] 
send_mds_reconnect+0x16e/0x450 [ceph]
[13439.359278]  #4:  (file_lock_lock){..}, at: [] 
lock_flocks+0x15/0x20
[13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW
3.9.0-00358-g308ae61 #557
[13439.376225] Call Trace:
[13439.378757]  [] __might_sleep+0xfc/0x110
[13439.384353]  [] ceph_pagelist_append+0x120/0x1b0 [libceph]
[13439.391491]  [] ceph_encode_locks+0x89/0x190 [ceph]
[13439.398035]  [] ? _raw_spin_lock+0x49/0x50
[13439.403775]  [] ? lock_flocks+0x15/0x20
[13439.409277]  [] encode_caps_cb+0x41f/0x4a0 [ceph]
[13439.415622]  [] ? igrab+0x28/0x70
[13439.420610]  [] ? iterate_session_caps+0xe8/0x250 [ceph]
[13439.427584]  [] iterate_session_caps+0x115/0x250 [ceph]
[13439.434499]  [] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
[13439.441646]  [] send_mds_reconnect+0x238/0x450 [ceph]
[13439.448363]  [] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
[13439.455250]  [] check_new_map+0x352/0x500 [ceph]
[13439.461534]  [] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
[13439.468432]  [] ? mutex_unlock+0xe/0x10
[13439.473934]  [] extra_mon_dispatch+0x22/0x30 [ceph]
[13439.480464]  [] dispatch+0xbc/0x110 [libceph]
[13439.486492]  [] process_message+0x1ad/0x1d0 [libceph]
[13439.493190]  [] ? read_partial_message+0x3e8/0x520 
[libceph]
[13439.500583]  [] ? kernel_recvmsg+0x44/0x60
[13439.506324]  [] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[13439.513140]  [] try_read+0x5fe/0x7e0 [libceph]
[13439.519246]  [] con_work+0x378/0x4a0 [libceph]
[13439.525345]  [] ? finish_task_switch+0x3f/0x110
[13439.531515]  [] process_one_work+0x2b5/0x480
[13439.537439]  [] ? process_one_work+0x218/0x480
[13439.543526]  [] worker_thread+0x1f5/0x320
[13439.549191]  [] ? manage_workers+0x170/0x170
[13439.555102]  [] kthread+0xe1/0xf0
[13439.560075]  [] ? __init_kthread_worker+0x70/0x70
[13439.566419]  [] ret_from_fork+0x7c/0xb0
[13439.571918]  [] ? __init_kthread_worker+0x70/0x70
[13439.587132] ceph: mds0 reconnect success
[13490.720032] ceph: mds0 caps stale
[13501.235257] ceph: mds0 recovery completed
[13501.300419] ceph: mds0 caps renewed

Fix it up by encoding locks into a buffer first, and when the
number of encoded locks is stable, copy that into a ceph_pagelist.

Signed-off-by: Jim Schutt 
---
 fs/ceph/locks.c  |   73 +
 fs/ceph/mds_client.c |   62 ++
 fs/ceph/super.h  |9 +-
 3 files changed, 88 insertions(+), 56 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 202dd3d..9a46161 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -191,27 +191,23 @@ void ceph_count_locks(struct inode *inode, int 
*fcntl_count, int *flock_count)
 }
 
 /**
- * Encode the flock and fcntl locks for the given inode into the pagelist.
- * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
- * sequential flock locks.
- * Must be called with lock_flocks() already held.
- * If we encounter more of a specific lock type than expected,
- * we return the value 1.
+ * Encode the flock and fcntl locks for the given inode into the ceph_filelock
+ * array. Must be called with lock_flocks() already held.
+ * If we encounter more of a specific lock type than expected, return -ENOSPC.
  */
-int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist,
- int num_fcntl_locks, int num_flock_locks)
+int ceph_encode_locks_to_buffer(struct inode *inode,
+   struct ceph_filelock *flocks,
+   int num_fcntl_locks, int num_flock_locks)
 {
struct file_lock *lock;
-   struct ceph_filelock cephlock;
int err = 0;
int seen_fcntl = 0;
int seen_flock = 0;
+   int l = 0;
 
dout("encoding %d flock and %d fcntl locks", num_flock_locks,
 num_fcntl_locks);
-   err = ceph_pagelist_append(pagelist, &num_fcntl_locks, sizeof(u32));
-   if (err)
-   goto fail;
+
for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) {
if (lock->fl_flags & FL_POSIX) {
++seen_fcntl;
@@ -219,19 +215,12 @@ int

Re: [PATCH v2] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values

2013-04-18 Thread Jim Schutt

On 04/18/2013 04:38 PM, Jim Schutt wrote:
> Hi Greg,
> 
> On 04/16/2013 02:18 PM, Gregory Farnum wrote:
>> On Fri, Apr 12, 2013 at 12:41 PM, Jim Schutt  wrote:
>>> Hi Greg,
>>>
>>>
>>> On 04/10/2013 06:39 PM, Gregory Farnum wrote:
>>>> Jim,
>>>> I took this patch as a base for setting up config options which people
>>>> can tune manually and have pushed those changes to wip-leveldb-config.
>>>
>>> I was out of the office unexpectedly for a few days,
>>> so I'm just now taking a look.
>>>
>>>> Thanks very much for figuring out how to set up the cache et al!
>>>
>>> No problem!
>>>
>>>>
>>>> For now I restructured quite a bit of the data ingestion, and I took
>>>> your defaults for the monitor on the write buffer, block size, and
>>>> compression, but I left the cache off. These also don't apply to the
>>>> OSDs at all. In order to enable more experimentation I do pass through
>>>> the options though:
>>>> OPTION(mon_ldb_write_buffer_size, OPT_U64, 32*1024*1024) // monitor's
>>>> leveldb write buffer size
>>>> OPTION(mon_ldb_cache_size, OPT_U64, 0) // monitor's leveldb cache size
>>>> OPTION(mon_ldb_block_size, OPT_U64, 4*1024*1024) // monitor's leveldb 
>>>> block size
>>>> OPTION(mon_ldb_bloom_size, OPT_INT, 0) // monitor's leveldb bloom bits per 
>>>> entry
>>>> OPTION(mon_ldb_max_open_files, OPT_INT, 0) // monitor's leveldb max open 
>>>> files
>>>> OPTION(mon_ldb_compression, OPT_BOOL, false) // monitor's leveldb uses
>>>> compression
>>>> (and similar ones for osd_ldb_*).
>>
>> On request from Sage these are now "*_leveldb_*" instead of "*_ldb_*";
>> I pushed that a couple hours ago. In case you haven't already pulled
>> down a copy, and so you know for when it gets into mainline and you go
>> to adjust it. :)
> 
> I've been testing this over the last several days, both
> via the wip-leveldb-config branch, and then the next branch
> after that was merged into next.
> 
> I've gotten slightly better startup behavior with mon_ldb_cache_size

Oops, now mon_leveldb_cache_size ;)

-- Jim

> larger than the default 8 MiB in leveldb - I've used 64 MiB and
> 256 MiB, and found little to prefer one over the other.  Both
> seem to be preferable to the default, at least wrt. startup
> behavior at the number of PGs (256K) I'm testing.
> 
> Other than that, I've had no trouble with the new tunings.
> 
> I'm sorry for the delay in reporting - I've had a little
> hardware trouble, and was trying to be sure that any issues
> I've been having were related to that and not these patches.
> 
> -- Jim
> 
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values

2013-04-18 Thread Jim Schutt

Hi Greg,

On 04/16/2013 02:18 PM, Gregory Farnum wrote:
> On Fri, Apr 12, 2013 at 12:41 PM, Jim Schutt  wrote:
>> Hi Greg,
>>
>>
>> On 04/10/2013 06:39 PM, Gregory Farnum wrote:
>>> Jim,
>>> I took this patch as a base for setting up config options which people
>>> can tune manually and have pushed those changes to wip-leveldb-config.
>>
>> I was out of the office unexpectedly for a few days,
>> so I'm just now taking a look.
>>
>>> Thanks very much for figuring out how to set up the cache et al!
>>
>> No problem!
>>
>>>
>>> For now I restructured quite a bit of the data ingestion, and I took
>>> your defaults for the monitor on the write buffer, block size, and
>>> compression, but I left the cache off. These also don't apply to the
>>> OSDs at all. In order to enable more experimentation I do pass through
>>> the options though:
>>> OPTION(mon_ldb_write_buffer_size, OPT_U64, 32*1024*1024) // monitor's
>>> leveldb write buffer size
>>> OPTION(mon_ldb_cache_size, OPT_U64, 0) // monitor's leveldb cache size
>>> OPTION(mon_ldb_block_size, OPT_U64, 4*1024*1024) // monitor's leveldb block 
>>> size
>>> OPTION(mon_ldb_bloom_size, OPT_INT, 0) // monitor's leveldb bloom bits per 
>>> entry
>>> OPTION(mon_ldb_max_open_files, OPT_INT, 0) // monitor's leveldb max open 
>>> files
>>> OPTION(mon_ldb_compression, OPT_BOOL, false) // monitor's leveldb uses
>>> compression
>>> (and similar ones for osd_ldb_*).
> 
> On request from Sage these are now "*_leveldb_*" instead of "*_ldb_*";
> I pushed that a couple hours ago. In case you haven't already pulled
> down a copy, and so you know for when it gets into mainline and you go
> to adjust it. :)

I've been testing this over the last several days, both
via the wip-leveldb-config branch, and then the next branch
after that was merged into next.

I've gotten slightly better startup behavior with mon_ldb_cache_size
larger than the default 8 MiB in leveldb - I've used 64 MiB and
256 MiB, and found little to prefer one over the other.  Both
seem to be preferable to the default, at least wrt. startup
behavior at the number of PGs (256K) I'm testing.

Other than that, I've had no trouble with the new tunings.

I'm sorry for the delay in reporting - I've had a little
hardware trouble, and was trying to be sure that any issues
I've been having were related to that and not these patches.

-- Jim

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values

2013-04-12 Thread Jim Schutt

Hi Greg,


On 04/10/2013 06:39 PM, Gregory Farnum wrote:
> Jim,
> I took this patch as a base for setting up config options which people
> can tune manually and have pushed those changes to wip-leveldb-config.

I was out of the office unexpectedly for a few days,
so I'm just now taking a look.

> Thanks very much for figuring out how to set up the cache et al!

No problem!

> 
> For now I restructured quite a bit of the data ingestion, and I took
> your defaults for the monitor on the write buffer, block size, and
> compression, but I left the cache off. These also don't apply to the
> OSDs at all. In order to enable more experimentation I do pass through
> the options though:
> OPTION(mon_ldb_write_buffer_size, OPT_U64, 32*1024*1024) // monitor's
> leveldb write buffer size
> OPTION(mon_ldb_cache_size, OPT_U64, 0) // monitor's leveldb cache size
> OPTION(mon_ldb_block_size, OPT_U64, 4*1024*1024) // monitor's leveldb block 
> size
> OPTION(mon_ldb_bloom_size, OPT_INT, 0) // monitor's leveldb bloom bits per 
> entry
> OPTION(mon_ldb_max_open_files, OPT_INT, 0) // monitor's leveldb max open files
> OPTION(mon_ldb_compression, OPT_BOOL, false) // monitor's leveldb uses
> compression
> (and similar ones for osd_ldb_*).
> 
> If you have the opportunity to verify that these patches work for you
> (in particular I'm wondering if the OSDs need any more tuning on their
> end which was being masked by your global changes) that would be
> wonderful. :)

I'll be able to take these for a spin next Tuesday, and I'll definitely
keep an eye open for any unexpected behavior on the OSDs.  Sorry for
the delay...

I appreciate you taking a look at this, and fixing it up into
something that's more appropriate.

Thanks --Jim

> Thanks,
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values

2013-04-05 Thread Jim Schutt

As reported in this thread
   http://www.spinics.net/lists/ceph-devel/msg13777.html
starting in v0.59 a new filesystem with ~55,000 PGs would not start after
a period of ~30 minutes.  By comparison, the same filesystem configuration
would start in ~1 minute for v0.58.

The issue is that starting in v0.59, LevelDB is used for the monitor
data store.  For moderate to large numbers of PGs, the length of a PGStat value
stored via LevelDB is best measured in megabytes.  The default tunings for
LevelDB data blocking seem tuned for values with lengths measured in tens or
hundreds of bytes.

With the data blocking tuning provided by this patch, here's a comparison
of filesystem startup times for v0.57, v0.58, and v0.59:

  55,392 PGs   221,568 PGs
v0.57   1m 07s9m 42s
v0.58   1m 04s   11m 44s
v0.59  48s3m 30s

Note that this patch turns off LevelDB's compression by default.  The
block tuning from this patch with compression enabled made no improvement
in the new filesystem startup time for v0.59, for either PG count tested.
I'll note that at 55,392 PGs the PGStat length is ~20 MB; perhaps that
value length interacts poorly with LevelDB's compression at this block size.

Signed-off-by: Jim Schutt 
---
 src/common/config_opts.h |4 
 src/os/LevelDBStore.cc   |9 +
 src/os/LevelDBStore.h|3 +++
 3 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/src/common/config_opts.h b/src/common/config_opts.h
index 9d42961..e8f491e 100644
--- a/src/common/config_opts.h
+++ b/src/common/config_opts.h
@@ -181,6 +181,10 @@ OPTION(paxos_propose_interval, OPT_DOUBLE, 1.0)  // gather 
updates for this long
 OPTION(paxos_min_wait, OPT_DOUBLE, 0.05)  // min time to gather updates for 
after period of inactivity
 OPTION(paxos_trim_tolerance, OPT_INT, 30) // number of extra proposals 
tolerated before trimming
 OPTION(paxos_trim_disabled_max_versions, OPT_INT, 100) // maximum amount of 
versions we shall allow passing by without trimming
+OPTION(leveldb_block_size, OPT_U64, 4 * 1024 * 1024)  // leveldb unit of 
caching, compression (in bytes)
+OPTION(leveldb_write_buffer_size, OPT_U64, 32 * 1024 * 1024) // leveldb unit 
of I/O (in bytes)
+OPTION(leveldb_cache_size, OPT_U64, 256 * 1024 * 1024) // leveldb data cache 
size (in bytes)
+OPTION(leveldb_compression_enabled, OPT_BOOL, false)
 OPTION(clock_offset, OPT_DOUBLE, 0) // how much to offset the system clock in 
Clock.cc
 OPTION(auth_cluster_required, OPT_STR, "cephx")   // required of mon, mds, osd 
daemons
 OPTION(auth_service_required, OPT_STR, "cephx")   // required by daemons of 
clients
diff --git a/src/os/LevelDBStore.cc b/src/os/LevelDBStore.cc
index 3d94096..0d41564 100644
--- a/src/os/LevelDBStore.cc
+++ b/src/os/LevelDBStore.cc
@@ -14,13 +14,22 @@ using std::string;
 
 int LevelDBStore::init(ostream &out, bool create_if_missing)
 {
+  db_cache = leveldb::NewLRUCache(g_conf->leveldb_cache_size);
+
   leveldb::Options options;
   options.create_if_missing = create_if_missing;
+  options.write_buffer_size = g_conf->leveldb_write_buffer_size;
+  options.block_size = g_conf->leveldb_block_size;
+  options.block_cache = db_cache;
+  if (!g_conf->leveldb_compression_enabled)
+options.compression = leveldb::kNoCompression;
   leveldb::DB *_db;
   leveldb::Status status = leveldb::DB::Open(options, path, &_db);
   db.reset(_db);
   if (!status.ok()) {
 out << status.ToString() << std::endl;
+delete db_cache;
+db_cache = NULL;
 return -EINVAL;
   } else
 return 0;
diff --git a/src/os/LevelDBStore.h b/src/os/LevelDBStore.h
index 7f0e154..8199a41 100644
--- a/src/os/LevelDBStore.h
+++ b/src/os/LevelDBStore.h
@@ -14,18 +14,21 @@
 #include "leveldb/db.h"
 #include "leveldb/write_batch.h"
 #include "leveldb/slice.h"
+#include "leveldb/cache.h"
 
 /**
  * Uses LevelDB to implement the KeyValueDB interface
  */
 class LevelDBStore : public KeyValueDB {
   string path;
+  leveldb::Cache *db_cache;
   boost::scoped_ptr db;
 
   int init(ostream &out, bool create_if_missing);
 
 public:
   LevelDBStore(const string &path) : path(path) {}
+  ~LevelDBStore() { delete db_cache; }
 
   /// Opens underlying db
   int open(ostream &out) {
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values

2013-04-05 Thread Jim Schutt

On 04/05/2013 09:06 AM, Sage Weil wrote:
> On Fri, 5 Apr 2013, Jim Schutt wrote:
>> On 04/04/2013 10:44 PM, Sage Weil wrote:
>>> Fantastic work tracking this down, Jim!
>>>
>>> Looking at the Riak docs on tuning leveldb, it looks like a large write 
>>> buffer size is definitely a good idea.  The block size of 4MB is 
>>> significantly larger than what they recommend, though.. if we go this big 
>>> we also need to make the cache size larger (it defaults to 8MB?). 
>>
>> I wondered about that, but the change in performance with
>> just these simple changes was so drastic that I thought
>> it was a good immediate change, with further tuning to
>> follow.
>>
>> Although, I just tested startup, not actually running such
>> a filesystem.  Perhaps I missed the effects of insufficient
>> caching.
>>
>> I ended up picking 4 MB because the 55K PG case has PGStat
>> values that are 20 MB in length, so it still takes 5 blocks
>> to store one.  Plus, I'd really like to be running at 256K
>> or 512K PGs to get the data uniformity across OSDs that I'm
>> after.
>>
>> Perhaps these need to be config options - I just hated to add
>> more, when it would be hard for a user to discover that's the
>> thing that needed tuning to solve some particular performance
>> issue.  I'm hoping we can find one set of values that work well
>> for everyone.
> 
> I think the best option is to figure out the defaults that will work for 
> everyone, but still make config options.  I suspect there will be users 
> that need to tune for less memory.
> 
> leveldb_block_size, leveldb_write_buffer_size, etc.
>  
>>> Did you 
>>> try with a large write buffer but a smaller block size (like 256K or 
>>> 512K)?
>>
>> I did try a 256K block size with an 8 MB write buffer, also with
>> no compression.  That caused the 55K PGs case to make much more
>> progress towards starting, but it still failed to come up - I
>> forget the details of what went awry.
>>
>>>
>>> I think either a larger cache or a smaller block size is okay, but 4MB 
>>> with an 8MB cache means only 2 blocks cached, which sounds non-ideal.
>>
>> I can try fixing up a larger cache - I'll need to dig in
>> a little to figure out how to do it, so it might take me
>> a little while.  How big do you think it should be, given
>> that the large PG count cases I'm after might have PGStat
>> data lengths that are many tens of MB?
>>
>> 64 MB? 256 MB?  Tunable?
> 
> The default is only 8MB, but this can safely go up to several gigs.  I 
> think 256 MB sounds like a reasonable default...
> 
> Sam, want to weigh in?
> 
>> Also, I wondered about whether the writes need to be
>> sync'd?  Do these tunings change your mind about whether
>> that's needed?
> 
> The sync parameters shouldn't need to be changed.  os/FileStore.cc is 
> calling the sync when we do an overall filestore sync prior to a commit 
> point or btrfs snapshot.

OK.  Now that I've had a look, setting up the cache
should be easy - I'm working on it.

I'll fix up v2 of the patch with that, and options for
the tunables.

Thanks -- Jim

> 
> sage
> 
> 
>>
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>> On Thu, 4 Apr 2013, Jim Schutt wrote:
>>>
>>>> As reported in this thread
>>>>http://www.spinics.net/lists/ceph-devel/msg13777.html
>>>> starting in v0.59 a new filesystem with ~55,000 PGs would not start after
>>>> a period of ~30 minutes.  By comparison, the same filesystem configuration
>>>> would start in ~1 minute for v0.58.
>>>>
>>>> The issue is that starting in v0.59, LevelDB is used for the monitor
>>>> data store.  For moderate to large numbers of PGs, the length of a PGStat 
>>>> value
>>>> stored via LevelDB is best measured in megabytes.  The default tunings for
>>>> LevelDB data blocking seem tuned for values with lengths measured in tens 
>>>> or
>>>> hundreds of bytes.
>>>>
>>>> With the data blocking tuning provided by this patch, here's a comparison
>>>> of filesystem startup times for v0.57, v0.58, and v0.59:
>>>>
>>>>   55,392 PGs   221,568 PGs
>>>> v0.57   1m 07s9m 42s
>>>> v0.58   1m 04s   11m 44s
>>>> v0.59  45s4m 17s
>>>>
>>>> Note that this patch turns off LevelDB's compre

Re: [PATCH] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values

2013-04-05 Thread Jim Schutt

On 04/04/2013 10:44 PM, Sage Weil wrote:
> Fantastic work tracking this down, Jim!
> 
> Looking at the Riak docs on tuning leveldb, it looks like a large write 
> buffer size is definitely a good idea.  The block size of 4MB is 
> significantly larger than what they recommend, though.. if we go this big 
> we also need to make the cache size larger (it defaults to 8MB?). 

I wondered about that, but the change in performance with
just these simple changes was so drastic that I thought
it was a good immediate change, with further tuning to
follow.

Although, I just tested startup, not actually running such
a filesystem.  Perhaps I missed the effects of insufficient
caching.

I ended up picking 4 MB because the 55K PG case has PGStat
values that are 20 MB in length, so it still takes 5 blocks
to store one.  Plus, I'd really like to be running at 256K
or 512K PGs to get the data uniformity across OSDs that I'm
after.

Perhaps these need to be config options - I just hated to add
more, when it would be hard for a user to discover that's the
thing that needed tuning to solve some particular performance
issue.  I'm hoping we can find one set of values that work well
for everyone.

> Did you 
> try with a large write buffer but a smaller block size (like 256K or 
> 512K)?

I did try a 256K block size with an 8 MB write buffer, also with
no compression.  That caused the 55K PGs case to make much more
progress towards starting, but it still failed to come up - I
forget the details of what went awry.

> 
> I think either a larger cache or a smaller block size is okay, but 4MB 
> with an 8MB cache means only 2 blocks cached, which sounds non-ideal.

I can try fixing up a larger cache - I'll need to dig in
a little to figure out how to do it, so it might take me
a little while.  How big do you think it should be, given
that the large PG count cases I'm after might have PGStat
data lengths that are many tens of MB?

64 MB? 256 MB?  Tunable?

Also, I wondered about whether the writes need to be
sync'd?  Do these tunings change your mind about whether
that's needed?

> 
> Thanks!
> sage
> 
> 
> On Thu, 4 Apr 2013, Jim Schutt wrote:
> 
>> As reported in this thread
>>http://www.spinics.net/lists/ceph-devel/msg13777.html
>> starting in v0.59 a new filesystem with ~55,000 PGs would not start after
>> a period of ~30 minutes.  By comparison, the same filesystem configuration
>> would start in ~1 minute for v0.58.
>>
>> The issue is that starting in v0.59, LevelDB is used for the monitor
>> data store.  For moderate to large numbers of PGs, the length of a PGStat 
>> value
>> stored via LevelDB is best measured in megabytes.  The default tunings for
>> LevelDB data blocking seem tuned for values with lengths measured in tens or
>> hundreds of bytes.
>>
>> With the data blocking tuning provided by this patch, here's a comparison
>> of filesystem startup times for v0.57, v0.58, and v0.59:
>>
>>   55,392 PGs   221,568 PGs
>> v0.57   1m 07s9m 42s
>> v0.58   1m 04s   11m 44s
>> v0.59  45s4m 17s
>>
>> Note that this patch turns off LevelDB's compression.  The block
>> tuning from this patch with compression enabled made no improvement
>> in the new filesystem startup time for v0.59, for either PG count
>> tested.  I'll note that at 55,392 PGs the PGStat length is ~20 MB;
>> perhaps that value length interacts pooly with LevelDB's compression

s/pooly/poorly/

Thanks -- Jim

>> at this block size.
>>
>> Signed-off-by: Jim Schutt 
>> ---
>>  src/os/LevelDBStore.cc |3 +++
>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/src/os/LevelDBStore.cc b/src/os/LevelDBStore.cc
>> index 3d94096..1b6ae7d 100644
>> --- a/src/os/LevelDBStore.cc
>> +++ b/src/os/LevelDBStore.cc
>> @@ -16,6 +16,9 @@ int LevelDBStore::init(ostream &out, bool 
>> create_if_missing)
>>  {
>>leveldb::Options options;
>>options.create_if_missing = create_if_missing;
>> +  options.write_buffer_size = 32 * 1024 * 1024;
>> +  options.block_size = 4 * 1024 * 1024;
>> +  options.compression = leveldb::kNoCompression;
>>leveldb::DB *_db;
>>leveldb::Status status = leveldb::DB::Open(options, path, &_db);
>>db.reset(_db);
>> -- 
>> 1.7.8.2
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values

2013-04-04 Thread Jim Schutt

As reported in this thread
   http://www.spinics.net/lists/ceph-devel/msg13777.html
starting in v0.59 a new filesystem with ~55,000 PGs would not start after
a period of ~30 minutes.  By comparison, the same filesystem configuration
would start in ~1 minute for v0.58.

The issue is that starting in v0.59, LevelDB is used for the monitor
data store.  For moderate to large numbers of PGs, the length of a PGStat value
stored via LevelDB is best measured in megabytes.  The default tunings for
LevelDB data blocking seem tuned for values with lengths measured in tens or
hundreds of bytes.

With the data blocking tuning provided by this patch, here's a comparison
of filesystem startup times for v0.57, v0.58, and v0.59:

  55,392 PGs   221,568 PGs
v0.57   1m 07s9m 42s
v0.58   1m 04s   11m 44s
v0.59  45s4m 17s

Note that this patch turns off LevelDB's compression.  The block
tuning from this patch with compression enabled made no improvement
in the new filesystem startup time for v0.59, for either PG count
tested.  I'll note that at 55,392 PGs the PGStat length is ~20 MB;
perhaps that value length interacts pooly with LevelDB's compression
at this block size.

Signed-off-by: Jim Schutt 
---
 src/os/LevelDBStore.cc |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/src/os/LevelDBStore.cc b/src/os/LevelDBStore.cc
index 3d94096..1b6ae7d 100644
--- a/src/os/LevelDBStore.cc
+++ b/src/os/LevelDBStore.cc
@@ -16,6 +16,9 @@ int LevelDBStore::init(ostream &out, bool create_if_missing)
 {
   leveldb::Options options;
   options.create_if_missing = create_if_missing;
+  options.write_buffer_size = 32 * 1024 * 1024;
+  options.block_size = 4 * 1024 * 1024;
+  options.compression = leveldb::kNoCompression;
   leveldb::DB *_db;
   leveldb::Status status = leveldb::DB::Open(options, path, &_db);
   db.reset(_db);
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-04 Thread Jim Schutt

On 04/03/2013 04:40 PM, Jim Schutt wrote:
> On 04/03/2013 12:25 PM, Sage Weil wrote:
>>>>>> >>> > > Sorry, guess I forgot some of the history since this piece at 
>>>>>> >>> > > least is
>>>>>> >>> > > resolved now. I'm surprised if 30-second timeouts are causing 
>>>>>> >>> > > issues
>>>>>> >>> > > without those overloads you were seeing; have you seen this issue
>>>>>> >>> > > without your high debugging levels and without the bad PG 
>>>>>> >>> > > commits (due
>>>>>> >>> > > to debugging)?
>>>> >> > 
>>>> >> > I think so, because that's why I started with higher debugging
>>>> >> > levels.
>>>> >> > 
>>>> >> > But, as it turns out, I'm just in the process of returning to my
>>>> >> > testing of next, with all my debugging back to 0.  So, I'll try
>>>> >> > the default timeout of 30 seconds first.  If I have trouble starting
>>>> >> > up a new file system, I'll turn up the timeout and try again, without
>>>> >> > any extra debugging.  Either way, I'll let you know what happens.
>> > I would be curious to hear roughly what value between 30 and 300 is 
>> > sufficient, if you can experiment just a bit.  We probably want to adjust 
>> > the default.
>> > 
>> > Perhaps more importantly, we'll need to look at the performance of the pg 
>> > stat updates on the mon.  There is a refactor due in that code that should 
>> > improve life, but it's slated for dumpling.
> OK, here's some results, with all debugging at 0, using current next...
> 
> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
> use 10 GbE NICs now.  The mon host uses an SSD for the mon data store.
> My test procedure is to start 'ceph -w', start all the OSDs, and once
> they're all running start the mon.  I report the time from starting
> the mon to all PGs active+clean.

As a sanity check, to be sure I wasn't doing something differently
now than I remember doing before, I re-ran this test for v0.57,
v0.58, and v0.59, using default 'osd mon ack timeout', default
'paxos propose interval', and no debugging:

55,392 PGs   221,568 PGs
v0.571m 07s9m 42s
v0.581m 04s   11m 44s
v0.59 >30m  not attempted

The v0.57/v0.58 runs showed no signs of stress, e.g. no
slow op reports, etc.

The v0.59 run behaved as I previously reported, i.e.,
lots of stale peers, OSDs wrongly marked down, etc.,
before I gave up on it.

-- Jim


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-04 Thread Jim Schutt

On 04/04/2013 08:15 AM, Jim Schutt wrote:
> On 04/03/2013 04:51 PM, Gregory Farnum wrote:
>> On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt  wrote:
>>> On 04/03/2013 12:25 PM, Sage Weil wrote:
>>>>>>>> Sorry, guess I forgot some of the history since this piece at least is
>>>>>>>> resolved now. I'm surprised if 30-second timeouts are causing issues
>>>>>>>> without those overloads you were seeing; have you seen this issue
>>>>>>>> without your high debugging levels and without the bad PG commits (due
>>>>>>>> to debugging)?
>>>>>>
>>>>>> I think so, because that's why I started with higher debugging
>>>>>> levels.
>>>>>>
>>>>>> But, as it turns out, I'm just in the process of returning to my
>>>>>> testing of next, with all my debugging back to 0.  So, I'll try
>>>>>> the default timeout of 30 seconds first.  If I have trouble starting
>>>>>> up a new file system, I'll turn up the timeout and try again, without
>>>>>> any extra debugging.  Either way, I'll let you know what happens.
>>>> I would be curious to hear roughly what value between 30 and 300 is
>>>> sufficient, if you can experiment just a bit.  We probably want to adjust
>>>> the default.
>>>>
>>>> Perhaps more importantly, we'll need to look at the performance of the pg
>>>> stat updates on the mon.  There is a refactor due in that code that should
>>>> improve life, but it's slated for dumpling.
>>>
>>> OK, here's some results, with all debugging at 0, using current next...
>>>
>>> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
>>> use 10 GbE NICs now.  The mon host uses an SSD for the mon data store.
>>> My test procedure is to start 'ceph -w', start all the OSDs, and once
>>> they're all running start the mon.  I report the time from starting
>>> the mon to all PGs active+clean.
>>>
>>> # PGs osd mon ackstartupnotes
>>> timeout   time
>>> ---     -
>>>  55392  default  >30:00   1
>>>  55392300 18:36   2
>>>  55392 60>30:00   3
>>>  55392150>30:00   4
>>>  55392240>30:00   5
>>>  55392300>30:00   2,6
>>>
>>> notes:
>>> 1) lots of PGs marked stale, OSDs wrongly marked down
>>>  before I gave up on this case
>>> 2) OSDs report lots of slow requests for "pg_notify(...) v4
>>>  currently wait for new map"
>>> 3) some OSDs wrongly marked down, OSDs report some slow requests
>>>  for "pg_notify(...) v4 currently wait for new map"
>>>  before I gave up on this case
>>> 4) appeared to be making progress; then an OSD was marked
>>>  out at ~21 minutes; many more marked out before I
>>>  gave up on this case
>>> 5) some OSD reports of slow requests for "pg_notify",
>>>  some OSDs wrongly marked down, appeared to be making
>>>  progress, then stalled; then I gave up on this case
>>> 6) retried this case, appeared to be making progress, but
>>>  after ~18 min stalled at 19701 active+clean, 35691 peering
>>>  until I gave up
>>>
>>> Hmmm, I didn't really expect the above results.  I ran out of
>>> time before attempting an even longer osd mon ack timeout.
>>>
>>> But either we're on the wrong trail, or 300 is not sufficient.
>>> Or, I'm doing something wrong and haven't yet figured out what
>>> it is.
>>>
>>> FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs,
>>> and my memory is a new filesystem started up in ~5 minutes.  For
>>> that testing I had to increase 'paxos propose interval' to two
>>> or three seconds to keep the monitor writeout rate (as measured
>>> by vmstat) down to a sustained 50-70 MB/s during start-up.
>>>
>>> That was with a 1 GbE NIC in the mon; the reason I upgraded
>>> it was a filesystem with 512K PGs was taking too long to start,
>>> and I thought the mon might be network-limited since it had an
>>> SSD for the mon data store.
>>>
>>> For the testing above I used the default '

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-04 Thread Jim Schutt

On 04/03/2013 04:51 PM, Gregory Farnum wrote:
> On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt  wrote:
>> On 04/03/2013 12:25 PM, Sage Weil wrote:
>>>>>>> Sorry, guess I forgot some of the history since this piece at least is
>>>>>>> resolved now. I'm surprised if 30-second timeouts are causing issues
>>>>>>> without those overloads you were seeing; have you seen this issue
>>>>>>> without your high debugging levels and without the bad PG commits (due
>>>>>>> to debugging)?
>>>>>
>>>>> I think so, because that's why I started with higher debugging
>>>>> levels.
>>>>>
>>>>> But, as it turns out, I'm just in the process of returning to my
>>>>> testing of next, with all my debugging back to 0.  So, I'll try
>>>>> the default timeout of 30 seconds first.  If I have trouble starting
>>>>> up a new file system, I'll turn up the timeout and try again, without
>>>>> any extra debugging.  Either way, I'll let you know what happens.
>>> I would be curious to hear roughly what value between 30 and 300 is
>>> sufficient, if you can experiment just a bit.  We probably want to adjust
>>> the default.
>>>
>>> Perhaps more importantly, we'll need to look at the performance of the pg
>>> stat updates on the mon.  There is a refactor due in that code that should
>>> improve life, but it's slated for dumpling.
>>
>> OK, here's some results, with all debugging at 0, using current next...
>>
>> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
>> use 10 GbE NICs now.  The mon host uses an SSD for the mon data store.
>> My test procedure is to start 'ceph -w', start all the OSDs, and once
>> they're all running start the mon.  I report the time from starting
>> the mon to all PGs active+clean.
>>
>> # PGs osd mon ackstartupnotes
>> timeout   time
>> ---     -
>>  55392  default  >30:00   1
>>  55392300 18:36   2
>>  55392 60>30:00   3
>>  55392150>30:00   4
>>  55392240>30:00   5
>>  55392300>30:00   2,6
>>
>> notes:
>> 1) lots of PGs marked stale, OSDs wrongly marked down
>>  before I gave up on this case
>> 2) OSDs report lots of slow requests for "pg_notify(...) v4
>>  currently wait for new map"
>> 3) some OSDs wrongly marked down, OSDs report some slow requests
>>  for "pg_notify(...) v4 currently wait for new map"
>>  before I gave up on this case
>> 4) appeared to be making progress; then an OSD was marked
>>  out at ~21 minutes; many more marked out before I
>>  gave up on this case
>> 5) some OSD reports of slow requests for "pg_notify",
>>  some OSDs wrongly marked down, appeared to be making
>>  progress, then stalled; then I gave up on this case
>> 6) retried this case, appeared to be making progress, but
>>  after ~18 min stalled at 19701 active+clean, 35691 peering
>>  until I gave up
>>
>> Hmmm, I didn't really expect the above results.  I ran out of
>> time before attempting an even longer osd mon ack timeout.
>>
>> But either we're on the wrong trail, or 300 is not sufficient.
>> Or, I'm doing something wrong and haven't yet figured out what
>> it is.
>>
>> FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs,
>> and my memory is a new filesystem started up in ~5 minutes.  For
>> that testing I had to increase 'paxos propose interval' to two
>> or three seconds to keep the monitor writeout rate (as measured
>> by vmstat) down to a sustained 50-70 MB/s during start-up.
>>
>> That was with a 1 GbE NIC in the mon; the reason I upgraded
>> it was a filesystem with 512K PGs was taking too long to start,
>> and I thought the mon might be network-limited since it had an
>> SSD for the mon data store.
>>
>> For the testing above I used the default 'paxos propose interval'.
>> I don't know if it matters, but vmstat sees only a little data
>> being written on the mon system.
> 
> That's odd; I'd actually expect to see more going to disk with v0.59
> than previously. Is vmstat actually looking at disk IO, or might it be
> missing DirectIO or something? (Not that

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-03 Thread Jim Schutt

On 04/03/2013 12:25 PM, Sage Weil wrote:
>>> > > Sorry, guess I forgot some of the history since this piece at least is
>>> > > resolved now. I'm surprised if 30-second timeouts are causing issues
>>> > > without those overloads you were seeing; have you seen this issue
>>> > > without your high debugging levels and without the bad PG commits (due
>>> > > to debugging)?
>> > 
>> > I think so, because that's why I started with higher debugging
>> > levels.
>> > 
>> > But, as it turns out, I'm just in the process of returning to my
>> > testing of next, with all my debugging back to 0.  So, I'll try
>> > the default timeout of 30 seconds first.  If I have trouble starting
>> > up a new file system, I'll turn up the timeout and try again, without
>> > any extra debugging.  Either way, I'll let you know what happens.
> I would be curious to hear roughly what value between 30 and 300 is 
> sufficient, if you can experiment just a bit.  We probably want to adjust 
> the default.
> 
> Perhaps more importantly, we'll need to look at the performance of the pg 
> stat updates on the mon.  There is a refactor due in that code that should 
> improve life, but it's slated for dumpling.

OK, here's some results, with all debugging at 0, using current next...

My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts
use 10 GbE NICs now.  The mon host uses an SSD for the mon data store.
My test procedure is to start 'ceph -w', start all the OSDs, and once
they're all running start the mon.  I report the time from starting
the mon to all PGs active+clean.

# PGs osd mon ackstartupnotes
timeout   time
---     -
 55392  default  >30:00   1
 55392300 18:36   2
 55392 60>30:00   3
 55392150>30:00   4
 55392240>30:00   5
 55392300>30:00   2,6

notes:
1) lots of PGs marked stale, OSDs wrongly marked down
 before I gave up on this case
2) OSDs report lots of slow requests for "pg_notify(...) v4
 currently wait for new map"
3) some OSDs wrongly marked down, OSDs report some slow requests
 for "pg_notify(...) v4 currently wait for new map"
 before I gave up on this case
4) appeared to be making progress; then an OSD was marked
 out at ~21 minutes; many more marked out before I
 gave up on this case
5) some OSD reports of slow requests for "pg_notify",
 some OSDs wrongly marked down, appeared to be making
 progress, then stalled; then I gave up on this case
6) retried this case, appeared to be making progress, but
 after ~18 min stalled at 19701 active+clean, 35691 peering
 until I gave up

Hmmm, I didn't really expect the above results.  I ran out of
time before attempting an even longer osd mon ack timeout.

But either we're on the wrong trail, or 300 is not sufficient.
Or, I'm doing something wrong and haven't yet figured out what
it is.

FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs,
and my memory is a new filesystem started up in ~5 minutes.  For
that testing I had to increase 'paxos propose interval' to two
or three seconds to keep the monitor writeout rate (as measured
by vmstat) down to a sustained 50-70 MB/s during start-up.

That was with a 1 GbE NIC in the mon; the reason I upgraded
it was a filesystem with 512K PGs was taking too long to start,
and I thought the mon might be network-limited since it had an
SSD for the mon data store.

For the testing above I used the default 'paxos propose interval'.
I don't know if it matters, but vmstat sees only a little data
being written on the mon system.

-- Jim

> 
> sage
> 
>> > 
>> > -- Jim
>> > 
>>> > > -Greg
>>> > > Software Engineer #42 @ http://inktank.com | http://ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-03 Thread Jim Schutt

On 04/03/2013 11:49 AM, Gregory Farnum wrote:
> On Wed, Apr 3, 2013 at 10:14 AM, Gregory Farnum  wrote:
>> On Wed, Apr 3, 2013 at 10:09 AM, Jim Schutt  wrote:
>>> Hi Sage,
>>>
>>> On 04/03/2013 09:58 AM, Sage Weil wrote:
>>>> Hi Jim,
>>>>
>>>> What happens if you change 'osd mon ack timeout = 300' (from the
>>>> default of 30)?  I suspect part of the problem is that the mons are just
>>>> slow enough that the osd's resend the same thing again and it snowballs
>>>> into more work for the monitor.
>>>
>>> Thanks, that helped.  My OSDs aren't reconnecting to the mon any more,
>>> and the new filesystem started up as expected.
>>>
>>> Hmmm, it occurs to me that I upgraded my mon hosts to 10 GbE NICs at
>>> about the same time I started testing v0.59.  Perhaps before the upgrade
>>> I was running right at the edge of that timeout.  After the NIC upgrade
>>> the PGStat messages come flooding in at startup, and they bunch up
>>> enough that working through the backlog pushed me over the timeout cliff?
>>>
>>> Is there any downside to using a large 'osd mon ack timeout', assuming I
>>> run more than one mon?  If so, I expect I'll work my way back from
>>> 'osd mon ack timeout = 300' to see how big it needs to be to stay reliable
>>> for my configuration.
>>
>> It's a timeout, so the generic downsides to larger timeouts — if the
>> monitor actually has gone away it's going to take the OSDs more time
>> to connect to somebody else for their updates and reports. This will
>> probably be most apparent if they're trying to peer and can't make
>> progress until they get acks from the monitors, but the one they're
>> connected to has died.
>>
>>
>>> Sorry for the noise about paxos.  At least it was useful
>>> to help Joao find that debug log message that was more expensive
>>> than expected
>>
>> It's not noise — the reason this timeout is causing problems now is
>> that the monitor disk commits are taking so long that it looks like
>> they've failed. Which is bad. :/ So thanks for reporting it!
> 
> Sorry, guess I forgot some of the history since this piece at least is
> resolved now. I'm surprised if 30-second timeouts are causing issues
> without those overloads you were seeing; have you seen this issue
> without your high debugging levels and without the bad PG commits (due
> to debugging)?

I think so, because that's why I started with higher debugging
levels.

But, as it turns out, I'm just in the process of returning to my
testing of next, with all my debugging back to 0.  So, I'll try
the default timeout of 30 seconds first.  If I have trouble starting
up a new file system, I'll turn up the timeout and try again, without
any extra debugging.  Either way, I'll let you know what happens.

-- Jim

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trouble getting a new file system to start, for v0.59 and newer

2013-04-03 Thread Jim Schutt

Hi Sage,

On 04/03/2013 09:58 AM, Sage Weil wrote:
> Hi Jim,
> 
> What happens if you change 'osd mon ack timeout = 300' (from the 
> default of 30)?  I suspect part of the problem is that the mons are just 
> slow enough that the osd's resend the same thing again and it snowballs 
> into more work for the monitor.

Thanks, that helped.  My OSDs aren't reconnecting to the mon any more,
and the new filesystem started up as expected.

Hmmm, it occurs to me that I upgraded my mon hosts to 10 GbE NICs at
about the same time I started testing v0.59.  Perhaps before the upgrade
I was running right at the edge of that timeout.  After the NIC upgrade 
the PGStat messages come flooding in at startup, and they bunch up
enough that working through the backlog pushed me over the timeout cliff?

Is there any downside to using a large 'osd mon ack timeout', assuming I
run more than one mon?  If so, I expect I'll work my way back from
'osd mon ack timeout = 300' to see how big it needs to be to stay reliable
for my configuration.

Sorry for the noise about paxos.  At least it was useful
to help Joao find that debug log message that was more expensive
than expected

Thanks -- Jim

> 
> sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trouble getting a new file system to start, for v0.59 and newer

2013-04-03 Thread Jim Schutt

Hi Joao,

I alluded in an earlier thread about an issue I've been recently
having with starting a new filesystem, which I thought I had
tracked into the paxos subsystem.  I believe I started having
this trouble when I started testing v0.59, and it's still there
in v0.60.

The basic configuration is 1 mon, 1 mds, 576 OSDs.  Some time
after starting the mon and the OSDs, PG creation progress stops.

I'm finding this in my OSD logs, which I've been thinking is
suggestive of why OSDs stop making progress:

# cat osd.0.log | grep "acked PGStats"
2013-04-02 15:40:28.944178 7387a700  1 osd.0 2 mon hasn't acked PGStats in 
30.034349 seconds, reconnecting elsewhere
2013-04-02 15:42:24.570963 7387a700  1 osd.0 14 mon hasn't acked PGStats in 
30.072959 seconds, reconnecting elsewhere
2013-04-02 15:43:22.664921 7387a700  1 osd.0 20 mon hasn't acked PGStats in 
30.570739 seconds, reconnecting elsewhere
2013-04-02 15:43:52.672461 7387a700  1 osd.0 20 mon hasn't acked PGStats in 
30.007379 seconds, reconnecting elsewhere
2013-04-02 15:44:22.680691 7387a700  1 osd.0 20 mon hasn't acked PGStats in 
30.007095 seconds, reconnecting elsewhere
2013-04-02 15:44:52.688936 7387a700  1 osd.0 20 mon hasn't acked PGStats in 
30.007221 seconds, reconnecting elsewhere
2013-04-02 15:45:22.702066 7387a700  1 osd.0 20 mon hasn't acked PGStats in 
30.012264 seconds, reconnecting elsewhere
2013-04-02 15:45:52.711134 7387a700  1 osd.0 20 mon hasn't acked PGStats in 
30.007957 seconds, reconnecting elsewhere
2013-04-02 15:46:22.719845 7387a700  1 osd.0 20 mon hasn't acked PGStats in 
30.007859 seconds, reconnecting elsewhere
2013-04-02 15:46:52.729475 7387a700  1 osd.0 20 mon hasn't acked PGStats in 
30.008756 seconds, reconnecting elsewhere


I also thought paxos might be implicated because I noticed the auth
paxos service appears to stop making progress.  The mon is getting
lots of auth paxos service messages from all the OSDs:

# cat mon.cs31.log  | grep paxos | grep "dispatch auth" | head -1
2013-04-02 15:38:59.995457 72a8c700 10 
mon.cs31@0(leader).paxosservice(auth) dispatch auth(proto 0 30 bytes epoch 1) 
v1 from client.? 172.17.131.29:0/52585

# cat mon.cs31.log  | grep paxos | grep "dispatch auth" | wc -l 
8802

but at some point the auth paxos service appears to stop making progress:

# cat mon.cs31.log  | grep paxos | grep -v "dispatch auth" | grep auth
2013-04-02 15:38:42.895202 77fbe780 10 
mon.cs31@0(probing).paxosservice(auth) restart
2013-04-02 15:38:42.905705 77fbe780 10 
mon.cs31@0(probing).paxosservice(auth) restart
2013-04-02 15:38:59.182158 77fbe780 10 
mon.cs31@0(probing).paxosservice(auth) restart
2013-04-02 15:38:59.192809 77fbe780 10 
mon.cs31@0(probing).paxosservice(auth) restart
2013-04-02 15:38:59.290602 77fbe780 10 
mon.cs31@0(leader).paxosservice(auth) election_finished
2013-04-02 15:38:59.290608 77fbe780 10 
mon.cs31@0(leader).paxosservice(auth) _active
2013-04-02 15:38:59.300707 77fbe780 10 
mon.cs31@0(leader).paxosservice(auth) scrub
2013-04-02 15:38:59.308260 77fbe780  7 
mon.cs31@0(leader).paxosservice(auth) _active creating new pending
2013-04-02 15:38:59.329153 77fbe780 10 
mon.cs31@0(leader).paxosservice(auth) propose_pending
2013-04-02 15:38:59.390988 77fbe780 10 
mon.cs31@0(leader).paxosservice(auth) _active
2013-04-02 15:38:59.421164 77fbe780 10 
mon.cs31@0(leader).paxosservice(auth) scrub
2013-04-02 15:38:59.427781 77fbe780  7 
mon.cs31@0(leader).paxosservice(auth) _active creating new pending
2013-04-02 15:38:59.448001 77fbe780 10 
mon.cs31@0(leader).paxosservice(auth) election_finished
2013-04-02 15:38:59.448027 77fbe780 10 
mon.cs31@0(leader).paxosservice(auth) _active
2013-04-02 15:38:59.453691 77fbe780 10 
mon.cs31@0(leader).paxosservice(auth) scrub
2013-04-02 15:38:59.459446 77fbe780  7 
mon.cs31@0(leader).paxosservice(auth) _active creating new pending
2013-04-02 15:38:59.500315 7328d700 10 
mon.cs31@0(leader).paxosservice(auth) propose_pending
2013-04-02 15:38:59.569638 7328d700 10 
mon.cs31@0(leader).paxosservice(auth) _active
2013-04-02 15:38:59.593526 7328d700 10 
mon.cs31@0(leader).paxosservice(auth) scrub
2013-04-02 15:38:59.599250 7328d700  7 
mon.cs31@0(leader).paxosservice(auth) _active creating new pending
2013-04-02 15:39:39.390156 72a8c700 10 
mon.cs31@0(leader).paxosservice(auth) propose_pending
2013-04-02 15:39:39.435906 72a8c700 10 
mon.cs31@0(leader).paxosservice(auth) _active
2013-04-02 15:39:39.456743 72a8c700 10 
mon.cs31@0(leader).paxosservice(auth) scrub
2013-04-02 15:39:39.462498 72a8c700  7 
mon.cs31@0(leader).paxosservice(auth) _active creating new pending

even as it continues to receive auth paxos service messages:

# cat mon.cs31.log  | grep paxos | grep -A 20 "15:39:39.462498" | grep 
"dispatch auth" | wc -l
8799

for some time:

# cat mon.cs31.log  | grep paxos | grep "dispatch auth" | tail -2
2013-04-02 15:46:56.090494 7ff

Re: Trouble with paxos service for large PG count

2013-04-02 Thread Jim Schutt

On 04/02/2013 01:16 PM, Jim Schutt wrote:
> On 04/02/2013 12:28 PM, Joao Luis wrote:
>> Right.  I'll push a patch to bump that sort of output to 30 when I get home.
> 
> Thanks - but FWIW, I don't think it's the root cause of my
> issue -- more below

OK, I see now that you're talking about the output from list_proposals().
Indeed, when I cause list_proposals() in propose_queued() to only
be called at paxos debug 30, the stall in finish_queued_proposal()
mostly goes away.  Now, with 55392 PGs I'm seeing this:

2013-04-02 14:21:19.401820 7328d700 10 
mon.cs31@0(leader).paxosservice(pgmap) propose_pending
2013-04-02 14:21:19.535731 7328d700  5 mon.cs31@0(leader).paxos(paxos 
active c 1..33) queue_proposal bl 19870402 bytes; ctx = 0x11ecff0
2013-04-02 14:21:19.535917 7328d700  5 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..33) propose_queued 34 19870402 bytes
2013-04-02 14:21:19.535925 7328d700 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..33) begin for 34 19870402 bytes
2013-04-02 14:21:19.754326 7328d700 10 mon.cs31@0(leader).paxos(paxos 
updating c 1..33) commit 34
2013-04-02 14:21:19.975747 7328d700 10 mon.cs31@0(leader).paxos(paxos 
active c 1..34) finish_queued_proposal finishing proposal
2013-04-02 14:21:19.975799 7328d700 10 mon.cs31@0(leader).paxos(paxos 
active c 1..34) finish_queued_proposal finish it (proposal = 0x790c0c0)
2013-04-02 14:21:19.975802 7328d700 10 mon.cs31@0(leader).paxos(paxos 
active c 1..34) finish_queued_proposal proposal took 0.439887 to finish
2013-04-02 14:21:19.975913 7328d700 10 
mon.cs31@0(leader).paxosservice(pgmap) _active

>>
>> If you're willing, try reducing the paxos debug level to 0 and let us know
>> if those delays in the proposal go away.
> 
> When I started debugging why I couldn't start new filesystems, several
> days ago, I wasn't using any debugging.  It took me until last Friday
> to track the issue into paxos, which is when I turned on its debugging.
> 

With that list_proposals() debug output issue out of the way, I'm
still having trouble starting up a new filesystem -- still working
to try to isolate the source of the issue

Thanks -- Jim


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trouble with paxos service for large PG count

2013-04-02 Thread Jim Schutt

On 04/02/2013 12:28 PM, Joao Luis wrote:
> Right.  I'll push a patch to bump that sort of output to 30 when I get home.

Thanks - but FWIW, I don't think it's the root cause of my
issue -- more below
> 
> If you're willing, try reducing the paxos debug level to 0 and let us know
> if those delays in the proposal go away.

When I started debugging why I couldn't start new filesystems, several
days ago, I wasn't using any debugging.  It took me until last Friday
to track the issue into paxos, which is when I turned on its debugging.

Also, I've added the following bits of debug output:

diff --git a/src/mon/Paxos.cc b/src/mon/Paxos.cc
index bf3eb92..70a8b5d 100644
--- a/src/mon/Paxos.cc
+++ b/src/mon/Paxos.cc
@@ -494,6 +494,8 @@ void Paxos::begin(bufferlist& v)
   MonitorDBStore::Transaction t;
   t.put(get_name(), last_committed+1, new_value);
 
+  utime_t tstart = ceph_clock_now(g_ceph_context);
+
   dout(30) << __func__ << " transaction dump:\n";
   JSONFormatter f(true);
   t.dump(&f);
@@ -506,7 +508,14 @@ void Paxos::begin(bufferlist& v)
   f.flush(*_dout);
   *_dout << dendl;
 
+  dout(10) <<  __func__ << " - JSON formatting, debug_tx decode took "
+   << ceph_clock_now(g_ceph_context) - tstart << dendl;
+
+  tstart = ceph_clock_now(g_ceph_context);
   get_store()->apply_transaction(t);
+  dout(10) <<  __func__ << " - get_store()->apply_transaction took "
+   << ceph_clock_now(g_ceph_context) - tstart << dendl;
+
 
   if (mon->get_quorum().size() == 1) {
 // we're alone, take it easy
diff --git a/src/mon/PaxosService.cc b/src/mon/PaxosService.cc
index a66c5ec..808ce62 100644
--- a/src/mon/PaxosService.cc
+++ b/src/mon/PaxosService.cc
@@ -158,6 +158,8 @@ void PaxosService::propose_pending()
   encode_pending(&t);
   have_pending = false;
 
+  utime_t tstart = ceph_clock_now(g_ceph_context);
+
   dout(30) << __func__ << " transaction dump:\n";
   JSONFormatter f(true);
   t.dump(&f);
@@ -166,6 +168,9 @@ void PaxosService::propose_pending()
 
   t.encode(bl);
 
+  dout(10) <<  __func__ << " - JSON formatting, encode took "
+   << ceph_clock_now(g_ceph_context) - tstart << dendl;
+
   // apply to paxos
   proposing.set(1);
   paxos->propose_new_value(bl, new C_Committed(this));


which produces this:

2013-04-02 13:04:22.311743 77fbe780 10 
mon.cs31@0(leader).paxosservice(pgmap) propose_pending
2013-04-02 13:04:22.426810 77fbe780 10 
mon.cs31@0(leader).paxosservice(pgmap) propose_pending - JSON formatting, 
encode took 0.000170
2013-04-02 13:04:22.426874 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
active c 1..3) queue_proposal bl 19775198 bytes; ctx = 0x11e81f0
2013-04-02 13:04:22.426999 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued 4 19775198 bytes
2013-04-02 13:04:22.427003 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued list_proposals 1 in queue:
2013-04-02 13:07:51.190312 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) begin for 4 19775198 bytes
2013-04-02 13:07:51.190860 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
updating c 1..3) begin - JSON formatting, debug_tx decode took 0.02
2013-04-02 13:07:51.402134 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
updating c 1..3) begin - get_store()->apply_transaction took 0.211275
2013-04-02 13:07:51.402178 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
updating c 1..3) commit 4
2013-04-02 13:07:51.635113 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal finishing proposal
2013-04-02 13:07:51.635152 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal finish it (proposal = 0x1261340)
2013-04-02 13:07:51.635159 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal proposal took 209.208162 to finish
2013-04-02 13:07:51.635193 77fbe780 10 
mon.cs31@0(leader).paxosservice(pgmap) _active


I thought I was timing the same JSON formatting in propose_pending()
and begin(), but they take much different amounts of time.
I don't understand why yet.

Also, I just noticed the elapsed time between these two lines:

2013-04-02 13:04:22.427003 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued list_proposals 1 in queue:
2013-04-02 13:07:51.190312 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) begin for 4 19775198 bytes

It accounts for nearly the whole of the delay that finish_queued_proposal()
is reporting.  Why did begin() take so long to start, I wonder ???

-- Jim


> 
>   -Joao


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Trouble with paxos service for large PG count

2013-04-02 Thread Jim Schutt

On 04/02/2013 09:42 AM, Joao Eduardo Luis wrote:

> On 04/01/2013 10:14 PM, Jim Schutt wrote:
>> Hi,
>>
>> I've been having trouble starting a new file system
>> created using the current next branch (most recently,
>> commit 3b5f663f11).
>>
>> I believe the trouble is related to how long it takes paxos to
>> process a pgmap proposal.
>>
>> For a configuration with 1 mon, 1 mds, and 576 osds, using
>> pg_bits = 3 and debug paxos = 10, if I start just the monitor,
>> here's what I get when paxos processes the first non-trivial
>> pgmap proposal:
>>
> 
> Just noticed one other thing.  With 'debug paxos = 10', you should have a 
> whole bunch of output (the proposal's dump) after this:
> 
>> 2013-04-01 14:04:16.330735 77fbe780 10 
>> mon.cs31@0(leader).paxosservice(pgmap) propose_pending
>> 2013-04-01 14:04:16.358973 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
>> active c 1..3) queue_proposal bl 4943990 bytes; ctx = 0x11e81f0
>> 2013-04-01 14:04:16.359021 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
>> preparing update c 1..3) propose_queued 4 4943990 bytes
>> 2013-04-01 14:04:16.359025 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
>> preparing update c 1..3) propose_queued list_proposals 1 in queue:
> 
> and before this:
> 
>> 2013-04-01 14:04:28.096284 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
>> preparing update c 1..3) begin for 4 4943990 bytes
> 
> for every snippet you sent on your previous email.
> 
> The code responsible for that shouldn't ever have made into master, and 
> should be to blame for a great deal of the time spent.
> 
> Jim, can you confirm such output is present?

Yep, here's what I see:

2013-04-01 14:04:16.359021 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued 4 4943990 bytes
2013-04-01 14:04:16.359025 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued list_proposals 1 in queue:
-- entry #0
 proposed queued 0.08 tx dump:
{ "ops": [
{ "op_num": 0,
  "type": "PUT",
  "prefix": "pgmap",
  "key": "full_1",
  "length": 46,
  "bl": " : 05 04 28 00 00 00 01 00 00 00 00 00 00 00 00 00 : 
..(.\n0010 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 33 33 : 
..33\n0020 : 73 3f 9a 99 59 3f 40 e8 59 51 90 f0 a4 07   : 
s?..Y?@.YQ\n"},
{ "op_num": 1,
  "type": "PUT",
  "prefix": "pgmap",
  "key": "full_latest",
  "length": 8,
  "bl": " : 01 00 00 00 00 00 00 00 : 
\n"},
{ "op_num": 2,
  "type": "PUT",
  "prefix": "pgmap",
  "key": "2",
  "length": 4943790,
  "bl": " : 06 05 a8 6f 4b 00 02 00 00 00 00 00 00 00 18 36 : 
...oK..6

[snipped lots more of the bufferlist, then]

{ "op_num": 3,
  "type": "PUT",
  "prefix": "pgmap",
  "key": "last_committed",
  "length": 8,
  "bl": " : 02 00 00 00 00 00 00 00 : 
\n"}]}
2013-04-01 14:04:28.096284 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) begin for 4 4943990 bytes

As you suggested, I see that sort of thing for every "propose_queued 
list_proposals" in my logs

-- Jim

> 
> 
>   -Joao
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trouble with paxos service for large PG count

2013-04-01 Thread Jim Schutt

Hi,

I've been having trouble starting a new file system
created using the current next branch (most recently,
commit 3b5f663f11).

I believe the trouble is related to how long it takes paxos to
process a pgmap proposal.

For a configuration with 1 mon, 1 mds, and 576 osds, using
pg_bits = 3 and debug paxos = 10, if I start just the monitor,
here's what I get when paxos processes the first non-trivial
pgmap proposal:

2013-04-01 14:04:16.330735 77fbe780 10 
mon.cs31@0(leader).paxosservice(pgmap) propose_pending
2013-04-01 14:04:16.358973 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
active c 1..3) queue_proposal bl 4943990 bytes; ctx = 0x11e81f0
2013-04-01 14:04:16.359021 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued 4 4943990 bytes
2013-04-01 14:04:16.359025 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued list_proposals 1 in queue:
2013-04-01 14:04:28.096284 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) begin for 4 4943990 bytes
2013-04-01 14:04:28.164876 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
updating c 1..3) commit 4
2013-04-01 14:04:28.226789 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal finishing proposal
2013-04-01 14:04:28.226885 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal finish it (proposal = 0x1260f50)
2013-04-01 14:04:28.226892 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal proposal took 11.867914 to finish
2013-04-01 14:04:28.226941 77fbe780 10 
mon.cs31@0(leader).paxosservice(pgmap) _active

When I redo the test with pg_bits = 4 (27696 PGs total) I get this:

2013-04-01 14:25:16.993255 77fbe780 10 
mon.cs31@0(leader).paxosservice(pgmap) propose_pending
2013-04-01 14:25:17.051729 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
active c 1..3) queue_proposal bl 9887726 bytes; ctx = 0x11e81f0
2013-04-01 14:25:17.051882 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued 4 9887726 bytes
2013-04-01 14:25:17.051888 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued list_proposals 1 in queue:
2013-04-01 14:26:08.185143 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) begin for 4 9887726 bytes
2013-04-01 14:26:08.296351 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
updating c 1..3) commit 4
2013-04-01 14:26:08.408900 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal finishing proposal
2013-04-01 14:26:08.408928 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal finish it (proposal = 0x1260f50)
2013-04-01 14:26:08.408931 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal proposal took 51.357088 to finish
2013-04-01 14:26:08.408974 77fbe780 10 
mon.cs31@0(leader).paxosservice(pgmap) _active

and at pg_bits = 5 (55392 PGs total) I get this:

2013-04-01 14:36:15.385280 77fbe780 10 
mon.cs31@0(leader).paxosservice(pgmap) propose_pending
2013-04-01 14:36:15.500292 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
active c 1..3) queue_proposal bl 19775198 bytes; ctx = 0x11e81f0
2013-04-01 14:36:15.500433 77fbe780  5 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued 4 19775198 bytes
2013-04-01 14:36:15.500438 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) propose_queued list_proposals 1 in queue:
2013-04-01 14:39:43.645062 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..3) begin for 4 19775198 bytes
2013-04-01 14:39:43.875099 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
updating c 1..3) commit 4
2013-04-01 14:39:44.086097 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal finishing proposal
2013-04-01 14:39:44.086128 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal finish it (proposal = 0x1260f50)
2013-04-01 14:39:44.086136 77fbe780 10 mon.cs31@0(leader).paxos(paxos 
active c 1..4) finish_queued_proposal proposal took 208.585710 to finish
2013-04-01 14:39:44.086156 77fbe780 10 
mon.cs31@0(leader).paxosservice(pgmap) _active

For this last configuration, after collecting the above
I waited a bit, started all the OSDs, waited a bit longer,
then collected this:

2013-04-01 14:54:37.364686 7328d700 10 
mon.cs31@0(leader).paxosservice(pgmap) propose_pending
2013-04-01 14:54:37.433641 7328d700  5 mon.cs31@0(leader).paxos(paxos 
active c 1..27) queue_proposal bl 10629660 bytes; ctx = 0x11ece50
2013-04-01 14:54:37.433750 7328d700  5 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..27) propose_queued 28 10629660 bytes
2013-04-01 14:54:37.433755 7328d700 10 mon.cs31@0(leader).paxos(paxos 
preparing update c 1..27) propose_queued list_proposals 1 in queue:
2013-04-01 14:55:38.684532 7328

Re: CephFS Space Accounting and Quotas

2013-03-18 Thread Jim Schutt

On 03/15/2013 05:17 PM, Greg Farnum wrote:
> [Putting list back on cc]
> 
> On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote:
> 
>> On 03/15/2013 04:23 PM, Greg Farnum wrote:
>>> As I come back and look at these again, I'm not sure what the context
>>> for these logs is. Which test did they come from, and which behavior
>>> (slow or not slow, etc) did you see? :) -Greg
>>
>>
>>
>> They come from a test where I had debug mds = 20 and debug ms = 1
>> on the MDS while writing files from 198 clients. It turns out that 
>> for some reason I need debug mds = 20 during writing to reproduce
>> the slow stat behavior later.
>>
>> strace.find.dirs.txt.bz2 contains the log of running 
>> strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls 
>> -lhd {} \;
>>
>> From that output, I believe that the stat of at least these files is slow:
>> zero0.rc11
>> zero0.rc30
>> zero0.rc46
>> zero0.rc8
>> zero0.tc103
>> zero0.tc105
>> zero0.tc106
>> I believe that log shows slow stats on more files, but those are the first 
>> few.
>>
>> mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the
>> find command started, until just after the fifth or sixth slow stat from
>> the list above.
>>
>> I haven't yet tried to find other ways of reproducing this, but so far
>> it appears that something happens during the writing of the files that
>> ends up causing the condition that results in slow stat commands.
>>
>> I have the full MDS log from the writing of the files, as well, but it's
>> big
>>
>> Is that what you were after?
>>
>> Thanks for taking a look!
>>
>> -- Jim
> 
> I just was coming back to these to see what new information was
> available, but I realized we'd discussed several tests and I wasn't
> sure what these ones came from. That information is enough, yes.
> 
> If in fact you believe you've only seen this with high-level MDS
> debugging, I believe the cause is as I mentioned last time: the MDS
> is flapping a bit and so some files get marked as "needsrecover", but
> they aren't getting recovered asynchronously, and the first thing
> that pokes them into doing a recover is the stat.

OK, that makes sense.

> That's definitely not the behavior we want and so I'll be poking
> around the code a bit and generating bugs, but given that explanation
> it's a bit less scary than random slow stats are so it's not such a
> high priority. :) Do let me know if you come across it without the
> MDS and clients having had connection issues!

No problem - thanks!

-- Jim


> -Greg
> 
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS Space Accounting and Quotas

2013-03-12 Thread Jim Schutt

On 03/11/2013 02:40 PM, Jim Schutt wrote:
>>> >> If you want I can attempt to duplicate my memory of the first
>>> >> test I reported, writing the files today and doing the strace
>>> >> tomorrow (with timestamps, this time).
>>> >>
>>> >> Also, would it be helpful to write the files with minimal logging, in
>>> >> hopes of inducing minimal timing changes, then upping the logging
>>> >> for the stat phase?
>> > 
>> > Well that would give us better odds of not introducing failures of
>> > any kind during the write phase, and then getting accurate
>> > information on what's happening during the stats, so it probably
>> > would. Basically I'd like as much logging as possible without
>> > changing the states they system goes through. ;)
> Hmmm, this is getting more interesting...
> 
> I just did two complete trials where I built a file system,
> did two sets of writes with minimal MDS logging, then 
> turned MDS logging up to 20 with MDS ms logging at 1 for
> the stat phase.
> 
> In each trial my strace'd find finished in < 10 seconds,
> and there were no slow stat calls (they were taking ~19 ms
> each).
> 
> I'm going to do a third trial where I let things rest
> overnight, after I write the files.  That delay is the
> only thing I haven't reproduced from my first trial

As you suspected, that didn't make any difference either...
That trial didn't reproduce the slow stat behavior.

It turns out that the only way I can reliably reproduce
the slow stat behavior right now is to turn MDS debugging
up to 20.

I've got another set of logs with MDS debug ms = 1,
which I'll send via an off-list email.

I still haven't figured out what made that first test
exhibit this behavior, when I was using debug mds = 5.

-- Jim

> 
> -- Jim
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS Space Accounting and Quotas

2013-03-11 Thread Jim Schutt

On 03/11/2013 10:57 AM, Greg Farnum wrote:
> On Monday, March 11, 2013 at 9:48 AM, Jim Schutt wrote:
>> On 03/11/2013 09:48 AM, Greg Farnum wrote:
>>> On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote:
>>>>
>>>> For this run, the MDS logging slowed it down enough to cause the
>>>> client caps to occasionally go stale. I don't think it's the cause
>>>> of the issue, because I was having it before I turned MDS debugging
>>>> up. My client caps never go stale at, e.g., debug mds 5.
>>>
>>>
>>>
>>> Oh, so this might be behaviorally different than you were seeing before? 
>>> Drat.
>>>
>>> You had said before that each newfstatat was taking tens of seconds,
>>> whereas in the strace log you sent along most of the individual calls
>>> were taking a bit less than 20 milliseconds. Do you have an strace of
>>> them individually taking much more than that, or were you just
>>> noticing that they took a long time in aggregate?
>>
>>
>>
>> When I did the first strace, I didn't turn on timestamps, and I was
>> watching it scroll by. I saw several stats in a row take ~30 secs,
>> at which point I got bored, and took a look at the strace man page to
>> figure out how to get timestamps ;)
>>
>> Also, another difference is for that test, I was looking at files
>> I had written the day before, whereas for the strace log I sent,
>> there was only several minutes between writing and the strace of find.
>>
>> I thought I had eliminated the page cache issue by using fdatasync
>> when writing the files. Perhaps the real issue is affected by that
>> delay?
> 
> I'm not sure. I can't think of any mechanism by which waiting longer
> would increase the time lags, though, so I doubt it.
> 
>>> I suppose if you were going to run it again then just the message
>>> logging could also be helpful. That way we could at least check and
>>> see the message delays and if the MDS is doing other work in the
>>> course of answering a request.
>>
>>
>>
>> I can do as many trials as needed to isolate the issue.
>>
>> What message debugging level is sufficient on the MDS; 1?
> Yep, that will capture all incoming and outgoing messages. :) 
>>
>> If you want I can attempt to duplicate my memory of the first
>> test I reported, writing the files today and doing the strace
>> tomorrow (with timestamps, this time).
>>
>> Also, would it be helpful to write the files with minimal logging, in
>> hopes of inducing minimal timing changes, then upping the logging
>> for the stat phase?
> 
> Well that would give us better odds of not introducing failures of
> any kind during the write phase, and then getting accurate
> information on what's happening during the stats, so it probably
> would. Basically I'd like as much logging as possible without
> changing the states they system goes through. ;)

Hmmm, this is getting more interesting...

I just did two complete trials where I built a file system,
did two sets of writes with minimal MDS logging, then 
turned MDS logging up to 20 with MDS ms logging at 1 for
the stat phase.

In each trial my strace'd find finished in < 10 seconds,
and there were no slow stat calls (they were taking ~19 ms
each).

I'm going to do a third trial where I let things rest
overnight, after I write the files.  That delay is the
only thing I haven't reproduced from my first trial

-- Jim


> -Greg
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Estimating OSD memory requirements (was Re: stuff for v0.56.4)

2013-03-11 Thread Jim Schutt

Hi Bryan,

On 03/11/2013 09:10 AM, Bryan K. Wright wrote:
> s...@inktank.com said:
>> On Thu, 7 Mar 2013, Bryan K. Wright wrote:
>>
>> s...@inktank.com said:
>>> - pg log trimming (probably a conservative subset) to avoid memory bloat 
>>
>> Anything that reduces the size of OSD processes would be appreciated.
>> You can probably do this with just
>>  log max recent = 1000
>> By default it's keeping 100k lines of logs in memory, which can eat a lot  of
>> ram (but is great when debugging issues).
> 
>   Thanks for the tip about "log max recent".  I've made this 
> change, but it doesn't seem to significantly reduce the size of the 
> OSD processes.
> 
>   In general, are there some rules of thumb for estimated the
> memory requirements for OSDs?  I see processes blow up to 8gb of 
> resident memory sometimes.  If I need to allow for that much memory
> per OSD process, I may have to just walk away from ceph.
> 
>   Does the memory usage scale with the size of the disks?
> I've been trying to run 12 OSDs with 12 2TB disks on a single box.
> Would I be better off (memory-usage-wise) if I RAIDed the disks
> together and used a single OSD process?
> 
>   Thanks for any advice.

You might also try tuning "osd client message size cap"; its
current default is 500 MiB.

During the periods your aggregate applied write load is higher
than your OSD aggregate write bandwidth (taking into account
replicas), you'll be buffering up this amount of client data.

Since it only applies to incoming client messages, to figure
total memory use I believe you need to multiply that by the
number of replicas you're using.

FWIW, for sequential writes from lots of clients, I can
maintain full write bandwidth with "osd client message size
cap" tuned to 60 MiB.

-- Jim

> 
>   Bryan
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS Space Accounting and Quotas

2013-03-11 Thread Jim Schutt

On 03/11/2013 09:48 AM, Greg Farnum wrote:
> On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote:
>> On 03/08/2013 07:05 PM, Greg Farnum wrote:
>>> On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote:
>>>> On 03/07/2013 08:15 AM, Jim Schutt wrote:
>>>>> On 03/06/2013 05:18 PM, Greg Farnum wrote:
>>>>>> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
>>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>> [snip]
>>>>  
>>>>>>> Do you want the MDS log at 10 or 20?
>>>>>>  
>>>>>> More is better. ;)
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>> OK, thanks.
>>>>  
>>>>  
>>>> I've sent some mds logs via private email...
>>>>  
>>>> -- Jim  
>>>  
>>> I'm going to need to probe into this a bit more, but on an initial
>>> examination I see that most of your stats are actually happening very
>>> quickly — it's just that occasionally they take quite a while.
>>  
>>  
>>  
>> Interesting...
>>  
>>> Going
>>> through the MDS log for one of those, the inode in question is
>>> flagged with "needsrecover" from its first appearance in the log —
>>> that really shouldn't happen unless a client had write caps on it and
>>> the client disappeared. Any ideas? The slowness is being caused by
>>> the MDS going out and looking at every object which could be in the
>>> file — there are a lot since the file has a listed size of 8GB.
>>  
>>  
>>  
>> For this run, the MDS logging slowed it down enough to cause the
>> client caps to occasionally go stale. I don't think it's the cause
>> of the issue, because I was having it before I turned MDS debugging
>> up. My client caps never go stale at, e.g., debug mds 5.
> 
> Oh, so this might be behaviorally different than you were seeing before? Drat.
> 
> You had said before that each newfstatat was taking tens of seconds,
> whereas in the strace log you sent along most of the individual calls
> were taking a bit less than 20 milliseconds. Do you have an strace of
> them individually taking much more than that, or were you just
> noticing that they took a long time in aggregate?

When I did the first strace, I didn't turn on timestamps, and I was
watching it scroll by.  I saw several stats in a row take ~30 secs,
at which point I got bored, and took a look at the strace man page to
figure out how to get timestamps ;)

Also, another difference is for that test, I was looking at files
I had written the day before, whereas for the strace log I sent,
there was only several minutes between writing and the strace of find.

I thought I had eliminated the page cache issue by using fdatasync
when writing the files.  Perhaps the real issue is affected by that
delay?

> I suppose if you were going to run it again then just the message
> logging could also be helpful. That way we could at least check and
> see the message delays and if the MDS is doing other work in the
> course of answering a request.

I can do as many trials as needed to isolate the issue.

What message debugging level is sufficient on the MDS; 1?

If you want I can attempt to duplicate my memory of the first
test I reported, writing the files today and doing the strace
tomorrow (with timestamps, this time).

Also, would it be helpful to write the files with minimal logging, in
hopes of inducing minimal timing changes, then upping the logging
for the stat phase?

> 
>> Otherwise, there were no signs of trouble while writing the files.
>>  
>> Can you suggest which kernel client debugging I might enable that
>> would help understand what is happening? Also, I have the full
>> MDS log from writing the files, if that will help. It's big (~10 GiB).
>>  
>>> (There are several other mysteries here that can probably be traced
>>> to different varieties of non-optimal and buggy code as well — there
>>> is a client which has write caps on the inode in question despite it
>>> needing recovery, but the recovery isn't triggered until the stat
>>> event occurs, etc).
>>  
>>  
>>  
>> OK, thanks for taking a look. Let me know if there is other
>> logging I can enable that will be helpful.
> 
> I'm going to want to spend more time with the log I've got, but I'll think 
> about if there's a different set of data we can gather less disruptively.  

OK, cool.  Just let me know.

Thanks -- Jim

> -Greg
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS Space Accounting and Quotas

2013-03-11 Thread Jim Schutt

On 03/08/2013 07:05 PM, Greg Farnum wrote:
> On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote:
>> On 03/07/2013 08:15 AM, Jim Schutt wrote:
>>> On 03/06/2013 05:18 PM, Greg Farnum wrote:
>>>> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
>>>  
>>  
>>  
>>  
>> [snip]
>>  
>>>>> Do you want the MDS log at 10 or 20?
>>>> More is better. ;)
>>>  
>>>  
>>>  
>>> OK, thanks.
>>  
>> I've sent some mds logs via private email...
>>  
>> -- Jim  
> I'm going to need to probe into this a bit more, but on an initial
> examination I see that most of your stats are actually happening very
> quickly — it's just that occasionally they take quite a while.

Interesting...

> Going
> through the MDS log for one of those, the inode in question is
> flagged with "needsrecover" from its first appearance in the log —
> that really shouldn't happen unless a client had write caps on it and
> the client disappeared. Any ideas? The slowness is being caused by
> the MDS going out and looking at every object which could be in the
> file — there are a lot since the file has a listed size of 8GB.

For this run, the MDS logging slowed it down enough to cause the
client caps to occasionally go stale.  I don't think it's the cause
of the issue, because I was having it before I turned MDS debugging
up.  My client caps never go stale at, e.g., debug mds 5.

Otherwise, there were no signs of trouble while writing the files.

Can you suggest which kernel client debugging I might enable that
would help understand what is happening?  Also, I have the full
MDS log from writing the files, if that will help.  It's big (~10 GiB).

> (There are several other mysteries here that can probably be traced
> to different varieties of non-optimal and buggy code as well — there
> is a client which has write caps on the inode in question despite it
> needing recovery, but the recovery isn't triggered until the stat
> event occurs, etc).

OK, thanks for taking a look.  Let me know if there is other
logging I can enable that will be helpful.

-- Jim

> -Greg
> 
> Software Engineer #42 @ http://inktank.com | http://ceph.com 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS Space Accounting and Quotas

2013-03-08 Thread Jim Schutt

On 03/07/2013 08:15 AM, Jim Schutt wrote:
> On 03/06/2013 05:18 PM, Greg Farnum wrote:
>> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:

[snip]

>>> Do you want the MDS log at 10 or 20?
>> More is better. ;)
> 
> OK, thanks.

I've sent some mds logs via private email...

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS Space Accounting and Quotas

2013-03-07 Thread Jim Schutt

On 03/06/2013 05:18 PM, Greg Farnum wrote:
> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
>> When I'm doing these stat operations the file system is otherwise
>> idle.
> 
> What's the cluster look like? This is just one active MDS and a couple 
> hundred clients?

1 mds, 1 mon, 576 osds, 198 cephfs clients.

> 
>> What is happening is that once one of these slow stat operations
>> on a file completes, it never happens again for that file, from
>> any client. At least, that's the case if I'm not writing to
>> the file any more. I haven't checked if appending to the files
>> restarts the behavior.
> 
> I assume it'll come back, but if you could verify that'd be good.

OK, I'll check it out.

> 
>  
>> On the client side I'm running with 3.8.2 + the ceph patch queue
>> that was merged into 3.9-rc1.
>>
>> On the server side I'm running recent next branch (commit 0f42eddef5),
>> with the tcp receive socket buffer option patches cherry-picked.
>> I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num
>> rather than pg_bits to set initial number of PGs (same for pgp_num),
>> and a patch that lets me run with just one pool that contains both
>> data and metadata. I'm testing data distribution uniformity with 512K PGs.
>>
>> My MDS tunables are all at default settings.
>>
>>>
>>> We'll probably want to get a high-debug log of the MDS during these slow 
>>> stats as well.
>>
>> OK.
>>
>> Do you want me to try to reproduce with a more standard setup?
> No, this is fine. 
>  
>> Also, I see Sage just pushed a patch to pgid decoding - I expect
>> I need that as well, if I'm running the latest client code.
> 
> Yeah, if you've got the commit it references you'll want it.
> 
>> Do you want the MDS log at 10 or 20?
> More is better. ;)

OK, thanks.

-- Jim

> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Jim Schutt

On 03/06/2013 02:39 PM, Greg Farnum wrote:
> On Wednesday, March 6, 2013 at 1:28 PM, Jim Schutt wrote:
>> On 03/06/2013 01:21 PM, Greg Farnum wrote:
>>>>> Also, this issue of stat on files created on other clients seems
>>>>> like it's going to be problematic for many interactions our users
>>>>> will have with the files created by their parallel compute jobs -
>>>>> any suggestion on how to avoid or fix it?
>>>>  
>>>  
>>>  
>>> Brief background: stat is required to provide file size information,
>>> and so when you do a stat Ceph needs to find out the actual file
>>> size. If the file is currently in use by somebody, that requires
>>> gathering up the latest metadata from them. Separately, while Ceph
>>> allows a client and the MDS to proceed with a bunch of operations
>>> (ie, mknod) without having it go to disk first, it requires anything
>>> which is visible to a third party (another client) be durable on disk
>>> for consistency reasons.
>>>  
>>> These combine to mean that if you do a stat on a file which a client
>>> currently has buffered writes for, that buffer must be flushed out to
>>> disk before the stat can return. This is the usual cause of the slow
>>> stats you're seeing. You should be able to adjust dirty data
>>> thresholds to encourage faster writeouts, do fsyncs once a client is
>>> done with a file, etc in order to minimize the likelihood of running
>>> into this. Also, I'd have to check but I believe opening a file with
>>> LAZY_IO or whatever will weaken those requirements — it's probably
>>> not the solution you'd like here but it's an option, and if this
>>> turns out to be a serious issue then config options to reduce
>>> consistency on certain operations are likely to make their way into
>>> the roadmap. :)
>>  
>>  
>>  
>> That all makes sense.
>>  
>> But, it turns out the files in question were written yesterday,
>> and I did the stat operations today.
>>  
>> So, shouldn't the dirty buffer issue not be in play here?
> Probably not. :/
> 
> 
>> Is there anything else that might be going on?
> In that case it sounds like either there's a slowdown on disk access
> that is propagating up the chain very bizarrely, there's a serious
> performance issue on the MDS (ie, swapping for everything), or the
> clients are still holding onto capabilities for the files in question
> and you're running into some issues with the capability revocation
> mechanisms.
> Can you describe your setup a bit more? What versions are you
> running, kernel or userspace clients, etc. What config options are
> you setting on the MDS? Assuming you're on something semi-recent,
> getting a perfcounter dump from the MDS might be illuminating as
> well.

When I'm doing these stat operations the file system is otherwise
idle.

What is happening is that once one of these slow stat operations
on a file completes, it never happens again for that file, from
any client.  At least, that's the case if I'm not writing to
the file any more.  I haven't checked if appending to the files
restarts the behavior.

On the client side I'm running with 3.8.2 + the ceph patch queue
that was merged into 3.9-rc1.

On the server side I'm running recent next branch (commit 0f42eddef5),
with the tcp receive socket buffer option patches cherry-picked.
I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num
rather than pg_bits to set initial number of PGs (same for pgp_num),
and a patch that lets me run with just one pool that contains both
data and metadata.  I'm testing data distribution uniformity with 512K PGs.

My MDS tunables are all at default settings.

> 
> We'll probably want to get a high-debug log of the MDS during these slow 
> stats as well.

OK.

Do you want me to try to reproduce with a more standard setup?

Also,  I see Sage just pushed a patch to pgid decoding - I expect
I need that as well, if I'm running the latest client code.

Do you want the MDS log at 10 or 20?

-- Jim


> -Greg
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Jim Schutt

On 03/06/2013 01:21 PM, Greg Farnum wrote:
>> > Also, this issue of stat on files created on other clients seems
>> > like it's going to be problematic for many interactions our users
>> > will have with the files created by their parallel compute jobs -
>> > any suggestion on how to avoid or fix it?
>> >  
> Brief background: stat is required to provide file size information,
> and so when you do a stat Ceph needs to find out the actual file
> size. If the file is currently in use by somebody, that requires
> gathering up the latest metadata from them. Separately, while Ceph
> allows a client and the MDS to proceed with a bunch of operations
> (ie, mknod) without having it go to disk first, it requires anything
> which is visible to a third party (another client) be durable on disk
> for consistency reasons.
> 
> These combine to mean that if you do a stat on a file which a client
> currently has buffered writes for, that buffer must be flushed out to
> disk before the stat can return. This is the usual cause of the slow
> stats you're seeing. You should be able to adjust dirty data
> thresholds to encourage faster writeouts, do fsyncs once a client is
> done with a file, etc in order to minimize the likelihood of running
> into this. Also, I'd have to check but I believe opening a file with
> LAZY_IO or whatever will weaken those requirements — it's probably
> not the solution you'd like here but it's an option, and if this
> turns out to be a serious issue then config options to reduce
> consistency on certain operations are likely to make their way into
> the roadmap. :)

That all makes sense.

But, it turns out the files in question were written yesterday,
and I did the stat operations today.

So, shouldn't the dirty buffer issue not be in play here?

Is there anything else that might be going on?

Thanks -- Jim

> -Greg
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Jim Schutt

On 03/06/2013 12:13 PM, Greg Farnum wrote:
> On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote:
>> On 03/05/2013 12:33 PM, Sage Weil wrote:
>>>>> Running 'du' on each directory would be much faster with Ceph since it
>>>>> accounts tracks the subdirectories and shows their total size with an 'ls
>>>>> -al'.
>>>>>  
>>>>> Environments with 100k users also tend to be very dynamic with adding and
>>>>> removing users all the time, so creating separate filesystems for them 
>>>>> would
>>>>> be very time consuming.
>>>>>  
>>>>> Now, I'm not talking about enforcing soft or hard quotas, I'm just talking
>>>>> about knowing how much space uid X and Y consume on the filesystem.
>>>>  
>>>  
>>>  
>>> The part I'm most unclear on is what use cases people have where uid X and  
>>> Y are spread around the file system (not in a single or small set of sub  
>>> directories) and per-user (not, say, per-project) quotas are still  
>>> necessary. In most environments, users get their own home directory and  
>>> everything lives there...
>>  
>>  
>>  
>> Hmmm, is there a tool I should be using that will return the space
>> used by a directory, and all its descendants?
>>  
>> If it's 'du', that tool is definitely not fast for me.
>>  
>> I'm doing an 'strace du -s ', where  has one
>> subdirectory which contains ~600 files. I've got ~200 clients
>> mounting the file system, and each client wrote 3 files in that
>> directory.
>>  
>> I'm doing the 'du' from one of those nodes, and the strace is showing
>> me du is doing a 'newfstat' for each file. For each file that was
>> written on a different client from where du is running, that 'newfstat'
>> takes tens of seconds to return. Which means my 'du' has been running
>> for quite some time and hasn't finished yet
>>  
>> I'm hoping there's another tool I'm supposed to be using that I
>> don't know about yet. Our use case includes tens of millions
>> of files written from thousands of clients, and whatever tool
>> we use to do space accounting needs to not walk an entire directory
>> tree, checking each file.
> 
> Check out the directory sizes with ls -l or whatever — those numbers are 
> semantically meaningful! :)

That is just exceptionally cool!

> 
> Unfortunately we can't (currently) use those "recursive statistics"
> to do proper hard quotas on subdirectories as they're lazily
> propagated following client ops, not as part of the updates. (Lazily
> in the technical sense — it's actually quite fast in general). But
> they'd work fine for soft quotas if somebody wrote the code, or to
> block writes on a slight time lag.

'ls -lh ' seems to be just the thing if you already know .

And it's perfectly suitable for our use case of not scheduling
new jobs for users consuming too much space.

I was thinking I might need to find a subtree where all the
subdirectories are owned by the same user, on the theory that
all the files in such a subtree would be owned by that same
user.  E.g., we might want such a capability to manage space per
user in shared project directories.

So, I tried 'find  -type d -exec ls -lhd {} \;'

Unfortunately, that ended up doing a 'newfstatat' on each file
under , evidently to learn if it was a directory.  The
result was that same slowdown for files written on other clients.

Is there some other way I should be looking for directories if I
don't already know what they are?

Also, this issue of stat on files created on other clients seems
like it's going to be problematic for many interactions our users
will have with the files created by their parallel compute jobs -
any suggestion on how to avoid or fix it?

Thanks!

-- Jim

> -Greg
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS First product release discussion

2013-03-06 Thread Jim Schutt

On 03/05/2013 12:33 PM, Sage Weil wrote:
>> > Running 'du' on each directory would be much faster with Ceph since it
>> > accounts tracks the subdirectories and shows their total size with an 'ls
>> > -al'.
>> > 
>> > Environments with 100k users also tend to be very dynamic with adding and
>> > removing users all the time, so creating separate filesystems for them 
>> > would
>> > be very time consuming.
>> > 
>> > Now, I'm not talking about enforcing soft or hard quotas, I'm just talking
>> > about knowing how much space uid X and Y consume on the filesystem.
> The part I'm most unclear on is what use cases people have where uid X and 
> Y are spread around the file system (not in a single or small set of sub 
> directories) and per-user (not, say, per-project) quotas are still 
> necessary.  In most environments, users get their own home directory and 
> everything lives there...

Hmmm, is there a tool I should be using that will return the space
used by a directory, and all its descendants?

If it's 'du', that tool is definitely not fast for me.

I'm doing an 'strace du -s ', where  has one
subdirectory which contains ~600 files.  I've got ~200 clients
mounting the file system, and each client wrote 3 files in that
directory.

I'm doing the 'du' from one of those nodes, and the strace is showing
me du is doing a 'newfstat' for each file.  For each file that was
written on a different client from where du is running, that 'newfstat'
takes tens of seconds to return.  Which means my 'du' has been running
for quite some time and hasn't finished yet

I'm hoping there's another tool I'm supposed to be using that I
don't know about yet.  Our use case includes tens of millions
of files written from thousands of clients, and whatever tool
we use to do space accounting needs to not walk an entire directory
tree, checking each file.

-- Jim

> 
> sage
> 
> 
>> > 
>> > Wido

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CephFS First product release discussion

2013-03-05 Thread Jim Schutt

On 03/05/2013 11:36 AM, Wido den Hollander wrote:
> 
> Now, I'm not talking about enforcing soft or hard quotas, I'm just
> talking about knowing how much space uid X and Y consume on the
> filesystem.

FWIW, we'd like this capability for our HPC systems - we need
to be able to disable scheduling of new jobs for users that
are consuming too much storage

-- Jim

> 
> Wido


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-28 Thread Jim Schutt

Hi Sage,

On 02/26/2013 12:36 PM, Sage Weil wrote:
> On Tue, 26 Feb 2013, Jim Schutt wrote:
>>> I think the right solution is to make an option that will setsockopt on 
>>> SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
>>> wip-tcp.  Do you mind checking to see if this addresses the issue (without 
>>> manually adjusting things in /proc)?
>>
>> I'll be happy to test it out...
> 
> That would be great!  It's branch wip-tcp, and the setting is 'ms tcp 
> rcvbuf'.

I've verified that I can reproduce the slowdown with the
default value of 1 for /proc/sys/net/ipv4/tcp_moderate_rcvbuf,
and 'ms tcp rcvbuf' at 0.

I've also verified that I could not reproduce any slowdown when
I configure 'ms tcp rcvbuf' to 256 KiB on OSDs.

So, that's great news - sorry for the delay in testing.

Also, FWIW, I ended up testing with commits cb15e6e0f4 and
c346282940 cherry-picked on top of next as of a day or
so ago (commit f58601d681), as for some reason wip-tcp
wouldn't work for me - ceph-mon was non-responsive in
some way I didn't dig into.

-- Jim

> 
> Thanks!
> sage
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-26 Thread Jim Schutt

Hi Sage,

On 02/20/2013 05:12 PM, Sage Weil wrote:
> Hi Jim,
> 
> I'm resurrecting an ancient thread here, but: we've just observed this on 
> another big cluster and remembered that this hasn't actually been fixed.

Sorry for the delayed reply - I missed this in a backlog
of unread email...

> 
> I think the right solution is to make an option that will setsockopt on 
> SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
> wip-tcp.  Do you mind checking to see if this addresses the issue (without 
> manually adjusting things in /proc)?

I'll be happy to test it out...

> 
> And perhaps we should consider making this default to 256KB...

That's the value I've been using with my /proc adjustments
since I figured out what was going on.  My servers use
a 10 GbE port for each of the cluster and public networks,
with cephfs clients using 1 GbE, and I've not detected any
issues resulting from that value.  So, it seems like a decent
starting point for a default...

-- Jim

> 
> Thanks!
> sage
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Slow request in XFS

2013-01-31 Thread Jim Schutt

On 01/31/2013 01:00 PM, Sage Weil wrote:
> On Thu, 31 Jan 2013, Jim Schutt wrote:
>> On 01/31/2013 05:43 AM, Sage Weil wrote:
>>> Hi-
>>>
>>> Can you reproduce this with logs?  It looks like there are a few ops that 
>>> are hanging for a very long time, but there isn't enough information here 
>>> except to point to osds 610, 612, 615, and 68...
>>
>> FWIW, I have a small pile of disks with bad sections in
>> them - these sections aren't bad enough that writes fail,
>> but they are bad enough that throughput drops by a factor
>> of ~20.
>>
>> Do OSDs already collect statistics on, say, op commit
>> elapsed time (assuming that's the statistic most directly
>> related to slow writes on only some sections of a disk),
>> in a way that could be used to diagnose such disks?
>>
>> If not, is there enough structure already in place that
>> it would be easy to add?
> 
> They're tracking it internally, but it gets averaged into the totals 
> before a user gets to see any per-request latencies.  The per-daemon 
> totals are available via the admin socket 'perf dump' command.  Have you 
> looked at that information yet?

Not yet - thanks for the pointer.

I'll take a look, and see what happens when I 
put one of my questionable disks back in.

Thanks!

-- Jim

> 
> sage
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Slow request in XFS

2013-01-31 Thread Jim Schutt

On 01/31/2013 05:43 AM, Sage Weil wrote:
> Hi-
> 
> Can you reproduce this with logs?  It looks like there are a few ops that 
> are hanging for a very long time, but there isn't enough information here 
> except to point to osds 610, 612, 615, and 68...

FWIW, I have a small pile of disks with bad sections in
them - these sections aren't bad enough that writes fail,
but they are bad enough that throughput drops by a factor
of ~20.

Do OSDs already collect statistics on, say, op commit
elapsed time (assuming that's the statistic most directly
related to slow writes on only some sections of a disk),
in a way that could be used to diagnose such disks?

If not, is there enough structure already in place that
it would be easy to add?

-- Jim

> 
> thanks!
> s

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Single host VM limit when using RBD

2013-01-17 Thread Jim Schutt

On 01/17/2013 11:36 AM, Dan Mick wrote:
> How about RLIMIT_NPROC, or memory exhaustion?

Also, check /proc/sys/kernel/pid_max.

I've solved a similar pthread_create problem by increasing this
to 256k, up from 32k.

-- Jim

> 
> On Jan 17, 2013, at 12:47 AM, Matthew Anderson  wrote:
> 
>> Hi Audrey,
>>
>> I did try your suggestion beforehand and it doesn't appear to fix the issue. 
>>
>> [root@KVM04 ~]# cat  /proc/sys/kernel/threads-max 
>> 2549635
>> [root@KVM04 ~]# echo 5549635 > /proc/sys/kernel/threads-max
>> [root@KVM04 ~]# virsh start EX03
>> error: Failed to start domain EX03
>> error: internal error Process exited while reading console log output: char 
>> device redirected to /dev/pts/23
>> Thread::try_create(): pthread_create failed with error 11common/Thread.cc: 
>> In function 'void Thread::create(size_t)' thread 7f5ec9706960 time 
>> 2013-01-17 16:46:50.935681
>> common/Thread.cc: 110: FAILED assert(ret == 0)
>> ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>> 1: (()+0x2aaa8f) [0x7f5ec6a89a8f]
>> 2: (SafeTimer::init()+0x95) [0x7f5ec6973575]
>> 3: (librados::RadosClient::connect()+0x72c) [0x7f5ec69099dc]
>> 4: (()+0xa0290) [0x7f5ec97c8290]
>> 5: (()+0x879dd) [0x7f5ec97af9dd]
>> 6: (()+0x87c1b) [0x7f5ec97afc1b]
>> 7: (()+0x87ae1) [0x7f5ec97afae1]
>> 8: (()+0x87d50) [0x7f5ec97afd50]
>> 9: (()+0xb37b2) [0x7f5ec97db7b2]
>> 10: (()+0x1e83eb) [0x7f5ec99103eb]
>> 11: (()+0x1ab54a) [0x7f5ec98d354a]
>> 12: (main()+0x9da) [0x7f5ec9913a3a]
>> 13: (__libc_start_main()+0xfd) [0x7f5ec5755cdd]
>> 14: (()+0x710b9) [0x7f5ec97990b9]
>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>> terminate called after
>> 
>>
>>
>> -Original Message-
>> From: Andrey Korolyov [mailto:and...@xdel.ru] 
>> Sent: Thursday, 17 January 2013 4:42 PM
>> To: Matthew Anderson
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: Single host VM limit when using RBD
>>
>> Hi Matthew,
>>
>> Seems to a low value in /proc/sys/kernel/threads-max value.
>>
>> On Thu, Jan 17, 2013 at 12:37 PM, Matthew Anderson  
>> wrote:
>>> I've run into a limit on the maximum number of RBD backed VM's that I'm 
>>> able to run on a single host. I have 20 VM's (21 RBD volumes open) running 
>>> on a single host and when booting the 21st machine I get the below error 
>>> from libvirt/QEMU. I'm able to shut down a VM and start another in it's 
>>> place so there seems to be a hard limit on the amount of volumes I'm able 
>>> to have open.  I did some googling and the error 11 from pthread_create 
>>> seems to mean 'resource unavailable' so I'm probably running into a thread 
>>> limit of some sort. I did try increasing the max_thread kernel option but 
>>> nothing changed. I moved a few VM's to a different empty host and they 
>>> start with no issues at all.
>>>
>>> This machine has 4 OSD's running on it in addition to the 20 VM's. Kernel 
>>> 3.7.1. Ceph 0.56.1 and QEMU 1.3.0. There is currently 65GB of 96GB free ram 
>>> and no swap.
>>>
>>> Can anyone suggest where the limit might be or anything I can do to narrow 
>>> down the problem?
>>>
>>> Thanks
>>> -Matt
>>> -
>>>
>>> Error starting domain: internal error Process exited while reading 
>>> console log output: char device redirected to /dev/pts/23
>>> Thread::try_create(): pthread_create failed with error 
>>> 11common/Thread.cc: In function 'void Thread::create(size_t)' thread 
>>> 7f4eb5a65960 time 2013-01-17 02:32:58.096437
>>> common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 
>>> (e4a541624df62ef353e754391cbbb707f54b16f7)
>>> 1: (()+0x2aaa8f) [0x7f4eb2de8a8f]
>>> 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575]
>>> 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc]
>>> 4: (()+0xa0290) [0x7f4eb5b27290]
>>> 5: (()+0x879dd) [0x7f4eb5b0e9dd]
>>> 6: (()+0x87c1b) [0x7f4eb5b0ec1b]
>>> 7: (()+0x87ae1) [0x7f4eb5b0eae1]
>>> 8: (()+0x87d50) [0x7f4eb5b0ed50]
>>> 9: (()+0xb37b2) [0x7f4eb5b3a7b2]
>>> 10: (()+0x1e83eb) [0x7f4eb5c6f3eb]
>>> 11: (()+0x1ab54a) [0x7f4eb5c3254a]
>>> 12: (main()+0x9da) [0x7f4eb5c72a3a]
>>> 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd]
>>> 14: (()+0x710b9) [0x7f4eb5af80b9]
>>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>>> interpret this.
>>> terminate called after
>>>
>>> Traceback (most recent call last):
>>>  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 96, in 
>>> cb_wrapper
>>>callback(asyncjob, *args, **kwargs)
>>>  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 117, in tmpcb
>>>callback(*args, **kwargs)
>>>  File "/usr/share/virt-manager/virtManager/domain.py", line 1090, in startup
>>>self._backend.create()
>>>  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 620, in create
>>>if ret == -1: raise libvirtError ('virDomainCreate() failed', 
>>> dom=self)
>>> libvirtError: internal error Process exited while reading console log 
>>> output: char device redirected to /dev/pts/23
>>> Thread::try_

Re: [PATCH] libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed

2013-01-16 Thread Jim Schutt

Hi Sage,

On 01/15/2013 07:55 PM, Sage Weil wrote:
> Hi Jim-
> 
> I just realized this didn't make it into our tree.  It's now in testing, 
> and will get merged in the next window.  D'oh!

That's great news - thanks for the update.

-- Jim

> 
> sage


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXTERNAL] Re: OSDMonitor: don't allow creation of pools with > 65535 pgs

2012-12-14 Thread Jim Schutt

On 12/14/2012 09:59 AM, Joao Eduardo Luis wrote:
> On 12/14/2012 03:41 PM, Jim Schutt wrote:
>> Hi,
>>
>> I'm looking at commit e3ed28eb2 in the next branch,
>> and I have a question.
>>
>> Shouldn't the limit be pg_num > 65536, because
>> PGs are numbered 0 thru pg_num-1?
>>
>> If not, what am I missing?
>>
>> FWIW, up through yesterday I've been using the next branch and this:
>>
>>ceph osd pool set data pg_num 65536 --allow-experimental-feature
>>ceph osd pool set metadata pg_num 65536 --allow-experimental-feature
>>ceph osd pool set data pgp_num 65536 --allow-experimental-feature
>>ceph osd pool set metadata pgp_num 65536 --allow-experimental-feature
>>
>> using cephfs clients, and have seen no trouble with
>> misdirected ops, etc.
>>
>> -- Jim
>>
> 
> Hi Jim,
> 
> To the best of my knowledge, one of the things that triggered the
> required hard cap on the number of pgs was that the kernel side is
> still limited to 16 bits, despite that on the osd side this is no
> longer true.
> 

I believe the culprit is the "ps" member of struct ceph_pg,
which stores what is eventually used as the PG id as __le16.

> I'm not familiar with what's going on on the kernel side, but if
> there's a slight chance that we are indeed keeping the 'pg_num' on a
> 16-bit variable, then that value must be capped to 65535. If that's
> not the case and we're just limited by the pg's number/id, then I
> guess that accepting up to 65636 would be fine (0..65535).

struct ceph_pg_pool in the kernel stores pg_num and friends
as __le32.

> 
> Just in case I'll look into this and further implications.

Cool, thanks.

-- Jim

> 
> Thanks.
> 
>   -Joao
> 
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

OSDMonitor: don't allow creation of pools with > 65535 pgs

2012-12-14 Thread Jim Schutt

Hi,

I'm looking at commit e3ed28eb2 in the next branch,
and I have a question.

Shouldn't the limit be pg_num > 65536, because
PGs are numbered 0 thru pg_num-1?

If not, what am I missing?

FWIW, up through yesterday I've been using the next branch and this:

  ceph osd pool set data pg_num 65536 --allow-experimental-feature
  ceph osd pool set metadata pg_num 65536 --allow-experimental-feature
  ceph osd pool set data pgp_num 65536 --allow-experimental-feature
  ceph osd pool set metadata pgp_num 65536 --allow-experimental-feature

using cephfs clients, and have seen no trouble with
misdirected ops, etc.

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.7.0-rc8 btrfs locking issue

2012-12-12 Thread Jim Schutt

On 12/11/2012 06:37 PM, Liu Bo wrote:
> On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote:
>> On 12/09/2012 07:04 AM, Liu Bo wrote:
>>> On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote:
>>> Hi Jim,
>>>
>>> Could you please apply the following patch to test if it works?
>>
>> Hi,
>>
>> So far, with your patch applied I've been unable to reproduce
>> the recursive deadlock.  Thanks a lot for this patch!
>> This issue has been troubling me for a while.
> 
> Hi Jim,
> 
> Good news for us :)
> 
>>
>> I've been trying to learn more about btrfs internals -
>> if you have the time to answer a couple questions about
>> your patch, I'd really appreciate it.
> 
> See below.
> 
>>
>>>
>>> (It's against 3.7-rc8.)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 3d3e2c1..100289b 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root
>>> *root, int data)
>>>  
>>> if (data)
>>> flags = BTRFS_BLOCK_GROUP_DATA;
>>> -   else if (root == root->fs_info->chunk_root)
>>> +   else if (root == root->fs_info->chunk_root ||
>>> +root == root->fs_info->dev_root)
>>> flags = BTRFS_BLOCK_GROUP_SYSTEM;
>>> else
>>> flags = BTRFS_BLOCK_GROUP_METADATA;
>>> @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct
>>> btrfs_root *root, u64 type)
>>> num_dev = 1;/* DUP or single */
>>>  
>>> /* metadata for updaing devices and chunk tree */
>>> +   num_dev = num_dev << 1
>>
>> AFAICS this is doubling the size of the reserve, which
>> reduces the chance of a recursive do_chunk_alloc(), right?
>>
> 
> Not like that, we hit the deadlock because updating device tree also
> uses METADATA chunk, which may be called when we're actually allocating
> a METADATA chunk, so the patch I sent you makes updating device tree
> use SYSTEM chunk, which we'll have some code to check if it is enough
> before allocating a chunk(if not, we'll allocate a SYSTEM chunk first).
> 
> Here I double the size just because the worst case of allocating a
> DATA/METADATA chunk -may- results in
> 
> 1)adding a SYSTEM chunk +
> 2)adding dev extent per chunk stripe +
> 3)updating chunk stripes's bytes_used
> 
>>> return btrfs_calc_trans_metadata_size(root, num_dev + 1);
>>
>> btrfs_calc_trans_metadata_size(root, num_items) multiplies its
>> num_items argument by another factor of three - do you know if
>> there is there some rationale behind that number, or is it
>> perhaps also an empirically determined factor?
> 
> The height of Btrfs's metadata btree is at most 8,
> leaf is on 0 level while node is at most on 7 level.
> 
> Each btree update may results in COWing a node and its sibling nodes,
> where the factor of tree comes from
> 
>>
>> What I'm wondering about is that if the size of the reserve is
>> empirically determined, will it need to be increased again
>> later when machines are more capable, and can handle a higher
>> load?
>>
>> Do you think it's feasible to modify the locking for
>> do_chunk_alloc to allow it to recurse without deadlock?
> 
> Well, it could be, but IMO it'll bring us complexity, so worse
> maintainance.
> 
> Any questions? Feel free to ask.

Your response was very helpful.  Thanks a lot!

-- Jim

> 
> thanks,
> liubo
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.7.0-rc8 btrfs locking issue

2012-12-11 Thread Jim Schutt

On 12/09/2012 07:04 AM, Liu Bo wrote:
> On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote:
>> > Hi,
>> > 
>> > I'm hitting a btrfs locking issue with 3.7.0-rc8.
>> > 
>> > The btrfs filesystem in question is backing a Ceph OSD
>> > under a heavy write load from many cephfs clients.
>> > 
>> > I reported this issue a while ago:
>> >   http://www.spinics.net/lists/linux-btrfs/msg19370.html
>> > when I was testing what I thought might be part of the
>> > 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree.
>> > 
>> > I spent some time attempting to bisect the btrfs patch queue
>> > just before it was merged for 3.7, but got nowhere due to
>> > false negatives.
>> > 
>> > I've just been able to get back to testing 3.7-rc, and found
>> > that I can still trigger the issue.
> Hi Jim,
> 
> Could you please apply the following patch to test if it works?

Hi,

So far, with your patch applied I've been unable to reproduce
the recursive deadlock.  Thanks a lot for this patch!
This issue has been troubling me for a while.

I've been trying to learn more about btrfs internals -
if you have the time to answer a couple questions about
your patch, I'd really appreciate it.

> 
> (It's against 3.7-rc8.)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3d3e2c1..100289b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root
> *root, int data)
>  
>   if (data)
>   flags = BTRFS_BLOCK_GROUP_DATA;
> - else if (root == root->fs_info->chunk_root)
> + else if (root == root->fs_info->chunk_root ||
> +  root == root->fs_info->dev_root)
>   flags = BTRFS_BLOCK_GROUP_SYSTEM;
>   else
>   flags = BTRFS_BLOCK_GROUP_METADATA;
> @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct
> btrfs_root *root, u64 type)
>   num_dev = 1;/* DUP or single */
>  
>   /* metadata for updaing devices and chunk tree */
> + num_dev = num_dev << 1

AFAICS this is doubling the size of the reserve, which
reduces the chance of a recursive do_chunk_alloc(), right?

>   return btrfs_calc_trans_metadata_size(root, num_dev + 1);

btrfs_calc_trans_metadata_size(root, num_items) multiplies its
num_items argument by another factor of three - do you know if
there is there some rationale behind that number, or is it
perhaps also an empirically determined factor?

What I'm wondering about is that if the size of the reserve is
empirically determined, will it need to be increased again
later when machines are more capable, and can handle a higher
load?

Do you think it's feasible to modify the locking for
do_chunk_alloc to allow it to recurse without deadlock?

Thanks -- Jim


>  }
>  
> @@ -4351,7 +4353,7 @@ static void init_global_block_rsv(struct
> btrfs_fs_info *fs_info)
>  
>   fs_info->extent_root->block_rsv = &fs_info->global_block_rsv;
>   fs_info->csum_root->block_rsv = &fs_info->global_block_rsv;
> - fs_info->dev_root->block_rsv = &fs_info->global_block_rsv;
> + fs_info->dev_root->block_rsv = &fs_info->chunk_block_rsv;
>   fs_info->tree_root->block_rsv = &fs_info->global_block_rsv;
>   fs_info->chunk_root->block_rsv = &fs_info->chunk_block_rsv;
>  
> 
> thanks,
> liubo
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.7.0-rc8 btrfs locking issue

2012-12-07 Thread Jim Schutt


On 12/05/2012 09:07 AM, Jim Schutt wrote:

Hi,

I'm hitting a btrfs locking issue with 3.7.0-rc8.

The btrfs filesystem in question is backing a Ceph OSD
under a heavy write load from many cephfs clients.

I reported this issue a while ago:
http://www.spinics.net/lists/linux-btrfs/msg19370.html
when I was testing what I thought might be part of the
3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree.

I spent some time attempting to bisect the btrfs patch queue
just before it was merged for 3.7, but got nowhere due to
false negatives.

I've just been able to get back to testing 3.7-rc, and found
that I can still trigger the issue.

First I get this lockdep splat:

[ 1184.201331] =
[ 1184.206716] [ INFO: possible recursive locking detected ]
[ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted
[ 1184.217156] -
[ 1184.222544] ceph-osd/42177 is trying to acquire lock:
[ 1184.227589]  (&fs_info->chunk_mutex){+.+...}, at: [] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.237270]
[ 1184.237270] but task is already holding lock:
[ 1184.243114]  (&fs_info->chunk_mutex){+.+...}, at: [] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.252786]
[ 1184.252786] other info that might help us debug this:
[ 1184.259303]  Possible unsafe locking scenario:
[ 1184.259303]
[ 1184.265220]CPU0
[ 1184.267680]
[ 1184.270133]   lock(&fs_info->chunk_mutex);
[ 1184.274276]   lock(&fs_info->chunk_mutex);
[ 1184.278417]
[ 1184.278417]  *** DEADLOCK ***


To try to debug this, I applied this patch:

-cut here -
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c72ead8..86da319 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1470,6 +1470,10 @@ struct btrfs_fs_info {
int backup_root_index;

int num_tolerated_disk_barrier_failures;
+
+   /* protect against recursive do_chunk_alloc() */
+   const void *chunk_alloc_task;
+   const struct btrfs_space_info *prev_sinfo;
 };

 /*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7cda519..d61e93d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2087,6 +2087,9 @@ int open_ctree(struct super_block *sb,

fs_info->btree_inode->i_ino = BTRFS_BTREE_INODE_OBJECTID;
set_nlink(fs_info->btree_inode, 1);
+
+   fs_info->chunk_alloc_task = NULL;
+   fs_info->prev_sinfo = NULL;
/*
 * we set the i_size on the btree inode to the max possible int.
 * the real end of the address space is determined by all of
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d3e2c1..1e0e421 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3582,6 +3582,16 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
}
BUG_ON(!space_info); /* Logic error */

+   if (mutex_is_locked(&fs_info->chunk_mutex) &&
+   unlikely(fs_info->chunk_alloc_task == current)) {
+   WARN_ONCE(1, "do_chunk_alloc() about to recursively acquire "
+ "fs_info->chunk_mutex: impending deadlock avoided!\n"
+ "outer call space_info = %p flags %#llx\n"
+ "nested call space_info = %p flags %#llx\n",
+ fs_info->prev_sinfo, fs_info->prev_sinfo->flags,
+ space_info, space_info->flags);
+   return -EDEADLK;
+   }
 again:
spin_lock(&space_info->lock);
if (force < space_info->force_alloc)
@@ -3603,6 +3613,8 @@ again:
spin_unlock(&space_info->lock);

mutex_lock(&fs_info->chunk_mutex);
+   fs_info->chunk_alloc_task = current;
+   fs_info->prev_sinfo = space_info;

/*
 * The chunk_mutex is held throughout the entirety of a chunk
@@ -3655,6 +3667,8 @@ again:
space_info->chunk_alloc = 0;
spin_unlock(&space_info->lock);
 out:
+   fs_info->chunk_alloc_task = NULL;
+   fs_info->prev_sinfo = NULL;
mutex_unlock(&fs_info->chunk_mutex);
return ret;
 }
-cut here -

After a few trials, I got this during a test:

[ 1614.213516] [ cut here ]
[ 1614.218185] WARNING: at fs/btrfs/extent-tree.c:3592 
do_chunk_alloc+0xec/0x3f0 [btrfs]()
[ 1614.226228] Hardware name: X8DTH-i/6/iF/6F
[ 1614.230484] do_chunk_alloc() about to recursively acquire 
fs_info->chunk_mutex: impending deadlock avoided!
[ 1614.230484] outer call space_info = 880c214dac00 flags 0x1
[ 1614.230484] nested call space_info = 880c214dbc00 flags 0x4
[ 1614.251968] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash 
dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macv

3.7.0-rc8 btrfs locking issue

2012-12-05 Thread Jim Schutt


Hi,

I'm hitting a btrfs locking issue with 3.7.0-rc8.

The btrfs filesystem in question is backing a Ceph OSD
under a heavy write load from many cephfs clients.

I reported this issue a while ago:
  http://www.spinics.net/lists/linux-btrfs/msg19370.html
when I was testing what I thought might be part of the
3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree.

I spent some time attempting to bisect the btrfs patch queue
just before it was merged for 3.7, but got nowhere due to
false negatives.

I've just been able to get back to testing 3.7-rc, and found
that I can still trigger the issue.

First I get this lockdep splat:

[ 1184.201331] =
[ 1184.206716] [ INFO: possible recursive locking detected ]
[ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted
[ 1184.217156] -
[ 1184.222544] ceph-osd/42177 is trying to acquire lock:
[ 1184.227589]  (&fs_info->chunk_mutex){+.+...}, at: [] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.237270]
[ 1184.237270] but task is already holding lock:
[ 1184.243114]  (&fs_info->chunk_mutex){+.+...}, at: [] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.252786]
[ 1184.252786] other info that might help us debug this:
[ 1184.259303]  Possible unsafe locking scenario:
[ 1184.259303]
[ 1184.265220]CPU0
[ 1184.267680]
[ 1184.270133]   lock(&fs_info->chunk_mutex);
[ 1184.274276]   lock(&fs_info->chunk_mutex);
[ 1184.278417]
[ 1184.278417]  *** DEADLOCK ***
[ 1184.278417]
[ 1184.284325]  May be due to missing lock nesting notation
[ 1184.284325]
[ 1184.291099] 4 locks held by ceph-osd/42177:
[ 1184.295277]  #0:  (sb_writers#7){.+.+.+}, at: [] 
btrfs_file_aio_write+0x64/0x320 [btrfs]
[ 1184.305103]  #1:  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: 
[] btrfs_file_aio_write+0x6e/0x320 [btrfs]
[ 1184.316108]  #2:  (sb_internal){.+.+..}, at: [] 
start_transaction+0x1c4/0x450 [btrfs]
[ 1184.325632]  #3:  (&fs_info->chunk_mutex){+.+...}, at: [] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.335761]
[ 1184.335761] stack backtrace:
[ 1184.340126] Pid: 42177, comm: ceph-osd Not tainted 3.7.0-rc8-00013-gdf2fc24 
#438
[ 1184.347508] Call Trace:
[ 1184.349962]  [] ? vprintk_emit+0x42a/0x4c0
[ 1184.355619]  [] print_deadlock_bug+0xe9/0x100
[ 1184.361556]  [] validate_chain+0x596/0x750
[ 1184.367222]  [] __lock_acquire+0x449/0x510
[ 1184.372894]  [] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.379417]  [] lock_acquire+0xc9/0x120
[ 1184.384855]  [] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.391377]  [] ? __lock_acquire+0x449/0x510
[ 1184.397204]  [] __mutex_lock_common+0x5d/0x3a0
[ 1184.403221]  [] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.409762]  [] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.416323]  [] ? do_chunk_alloc+0x179/0x340 [btrfs]
[ 1184.422849]  [] mutex_lock_nested+0x4a/0x60
[ 1184.428640]  [] do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.435018]  [] find_free_extent+0xa3c/0xb70 [btrfs]
[ 1184.441555]  [] ? btrfs_reduce_alloc_profile+0xa9/0x120 
[btrfs]
[ 1184.449051]  [] btrfs_reserve_extent+0x82/0x190 [btrfs]
[ 1184.455843]  [] btrfs_alloc_free_block+0x85/0x230 [btrfs]
[ 1184.462828]  [] __btrfs_cow_block+0x14a/0x4b0 [btrfs]
[ 1184.469471]  [] ? btrfs_set_lock_blocking_rw+0xe3/0x160 
[btrfs]
[ 1184.476962]  [] btrfs_cow_block+0x18d/0x230 [btrfs]
[ 1184.483426]  [] btrfs_search_slot+0x360/0x730 [btrfs]
[ 1184.490067]  [] btrfs_insert_empty_items+0x8d/0x100 [btrfs]
[ 1184.497199]  [] ? kmem_cache_alloc+0xd3/0x170
[ 1184.503151]  [] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs]
[ 1184.510125]  [] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs]
[ 1184.516928]  [] btrfs_alloc_chunk+0x5e/0x90 [btrfs]
[ 1184.523374]  [] ? check_system_chunk+0x71/0x130 [btrfs]
[ 1184.530179]  [] do_chunk_alloc+0x2ce/0x340 [btrfs]
[ 1184.536555]  [] ? btrfs_reduce_alloc_profile+0xa9/0x120 
[btrfs]
[ 1184.544070]  [] btrfs_check_data_free_space+0x13c/0x2b0 
[btrfs]
[ 1184.551574]  [] btrfs_delalloc_reserve_space+0x20/0x60 
[btrfs]
[ 1184.558987]  [] __btrfs_buffered_write+0x15e/0x350 [btrfs]
[ 1184.566042]  [] btrfs_file_aio_write+0x209/0x320 [btrfs]
[ 1184.572944]  [] ? __btrfs_direct_write+0x130/0x130 [btrfs]
[ 1184.579984]  [] do_sync_readv_writev+0x94/0xe0
[ 1184.585985]  [] do_readv_writev+0xe3/0x1e0
[ 1184.591645]  [] ? fget_light+0x122/0x170
[ 1184.597131]  [] vfs_writev+0x46/0x60
[ 1184.602266]  [] sys_writev+0x5f/0xc0
[ 1184.607398]  [] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1184.613832]  [] system_call_fastpath+0x16/0x1b

and then a little later:

[ 1319.463719] INFO: task ceph-osd:42177 blocked for more than 120 seconds.
[ 1319.470409] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1319.478227] ceph-osdD 0246 0 42177  1 0x
[ 1319.485330]  880402c4d358 0046 7fffeccc49d0 
880402c4dfd8
[ 1319.492852]  880402c4c010 880402c4c000 880402c4c000 
880402c4c000
[ 1319.500449]  880402c4dfd8 880402c4c000 8

[PATCH] libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed

2012-11-30 Thread Jim Schutt

Add libceph support for a new CRUSH tunable recently added to Ceph servers.

Consider the CRUSH rule
  step chooseleaf firstn 0 type 

This rule means that  replicas will be chosen in a manner such that
each chosen leaf's branch will contain a unique instance of .

When an object is re-replicated after a leaf failure, if the CRUSH map uses
a chooseleaf rule the remapped replica ends up under the  bucket
that held the failed leaf.  This causes uneven data distribution across the
storage cluster, to the point that when all the leaves but one fail under a
particular  bucket, that remaining leaf holds all the data from
its failed peers.

This behavior also limits the number of peers that can participate in the
re-replication of the data held by the failed leaf, which increases the
time required to re-replicate after a failure.

For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
inner and outer descents.

If the tree descent down to  is the outer descent, and the descent
from  down to a leaf is the inner descent, the issue is that a
down leaf is detected on the inner descent, so only the inner descent is
retried.

In order to disperse re-replicated data as widely as possible across a
storage cluster after a failure, we want to retry the outer descent. So,
fix up crush_choose() to allow the inner descent to return immediately on
choosing a failed leaf.  Wire this up as a new CRUSH tunable.

Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary.  This
requires that OSD's data for that PG to need moving as well.  This
seems unavoidable but should be relatively rare.

Signed-off-by: Jim Schutt 
---
 include/linux/ceph/ceph_features.h |4 +++-
 include/linux/crush/crush.h|2 ++
 net/ceph/crush/mapper.c|   13 ++---
 net/ceph/osdmap.c  |6 ++
 4 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/linux/ceph/ceph_features.h 
b/include/linux/ceph/ceph_features.h
index dad579b..61e5af4 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -14,13 +14,15 @@
 #define CEPH_FEATURE_DIRLAYOUTHASH  (1<<7)
 /* bits 8-17 defined by user-space; not supported yet here */
 #define CEPH_FEATURE_CRUSH_TUNABLES (1<<18)
+#define CEPH_FEATURE_CRUSH_TUNABLES2 (1<<25)
 
 /*
  * Features supported.
  */
 #define CEPH_FEATURES_SUPPORTED_DEFAULT  \
(CEPH_FEATURE_NOSRCADDR |\
-CEPH_FEATURE_CRUSH_TUNABLES)
+CEPH_FEATURE_CRUSH_TUNABLES |   \
+CEPH_FEATURE_CRUSH_TUNABLES2)
 
 #define CEPH_FEATURES_REQUIRED_DEFAULT   \
(CEPH_FEATURE_NOSRCADDR)
diff --git a/include/linux/crush/crush.h b/include/linux/crush/crush.h
index 25baa28..6a1101f 100644
--- a/include/linux/crush/crush.h
+++ b/include/linux/crush/crush.h
@@ -162,6 +162,8 @@ struct crush_map {
__u32 choose_local_fallback_tries;
/* choose attempts before giving up */ 
__u32 choose_total_tries;
+   /* attempt chooseleaf inner descent once; on failure retry outer 
descent */
+   __u32 chooseleaf_descend_once;
 };
 
 
diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c
index 35fce75..96c8a58 100644
--- a/net/ceph/crush/mapper.c
+++ b/net/ceph/crush/mapper.c
@@ -287,6 +287,7 @@ static int is_out(const struct crush_map *map, const __u32 
*weight, int item, in
  * @outpos: our position in that vector
  * @firstn: true if choosing "first n" items, false if choosing "indep"
  * @recurse_to_leaf: true if we want one device under each item of given type
+ * @descend_once: true if we should only try one descent before giving up
  * @out2: second output vector for leaf items (if @recurse_to_leaf)
  */
 static int crush_choose(const struct crush_map *map,
@@ -295,7 +296,7 @@ static int crush_choose(const struct crush_map *map,
int x, int numrep, int type,
int *out, int outpos,
int firstn, int recurse_to_leaf,
-   int *out2)
+   int descend_once, int *out2)
 {
int rep;
unsigned int ftotal, flocal;
@@ -399,6 +400,7 @@ static int crush_choose(const struct crush_map *map,
 x, outpos+1, 0,
 out2, outpos,
 firstn, 0,
+
map->chooseleaf_descend_once,
 NULL) <= outpos)
/* didn't get leaf */
reject = 1;
@@ -422,7 +424,10 @@ reject:

Re: chooseleaf_descend_once

2012-11-28 Thread Jim Schutt


On 11/28/2012 09:11 AM, Caleb Miles wrote:

Hey Jim,

Running the third test with tunable chooseleaf_descend_once 0 with no
devices marked out yields the following result

(999.827397, 0.48667056652539997)

so chi squared value is 999 with a corresponding p value of 0.487 so that
the placement distribution seems to be drawn from the uniform distribution
as desired.


Great, thanks for doing that extra test.

Plus, I see that Sage has merged it.   Cool.

Thanks -- Jim




Caleb


On Tue, Nov 27, 2012 at 1:28 PM, Jim Schutt  wrote:


Hi Caleb,


On 11/26/2012 07:28 PM, caleb miles wrote:


Hello all,

Here's what I've done to try and validate the new chooseleaf_descend_once
tunable first described in commit f1a53c5e80a48557e63db9c52b83f3**9391bc69b8
in the wip-crush branch of ceph.git.

First I set the new tunable to it's legacy value, disabled,

tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 0

The map contains one thousand osd devices contained in one hundred hosts
with the following data rule

rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

I then simulate the creation of one million placement groups using the
crushtool

$ crushtool -i hundred.map --test --min-x 0 --max-x 99 --num-rep 3
--output-csv --weight 120 0.0 --weight 121 0.0 --weight 122 0.0 --weight
123 0.0 --weight 124 0.0 --weight 125 0.0 --weight 125 0.0 --weight 150 0.0
--weight 151 0.0 --weight 152 0.0 --weight 153 0.0 --weight 154 0.0
--weight 155 0.0 --weight 156 0.0 --weight 180 0.0 --weight 181 0.0
--weight 182 0.0 --weight 183 0.0 --weight 184 0.0 --weight 185 0.0
--weight 186 0.0

with the majority of devices in three hosts marked out. Then in (I)Python

import scipy.stats as s
import matplotlib.mlab as m

data = m.csv2rec("data-device_**utilization.csv")
s.chisquare(data['number_of_**objects_stored'], data['number_of_objects_*
*expected'])

which will output

(122939.76474477499, 0.0)

so that the chi squared value is 122939.795 and the p value is, rounded
to, 0.0 and the observed placement distribution statistically differs from
a uniform distribution. Repeating with the new tunable set to

tunable chooseleaf_descend_once 1

I obtain the following result

(998.97643161876761, 0.32151775131589833)

so that the chi squared value is 998.976 and the p value is 0.32 and the
observed placement distribution is statistically identical to the uniform
distribution at the five and ten percent confidence levels, higher as well
of course. The p value is the probability of obtaining a chi squared value
more extreme than the statistic observed. Basically, from my rudimentary
understanding of probability theory, that if you obtain a p value p<  P
then reject the null hypothesis, in our case that the observed placement
distribution is drawn from the uniform distribution, at the P confidence
level.



Cool.  Thanks for doing these tests.

Is there any point to doing a third test, with

tunable chooseleaf_descend_once 0

and no devices marked out, but in all other respects
the same as the above two tests?

I would expect the results for that case and the last
case you tested to be essentially identical in the degree
of uniformity, but is it worth verifying?

-- Jim

  Caleb

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at 
http://vger.kernel.org/**majordomo-info.html<http://vger.kernel.org/majordomo-info.html>











--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: chooseleaf_descend_once

2012-11-27 Thread Jim Schutt


Hi Caleb,

On 11/26/2012 07:28 PM, caleb miles wrote:

Hello all,

Here's what I've done to try and validate the new chooseleaf_descend_once 
tunable first described in commit f1a53c5e80a48557e63db9c52b83f39391bc69b8 in 
the wip-crush branch of ceph.git.

First I set the new tunable to it's legacy value, disabled,

tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 0

The map contains one thousand osd devices contained in one hundred hosts with 
the following data rule

rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

I then simulate the creation of one million placement groups using the crushtool

$ crushtool -i hundred.map --test --min-x 0 --max-x 99 --num-rep 3 
--output-csv --weight 120 0.0 --weight 121 0.0 --weight 122 0.0 --weight 123 
0.0 --weight 124 0.0 --weight 125 0.0 --weight 125 0.0 --weight 150 0.0 
--weight 151 0.0 --weight 152 0.0 --weight 153 0.0 --weight 154 0.0 --weight 
155 0.0 --weight 156 0.0 --weight 180 0.0 --weight 181 0.0 --weight 182 0.0 
--weight 183 0.0 --weight 184 0.0 --weight 185 0.0 --weight 186 0.0

with the majority of devices in three hosts marked out. Then in (I)Python

import scipy.stats as s
import matplotlib.mlab as m

data = m.csv2rec("data-device_utilization.csv")
s.chisquare(data['number_of_objects_stored'], 
data['number_of_objects_expected'])

which will output

(122939.76474477499, 0.0)

so that the chi squared value is 122939.795 and the p value is, rounded to, 0.0 
and the observed placement distribution statistically differs from a uniform 
distribution. Repeating with the new tunable set to

tunable chooseleaf_descend_once 1

I obtain the following result

(998.97643161876761, 0.32151775131589833)

so that the chi squared value is 998.976 and the p value is 0.32 and the observed 
placement distribution is statistically identical to the uniform distribution at 
the five and ten percent confidence levels, higher as well of course. The p value 
is the probability of obtaining a chi squared value more extreme than the 
statistic observed. Basically, from my rudimentary understanding of probability 
theory, that if you obtain a p value p < P then reject the null hypothesis, in 
our case that the observed placement distribution is drawn from the uniform 
distribution, at the P confidence level.



Cool.  Thanks for doing these tests.

Is there any point to doing a third test, with

tunable chooseleaf_descend_once 0

and no devices marked out, but in all other respects
the same as the above two tests?

I would expect the results for that case and the last
case you tested to be essentially identical in the degree
of uniformity, but is it worth verifying?

-- Jim


Caleb
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] PG: Do not discard op data too early

2012-10-26 Thread Jim Schutt


On 10/26/2012 02:52 PM, Gregory Farnum wrote:

Wanted to touch base on this patch again. If Sage and Sam agree that
we don't want to play any tricks with memory accounting, we should
pull this patch in. I'm pretty sure we want it for Bobtail!


I've been running with it since I posted it.
I think it would be great if you could pick it up!

-- Jim


-Greg

On Thu, Sep 27, 2012 at 3:36 PM, Jim Schutt  wrote:

On 09/27/2012 04:27 PM, Gregory Farnum wrote:


On Thu, Sep 27, 2012 at 3:23 PM, Jim Schutt   wrote:


On 09/27/2012 04:07 PM, Gregory Farnum wrote:



Have you tested that this does what you want? If it does, I think
we'll want to implement this so that we actually release the memory,
but continue accounting it.




Yes.  I have diagnostic patches where I add an "advisory" option
to Throttle, and apply it in advisory mode to the cluster throttler.
In advisory mode Throttle counts bytes but never throttles.



Can't you also do this if you just set up a throttler with a limit of 0?
:)



Hmmm, I expect so.  I guess I just didn't think of doing it that way






When I run all the clients I can muster (222) against a relatively
small number of OSDs (48-96), with osd_client_message_size_cap set
to 10,000,000 bytes I see spikes of>   100,000,000 bytes tied up
in ops that came through the cluster messenger, and I see long
wait times (>   60 secs) on ops coming through the client throttler.

With this patch applied, I can raise osd_client_message_size_cap
to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes
tied up in ops that came through the cluster messenger.  Wait times
for ops coming through the client policy throttler are lower,
overall daemon memory usage is lower, but throughput is the same.

Overall, with this patch applied, my storage cluster "feels" much
less brittle when overloaded.



Okay, cool. Are you interested in reducing the memory usage a little
more by deallocating the memory separately from accounting it?




My testing doesn't indicate a need -- even keeping the memory
around until the op is done, my daemons use less memory overall
to get the same throughput.  So, unless some other load condition
indicates a need, I'd counsel simplicity.

-- Jim









--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] PG: Do not discard op data too early

2012-09-27 Thread Jim Schutt


On 09/27/2012 04:27 PM, Gregory Farnum wrote:

On Thu, Sep 27, 2012 at 3:23 PM, Jim Schutt  wrote:

On 09/27/2012 04:07 PM, Gregory Farnum wrote:


Have you tested that this does what you want? If it does, I think
we'll want to implement this so that we actually release the memory,
but continue accounting it.



Yes.  I have diagnostic patches where I add an "advisory" option
to Throttle, and apply it in advisory mode to the cluster throttler.
In advisory mode Throttle counts bytes but never throttles.


Can't you also do this if you just set up a throttler with a limit of 0? :)


Hmmm, I expect so.  I guess I just didn't think of doing it that way





When I run all the clients I can muster (222) against a relatively
small number of OSDs (48-96), with osd_client_message_size_cap set
to 10,000,000 bytes I see spikes of>  100,000,000 bytes tied up
in ops that came through the cluster messenger, and I see long
wait times (>  60 secs) on ops coming through the client throttler.

With this patch applied, I can raise osd_client_message_size_cap
to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes
tied up in ops that came through the cluster messenger.  Wait times
for ops coming through the client policy throttler are lower,
overall daemon memory usage is lower, but throughput is the same.

Overall, with this patch applied, my storage cluster "feels" much
less brittle when overloaded.


Okay, cool. Are you interested in reducing the memory usage a little
more by deallocating the memory separately from accounting it?




My testing doesn't indicate a need -- even keeping the memory
around until the op is done, my daemons use less memory overall
to get the same throughput.  So, unless some other load condition
indicates a need, I'd counsel simplicity.

-- Jim



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] PG: Do not discard op data too early

2012-09-27 Thread Jim Schutt


On 09/27/2012 04:07 PM, Gregory Farnum wrote:

Have you tested that this does what you want? If it does, I think
we'll want to implement this so that we actually release the memory,
but continue accounting it.


Yes.  I have diagnostic patches where I add an "advisory" option
to Throttle, and apply it in advisory mode to the cluster throttler.
In advisory mode Throttle counts bytes but never throttles.

When I run all the clients I can muster (222) against a relatively
small number of OSDs (48-96), with osd_client_message_size_cap set
to 10,000,000 bytes I see spikes of > 100,000,000 bytes tied up
in ops that came through the cluster messenger, and I see long
wait times (> 60 secs) on ops coming through the client throttler.

With this patch applied, I can raise osd_client_message_size_cap
to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes
tied up in ops that came through the cluster messenger.  Wait times
for ops coming through the client policy throttler are lower,
overall daemon memory usage is lower, but throughput is the same.

Overall, with this patch applied, my storage cluster "feels" much
less brittle when overloaded.

-- Jim



On Thu, Sep 27, 2012 at 2:56 PM, Jim Schutt  wrote:

Under a sustained cephfs write load where the offered load is higher
than the storage cluster write throughput, a backlog of replication ops
that arrive via the cluster messenger builds up.  The client message
policy throttler, which should be limiting the total write workload
accepted by the storage cluster, is unable to prevent it, for any
value of osd_client_message_size_cap, under such an overload condition.

The root cause is that op data is released too early, in op_applied().

If instead the op data is released at op deletion, then the limit
imposed by the client policy throttler applies over the entire
lifetime of the op, including commits of replication ops.  That
makes the policy throttler an effective means for an OSD to
protect itself from a sustained high offered load, because it can
effectively limit the total, cluster-wide resources needed to process
in-progress write ops.

Signed-off-by: Jim Schutt
---
  src/osd/ReplicatedPG.cc |4 
  1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index a64abda..80bec2a 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -3490,10 +3490,6 @@ void ReplicatedPG::op_applied(RepGather *repop)
dout(10)<<  "op_applied "<<  *repop<<  dendl;
if (repop->ctx->op)
  repop->ctx->op->mark_event("op_applied");
-
-  // discard my reference to the buffer
-  if (repop->ctx->op)
-repop->ctx->op->request->clear_data();

repop->applying = false;
repop->applied = true;
--
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] PG: Do not discard op data too early

2012-09-27 Thread Jim Schutt

Under a sustained cephfs write load where the offered load is higher
than the storage cluster write throughput, a backlog of replication ops
that arrive via the cluster messenger builds up.  The client message
policy throttler, which should be limiting the total write workload
accepted by the storage cluster, is unable to prevent it, for any
value of osd_client_message_size_cap, under such an overload condition.

The root cause is that op data is released too early, in op_applied().

If instead the op data is released at op deletion, then the limit
imposed by the client policy throttler applies over the entire
lifetime of the op, including commits of replication ops.  That
makes the policy throttler an effective means for an OSD to
protect itself from a sustained high offered load, because it can
effectively limit the total, cluster-wide resources needed to process
in-progress write ops.

Signed-off-by: Jim Schutt 
---
 src/osd/ReplicatedPG.cc |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index a64abda..80bec2a 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -3490,10 +3490,6 @@ void ReplicatedPG::op_applied(RepGather *repop)
   dout(10) << "op_applied " << *repop << dendl;
   if (repop->ctx->op)
 repop->ctx->op->mark_event("op_applied");
-
-  // discard my reference to the buffer
-  if (repop->ctx->op)
-repop->ctx->op->request->clear_data();
   
   repop->applying = false;
   repop->applied = true;
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] FileJournal: correctly check return value of lseek in write_fd

2012-09-27 Thread Jim Schutt


Signed-off-by: Jim Schutt 
---
 src/os/FileJournal.cc |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/os/FileJournal.cc b/src/os/FileJournal.cc
index d1c92dc..2254720 100644
--- a/src/os/FileJournal.cc
+++ b/src/os/FileJournal.cc
@@ -945,9 +945,10 @@ void FileJournal::align_bl(off64_t pos, bufferlist& bl)
 int FileJournal::write_bl(off64_t& pos, bufferlist& bl)
 {
   align_bl(pos, bl);
+  int ret;
 
-  int ret = ::lseek64(fd, pos, SEEK_SET);
-  if (ret) {
+  off64_t spos = ::lseek64(fd, pos, SEEK_SET);
+  if (spos < 0) {
 ret = -errno;
 derr << "FileJournal::write_bl : lseek64 failed " << cpp_strerror(ret) << 
dendl;
 return ret;
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lots of misdirected client requests with 73984 PGs/pool

2012-08-28 Thread Jim Schutt


Hi,

I was testing on 288 OSDs with pg_bits=8, for 73984 PGs/pool,
221952 total PGs.

Writing from CephFS clients generates lots of messages like this:

2012-08-28 14:53:33.772344 osd.235 [WRN] client.4533 172.17.135.45:0/1432642641 
misdirected client.4533.1:124 pg 0.8b9d12d4 to osd.235 in e7, client e7 pg 
0.112d4 features 262282

There's no trouble with 288 OSDs and pg_bits=7 (36992 PGs/pool).

This is all with current Linus master branch + ceph-client testing
branch, and master branch on the server side.

Does this have anything to do with old_pg_t AKA struct ceph_pg
only using 16 bits for the placement seed?

I'd about to start testing on 576 OSDs, and was hoping to use
pg_bits=7 to get more uniform data placement across OSDs.
My first step on that path was 288 OSDs with pg_bits=8.

Maybe I can't, or I have to just ignore the warnings, until
the kernel's libceph learns about 32-bit placement seeds?

Or maybe this 16-bit seed theory is all wrong, and something
else is going on?  If so, what more information is needed
to help sort this out?

Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to increase the number of pgs at pool creation time?

2012-08-23 Thread Jim Schutt


On 08/23/2012 03:26 PM, Tren Blackburn wrote:

On Thu, Aug 23, 2012 at 2:17 PM, Jim Schutt  wrote:

On 08/23/2012 02:39 PM, Tren Blackburn wrote:


2) Increase the number of pgs via ceph.conf (osd pg bits = 7) and
create the cluster. This does not work either as the cluster comes up
with 6 pgs bits per osd still.
  re:http://ceph.com/docs/master/config-cluster/osd-config-ref/



I use this regularly.  Do you have that in your global section, i.e.

[global]
 osd pg bits = 7
 osd pgp bits = 7


Hi Jim;

Sorry about the direct reply. I have cc'ed the list. To answer your
question I put that setting in the [osd] section, not the [global]
section. How would a person know these settings go in [global]?


I'm sorry to admit I haven't learned how to know this
either.  I recall trying it in the [osd] section and
not having it work, and thinking maybe the monitor
would need to know, and putting it in the [global]
section would give it access.  It worked, so I moved on.


Also,
I could not find documentation about "pgp bits" What is a "placement
group p bit"?


Again, I'm not sure.  The default config has pg bits and
pgp bits set to the same value, so I think I assumed it
would be best to have them the same if I changed pg bits.



Thank you in advance!


Sorry I don't know more

-- Jim



Regards,

Tren





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to increase the number of pgs at pool creation time?

2012-08-23 Thread Jim Schutt


On 08/23/2012 02:39 PM, Tren Blackburn wrote:

2) Increase the number of pgs via ceph.conf (osd pg bits = 7) and
create the cluster. This does not work either as the cluster comes up
with 6 pgs bits per osd still.
 re:http://ceph.com/docs/master/config-cluster/osd-config-ref/


I use this regularly.  Do you have that in your global section, i.e.

[global]
osd pg bits = 7
osd pgp bits = 7


-- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3.6-rc1] libceph: ensure banner is written on client connect before feature negotiation starts

2012-08-09 Thread Jim Schutt


On 08/08/2012 07:13 PM, Alex Elder wrote:

On 08/08/2012 11:09 AM, Jim Schutt wrote:

Because the Ceph client messenger uses a non-blocking connect, it is possible
for the sending of the client banner to race with the arrival of the banner
sent by the peer.


This is possible because the server-side messenger immediately sends
its banner and addresses after accepting a connect request, *before*
actually attempting to read or verify the banner from the client.


It just occurred to me this really should be added to the commit
message if you take the patch.  I can resend with it added, or
you can add it when you commit the patch.

Thanks for verifying this aspect of what's happening.




When ceph_sock_state_change() notices the connect has completed, it schedules
work to process the socket via con_work().  During this time the peer is
writing its banner, and arrival of the peer banner races with con_work().

If con_work() calls try_read() before the peer banner arrives, there is
nothing for it to do, after which con_work() calls try_write() to send the
client's banner.  In this case Ceph's protocol negotiation can complete
succesfully.

If, however, the peer's banner arrives before con_work() calls try_read(),
then try_read() will read the banner and prepare protocol negotiation info via
prepare_write_connect().  prepare_write_connect() calls con_out_kvec_reset(),
which discards the as-yet-unsent client banner.  Next, con_work() calls
try_write(), which sends the protocol negotiation info rather than the banner
that the peer is expecting.


I believe your analysis is correct.  And I think your solution sounds
good as well.  I think Sage was going to take a look at this also, and
I'll wait for him to have a chance to comment.  But in any case, your
fix looks good.  Nice job.


Thanks for taking a look.

-- Jim


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] make mkcephfs and init-ceph osd filesystem handling more flexible

2012-08-09 Thread Jim Schutt


On 08/09/2012 10:26 AM, Tommi Virtanen wrote:

mkcephfs is not a viable route forward. For example, it is unable to
expand a pre-existing cluster.

The new "OSD hotplugging" style init is much, much nicer. And does
more than just mkfs&  mount.


I'm embarrassed to admit I haven't been keeping up with this,
but I seem to recall that early versions didn't handle a
journal on a partition.  Did I get that wrong, or maybe that
capability exists now?  In the past I've found it to have a
small performance benefit, and would hate to lose it.

Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3.6-rc1] libceph: ensure banner is written on client connect before feature negotiation starts

2012-08-08 Thread Jim Schutt

Because the Ceph client messenger uses a non-blocking connect, it is possible
for the sending of the client banner to race with the arrival of the banner
sent by the peer.

When ceph_sock_state_change() notices the connect has completed, it schedules
work to process the socket via con_work().  During this time the peer is
writing its banner, and arrival of the peer banner races with con_work().

If con_work() calls try_read() before the peer banner arrives, there is
nothing for it to do, after which con_work() calls try_write() to send the
client's banner.  In this case Ceph's protocol negotiation can complete
succesfully.

If, however, the peer's banner arrives before con_work() calls try_read(),
then try_read() will read the banner and prepare protocol negotiation info via
prepare_write_connect().  prepare_write_connect() calls con_out_kvec_reset(),
which discards the as-yet-unsent client banner.  Next, con_work() calls
try_write(), which sends the protocol negotiation info rather than the banner
that the peer is expecting.

The result is that the peer sees an invalid banner, and the client reports
"negotiation failed".

Fix this by moving con_out_kvec_reset() out of prepare_write_connect() to its
callers at all locations except the one where the banner might still need to
be sent.

Signed-off-by: Jim Schutt 
---
 net/ceph/messenger.c |   11 +--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index b979675..24c5eea 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -915,7 +915,6 @@ static int prepare_write_connect(struct ceph_connection 
*con)
con->out_connect.authorizer_len = auth ?
cpu_to_le32(auth->authorizer_buf_len) : 0;
 
-   con_out_kvec_reset(con);
con_out_kvec_add(con, sizeof (con->out_connect),
&con->out_connect);
if (auth && auth->authorizer_buf_len)
@@ -1557,6 +1556,7 @@ static int process_connect(struct ceph_connection *con)
return -1;
}
con->auth_retry = 1;
+   con_out_kvec_reset(con);
ret = prepare_write_connect(con);
if (ret < 0)
return ret;
@@ -1577,6 +1577,7 @@ static int process_connect(struct ceph_connection *con)
   ENTITY_NAME(con->peer_name),
   ceph_pr_addr(&con->peer_addr.in_addr));
reset_connection(con);
+   con_out_kvec_reset(con);
ret = prepare_write_connect(con);
if (ret < 0)
return ret;
@@ -1601,6 +1602,7 @@ static int process_connect(struct ceph_connection *con)
 le32_to_cpu(con->out_connect.connect_seq),
 le32_to_cpu(con->in_reply.connect_seq));
con->connect_seq = le32_to_cpu(con->in_reply.connect_seq);
+   con_out_kvec_reset(con);
ret = prepare_write_connect(con);
if (ret < 0)
return ret;
@@ -1617,6 +1619,7 @@ static int process_connect(struct ceph_connection *con)
 le32_to_cpu(con->in_reply.global_seq));
get_global_seq(con->msgr,
   le32_to_cpu(con->in_reply.global_seq));
+   con_out_kvec_reset(con);
ret = prepare_write_connect(con);
if (ret < 0)
return ret;
@@ -2135,7 +2138,11 @@ more:
BUG_ON(con->state != CON_STATE_CONNECTING);
con->state = CON_STATE_NEGOTIATING;
 
-   /* Banner is good, exchange connection info */
+   /*
+* Received banner is good, exchange connection info.
+* Do not reset out_kvec, as sending our banner raced
+* with receiving peer banner after connect completed.
+*/
ret = prepare_write_connect(con);
if (ret < 0)
goto out;
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXTERNAL] Re: avoiding false detection of down OSDs

2012-07-31 Thread Jim Schutt


On 07/31/2012 12:40 PM, Sage Weil wrote:

On Tue, 31 Jul 2012, Gregory Farnum wrote:

On Tue, Jul 31, 2012 at 8:07 AM, Jim Schutt  wrote:



Also, FWIW I've been running my Ceph servers with no swap,
and I've recently doubled the size of my storage cluster.
Is it possible to have map processing do a little memory
accounting and log it, or to provide some way to learn
that map processing is chewing up significant amounts of
memory?  Or maybe there's already a way to learn this that
I need to learn about?  I sometimes run into something that
shares some characteristics with what you describe, but is
primarily triggered by high client write load.  I'd like
to be able to confirm or deny it's the same basic issue
you've described.


I think that we've done all our diagnosis using profiling tools, but
there's now a map cache and it probably wouldn't be too difficult to
have it dump data via perfcounters if you poked around...anything like
this exist yet, Sage?


Much of the bad behavior was triggered by #2860, fixes for which just went
into the stable and master branches yesterday.  It's difficult to fully
observe the bad behavior, though (lots of time spend in
generate_past_intervals, reading old maps off disk).  With the fix, we
pretty much only process maps during handle_osd_map.

Adding perfcounters in the methods that grab a map out of the cache or
(more importantly) read it off disk will give you better visibility into
that.  It should be pretty easy to instrument that (and I'll gladly
take patches that implement that... :).  Without knowing more about what
you're seeing, it's hard to say if its related, though.  This was
triggered by long periods of unclean pgs and lots of data migration, not
high load.


An issue I've been seeing is unusually high OSD memory use.
It seems to be triggered by linux clients timing out requests
and resetting OSDs during a heavy write load, but I was hoping
to rule out any memory-use issues caused by map processing.
However, this morning I started testing your server wip-msgr
branch together with the kernel-side patches queued up for 3.6,
and so far with that combination I've been unable to trigger the
behavior I was seeing.  So, that's great news, and I think
confirms that issue was unrelated to any map issues.

I've also sometimes had issues with my cluster becoming unstable
when failing an OSD while the cluster is under a heavy write load,
but hadn't been successful at characterizing under what conditions
it couldn't recover.  I expect that situation is now improved as
well, and will retest.

Thanks -- Jim



sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXTERNAL] Re: avoiding false detection of down OSDs

2012-07-31 Thread Jim Schutt


On 07/30/2012 06:24 PM, Gregory Farnum wrote:

On Mon, Jul 30, 2012 at 3:47 PM, Jim Schutt  wrote:


Above you mentioned that you are seeing these issues as you scaled
out a storage cluster, but none of the solutions you mentioned
address scaling.  Let's assume your preferred solution handles
this issue perfectly on the biggest cluster anyone has built
today.  What do you predict will happen when that cluster size
is scaled up by a factor of 2, or 10, or 100?

Sage should probably describe in more depth what we've seen since he's
looked at it the most, but I can expand on it a little. In argonaut
and earlier version of Ceph, processing a new OSDMap for an OSD is
very expensive. I don't remember the precise numbers we'd whittled it
down to but it required at least one disk sync as well as pausing all
request processing for a while. If you combined this expense with a
large number of large maps (if, perhaps, one quarter of your 800-OSD
system had been down but not out for 6+ hours), you could cause memory
thrashing on OSDs as they came up, which could force them to become
very, very, veeery slow. In the next version of Ceph, map processing
is much less expensive (no syncs or full-system pauses required),
which will prevent request backup. And there are a huge number of ways
to reduce the memory utilization of maps, some of which can be
backported to argonaut and some of which can't.
Now, if we can't prevent our internal processes from running an OSD
out of memory, we'll have failed. But we don't think this is an
intractable problem; in fact we have reason to hope we've cleared it
up now that we've seen the problem — although we don't think it's
something that we can absolutely prevent on argonaut (too much code
churn).
So we're looking for something that we can apply to argonaut as a
band-aid, but that we can also keep around in case forces external to
Ceph start causing similar cluster-scale resource shortages beyond our
control (runaway co-located process eats up all the memory on lots of
boxes, switch fails and bandwidth gets cut in half, etc). If something
happens that means Ceph can only supply half as much throughput as it
was previously, then Ceph should provide that much throughput; right
now if that kind of incident occurs then Ceph won't provide any
throughput because it'll all be eaten by spurious recovery work.


Ah, thanks for the extra context.  I hadn't fully appreciated
the proposal was primarily a mitigation for argonaut, and
otherwise as a fail-safe mechanism.



As I mentioned above, I'm concerned this is addressing
symptoms, rather than root causes.  I'm concerned the
root cause has something to do with how the map processing
work scales with number of OSDs/PGs, and that this will
limit the maximum size of a Ceph storage cluster.

I think I discussed this above enough already? :)


Yep, thanks.




But, if you really just want to not mark down an OSD that is
laggy, I know this will sound simplistic, but I keep thinking
that the OSD knows for itself if it's up, even when the
heartbeat mechanism is backed up.  Couldn't there be some way
to ask an OSD suspected of being down whether it is or not,
separate from the heartbeat mechanism?  I mean, if you're
considering having the monitor ignore OSD down reports for a
while based on some estimate of past behavior, wouldn't it be
better for the monitor to just ask such an OSD, "hey, are you
still there?"  If it gets an immediate "I'm busy, come back later",
extend the grace period; otherwise, mark the OSD down.

Hmm. The concern is that if an OSD is stuck on disk swapping then it's
going to be just as stuck for the monitors as the OSDs — they're all
using the same network in the basic case, etc. We want to be able to
make that guess before the OSD is able to answer such questions.
But I'll think on if we could try something else similar.


OK - thanks.

Also, FWIW I've been running my Ceph servers with no swap,
and I've recently doubled the size of my storage cluster.
Is it possible to have map processing do a little memory
accounting and log it, or to provide some way to learn
that map processing is chewing up significant amounts of
memory?  Or maybe there's already a way to learn this that
I need to learn about?  I sometimes run into something that
shares some characteristics with what you describe, but is
primarily triggered by high client write load.  I'd like
to be able to confirm or deny it's the same basic issue
you've described.

Thanks -- Jim


-Greg





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: avoiding false detection of down OSDs

2012-07-30 Thread Jim Schutt


Hi Greg,

Thanks for the write-up.  I have a couple questions below.

On 07/30/2012 12:46 PM, Gregory Farnum wrote:

As Ceph gets deployed on larger clusters our most common scaling
issues have related to
1) our heartbeat system, and
2) handling the larger numbers of OSDMaps that get generated by
increases in the OSD (failures, boots, etc) and PG count (osd
up-thrus, pg_temp insertions, etc).

Lately we haven't had many issues with heartbeats when the OSDs are
happy, so it looks like the latest respin of the heartbeat code is
going to satisfy us going forward. Fast OSD map generation continues
to be a concern, but with the merge of Sam's new map handling code
recently (which reduced the amount of disk effort required to process
a map and shuffled responsibility out of the main OSD thread and into
the more highly-threaded PGs) it has become significantly less
expensive, and we have a number of implemented and planned changes
(from the short- to the long-term) to continue making it less painful.

However, we've started seeing a new issue at the intersection of these
separate problems: what happens when an OSD slows down because it's
processing too many maps, but continues to operate. In large clusters,
an OSD might go down and come back up with hundreds-to-thousands of
maps to process — often at the same time as other OSDs. We've started
to observe issues during software upgrades where a lot of OSDs come up
together and process so many maps that they run out of memory and
start swapping[1]. This can easily cause them to miss heartbeats long
enough to get marked down — but then they finish map processing, tell
the monitors they *are* alive, and get marked back up. This sequence
can cause so many new maps to generate that it repeats itself on the
new nodes, spreads to other nodes in the cluster, or even causes some
running OSD daemons to get marked out. We've taken to calling this
"OSD thrashing".

It would be great if we could come up with a systemic way to reduce
thrashing, independent from our efforts to reduce the triggering
factors. (For one thing, when only one node is thrashing we probably
want to mark it down to preserve performance, whereas when half the
cluster is thrashing we want to keep them up to reduce cluster-wide
load increases.) A few weeks ago some of us at Inktank had a meeting
to discuss the issue, and I've finally gotten around to writing it up
in this email so that we can ask for input from the wider community!

After discussing several approaches (including scaling heartbeat
intervals as more nodes are marked down, as nodes report being wrongly
marked down, putting caps on the number of nodes that can be auto
marked down and/or out, applying rate limiters to the auto-marks,
etc), we realized that what we really wanted was to do our best to
estimate the chances that an OSD which missed its heartbeat window was
simply laggy rather than being down.


I don't understand the functional difference between an OSD that
is too busy to process its heartbeat in a timely fashion, and
one that is down.  In either case, it cannot meet its obligations
to its peers.

I understand that wrongly marking an OSD down adds unnecessary map
processing work.  Also, if an OSD is wrongly marked down then any
data that would be written to it while it is marked down will be
written to other OSDs, and will need to be migrated when that OSD
is marked back up.

I don't fully understand what is the impact of not marking down
an OSD that really is dead, particularly if the cluster is under
a heavy write load from many clients.  At the very least, write
requests that have a replica on such an OSD will stall waiting
for an ack that will never come, or a new map, right?

It seems to me that each of the discarded solutions has similar
properties as the favored solution: they address a symptom, rather
than the cause.

Above you mentioned that you are seeing these issues as you scaled
out a storage cluster, but none of the solutions you mentioned
address scaling.  Let's assume your preferred solution handles
this issue perfectly on the biggest cluster anyone has built
today.  What do you predict will happen when that cluster size
is scaled up by a factor of 2, or 10, or 100?


While long-term I'm a proponent of pushing most of this heartbeat
handling logic to the OSDs, in the short term adjustments to the
algorithm are much easier to implement in the monitor (which has a lot
more state on the cluster already local). We came up with a broad
algorithm to estimate the chance that an OSD is laggy instead of down:
first, figure out the probability that the OSD is down based on its
past history, and then figure out that probability for the cluster
that the OSD belongs to.
Basically:
1) Keep track of when an OSD boots if it reports itself as fresh or as
wrongly-marked-down. Maintain the probability that the OSD is actually
down versus laggy based on that data and an exponential decay (more
recent reports matter more), and maintain

Re: osd/OSDMap.h: 330: FAILED assert(is_up(osd))

2012-07-18 Thread Jim Schutt


On 07/18/2012 12:03 PM, Samuel Just wrote:

Sorry, master has a fix now for that also.
76efd9772c60b93bbf632e3ecc3b9117dc081427
-Sam


That got things running for me.

Thanks for the quick reply.

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: osd/OSDMap.h: 330: FAILED assert(is_up(osd))

2012-07-18 Thread Jim Schutt


On 07/17/2012 06:03 PM, Samuel Just wrote:

master should now have a fix for that, let me know how it goes.  I opened
bug #2798 for this issue.



Hmmm, it seems handle_osd_ping() now runs into a case
where for the first ping it gets, service.osdmap can be empty?

 0> 2012-07-18 09:17:23.977497 7fffe6ec6700 -1 *** Caught signal 
(Segmentation fault) **
 in thread 7fffe6ec6700

 ceph version 0.48argonaut-419-g4e1d973 
(commit:4e1d973e466cd45138f004e84ab8631d9b2a60fa)
 1: /usr/bin/ceph-osd() [0x723c39]
 2: (()+0xf4a0) [0x776584a0]
 3: (OSD::handle_osd_ping(MOSDPing*)+0x7d4) [0x5d7894]
 4: (OSD::heartbeat_dispatch(Message*)+0x71) [0x5d8111]
 5: (SimpleMessenger::DispatchQueue::entry()+0x583) [0x7d5103]
 6: (SimpleMessenger::dispatch_entry()+0x15) [0x7d6485]
 7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x79523d]
 8: (()+0x77f1) [0x776507f1]
 9: (clone()+0x6d) [0x76aa1ccd]

gdb has this to say:

(gdb) bt
#0  0x7765836b in raise (sig=11) at 
../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x00724067 in reraise_fatal (signum=11) at 
global/signal_handler.cc:58
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:104
#3  
#4  get_epoch (this=0x15d, m=0x1587000) at ./osd/OSDMap.h:210
#5  OSD::handle_osd_ping (this=0x15d, m=0x1587000) at osd/OSD.cc:1711
#6  0x005d8111 in OSD::heartbeat_dispatch (this=0x15d, m=0x1587000) 
at osd/OSD.cc:2769
#7  0x007d5103 in ms_deliver_dispatch (this=0x1472960) at 
msg/Messenger.h:504
#8  SimpleMessenger::DispatchQueue::entry (this=0x1472960) at 
msg/SimpleMessenger.cc:367
#9  0x007d6485 in SimpleMessenger::dispatch_entry (this=0x1472880) at 
msg/SimpleMessenger.cc:384
#10 0x0079523d in SimpleMessenger::DispatchThread::entry (this=) at ./msg/SimpleMessenger.h:807
#11 0x776507f1 in start_thread (arg=0x7fffe6ec6700) at 
pthread_create.c:301
#12 0x76aa1ccd in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) f 5
#5  OSD::handle_osd_ping (this=0x15d, m=0x1587000) at osd/OSD.cc:1711
1711m->stamp);
(gdb) l
1706}
1707  }
1708  Message *r = new MOSDPing(monc->get_fsid(),
1709curmap->get_epoch(),
1710MOSDPing::PING_REPLY,
1711m->stamp);
1712  hbserver_messenger->send_message(r, m->get_connection());
1713
1714  if (curmap->is_up(from)) {
1715note_peer_epoch(from, m->map_epoch);
(gdb) p curmap
$1 = std::tr1::shared_ptr (empty) 0x0

-- Jim


Thanks for the info!
-Sam

On Tue, Jul 17, 2012 at 2:54 PM, Jim Schutt  wrote:

On 07/17/2012 03:44 PM, Samuel Just wrote:


Not quite.  OSDService::get_osdmap() returns the most recently
published osdmap.  Generally, OSD::osdmap is safe to use when you are
holding the osd lock.  Otherwise, OSDService::get_osdmap() should be
used.  There are a few other things that should be fixed surrounding
this issue as well, I'll put some time into it today.  The map_lock
should probably be removed all together.



Thanks for taking a look.  Let me know when
you get something, and I'll take it for a spin.

Thanks -- Jim


-Sam










--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: osd/OSDMap.h: 330: FAILED assert(is_up(osd))

2012-07-17 Thread Jim Schutt


On 07/17/2012 03:44 PM, Samuel Just wrote:

Not quite.  OSDService::get_osdmap() returns the most recently
published osdmap.  Generally, OSD::osdmap is safe to use when you are
holding the osd lock.  Otherwise, OSDService::get_osdmap() should be
used.  There are a few other things that should be fixed surrounding
this issue as well, I'll put some time into it today.  The map_lock
should probably be removed all together.


Thanks for taking a look.  Let me know when
you get something, and I'll take it for a spin.

Thanks -- Jim


-Sam



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

osd/OSDMap.h: 330: FAILED assert(is_up(osd))

2012-07-17 Thread Jim Schutt


Hi,

Recent master branch is asserting for me like this:

 ceph version 0.48argonaut-404-gabe05a3 
(commit:abe05a3fbbb120d8d354623258d9104584db66f7)
 1: (OSDMap::get_cluster_inst(int) const+0xc9) [0x58cde9]
 2: (OSD::handle_osd_ping(MOSDPing*)+0x8cf) [0x5d4b4f]
 3: (OSD::heartbeat_dispatch(Message*)+0x71) [0x5d5491]
 4: (SimpleMessenger::DispatchQueue::entry()+0x583) [0x7d5683]
 5: (SimpleMessenger::dispatch_entry()+0x15) [0x7d6a05]
 6: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7957bd]
 7: (()+0x77f1) [0x776507f1]
 8: (clone()+0x6d) [0x76aa1ccd]

gdb had this to say:

(gdb) bt
#0  0x7765836b in raise (sig=6) at 
../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x007245b7 in reraise_fatal (signum=6) at 
global/signal_handler.cc:58
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3  
#4  0x769ee885 in raise (sig=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5  0x769f0065 in abort () at abort.c:92
#6  0x003be84bea7d in __gnu_cxx::__verbose_terminate_handler() () from 
/usr/lib64/libstdc++.so.6
#7  0x003be84bcc06 in ?? () from /usr/lib64/libstdc++.so.6
#8  0x003be84bcc33 in std::terminate() () from /usr/lib64/libstdc++.so.6
#9  0x003be84bcd2e in __cxa_throw () from /usr/lib64/libstdc++.so.6
#10 0x0074b9e3 in ceph::__ceph_assert_fail (assertion=0x1488000 "\001", file=0x2d828a0 
"\260m\"\003", line=330, func=0x8701e0 "entity_inst_t OSDMap::get_cluster_inst(int) 
const") at common/assert.cc:77
#11 0x0058cde9 in OSDMap::get_cluster_inst (this=, 
osd=) at osd/OSDMap.h:330
#12 0x005d4b4f in OSD::handle_osd_ping (this=0x14d8000, m=) at osd/OSD.cc:1717
#13 0x005d5491 in OSD::heartbeat_dispatch (this=0x14d8000, 
m=0x24383100) at osd/OSD.cc:2784
#14 0x007d5683 in ms_deliver_dispatch (this=0x1472960) at 
msg/Messenger.h:504
#15 SimpleMessenger::DispatchQueue::entry (this=0x1472960) at 
msg/SimpleMessenger.cc:367
#16 0x007d6a05 in SimpleMessenger::dispatch_entry (this=0x1472880) at 
msg/SimpleMessenger.cc:384
#17 0x007957bd in SimpleMessenger::DispatchThread::entry (this=) at ./msg/SimpleMessenger.h:807
#18 0x776507f1 in start_thread (arg=0x7fffe6ec6700) at 
pthread_create.c:301
#19 0x76aa1ccd in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) f 12
#12 0x005d4b4f in OSD::handle_osd_ping (this=0x14d8000, m=) at osd/OSD.cc:1717
1717  
_share_map_outgoing(service.get_osdmap()->get_cluster_inst(from));
(gdb) l
1712  hbserver_messenger->send_message(r, m->get_connection());
1713
1714  if (osdmap->is_up(from)) {
1715note_peer_epoch(from, m->map_epoch);
1716if (locked && is_active())
1717  
_share_map_outgoing(service.get_osdmap()->get_cluster_inst(from));
1718  }
1719}
1720break;
1721
(gdb) f 11
#11 0x0058cde9 in OSDMap::get_cluster_inst (this=, 
osd=) at osd/OSDMap.h:330
330 assert(is_up(osd));
(gdb) l
325   entity_inst_t get_inst(int osd) const {
326 assert(is_up(osd));
327 return entity_inst_t(entity_name_t::OSD(osd), get_addr(osd));
328   }
329   entity_inst_t get_cluster_inst(int osd) const {
330 assert(is_up(osd));
331 return entity_inst_t(entity_name_t::OSD(osd), 
get_cluster_addr(osd));
332   }
333   entity_inst_t get_hb_inst(int osd) const {
334 assert(is_up(osd));


Apparently osdmap member in class OSD don't have the
same map contents as the osdmap member in OSDService
in this instance?

Why are there two osdmaps?  Under what conditions is it
appropriate for them to have different contents?

Is this the appropriate fix?

@@ -1711,10 +1711,10 @@ void OSD::handle_osd_ping(MOSDPing *m)
m->stamp);
   hbserver_messenger->send_message(r, m->get_connection());

-  if (osdmap->is_up(from)) {
+  if (locked && osdmap->is_up(from)) {
note_peer_epoch(from, m->map_epoch);
-   if (locked && is_active())
- _share_map_outgoing(service.get_osdmap()->get_cluster_inst(from));
+   if (is_active())
+ _share_map_outgoing(osdmap->get_cluster_inst(from));
   }
 }
 break;


Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Interesting results

2012-07-02 Thread Jim Schutt


On 07/02/2012 08:07 AM, Stefan Priebe - Profihost AG wrote:

Am 02.07.2012 16:04, schrieb Jim Schutt:

On 07/01/2012 01:57 PM, Stefan Priebe wrote:

thanks for sharing. Which btrfs mount options did you use?


-o noatime

is all I use.


Thanks. Have you ever measured random I/O performance? Or is sequential all you 
need?


So far I've only been testing sequential, because
I like to find the limits under sequential load first.

--Jim



Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Interesting results

2012-07-02 Thread Jim Schutt


On 07/01/2012 01:57 PM, Stefan Priebe wrote:

thanks for sharing. Which btrfs mount options did you use?


 -o noatime

is all I use.

-- Jim



Am 29.06.2012 00:37, schrieb Jim Schutt:

Hi,

Lots of trouble reports go by on the list - I thought
it would be useful to report a success.

Using a patch (https://lkml.org/lkml/2012/6/28/446)
on top of 2.5-rc4 for my OSD servers, the same kernel
for my Linux clients, and a recent master branch
tip (git://github.com/ceph/ceph commit 4142ac44b3f),
I was able to sustain streaming writes from 166 linux
clients for 2 hours:

On 166 clients:
dd conv=fdatasync if=/dev/zero
of=/mnt/ceph/stripe-4M/1/zero0.`hostname -s` bs=4k count=65536k

Elapsed time: 7274.55 seconds
Total data: 45629732.553 MB (43515904 MiB)
Aggregate rate: 6272.516 MB/s

That kernel patch was critical; without it this test
runs into trouble after a few minutes because the
kernel runs into trouble looking for pages to merge
during page compaction. Also critical were the ceph
tunings I mentioned here:
http://www.spinics.net/lists/ceph-devel/msg07128.html

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Interesting results

2012-06-29 Thread Jim Schutt


On 06/28/2012 04:53 PM, Mark Nelson wrote:

On 06/28/2012 05:37 PM, Jim Schutt wrote:

Hi,

Lots of trouble reports go by on the list - I thought
it would be useful to report a success.

Using a patch (https://lkml.org/lkml/2012/6/28/446)
on top of 2.5-rc4 for my OSD servers, the same kernel
for my Linux clients, and a recent master branch
tip (git://github.com/ceph/ceph commit 4142ac44b3f),
I was able to sustain streaming writes from 166 linux
clients for 2 hours:

On 166 clients:
dd conv=fdatasync if=/dev/zero of=/mnt/ceph/stripe-4M/1/zero0.`hostname
-s` bs=4k count=65536k

Elapsed time: 7274.55 seconds
Total data: 45629732.553 MB (43515904 MiB)
Aggregate rate: 6272.516 MB/s

That kernel patch was critical; without it this test
runs into trouble after a few minutes because the
kernel runs into trouble looking for pages to merge
during page compaction. Also critical were the ceph
tunings I mentioned here:
http://www.spinics.net/lists/ceph-devel/msg07128.html

-- Jim


Nice! Did you see much performance degradation over time? Internally I've sen 
some slow downs (especially at smaller block sizes) as the osds fill up. How 
many servers and how many drives?



This result is from 12 servers, 24 OSDs/server, starting
from a freshly-built filesystem. I use 64KB btrfs metadata
nodes.

There is some performance degradation during such runs.
During the initial 10 TB or so, each server sustains ~2.2 GB/s,
as reported by vmstat.

Nearer the end of the run, data rate on each server is
much more variable, with peaks at ~2 GB/s and valleys at
~1.5 GB/s.

I am suspecting that some of that variability comes from
the OSDs not filling up uniformly; here's low/high utilization
at the end of the run:

server 1K-blocks  Used Available Use% Mounted on

cs42:  939095640 258202860 662416404  29% 
/ram/mnt/ceph/data.osd.261
cs38:  939095640 259052468 661568524  29% 
/ram/mnt/ceph/data.osd.154
cs39:  939095640 264803592 655825592  29% 
/ram/mnt/ceph/data.osd.174
cs34:  939095640 265911256 654711400  29% 
/ram/mnt/ceph/data.osd.52
cs41:  939095640 270588260 650049820  30% 
/ram/mnt/ceph/data.osd.238

cs33:  939095640 345327760 575399472  38% 
/ram/mnt/ceph/data.osd.47
cs40:  939095640 351180832 569558176  39% 
/ram/mnt/ceph/data.osd.205
cs35:  939095640 351372096 569365696  39% 
/ram/mnt/ceph/data.osd.89
cs41:  939095640 352522904 568214632  39% 
/ram/mnt/ceph/data.osd.217
cs33:  939095640 358181684 562561740  39% 
/ram/mnt/ceph/data.osd.35

 max/min: 1.3872

Note that I am using osd_pg_bits=7, osd_pgp_bits=7.  I have plans
to push that to see what happens.  I've also got another dozen
servers on a truck somewhere on their way to here

The under-utilized OSDs finish early, which I believe contributes
to performance tailing off at the end of such a run.  I don't have
any data on how big this effect might be.

I haven't yet tested filling my filesystem to capacity, so I have no
data regarding what happens as the disks fill up.


Still, those are the kinds of numbers I like to see. Congrats! :)


Thanks - I think it's pretty cool that testing
Ceph found a performance issue in the kernel.

-- Jim



Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Interesting results

2012-06-28 Thread Jim Schutt


Hi,

Lots of trouble reports go by on the list - I thought
it would be useful to report a success.

Using a patch (https://lkml.org/lkml/2012/6/28/446)
on top of 2.5-rc4 for my OSD servers, the same kernel
for my Linux clients, and a recent master branch
tip (git://github.com/ceph/ceph commit 4142ac44b3f),
I was able to sustain streaming writes from 166 linux
clients for 2 hours:

On 166 clients:
  dd conv=fdatasync if=/dev/zero of=/mnt/ceph/stripe-4M/1/zero0.`hostname -s` 
bs=4k count=65536k

Elapsed time:   7274.55 seconds
Total data: 45629732.553 MB (43515904 MiB)
Aggregate rate: 6272.516 MB/s

That kernel patch was critical; without it this test
runs into trouble after a few minutes because the
kernel runs into trouble looking for pages to merge
during page compaction.  Also critical were the ceph
tunings I mentioned here:
  http://www.spinics.net/lists/ceph-devel/msg07128.html

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD Hardware questions

2012-06-28 Thread Jim Schutt


On 06/28/2012 09:45 AM, Alexandre DERUMIER wrote:

>>Definitely. Seeing perf/oprofile/whatever results for the osd under that
>>workload would be very interesting! We need to get perf going in our
>>testing environment...

I'm not an expert, but if you give me command line, I'll do it ;)


Thanks to Mark's help, I'm now using:

perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5 | more

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: excessive CPU utilization by isolate_freepages?

2012-06-28 Thread Jim Schutt


On 06/28/2012 05:36 AM, Mel Gorman wrote:

On Wed, Jun 27, 2012 at 03:59:19PM -0600, Jim Schutt wrote:

Hi,

I'm running into trouble with systems going unresponsive,
and perf suggests it's excessive CPU usage by isolate_freepages().
I'm currently testing 3.5-rc4, but I think this problem may have
first shown up in 3.4.  I'm only just learning how to use perf,
so I only currently have results to report for 3.5-rc4.



Out of curiosity, why do you think it showed up in 3.4? It's not
surprising as such if it did show up there but I'm wondering what you
are basing it on.


If I remember correctly, when I was testing this workload on 3.4 is
when I first saw hundreds of runable threads being reported by vmstat.
At that time I couldn't reproduce quite as reliably, and I didn't
know how to get perf to give me call chains, so I didn't follow up :(



It's not a suprise because it's also where reclaim/compaction stopped
depending on lumpy reclaim. In the past we would have reclaimed more
pages but now rely on compaction more. It's plassible that for many
parallel compactions that there would be higher CPU usage now.



2012-06-27 14:00:03.219-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
  r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
75  1  0 566988576   3566480000 2  1355   21
3   1  4  95  0  0
433  1  0 964052576   3506911200 7 456359 
102256 20901   2 98   0  0  0
547  3  0 820116576   348939320057 560507 
114878 28115   3 96   0  0  0
806  2  0 606992576   3484818000   339 309668 
101230 21056   2 98   0  0  0
708  1  0 529624576   3470800000   248 370886 
101327 20062   2 97   0  0  0
231  5  0 504772576   3466388000   305 334824 95045 
20407   2 97   1  1  0
158  6  01063088576   3351853600   531 847435 
130696 47140   4 92   1  2  0
193  0  01449156576   3303557200   363 371279 94470 
18955   2 96   1  1  0
266  6  01623512576   327281640077 241114 95730 
15483   2 98   0  0  0
243  8  01629504576   326530800081 471018 
100223 20920   3 96   0  1  0
70 11  01342140576   3308402000   100 925869 139876 
56599   6 88   3  3  0
211  7  01130316576   3347043200   290 1008984 
150699 74320   6 83   6  5  0
365  3  0 776736576   3407277200   182 747167 
139436 67135   5 88   4  3  0
29  1  01528412576   341106400050 612181 137403 
77609   4 87   6  3  0
266  5  01657688576   3410569600 3 258307 62879 
38508   2 93   3  2  0
1159  2  02002256576   337754760019 88554 42112 
14230   1 98   0  0  0



ok, so System CPU usage through the roof.



Right around 14:00 I was able to get a "perf -a -g"; here's the
beginning of what "perf report --sort symbol --call-graph fractal,5"
had to say:

#
 64.86%  [k] _raw_spin_lock_irqsave
 |
 |--97.94%-- isolate_freepages
 |  compaction_alloc
 |  unmap_and_move
 |  migrate_pages
 |  compact_zone
 |  |
 |  |--99.56%-- try_to_compact_pages
 |  |  __alloc_pages_direct_compact
 |  |  __alloc_pages_slowpath
 |  |  __alloc_pages_nodemask
 |  |  alloc_pages_vma
 |  |  do_huge_pmd_anonymous_page
 |  |  handle_mm_fault
 |  |  do_page_fault
 |  |  page_fault
 |  |  |
 |  |  |--53.53%-- skb_copy_datagram_iovec
 |  |  |  tcp_recvmsg
 |  |  |  inet_recvmsg
 |  |  |  sock_recvmsg
 |  |  |  sys_recvfrom
 |  |  |  system_call_fastpath
 |  |  |  __recv
 |  |  |  |
 |  |  |   --100.00%-- (nil)
 |  |  |
 |  |  |--27.80%-- __pthread_create_2_1
 |  |  |  (nil)
 |  |  |
 |  |   --18.67%-- memcpy
 |  |

excessive CPU utilization by isolate_freepages?

2012-06-27 Thread Jim Schutt


Hi,

I'm running into trouble with systems going unresponsive,
and perf suggests it's excessive CPU usage by isolate_freepages().
I'm currently testing 3.5-rc4, but I think this problem may have
first shown up in 3.4.  I'm only just learning how to use perf,
so I only currently have results to report for 3.5-rc4.

(FWIW I'm running the my distro version of perf; please let me know
if I need to compile the tools/perf version to match my kernel.)

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive.  FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

In the early phase of such a test, when things are running well,
here's what vmstat reports for the state of one of these servers:

2012-06-27 13:56:58.356-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
31 15  0 287216576   3860662800 2  11582   
14   1  3  95  0  0
27 15  0 225288576   385833840018 016 
203357 134876  11 56  17 15  0
28 17  0 219256576   385447360011 2305932 
203141 146296  11 49  23 17  0
 6 18  0 215596576   3855287200 7 2363207 
215264 166502  12 45  22 20  0
22 18  0 226984576   3859640400 3 2445741 
223114 179527  12 43  23 22  0
30 12  0 230844576   384616480014 2298537 
216580 11  12 45  23 20  0
29 19  0 237856576   3842288400 5 2332741 
209408 157138  12 42  25 22  0
17 11  0 222156576   3848399200 4 2380495 
210312 173121  12 39  28 22  0
11 13  0 216152576   384638720044 2362186 
215236 176454  12 39  27 22  0
12 14  0 223704576   3854672000 2 2395530 
214684 177518  12 39  28 22  0
39 17  0 219932576   3859818400 4 2428231 
223284 179095  12 42  24 21  0
11 10  0 219956576   3852104800 4 2323853 
207472 15  12 39  28 21  0
23 15  0 216580576   3845190400 3 2241800 
201049 163496  11 37  31 21  0
 9 13  0 225792576   384519160013 2281900 
204869 171814  11 36  30 23  0
14 12  0 233820576   3849272800 4 2293423 
207686 173019  11 37  31 21  0
11 20  0 213796576   3853320800 3 2288636 
205605 168118  11 37  31 21  0


The system begins to struggle over the next several
minutes; here's what vmstat has to say:

2012-06-27 13:57:58.831-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
21 16  0 224628576   3852687200 2  12339   
19   1  4  95  0  0
12 19  0 232060576   3850102000 4 2366769 
221418 159890  12 48  20 19  0
124  8  0 218548576   383796560013 2103075 
199660 108618  11 65  13 12  0
24 10  0 300476576   382302880031 1966282 
177472 84572  10 76   7  7  0
20 16  0 217584576   3829670000 9 2062571 
195936 128810  10 55  20 15  0
53 12  0 235720576   382479680030 2035407 
196973 133921  10 52  23 15  0
20 16  0 360340576   3806799200 6 2192179 
208692 136784  11 54  19 15  0
26 10  0 310800576   380938840043 2138772 
207105 118718  11 64  12 13  0
24 15  0 261108576   380308280068 2174015 
205793 135302  11 56  18 15  0
 9 17  0 241816576   379820720020 2076145 
194971 120285  10 60  16 14  0
37 15  0 255972576   378928680014 2225076 
205694 126954  11 59  16 13  0
27 16  0 243212576   3787270400 6 2249476 
210885 134684  12 60  15 14  0
30 10  0 217572576   3779538800 3 2128688 
205027 118319  11 66  12 11  0
26 11  0 236420576   377401640023 2109709 
205105 133925  10 56  19 14  0
45 15  0 330056576   376198960015 1948311 
196188 119330  10 62  15 13  0
54 15  0 242696576   3763150000 4 2159530 
202595 132588  11 59  16 15  0

2012-06-27 13:58:59.569-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   s

Re: OSD Hardware questions

2012-06-27 Thread Jim Schutt


On 06/27/2012 01:10 PM, Jim Schutt wrote:

On 06/27/2012 12:48 PM, Stefan Priebe wrote:

Am 27.06.2012 20:38, schrieb Jim Schutt:

Actually, when my 166-client test is running,
"ps -o pid,nlwp,args -C ceph-osd"
tells me that I typically have ~1200 threads/OSD.


huh i see only 124 threads per OSD even with your settings.


FWIW:

2 threads/messenger (reader+writer):
166 clients
~200 OSD data peers (this depends on PG distribution/number)
~200 OSD heartbeat peers (ditto)


~200 OSD peers because I have 12 such servers
with 24 OSDs each.

-- Jim


plus
24 OSD op threads (my tuning)
24 OSD disk threads (my tuning)
6 OSD filestore op threads (my tuning)

So, 2*566 + 54 = 1186 threads/OSD

Plus, there's various other worker threads, such as the
timer, message dispatch threads, monitor/MDS messenger, etc.




Hmmm. The only other obvious difference, based on
what I remember from your other posts, is that you're
testing against RBD, right? I've been testing exclusively
with the Linux kernel client.


right and SSD. So it might be some timing issues.


I guess so.

-- Jim



Stefan







--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD Hardware questions

2012-06-27 Thread Jim Schutt


On 06/27/2012 12:48 PM, Stefan Priebe wrote:

Am 27.06.2012 20:38, schrieb Jim Schutt:

Actually, when my 166-client test is running,
"ps -o pid,nlwp,args -C ceph-osd"
tells me that I typically have ~1200 threads/OSD.


huh i see only 124 threads per OSD even with your settings.


FWIW:

2 threads/messenger (reader+writer):
  166 clients
 ~200 OSD data peers (this depends on PG distribution/number)
 ~200 OSD heartbeat peers (ditto)
plus
   24 OSD op threads (my tuning)
   24 OSD disk threads (my tuning)
6 OSD filestore op threads (my tuning)

So, 2*566 + 54 = 1186 threads/OSD

Plus, there's various other worker threads, such as the
timer, message dispatch threads, monitor/MDS messenger, etc.




Hmmm. The only other obvious difference, based on
what I remember from your other posts, is that you're
testing against RBD, right? I've been testing exclusively
with the Linux kernel client.


right and SSD. So it might be some timing issues.


I guess so.

-- Jim



Stefan





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD Hardware questions

2012-06-27 Thread Jim Schutt


On 06/27/2012 11:54 AM, Stefan Priebe wrote:

Am 27.06.2012 um 19:23 schrieb "Jim Schutt":


On 06/27/2012 09:19 AM, Stefan Priebe wrote:

Am 27.06.2012 16:55, schrieb Jim Schutt:

This is my current best tuning for my hardware, which uses
24 SAS drives/server, and 1 OSD/drive with a journal partition
on the outer tracks and btrfs for the data store.


Which raid level do you use?


No RAID.  Each OSD directly accesses a single
disk, via a partition for the journal and a partition
for the btrfs file store for that OSD.

So you have 24 threads x 24 osds 576 threads running?


Actually, when my 166-client test is running,
"ps -o pid,nlwp,args -C ceph-osd"
tells me that I typically have ~1200 threads/OSD.

Since NPTL uses a 1:1 threading model, I recently had
to increase /proc/sys/kernel/pid_max from the default
32768 to get them all to fit.




The NICs are Chelsio T4, but I'm not using any of the
TCP stateful offload features for this testing.
I don't know if they have ntuple support, but the
ethtool version I'm using (2.6.33) doesn't mention it.

For kernels I switch back and forth between latest development
kernel from Linus's tree, or latest stable kernel, depending
on where the kernel development cycle is.  I usually switch
to the development kernel around -rc4 or so.



Crazy that this works for you. Btrfs is crashing to me in 20s while running 
full speed on ssd.


Hmmm.  The only other obvious difference, based on
what I remember from your other posts, is that you're
testing against RBD, right?  I've been testing exclusively
with the Linux kernel client.

???

-- Jim



Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD Hardware questions

2012-06-27 Thread Jim Schutt


On 06/27/2012 09:53 AM, Mark Nelson wrote:

I'm still
trying to learn how to get perf to tell me more

-- Jim



Thanks for doing this!  I've been wanting to get perf going on our test boxes 
for ages but haven't had time to get the packages built yet for our gitbuilder 
kernels.

Try generating a call-graph ala: http://lwn.net/Articles/340010/


That looks like exactly what I need.

Thanks!

I'll post when I've learned more.

-- Jim



Mark



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD Hardware questions

2012-06-27 Thread Jim Schutt


On 06/27/2012 09:19 AM, Stefan Priebe wrote:

Am 27.06.2012 16:55, schrieb Jim Schutt:

This is my current best tuning for my hardware, which uses
24 SAS drives/server, and 1 OSD/drive with a journal partition
on the outer tracks and btrfs for the data store.


Which raid level do you use?


No RAID.  Each OSD directly accesses a single
disk, via a partition for the journal and a partition
for the btrfs file store for that OSD.

I've got my 24 drives spread across three 6 Gb/s SAS HBAs,
so I can sustain ~90 MB/s per drive with all drives active,
when writing to the outer tracks using dd.

I want to rely on Ceph for data protection via replication.
At some point I expect to play around with the RAID0
support in btrfs to explore the performance relationship
between number of OSDs and size of each OSD, but haven't yet.




I'd be very curious to hear how these work for you.
My current testing load is streaming writes from
166 linux clients, and the above tunings let me
sustain ~2 GB/s on each server (2x replication,
so 500 MB/s per server aggregate client bandwidth).

10GBe max speed shoudl be around 1Gbit/s. Do i miss something?


Hmmm, not sure.  My servers are limited by the bandwidth
of the SAS drives and HBAs.  So 2 GB/s aggregate disk
bandwidth is 1 GB/s for journals and 1 GB/s for data.
At 2x replication, that's 500 MB/s client data bandwidth.




I have dual-port 10 GbE NICs, and use one port
for the cluster and one for the clients. I use
jumbo frames because it freed up ~10% CPU cycles over
the default config of 1500-byte frames + GRO/GSO/etc
on the load I'm currently testing with.

Do you have ntuple and lro on or off? Which kernel version do you use and which 
driver version? Intel cards?


# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off

The NICs are Chelsio T4, but I'm not using any of the
TCP stateful offload features for this testing.
I don't know if they have ntuple support, but the
ethtool version I'm using (2.6.33) doesn't mention it.

For kernels I switch back and forth between latest development
kernel from Linus's tree, or latest stable kernel, depending
on where the kernel development cycle is.  I usually switch
to the development kernel around -rc4 or so.

-- Jim



Stefan





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD Hardware questions

2012-06-27 Thread Jim Schutt


Hi Mark,

On 06/27/2012 07:55 AM, Mark Nelson wrote:


For what it's worth, I've got a pair of Dell R515 setup with a single 2.8GHz 
6-core 4184 Opteron, 16GB of RAM, and 10 SSDs that are capable of about 200MB/s 
each.  Currently I'm topping out at about 600MB/s with rados bench using half 
of the drives for data and half for journals (at 2x replication).  Putting 
journals on the same drive and doing 10 OSDs on each node is slower.  Still 
working on figuring out why.


Just for fun, try the following tunings to see if they make
a difference for you.

This is my current best tuning for my hardware, which uses
24 SAS drives/server, and 1 OSD/drive with a journal partition
on the outer tracks and btrfs for the data store.

journal dio = true
osd op threads = 24
osd disk threads = 24
filestore op threads = 6
filestore queue max ops = 24

osd client message size cap = 1400
ms dispatch throttle bytes =  1750

I'd be very curious to hear how these work for you.
My current testing load is streaming writes from
166 linux clients, and the above tunings let me
sustain ~2 GB/s on each server (2x replication,
so 500 MB/s per server aggregate client bandwidth).

I have dual-port 10 GbE NICs, and use one port
for the cluster and one for the clients.  I use
jumbo frames because it freed up ~10% CPU cycles over
the default config of 1500-byte frames + GRO/GSO/etc
on the load I'm currently testing with.

FWIW these servers are dual-socket Intel 5675 Xeons,
so total 12 cores at 3.0 GHz.  On the above load I
usually see 15-30% idle.

FWIW, "perf top" has this to say about where time is being spent
under the above load under normal conditions.

   PerfTop:   19134 irqs/sec  kernel:79.2%  exact:  0.0% [1000Hz cycles],  
(all, 24 CPUs)


 samples  pcnt function   DSO
 ___ _ __ 


37656.00 15.3% ceph_crc32c_le 
/usr/bin/ceph-osd
23221.00  9.5% copy_user_generic_string   
[kernel.kallsyms]
16857.00  6.9% btrfs_end_transaction_dmeta
/lib/modules/3.5.0-rc4-00011-g15d0694/kernel/fs/btrfs/btrfs.ko
16787.00  6.8% __crc32c_le
[kernel.kallsyms]


But, sometimes I see this:

   PerfTop:4930 irqs/sec  kernel:97.8%  exact:  0.0% [1000Hz cycles],  
(all, 24 CPUs)


 samples  pcnt function   DSO
 ___ _ __ 


   147565.00 45.8% _raw_spin_lock_irqsave 
[kernel.kallsyms]
24427.00  7.6% isolate_freepages_block
[kernel.kallsyms]
23759.00  7.4% ceph_crc32c_le 
/usr/bin/ceph-osd
16521.00  5.1% copy_user_generic_string   
[kernel.kallsyms]
10549.00  3.3% __crc32c_le
[kernel.kallsyms]
 8901.00  2.8% btrfs_end_transaction_dmeta
/lib/modules/3.5.0-rc4-00011-g15d0694/kernel/fs/btrfs/btrfs.ko

When this happens, OSDs cannot process heartbeats in a timely fashion,
get wrongly marked down, thrashing ensues, clients stall.  I'm still
trying to  learn how to get perf to tell me more

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mkcephfs regression in current master branch

2012-05-24 Thread Jim Schutt


On 05/24/2012 03:13 PM, Sage Weil wrote:

Hi Jim,

On Thu, 24 May 2012, Jim Schutt wrote:


Hi,

In my testing I make repeated use of the "manual" mkcephfs
sequence described in the man page:

master# mkdir /tmp/foo
master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo
osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo
mdsnode# mkcephfs --init-local-daemons mds -d /tmp/foo
master# mkcephfs --prepare-mon -d /tmp/foo
monnode# mkcephfs --init-local-daemons mon -d /tmp/foo

Using current master branch (commit ca79f45a33f9), the
"mkcephfs --init-local-daemons osd" phase breaks like this:

2012-05-24 13:24:06.325905 7fdcd61d4780 -1 filestore(/ram/mnt/ceph/data.osd.0)
could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or
directory
2012-05-24 13:24:06.326768 7fdcd1ee2700 -1 filestore(/ram/mnt/ceph/data.osd.0)
async snap create 'snap_1' transid 0 got (17) File exists
os/FileStore.cc: In function 'void FileStore::sync_entry()' thread
7fdcd1ee2700 time 2012-05-24 13:24:06.326792
os/FileStore.cc: 3564: FAILED assert(0 == "async snap ioctl error")
  ceph version 0.47.1-157-gb003815
(commit:b003815c222add8bdcf645d9ba4ef7e13f34587e)
  1: (FileStore::sync_entry()+0x34f0) [0x68e190]
  2: (FileStore::SyncThread::entry()+0xd) [0x6a2abd]
  3: (Thread::_entry_func(void*)+0x1b9) [0x803499]
  4: (()+0x77f1) [0x7fdcd58637f1]
  5: (clone()+0x6d) [0x7fdcd4cb4ccd]
  NOTE: a copy of the executable, or `objdump -rdS` is needed to
interpret this.
2012-05-24 13:24:06.327991 7fdcd1ee2700 -1 os/FileStore.cc: In function 'void
FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792
os/FileStore.cc: 3564: FAILED assert(0 == "async snap ioctl error")


I just pushed a fix for this to master.


Great, that fixed it for me.

Thanks for the quick response!



BTW, the real change happening with these patches is that --mkfs no longer
clobbers existing data.  If you want to wipe out an osd and start anew,
you need to rm -r (and btrfs sub delete snap_* and current), or re-run
mkfs.btrfs.


Ah, OK.  It turns out I always run mkfs.btrfs anyway, but
this is good to know.

-- Jim



Thanks for the report!

sage





FWIW, with commit 598dea12411 (filestore: mkfs: only
create snap_0 if we created current_op_seq) reverted,
I am able to create a new filesystem using the above
sequence, and a typical "mkcephfs --init-local-daemons osd"
looks like this:

2012-05-24 13:06:25.918663 7fe7ac829780 -1 filestore(/ram/mnt/ceph/data.osd.0)
could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or
directory
2012-05-24 13:06:26.301738 7fe7ac829780 -1 created object store
/ram/mnt/ceph/data.osd.0 journal /dev/mapper/cs32s01p2 for osd.0 fsid
f8cc9fa2-a300-45a1-ae6d-e0c0ef418d0f
creating private key for osd.0 keyring /mnt/ceph/misc.osd.0/keyring.osd.0
creating /mnt/ceph/misc.osd.0/keyring.osd.0

Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html








--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

mkcephfs regression in current master branch

2012-05-24 Thread Jim Schutt


Hi,

In my testing I make repeated use of the "manual" mkcephfs
sequence described in the man page:

   master# mkdir /tmp/foo
   master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo
   osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo
   mdsnode# mkcephfs --init-local-daemons mds -d /tmp/foo
   master# mkcephfs --prepare-mon -d /tmp/foo
   monnode# mkcephfs --init-local-daemons mon -d /tmp/foo

Using current master branch (commit ca79f45a33f9), the
"mkcephfs --init-local-daemons osd" phase breaks like this:

2012-05-24 13:24:06.325905 7fdcd61d4780 -1 filestore(/ram/mnt/ceph/data.osd.0) 
could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory
2012-05-24 13:24:06.326768 7fdcd1ee2700 -1 filestore(/ram/mnt/ceph/data.osd.0) 
async snap create 'snap_1' transid 0 got (17) File exists
os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 
time 2012-05-24 13:24:06.326792
os/FileStore.cc: 3564: FAILED assert(0 == "async snap ioctl error")
 ceph version 0.47.1-157-gb003815 
(commit:b003815c222add8bdcf645d9ba4ef7e13f34587e)
 1: (FileStore::sync_entry()+0x34f0) [0x68e190]
 2: (FileStore::SyncThread::entry()+0xd) [0x6a2abd]
 3: (Thread::_entry_func(void*)+0x1b9) [0x803499]
 4: (()+0x77f1) [0x7fdcd58637f1]
 5: (clone()+0x6d) [0x7fdcd4cb4ccd]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.
2012-05-24 13:24:06.327991 7fdcd1ee2700 -1 os/FileStore.cc: In function 'void 
FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792
os/FileStore.cc: 3564: FAILED assert(0 == "async snap ioctl error")

FWIW, with commit 598dea12411 (filestore: mkfs: only
create snap_0 if we created current_op_seq) reverted,
I am able to create a new filesystem using the above
sequence, and a typical "mkcephfs --init-local-daemons osd"
looks like this:

2012-05-24 13:06:25.918663 7fe7ac829780 -1 filestore(/ram/mnt/ceph/data.osd.0) 
could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory
2012-05-24 13:06:26.301738 7fe7ac829780 -1 created object store 
/ram/mnt/ceph/data.osd.0 journal /dev/mapper/cs32s01p2 for osd.0 fsid 
f8cc9fa2-a300-45a1-ae6d-e0c0ef418d0f
creating private key for osd.0 keyring /mnt/ceph/misc.osd.0/keyring.osd.0
creating /mnt/ceph/misc.osd.0/keyring.osd.0

Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXTERNAL] Re: [RFC PATCH 0/2] Distribute re-replicated objects evenly after OSD failure

2012-05-14 Thread Jim Schutt


On 05/12/2012 05:51 PM, Sage Weil wrote:

Hey Jim,

These both look like reasonable changes.  And it's great to see they fix
behavior for you.

I'm not going to merge them yet, though.  We're just kicking off a CRUSH
refresh project next week that will include some testing framework to more
thoroughly validate the quality of the output, and also take a more
holistic look at all what the algorithm is doing and see what we can
improve.


I really didn't expect you'd merge them right away -- I knew you had
this CRUSH effort coming, so my goal was to get these to you so you
could evaluate them as part of that push.


Most likely these changes will be included, but revving the
mapping algorithm is going to be tricky for forward/backward
compatibility, and we'd like to get it all in at once.  (And/or come up
with a better way to deal with mismatched versions...)


Yep, I completely ignored the version compatibility issue -
I didn't have any clever ideas on how to handle it.

Also, FWIW I'm running with the patch below on top of the
previous two - I think it helps avoid giving up too early
in clusters where many OSDs have gone down/out, but I haven't
done enough testing on it yet to quantify.



Thanks!


Hey, thanks for taking at look!


sage



-- Jim

---
ceph: retry CRUSH map descents from root a little longer before falling back to 
exhaustive search

The exhaustive search isn't as exhaustive as we'd like if the
CRUSH map is several levels deep, so try a few more times
to find an "in" device during "spread re-replication around"
mode.  This makes it less likely we'll give up when the
storage cluster has many failed devices.

Signed-off-by: Jim Schutt 
---
 src/crush/mapper.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/crush/mapper.c b/src/crush/mapper.c
index 698da55..39e9c5d 100644
--- a/src/crush/mapper.c
+++ b/src/crush/mapper.c
@@ -306,7 +306,7 @@ static int crush_choose(const struct crush_map *map,
int item = 0;
int itemtype;
int collide, reject;
-   const unsigned int orig_tries = 5; /* attempts before we fall back to 
search */
+   const unsigned int orig_tries = 10; /* attempts before we fall back to 
search */

dprintk("CHOOSE%s bucket %d x %d outpos %d numrep %d\n", recurse_to_leaf ? 
"_LEAF" : "",
bucket->id, x, outpos, numrep);
@@ -440,7 +440,7 @@ reject:
else if (flocal <= in->size + 
orig_tries)
/* exhaustive bucket search */
retry_bucket = 1;
-   else if (ftotal < 20)
+   else if (ftotal <= orig_tries + 15)
/* then retry descent */
retry_descent = 1;
else
--
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 2/2] ceph: retry CRUSH map descent from root if leaf is failed

2012-05-10 Thread Jim Schutt

When an object is re-replicated after a leaf failure, the remapped replica
ends up under the bucket that held the failed leaf.  This causes uneven
data distribution across the storage cluster, to the point that when all
the leaves of a bucket but one fail, that remaining leaf holds all the
data from its failed peers.

For example, consider the crush rule
  step chooseleaf firstn 0 type 

This rule means that  replicas will be chosen in such a manner that
each chosen leaf's branch will contain a unique instance of .

For such ops, the tree descent has two steps: call them the inner and
outer descent.

If the tree descent down to  is the outer descent, and the
descent from  down to a leaf is the inner descent, the issue
is that a down leaf is detected on the inner descent, but we want to
retry the outer descent.  This ensures that re-replication after a leaf
failure disperses the re-replicated objects as widely as possible across
the storage cluster.

Fix this by causing the inner descent to return immediately on choosing
a failed leaf, unless we've fallen back to exhaustive search.

Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary.  This
requires that OSD's data for that PG to need moving as well.  This
seems unavoidable but should be relatively rare.

Signed-off-by: Jim Schutt 
---
 src/crush/mapper.c |   12 +---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/src/crush/mapper.c b/src/crush/mapper.c
index e5dc950..698da55 100644
--- a/src/crush/mapper.c
+++ b/src/crush/mapper.c
@@ -286,6 +286,7 @@ static int is_out(const struct crush_map *map, const __u32 
*weight, int item, in
  * @param outpos our position in that vector
  * @param firstn true if choosing "first n" items, false if choosing "indep"
  * @param recurseto_leaf: true if we want one device under each item of given 
type
+ * @param descend_once true if we should only try one descent before giving up
  * @param out2 second output vector for leaf items (if @a recurse_to_leaf)
  */
 static int crush_choose(const struct crush_map *map,
@@ -293,7 +294,7 @@ static int crush_choose(const struct crush_map *map,
const __u32 *weight,
int x, int numrep, int type,
int *out, int outpos,
-   int firstn, int recurse_to_leaf,
+   int firstn, int recurse_to_leaf, int descend_once,
int *out2)
 {
int rep;
@@ -397,6 +398,7 @@ static int crush_choose(const struct crush_map *map,
 x, outpos+1, 0,
 out2, outpos,
 firstn, 0,
+ftotal < orig_tries,
 NULL) <= outpos)
/* didn't get leaf */
reject = 1;
@@ -430,7 +432,10 @@ reject:
 * descent during that phase so that 
multiple
 * buckets can be exhaustively searched.
 */
-   if (ftotal <= orig_tries)
+   if (reject && descend_once)
+   /* let outer call try again */
+   skip_rep = 1;
+   else if (ftotal <= orig_tries)
retry_descent = 1;
else if (flocal <= in->size + 
orig_tries)
/* exhaustive bucket search */
@@ -491,6 +496,7 @@ int crush_do_rule(const struct crush_map *map,
int i, j;
int numrep;
int firstn;
+   const int descend_once = 0;
 
if ((__u32)ruleno >= map->max_rules) {
dprintk(" bad ruleno %d\n", ruleno);
@@ -550,7 +556,7 @@ int crush_do_rule(const struct crush_map *map,
  curstep->arg2,
  o+osize, j,
  firstn,
- recurse_to_leaf, c+osize);
+ recurse_to_leaf, 
descend_once, c+osize);
}
 
if (recurse_to_leaf)
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/2] ceph: retry CRUSH map descent before retrying bucket

2012-05-10 Thread Jim Schutt

For the first few rejections or collisions, we'll retry the descent to
keep objects spread across the cluster.  After that, we'll fall back
to exhaustive search of the bucket to avoid trying forever in the event
a bucket has only a few in items and the hash doesn't do a good job of
finding them.

Signed-off-by: Jim Schutt 
---
 src/crush/mapper.c |   20 ++--
 1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/src/crush/mapper.c b/src/crush/mapper.c
index 8857577..e5dc950 100644
--- a/src/crush/mapper.c
+++ b/src/crush/mapper.c
@@ -350,8 +350,7 @@ static int crush_choose(const struct crush_map *map,
reject = 1;
goto reject;
}
-   if (flocal >= (in->size>>1) &&
-   flocal > orig_tries)
+   if (ftotal >= orig_tries)/* exhaustive 
bucket search */
item = bucket_perm_choose(in, x, r);
else
item = crush_bucket_choose(in, x, r);
@@ -420,10 +419,19 @@ reject:
if (reject || collide) {
ftotal++;
flocal++;
-
-   if (collide && flocal < 3)
-   /* retry locally a few times */
-   retry_bucket = 1;
+   /*
+* For the first couple rejections or 
collisions,
+* we'll retry the descent to keep 
objects spread
+* across the cluster.  After that, 
we'll fall back
+* to exhaustive search of buckets to 
avoid trying
+* forever in the event a bucket has 
only a few
+* "in" items and the hash doesn't do a 
good job
+* of finding them.  Note that we need 
to retry
+* descent during that phase so that 
multiple
+* buckets can be exhaustively searched.
+*/
+   if (ftotal <= orig_tries)
+   retry_descent = 1;
else if (flocal <= in->size + 
orig_tries)
/* exhaustive bucket search */
retry_bucket = 1;
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/2] Distribute re-replicated objects evenly after OSD failure

2012-05-10 Thread Jim Schutt

Hi Sage,

I've been trying to solve the issue mentioned in tracker #2047, which I 
think is the same as I described in
  http://www.spinics.net/lists/ceph-devel/msg05824.html

The attached patches seem to fix it for me.  I also attempted to 
address the local search issue you mentioned in #2047.

I'm testing this using a cluster with 3 rows, 2 racks/row, 2 hosts/rack,
4 osds/host. I tested against a CRUSH map with the rules:
step take root
step chooseleaf firstn 0 type rack
step emit

I'm in the processes of testing this as follows:

I wrote some data to the cluster, then started shutting down OSDs using
"init-ceph stop osd.n". For the first rack's worth, I shut OSDs down
sequentially.  I waited for recovery to complete each time before
stopping the next OSD.  For the next rack I shut down the first 3 OSDs
on a host at the same time, waited for recovery to complete, then shut
down the last OSD on that host.  For the next racks, I shut down all
the OSDs on the hosts in the rack at the same time.

Right now I'm waiting for recovery to complete after shutting down
the third rack.  Once recovery completed after each phase so far,
there were no degraded objects.

So, this is looking fairly solid to me so far.  What do you think?

Thanks -- Jim

Jim Schutt (2):
  ceph: retry CRUSH map descent before retrying bucket
  ceph: retry CRUSH map descent from root if leaf is failed

 src/crush/mapper.c |   30 ++
 1 files changed, 22 insertions(+), 8 deletions(-)

-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] Ceph tracepoints

2012-05-10 Thread Jim Schutt

Hi Alex,

I ran across tracker #2374 today - I've been carrying these two
tracepoint patches for a while.  Perhaps you'll find them useful.

Jim Schutt (2):
  ceph: add tracepoints for message submission on read/write requests
  ceph: add tracepoints for message send queueing and completion, reply
handling

 fs/ceph/addr.c  |8 +++
 fs/ceph/file.c  |6 ++
 include/trace/events/ceph.h |  144 +++
 net/ceph/messenger.c|9 +++-
 net/ceph/osd_client.c   |8 +++
 5 files changed, 174 insertions(+), 1 deletions(-)
 create mode 100644 include/trace/events/ceph.h

-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 >

1 - 100 of 317 matches

Mail list logo