3.10.0 failed paging request from kthread_data
Hi, I'm trying to test the btrfs and ceph contributions to 3.11, without testing all of 3.11-rc1 (just yet), so I'm testing with the "next" branch of Chris Mason's tree (commit cbacd76bb3 from git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git) merged into the for-linus branch of the ceph tree (commit 8b8cf8917f from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git) One of my ceph clients hit this: [94633.463166] BUG: unable to handle kernel paging request at ffa8 [94633.464003] IP: [] kthread_data+0x10/0x20 [94633.464003] PGD 1a0c067 PUD 1a0e067 PMD 0 [94633.464003] Oops: [#2] SMP [94633.464003] Modules linked in: cbc ceph libceph ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log dm_multipath scsi_dh scsi_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support dcdbas coretemp kvm microcode button serio_raw pcspkr ehci_pci ehci_hcd ib_mthca ib_mad ib_core lpc_ich mfd_core uhci_hcd i5k_amb i5000_edac edac_core dm_mod nfsv4 nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 bnx2 igb ptp pps_core i2c_algo_bit i2c_core dca hwmon e1000 [94633.464003] CPU: 0 PID: 78416 Comm: kworker/0:1 Tainted: G D W 3.10.0-00119-g2925339 #601 [94633.464003] Hardware name: Dell Inc. PowerEdge 1950/0NK937, BIOS 1.1.0 06/21/2006 [94633.464003] task: 880415b6 ti: 88040e39a000 task.ti: 88040e39a000 [94633.464003] RIP: 0010:[] [] kthread_data+0x10/0x20 [94633.464003] RSP: 0018:88040e39b7f8 EFLAGS: 00010092 [94633.464003] RAX: RBX: RCX: 81d30320 [94633.464003] RDX: RSI: RDI: 880415b6 [94633.464003] RBP: 88040e39b7f8 R08: 880415b60070 R09: 0001 [94633.464003] R10: R11: R12: [94633.464003] R13: 880415b603e8 R14: 0001 R15: 0002 [94633.464003] FS: () GS:88042fc0() knlGS: [94633.464003] CS: 0010 DS: ES: CR0: 8005003b [94633.464003] CR2: 0028 CR3: 000415f77000 CR4: 07f0 [94633.464003] DR0: DR1: DR2: [94633.464003] DR3: DR6: 0ff0 DR7: 0400 [94633.464003] Stack: [94633.464003] 88040e39b818 810602a5 88040e39b818 88042fc139c0 [94633.464003] 88040e39b8a8 814ef79e 8804 88040e39bfd8 [94633.464003] 88040e39a000 88040e39a000 88040e39a010 88040e39a000 [94633.464003] Call Trace: [94633.464003] [] wq_worker_sleeping+0x15/0xa0 [94633.464003] [] __schedule+0x17e/0x6b0 [94633.464003] [] schedule+0x5d/0x60 [94633.464003] [] do_exit+0x3eb/0x440 [94633.464003] [] oops_end+0xd8/0xf0 [94633.464003] [] no_context+0x1bf/0x1e0 [94633.464003] [] __bad_area_nosemaphore+0x1f5/0x230 [94633.464003] [] bad_area_nosemaphore+0x13/0x20 [94633.464003] [] __do_page_fault+0x416/0x4b0 [94633.464003] [] ? idle_balance+0x14e/0x180 [94633.464003] [] ? finish_task_switch+0x3f/0x110 [94633.464003] [] ? error_sti+0x5/0x6 [94633.464003] [] ? trace_hardirqs_off_caller+0x29/0xd0 [94633.464003] [] ? trace_hardirqs_off_thunk+0x3a/0x3c [94633.464003] [] do_page_fault+0xe/0x10 [94633.464003] [] page_fault+0x22/0x30 [94633.464003] [] ? rb_erase+0x297/0x3a0 [94633.464003] [] __remove_osd+0x98/0xd0 [libceph] [94633.464003] [] __reset_osd+0xa3/0x1c0 [libceph] [94633.464003] [] ? osd_reset+0x9b/0xd0 [libceph] [94633.464003] [] __kick_osd_requests+0x7b/0x2e0 [libceph] [94633.464003] [] osd_reset+0xa6/0xd0 [libceph] [94633.464003] [] con_work+0x445/0x4a0 [libceph] [94633.464003] [] process_one_work+0x2e5/0x510 [94633.464003] [] ? process_one_work+0x240/0x510 [94633.464003] [] worker_thread+0x215/0x340 [94633.464003] [] ? manage_workers+0x170/0x170 [94633.464003] [] kthread+0xe1/0xf0 [94633.464003] [] ? __init_kthread_worker+0x70/0x70 [94633.464003] [] ret_from_fork+0x7c/0xb0 [94633.464003] [] ? __init_kthread_worker+0x70/0x70 [94633.464003] Code: 90 03 00 00 48 8b 40 98 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 90 03 00 00 <48> 8b 40 a8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 [94633.464003] RIP [] kthread_data+0x10/0x20 [94633.464003] RSP [94633.464003] CR2: ffa8 [94633.464003] ---[ end trace 89622896705a7fac ]--- [94633.464003] Fixing recursive fault but reboot is needed! [94633.464003] [ cut here ] kthread_data disassembles to this: (gdb) disassemble kthread_data Dump of assembler code for function kthread_data: 0x8106a060 <+0>: push %rbp 0x8106a061 <+1>: mov%rsp,%rbp 0x8106a064 <+4>: callq 0x814fabc0 0x8106a069 <+9>: mov0x390(%rdi),%rax 0x8106a070 <+16>:
lost bugfix in sync_requester_abort?
Hi Sage, A week or so ago you had a patch in your wip-mon branch (now gone) called "mon: fix raw use of *_dout in sync_requester_abort()". It seems that patch got dropped. AFAICS, without that patch, sync_requester_abort() has different behavior depending on the value of "debug mon". In particular, I believe the sync leader/provider won't get reset if debug mon is less than 10. Am I missing something, or was dropping that change unintended? Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pg balancing
Hi Sage, On 05/13/2013 06:35 PM, Sage Weil wrote: > Hi Jim- > > You mentioned the other day your concerns about the uniformity of the PG > and data distribution. There are several ways to attack it (including > increasing the number of PGs), but one that we haven't tested much yet is > the 'reweight-by-utilization' function in the monitor. > > The idea is that there will always be some statistical variance in the > distribution and a non-zero probability of having outlier OSDs with too > many PG. We adjust for this by taking nodes that are substantially above > the mean down by some adjustment factor in an automated way. > > ceph osd reweight-by-utilization MIN > > where MIN is the minimum relative utilization at which we will start > adjusting down. It is always > 100 (100% of the mean), and defaults to > 120. After it adjusts the reweights, you should see the result in 'ceph > osd tree' output > > Have you played with this at all on your cluster? I'd be very interested > in how well this does/does not improve things for you. I've been experimenting with re-weighting, and have found that it works well to redistribute data, as you expected. Here's a few observations: - when an OSD goes out and comes back in, its weight gets reset to 1. It would be nice if it could remember its old weight. - in order to reach the data distribution uniformity I'm after, I need to run multiple iterations of re-weighting - each iteration pushes data off the most highly utilized OSDs, but some ends up on average OSDs and pushes them over the limit. - as you expected, to reach the uniformity I'm after, a _lot_ of data needs to move. I've got some scripts I'm using to generate 'ceph osd reweight OSD WEIGHT' commands based on PG distribution, and I can use these after I create a new filesystem, to get a suitably uniform PG distribution before there is any data to move. Some iteration is required here as well, and this is working really well for me. When I start writing data to such a re-weighted filesystem, the data distribution pretty closely mirrors the PG distribution (once you write enough data). - re-weighting to get a more uniform data distribution works better if there are more PGs to work with. At 576 OSDs, I can't quite get things as uniform as I'd like with 64K PGs, but I can with 128K PGs. FWIW, here's the (max PGs/OSD)/(min PGs/OSD) I've measured for various numbers of PGs on 576 OSDs, with no re-weighting: PGs (max PGs/OSD) / (min PGs/OSD) 65536 1.478 131073 1.308 262144 1.240 524288 1.155 1048576 1.105 (BTW, your recent leveldb work enabled those 512K and 1M measurements. Thanks!) With 128K PGs and iterative re-weighting, I can get (max PGs/OSD)/(min PGs/OSD) < 1.05, and after writing enough data to consume ~33% of available storage, I get (max OSD data use)/(min OSD data use) ~ 1.06. OSD weights end up in the 0.85 - 1.0 range for such a distribution. So, re-weighting is definitely working for me. -- Jim > > Thanks! > sage > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: leveldb compaction overhead
On 06/05/2013 01:05 PM, Mark Nelson wrote: > > FWIW, I've been fighting with some mon/leveldb issues on 24-node test > cluster causing high CPU utilization, constant reads, laggy osdmap > updates, and mons dropping out of quorum. Work is going on in > wip-mon. Should have some more testing done later today. Other than > some crashing it appears to be behaving much better. ;) Cool! I've seen that branch, but haven't had a chance to take it for a spin yet Thanks -- Jim > > Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: leveldb compaction overhead
Hi Sage, On 05/31/2013 06:00 PM, Sage Weil wrote: > On Fri, 31 May 2013, Jim Schutt wrote: >> Hi Sage, >> >> On 05/29/2013 03:07 PM, Sage Weil wrote: >>> Hi all- >>> >>> I have a couple of branches (wip-5176 and wip-5176-cuttlefish) that try to >>> make the leveldb compaction on the monitor less expensive by doing it in >>> an async thread and compaction only the trimmed range. If anyone who is >>> experience high monitor io on cuttlefish would like to test it out, >>> feedback on whether/how much it improves things would be much appreciated! >> >> I've been flogging wip-5176-cuttlefish merged into recent cuttlefish >> branch (commit 02ef6e918e), and the result has been very stable for >> me. I've been testing OSD reweights, and so have been getting lots >> of pgmap updates, and lots of data movement. >> >> I'm no longer seeing stalls, and I see much less data movement >> on the monitor hosts. I haven't seen any monitors drop out >> and rejoin, which had been a regular occurrence for me. >> I stopped a mon and reinitialized it, and it resync'ed in >> just a few minutes, which is also a big improvement. >> >> This is all with 128K PGs - next week I'll try much higher >> PG counts. >> >> Thanks a bunch for these fixes - they are working great >> for me! > > This is great news! > > I pushed a few more patches to wip-5176-cuttlefish that should make it > even better (smarter about ranges to compact, and perfcounters so we can > tell what leveldb compactions are taking place). Do you mind trying it > out as well? I've been testing these out, in the cuttlefish branch (at commit 8544ea7518). They've been working well for me, with the possible exception of commit 61135964419 ("mon/Paxos: adjust trimming defaults up; rename options"). FWIW, I've found that the new default values for paxos_trim_min and paxos_service_trim_min aren't working well for me at 256K PGs. I periodically get the classic symptoms of monitor non-responsiveness: mons dropping out of quorum and coming back in later, and an mds going laggy and booting. The mds behavior is slightly new - the active mds doesn't fail over to one of my standby mds instances, it just goes laggy, boots, repeat. I've gone to really, really aggressive trimming (both paxos_trim_min and paxos_service_trim_min at 10), and this has been working really well for me so far. I'm wondering, when leveldb compacts, does it stop committing new objects for the duration of the compaction? If so, then possibly smaller but more frequent compaction causes shorter periods of no updates. So, even though there's more compaction work overall, each episode is much less disruptive, and my cluster is much happier. If not, then I'm not sure why my trim tuning seems to help. Thanks -- Jim > > In the meantime, I'm going to pull it into the cuttlefish branch and test > over the weekend. If that looks good we'll cut a point release with all > of these fixes. > > Thank you to everyone who has helped with the debugging and testing on > these issues! > > sage > > >> >> Thanks -- Jim >> >>> >>> Thanks- >>> sage >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: leveldb compaction overhead
Hi Sage, On 05/29/2013 03:07 PM, Sage Weil wrote: > Hi all- > > I have a couple of branches (wip-5176 and wip-5176-cuttlefish) that try to > make the leveldb compaction on the monitor less expensive by doing it in > an async thread and compaction only the trimmed range. If anyone who is > experience high monitor io on cuttlefish would like to test it out, > feedback on whether/how much it improves things would be much appreciated! I've been flogging wip-5176-cuttlefish merged into recent cuttlefish branch (commit 02ef6e918e), and the result has been very stable for me. I've been testing OSD reweights, and so have been getting lots of pgmap updates, and lots of data movement. I'm no longer seeing stalls, and I see much less data movement on the monitor hosts. I haven't seen any monitors drop out and rejoin, which had been a regular occurrence for me. I stopped a mon and reinitialized it, and it resync'ed in just a few minutes, which is also a big improvement. This is all with 128K PGs - next week I'll try much higher PG counts. Thanks a bunch for these fixes - they are working great for me! Thanks -- Jim > > Thanks- > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic
On 05/15/2013 10:49 AM, Alex Elder wrote: > On 05/15/2013 11:38 AM, Jim Schutt wrote: >> > Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while >> > holding a lock, but it's spoiled because ceph_pagelist_addpage() always >> > calls kmap(), which might sleep. Here's the result: > This looks good to me, but I admit I didn't take as close > a look at it this time. > > I appreciate your updating the series to include the things > I mentioned. > > I'll commit these for you, and I'll get confirmation on the > byte order thing as well. Cool. Thanks, Alex. -- Jim > > Reviewed-by: Alex Elder > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result: [13439.295457] ceph: mds0 reconnect start [13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58 [13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1 [13439.316464] 5 locks held by kworker/1:1/12059: [13439.320998] #0: (ceph-msgr){..}, at: [] process_one_work+0x218/0x480 [13439.329701] #1: ((&(&con->work)->work)){..}, at: [] process_one_work+0x218/0x480 [13439.339446] #2: (&s->s_mutex){..}, at: [] send_mds_reconnect+0xec/0x450 [ceph] [13439.349081] #3: (&mdsc->snap_rwsem){..}, at: [] send_mds_reconnect+0x16e/0x450 [ceph] [13439.359278] #4: (file_lock_lock){..}, at: [] lock_flocks+0x15/0x20 [13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW 3.9.0-00358-g308ae61 #557 [13439.376225] Call Trace: [13439.378757] [] __might_sleep+0xfc/0x110 [13439.384353] [] ceph_pagelist_append+0x120/0x1b0 [libceph] [13439.391491] [] ceph_encode_locks+0x89/0x190 [ceph] [13439.398035] [] ? _raw_spin_lock+0x49/0x50 [13439.403775] [] ? lock_flocks+0x15/0x20 [13439.409277] [] encode_caps_cb+0x41f/0x4a0 [ceph] [13439.415622] [] ? igrab+0x28/0x70 [13439.420610] [] ? iterate_session_caps+0xe8/0x250 [ceph] [13439.427584] [] iterate_session_caps+0x115/0x250 [ceph] [13439.434499] [] ? set_request_path_attr+0x2d0/0x2d0 [ceph] [13439.441646] [] send_mds_reconnect+0x238/0x450 [ceph] [13439.448363] [] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph] [13439.455250] [] check_new_map+0x352/0x500 [ceph] [13439.461534] [] ceph_mdsc_handle_map+0x1bd/0x260 [ceph] [13439.468432] [] ? mutex_unlock+0xe/0x10 [13439.473934] [] extra_mon_dispatch+0x22/0x30 [ceph] [13439.480464] [] dispatch+0xbc/0x110 [libceph] [13439.486492] [] process_message+0x1ad/0x1d0 [libceph] [13439.493190] [] ? read_partial_message+0x3e8/0x520 [libceph] [13439.500583] [] ? kernel_recvmsg+0x44/0x60 [13439.506324] [] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [13439.513140] [] try_read+0x5fe/0x7e0 [libceph] [13439.519246] [] con_work+0x378/0x4a0 [libceph] [13439.525345] [] ? finish_task_switch+0x3f/0x110 [13439.531515] [] process_one_work+0x2b5/0x480 [13439.537439] [] ? process_one_work+0x218/0x480 [13439.543526] [] worker_thread+0x1f5/0x320 [13439.549191] [] ? manage_workers+0x170/0x170 [13439.555102] [] kthread+0xe1/0xf0 [13439.560075] [] ? __init_kthread_worker+0x70/0x70 [13439.566419] [] ret_from_fork+0x7c/0xb0 [13439.571918] [] ? __init_kthread_worker+0x70/0x70 [13439.587132] ceph: mds0 reconnect success [13490.720032] ceph: mds0 caps stale [13501.235257] ceph: mds0 recovery completed [13501.300419] ceph: mds0 caps renewed Fix it up by encoding locks into a buffer first, and when the number of encoded locks is stable, copy that into a ceph_pagelist. Signed-off-by: Jim Schutt --- fs/ceph/locks.c | 76 +++--- fs/ceph/mds_client.c | 65 +++--- fs/ceph/super.h |9 - 3 files changed, 89 insertions(+), 61 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 4518313..8978851 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -191,29 +191,23 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count) } /** - * Encode the flock and fcntl locks for the given inode into the pagelist. - * Format is: #fcntl locks, sequential fcntl locks, #flock locks, - * sequential flock locks. - * Must be called with lock_flocks() already held. - * If we encounter more of a specific lock type than expected, - * we return the value 1. + * Encode the flock and fcntl locks for the given inode into the ceph_filelock + * array. Must be called with lock_flocks() already held. + * If we encounter more of a specific lock type than expected, return -ENOSPC. */ -int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, - int num_fcntl_locks, int num_flock_locks) +int ceph_encode_locks_to_buffer(struct inode *inode, + struct ceph_filelock *flocks, + int num_fcntl_locks, int num_flock_locks) { struct file_lock *lock; - struct ceph_filelock cephlock; int err = 0; int seen_fcntl = 0; int seen_flock = 0; - __le32 nlocks; + int l = 0; dout("encoding %d flock and %d fcntl locks", num_flock_locks, num_fcntl_locks); - nlocks = cpu_to_le32(num_fcntl_locks); - err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks)); - if (err) - goto fail; + for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) { if (lock->fl_flags & FL_PO
[PATCH v2 2/3] ceph: add missing cpu_to_le32() calls when encoding a reconnect capability
In his review, Alex Elder mentioned that he hadn't checked that num_fcntl_locks and num_flock_locks were properly decoded on the server side, from a le32 over-the-wire type to a cpu type. I checked, and AFAICS it is done; those interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph). I also checked the server side for flock_len decoding, and I believe that also happens correctly, by virtue of having been declared __le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h. Signed-off-by: Jim Schutt --- fs/ceph/locks.c |7 +-- fs/ceph/mds_client.c |2 +- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index ffc86cb..4518313 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -206,10 +206,12 @@ int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, int err = 0; int seen_fcntl = 0; int seen_flock = 0; + __le32 nlocks; dout("encoding %d flock and %d fcntl locks", num_flock_locks, num_fcntl_locks); - err = ceph_pagelist_append(pagelist, &num_fcntl_locks, sizeof(u32)); + nlocks = cpu_to_le32(num_fcntl_locks); + err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks)); if (err) goto fail; for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) { @@ -229,7 +231,8 @@ int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, goto fail; } - err = ceph_pagelist_append(pagelist, &num_flock_locks, sizeof(u32)); + nlocks = cpu_to_le32(num_flock_locks); + err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks)); if (err) goto fail; for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) { diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 4f22671..d9ca152 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2485,7 +2485,7 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap, lock_flocks(); ceph_count_locks(inode, &num_fcntl_locks, &num_flock_locks); - rec.v2.flock_len = (2*sizeof(u32) + + rec.v2.flock_len = cpu_to_le32(2*sizeof(u32) + (num_fcntl_locks+num_flock_locks) * sizeof(struct ceph_filelock)); unlock_flocks(); -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] ceph: fix up comment for ceph_count_locks() as to which lock to hold
Signed-off-by: Jim Schutt --- fs/ceph/locks.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 202dd3d..ffc86cb 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -169,7 +169,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl) } /** - * Must be called with BKL already held. Fills in the passed + * Must be called with lock_flocks() already held. Fills in the passed * counter variables, so you can prepare pagelist metadata before calling * ceph_encode_locks. */ -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/3] ceph: fix might_sleep while atomic
This patch series fixes an issue where encode_caps_cb() ends up calling kmap() while holding a spinlock. The original patch (http://www.spinics.net/lists/ceph-devel/msg14573.html) was expanded to fix some other outstanding issues noticed by Alex Elder in his review. Changes from the original version: - clean up a comment - add missing calls to cpu_to_le32() - eliminate unnecessary loop around calls to ceph_pagelist_append() - only loop on calls to ceph_count_locks() and ceph_encode_locks_to_buffer() if the number of locks changes; exit on other errors - return proper error code from encode_caps_cb() if kmalloc fails In his review, Alex mentioned that he hadn't checked that num_fcntl_locks and num_flock_locks were properly decoded on the server side, from a le32 over-the-wire type to a cpu type. I checked, and AFAICS it is done; those interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph). I also checked the server side for flock_len decoding, and I believe that also happens correctly, by virtue of having been declared __le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h. Jim Schutt (3): ceph: fix up comment for ceph_count_locks() as to which lock to hold ceph: add missing cpu_to_le32() calls when encoding a reconnect capability ceph: ceph_pagelist_append might sleep while atomic fs/ceph/locks.c | 75 -- fs/ceph/mds_client.c | 65 +++ fs/ceph/super.h |9 - 3 files changed, 90 insertions(+), 59 deletions(-) -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] OSD state flipping when cluster-network in high utilization
On 05/15/2013 09:12 AM, Chen, Xiaoxi wrote: > Thanks, but i am not quite understand how to determine weather monitor > overloaded? I've also found it difficult, and a process of elimination. Make sure your mon debugging is minimal. If it's not, lower it and see if it helps. Watch your mon host with vmstat or iostat, to see what it's disk write rate is during the periods you're having trouble with OSDs having heartbeat issues. If it's near what you think the limiting rate is for whatever device is holding the mon's store.db directory, that might be your bottleneck. > and if yes,will start several monitor help? Sage or one of the other Ceph developers can speak better to this. I will note that with three monitors, if one is too busy, it'll miss renewing its lease. The result of that will be an election to get a new quorum. If you run multiple mons and they hold frequent elections where a mon drops out of quorum, then comes back in on the next election, I've found that to be a sign that my mons are too busy. -- Jim > > 发自我的 iPhone > > 在 2013-5-15,23:07,"Jim Schutt" 写道: > >> On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote: >>>> How responsive generally is the machine under load? Is there available >>>> CPU? >>>The machine works well, and the issued OSDs are likely the same, seems >>> because they have relative slower disk( disk type are the same but the >>> latency is a bit higher ,8ms -> 10ms). >>> >>>Top show no idle % but still have 30+% of io_wait, my colleague educate >>> me that io_wait can be treated as free. >>> >>> Another information is offload the heartbeat to 1Gb nic doesn't solve the >>> problem, what's more, when we doing random write test, we can still see >>> this flipping happen. So I would like to say it may related with CPU >>> scheduler ? The heartbeat thread (in busy OSD ) failed to get enough cpu >>> cycle. >>> >> >> FWIW, also take a close look at your monitor daemons, and >> whether they show any signs of being overloaded. >> >> I frequently see OSDs wrongly marked down when my >> mons cannot keep up with their workload. >> >> -- Jim >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ceph-users] OSD state flipping when cluster-network in high utilization
On 05/14/2013 09:23 PM, Chen, Xiaoxi wrote: >> How responsive generally is the machine under load? Is there available CPU? > The machine works well, and the issued OSDs are likely the same, seems > because they have relative slower disk( disk type are the same but the > latency is a bit higher ,8ms -> 10ms). > > Top show no idle % but still have 30+% of io_wait, my colleague > educate me that io_wait can be treated as free. > > Another information is offload the heartbeat to 1Gb nic doesn't solve the > problem, what's more, when we doing random write test, we can still see this > flipping happen. So I would like to say it may related with CPU scheduler ? > The heartbeat thread (in busy OSD ) failed to get enough cpu cycle. > FWIW, also take a close look at your monitor daemons, and whether they show any signs of being overloaded. I frequently see OSDs wrongly marked down when my mons cannot keep up with their workload. -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: ceph_pagelist_append might sleep while atomic
On 05/14/2013 10:44 AM, Alex Elder wrote: > On 05/09/2013 09:42 AM, Jim Schutt wrote: >> Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while >> holding a lock, but it's spoiled because ceph_pagelist_addpage() always >> calls kmap(), which might sleep. Here's the result: > > I finally took a close look at this today, Jim. Sorry > for the delay. > No worries - thanks for taking a look. > The issue is formatting the reconnect message--which will > hold an arbitrary amount of data and therefore which we'll > need to do some allocation (and kmap) for--in the face of > having to hold the flock spinlock while doing so. > > And as you found, ceph_pagelist_addpage(), which is called > by ceph_pagelist_append(), calls kmap() even if it doesn't > need to allocate anything. This means that despite reserving > the pages, those pages are in the free list and because they'll > need to be the subject of kmap() their preallocation doesn't > help. > > Your solution was to pre-allocate a buffer, format the locks > into that buffer while holding the lock, then append the > buffer contents to a pagelist after releasing the lock. You > check for a changing (increasing) lock count while you format > the locks, which is good. > > So... Given that, I think your change looks good. It's a shame > we can't format directly into the pagelist buffer but this won't > happen much so it's not a big deal. I have a few small suggestions, > below. > > I do find some byte order bugs though. They aren't your doing, > but I think they ought to be fixed first, as a separate patch > that would precede this one. The bug is that the lock counts > that are put into the buffer (num_fcntl_locks and num_flock_locks) > are not properly byte-swapped. I'll point it out inline > in your code, below. > > I'll say that what you have is OK. Consider my suggestions, and > if you choose not to fix the byte order bugs, please let me know. I'll happily fix up a v2 series with your suggestions addressed. Thanks for catching those issues. Stay tuned... Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pg balancing
[resent to list because I missed that Cc:] Hi Sage, On 05/13/2013 06:35 PM, Sage Weil wrote: > Hi Jim- > > You mentioned the other day your concerns about the uniformity of the PG > and data distribution. There are several ways to attack it (including > increasing the number of PGs), but one that we haven't tested much yet is > the 'reweight-by-utilization' function in the monitor. > > The idea is that there will always be some statistical variance in the > distribution and a non-zero probability of having outlier OSDs with too > many PG. We adjust for this by taking nodes that are substantially above > the mean down by some adjustment factor in an automated way. > > ceph osd reweight-by-utilization MIN > > where MIN is the minimum relative utilization at which we will start > adjusting down. It is always > 100 (100% of the mean), and defaults to > 120. After it adjusts the reweights, you should see the result in 'ceph > osd tree' output > > Have you played with this at all on your cluster? I'd be very interested > in how well this does/does not improve things for you. I haven't yet, but likely it's because I don't understand what the reweighting does, exactly. Maybe you can comment inline below where I go wrong? Here's my thinking: I'm only partially motivated by the actual amount of storage used per OSD, although it is a factor. My major concern is a performance issue for our parallel application codes. Their computation cycle is: compute furiously, write results, repeat. The issue is that none of our codes implement write-behind; each task must finish writing results before any can resume computing. So, when some OSDs carry more PGs, they cannot complete their portion of the write phase as quickly as other OSDs with fewer PGs. Thus, the application's ability to resume computation is delayed by the busiest OSDs. My concern is that when we rebalance, we just cause some other subset of the OSDs to be the busiest, in order to send fewer writes to the overused OSDs and more writes to underused OSDs. At least, that's what I was thinking, without actually examining the code to see what is really going on in rebalancing, and without testing. Another thing I haven't done is actually compute from the statistics of uniform distributions what the expected variance is for my specific layout, (now 256K PGs across 576 OSDs, with root/host/device hierarchy, 24 OSDs/host). That's mostly due to my lack of knowledge of statistics If I'm getting more variance than expected I want to understand why, in case it can be fixed. In any event, I think it's past time I tried reweighting. Suppose I use 'ceph osd reweight-by-utilization 101', on the theory that I'd cause continuous, small adjustments to utilization, and I'd learn what the maximum impact can be. Does that seem like a bad idea to you, and if so could you help me understand why? Thanks for taking the time to think about this - I know you're busy. PS - FWIW, another reason I keep pushing the number of PGs is because when we actually deploy Ceph for production, it'll be at a bigger scale than my testbed. So, I'm trying to shake out any scale-related issues now, to make sure our users' first experience with Ceph is a good one. -- Jim > > Thanks! > sage > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: ceph_pagelist_append might sleep while atomic
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result: [13439.295457] ceph: mds0 reconnect start [13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58 [13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1 [13439.316464] 5 locks held by kworker/1:1/12059: [13439.320998] #0: (ceph-msgr){..}, at: [] process_one_work+0x218/0x480 [13439.329701] #1: ((&(&con->work)->work)){..}, at: [] process_one_work+0x218/0x480 [13439.339446] #2: (&s->s_mutex){..}, at: [] send_mds_reconnect+0xec/0x450 [ceph] [13439.349081] #3: (&mdsc->snap_rwsem){..}, at: [] send_mds_reconnect+0x16e/0x450 [ceph] [13439.359278] #4: (file_lock_lock){..}, at: [] lock_flocks+0x15/0x20 [13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW 3.9.0-00358-g308ae61 #557 [13439.376225] Call Trace: [13439.378757] [] __might_sleep+0xfc/0x110 [13439.384353] [] ceph_pagelist_append+0x120/0x1b0 [libceph] [13439.391491] [] ceph_encode_locks+0x89/0x190 [ceph] [13439.398035] [] ? _raw_spin_lock+0x49/0x50 [13439.403775] [] ? lock_flocks+0x15/0x20 [13439.409277] [] encode_caps_cb+0x41f/0x4a0 [ceph] [13439.415622] [] ? igrab+0x28/0x70 [13439.420610] [] ? iterate_session_caps+0xe8/0x250 [ceph] [13439.427584] [] iterate_session_caps+0x115/0x250 [ceph] [13439.434499] [] ? set_request_path_attr+0x2d0/0x2d0 [ceph] [13439.441646] [] send_mds_reconnect+0x238/0x450 [ceph] [13439.448363] [] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph] [13439.455250] [] check_new_map+0x352/0x500 [ceph] [13439.461534] [] ceph_mdsc_handle_map+0x1bd/0x260 [ceph] [13439.468432] [] ? mutex_unlock+0xe/0x10 [13439.473934] [] extra_mon_dispatch+0x22/0x30 [ceph] [13439.480464] [] dispatch+0xbc/0x110 [libceph] [13439.486492] [] process_message+0x1ad/0x1d0 [libceph] [13439.493190] [] ? read_partial_message+0x3e8/0x520 [libceph] [13439.500583] [] ? kernel_recvmsg+0x44/0x60 [13439.506324] [] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [13439.513140] [] try_read+0x5fe/0x7e0 [libceph] [13439.519246] [] con_work+0x378/0x4a0 [libceph] [13439.525345] [] ? finish_task_switch+0x3f/0x110 [13439.531515] [] process_one_work+0x2b5/0x480 [13439.537439] [] ? process_one_work+0x218/0x480 [13439.543526] [] worker_thread+0x1f5/0x320 [13439.549191] [] ? manage_workers+0x170/0x170 [13439.555102] [] kthread+0xe1/0xf0 [13439.560075] [] ? __init_kthread_worker+0x70/0x70 [13439.566419] [] ret_from_fork+0x7c/0xb0 [13439.571918] [] ? __init_kthread_worker+0x70/0x70 [13439.587132] ceph: mds0 reconnect success [13490.720032] ceph: mds0 caps stale [13501.235257] ceph: mds0 recovery completed [13501.300419] ceph: mds0 caps renewed Fix it up by encoding locks into a buffer first, and when the number of encoded locks is stable, copy that into a ceph_pagelist. Signed-off-by: Jim Schutt --- fs/ceph/locks.c | 73 + fs/ceph/mds_client.c | 62 ++ fs/ceph/super.h |9 +- 3 files changed, 88 insertions(+), 56 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 202dd3d..9a46161 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -191,27 +191,23 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count) } /** - * Encode the flock and fcntl locks for the given inode into the pagelist. - * Format is: #fcntl locks, sequential fcntl locks, #flock locks, - * sequential flock locks. - * Must be called with lock_flocks() already held. - * If we encounter more of a specific lock type than expected, - * we return the value 1. + * Encode the flock and fcntl locks for the given inode into the ceph_filelock + * array. Must be called with lock_flocks() already held. + * If we encounter more of a specific lock type than expected, return -ENOSPC. */ -int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, - int num_fcntl_locks, int num_flock_locks) +int ceph_encode_locks_to_buffer(struct inode *inode, + struct ceph_filelock *flocks, + int num_fcntl_locks, int num_flock_locks) { struct file_lock *lock; - struct ceph_filelock cephlock; int err = 0; int seen_fcntl = 0; int seen_flock = 0; + int l = 0; dout("encoding %d flock and %d fcntl locks", num_flock_locks, num_fcntl_locks); - err = ceph_pagelist_append(pagelist, &num_fcntl_locks, sizeof(u32)); - if (err) - goto fail; + for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) { if (lock->fl_flags & FL_POSIX) { ++seen_fcntl; @@ -219,19 +215,12 @@ int
Re: [PATCH v2] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values
On 04/18/2013 04:38 PM, Jim Schutt wrote: > Hi Greg, > > On 04/16/2013 02:18 PM, Gregory Farnum wrote: >> On Fri, Apr 12, 2013 at 12:41 PM, Jim Schutt wrote: >>> Hi Greg, >>> >>> >>> On 04/10/2013 06:39 PM, Gregory Farnum wrote: >>>> Jim, >>>> I took this patch as a base for setting up config options which people >>>> can tune manually and have pushed those changes to wip-leveldb-config. >>> >>> I was out of the office unexpectedly for a few days, >>> so I'm just now taking a look. >>> >>>> Thanks very much for figuring out how to set up the cache et al! >>> >>> No problem! >>> >>>> >>>> For now I restructured quite a bit of the data ingestion, and I took >>>> your defaults for the monitor on the write buffer, block size, and >>>> compression, but I left the cache off. These also don't apply to the >>>> OSDs at all. In order to enable more experimentation I do pass through >>>> the options though: >>>> OPTION(mon_ldb_write_buffer_size, OPT_U64, 32*1024*1024) // monitor's >>>> leveldb write buffer size >>>> OPTION(mon_ldb_cache_size, OPT_U64, 0) // monitor's leveldb cache size >>>> OPTION(mon_ldb_block_size, OPT_U64, 4*1024*1024) // monitor's leveldb >>>> block size >>>> OPTION(mon_ldb_bloom_size, OPT_INT, 0) // monitor's leveldb bloom bits per >>>> entry >>>> OPTION(mon_ldb_max_open_files, OPT_INT, 0) // monitor's leveldb max open >>>> files >>>> OPTION(mon_ldb_compression, OPT_BOOL, false) // monitor's leveldb uses >>>> compression >>>> (and similar ones for osd_ldb_*). >> >> On request from Sage these are now "*_leveldb_*" instead of "*_ldb_*"; >> I pushed that a couple hours ago. In case you haven't already pulled >> down a copy, and so you know for when it gets into mainline and you go >> to adjust it. :) > > I've been testing this over the last several days, both > via the wip-leveldb-config branch, and then the next branch > after that was merged into next. > > I've gotten slightly better startup behavior with mon_ldb_cache_size Oops, now mon_leveldb_cache_size ;) -- Jim > larger than the default 8 MiB in leveldb - I've used 64 MiB and > 256 MiB, and found little to prefer one over the other. Both > seem to be preferable to the default, at least wrt. startup > behavior at the number of PGs (256K) I'm testing. > > Other than that, I've had no trouble with the new tunings. > > I'm sorry for the delay in reporting - I've had a little > hardware trouble, and was trying to be sure that any issues > I've been having were related to that and not these patches. > > -- Jim > >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values
Hi Greg, On 04/16/2013 02:18 PM, Gregory Farnum wrote: > On Fri, Apr 12, 2013 at 12:41 PM, Jim Schutt wrote: >> Hi Greg, >> >> >> On 04/10/2013 06:39 PM, Gregory Farnum wrote: >>> Jim, >>> I took this patch as a base for setting up config options which people >>> can tune manually and have pushed those changes to wip-leveldb-config. >> >> I was out of the office unexpectedly for a few days, >> so I'm just now taking a look. >> >>> Thanks very much for figuring out how to set up the cache et al! >> >> No problem! >> >>> >>> For now I restructured quite a bit of the data ingestion, and I took >>> your defaults for the monitor on the write buffer, block size, and >>> compression, but I left the cache off. These also don't apply to the >>> OSDs at all. In order to enable more experimentation I do pass through >>> the options though: >>> OPTION(mon_ldb_write_buffer_size, OPT_U64, 32*1024*1024) // monitor's >>> leveldb write buffer size >>> OPTION(mon_ldb_cache_size, OPT_U64, 0) // monitor's leveldb cache size >>> OPTION(mon_ldb_block_size, OPT_U64, 4*1024*1024) // monitor's leveldb block >>> size >>> OPTION(mon_ldb_bloom_size, OPT_INT, 0) // monitor's leveldb bloom bits per >>> entry >>> OPTION(mon_ldb_max_open_files, OPT_INT, 0) // monitor's leveldb max open >>> files >>> OPTION(mon_ldb_compression, OPT_BOOL, false) // monitor's leveldb uses >>> compression >>> (and similar ones for osd_ldb_*). > > On request from Sage these are now "*_leveldb_*" instead of "*_ldb_*"; > I pushed that a couple hours ago. In case you haven't already pulled > down a copy, and so you know for when it gets into mainline and you go > to adjust it. :) I've been testing this over the last several days, both via the wip-leveldb-config branch, and then the next branch after that was merged into next. I've gotten slightly better startup behavior with mon_ldb_cache_size larger than the default 8 MiB in leveldb - I've used 64 MiB and 256 MiB, and found little to prefer one over the other. Both seem to be preferable to the default, at least wrt. startup behavior at the number of PGs (256K) I'm testing. Other than that, I've had no trouble with the new tunings. I'm sorry for the delay in reporting - I've had a little hardware trouble, and was trying to be sure that any issues I've been having were related to that and not these patches. -- Jim > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values
Hi Greg, On 04/10/2013 06:39 PM, Gregory Farnum wrote: > Jim, > I took this patch as a base for setting up config options which people > can tune manually and have pushed those changes to wip-leveldb-config. I was out of the office unexpectedly for a few days, so I'm just now taking a look. > Thanks very much for figuring out how to set up the cache et al! No problem! > > For now I restructured quite a bit of the data ingestion, and I took > your defaults for the monitor on the write buffer, block size, and > compression, but I left the cache off. These also don't apply to the > OSDs at all. In order to enable more experimentation I do pass through > the options though: > OPTION(mon_ldb_write_buffer_size, OPT_U64, 32*1024*1024) // monitor's > leveldb write buffer size > OPTION(mon_ldb_cache_size, OPT_U64, 0) // monitor's leveldb cache size > OPTION(mon_ldb_block_size, OPT_U64, 4*1024*1024) // monitor's leveldb block > size > OPTION(mon_ldb_bloom_size, OPT_INT, 0) // monitor's leveldb bloom bits per > entry > OPTION(mon_ldb_max_open_files, OPT_INT, 0) // monitor's leveldb max open files > OPTION(mon_ldb_compression, OPT_BOOL, false) // monitor's leveldb uses > compression > (and similar ones for osd_ldb_*). > > If you have the opportunity to verify that these patches work for you > (in particular I'm wondering if the OSDs need any more tuning on their > end which was being masked by your global changes) that would be > wonderful. :) I'll be able to take these for a spin next Tuesday, and I'll definitely keep an eye open for any unexpected behavior on the OSDs. Sorry for the delay... I appreciate you taking a look at this, and fixing it up into something that's more appropriate. Thanks --Jim > Thanks, > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values
As reported in this thread http://www.spinics.net/lists/ceph-devel/msg13777.html starting in v0.59 a new filesystem with ~55,000 PGs would not start after a period of ~30 minutes. By comparison, the same filesystem configuration would start in ~1 minute for v0.58. The issue is that starting in v0.59, LevelDB is used for the monitor data store. For moderate to large numbers of PGs, the length of a PGStat value stored via LevelDB is best measured in megabytes. The default tunings for LevelDB data blocking seem tuned for values with lengths measured in tens or hundreds of bytes. With the data blocking tuning provided by this patch, here's a comparison of filesystem startup times for v0.57, v0.58, and v0.59: 55,392 PGs 221,568 PGs v0.57 1m 07s9m 42s v0.58 1m 04s 11m 44s v0.59 48s3m 30s Note that this patch turns off LevelDB's compression by default. The block tuning from this patch with compression enabled made no improvement in the new filesystem startup time for v0.59, for either PG count tested. I'll note that at 55,392 PGs the PGStat length is ~20 MB; perhaps that value length interacts poorly with LevelDB's compression at this block size. Signed-off-by: Jim Schutt --- src/common/config_opts.h |4 src/os/LevelDBStore.cc |9 + src/os/LevelDBStore.h|3 +++ 3 files changed, 16 insertions(+), 0 deletions(-) diff --git a/src/common/config_opts.h b/src/common/config_opts.h index 9d42961..e8f491e 100644 --- a/src/common/config_opts.h +++ b/src/common/config_opts.h @@ -181,6 +181,10 @@ OPTION(paxos_propose_interval, OPT_DOUBLE, 1.0) // gather updates for this long OPTION(paxos_min_wait, OPT_DOUBLE, 0.05) // min time to gather updates for after period of inactivity OPTION(paxos_trim_tolerance, OPT_INT, 30) // number of extra proposals tolerated before trimming OPTION(paxos_trim_disabled_max_versions, OPT_INT, 100) // maximum amount of versions we shall allow passing by without trimming +OPTION(leveldb_block_size, OPT_U64, 4 * 1024 * 1024) // leveldb unit of caching, compression (in bytes) +OPTION(leveldb_write_buffer_size, OPT_U64, 32 * 1024 * 1024) // leveldb unit of I/O (in bytes) +OPTION(leveldb_cache_size, OPT_U64, 256 * 1024 * 1024) // leveldb data cache size (in bytes) +OPTION(leveldb_compression_enabled, OPT_BOOL, false) OPTION(clock_offset, OPT_DOUBLE, 0) // how much to offset the system clock in Clock.cc OPTION(auth_cluster_required, OPT_STR, "cephx") // required of mon, mds, osd daemons OPTION(auth_service_required, OPT_STR, "cephx") // required by daemons of clients diff --git a/src/os/LevelDBStore.cc b/src/os/LevelDBStore.cc index 3d94096..0d41564 100644 --- a/src/os/LevelDBStore.cc +++ b/src/os/LevelDBStore.cc @@ -14,13 +14,22 @@ using std::string; int LevelDBStore::init(ostream &out, bool create_if_missing) { + db_cache = leveldb::NewLRUCache(g_conf->leveldb_cache_size); + leveldb::Options options; options.create_if_missing = create_if_missing; + options.write_buffer_size = g_conf->leveldb_write_buffer_size; + options.block_size = g_conf->leveldb_block_size; + options.block_cache = db_cache; + if (!g_conf->leveldb_compression_enabled) +options.compression = leveldb::kNoCompression; leveldb::DB *_db; leveldb::Status status = leveldb::DB::Open(options, path, &_db); db.reset(_db); if (!status.ok()) { out << status.ToString() << std::endl; +delete db_cache; +db_cache = NULL; return -EINVAL; } else return 0; diff --git a/src/os/LevelDBStore.h b/src/os/LevelDBStore.h index 7f0e154..8199a41 100644 --- a/src/os/LevelDBStore.h +++ b/src/os/LevelDBStore.h @@ -14,18 +14,21 @@ #include "leveldb/db.h" #include "leveldb/write_batch.h" #include "leveldb/slice.h" +#include "leveldb/cache.h" /** * Uses LevelDB to implement the KeyValueDB interface */ class LevelDBStore : public KeyValueDB { string path; + leveldb::Cache *db_cache; boost::scoped_ptr db; int init(ostream &out, bool create_if_missing); public: LevelDBStore(const string &path) : path(path) {} + ~LevelDBStore() { delete db_cache; } /// Opens underlying db int open(ostream &out) { -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values
On 04/05/2013 09:06 AM, Sage Weil wrote: > On Fri, 5 Apr 2013, Jim Schutt wrote: >> On 04/04/2013 10:44 PM, Sage Weil wrote: >>> Fantastic work tracking this down, Jim! >>> >>> Looking at the Riak docs on tuning leveldb, it looks like a large write >>> buffer size is definitely a good idea. The block size of 4MB is >>> significantly larger than what they recommend, though.. if we go this big >>> we also need to make the cache size larger (it defaults to 8MB?). >> >> I wondered about that, but the change in performance with >> just these simple changes was so drastic that I thought >> it was a good immediate change, with further tuning to >> follow. >> >> Although, I just tested startup, not actually running such >> a filesystem. Perhaps I missed the effects of insufficient >> caching. >> >> I ended up picking 4 MB because the 55K PG case has PGStat >> values that are 20 MB in length, so it still takes 5 blocks >> to store one. Plus, I'd really like to be running at 256K >> or 512K PGs to get the data uniformity across OSDs that I'm >> after. >> >> Perhaps these need to be config options - I just hated to add >> more, when it would be hard for a user to discover that's the >> thing that needed tuning to solve some particular performance >> issue. I'm hoping we can find one set of values that work well >> for everyone. > > I think the best option is to figure out the defaults that will work for > everyone, but still make config options. I suspect there will be users > that need to tune for less memory. > > leveldb_block_size, leveldb_write_buffer_size, etc. > >>> Did you >>> try with a large write buffer but a smaller block size (like 256K or >>> 512K)? >> >> I did try a 256K block size with an 8 MB write buffer, also with >> no compression. That caused the 55K PGs case to make much more >> progress towards starting, but it still failed to come up - I >> forget the details of what went awry. >> >>> >>> I think either a larger cache or a smaller block size is okay, but 4MB >>> with an 8MB cache means only 2 blocks cached, which sounds non-ideal. >> >> I can try fixing up a larger cache - I'll need to dig in >> a little to figure out how to do it, so it might take me >> a little while. How big do you think it should be, given >> that the large PG count cases I'm after might have PGStat >> data lengths that are many tens of MB? >> >> 64 MB? 256 MB? Tunable? > > The default is only 8MB, but this can safely go up to several gigs. I > think 256 MB sounds like a reasonable default... > > Sam, want to weigh in? > >> Also, I wondered about whether the writes need to be >> sync'd? Do these tunings change your mind about whether >> that's needed? > > The sync parameters shouldn't need to be changed. os/FileStore.cc is > calling the sync when we do an overall filestore sync prior to a commit > point or btrfs snapshot. OK. Now that I've had a look, setting up the cache should be easy - I'm working on it. I'll fix up v2 of the patch with that, and options for the tunables. Thanks -- Jim > > sage > > >> >>> >>> Thanks! >>> sage >>> >>> >>> On Thu, 4 Apr 2013, Jim Schutt wrote: >>> >>>> As reported in this thread >>>>http://www.spinics.net/lists/ceph-devel/msg13777.html >>>> starting in v0.59 a new filesystem with ~55,000 PGs would not start after >>>> a period of ~30 minutes. By comparison, the same filesystem configuration >>>> would start in ~1 minute for v0.58. >>>> >>>> The issue is that starting in v0.59, LevelDB is used for the monitor >>>> data store. For moderate to large numbers of PGs, the length of a PGStat >>>> value >>>> stored via LevelDB is best measured in megabytes. The default tunings for >>>> LevelDB data blocking seem tuned for values with lengths measured in tens >>>> or >>>> hundreds of bytes. >>>> >>>> With the data blocking tuning provided by this patch, here's a comparison >>>> of filesystem startup times for v0.57, v0.58, and v0.59: >>>> >>>> 55,392 PGs 221,568 PGs >>>> v0.57 1m 07s9m 42s >>>> v0.58 1m 04s 11m 44s >>>> v0.59 45s4m 17s >>>> >>>> Note that this patch turns off LevelDB's compre
Re: [PATCH] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values
On 04/04/2013 10:44 PM, Sage Weil wrote: > Fantastic work tracking this down, Jim! > > Looking at the Riak docs on tuning leveldb, it looks like a large write > buffer size is definitely a good idea. The block size of 4MB is > significantly larger than what they recommend, though.. if we go this big > we also need to make the cache size larger (it defaults to 8MB?). I wondered about that, but the change in performance with just these simple changes was so drastic that I thought it was a good immediate change, with further tuning to follow. Although, I just tested startup, not actually running such a filesystem. Perhaps I missed the effects of insufficient caching. I ended up picking 4 MB because the 55K PG case has PGStat values that are 20 MB in length, so it still takes 5 blocks to store one. Plus, I'd really like to be running at 256K or 512K PGs to get the data uniformity across OSDs that I'm after. Perhaps these need to be config options - I just hated to add more, when it would be hard for a user to discover that's the thing that needed tuning to solve some particular performance issue. I'm hoping we can find one set of values that work well for everyone. > Did you > try with a large write buffer but a smaller block size (like 256K or > 512K)? I did try a 256K block size with an 8 MB write buffer, also with no compression. That caused the 55K PGs case to make much more progress towards starting, but it still failed to come up - I forget the details of what went awry. > > I think either a larger cache or a smaller block size is okay, but 4MB > with an 8MB cache means only 2 blocks cached, which sounds non-ideal. I can try fixing up a larger cache - I'll need to dig in a little to figure out how to do it, so it might take me a little while. How big do you think it should be, given that the large PG count cases I'm after might have PGStat data lengths that are many tens of MB? 64 MB? 256 MB? Tunable? Also, I wondered about whether the writes need to be sync'd? Do these tunings change your mind about whether that's needed? > > Thanks! > sage > > > On Thu, 4 Apr 2013, Jim Schutt wrote: > >> As reported in this thread >>http://www.spinics.net/lists/ceph-devel/msg13777.html >> starting in v0.59 a new filesystem with ~55,000 PGs would not start after >> a period of ~30 minutes. By comparison, the same filesystem configuration >> would start in ~1 minute for v0.58. >> >> The issue is that starting in v0.59, LevelDB is used for the monitor >> data store. For moderate to large numbers of PGs, the length of a PGStat >> value >> stored via LevelDB is best measured in megabytes. The default tunings for >> LevelDB data blocking seem tuned for values with lengths measured in tens or >> hundreds of bytes. >> >> With the data blocking tuning provided by this patch, here's a comparison >> of filesystem startup times for v0.57, v0.58, and v0.59: >> >> 55,392 PGs 221,568 PGs >> v0.57 1m 07s9m 42s >> v0.58 1m 04s 11m 44s >> v0.59 45s4m 17s >> >> Note that this patch turns off LevelDB's compression. The block >> tuning from this patch with compression enabled made no improvement >> in the new filesystem startup time for v0.59, for either PG count >> tested. I'll note that at 55,392 PGs the PGStat length is ~20 MB; >> perhaps that value length interacts pooly with LevelDB's compression s/pooly/poorly/ Thanks -- Jim >> at this block size. >> >> Signed-off-by: Jim Schutt >> --- >> src/os/LevelDBStore.cc |3 +++ >> 1 files changed, 3 insertions(+), 0 deletions(-) >> >> diff --git a/src/os/LevelDBStore.cc b/src/os/LevelDBStore.cc >> index 3d94096..1b6ae7d 100644 >> --- a/src/os/LevelDBStore.cc >> +++ b/src/os/LevelDBStore.cc >> @@ -16,6 +16,9 @@ int LevelDBStore::init(ostream &out, bool >> create_if_missing) >> { >>leveldb::Options options; >>options.create_if_missing = create_if_missing; >> + options.write_buffer_size = 32 * 1024 * 1024; >> + options.block_size = 4 * 1024 * 1024; >> + options.compression = leveldb::kNoCompression; >>leveldb::DB *_db; >>leveldb::Status status = leveldb::DB::Open(options, path, &_db); >>db.reset(_db); >> -- >> 1.7.8.2 >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] os/LevelDBStore: tune LevelDB data blocking options to be more suitable for PGStat values
As reported in this thread http://www.spinics.net/lists/ceph-devel/msg13777.html starting in v0.59 a new filesystem with ~55,000 PGs would not start after a period of ~30 minutes. By comparison, the same filesystem configuration would start in ~1 minute for v0.58. The issue is that starting in v0.59, LevelDB is used for the monitor data store. For moderate to large numbers of PGs, the length of a PGStat value stored via LevelDB is best measured in megabytes. The default tunings for LevelDB data blocking seem tuned for values with lengths measured in tens or hundreds of bytes. With the data blocking tuning provided by this patch, here's a comparison of filesystem startup times for v0.57, v0.58, and v0.59: 55,392 PGs 221,568 PGs v0.57 1m 07s9m 42s v0.58 1m 04s 11m 44s v0.59 45s4m 17s Note that this patch turns off LevelDB's compression. The block tuning from this patch with compression enabled made no improvement in the new filesystem startup time for v0.59, for either PG count tested. I'll note that at 55,392 PGs the PGStat length is ~20 MB; perhaps that value length interacts pooly with LevelDB's compression at this block size. Signed-off-by: Jim Schutt --- src/os/LevelDBStore.cc |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/src/os/LevelDBStore.cc b/src/os/LevelDBStore.cc index 3d94096..1b6ae7d 100644 --- a/src/os/LevelDBStore.cc +++ b/src/os/LevelDBStore.cc @@ -16,6 +16,9 @@ int LevelDBStore::init(ostream &out, bool create_if_missing) { leveldb::Options options; options.create_if_missing = create_if_missing; + options.write_buffer_size = 32 * 1024 * 1024; + options.block_size = 4 * 1024 * 1024; + options.compression = leveldb::kNoCompression; leveldb::DB *_db; leveldb::Status status = leveldb::DB::Open(options, path, &_db); db.reset(_db); -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trouble getting a new file system to start, for v0.59 and newer
On 04/03/2013 04:40 PM, Jim Schutt wrote: > On 04/03/2013 12:25 PM, Sage Weil wrote: >>>>>> >>> > > Sorry, guess I forgot some of the history since this piece at >>>>>> >>> > > least is >>>>>> >>> > > resolved now. I'm surprised if 30-second timeouts are causing >>>>>> >>> > > issues >>>>>> >>> > > without those overloads you were seeing; have you seen this issue >>>>>> >>> > > without your high debugging levels and without the bad PG >>>>>> >>> > > commits (due >>>>>> >>> > > to debugging)? >>>> >> > >>>> >> > I think so, because that's why I started with higher debugging >>>> >> > levels. >>>> >> > >>>> >> > But, as it turns out, I'm just in the process of returning to my >>>> >> > testing of next, with all my debugging back to 0. So, I'll try >>>> >> > the default timeout of 30 seconds first. If I have trouble starting >>>> >> > up a new file system, I'll turn up the timeout and try again, without >>>> >> > any extra debugging. Either way, I'll let you know what happens. >> > I would be curious to hear roughly what value between 30 and 300 is >> > sufficient, if you can experiment just a bit. We probably want to adjust >> > the default. >> > >> > Perhaps more importantly, we'll need to look at the performance of the pg >> > stat updates on the mon. There is a refactor due in that code that should >> > improve life, but it's slated for dumpling. > OK, here's some results, with all debugging at 0, using current next... > > My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts > use 10 GbE NICs now. The mon host uses an SSD for the mon data store. > My test procedure is to start 'ceph -w', start all the OSDs, and once > they're all running start the mon. I report the time from starting > the mon to all PGs active+clean. As a sanity check, to be sure I wasn't doing something differently now than I remember doing before, I re-ran this test for v0.57, v0.58, and v0.59, using default 'osd mon ack timeout', default 'paxos propose interval', and no debugging: 55,392 PGs 221,568 PGs v0.571m 07s9m 42s v0.581m 04s 11m 44s v0.59 >30m not attempted The v0.57/v0.58 runs showed no signs of stress, e.g. no slow op reports, etc. The v0.59 run behaved as I previously reported, i.e., lots of stale peers, OSDs wrongly marked down, etc., before I gave up on it. -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trouble getting a new file system to start, for v0.59 and newer
On 04/04/2013 08:15 AM, Jim Schutt wrote: > On 04/03/2013 04:51 PM, Gregory Farnum wrote: >> On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt wrote: >>> On 04/03/2013 12:25 PM, Sage Weil wrote: >>>>>>>> Sorry, guess I forgot some of the history since this piece at least is >>>>>>>> resolved now. I'm surprised if 30-second timeouts are causing issues >>>>>>>> without those overloads you were seeing; have you seen this issue >>>>>>>> without your high debugging levels and without the bad PG commits (due >>>>>>>> to debugging)? >>>>>> >>>>>> I think so, because that's why I started with higher debugging >>>>>> levels. >>>>>> >>>>>> But, as it turns out, I'm just in the process of returning to my >>>>>> testing of next, with all my debugging back to 0. So, I'll try >>>>>> the default timeout of 30 seconds first. If I have trouble starting >>>>>> up a new file system, I'll turn up the timeout and try again, without >>>>>> any extra debugging. Either way, I'll let you know what happens. >>>> I would be curious to hear roughly what value between 30 and 300 is >>>> sufficient, if you can experiment just a bit. We probably want to adjust >>>> the default. >>>> >>>> Perhaps more importantly, we'll need to look at the performance of the pg >>>> stat updates on the mon. There is a refactor due in that code that should >>>> improve life, but it's slated for dumpling. >>> >>> OK, here's some results, with all debugging at 0, using current next... >>> >>> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts >>> use 10 GbE NICs now. The mon host uses an SSD for the mon data store. >>> My test procedure is to start 'ceph -w', start all the OSDs, and once >>> they're all running start the mon. I report the time from starting >>> the mon to all PGs active+clean. >>> >>> # PGs osd mon ackstartupnotes >>> timeout time >>> --- - >>> 55392 default >30:00 1 >>> 55392300 18:36 2 >>> 55392 60>30:00 3 >>> 55392150>30:00 4 >>> 55392240>30:00 5 >>> 55392300>30:00 2,6 >>> >>> notes: >>> 1) lots of PGs marked stale, OSDs wrongly marked down >>> before I gave up on this case >>> 2) OSDs report lots of slow requests for "pg_notify(...) v4 >>> currently wait for new map" >>> 3) some OSDs wrongly marked down, OSDs report some slow requests >>> for "pg_notify(...) v4 currently wait for new map" >>> before I gave up on this case >>> 4) appeared to be making progress; then an OSD was marked >>> out at ~21 minutes; many more marked out before I >>> gave up on this case >>> 5) some OSD reports of slow requests for "pg_notify", >>> some OSDs wrongly marked down, appeared to be making >>> progress, then stalled; then I gave up on this case >>> 6) retried this case, appeared to be making progress, but >>> after ~18 min stalled at 19701 active+clean, 35691 peering >>> until I gave up >>> >>> Hmmm, I didn't really expect the above results. I ran out of >>> time before attempting an even longer osd mon ack timeout. >>> >>> But either we're on the wrong trail, or 300 is not sufficient. >>> Or, I'm doing something wrong and haven't yet figured out what >>> it is. >>> >>> FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs, >>> and my memory is a new filesystem started up in ~5 minutes. For >>> that testing I had to increase 'paxos propose interval' to two >>> or three seconds to keep the monitor writeout rate (as measured >>> by vmstat) down to a sustained 50-70 MB/s during start-up. >>> >>> That was with a 1 GbE NIC in the mon; the reason I upgraded >>> it was a filesystem with 512K PGs was taking too long to start, >>> and I thought the mon might be network-limited since it had an >>> SSD for the mon data store. >>> >>> For the testing above I used the default '
Re: Trouble getting a new file system to start, for v0.59 and newer
On 04/03/2013 04:51 PM, Gregory Farnum wrote: > On Wed, Apr 3, 2013 at 3:40 PM, Jim Schutt wrote: >> On 04/03/2013 12:25 PM, Sage Weil wrote: >>>>>>> Sorry, guess I forgot some of the history since this piece at least is >>>>>>> resolved now. I'm surprised if 30-second timeouts are causing issues >>>>>>> without those overloads you were seeing; have you seen this issue >>>>>>> without your high debugging levels and without the bad PG commits (due >>>>>>> to debugging)? >>>>> >>>>> I think so, because that's why I started with higher debugging >>>>> levels. >>>>> >>>>> But, as it turns out, I'm just in the process of returning to my >>>>> testing of next, with all my debugging back to 0. So, I'll try >>>>> the default timeout of 30 seconds first. If I have trouble starting >>>>> up a new file system, I'll turn up the timeout and try again, without >>>>> any extra debugging. Either way, I'll let you know what happens. >>> I would be curious to hear roughly what value between 30 and 300 is >>> sufficient, if you can experiment just a bit. We probably want to adjust >>> the default. >>> >>> Perhaps more importantly, we'll need to look at the performance of the pg >>> stat updates on the mon. There is a refactor due in that code that should >>> improve life, but it's slated for dumpling. >> >> OK, here's some results, with all debugging at 0, using current next... >> >> My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts >> use 10 GbE NICs now. The mon host uses an SSD for the mon data store. >> My test procedure is to start 'ceph -w', start all the OSDs, and once >> they're all running start the mon. I report the time from starting >> the mon to all PGs active+clean. >> >> # PGs osd mon ackstartupnotes >> timeout time >> --- - >> 55392 default >30:00 1 >> 55392300 18:36 2 >> 55392 60>30:00 3 >> 55392150>30:00 4 >> 55392240>30:00 5 >> 55392300>30:00 2,6 >> >> notes: >> 1) lots of PGs marked stale, OSDs wrongly marked down >> before I gave up on this case >> 2) OSDs report lots of slow requests for "pg_notify(...) v4 >> currently wait for new map" >> 3) some OSDs wrongly marked down, OSDs report some slow requests >> for "pg_notify(...) v4 currently wait for new map" >> before I gave up on this case >> 4) appeared to be making progress; then an OSD was marked >> out at ~21 minutes; many more marked out before I >> gave up on this case >> 5) some OSD reports of slow requests for "pg_notify", >> some OSDs wrongly marked down, appeared to be making >> progress, then stalled; then I gave up on this case >> 6) retried this case, appeared to be making progress, but >> after ~18 min stalled at 19701 active+clean, 35691 peering >> until I gave up >> >> Hmmm, I didn't really expect the above results. I ran out of >> time before attempting an even longer osd mon ack timeout. >> >> But either we're on the wrong trail, or 300 is not sufficient. >> Or, I'm doing something wrong and haven't yet figured out what >> it is. >> >> FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs, >> and my memory is a new filesystem started up in ~5 minutes. For >> that testing I had to increase 'paxos propose interval' to two >> or three seconds to keep the monitor writeout rate (as measured >> by vmstat) down to a sustained 50-70 MB/s during start-up. >> >> That was with a 1 GbE NIC in the mon; the reason I upgraded >> it was a filesystem with 512K PGs was taking too long to start, >> and I thought the mon might be network-limited since it had an >> SSD for the mon data store. >> >> For the testing above I used the default 'paxos propose interval'. >> I don't know if it matters, but vmstat sees only a little data >> being written on the mon system. > > That's odd; I'd actually expect to see more going to disk with v0.59 > than previously. Is vmstat actually looking at disk IO, or might it be > missing DirectIO or something? (Not that
Re: Trouble getting a new file system to start, for v0.59 and newer
On 04/03/2013 12:25 PM, Sage Weil wrote: >>> > > Sorry, guess I forgot some of the history since this piece at least is >>> > > resolved now. I'm surprised if 30-second timeouts are causing issues >>> > > without those overloads you were seeing; have you seen this issue >>> > > without your high debugging levels and without the bad PG commits (due >>> > > to debugging)? >> > >> > I think so, because that's why I started with higher debugging >> > levels. >> > >> > But, as it turns out, I'm just in the process of returning to my >> > testing of next, with all my debugging back to 0. So, I'll try >> > the default timeout of 30 seconds first. If I have trouble starting >> > up a new file system, I'll turn up the timeout and try again, without >> > any extra debugging. Either way, I'll let you know what happens. > I would be curious to hear roughly what value between 30 and 300 is > sufficient, if you can experiment just a bit. We probably want to adjust > the default. > > Perhaps more importantly, we'll need to look at the performance of the pg > stat updates on the mon. There is a refactor due in that code that should > improve life, but it's slated for dumpling. OK, here's some results, with all debugging at 0, using current next... My testing is for 1 mon + 576 OSDs, 24/host. All my storage cluster hosts use 10 GbE NICs now. The mon host uses an SSD for the mon data store. My test procedure is to start 'ceph -w', start all the OSDs, and once they're all running start the mon. I report the time from starting the mon to all PGs active+clean. # PGs osd mon ackstartupnotes timeout time --- - 55392 default >30:00 1 55392300 18:36 2 55392 60>30:00 3 55392150>30:00 4 55392240>30:00 5 55392300>30:00 2,6 notes: 1) lots of PGs marked stale, OSDs wrongly marked down before I gave up on this case 2) OSDs report lots of slow requests for "pg_notify(...) v4 currently wait for new map" 3) some OSDs wrongly marked down, OSDs report some slow requests for "pg_notify(...) v4 currently wait for new map" before I gave up on this case 4) appeared to be making progress; then an OSD was marked out at ~21 minutes; many more marked out before I gave up on this case 5) some OSD reports of slow requests for "pg_notify", some OSDs wrongly marked down, appeared to be making progress, then stalled; then I gave up on this case 6) retried this case, appeared to be making progress, but after ~18 min stalled at 19701 active+clean, 35691 peering until I gave up Hmmm, I didn't really expect the above results. I ran out of time before attempting an even longer osd mon ack timeout. But either we're on the wrong trail, or 300 is not sufficient. Or, I'm doing something wrong and haven't yet figured out what it is. FWIW, on v0.57 or v0.58 I was testing with one pool at 256K PGs, and my memory is a new filesystem started up in ~5 minutes. For that testing I had to increase 'paxos propose interval' to two or three seconds to keep the monitor writeout rate (as measured by vmstat) down to a sustained 50-70 MB/s during start-up. That was with a 1 GbE NIC in the mon; the reason I upgraded it was a filesystem with 512K PGs was taking too long to start, and I thought the mon might be network-limited since it had an SSD for the mon data store. For the testing above I used the default 'paxos propose interval'. I don't know if it matters, but vmstat sees only a little data being written on the mon system. -- Jim > > sage > >> > >> > -- Jim >> > >>> > > -Greg >>> > > Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trouble getting a new file system to start, for v0.59 and newer
On 04/03/2013 11:49 AM, Gregory Farnum wrote: > On Wed, Apr 3, 2013 at 10:14 AM, Gregory Farnum wrote: >> On Wed, Apr 3, 2013 at 10:09 AM, Jim Schutt wrote: >>> Hi Sage, >>> >>> On 04/03/2013 09:58 AM, Sage Weil wrote: >>>> Hi Jim, >>>> >>>> What happens if you change 'osd mon ack timeout = 300' (from the >>>> default of 30)? I suspect part of the problem is that the mons are just >>>> slow enough that the osd's resend the same thing again and it snowballs >>>> into more work for the monitor. >>> >>> Thanks, that helped. My OSDs aren't reconnecting to the mon any more, >>> and the new filesystem started up as expected. >>> >>> Hmmm, it occurs to me that I upgraded my mon hosts to 10 GbE NICs at >>> about the same time I started testing v0.59. Perhaps before the upgrade >>> I was running right at the edge of that timeout. After the NIC upgrade >>> the PGStat messages come flooding in at startup, and they bunch up >>> enough that working through the backlog pushed me over the timeout cliff? >>> >>> Is there any downside to using a large 'osd mon ack timeout', assuming I >>> run more than one mon? If so, I expect I'll work my way back from >>> 'osd mon ack timeout = 300' to see how big it needs to be to stay reliable >>> for my configuration. >> >> It's a timeout, so the generic downsides to larger timeouts — if the >> monitor actually has gone away it's going to take the OSDs more time >> to connect to somebody else for their updates and reports. This will >> probably be most apparent if they're trying to peer and can't make >> progress until they get acks from the monitors, but the one they're >> connected to has died. >> >> >>> Sorry for the noise about paxos. At least it was useful >>> to help Joao find that debug log message that was more expensive >>> than expected >> >> It's not noise — the reason this timeout is causing problems now is >> that the monitor disk commits are taking so long that it looks like >> they've failed. Which is bad. :/ So thanks for reporting it! > > Sorry, guess I forgot some of the history since this piece at least is > resolved now. I'm surprised if 30-second timeouts are causing issues > without those overloads you were seeing; have you seen this issue > without your high debugging levels and without the bad PG commits (due > to debugging)? I think so, because that's why I started with higher debugging levels. But, as it turns out, I'm just in the process of returning to my testing of next, with all my debugging back to 0. So, I'll try the default timeout of 30 seconds first. If I have trouble starting up a new file system, I'll turn up the timeout and try again, without any extra debugging. Either way, I'll let you know what happens. -- Jim > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trouble getting a new file system to start, for v0.59 and newer
Hi Sage, On 04/03/2013 09:58 AM, Sage Weil wrote: > Hi Jim, > > What happens if you change 'osd mon ack timeout = 300' (from the > default of 30)? I suspect part of the problem is that the mons are just > slow enough that the osd's resend the same thing again and it snowballs > into more work for the monitor. Thanks, that helped. My OSDs aren't reconnecting to the mon any more, and the new filesystem started up as expected. Hmmm, it occurs to me that I upgraded my mon hosts to 10 GbE NICs at about the same time I started testing v0.59. Perhaps before the upgrade I was running right at the edge of that timeout. After the NIC upgrade the PGStat messages come flooding in at startup, and they bunch up enough that working through the backlog pushed me over the timeout cliff? Is there any downside to using a large 'osd mon ack timeout', assuming I run more than one mon? If so, I expect I'll work my way back from 'osd mon ack timeout = 300' to see how big it needs to be to stay reliable for my configuration. Sorry for the noise about paxos. At least it was useful to help Joao find that debug log message that was more expensive than expected Thanks -- Jim > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Trouble getting a new file system to start, for v0.59 and newer
Hi Joao, I alluded in an earlier thread about an issue I've been recently having with starting a new filesystem, which I thought I had tracked into the paxos subsystem. I believe I started having this trouble when I started testing v0.59, and it's still there in v0.60. The basic configuration is 1 mon, 1 mds, 576 OSDs. Some time after starting the mon and the OSDs, PG creation progress stops. I'm finding this in my OSD logs, which I've been thinking is suggestive of why OSDs stop making progress: # cat osd.0.log | grep "acked PGStats" 2013-04-02 15:40:28.944178 7387a700 1 osd.0 2 mon hasn't acked PGStats in 30.034349 seconds, reconnecting elsewhere 2013-04-02 15:42:24.570963 7387a700 1 osd.0 14 mon hasn't acked PGStats in 30.072959 seconds, reconnecting elsewhere 2013-04-02 15:43:22.664921 7387a700 1 osd.0 20 mon hasn't acked PGStats in 30.570739 seconds, reconnecting elsewhere 2013-04-02 15:43:52.672461 7387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007379 seconds, reconnecting elsewhere 2013-04-02 15:44:22.680691 7387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007095 seconds, reconnecting elsewhere 2013-04-02 15:44:52.688936 7387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007221 seconds, reconnecting elsewhere 2013-04-02 15:45:22.702066 7387a700 1 osd.0 20 mon hasn't acked PGStats in 30.012264 seconds, reconnecting elsewhere 2013-04-02 15:45:52.711134 7387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007957 seconds, reconnecting elsewhere 2013-04-02 15:46:22.719845 7387a700 1 osd.0 20 mon hasn't acked PGStats in 30.007859 seconds, reconnecting elsewhere 2013-04-02 15:46:52.729475 7387a700 1 osd.0 20 mon hasn't acked PGStats in 30.008756 seconds, reconnecting elsewhere I also thought paxos might be implicated because I noticed the auth paxos service appears to stop making progress. The mon is getting lots of auth paxos service messages from all the OSDs: # cat mon.cs31.log | grep paxos | grep "dispatch auth" | head -1 2013-04-02 15:38:59.995457 72a8c700 10 mon.cs31@0(leader).paxosservice(auth) dispatch auth(proto 0 30 bytes epoch 1) v1 from client.? 172.17.131.29:0/52585 # cat mon.cs31.log | grep paxos | grep "dispatch auth" | wc -l 8802 but at some point the auth paxos service appears to stop making progress: # cat mon.cs31.log | grep paxos | grep -v "dispatch auth" | grep auth 2013-04-02 15:38:42.895202 77fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart 2013-04-02 15:38:42.905705 77fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart 2013-04-02 15:38:59.182158 77fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart 2013-04-02 15:38:59.192809 77fbe780 10 mon.cs31@0(probing).paxosservice(auth) restart 2013-04-02 15:38:59.290602 77fbe780 10 mon.cs31@0(leader).paxosservice(auth) election_finished 2013-04-02 15:38:59.290608 77fbe780 10 mon.cs31@0(leader).paxosservice(auth) _active 2013-04-02 15:38:59.300707 77fbe780 10 mon.cs31@0(leader).paxosservice(auth) scrub 2013-04-02 15:38:59.308260 77fbe780 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending 2013-04-02 15:38:59.329153 77fbe780 10 mon.cs31@0(leader).paxosservice(auth) propose_pending 2013-04-02 15:38:59.390988 77fbe780 10 mon.cs31@0(leader).paxosservice(auth) _active 2013-04-02 15:38:59.421164 77fbe780 10 mon.cs31@0(leader).paxosservice(auth) scrub 2013-04-02 15:38:59.427781 77fbe780 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending 2013-04-02 15:38:59.448001 77fbe780 10 mon.cs31@0(leader).paxosservice(auth) election_finished 2013-04-02 15:38:59.448027 77fbe780 10 mon.cs31@0(leader).paxosservice(auth) _active 2013-04-02 15:38:59.453691 77fbe780 10 mon.cs31@0(leader).paxosservice(auth) scrub 2013-04-02 15:38:59.459446 77fbe780 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending 2013-04-02 15:38:59.500315 7328d700 10 mon.cs31@0(leader).paxosservice(auth) propose_pending 2013-04-02 15:38:59.569638 7328d700 10 mon.cs31@0(leader).paxosservice(auth) _active 2013-04-02 15:38:59.593526 7328d700 10 mon.cs31@0(leader).paxosservice(auth) scrub 2013-04-02 15:38:59.599250 7328d700 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending 2013-04-02 15:39:39.390156 72a8c700 10 mon.cs31@0(leader).paxosservice(auth) propose_pending 2013-04-02 15:39:39.435906 72a8c700 10 mon.cs31@0(leader).paxosservice(auth) _active 2013-04-02 15:39:39.456743 72a8c700 10 mon.cs31@0(leader).paxosservice(auth) scrub 2013-04-02 15:39:39.462498 72a8c700 7 mon.cs31@0(leader).paxosservice(auth) _active creating new pending even as it continues to receive auth paxos service messages: # cat mon.cs31.log | grep paxos | grep -A 20 "15:39:39.462498" | grep "dispatch auth" | wc -l 8799 for some time: # cat mon.cs31.log | grep paxos | grep "dispatch auth" | tail -2 2013-04-02 15:46:56.090494 7ff
Re: Trouble with paxos service for large PG count
On 04/02/2013 01:16 PM, Jim Schutt wrote: > On 04/02/2013 12:28 PM, Joao Luis wrote: >> Right. I'll push a patch to bump that sort of output to 30 when I get home. > > Thanks - but FWIW, I don't think it's the root cause of my > issue -- more below OK, I see now that you're talking about the output from list_proposals(). Indeed, when I cause list_proposals() in propose_queued() to only be called at paxos debug 30, the stall in finish_queued_proposal() mostly goes away. Now, with 55392 PGs I'm seeing this: 2013-04-02 14:21:19.401820 7328d700 10 mon.cs31@0(leader).paxosservice(pgmap) propose_pending 2013-04-02 14:21:19.535731 7328d700 5 mon.cs31@0(leader).paxos(paxos active c 1..33) queue_proposal bl 19870402 bytes; ctx = 0x11ecff0 2013-04-02 14:21:19.535917 7328d700 5 mon.cs31@0(leader).paxos(paxos preparing update c 1..33) propose_queued 34 19870402 bytes 2013-04-02 14:21:19.535925 7328d700 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..33) begin for 34 19870402 bytes 2013-04-02 14:21:19.754326 7328d700 10 mon.cs31@0(leader).paxos(paxos updating c 1..33) commit 34 2013-04-02 14:21:19.975747 7328d700 10 mon.cs31@0(leader).paxos(paxos active c 1..34) finish_queued_proposal finishing proposal 2013-04-02 14:21:19.975799 7328d700 10 mon.cs31@0(leader).paxos(paxos active c 1..34) finish_queued_proposal finish it (proposal = 0x790c0c0) 2013-04-02 14:21:19.975802 7328d700 10 mon.cs31@0(leader).paxos(paxos active c 1..34) finish_queued_proposal proposal took 0.439887 to finish 2013-04-02 14:21:19.975913 7328d700 10 mon.cs31@0(leader).paxosservice(pgmap) _active >> >> If you're willing, try reducing the paxos debug level to 0 and let us know >> if those delays in the proposal go away. > > When I started debugging why I couldn't start new filesystems, several > days ago, I wasn't using any debugging. It took me until last Friday > to track the issue into paxos, which is when I turned on its debugging. > With that list_proposals() debug output issue out of the way, I'm still having trouble starting up a new filesystem -- still working to try to isolate the source of the issue Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trouble with paxos service for large PG count
On 04/02/2013 12:28 PM, Joao Luis wrote: > Right. I'll push a patch to bump that sort of output to 30 when I get home. Thanks - but FWIW, I don't think it's the root cause of my issue -- more below > > If you're willing, try reducing the paxos debug level to 0 and let us know > if those delays in the proposal go away. When I started debugging why I couldn't start new filesystems, several days ago, I wasn't using any debugging. It took me until last Friday to track the issue into paxos, which is when I turned on its debugging. Also, I've added the following bits of debug output: diff --git a/src/mon/Paxos.cc b/src/mon/Paxos.cc index bf3eb92..70a8b5d 100644 --- a/src/mon/Paxos.cc +++ b/src/mon/Paxos.cc @@ -494,6 +494,8 @@ void Paxos::begin(bufferlist& v) MonitorDBStore::Transaction t; t.put(get_name(), last_committed+1, new_value); + utime_t tstart = ceph_clock_now(g_ceph_context); + dout(30) << __func__ << " transaction dump:\n"; JSONFormatter f(true); t.dump(&f); @@ -506,7 +508,14 @@ void Paxos::begin(bufferlist& v) f.flush(*_dout); *_dout << dendl; + dout(10) << __func__ << " - JSON formatting, debug_tx decode took " + << ceph_clock_now(g_ceph_context) - tstart << dendl; + + tstart = ceph_clock_now(g_ceph_context); get_store()->apply_transaction(t); + dout(10) << __func__ << " - get_store()->apply_transaction took " + << ceph_clock_now(g_ceph_context) - tstart << dendl; + if (mon->get_quorum().size() == 1) { // we're alone, take it easy diff --git a/src/mon/PaxosService.cc b/src/mon/PaxosService.cc index a66c5ec..808ce62 100644 --- a/src/mon/PaxosService.cc +++ b/src/mon/PaxosService.cc @@ -158,6 +158,8 @@ void PaxosService::propose_pending() encode_pending(&t); have_pending = false; + utime_t tstart = ceph_clock_now(g_ceph_context); + dout(30) << __func__ << " transaction dump:\n"; JSONFormatter f(true); t.dump(&f); @@ -166,6 +168,9 @@ void PaxosService::propose_pending() t.encode(bl); + dout(10) << __func__ << " - JSON formatting, encode took " + << ceph_clock_now(g_ceph_context) - tstart << dendl; + // apply to paxos proposing.set(1); paxos->propose_new_value(bl, new C_Committed(this)); which produces this: 2013-04-02 13:04:22.311743 77fbe780 10 mon.cs31@0(leader).paxosservice(pgmap) propose_pending 2013-04-02 13:04:22.426810 77fbe780 10 mon.cs31@0(leader).paxosservice(pgmap) propose_pending - JSON formatting, encode took 0.000170 2013-04-02 13:04:22.426874 77fbe780 5 mon.cs31@0(leader).paxos(paxos active c 1..3) queue_proposal bl 19775198 bytes; ctx = 0x11e81f0 2013-04-02 13:04:22.426999 77fbe780 5 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued 4 19775198 bytes 2013-04-02 13:04:22.427003 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued list_proposals 1 in queue: 2013-04-02 13:07:51.190312 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) begin for 4 19775198 bytes 2013-04-02 13:07:51.190860 77fbe780 10 mon.cs31@0(leader).paxos(paxos updating c 1..3) begin - JSON formatting, debug_tx decode took 0.02 2013-04-02 13:07:51.402134 77fbe780 10 mon.cs31@0(leader).paxos(paxos updating c 1..3) begin - get_store()->apply_transaction took 0.211275 2013-04-02 13:07:51.402178 77fbe780 10 mon.cs31@0(leader).paxos(paxos updating c 1..3) commit 4 2013-04-02 13:07:51.635113 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal finishing proposal 2013-04-02 13:07:51.635152 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal finish it (proposal = 0x1261340) 2013-04-02 13:07:51.635159 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal proposal took 209.208162 to finish 2013-04-02 13:07:51.635193 77fbe780 10 mon.cs31@0(leader).paxosservice(pgmap) _active I thought I was timing the same JSON formatting in propose_pending() and begin(), but they take much different amounts of time. I don't understand why yet. Also, I just noticed the elapsed time between these two lines: 2013-04-02 13:04:22.427003 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued list_proposals 1 in queue: 2013-04-02 13:07:51.190312 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) begin for 4 19775198 bytes It accounts for nearly the whole of the delay that finish_queued_proposal() is reporting. Why did begin() take so long to start, I wonder ??? -- Jim > > -Joao -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trouble with paxos service for large PG count
On 04/02/2013 09:42 AM, Joao Eduardo Luis wrote: > On 04/01/2013 10:14 PM, Jim Schutt wrote: >> Hi, >> >> I've been having trouble starting a new file system >> created using the current next branch (most recently, >> commit 3b5f663f11). >> >> I believe the trouble is related to how long it takes paxos to >> process a pgmap proposal. >> >> For a configuration with 1 mon, 1 mds, and 576 osds, using >> pg_bits = 3 and debug paxos = 10, if I start just the monitor, >> here's what I get when paxos processes the first non-trivial >> pgmap proposal: >> > > Just noticed one other thing. With 'debug paxos = 10', you should have a > whole bunch of output (the proposal's dump) after this: > >> 2013-04-01 14:04:16.330735 77fbe780 10 >> mon.cs31@0(leader).paxosservice(pgmap) propose_pending >> 2013-04-01 14:04:16.358973 77fbe780 5 mon.cs31@0(leader).paxos(paxos >> active c 1..3) queue_proposal bl 4943990 bytes; ctx = 0x11e81f0 >> 2013-04-01 14:04:16.359021 77fbe780 5 mon.cs31@0(leader).paxos(paxos >> preparing update c 1..3) propose_queued 4 4943990 bytes >> 2013-04-01 14:04:16.359025 77fbe780 10 mon.cs31@0(leader).paxos(paxos >> preparing update c 1..3) propose_queued list_proposals 1 in queue: > > and before this: > >> 2013-04-01 14:04:28.096284 77fbe780 10 mon.cs31@0(leader).paxos(paxos >> preparing update c 1..3) begin for 4 4943990 bytes > > for every snippet you sent on your previous email. > > The code responsible for that shouldn't ever have made into master, and > should be to blame for a great deal of the time spent. > > Jim, can you confirm such output is present? Yep, here's what I see: 2013-04-01 14:04:16.359021 77fbe780 5 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued 4 4943990 bytes 2013-04-01 14:04:16.359025 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued list_proposals 1 in queue: -- entry #0 proposed queued 0.08 tx dump: { "ops": [ { "op_num": 0, "type": "PUT", "prefix": "pgmap", "key": "full_1", "length": 46, "bl": " : 05 04 28 00 00 00 01 00 00 00 00 00 00 00 00 00 : ..(.\n0010 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 33 33 : ..33\n0020 : 73 3f 9a 99 59 3f 40 e8 59 51 90 f0 a4 07 : s?..Y?@.YQ\n"}, { "op_num": 1, "type": "PUT", "prefix": "pgmap", "key": "full_latest", "length": 8, "bl": " : 01 00 00 00 00 00 00 00 : \n"}, { "op_num": 2, "type": "PUT", "prefix": "pgmap", "key": "2", "length": 4943790, "bl": " : 06 05 a8 6f 4b 00 02 00 00 00 00 00 00 00 18 36 : ...oK..6 [snipped lots more of the bufferlist, then] { "op_num": 3, "type": "PUT", "prefix": "pgmap", "key": "last_committed", "length": 8, "bl": " : 02 00 00 00 00 00 00 00 : \n"}]} 2013-04-01 14:04:28.096284 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) begin for 4 4943990 bytes As you suggested, I see that sort of thing for every "propose_queued list_proposals" in my logs -- Jim > > > -Joao > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Trouble with paxos service for large PG count
Hi, I've been having trouble starting a new file system created using the current next branch (most recently, commit 3b5f663f11). I believe the trouble is related to how long it takes paxos to process a pgmap proposal. For a configuration with 1 mon, 1 mds, and 576 osds, using pg_bits = 3 and debug paxos = 10, if I start just the monitor, here's what I get when paxos processes the first non-trivial pgmap proposal: 2013-04-01 14:04:16.330735 77fbe780 10 mon.cs31@0(leader).paxosservice(pgmap) propose_pending 2013-04-01 14:04:16.358973 77fbe780 5 mon.cs31@0(leader).paxos(paxos active c 1..3) queue_proposal bl 4943990 bytes; ctx = 0x11e81f0 2013-04-01 14:04:16.359021 77fbe780 5 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued 4 4943990 bytes 2013-04-01 14:04:16.359025 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued list_proposals 1 in queue: 2013-04-01 14:04:28.096284 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) begin for 4 4943990 bytes 2013-04-01 14:04:28.164876 77fbe780 10 mon.cs31@0(leader).paxos(paxos updating c 1..3) commit 4 2013-04-01 14:04:28.226789 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal finishing proposal 2013-04-01 14:04:28.226885 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal finish it (proposal = 0x1260f50) 2013-04-01 14:04:28.226892 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal proposal took 11.867914 to finish 2013-04-01 14:04:28.226941 77fbe780 10 mon.cs31@0(leader).paxosservice(pgmap) _active When I redo the test with pg_bits = 4 (27696 PGs total) I get this: 2013-04-01 14:25:16.993255 77fbe780 10 mon.cs31@0(leader).paxosservice(pgmap) propose_pending 2013-04-01 14:25:17.051729 77fbe780 5 mon.cs31@0(leader).paxos(paxos active c 1..3) queue_proposal bl 9887726 bytes; ctx = 0x11e81f0 2013-04-01 14:25:17.051882 77fbe780 5 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued 4 9887726 bytes 2013-04-01 14:25:17.051888 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued list_proposals 1 in queue: 2013-04-01 14:26:08.185143 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) begin for 4 9887726 bytes 2013-04-01 14:26:08.296351 77fbe780 10 mon.cs31@0(leader).paxos(paxos updating c 1..3) commit 4 2013-04-01 14:26:08.408900 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal finishing proposal 2013-04-01 14:26:08.408928 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal finish it (proposal = 0x1260f50) 2013-04-01 14:26:08.408931 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal proposal took 51.357088 to finish 2013-04-01 14:26:08.408974 77fbe780 10 mon.cs31@0(leader).paxosservice(pgmap) _active and at pg_bits = 5 (55392 PGs total) I get this: 2013-04-01 14:36:15.385280 77fbe780 10 mon.cs31@0(leader).paxosservice(pgmap) propose_pending 2013-04-01 14:36:15.500292 77fbe780 5 mon.cs31@0(leader).paxos(paxos active c 1..3) queue_proposal bl 19775198 bytes; ctx = 0x11e81f0 2013-04-01 14:36:15.500433 77fbe780 5 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued 4 19775198 bytes 2013-04-01 14:36:15.500438 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) propose_queued list_proposals 1 in queue: 2013-04-01 14:39:43.645062 77fbe780 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..3) begin for 4 19775198 bytes 2013-04-01 14:39:43.875099 77fbe780 10 mon.cs31@0(leader).paxos(paxos updating c 1..3) commit 4 2013-04-01 14:39:44.086097 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal finishing proposal 2013-04-01 14:39:44.086128 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal finish it (proposal = 0x1260f50) 2013-04-01 14:39:44.086136 77fbe780 10 mon.cs31@0(leader).paxos(paxos active c 1..4) finish_queued_proposal proposal took 208.585710 to finish 2013-04-01 14:39:44.086156 77fbe780 10 mon.cs31@0(leader).paxosservice(pgmap) _active For this last configuration, after collecting the above I waited a bit, started all the OSDs, waited a bit longer, then collected this: 2013-04-01 14:54:37.364686 7328d700 10 mon.cs31@0(leader).paxosservice(pgmap) propose_pending 2013-04-01 14:54:37.433641 7328d700 5 mon.cs31@0(leader).paxos(paxos active c 1..27) queue_proposal bl 10629660 bytes; ctx = 0x11ece50 2013-04-01 14:54:37.433750 7328d700 5 mon.cs31@0(leader).paxos(paxos preparing update c 1..27) propose_queued 28 10629660 bytes 2013-04-01 14:54:37.433755 7328d700 10 mon.cs31@0(leader).paxos(paxos preparing update c 1..27) propose_queued list_proposals 1 in queue: 2013-04-01 14:55:38.684532 7328
Re: CephFS Space Accounting and Quotas
On 03/15/2013 05:17 PM, Greg Farnum wrote: > [Putting list back on cc] > > On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote: > >> On 03/15/2013 04:23 PM, Greg Farnum wrote: >>> As I come back and look at these again, I'm not sure what the context >>> for these logs is. Which test did they come from, and which behavior >>> (slow or not slow, etc) did you see? :) -Greg >> >> >> >> They come from a test where I had debug mds = 20 and debug ms = 1 >> on the MDS while writing files from 198 clients. It turns out that >> for some reason I need debug mds = 20 during writing to reproduce >> the slow stat behavior later. >> >> strace.find.dirs.txt.bz2 contains the log of running >> strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls >> -lhd {} \; >> >> From that output, I believe that the stat of at least these files is slow: >> zero0.rc11 >> zero0.rc30 >> zero0.rc46 >> zero0.rc8 >> zero0.tc103 >> zero0.tc105 >> zero0.tc106 >> I believe that log shows slow stats on more files, but those are the first >> few. >> >> mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the >> find command started, until just after the fifth or sixth slow stat from >> the list above. >> >> I haven't yet tried to find other ways of reproducing this, but so far >> it appears that something happens during the writing of the files that >> ends up causing the condition that results in slow stat commands. >> >> I have the full MDS log from the writing of the files, as well, but it's >> big >> >> Is that what you were after? >> >> Thanks for taking a look! >> >> -- Jim > > I just was coming back to these to see what new information was > available, but I realized we'd discussed several tests and I wasn't > sure what these ones came from. That information is enough, yes. > > If in fact you believe you've only seen this with high-level MDS > debugging, I believe the cause is as I mentioned last time: the MDS > is flapping a bit and so some files get marked as "needsrecover", but > they aren't getting recovered asynchronously, and the first thing > that pokes them into doing a recover is the stat. OK, that makes sense. > That's definitely not the behavior we want and so I'll be poking > around the code a bit and generating bugs, but given that explanation > it's a bit less scary than random slow stats are so it's not such a > high priority. :) Do let me know if you come across it without the > MDS and clients having had connection issues! No problem - thanks! -- Jim > -Greg > > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/11/2013 02:40 PM, Jim Schutt wrote: >>> >> If you want I can attempt to duplicate my memory of the first >>> >> test I reported, writing the files today and doing the strace >>> >> tomorrow (with timestamps, this time). >>> >> >>> >> Also, would it be helpful to write the files with minimal logging, in >>> >> hopes of inducing minimal timing changes, then upping the logging >>> >> for the stat phase? >> > >> > Well that would give us better odds of not introducing failures of >> > any kind during the write phase, and then getting accurate >> > information on what's happening during the stats, so it probably >> > would. Basically I'd like as much logging as possible without >> > changing the states they system goes through. ;) > Hmmm, this is getting more interesting... > > I just did two complete trials where I built a file system, > did two sets of writes with minimal MDS logging, then > turned MDS logging up to 20 with MDS ms logging at 1 for > the stat phase. > > In each trial my strace'd find finished in < 10 seconds, > and there were no slow stat calls (they were taking ~19 ms > each). > > I'm going to do a third trial where I let things rest > overnight, after I write the files. That delay is the > only thing I haven't reproduced from my first trial As you suspected, that didn't make any difference either... That trial didn't reproduce the slow stat behavior. It turns out that the only way I can reliably reproduce the slow stat behavior right now is to turn MDS debugging up to 20. I've got another set of logs with MDS debug ms = 1, which I'll send via an off-list email. I still haven't figured out what made that first test exhibit this behavior, when I was using debug mds = 5. -- Jim > > -- Jim > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/11/2013 10:57 AM, Greg Farnum wrote: > On Monday, March 11, 2013 at 9:48 AM, Jim Schutt wrote: >> On 03/11/2013 09:48 AM, Greg Farnum wrote: >>> On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote: >>>> >>>> For this run, the MDS logging slowed it down enough to cause the >>>> client caps to occasionally go stale. I don't think it's the cause >>>> of the issue, because I was having it before I turned MDS debugging >>>> up. My client caps never go stale at, e.g., debug mds 5. >>> >>> >>> >>> Oh, so this might be behaviorally different than you were seeing before? >>> Drat. >>> >>> You had said before that each newfstatat was taking tens of seconds, >>> whereas in the strace log you sent along most of the individual calls >>> were taking a bit less than 20 milliseconds. Do you have an strace of >>> them individually taking much more than that, or were you just >>> noticing that they took a long time in aggregate? >> >> >> >> When I did the first strace, I didn't turn on timestamps, and I was >> watching it scroll by. I saw several stats in a row take ~30 secs, >> at which point I got bored, and took a look at the strace man page to >> figure out how to get timestamps ;) >> >> Also, another difference is for that test, I was looking at files >> I had written the day before, whereas for the strace log I sent, >> there was only several minutes between writing and the strace of find. >> >> I thought I had eliminated the page cache issue by using fdatasync >> when writing the files. Perhaps the real issue is affected by that >> delay? > > I'm not sure. I can't think of any mechanism by which waiting longer > would increase the time lags, though, so I doubt it. > >>> I suppose if you were going to run it again then just the message >>> logging could also be helpful. That way we could at least check and >>> see the message delays and if the MDS is doing other work in the >>> course of answering a request. >> >> >> >> I can do as many trials as needed to isolate the issue. >> >> What message debugging level is sufficient on the MDS; 1? > Yep, that will capture all incoming and outgoing messages. :) >> >> If you want I can attempt to duplicate my memory of the first >> test I reported, writing the files today and doing the strace >> tomorrow (with timestamps, this time). >> >> Also, would it be helpful to write the files with minimal logging, in >> hopes of inducing minimal timing changes, then upping the logging >> for the stat phase? > > Well that would give us better odds of not introducing failures of > any kind during the write phase, and then getting accurate > information on what's happening during the stats, so it probably > would. Basically I'd like as much logging as possible without > changing the states they system goes through. ;) Hmmm, this is getting more interesting... I just did two complete trials where I built a file system, did two sets of writes with minimal MDS logging, then turned MDS logging up to 20 with MDS ms logging at 1 for the stat phase. In each trial my strace'd find finished in < 10 seconds, and there were no slow stat calls (they were taking ~19 ms each). I'm going to do a third trial where I let things rest overnight, after I write the files. That delay is the only thing I haven't reproduced from my first trial -- Jim > -Greg > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Estimating OSD memory requirements (was Re: stuff for v0.56.4)
Hi Bryan, On 03/11/2013 09:10 AM, Bryan K. Wright wrote: > s...@inktank.com said: >> On Thu, 7 Mar 2013, Bryan K. Wright wrote: >> >> s...@inktank.com said: >>> - pg log trimming (probably a conservative subset) to avoid memory bloat >> >> Anything that reduces the size of OSD processes would be appreciated. >> You can probably do this with just >> log max recent = 1000 >> By default it's keeping 100k lines of logs in memory, which can eat a lot of >> ram (but is great when debugging issues). > > Thanks for the tip about "log max recent". I've made this > change, but it doesn't seem to significantly reduce the size of the > OSD processes. > > In general, are there some rules of thumb for estimated the > memory requirements for OSDs? I see processes blow up to 8gb of > resident memory sometimes. If I need to allow for that much memory > per OSD process, I may have to just walk away from ceph. > > Does the memory usage scale with the size of the disks? > I've been trying to run 12 OSDs with 12 2TB disks on a single box. > Would I be better off (memory-usage-wise) if I RAIDed the disks > together and used a single OSD process? > > Thanks for any advice. You might also try tuning "osd client message size cap"; its current default is 500 MiB. During the periods your aggregate applied write load is higher than your OSD aggregate write bandwidth (taking into account replicas), you'll be buffering up this amount of client data. Since it only applies to incoming client messages, to figure total memory use I believe you need to multiply that by the number of replicas you're using. FWIW, for sequential writes from lots of clients, I can maintain full write bandwidth with "osd client message size cap" tuned to 60 MiB. -- Jim > > Bryan > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/11/2013 09:48 AM, Greg Farnum wrote: > On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote: >> On 03/08/2013 07:05 PM, Greg Farnum wrote: >>> On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote: >>>> On 03/07/2013 08:15 AM, Jim Schutt wrote: >>>>> On 03/06/2013 05:18 PM, Greg Farnum wrote: >>>>>> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: >>>>> >>>> >>>> >>>> >>>> >>>> >>>> [snip] >>>> >>>>>>> Do you want the MDS log at 10 or 20? >>>>>> >>>>>> More is better. ;) >>>>> >>>>> >>>>> >>>>> >>>>> OK, thanks. >>>> >>>> >>>> I've sent some mds logs via private email... >>>> >>>> -- Jim >>> >>> I'm going to need to probe into this a bit more, but on an initial >>> examination I see that most of your stats are actually happening very >>> quickly — it's just that occasionally they take quite a while. >> >> >> >> Interesting... >> >>> Going >>> through the MDS log for one of those, the inode in question is >>> flagged with "needsrecover" from its first appearance in the log — >>> that really shouldn't happen unless a client had write caps on it and >>> the client disappeared. Any ideas? The slowness is being caused by >>> the MDS going out and looking at every object which could be in the >>> file — there are a lot since the file has a listed size of 8GB. >> >> >> >> For this run, the MDS logging slowed it down enough to cause the >> client caps to occasionally go stale. I don't think it's the cause >> of the issue, because I was having it before I turned MDS debugging >> up. My client caps never go stale at, e.g., debug mds 5. > > Oh, so this might be behaviorally different than you were seeing before? Drat. > > You had said before that each newfstatat was taking tens of seconds, > whereas in the strace log you sent along most of the individual calls > were taking a bit less than 20 milliseconds. Do you have an strace of > them individually taking much more than that, or were you just > noticing that they took a long time in aggregate? When I did the first strace, I didn't turn on timestamps, and I was watching it scroll by. I saw several stats in a row take ~30 secs, at which point I got bored, and took a look at the strace man page to figure out how to get timestamps ;) Also, another difference is for that test, I was looking at files I had written the day before, whereas for the strace log I sent, there was only several minutes between writing and the strace of find. I thought I had eliminated the page cache issue by using fdatasync when writing the files. Perhaps the real issue is affected by that delay? > I suppose if you were going to run it again then just the message > logging could also be helpful. That way we could at least check and > see the message delays and if the MDS is doing other work in the > course of answering a request. I can do as many trials as needed to isolate the issue. What message debugging level is sufficient on the MDS; 1? If you want I can attempt to duplicate my memory of the first test I reported, writing the files today and doing the strace tomorrow (with timestamps, this time). Also, would it be helpful to write the files with minimal logging, in hopes of inducing minimal timing changes, then upping the logging for the stat phase? > >> Otherwise, there were no signs of trouble while writing the files. >> >> Can you suggest which kernel client debugging I might enable that >> would help understand what is happening? Also, I have the full >> MDS log from writing the files, if that will help. It's big (~10 GiB). >> >>> (There are several other mysteries here that can probably be traced >>> to different varieties of non-optimal and buggy code as well — there >>> is a client which has write caps on the inode in question despite it >>> needing recovery, but the recovery isn't triggered until the stat >>> event occurs, etc). >> >> >> >> OK, thanks for taking a look. Let me know if there is other >> logging I can enable that will be helpful. > > I'm going to want to spend more time with the log I've got, but I'll think > about if there's a different set of data we can gather less disruptively. OK, cool. Just let me know. Thanks -- Jim > -Greg > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/08/2013 07:05 PM, Greg Farnum wrote: > On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote: >> On 03/07/2013 08:15 AM, Jim Schutt wrote: >>> On 03/06/2013 05:18 PM, Greg Farnum wrote: >>>> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: >>> >> >> >> >> [snip] >> >>>>> Do you want the MDS log at 10 or 20? >>>> More is better. ;) >>> >>> >>> >>> OK, thanks. >> >> I've sent some mds logs via private email... >> >> -- Jim > I'm going to need to probe into this a bit more, but on an initial > examination I see that most of your stats are actually happening very > quickly — it's just that occasionally they take quite a while. Interesting... > Going > through the MDS log for one of those, the inode in question is > flagged with "needsrecover" from its first appearance in the log — > that really shouldn't happen unless a client had write caps on it and > the client disappeared. Any ideas? The slowness is being caused by > the MDS going out and looking at every object which could be in the > file — there are a lot since the file has a listed size of 8GB. For this run, the MDS logging slowed it down enough to cause the client caps to occasionally go stale. I don't think it's the cause of the issue, because I was having it before I turned MDS debugging up. My client caps never go stale at, e.g., debug mds 5. Otherwise, there were no signs of trouble while writing the files. Can you suggest which kernel client debugging I might enable that would help understand what is happening? Also, I have the full MDS log from writing the files, if that will help. It's big (~10 GiB). > (There are several other mysteries here that can probably be traced > to different varieties of non-optimal and buggy code as well — there > is a client which has write caps on the inode in question despite it > needing recovery, but the recovery isn't triggered until the stat > event occurs, etc). OK, thanks for taking a look. Let me know if there is other logging I can enable that will be helpful. -- Jim > -Greg > > Software Engineer #42 @ http://inktank.com | http://ceph.com > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/07/2013 08:15 AM, Jim Schutt wrote: > On 03/06/2013 05:18 PM, Greg Farnum wrote: >> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: [snip] >>> Do you want the MDS log at 10 or 20? >> More is better. ;) > > OK, thanks. I've sent some mds logs via private email... -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/06/2013 05:18 PM, Greg Farnum wrote: > On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: >> When I'm doing these stat operations the file system is otherwise >> idle. > > What's the cluster look like? This is just one active MDS and a couple > hundred clients? 1 mds, 1 mon, 576 osds, 198 cephfs clients. > >> What is happening is that once one of these slow stat operations >> on a file completes, it never happens again for that file, from >> any client. At least, that's the case if I'm not writing to >> the file any more. I haven't checked if appending to the files >> restarts the behavior. > > I assume it'll come back, but if you could verify that'd be good. OK, I'll check it out. > > >> On the client side I'm running with 3.8.2 + the ceph patch queue >> that was merged into 3.9-rc1. >> >> On the server side I'm running recent next branch (commit 0f42eddef5), >> with the tcp receive socket buffer option patches cherry-picked. >> I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num >> rather than pg_bits to set initial number of PGs (same for pgp_num), >> and a patch that lets me run with just one pool that contains both >> data and metadata. I'm testing data distribution uniformity with 512K PGs. >> >> My MDS tunables are all at default settings. >> >>> >>> We'll probably want to get a high-debug log of the MDS during these slow >>> stats as well. >> >> OK. >> >> Do you want me to try to reproduce with a more standard setup? > No, this is fine. > >> Also, I see Sage just pushed a patch to pgid decoding - I expect >> I need that as well, if I'm running the latest client code. > > Yeah, if you've got the commit it references you'll want it. > >> Do you want the MDS log at 10 or 20? > More is better. ;) OK, thanks. -- Jim > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/06/2013 02:39 PM, Greg Farnum wrote: > On Wednesday, March 6, 2013 at 1:28 PM, Jim Schutt wrote: >> On 03/06/2013 01:21 PM, Greg Farnum wrote: >>>>> Also, this issue of stat on files created on other clients seems >>>>> like it's going to be problematic for many interactions our users >>>>> will have with the files created by their parallel compute jobs - >>>>> any suggestion on how to avoid or fix it? >>>> >>> >>> >>> Brief background: stat is required to provide file size information, >>> and so when you do a stat Ceph needs to find out the actual file >>> size. If the file is currently in use by somebody, that requires >>> gathering up the latest metadata from them. Separately, while Ceph >>> allows a client and the MDS to proceed with a bunch of operations >>> (ie, mknod) without having it go to disk first, it requires anything >>> which is visible to a third party (another client) be durable on disk >>> for consistency reasons. >>> >>> These combine to mean that if you do a stat on a file which a client >>> currently has buffered writes for, that buffer must be flushed out to >>> disk before the stat can return. This is the usual cause of the slow >>> stats you're seeing. You should be able to adjust dirty data >>> thresholds to encourage faster writeouts, do fsyncs once a client is >>> done with a file, etc in order to minimize the likelihood of running >>> into this. Also, I'd have to check but I believe opening a file with >>> LAZY_IO or whatever will weaken those requirements — it's probably >>> not the solution you'd like here but it's an option, and if this >>> turns out to be a serious issue then config options to reduce >>> consistency on certain operations are likely to make their way into >>> the roadmap. :) >> >> >> >> That all makes sense. >> >> But, it turns out the files in question were written yesterday, >> and I did the stat operations today. >> >> So, shouldn't the dirty buffer issue not be in play here? > Probably not. :/ > > >> Is there anything else that might be going on? > In that case it sounds like either there's a slowdown on disk access > that is propagating up the chain very bizarrely, there's a serious > performance issue on the MDS (ie, swapping for everything), or the > clients are still holding onto capabilities for the files in question > and you're running into some issues with the capability revocation > mechanisms. > Can you describe your setup a bit more? What versions are you > running, kernel or userspace clients, etc. What config options are > you setting on the MDS? Assuming you're on something semi-recent, > getting a perfcounter dump from the MDS might be illuminating as > well. When I'm doing these stat operations the file system is otherwise idle. What is happening is that once one of these slow stat operations on a file completes, it never happens again for that file, from any client. At least, that's the case if I'm not writing to the file any more. I haven't checked if appending to the files restarts the behavior. On the client side I'm running with 3.8.2 + the ceph patch queue that was merged into 3.9-rc1. On the server side I'm running recent next branch (commit 0f42eddef5), with the tcp receive socket buffer option patches cherry-picked. I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num rather than pg_bits to set initial number of PGs (same for pgp_num), and a patch that lets me run with just one pool that contains both data and metadata. I'm testing data distribution uniformity with 512K PGs. My MDS tunables are all at default settings. > > We'll probably want to get a high-debug log of the MDS during these slow > stats as well. OK. Do you want me to try to reproduce with a more standard setup? Also, I see Sage just pushed a patch to pgid decoding - I expect I need that as well, if I'm running the latest client code. Do you want the MDS log at 10 or 20? -- Jim > -Greg > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/06/2013 01:21 PM, Greg Farnum wrote: >> > Also, this issue of stat on files created on other clients seems >> > like it's going to be problematic for many interactions our users >> > will have with the files created by their parallel compute jobs - >> > any suggestion on how to avoid or fix it? >> > > Brief background: stat is required to provide file size information, > and so when you do a stat Ceph needs to find out the actual file > size. If the file is currently in use by somebody, that requires > gathering up the latest metadata from them. Separately, while Ceph > allows a client and the MDS to proceed with a bunch of operations > (ie, mknod) without having it go to disk first, it requires anything > which is visible to a third party (another client) be durable on disk > for consistency reasons. > > These combine to mean that if you do a stat on a file which a client > currently has buffered writes for, that buffer must be flushed out to > disk before the stat can return. This is the usual cause of the slow > stats you're seeing. You should be able to adjust dirty data > thresholds to encourage faster writeouts, do fsyncs once a client is > done with a file, etc in order to minimize the likelihood of running > into this. Also, I'd have to check but I believe opening a file with > LAZY_IO or whatever will weaken those requirements — it's probably > not the solution you'd like here but it's an option, and if this > turns out to be a serious issue then config options to reduce > consistency on certain operations are likely to make their way into > the roadmap. :) That all makes sense. But, it turns out the files in question were written yesterday, and I did the stat operations today. So, shouldn't the dirty buffer issue not be in play here? Is there anything else that might be going on? Thanks -- Jim > -Greg > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/06/2013 12:13 PM, Greg Farnum wrote: > On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote: >> On 03/05/2013 12:33 PM, Sage Weil wrote: >>>>> Running 'du' on each directory would be much faster with Ceph since it >>>>> accounts tracks the subdirectories and shows their total size with an 'ls >>>>> -al'. >>>>> >>>>> Environments with 100k users also tend to be very dynamic with adding and >>>>> removing users all the time, so creating separate filesystems for them >>>>> would >>>>> be very time consuming. >>>>> >>>>> Now, I'm not talking about enforcing soft or hard quotas, I'm just talking >>>>> about knowing how much space uid X and Y consume on the filesystem. >>>> >>> >>> >>> The part I'm most unclear on is what use cases people have where uid X and >>> Y are spread around the file system (not in a single or small set of sub >>> directories) and per-user (not, say, per-project) quotas are still >>> necessary. In most environments, users get their own home directory and >>> everything lives there... >> >> >> >> Hmmm, is there a tool I should be using that will return the space >> used by a directory, and all its descendants? >> >> If it's 'du', that tool is definitely not fast for me. >> >> I'm doing an 'strace du -s ', where has one >> subdirectory which contains ~600 files. I've got ~200 clients >> mounting the file system, and each client wrote 3 files in that >> directory. >> >> I'm doing the 'du' from one of those nodes, and the strace is showing >> me du is doing a 'newfstat' for each file. For each file that was >> written on a different client from where du is running, that 'newfstat' >> takes tens of seconds to return. Which means my 'du' has been running >> for quite some time and hasn't finished yet >> >> I'm hoping there's another tool I'm supposed to be using that I >> don't know about yet. Our use case includes tens of millions >> of files written from thousands of clients, and whatever tool >> we use to do space accounting needs to not walk an entire directory >> tree, checking each file. > > Check out the directory sizes with ls -l or whatever — those numbers are > semantically meaningful! :) That is just exceptionally cool! > > Unfortunately we can't (currently) use those "recursive statistics" > to do proper hard quotas on subdirectories as they're lazily > propagated following client ops, not as part of the updates. (Lazily > in the technical sense — it's actually quite fast in general). But > they'd work fine for soft quotas if somebody wrote the code, or to > block writes on a slight time lag. 'ls -lh ' seems to be just the thing if you already know . And it's perfectly suitable for our use case of not scheduling new jobs for users consuming too much space. I was thinking I might need to find a subtree where all the subdirectories are owned by the same user, on the theory that all the files in such a subtree would be owned by that same user. E.g., we might want such a capability to manage space per user in shared project directories. So, I tried 'find -type d -exec ls -lhd {} \;' Unfortunately, that ended up doing a 'newfstatat' on each file under , evidently to learn if it was a directory. The result was that same slowdown for files written on other clients. Is there some other way I should be looking for directories if I don't already know what they are? Also, this issue of stat on files created on other clients seems like it's going to be problematic for many interactions our users will have with the files created by their parallel compute jobs - any suggestion on how to avoid or fix it? Thanks! -- Jim > -Greg > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS First product release discussion
On 03/05/2013 12:33 PM, Sage Weil wrote: >> > Running 'du' on each directory would be much faster with Ceph since it >> > accounts tracks the subdirectories and shows their total size with an 'ls >> > -al'. >> > >> > Environments with 100k users also tend to be very dynamic with adding and >> > removing users all the time, so creating separate filesystems for them >> > would >> > be very time consuming. >> > >> > Now, I'm not talking about enforcing soft or hard quotas, I'm just talking >> > about knowing how much space uid X and Y consume on the filesystem. > The part I'm most unclear on is what use cases people have where uid X and > Y are spread around the file system (not in a single or small set of sub > directories) and per-user (not, say, per-project) quotas are still > necessary. In most environments, users get their own home directory and > everything lives there... Hmmm, is there a tool I should be using that will return the space used by a directory, and all its descendants? If it's 'du', that tool is definitely not fast for me. I'm doing an 'strace du -s ', where has one subdirectory which contains ~600 files. I've got ~200 clients mounting the file system, and each client wrote 3 files in that directory. I'm doing the 'du' from one of those nodes, and the strace is showing me du is doing a 'newfstat' for each file. For each file that was written on a different client from where du is running, that 'newfstat' takes tens of seconds to return. Which means my 'du' has been running for quite some time and hasn't finished yet I'm hoping there's another tool I'm supposed to be using that I don't know about yet. Our use case includes tens of millions of files written from thousands of clients, and whatever tool we use to do space accounting needs to not walk an entire directory tree, checking each file. -- Jim > > sage > > >> > >> > Wido -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS First product release discussion
On 03/05/2013 11:36 AM, Wido den Hollander wrote: > > Now, I'm not talking about enforcing soft or hard quotas, I'm just > talking about knowing how much space uid X and Y consume on the > filesystem. FWIW, we'd like this capability for our HPC systems - we need to be able to disable scheduling of new jobs for users that are consuming too much storage -- Jim > > Wido -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Hi Sage, On 02/26/2013 12:36 PM, Sage Weil wrote: > On Tue, 26 Feb 2013, Jim Schutt wrote: >>> I think the right solution is to make an option that will setsockopt on >>> SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, >>> wip-tcp. Do you mind checking to see if this addresses the issue (without >>> manually adjusting things in /proc)? >> >> I'll be happy to test it out... > > That would be great! It's branch wip-tcp, and the setting is 'ms tcp > rcvbuf'. I've verified that I can reproduce the slowdown with the default value of 1 for /proc/sys/net/ipv4/tcp_moderate_rcvbuf, and 'ms tcp rcvbuf' at 0. I've also verified that I could not reproduce any slowdown when I configure 'ms tcp rcvbuf' to 256 KiB on OSDs. So, that's great news - sorry for the delay in testing. Also, FWIW, I ended up testing with commits cb15e6e0f4 and c346282940 cherry-picked on top of next as of a day or so ago (commit f58601d681), as for some reason wip-tcp wouldn't work for me - ceph-mon was non-responsive in some way I didn't dig into. -- Jim > > Thanks! > sage > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Hi Sage, On 02/20/2013 05:12 PM, Sage Weil wrote: > Hi Jim, > > I'm resurrecting an ancient thread here, but: we've just observed this on > another big cluster and remembered that this hasn't actually been fixed. Sorry for the delayed reply - I missed this in a backlog of unread email... > > I think the right solution is to make an option that will setsockopt on > SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, > wip-tcp. Do you mind checking to see if this addresses the issue (without > manually adjusting things in /proc)? I'll be happy to test it out... > > And perhaps we should consider making this default to 256KB... That's the value I've been using with my /proc adjustments since I figured out what was going on. My servers use a 10 GbE port for each of the cluster and public networks, with cephfs clients using 1 GbE, and I've not detected any issues resulting from that value. So, it seems like a decent starting point for a default... -- Jim > > Thanks! > sage > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow request in XFS
On 01/31/2013 01:00 PM, Sage Weil wrote: > On Thu, 31 Jan 2013, Jim Schutt wrote: >> On 01/31/2013 05:43 AM, Sage Weil wrote: >>> Hi- >>> >>> Can you reproduce this with logs? It looks like there are a few ops that >>> are hanging for a very long time, but there isn't enough information here >>> except to point to osds 610, 612, 615, and 68... >> >> FWIW, I have a small pile of disks with bad sections in >> them - these sections aren't bad enough that writes fail, >> but they are bad enough that throughput drops by a factor >> of ~20. >> >> Do OSDs already collect statistics on, say, op commit >> elapsed time (assuming that's the statistic most directly >> related to slow writes on only some sections of a disk), >> in a way that could be used to diagnose such disks? >> >> If not, is there enough structure already in place that >> it would be easy to add? > > They're tracking it internally, but it gets averaged into the totals > before a user gets to see any per-request latencies. The per-daemon > totals are available via the admin socket 'perf dump' command. Have you > looked at that information yet? Not yet - thanks for the pointer. I'll take a look, and see what happens when I put one of my questionable disks back in. Thanks! -- Jim > > sage > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow request in XFS
On 01/31/2013 05:43 AM, Sage Weil wrote: > Hi- > > Can you reproduce this with logs? It looks like there are a few ops that > are hanging for a very long time, but there isn't enough information here > except to point to osds 610, 612, 615, and 68... FWIW, I have a small pile of disks with bad sections in them - these sections aren't bad enough that writes fail, but they are bad enough that throughput drops by a factor of ~20. Do OSDs already collect statistics on, say, op commit elapsed time (assuming that's the statistic most directly related to slow writes on only some sections of a disk), in a way that could be used to diagnose such disks? If not, is there enough structure already in place that it would be easy to add? -- Jim > > thanks! > s -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Single host VM limit when using RBD
On 01/17/2013 11:36 AM, Dan Mick wrote: > How about RLIMIT_NPROC, or memory exhaustion? Also, check /proc/sys/kernel/pid_max. I've solved a similar pthread_create problem by increasing this to 256k, up from 32k. -- Jim > > On Jan 17, 2013, at 12:47 AM, Matthew Anderson wrote: > >> Hi Audrey, >> >> I did try your suggestion beforehand and it doesn't appear to fix the issue. >> >> [root@KVM04 ~]# cat /proc/sys/kernel/threads-max >> 2549635 >> [root@KVM04 ~]# echo 5549635 > /proc/sys/kernel/threads-max >> [root@KVM04 ~]# virsh start EX03 >> error: Failed to start domain EX03 >> error: internal error Process exited while reading console log output: char >> device redirected to /dev/pts/23 >> Thread::try_create(): pthread_create failed with error 11common/Thread.cc: >> In function 'void Thread::create(size_t)' thread 7f5ec9706960 time >> 2013-01-17 16:46:50.935681 >> common/Thread.cc: 110: FAILED assert(ret == 0) >> ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) >> 1: (()+0x2aaa8f) [0x7f5ec6a89a8f] >> 2: (SafeTimer::init()+0x95) [0x7f5ec6973575] >> 3: (librados::RadosClient::connect()+0x72c) [0x7f5ec69099dc] >> 4: (()+0xa0290) [0x7f5ec97c8290] >> 5: (()+0x879dd) [0x7f5ec97af9dd] >> 6: (()+0x87c1b) [0x7f5ec97afc1b] >> 7: (()+0x87ae1) [0x7f5ec97afae1] >> 8: (()+0x87d50) [0x7f5ec97afd50] >> 9: (()+0xb37b2) [0x7f5ec97db7b2] >> 10: (()+0x1e83eb) [0x7f5ec99103eb] >> 11: (()+0x1ab54a) [0x7f5ec98d354a] >> 12: (main()+0x9da) [0x7f5ec9913a3a] >> 13: (__libc_start_main()+0xfd) [0x7f5ec5755cdd] >> 14: (()+0x710b9) [0x7f5ec97990b9] >> NOTE: a copy of the executable, or `objdump -rdS ` is needed to >> interpret this. >> terminate called after >> >> >> >> -Original Message- >> From: Andrey Korolyov [mailto:and...@xdel.ru] >> Sent: Thursday, 17 January 2013 4:42 PM >> To: Matthew Anderson >> Cc: ceph-devel@vger.kernel.org >> Subject: Re: Single host VM limit when using RBD >> >> Hi Matthew, >> >> Seems to a low value in /proc/sys/kernel/threads-max value. >> >> On Thu, Jan 17, 2013 at 12:37 PM, Matthew Anderson >> wrote: >>> I've run into a limit on the maximum number of RBD backed VM's that I'm >>> able to run on a single host. I have 20 VM's (21 RBD volumes open) running >>> on a single host and when booting the 21st machine I get the below error >>> from libvirt/QEMU. I'm able to shut down a VM and start another in it's >>> place so there seems to be a hard limit on the amount of volumes I'm able >>> to have open. I did some googling and the error 11 from pthread_create >>> seems to mean 'resource unavailable' so I'm probably running into a thread >>> limit of some sort. I did try increasing the max_thread kernel option but >>> nothing changed. I moved a few VM's to a different empty host and they >>> start with no issues at all. >>> >>> This machine has 4 OSD's running on it in addition to the 20 VM's. Kernel >>> 3.7.1. Ceph 0.56.1 and QEMU 1.3.0. There is currently 65GB of 96GB free ram >>> and no swap. >>> >>> Can anyone suggest where the limit might be or anything I can do to narrow >>> down the problem? >>> >>> Thanks >>> -Matt >>> - >>> >>> Error starting domain: internal error Process exited while reading >>> console log output: char device redirected to /dev/pts/23 >>> Thread::try_create(): pthread_create failed with error >>> 11common/Thread.cc: In function 'void Thread::create(size_t)' thread >>> 7f4eb5a65960 time 2013-01-17 02:32:58.096437 >>> common/Thread.cc: 110: FAILED assert(ret == 0) ceph version 0.56.1 >>> (e4a541624df62ef353e754391cbbb707f54b16f7) >>> 1: (()+0x2aaa8f) [0x7f4eb2de8a8f] >>> 2: (SafeTimer::init()+0x95) [0x7f4eb2cd2575] >>> 3: (librados::RadosClient::connect()+0x72c) [0x7f4eb2c689dc] >>> 4: (()+0xa0290) [0x7f4eb5b27290] >>> 5: (()+0x879dd) [0x7f4eb5b0e9dd] >>> 6: (()+0x87c1b) [0x7f4eb5b0ec1b] >>> 7: (()+0x87ae1) [0x7f4eb5b0eae1] >>> 8: (()+0x87d50) [0x7f4eb5b0ed50] >>> 9: (()+0xb37b2) [0x7f4eb5b3a7b2] >>> 10: (()+0x1e83eb) [0x7f4eb5c6f3eb] >>> 11: (()+0x1ab54a) [0x7f4eb5c3254a] >>> 12: (main()+0x9da) [0x7f4eb5c72a3a] >>> 13: (__libc_start_main()+0xfd) [0x7f4eb1ab4cdd] >>> 14: (()+0x710b9) [0x7f4eb5af80b9] >>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to >>> interpret this. >>> terminate called after >>> >>> Traceback (most recent call last): >>> File "/usr/share/virt-manager/virtManager/asyncjob.py", line 96, in >>> cb_wrapper >>>callback(asyncjob, *args, **kwargs) >>> File "/usr/share/virt-manager/virtManager/asyncjob.py", line 117, in tmpcb >>>callback(*args, **kwargs) >>> File "/usr/share/virt-manager/virtManager/domain.py", line 1090, in startup >>>self._backend.create() >>> File "/usr/lib/python2.7/dist-packages/libvirt.py", line 620, in create >>>if ret == -1: raise libvirtError ('virDomainCreate() failed', >>> dom=self) >>> libvirtError: internal error Process exited while reading console log >>> output: char device redirected to /dev/pts/23 >>> Thread::try_
Re: [PATCH] libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed
Hi Sage, On 01/15/2013 07:55 PM, Sage Weil wrote: > Hi Jim- > > I just realized this didn't make it into our tree. It's now in testing, > and will get merged in the next window. D'oh! That's great news - thanks for the update. -- Jim > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: OSDMonitor: don't allow creation of pools with > 65535 pgs
On 12/14/2012 09:59 AM, Joao Eduardo Luis wrote: > On 12/14/2012 03:41 PM, Jim Schutt wrote: >> Hi, >> >> I'm looking at commit e3ed28eb2 in the next branch, >> and I have a question. >> >> Shouldn't the limit be pg_num > 65536, because >> PGs are numbered 0 thru pg_num-1? >> >> If not, what am I missing? >> >> FWIW, up through yesterday I've been using the next branch and this: >> >>ceph osd pool set data pg_num 65536 --allow-experimental-feature >>ceph osd pool set metadata pg_num 65536 --allow-experimental-feature >>ceph osd pool set data pgp_num 65536 --allow-experimental-feature >>ceph osd pool set metadata pgp_num 65536 --allow-experimental-feature >> >> using cephfs clients, and have seen no trouble with >> misdirected ops, etc. >> >> -- Jim >> > > Hi Jim, > > To the best of my knowledge, one of the things that triggered the > required hard cap on the number of pgs was that the kernel side is > still limited to 16 bits, despite that on the osd side this is no > longer true. > I believe the culprit is the "ps" member of struct ceph_pg, which stores what is eventually used as the PG id as __le16. > I'm not familiar with what's going on on the kernel side, but if > there's a slight chance that we are indeed keeping the 'pg_num' on a > 16-bit variable, then that value must be capped to 65535. If that's > not the case and we're just limited by the pg's number/id, then I > guess that accepting up to 65636 would be fine (0..65535). struct ceph_pg_pool in the kernel stores pg_num and friends as __le32. > > Just in case I'll look into this and further implications. Cool, thanks. -- Jim > > Thanks. > > -Joao > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSDMonitor: don't allow creation of pools with > 65535 pgs
Hi, I'm looking at commit e3ed28eb2 in the next branch, and I have a question. Shouldn't the limit be pg_num > 65536, because PGs are numbered 0 thru pg_num-1? If not, what am I missing? FWIW, up through yesterday I've been using the next branch and this: ceph osd pool set data pg_num 65536 --allow-experimental-feature ceph osd pool set metadata pg_num 65536 --allow-experimental-feature ceph osd pool set data pgp_num 65536 --allow-experimental-feature ceph osd pool set metadata pgp_num 65536 --allow-experimental-feature using cephfs clients, and have seen no trouble with misdirected ops, etc. -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.7.0-rc8 btrfs locking issue
On 12/11/2012 06:37 PM, Liu Bo wrote: > On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote: >> On 12/09/2012 07:04 AM, Liu Bo wrote: >>> On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: >>> Hi Jim, >>> >>> Could you please apply the following patch to test if it works? >> >> Hi, >> >> So far, with your patch applied I've been unable to reproduce >> the recursive deadlock. Thanks a lot for this patch! >> This issue has been troubling me for a while. > > Hi Jim, > > Good news for us :) > >> >> I've been trying to learn more about btrfs internals - >> if you have the time to answer a couple questions about >> your patch, I'd really appreciate it. > > See below. > >> >>> >>> (It's against 3.7-rc8.) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 3d3e2c1..100289b 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root >>> *root, int data) >>> >>> if (data) >>> flags = BTRFS_BLOCK_GROUP_DATA; >>> - else if (root == root->fs_info->chunk_root) >>> + else if (root == root->fs_info->chunk_root || >>> +root == root->fs_info->dev_root) >>> flags = BTRFS_BLOCK_GROUP_SYSTEM; >>> else >>> flags = BTRFS_BLOCK_GROUP_METADATA; >>> @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct >>> btrfs_root *root, u64 type) >>> num_dev = 1;/* DUP or single */ >>> >>> /* metadata for updaing devices and chunk tree */ >>> + num_dev = num_dev << 1 >> >> AFAICS this is doubling the size of the reserve, which >> reduces the chance of a recursive do_chunk_alloc(), right? >> > > Not like that, we hit the deadlock because updating device tree also > uses METADATA chunk, which may be called when we're actually allocating > a METADATA chunk, so the patch I sent you makes updating device tree > use SYSTEM chunk, which we'll have some code to check if it is enough > before allocating a chunk(if not, we'll allocate a SYSTEM chunk first). > > Here I double the size just because the worst case of allocating a > DATA/METADATA chunk -may- results in > > 1)adding a SYSTEM chunk + > 2)adding dev extent per chunk stripe + > 3)updating chunk stripes's bytes_used > >>> return btrfs_calc_trans_metadata_size(root, num_dev + 1); >> >> btrfs_calc_trans_metadata_size(root, num_items) multiplies its >> num_items argument by another factor of three - do you know if >> there is there some rationale behind that number, or is it >> perhaps also an empirically determined factor? > > The height of Btrfs's metadata btree is at most 8, > leaf is on 0 level while node is at most on 7 level. > > Each btree update may results in COWing a node and its sibling nodes, > where the factor of tree comes from > >> >> What I'm wondering about is that if the size of the reserve is >> empirically determined, will it need to be increased again >> later when machines are more capable, and can handle a higher >> load? >> >> Do you think it's feasible to modify the locking for >> do_chunk_alloc to allow it to recurse without deadlock? > > Well, it could be, but IMO it'll bring us complexity, so worse > maintainance. > > Any questions? Feel free to ask. Your response was very helpful. Thanks a lot! -- Jim > > thanks, > liubo > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.7.0-rc8 btrfs locking issue
On 12/09/2012 07:04 AM, Liu Bo wrote: > On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: >> > Hi, >> > >> > I'm hitting a btrfs locking issue with 3.7.0-rc8. >> > >> > The btrfs filesystem in question is backing a Ceph OSD >> > under a heavy write load from many cephfs clients. >> > >> > I reported this issue a while ago: >> > http://www.spinics.net/lists/linux-btrfs/msg19370.html >> > when I was testing what I thought might be part of the >> > 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree. >> > >> > I spent some time attempting to bisect the btrfs patch queue >> > just before it was merged for 3.7, but got nowhere due to >> > false negatives. >> > >> > I've just been able to get back to testing 3.7-rc, and found >> > that I can still trigger the issue. > Hi Jim, > > Could you please apply the following patch to test if it works? Hi, So far, with your patch applied I've been unable to reproduce the recursive deadlock. Thanks a lot for this patch! This issue has been troubling me for a while. I've been trying to learn more about btrfs internals - if you have the time to answer a couple questions about your patch, I'd really appreciate it. > > (It's against 3.7-rc8.) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 3d3e2c1..100289b 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root > *root, int data) > > if (data) > flags = BTRFS_BLOCK_GROUP_DATA; > - else if (root == root->fs_info->chunk_root) > + else if (root == root->fs_info->chunk_root || > + root == root->fs_info->dev_root) > flags = BTRFS_BLOCK_GROUP_SYSTEM; > else > flags = BTRFS_BLOCK_GROUP_METADATA; > @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct > btrfs_root *root, u64 type) > num_dev = 1;/* DUP or single */ > > /* metadata for updaing devices and chunk tree */ > + num_dev = num_dev << 1 AFAICS this is doubling the size of the reserve, which reduces the chance of a recursive do_chunk_alloc(), right? > return btrfs_calc_trans_metadata_size(root, num_dev + 1); btrfs_calc_trans_metadata_size(root, num_items) multiplies its num_items argument by another factor of three - do you know if there is there some rationale behind that number, or is it perhaps also an empirically determined factor? What I'm wondering about is that if the size of the reserve is empirically determined, will it need to be increased again later when machines are more capable, and can handle a higher load? Do you think it's feasible to modify the locking for do_chunk_alloc to allow it to recurse without deadlock? Thanks -- Jim > } > > @@ -4351,7 +4353,7 @@ static void init_global_block_rsv(struct > btrfs_fs_info *fs_info) > > fs_info->extent_root->block_rsv = &fs_info->global_block_rsv; > fs_info->csum_root->block_rsv = &fs_info->global_block_rsv; > - fs_info->dev_root->block_rsv = &fs_info->global_block_rsv; > + fs_info->dev_root->block_rsv = &fs_info->chunk_block_rsv; > fs_info->tree_root->block_rsv = &fs_info->global_block_rsv; > fs_info->chunk_root->block_rsv = &fs_info->chunk_block_rsv; > > > thanks, > liubo > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.7.0-rc8 btrfs locking issue
On 12/05/2012 09:07 AM, Jim Schutt wrote: Hi, I'm hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported this issue a while ago: http://www.spinics.net/lists/linux-btrfs/msg19370.html when I was testing what I thought might be part of the 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree. I spent some time attempting to bisect the btrfs patch queue just before it was merged for 3.7, but got nowhere due to false negatives. I've just been able to get back to testing 3.7-rc, and found that I can still trigger the issue. First I get this lockdep splat: [ 1184.201331] = [ 1184.206716] [ INFO: possible recursive locking detected ] [ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted [ 1184.217156] - [ 1184.222544] ceph-osd/42177 is trying to acquire lock: [ 1184.227589] (&fs_info->chunk_mutex){+.+...}, at: [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.237270] [ 1184.237270] but task is already holding lock: [ 1184.243114] (&fs_info->chunk_mutex){+.+...}, at: [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.252786] [ 1184.252786] other info that might help us debug this: [ 1184.259303] Possible unsafe locking scenario: [ 1184.259303] [ 1184.265220]CPU0 [ 1184.267680] [ 1184.270133] lock(&fs_info->chunk_mutex); [ 1184.274276] lock(&fs_info->chunk_mutex); [ 1184.278417] [ 1184.278417] *** DEADLOCK *** To try to debug this, I applied this patch: -cut here - diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c72ead8..86da319 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1470,6 +1470,10 @@ struct btrfs_fs_info { int backup_root_index; int num_tolerated_disk_barrier_failures; + + /* protect against recursive do_chunk_alloc() */ + const void *chunk_alloc_task; + const struct btrfs_space_info *prev_sinfo; }; /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7cda519..d61e93d 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2087,6 +2087,9 @@ int open_ctree(struct super_block *sb, fs_info->btree_inode->i_ino = BTRFS_BTREE_INODE_OBJECTID; set_nlink(fs_info->btree_inode, 1); + + fs_info->chunk_alloc_task = NULL; + fs_info->prev_sinfo = NULL; /* * we set the i_size on the btree inode to the max possible int. * the real end of the address space is determined by all of diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d3e2c1..1e0e421 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3582,6 +3582,16 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, } BUG_ON(!space_info); /* Logic error */ + if (mutex_is_locked(&fs_info->chunk_mutex) && + unlikely(fs_info->chunk_alloc_task == current)) { + WARN_ONCE(1, "do_chunk_alloc() about to recursively acquire " + "fs_info->chunk_mutex: impending deadlock avoided!\n" + "outer call space_info = %p flags %#llx\n" + "nested call space_info = %p flags %#llx\n", + fs_info->prev_sinfo, fs_info->prev_sinfo->flags, + space_info, space_info->flags); + return -EDEADLK; + } again: spin_lock(&space_info->lock); if (force < space_info->force_alloc) @@ -3603,6 +3613,8 @@ again: spin_unlock(&space_info->lock); mutex_lock(&fs_info->chunk_mutex); + fs_info->chunk_alloc_task = current; + fs_info->prev_sinfo = space_info; /* * The chunk_mutex is held throughout the entirety of a chunk @@ -3655,6 +3667,8 @@ again: space_info->chunk_alloc = 0; spin_unlock(&space_info->lock); out: + fs_info->chunk_alloc_task = NULL; + fs_info->prev_sinfo = NULL; mutex_unlock(&fs_info->chunk_mutex); return ret; } -cut here - After a few trials, I got this during a test: [ 1614.213516] [ cut here ] [ 1614.218185] WARNING: at fs/btrfs/extent-tree.c:3592 do_chunk_alloc+0xec/0x3f0 [btrfs]() [ 1614.226228] Hardware name: X8DTH-i/6/iF/6F [ 1614.230484] do_chunk_alloc() about to recursively acquire fs_info->chunk_mutex: impending deadlock avoided! [ 1614.230484] outer call space_info = 880c214dac00 flags 0x1 [ 1614.230484] nested call space_info = 880c214dbc00 flags 0x4 [ 1614.251968] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macv
3.7.0-rc8 btrfs locking issue
Hi, I'm hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported this issue a while ago: http://www.spinics.net/lists/linux-btrfs/msg19370.html when I was testing what I thought might be part of the 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree. I spent some time attempting to bisect the btrfs patch queue just before it was merged for 3.7, but got nowhere due to false negatives. I've just been able to get back to testing 3.7-rc, and found that I can still trigger the issue. First I get this lockdep splat: [ 1184.201331] = [ 1184.206716] [ INFO: possible recursive locking detected ] [ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted [ 1184.217156] - [ 1184.222544] ceph-osd/42177 is trying to acquire lock: [ 1184.227589] (&fs_info->chunk_mutex){+.+...}, at: [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.237270] [ 1184.237270] but task is already holding lock: [ 1184.243114] (&fs_info->chunk_mutex){+.+...}, at: [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.252786] [ 1184.252786] other info that might help us debug this: [ 1184.259303] Possible unsafe locking scenario: [ 1184.259303] [ 1184.265220]CPU0 [ 1184.267680] [ 1184.270133] lock(&fs_info->chunk_mutex); [ 1184.274276] lock(&fs_info->chunk_mutex); [ 1184.278417] [ 1184.278417] *** DEADLOCK *** [ 1184.278417] [ 1184.284325] May be due to missing lock nesting notation [ 1184.284325] [ 1184.291099] 4 locks held by ceph-osd/42177: [ 1184.295277] #0: (sb_writers#7){.+.+.+}, at: [] btrfs_file_aio_write+0x64/0x320 [btrfs] [ 1184.305103] #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] btrfs_file_aio_write+0x6e/0x320 [btrfs] [ 1184.316108] #2: (sb_internal){.+.+..}, at: [] start_transaction+0x1c4/0x450 [btrfs] [ 1184.325632] #3: (&fs_info->chunk_mutex){+.+...}, at: [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.335761] [ 1184.335761] stack backtrace: [ 1184.340126] Pid: 42177, comm: ceph-osd Not tainted 3.7.0-rc8-00013-gdf2fc24 #438 [ 1184.347508] Call Trace: [ 1184.349962] [] ? vprintk_emit+0x42a/0x4c0 [ 1184.355619] [] print_deadlock_bug+0xe9/0x100 [ 1184.361556] [] validate_chain+0x596/0x750 [ 1184.367222] [] __lock_acquire+0x449/0x510 [ 1184.372894] [] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.379417] [] lock_acquire+0xc9/0x120 [ 1184.384855] [] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.391377] [] ? __lock_acquire+0x449/0x510 [ 1184.397204] [] __mutex_lock_common+0x5d/0x3a0 [ 1184.403221] [] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.409762] [] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.416323] [] ? do_chunk_alloc+0x179/0x340 [btrfs] [ 1184.422849] [] mutex_lock_nested+0x4a/0x60 [ 1184.428640] [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.435018] [] find_free_extent+0xa3c/0xb70 [btrfs] [ 1184.441555] [] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1184.449051] [] btrfs_reserve_extent+0x82/0x190 [btrfs] [ 1184.455843] [] btrfs_alloc_free_block+0x85/0x230 [btrfs] [ 1184.462828] [] __btrfs_cow_block+0x14a/0x4b0 [btrfs] [ 1184.469471] [] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] [ 1184.476962] [] btrfs_cow_block+0x18d/0x230 [btrfs] [ 1184.483426] [] btrfs_search_slot+0x360/0x730 [btrfs] [ 1184.490067] [] btrfs_insert_empty_items+0x8d/0x100 [btrfs] [ 1184.497199] [] ? kmem_cache_alloc+0xd3/0x170 [ 1184.503151] [] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] [ 1184.510125] [] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] [ 1184.516928] [] btrfs_alloc_chunk+0x5e/0x90 [btrfs] [ 1184.523374] [] ? check_system_chunk+0x71/0x130 [btrfs] [ 1184.530179] [] do_chunk_alloc+0x2ce/0x340 [btrfs] [ 1184.536555] [] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1184.544070] [] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] [ 1184.551574] [] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] [ 1184.558987] [] __btrfs_buffered_write+0x15e/0x350 [btrfs] [ 1184.566042] [] btrfs_file_aio_write+0x209/0x320 [btrfs] [ 1184.572944] [] ? __btrfs_direct_write+0x130/0x130 [btrfs] [ 1184.579984] [] do_sync_readv_writev+0x94/0xe0 [ 1184.585985] [] do_readv_writev+0xe3/0x1e0 [ 1184.591645] [] ? fget_light+0x122/0x170 [ 1184.597131] [] vfs_writev+0x46/0x60 [ 1184.602266] [] sys_writev+0x5f/0xc0 [ 1184.607398] [] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 1184.613832] [] system_call_fastpath+0x16/0x1b and then a little later: [ 1319.463719] INFO: task ceph-osd:42177 blocked for more than 120 seconds. [ 1319.470409] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1319.478227] ceph-osdD 0246 0 42177 1 0x [ 1319.485330] 880402c4d358 0046 7fffeccc49d0 880402c4dfd8 [ 1319.492852] 880402c4c010 880402c4c000 880402c4c000 880402c4c000 [ 1319.500449] 880402c4dfd8 880402c4c000 8
[PATCH] libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed
Add libceph support for a new CRUSH tunable recently added to Ceph servers. Consider the CRUSH rule step chooseleaf firstn 0 type This rule means that replicas will be chosen in a manner such that each chosen leaf's branch will contain a unique instance of . When an object is re-replicated after a leaf failure, if the CRUSH map uses a chooseleaf rule the remapped replica ends up under the bucket that held the failed leaf. This causes uneven data distribution across the storage cluster, to the point that when all the leaves but one fail under a particular bucket, that remaining leaf holds all the data from its failed peers. This behavior also limits the number of peers that can participate in the re-replication of the data held by the failed leaf, which increases the time required to re-replicate after a failure. For a chooseleaf CRUSH rule, the tree descent has two steps: call them the inner and outer descents. If the tree descent down to is the outer descent, and the descent from down to a leaf is the inner descent, the issue is that a down leaf is detected on the inner descent, so only the inner descent is retried. In order to disperse re-replicated data as widely as possible across a storage cluster after a failure, we want to retry the outer descent. So, fix up crush_choose() to allow the inner descent to return immediately on choosing a failed leaf. Wire this up as a new CRUSH tunable. Note that after this change, for a chooseleaf rule, if the primary OSD in a placement group has failed, choosing a replacement may result in one of the other OSDs in the PG colliding with the new primary. This requires that OSD's data for that PG to need moving as well. This seems unavoidable but should be relatively rare. Signed-off-by: Jim Schutt --- include/linux/ceph/ceph_features.h |4 +++- include/linux/crush/crush.h|2 ++ net/ceph/crush/mapper.c| 13 ++--- net/ceph/osdmap.c |6 ++ 4 files changed, 21 insertions(+), 4 deletions(-) diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h index dad579b..61e5af4 100644 --- a/include/linux/ceph/ceph_features.h +++ b/include/linux/ceph/ceph_features.h @@ -14,13 +14,15 @@ #define CEPH_FEATURE_DIRLAYOUTHASH (1<<7) /* bits 8-17 defined by user-space; not supported yet here */ #define CEPH_FEATURE_CRUSH_TUNABLES (1<<18) +#define CEPH_FEATURE_CRUSH_TUNABLES2 (1<<25) /* * Features supported. */ #define CEPH_FEATURES_SUPPORTED_DEFAULT \ (CEPH_FEATURE_NOSRCADDR |\ -CEPH_FEATURE_CRUSH_TUNABLES) +CEPH_FEATURE_CRUSH_TUNABLES | \ +CEPH_FEATURE_CRUSH_TUNABLES2) #define CEPH_FEATURES_REQUIRED_DEFAULT \ (CEPH_FEATURE_NOSRCADDR) diff --git a/include/linux/crush/crush.h b/include/linux/crush/crush.h index 25baa28..6a1101f 100644 --- a/include/linux/crush/crush.h +++ b/include/linux/crush/crush.h @@ -162,6 +162,8 @@ struct crush_map { __u32 choose_local_fallback_tries; /* choose attempts before giving up */ __u32 choose_total_tries; + /* attempt chooseleaf inner descent once; on failure retry outer descent */ + __u32 chooseleaf_descend_once; }; diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c index 35fce75..96c8a58 100644 --- a/net/ceph/crush/mapper.c +++ b/net/ceph/crush/mapper.c @@ -287,6 +287,7 @@ static int is_out(const struct crush_map *map, const __u32 *weight, int item, in * @outpos: our position in that vector * @firstn: true if choosing "first n" items, false if choosing "indep" * @recurse_to_leaf: true if we want one device under each item of given type + * @descend_once: true if we should only try one descent before giving up * @out2: second output vector for leaf items (if @recurse_to_leaf) */ static int crush_choose(const struct crush_map *map, @@ -295,7 +296,7 @@ static int crush_choose(const struct crush_map *map, int x, int numrep, int type, int *out, int outpos, int firstn, int recurse_to_leaf, - int *out2) + int descend_once, int *out2) { int rep; unsigned int ftotal, flocal; @@ -399,6 +400,7 @@ static int crush_choose(const struct crush_map *map, x, outpos+1, 0, out2, outpos, firstn, 0, + map->chooseleaf_descend_once, NULL) <= outpos) /* didn't get leaf */ reject = 1; @@ -422,7 +424,10 @@ reject:
Re: chooseleaf_descend_once
On 11/28/2012 09:11 AM, Caleb Miles wrote: Hey Jim, Running the third test with tunable chooseleaf_descend_once 0 with no devices marked out yields the following result (999.827397, 0.48667056652539997) so chi squared value is 999 with a corresponding p value of 0.487 so that the placement distribution seems to be drawn from the uniform distribution as desired. Great, thanks for doing that extra test. Plus, I see that Sage has merged it. Cool. Thanks -- Jim Caleb On Tue, Nov 27, 2012 at 1:28 PM, Jim Schutt wrote: Hi Caleb, On 11/26/2012 07:28 PM, caleb miles wrote: Hello all, Here's what I've done to try and validate the new chooseleaf_descend_once tunable first described in commit f1a53c5e80a48557e63db9c52b83f3**9391bc69b8 in the wip-crush branch of ceph.git. First I set the new tunable to it's legacy value, disabled, tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 0 The map contains one thousand osd devices contained in one hundred hosts with the following data rule rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } I then simulate the creation of one million placement groups using the crushtool $ crushtool -i hundred.map --test --min-x 0 --max-x 99 --num-rep 3 --output-csv --weight 120 0.0 --weight 121 0.0 --weight 122 0.0 --weight 123 0.0 --weight 124 0.0 --weight 125 0.0 --weight 125 0.0 --weight 150 0.0 --weight 151 0.0 --weight 152 0.0 --weight 153 0.0 --weight 154 0.0 --weight 155 0.0 --weight 156 0.0 --weight 180 0.0 --weight 181 0.0 --weight 182 0.0 --weight 183 0.0 --weight 184 0.0 --weight 185 0.0 --weight 186 0.0 with the majority of devices in three hosts marked out. Then in (I)Python import scipy.stats as s import matplotlib.mlab as m data = m.csv2rec("data-device_**utilization.csv") s.chisquare(data['number_of_**objects_stored'], data['number_of_objects_* *expected']) which will output (122939.76474477499, 0.0) so that the chi squared value is 122939.795 and the p value is, rounded to, 0.0 and the observed placement distribution statistically differs from a uniform distribution. Repeating with the new tunable set to tunable chooseleaf_descend_once 1 I obtain the following result (998.97643161876761, 0.32151775131589833) so that the chi squared value is 998.976 and the p value is 0.32 and the observed placement distribution is statistically identical to the uniform distribution at the five and ten percent confidence levels, higher as well of course. The p value is the probability of obtaining a chi squared value more extreme than the statistic observed. Basically, from my rudimentary understanding of probability theory, that if you obtain a p value p< P then reject the null hypothesis, in our case that the observed placement distribution is drawn from the uniform distribution, at the P confidence level. Cool. Thanks for doing these tests. Is there any point to doing a third test, with tunable chooseleaf_descend_once 0 and no devices marked out, but in all other respects the same as the above two tests? I would expect the results for that case and the last case you tested to be essentially identical in the degree of uniformity, but is it worth verifying? -- Jim Caleb -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/**majordomo-info.html<http://vger.kernel.org/majordomo-info.html> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: chooseleaf_descend_once
Hi Caleb, On 11/26/2012 07:28 PM, caleb miles wrote: Hello all, Here's what I've done to try and validate the new chooseleaf_descend_once tunable first described in commit f1a53c5e80a48557e63db9c52b83f39391bc69b8 in the wip-crush branch of ceph.git. First I set the new tunable to it's legacy value, disabled, tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 0 The map contains one thousand osd devices contained in one hundred hosts with the following data rule rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } I then simulate the creation of one million placement groups using the crushtool $ crushtool -i hundred.map --test --min-x 0 --max-x 99 --num-rep 3 --output-csv --weight 120 0.0 --weight 121 0.0 --weight 122 0.0 --weight 123 0.0 --weight 124 0.0 --weight 125 0.0 --weight 125 0.0 --weight 150 0.0 --weight 151 0.0 --weight 152 0.0 --weight 153 0.0 --weight 154 0.0 --weight 155 0.0 --weight 156 0.0 --weight 180 0.0 --weight 181 0.0 --weight 182 0.0 --weight 183 0.0 --weight 184 0.0 --weight 185 0.0 --weight 186 0.0 with the majority of devices in three hosts marked out. Then in (I)Python import scipy.stats as s import matplotlib.mlab as m data = m.csv2rec("data-device_utilization.csv") s.chisquare(data['number_of_objects_stored'], data['number_of_objects_expected']) which will output (122939.76474477499, 0.0) so that the chi squared value is 122939.795 and the p value is, rounded to, 0.0 and the observed placement distribution statistically differs from a uniform distribution. Repeating with the new tunable set to tunable chooseleaf_descend_once 1 I obtain the following result (998.97643161876761, 0.32151775131589833) so that the chi squared value is 998.976 and the p value is 0.32 and the observed placement distribution is statistically identical to the uniform distribution at the five and ten percent confidence levels, higher as well of course. The p value is the probability of obtaining a chi squared value more extreme than the statistic observed. Basically, from my rudimentary understanding of probability theory, that if you obtain a p value p < P then reject the null hypothesis, in our case that the observed placement distribution is drawn from the uniform distribution, at the P confidence level. Cool. Thanks for doing these tests. Is there any point to doing a third test, with tunable chooseleaf_descend_once 0 and no devices marked out, but in all other respects the same as the above two tests? I would expect the results for that case and the last case you tested to be essentially identical in the degree of uniformity, but is it worth verifying? -- Jim Caleb -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] PG: Do not discard op data too early
On 10/26/2012 02:52 PM, Gregory Farnum wrote: Wanted to touch base on this patch again. If Sage and Sam agree that we don't want to play any tricks with memory accounting, we should pull this patch in. I'm pretty sure we want it for Bobtail! I've been running with it since I posted it. I think it would be great if you could pick it up! -- Jim -Greg On Thu, Sep 27, 2012 at 3:36 PM, Jim Schutt wrote: On 09/27/2012 04:27 PM, Gregory Farnum wrote: On Thu, Sep 27, 2012 at 3:23 PM, Jim Schutt wrote: On 09/27/2012 04:07 PM, Gregory Farnum wrote: Have you tested that this does what you want? If it does, I think we'll want to implement this so that we actually release the memory, but continue accounting it. Yes. I have diagnostic patches where I add an "advisory" option to Throttle, and apply it in advisory mode to the cluster throttler. In advisory mode Throttle counts bytes but never throttles. Can't you also do this if you just set up a throttler with a limit of 0? :) Hmmm, I expect so. I guess I just didn't think of doing it that way When I run all the clients I can muster (222) against a relatively small number of OSDs (48-96), with osd_client_message_size_cap set to 10,000,000 bytes I see spikes of> 100,000,000 bytes tied up in ops that came through the cluster messenger, and I see long wait times (> 60 secs) on ops coming through the client throttler. With this patch applied, I can raise osd_client_message_size_cap to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes tied up in ops that came through the cluster messenger. Wait times for ops coming through the client policy throttler are lower, overall daemon memory usage is lower, but throughput is the same. Overall, with this patch applied, my storage cluster "feels" much less brittle when overloaded. Okay, cool. Are you interested in reducing the memory usage a little more by deallocating the memory separately from accounting it? My testing doesn't indicate a need -- even keeping the memory around until the op is done, my daemons use less memory overall to get the same throughput. So, unless some other load condition indicates a need, I'd counsel simplicity. -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] PG: Do not discard op data too early
On 09/27/2012 04:27 PM, Gregory Farnum wrote: On Thu, Sep 27, 2012 at 3:23 PM, Jim Schutt wrote: On 09/27/2012 04:07 PM, Gregory Farnum wrote: Have you tested that this does what you want? If it does, I think we'll want to implement this so that we actually release the memory, but continue accounting it. Yes. I have diagnostic patches where I add an "advisory" option to Throttle, and apply it in advisory mode to the cluster throttler. In advisory mode Throttle counts bytes but never throttles. Can't you also do this if you just set up a throttler with a limit of 0? :) Hmmm, I expect so. I guess I just didn't think of doing it that way When I run all the clients I can muster (222) against a relatively small number of OSDs (48-96), with osd_client_message_size_cap set to 10,000,000 bytes I see spikes of> 100,000,000 bytes tied up in ops that came through the cluster messenger, and I see long wait times (> 60 secs) on ops coming through the client throttler. With this patch applied, I can raise osd_client_message_size_cap to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes tied up in ops that came through the cluster messenger. Wait times for ops coming through the client policy throttler are lower, overall daemon memory usage is lower, but throughput is the same. Overall, with this patch applied, my storage cluster "feels" much less brittle when overloaded. Okay, cool. Are you interested in reducing the memory usage a little more by deallocating the memory separately from accounting it? My testing doesn't indicate a need -- even keeping the memory around until the op is done, my daemons use less memory overall to get the same throughput. So, unless some other load condition indicates a need, I'd counsel simplicity. -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] PG: Do not discard op data too early
On 09/27/2012 04:07 PM, Gregory Farnum wrote: Have you tested that this does what you want? If it does, I think we'll want to implement this so that we actually release the memory, but continue accounting it. Yes. I have diagnostic patches where I add an "advisory" option to Throttle, and apply it in advisory mode to the cluster throttler. In advisory mode Throttle counts bytes but never throttles. When I run all the clients I can muster (222) against a relatively small number of OSDs (48-96), with osd_client_message_size_cap set to 10,000,000 bytes I see spikes of > 100,000,000 bytes tied up in ops that came through the cluster messenger, and I see long wait times (> 60 secs) on ops coming through the client throttler. With this patch applied, I can raise osd_client_message_size_cap to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes tied up in ops that came through the cluster messenger. Wait times for ops coming through the client policy throttler are lower, overall daemon memory usage is lower, but throughput is the same. Overall, with this patch applied, my storage cluster "feels" much less brittle when overloaded. -- Jim On Thu, Sep 27, 2012 at 2:56 PM, Jim Schutt wrote: Under a sustained cephfs write load where the offered load is higher than the storage cluster write throughput, a backlog of replication ops that arrive via the cluster messenger builds up. The client message policy throttler, which should be limiting the total write workload accepted by the storage cluster, is unable to prevent it, for any value of osd_client_message_size_cap, under such an overload condition. The root cause is that op data is released too early, in op_applied(). If instead the op data is released at op deletion, then the limit imposed by the client policy throttler applies over the entire lifetime of the op, including commits of replication ops. That makes the policy throttler an effective means for an OSD to protect itself from a sustained high offered load, because it can effectively limit the total, cluster-wide resources needed to process in-progress write ops. Signed-off-by: Jim Schutt --- src/osd/ReplicatedPG.cc |4 1 files changed, 0 insertions(+), 4 deletions(-) diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index a64abda..80bec2a 100644 --- a/src/osd/ReplicatedPG.cc +++ b/src/osd/ReplicatedPG.cc @@ -3490,10 +3490,6 @@ void ReplicatedPG::op_applied(RepGather *repop) dout(10)<< "op_applied "<< *repop<< dendl; if (repop->ctx->op) repop->ctx->op->mark_event("op_applied"); - - // discard my reference to the buffer - if (repop->ctx->op) -repop->ctx->op->request->clear_data(); repop->applying = false; repop->applied = true; -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] PG: Do not discard op data too early
Under a sustained cephfs write load where the offered load is higher than the storage cluster write throughput, a backlog of replication ops that arrive via the cluster messenger builds up. The client message policy throttler, which should be limiting the total write workload accepted by the storage cluster, is unable to prevent it, for any value of osd_client_message_size_cap, under such an overload condition. The root cause is that op data is released too early, in op_applied(). If instead the op data is released at op deletion, then the limit imposed by the client policy throttler applies over the entire lifetime of the op, including commits of replication ops. That makes the policy throttler an effective means for an OSD to protect itself from a sustained high offered load, because it can effectively limit the total, cluster-wide resources needed to process in-progress write ops. Signed-off-by: Jim Schutt --- src/osd/ReplicatedPG.cc |4 1 files changed, 0 insertions(+), 4 deletions(-) diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index a64abda..80bec2a 100644 --- a/src/osd/ReplicatedPG.cc +++ b/src/osd/ReplicatedPG.cc @@ -3490,10 +3490,6 @@ void ReplicatedPG::op_applied(RepGather *repop) dout(10) << "op_applied " << *repop << dendl; if (repop->ctx->op) repop->ctx->op->mark_event("op_applied"); - - // discard my reference to the buffer - if (repop->ctx->op) -repop->ctx->op->request->clear_data(); repop->applying = false; repop->applied = true; -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] FileJournal: correctly check return value of lseek in write_fd
Signed-off-by: Jim Schutt --- src/os/FileJournal.cc |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/src/os/FileJournal.cc b/src/os/FileJournal.cc index d1c92dc..2254720 100644 --- a/src/os/FileJournal.cc +++ b/src/os/FileJournal.cc @@ -945,9 +945,10 @@ void FileJournal::align_bl(off64_t pos, bufferlist& bl) int FileJournal::write_bl(off64_t& pos, bufferlist& bl) { align_bl(pos, bl); + int ret; - int ret = ::lseek64(fd, pos, SEEK_SET); - if (ret) { + off64_t spos = ::lseek64(fd, pos, SEEK_SET); + if (spos < 0) { ret = -errno; derr << "FileJournal::write_bl : lseek64 failed " << cpp_strerror(ret) << dendl; return ret; -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Lots of misdirected client requests with 73984 PGs/pool
Hi, I was testing on 288 OSDs with pg_bits=8, for 73984 PGs/pool, 221952 total PGs. Writing from CephFS clients generates lots of messages like this: 2012-08-28 14:53:33.772344 osd.235 [WRN] client.4533 172.17.135.45:0/1432642641 misdirected client.4533.1:124 pg 0.8b9d12d4 to osd.235 in e7, client e7 pg 0.112d4 features 262282 There's no trouble with 288 OSDs and pg_bits=7 (36992 PGs/pool). This is all with current Linus master branch + ceph-client testing branch, and master branch on the server side. Does this have anything to do with old_pg_t AKA struct ceph_pg only using 16 bits for the placement seed? I'd about to start testing on 576 OSDs, and was hoping to use pg_bits=7 to get more uniform data placement across OSDs. My first step on that path was 288 OSDs with pg_bits=8. Maybe I can't, or I have to just ignore the warnings, until the kernel's libceph learns about 32-bit placement seeds? Or maybe this 16-bit seed theory is all wrong, and something else is going on? If so, what more information is needed to help sort this out? Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to increase the number of pgs at pool creation time?
On 08/23/2012 03:26 PM, Tren Blackburn wrote: On Thu, Aug 23, 2012 at 2:17 PM, Jim Schutt wrote: On 08/23/2012 02:39 PM, Tren Blackburn wrote: 2) Increase the number of pgs via ceph.conf (osd pg bits = 7) and create the cluster. This does not work either as the cluster comes up with 6 pgs bits per osd still. re:http://ceph.com/docs/master/config-cluster/osd-config-ref/ I use this regularly. Do you have that in your global section, i.e. [global] osd pg bits = 7 osd pgp bits = 7 Hi Jim; Sorry about the direct reply. I have cc'ed the list. To answer your question I put that setting in the [osd] section, not the [global] section. How would a person know these settings go in [global]? I'm sorry to admit I haven't learned how to know this either. I recall trying it in the [osd] section and not having it work, and thinking maybe the monitor would need to know, and putting it in the [global] section would give it access. It worked, so I moved on. Also, I could not find documentation about "pgp bits" What is a "placement group p bit"? Again, I'm not sure. The default config has pg bits and pgp bits set to the same value, so I think I assumed it would be best to have them the same if I changed pg bits. Thank you in advance! Sorry I don't know more -- Jim Regards, Tren -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to increase the number of pgs at pool creation time?
On 08/23/2012 02:39 PM, Tren Blackburn wrote: 2) Increase the number of pgs via ceph.conf (osd pg bits = 7) and create the cluster. This does not work either as the cluster comes up with 6 pgs bits per osd still. re:http://ceph.com/docs/master/config-cluster/osd-config-ref/ I use this regularly. Do you have that in your global section, i.e. [global] osd pg bits = 7 osd pgp bits = 7 -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3.6-rc1] libceph: ensure banner is written on client connect before feature negotiation starts
On 08/08/2012 07:13 PM, Alex Elder wrote: On 08/08/2012 11:09 AM, Jim Schutt wrote: Because the Ceph client messenger uses a non-blocking connect, it is possible for the sending of the client banner to race with the arrival of the banner sent by the peer. This is possible because the server-side messenger immediately sends its banner and addresses after accepting a connect request, *before* actually attempting to read or verify the banner from the client. It just occurred to me this really should be added to the commit message if you take the patch. I can resend with it added, or you can add it when you commit the patch. Thanks for verifying this aspect of what's happening. When ceph_sock_state_change() notices the connect has completed, it schedules work to process the socket via con_work(). During this time the peer is writing its banner, and arrival of the peer banner races with con_work(). If con_work() calls try_read() before the peer banner arrives, there is nothing for it to do, after which con_work() calls try_write() to send the client's banner. In this case Ceph's protocol negotiation can complete succesfully. If, however, the peer's banner arrives before con_work() calls try_read(), then try_read() will read the banner and prepare protocol negotiation info via prepare_write_connect(). prepare_write_connect() calls con_out_kvec_reset(), which discards the as-yet-unsent client banner. Next, con_work() calls try_write(), which sends the protocol negotiation info rather than the banner that the peer is expecting. I believe your analysis is correct. And I think your solution sounds good as well. I think Sage was going to take a look at this also, and I'll wait for him to have a chance to comment. But in any case, your fix looks good. Nice job. Thanks for taking a look. -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] make mkcephfs and init-ceph osd filesystem handling more flexible
On 08/09/2012 10:26 AM, Tommi Virtanen wrote: mkcephfs is not a viable route forward. For example, it is unable to expand a pre-existing cluster. The new "OSD hotplugging" style init is much, much nicer. And does more than just mkfs& mount. I'm embarrassed to admit I haven't been keeping up with this, but I seem to recall that early versions didn't handle a journal on a partition. Did I get that wrong, or maybe that capability exists now? In the past I've found it to have a small performance benefit, and would hate to lose it. Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.6-rc1] libceph: ensure banner is written on client connect before feature negotiation starts
Because the Ceph client messenger uses a non-blocking connect, it is possible for the sending of the client banner to race with the arrival of the banner sent by the peer. When ceph_sock_state_change() notices the connect has completed, it schedules work to process the socket via con_work(). During this time the peer is writing its banner, and arrival of the peer banner races with con_work(). If con_work() calls try_read() before the peer banner arrives, there is nothing for it to do, after which con_work() calls try_write() to send the client's banner. In this case Ceph's protocol negotiation can complete succesfully. If, however, the peer's banner arrives before con_work() calls try_read(), then try_read() will read the banner and prepare protocol negotiation info via prepare_write_connect(). prepare_write_connect() calls con_out_kvec_reset(), which discards the as-yet-unsent client banner. Next, con_work() calls try_write(), which sends the protocol negotiation info rather than the banner that the peer is expecting. The result is that the peer sees an invalid banner, and the client reports "negotiation failed". Fix this by moving con_out_kvec_reset() out of prepare_write_connect() to its callers at all locations except the one where the banner might still need to be sent. Signed-off-by: Jim Schutt --- net/ceph/messenger.c | 11 +-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index b979675..24c5eea 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -915,7 +915,6 @@ static int prepare_write_connect(struct ceph_connection *con) con->out_connect.authorizer_len = auth ? cpu_to_le32(auth->authorizer_buf_len) : 0; - con_out_kvec_reset(con); con_out_kvec_add(con, sizeof (con->out_connect), &con->out_connect); if (auth && auth->authorizer_buf_len) @@ -1557,6 +1556,7 @@ static int process_connect(struct ceph_connection *con) return -1; } con->auth_retry = 1; + con_out_kvec_reset(con); ret = prepare_write_connect(con); if (ret < 0) return ret; @@ -1577,6 +1577,7 @@ static int process_connect(struct ceph_connection *con) ENTITY_NAME(con->peer_name), ceph_pr_addr(&con->peer_addr.in_addr)); reset_connection(con); + con_out_kvec_reset(con); ret = prepare_write_connect(con); if (ret < 0) return ret; @@ -1601,6 +1602,7 @@ static int process_connect(struct ceph_connection *con) le32_to_cpu(con->out_connect.connect_seq), le32_to_cpu(con->in_reply.connect_seq)); con->connect_seq = le32_to_cpu(con->in_reply.connect_seq); + con_out_kvec_reset(con); ret = prepare_write_connect(con); if (ret < 0) return ret; @@ -1617,6 +1619,7 @@ static int process_connect(struct ceph_connection *con) le32_to_cpu(con->in_reply.global_seq)); get_global_seq(con->msgr, le32_to_cpu(con->in_reply.global_seq)); + con_out_kvec_reset(con); ret = prepare_write_connect(con); if (ret < 0) return ret; @@ -2135,7 +2138,11 @@ more: BUG_ON(con->state != CON_STATE_CONNECTING); con->state = CON_STATE_NEGOTIATING; - /* Banner is good, exchange connection info */ + /* +* Received banner is good, exchange connection info. +* Do not reset out_kvec, as sending our banner raced +* with receiving peer banner after connect completed. +*/ ret = prepare_write_connect(con); if (ret < 0) goto out; -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: avoiding false detection of down OSDs
On 07/31/2012 12:40 PM, Sage Weil wrote: On Tue, 31 Jul 2012, Gregory Farnum wrote: On Tue, Jul 31, 2012 at 8:07 AM, Jim Schutt wrote: Also, FWIW I've been running my Ceph servers with no swap, and I've recently doubled the size of my storage cluster. Is it possible to have map processing do a little memory accounting and log it, or to provide some way to learn that map processing is chewing up significant amounts of memory? Or maybe there's already a way to learn this that I need to learn about? I sometimes run into something that shares some characteristics with what you describe, but is primarily triggered by high client write load. I'd like to be able to confirm or deny it's the same basic issue you've described. I think that we've done all our diagnosis using profiling tools, but there's now a map cache and it probably wouldn't be too difficult to have it dump data via perfcounters if you poked around...anything like this exist yet, Sage? Much of the bad behavior was triggered by #2860, fixes for which just went into the stable and master branches yesterday. It's difficult to fully observe the bad behavior, though (lots of time spend in generate_past_intervals, reading old maps off disk). With the fix, we pretty much only process maps during handle_osd_map. Adding perfcounters in the methods that grab a map out of the cache or (more importantly) read it off disk will give you better visibility into that. It should be pretty easy to instrument that (and I'll gladly take patches that implement that... :). Without knowing more about what you're seeing, it's hard to say if its related, though. This was triggered by long periods of unclean pgs and lots of data migration, not high load. An issue I've been seeing is unusually high OSD memory use. It seems to be triggered by linux clients timing out requests and resetting OSDs during a heavy write load, but I was hoping to rule out any memory-use issues caused by map processing. However, this morning I started testing your server wip-msgr branch together with the kernel-side patches queued up for 3.6, and so far with that combination I've been unable to trigger the behavior I was seeing. So, that's great news, and I think confirms that issue was unrelated to any map issues. I've also sometimes had issues with my cluster becoming unstable when failing an OSD while the cluster is under a heavy write load, but hadn't been successful at characterizing under what conditions it couldn't recover. I expect that situation is now improved as well, and will retest. Thanks -- Jim sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: avoiding false detection of down OSDs
On 07/30/2012 06:24 PM, Gregory Farnum wrote: On Mon, Jul 30, 2012 at 3:47 PM, Jim Schutt wrote: Above you mentioned that you are seeing these issues as you scaled out a storage cluster, but none of the solutions you mentioned address scaling. Let's assume your preferred solution handles this issue perfectly on the biggest cluster anyone has built today. What do you predict will happen when that cluster size is scaled up by a factor of 2, or 10, or 100? Sage should probably describe in more depth what we've seen since he's looked at it the most, but I can expand on it a little. In argonaut and earlier version of Ceph, processing a new OSDMap for an OSD is very expensive. I don't remember the precise numbers we'd whittled it down to but it required at least one disk sync as well as pausing all request processing for a while. If you combined this expense with a large number of large maps (if, perhaps, one quarter of your 800-OSD system had been down but not out for 6+ hours), you could cause memory thrashing on OSDs as they came up, which could force them to become very, very, veeery slow. In the next version of Ceph, map processing is much less expensive (no syncs or full-system pauses required), which will prevent request backup. And there are a huge number of ways to reduce the memory utilization of maps, some of which can be backported to argonaut and some of which can't. Now, if we can't prevent our internal processes from running an OSD out of memory, we'll have failed. But we don't think this is an intractable problem; in fact we have reason to hope we've cleared it up now that we've seen the problem — although we don't think it's something that we can absolutely prevent on argonaut (too much code churn). So we're looking for something that we can apply to argonaut as a band-aid, but that we can also keep around in case forces external to Ceph start causing similar cluster-scale resource shortages beyond our control (runaway co-located process eats up all the memory on lots of boxes, switch fails and bandwidth gets cut in half, etc). If something happens that means Ceph can only supply half as much throughput as it was previously, then Ceph should provide that much throughput; right now if that kind of incident occurs then Ceph won't provide any throughput because it'll all be eaten by spurious recovery work. Ah, thanks for the extra context. I hadn't fully appreciated the proposal was primarily a mitigation for argonaut, and otherwise as a fail-safe mechanism. As I mentioned above, I'm concerned this is addressing symptoms, rather than root causes. I'm concerned the root cause has something to do with how the map processing work scales with number of OSDs/PGs, and that this will limit the maximum size of a Ceph storage cluster. I think I discussed this above enough already? :) Yep, thanks. But, if you really just want to not mark down an OSD that is laggy, I know this will sound simplistic, but I keep thinking that the OSD knows for itself if it's up, even when the heartbeat mechanism is backed up. Couldn't there be some way to ask an OSD suspected of being down whether it is or not, separate from the heartbeat mechanism? I mean, if you're considering having the monitor ignore OSD down reports for a while based on some estimate of past behavior, wouldn't it be better for the monitor to just ask such an OSD, "hey, are you still there?" If it gets an immediate "I'm busy, come back later", extend the grace period; otherwise, mark the OSD down. Hmm. The concern is that if an OSD is stuck on disk swapping then it's going to be just as stuck for the monitors as the OSDs — they're all using the same network in the basic case, etc. We want to be able to make that guess before the OSD is able to answer such questions. But I'll think on if we could try something else similar. OK - thanks. Also, FWIW I've been running my Ceph servers with no swap, and I've recently doubled the size of my storage cluster. Is it possible to have map processing do a little memory accounting and log it, or to provide some way to learn that map processing is chewing up significant amounts of memory? Or maybe there's already a way to learn this that I need to learn about? I sometimes run into something that shares some characteristics with what you describe, but is primarily triggered by high client write load. I'd like to be able to confirm or deny it's the same basic issue you've described. Thanks -- Jim -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: avoiding false detection of down OSDs
Hi Greg, Thanks for the write-up. I have a couple questions below. On 07/30/2012 12:46 PM, Gregory Farnum wrote: As Ceph gets deployed on larger clusters our most common scaling issues have related to 1) our heartbeat system, and 2) handling the larger numbers of OSDMaps that get generated by increases in the OSD (failures, boots, etc) and PG count (osd up-thrus, pg_temp insertions, etc). Lately we haven't had many issues with heartbeats when the OSDs are happy, so it looks like the latest respin of the heartbeat code is going to satisfy us going forward. Fast OSD map generation continues to be a concern, but with the merge of Sam's new map handling code recently (which reduced the amount of disk effort required to process a map and shuffled responsibility out of the main OSD thread and into the more highly-threaded PGs) it has become significantly less expensive, and we have a number of implemented and planned changes (from the short- to the long-term) to continue making it less painful. However, we've started seeing a new issue at the intersection of these separate problems: what happens when an OSD slows down because it's processing too many maps, but continues to operate. In large clusters, an OSD might go down and come back up with hundreds-to-thousands of maps to process — often at the same time as other OSDs. We've started to observe issues during software upgrades where a lot of OSDs come up together and process so many maps that they run out of memory and start swapping[1]. This can easily cause them to miss heartbeats long enough to get marked down — but then they finish map processing, tell the monitors they *are* alive, and get marked back up. This sequence can cause so many new maps to generate that it repeats itself on the new nodes, spreads to other nodes in the cluster, or even causes some running OSD daemons to get marked out. We've taken to calling this "OSD thrashing". It would be great if we could come up with a systemic way to reduce thrashing, independent from our efforts to reduce the triggering factors. (For one thing, when only one node is thrashing we probably want to mark it down to preserve performance, whereas when half the cluster is thrashing we want to keep them up to reduce cluster-wide load increases.) A few weeks ago some of us at Inktank had a meeting to discuss the issue, and I've finally gotten around to writing it up in this email so that we can ask for input from the wider community! After discussing several approaches (including scaling heartbeat intervals as more nodes are marked down, as nodes report being wrongly marked down, putting caps on the number of nodes that can be auto marked down and/or out, applying rate limiters to the auto-marks, etc), we realized that what we really wanted was to do our best to estimate the chances that an OSD which missed its heartbeat window was simply laggy rather than being down. I don't understand the functional difference between an OSD that is too busy to process its heartbeat in a timely fashion, and one that is down. In either case, it cannot meet its obligations to its peers. I understand that wrongly marking an OSD down adds unnecessary map processing work. Also, if an OSD is wrongly marked down then any data that would be written to it while it is marked down will be written to other OSDs, and will need to be migrated when that OSD is marked back up. I don't fully understand what is the impact of not marking down an OSD that really is dead, particularly if the cluster is under a heavy write load from many clients. At the very least, write requests that have a replica on such an OSD will stall waiting for an ack that will never come, or a new map, right? It seems to me that each of the discarded solutions has similar properties as the favored solution: they address a symptom, rather than the cause. Above you mentioned that you are seeing these issues as you scaled out a storage cluster, but none of the solutions you mentioned address scaling. Let's assume your preferred solution handles this issue perfectly on the biggest cluster anyone has built today. What do you predict will happen when that cluster size is scaled up by a factor of 2, or 10, or 100? While long-term I'm a proponent of pushing most of this heartbeat handling logic to the OSDs, in the short term adjustments to the algorithm are much easier to implement in the monitor (which has a lot more state on the cluster already local). We came up with a broad algorithm to estimate the chance that an OSD is laggy instead of down: first, figure out the probability that the OSD is down based on its past history, and then figure out that probability for the cluster that the OSD belongs to. Basically: 1) Keep track of when an OSD boots if it reports itself as fresh or as wrongly-marked-down. Maintain the probability that the OSD is actually down versus laggy based on that data and an exponential decay (more recent reports matter more), and maintain
Re: osd/OSDMap.h: 330: FAILED assert(is_up(osd))
On 07/18/2012 12:03 PM, Samuel Just wrote: Sorry, master has a fix now for that also. 76efd9772c60b93bbf632e3ecc3b9117dc081427 -Sam That got things running for me. Thanks for the quick reply. -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd/OSDMap.h: 330: FAILED assert(is_up(osd))
On 07/17/2012 06:03 PM, Samuel Just wrote: master should now have a fix for that, let me know how it goes. I opened bug #2798 for this issue. Hmmm, it seems handle_osd_ping() now runs into a case where for the first ping it gets, service.osdmap can be empty? 0> 2012-07-18 09:17:23.977497 7fffe6ec6700 -1 *** Caught signal (Segmentation fault) ** in thread 7fffe6ec6700 ceph version 0.48argonaut-419-g4e1d973 (commit:4e1d973e466cd45138f004e84ab8631d9b2a60fa) 1: /usr/bin/ceph-osd() [0x723c39] 2: (()+0xf4a0) [0x776584a0] 3: (OSD::handle_osd_ping(MOSDPing*)+0x7d4) [0x5d7894] 4: (OSD::heartbeat_dispatch(Message*)+0x71) [0x5d8111] 5: (SimpleMessenger::DispatchQueue::entry()+0x583) [0x7d5103] 6: (SimpleMessenger::dispatch_entry()+0x15) [0x7d6485] 7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x79523d] 8: (()+0x77f1) [0x776507f1] 9: (clone()+0x6d) [0x76aa1ccd] gdb has this to say: (gdb) bt #0 0x7765836b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42 #1 0x00724067 in reraise_fatal (signum=11) at global/signal_handler.cc:58 #2 handle_fatal_signal (signum=11) at global/signal_handler.cc:104 #3 #4 get_epoch (this=0x15d, m=0x1587000) at ./osd/OSDMap.h:210 #5 OSD::handle_osd_ping (this=0x15d, m=0x1587000) at osd/OSD.cc:1711 #6 0x005d8111 in OSD::heartbeat_dispatch (this=0x15d, m=0x1587000) at osd/OSD.cc:2769 #7 0x007d5103 in ms_deliver_dispatch (this=0x1472960) at msg/Messenger.h:504 #8 SimpleMessenger::DispatchQueue::entry (this=0x1472960) at msg/SimpleMessenger.cc:367 #9 0x007d6485 in SimpleMessenger::dispatch_entry (this=0x1472880) at msg/SimpleMessenger.cc:384 #10 0x0079523d in SimpleMessenger::DispatchThread::entry (this=) at ./msg/SimpleMessenger.h:807 #11 0x776507f1 in start_thread (arg=0x7fffe6ec6700) at pthread_create.c:301 #12 0x76aa1ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) f 5 #5 OSD::handle_osd_ping (this=0x15d, m=0x1587000) at osd/OSD.cc:1711 1711m->stamp); (gdb) l 1706} 1707 } 1708 Message *r = new MOSDPing(monc->get_fsid(), 1709curmap->get_epoch(), 1710MOSDPing::PING_REPLY, 1711m->stamp); 1712 hbserver_messenger->send_message(r, m->get_connection()); 1713 1714 if (curmap->is_up(from)) { 1715note_peer_epoch(from, m->map_epoch); (gdb) p curmap $1 = std::tr1::shared_ptr (empty) 0x0 -- Jim Thanks for the info! -Sam On Tue, Jul 17, 2012 at 2:54 PM, Jim Schutt wrote: On 07/17/2012 03:44 PM, Samuel Just wrote: Not quite. OSDService::get_osdmap() returns the most recently published osdmap. Generally, OSD::osdmap is safe to use when you are holding the osd lock. Otherwise, OSDService::get_osdmap() should be used. There are a few other things that should be fixed surrounding this issue as well, I'll put some time into it today. The map_lock should probably be removed all together. Thanks for taking a look. Let me know when you get something, and I'll take it for a spin. Thanks -- Jim -Sam -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd/OSDMap.h: 330: FAILED assert(is_up(osd))
On 07/17/2012 03:44 PM, Samuel Just wrote: Not quite. OSDService::get_osdmap() returns the most recently published osdmap. Generally, OSD::osdmap is safe to use when you are holding the osd lock. Otherwise, OSDService::get_osdmap() should be used. There are a few other things that should be fixed surrounding this issue as well, I'll put some time into it today. The map_lock should probably be removed all together. Thanks for taking a look. Let me know when you get something, and I'll take it for a spin. Thanks -- Jim -Sam -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
osd/OSDMap.h: 330: FAILED assert(is_up(osd))
Hi, Recent master branch is asserting for me like this: ceph version 0.48argonaut-404-gabe05a3 (commit:abe05a3fbbb120d8d354623258d9104584db66f7) 1: (OSDMap::get_cluster_inst(int) const+0xc9) [0x58cde9] 2: (OSD::handle_osd_ping(MOSDPing*)+0x8cf) [0x5d4b4f] 3: (OSD::heartbeat_dispatch(Message*)+0x71) [0x5d5491] 4: (SimpleMessenger::DispatchQueue::entry()+0x583) [0x7d5683] 5: (SimpleMessenger::dispatch_entry()+0x15) [0x7d6a05] 6: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7957bd] 7: (()+0x77f1) [0x776507f1] 8: (clone()+0x6d) [0x76aa1ccd] gdb had this to say: (gdb) bt #0 0x7765836b in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42 #1 0x007245b7 in reraise_fatal (signum=6) at global/signal_handler.cc:58 #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104 #3 #4 0x769ee885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #5 0x769f0065 in abort () at abort.c:92 #6 0x003be84bea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #7 0x003be84bcc06 in ?? () from /usr/lib64/libstdc++.so.6 #8 0x003be84bcc33 in std::terminate() () from /usr/lib64/libstdc++.so.6 #9 0x003be84bcd2e in __cxa_throw () from /usr/lib64/libstdc++.so.6 #10 0x0074b9e3 in ceph::__ceph_assert_fail (assertion=0x1488000 "\001", file=0x2d828a0 "\260m\"\003", line=330, func=0x8701e0 "entity_inst_t OSDMap::get_cluster_inst(int) const") at common/assert.cc:77 #11 0x0058cde9 in OSDMap::get_cluster_inst (this=, osd=) at osd/OSDMap.h:330 #12 0x005d4b4f in OSD::handle_osd_ping (this=0x14d8000, m=) at osd/OSD.cc:1717 #13 0x005d5491 in OSD::heartbeat_dispatch (this=0x14d8000, m=0x24383100) at osd/OSD.cc:2784 #14 0x007d5683 in ms_deliver_dispatch (this=0x1472960) at msg/Messenger.h:504 #15 SimpleMessenger::DispatchQueue::entry (this=0x1472960) at msg/SimpleMessenger.cc:367 #16 0x007d6a05 in SimpleMessenger::dispatch_entry (this=0x1472880) at msg/SimpleMessenger.cc:384 #17 0x007957bd in SimpleMessenger::DispatchThread::entry (this=) at ./msg/SimpleMessenger.h:807 #18 0x776507f1 in start_thread (arg=0x7fffe6ec6700) at pthread_create.c:301 #19 0x76aa1ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) f 12 #12 0x005d4b4f in OSD::handle_osd_ping (this=0x14d8000, m=) at osd/OSD.cc:1717 1717 _share_map_outgoing(service.get_osdmap()->get_cluster_inst(from)); (gdb) l 1712 hbserver_messenger->send_message(r, m->get_connection()); 1713 1714 if (osdmap->is_up(from)) { 1715note_peer_epoch(from, m->map_epoch); 1716if (locked && is_active()) 1717 _share_map_outgoing(service.get_osdmap()->get_cluster_inst(from)); 1718 } 1719} 1720break; 1721 (gdb) f 11 #11 0x0058cde9 in OSDMap::get_cluster_inst (this=, osd=) at osd/OSDMap.h:330 330 assert(is_up(osd)); (gdb) l 325 entity_inst_t get_inst(int osd) const { 326 assert(is_up(osd)); 327 return entity_inst_t(entity_name_t::OSD(osd), get_addr(osd)); 328 } 329 entity_inst_t get_cluster_inst(int osd) const { 330 assert(is_up(osd)); 331 return entity_inst_t(entity_name_t::OSD(osd), get_cluster_addr(osd)); 332 } 333 entity_inst_t get_hb_inst(int osd) const { 334 assert(is_up(osd)); Apparently osdmap member in class OSD don't have the same map contents as the osdmap member in OSDService in this instance? Why are there two osdmaps? Under what conditions is it appropriate for them to have different contents? Is this the appropriate fix? @@ -1711,10 +1711,10 @@ void OSD::handle_osd_ping(MOSDPing *m) m->stamp); hbserver_messenger->send_message(r, m->get_connection()); - if (osdmap->is_up(from)) { + if (locked && osdmap->is_up(from)) { note_peer_epoch(from, m->map_epoch); - if (locked && is_active()) - _share_map_outgoing(service.get_osdmap()->get_cluster_inst(from)); + if (is_active()) + _share_map_outgoing(osdmap->get_cluster_inst(from)); } } break; Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Interesting results
On 07/02/2012 08:07 AM, Stefan Priebe - Profihost AG wrote: Am 02.07.2012 16:04, schrieb Jim Schutt: On 07/01/2012 01:57 PM, Stefan Priebe wrote: thanks for sharing. Which btrfs mount options did you use? -o noatime is all I use. Thanks. Have you ever measured random I/O performance? Or is sequential all you need? So far I've only been testing sequential, because I like to find the limits under sequential load first. --Jim Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Interesting results
On 07/01/2012 01:57 PM, Stefan Priebe wrote: thanks for sharing. Which btrfs mount options did you use? -o noatime is all I use. -- Jim Am 29.06.2012 00:37, schrieb Jim Schutt: Hi, Lots of trouble reports go by on the list - I thought it would be useful to report a success. Using a patch (https://lkml.org/lkml/2012/6/28/446) on top of 2.5-rc4 for my OSD servers, the same kernel for my Linux clients, and a recent master branch tip (git://github.com/ceph/ceph commit 4142ac44b3f), I was able to sustain streaming writes from 166 linux clients for 2 hours: On 166 clients: dd conv=fdatasync if=/dev/zero of=/mnt/ceph/stripe-4M/1/zero0.`hostname -s` bs=4k count=65536k Elapsed time: 7274.55 seconds Total data: 45629732.553 MB (43515904 MiB) Aggregate rate: 6272.516 MB/s That kernel patch was critical; without it this test runs into trouble after a few minutes because the kernel runs into trouble looking for pages to merge during page compaction. Also critical were the ceph tunings I mentioned here: http://www.spinics.net/lists/ceph-devel/msg07128.html -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Interesting results
On 06/28/2012 04:53 PM, Mark Nelson wrote: On 06/28/2012 05:37 PM, Jim Schutt wrote: Hi, Lots of trouble reports go by on the list - I thought it would be useful to report a success. Using a patch (https://lkml.org/lkml/2012/6/28/446) on top of 2.5-rc4 for my OSD servers, the same kernel for my Linux clients, and a recent master branch tip (git://github.com/ceph/ceph commit 4142ac44b3f), I was able to sustain streaming writes from 166 linux clients for 2 hours: On 166 clients: dd conv=fdatasync if=/dev/zero of=/mnt/ceph/stripe-4M/1/zero0.`hostname -s` bs=4k count=65536k Elapsed time: 7274.55 seconds Total data: 45629732.553 MB (43515904 MiB) Aggregate rate: 6272.516 MB/s That kernel patch was critical; without it this test runs into trouble after a few minutes because the kernel runs into trouble looking for pages to merge during page compaction. Also critical were the ceph tunings I mentioned here: http://www.spinics.net/lists/ceph-devel/msg07128.html -- Jim Nice! Did you see much performance degradation over time? Internally I've sen some slow downs (especially at smaller block sizes) as the osds fill up. How many servers and how many drives? This result is from 12 servers, 24 OSDs/server, starting from a freshly-built filesystem. I use 64KB btrfs metadata nodes. There is some performance degradation during such runs. During the initial 10 TB or so, each server sustains ~2.2 GB/s, as reported by vmstat. Nearer the end of the run, data rate on each server is much more variable, with peaks at ~2 GB/s and valleys at ~1.5 GB/s. I am suspecting that some of that variability comes from the OSDs not filling up uniformly; here's low/high utilization at the end of the run: server 1K-blocks Used Available Use% Mounted on cs42: 939095640 258202860 662416404 29% /ram/mnt/ceph/data.osd.261 cs38: 939095640 259052468 661568524 29% /ram/mnt/ceph/data.osd.154 cs39: 939095640 264803592 655825592 29% /ram/mnt/ceph/data.osd.174 cs34: 939095640 265911256 654711400 29% /ram/mnt/ceph/data.osd.52 cs41: 939095640 270588260 650049820 30% /ram/mnt/ceph/data.osd.238 cs33: 939095640 345327760 575399472 38% /ram/mnt/ceph/data.osd.47 cs40: 939095640 351180832 569558176 39% /ram/mnt/ceph/data.osd.205 cs35: 939095640 351372096 569365696 39% /ram/mnt/ceph/data.osd.89 cs41: 939095640 352522904 568214632 39% /ram/mnt/ceph/data.osd.217 cs33: 939095640 358181684 562561740 39% /ram/mnt/ceph/data.osd.35 max/min: 1.3872 Note that I am using osd_pg_bits=7, osd_pgp_bits=7. I have plans to push that to see what happens. I've also got another dozen servers on a truck somewhere on their way to here The under-utilized OSDs finish early, which I believe contributes to performance tailing off at the end of such a run. I don't have any data on how big this effect might be. I haven't yet tested filling my filesystem to capacity, so I have no data regarding what happens as the disks fill up. Still, those are the kinds of numbers I like to see. Congrats! :) Thanks - I think it's pretty cool that testing Ceph found a performance issue in the kernel. -- Jim Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Interesting results
Hi, Lots of trouble reports go by on the list - I thought it would be useful to report a success. Using a patch (https://lkml.org/lkml/2012/6/28/446) on top of 2.5-rc4 for my OSD servers, the same kernel for my Linux clients, and a recent master branch tip (git://github.com/ceph/ceph commit 4142ac44b3f), I was able to sustain streaming writes from 166 linux clients for 2 hours: On 166 clients: dd conv=fdatasync if=/dev/zero of=/mnt/ceph/stripe-4M/1/zero0.`hostname -s` bs=4k count=65536k Elapsed time: 7274.55 seconds Total data: 45629732.553 MB (43515904 MiB) Aggregate rate: 6272.516 MB/s That kernel patch was critical; without it this test runs into trouble after a few minutes because the kernel runs into trouble looking for pages to merge during page compaction. Also critical were the ceph tunings I mentioned here: http://www.spinics.net/lists/ceph-devel/msg07128.html -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD Hardware questions
On 06/28/2012 09:45 AM, Alexandre DERUMIER wrote: >>Definitely. Seeing perf/oprofile/whatever results for the osd under that >>workload would be very interesting! We need to get perf going in our >>testing environment... I'm not an expert, but if you give me command line, I'll do it ;) Thanks to Mark's help, I'm now using: perf record -g -a sleep 10 perf report --sort symbol --call-graph fractal,5 | more -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: excessive CPU utilization by isolate_freepages?
On 06/28/2012 05:36 AM, Mel Gorman wrote: On Wed, Jun 27, 2012 at 03:59:19PM -0600, Jim Schutt wrote: Hi, I'm running into trouble with systems going unresponsive, and perf suggests it's excessive CPU usage by isolate_freepages(). I'm currently testing 3.5-rc4, but I think this problem may have first shown up in 3.4. I'm only just learning how to use perf, so I only currently have results to report for 3.5-rc4. Out of curiosity, why do you think it showed up in 3.4? It's not surprising as such if it did show up there but I'm wondering what you are basing it on. If I remember correctly, when I was testing this workload on 3.4 is when I first saw hundreds of runable threads being reported by vmstat. At that time I couldn't reproduce quite as reliably, and I didn't know how to get perf to give me call chains, so I didn't follow up :( It's not a suprise because it's also where reclaim/compaction stopped depending on lumpy reclaim. In the past we would have reclaimed more pages but now rely on compaction more. It's plassible that for many parallel compactions that there would be higher CPU usage now. 2012-06-27 14:00:03.219-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 75 1 0 566988576 3566480000 2 1355 21 3 1 4 95 0 0 433 1 0 964052576 3506911200 7 456359 102256 20901 2 98 0 0 0 547 3 0 820116576 348939320057 560507 114878 28115 3 96 0 0 0 806 2 0 606992576 3484818000 339 309668 101230 21056 2 98 0 0 0 708 1 0 529624576 3470800000 248 370886 101327 20062 2 97 0 0 0 231 5 0 504772576 3466388000 305 334824 95045 20407 2 97 1 1 0 158 6 01063088576 3351853600 531 847435 130696 47140 4 92 1 2 0 193 0 01449156576 3303557200 363 371279 94470 18955 2 96 1 1 0 266 6 01623512576 327281640077 241114 95730 15483 2 98 0 0 0 243 8 01629504576 326530800081 471018 100223 20920 3 96 0 1 0 70 11 01342140576 3308402000 100 925869 139876 56599 6 88 3 3 0 211 7 01130316576 3347043200 290 1008984 150699 74320 6 83 6 5 0 365 3 0 776736576 3407277200 182 747167 139436 67135 5 88 4 3 0 29 1 01528412576 341106400050 612181 137403 77609 4 87 6 3 0 266 5 01657688576 3410569600 3 258307 62879 38508 2 93 3 2 0 1159 2 02002256576 337754760019 88554 42112 14230 1 98 0 0 0 ok, so System CPU usage through the roof. Right around 14:00 I was able to get a "perf -a -g"; here's the beginning of what "perf report --sort symbol --call-graph fractal,5" had to say: # 64.86% [k] _raw_spin_lock_irqsave | |--97.94%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | | | |--99.56%-- try_to_compact_pages | | __alloc_pages_direct_compact | | __alloc_pages_slowpath | | __alloc_pages_nodemask | | alloc_pages_vma | | do_huge_pmd_anonymous_page | | handle_mm_fault | | do_page_fault | | page_fault | | | | | |--53.53%-- skb_copy_datagram_iovec | | | tcp_recvmsg | | | inet_recvmsg | | | sock_recvmsg | | | sys_recvfrom | | | system_call_fastpath | | | __recv | | | | | | | --100.00%-- (nil) | | | | | |--27.80%-- __pthread_create_2_1 | | | (nil) | | | | | --18.67%-- memcpy | |
excessive CPU utilization by isolate_freepages?
Hi, I'm running into trouble with systems going unresponsive, and perf suggests it's excessive CPU usage by isolate_freepages(). I'm currently testing 3.5-rc4, but I think this problem may have first shown up in 3.4. I'm only just learning how to use perf, so I only currently have results to report for 3.5-rc4. (FWIW I'm running the my distro version of perf; please let me know if I need to compile the tools/perf version to match my kernel.) The systems in question have 24 SAS drives spread across 3 HBAs, running 24 Ceph OSD instances, one per drive. FWIW these servers are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160 Ceph Linux clients doing dd simultaneously to a Ceph file system backed by 12 of these servers. In the early phase of such a test, when things are running well, here's what vmstat reports for the state of one of these servers: 2012-06-27 13:56:58.356-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 31 15 0 287216576 3860662800 2 11582 14 1 3 95 0 0 27 15 0 225288576 385833840018 016 203357 134876 11 56 17 15 0 28 17 0 219256576 385447360011 2305932 203141 146296 11 49 23 17 0 6 18 0 215596576 3855287200 7 2363207 215264 166502 12 45 22 20 0 22 18 0 226984576 3859640400 3 2445741 223114 179527 12 43 23 22 0 30 12 0 230844576 384616480014 2298537 216580 11 12 45 23 20 0 29 19 0 237856576 3842288400 5 2332741 209408 157138 12 42 25 22 0 17 11 0 222156576 3848399200 4 2380495 210312 173121 12 39 28 22 0 11 13 0 216152576 384638720044 2362186 215236 176454 12 39 27 22 0 12 14 0 223704576 3854672000 2 2395530 214684 177518 12 39 28 22 0 39 17 0 219932576 3859818400 4 2428231 223284 179095 12 42 24 21 0 11 10 0 219956576 3852104800 4 2323853 207472 15 12 39 28 21 0 23 15 0 216580576 3845190400 3 2241800 201049 163496 11 37 31 21 0 9 13 0 225792576 384519160013 2281900 204869 171814 11 36 30 23 0 14 12 0 233820576 3849272800 4 2293423 207686 173019 11 37 31 21 0 11 20 0 213796576 3853320800 3 2288636 205605 168118 11 37 31 21 0 The system begins to struggle over the next several minutes; here's what vmstat has to say: 2012-06-27 13:57:58.831-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 21 16 0 224628576 3852687200 2 12339 19 1 4 95 0 0 12 19 0 232060576 3850102000 4 2366769 221418 159890 12 48 20 19 0 124 8 0 218548576 383796560013 2103075 199660 108618 11 65 13 12 0 24 10 0 300476576 382302880031 1966282 177472 84572 10 76 7 7 0 20 16 0 217584576 3829670000 9 2062571 195936 128810 10 55 20 15 0 53 12 0 235720576 382479680030 2035407 196973 133921 10 52 23 15 0 20 16 0 360340576 3806799200 6 2192179 208692 136784 11 54 19 15 0 26 10 0 310800576 380938840043 2138772 207105 118718 11 64 12 13 0 24 15 0 261108576 380308280068 2174015 205793 135302 11 56 18 15 0 9 17 0 241816576 379820720020 2076145 194971 120285 10 60 16 14 0 37 15 0 255972576 378928680014 2225076 205694 126954 11 59 16 13 0 27 16 0 243212576 3787270400 6 2249476 210885 134684 12 60 15 14 0 30 10 0 217572576 3779538800 3 2128688 205027 118319 11 66 12 11 0 26 11 0 236420576 377401640023 2109709 205105 133925 10 56 19 14 0 45 15 0 330056576 376198960015 1948311 196188 119330 10 62 15 13 0 54 15 0 242696576 3763150000 4 2159530 202595 132588 11 59 16 15 0 2012-06-27 13:58:59.569-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b s
Re: OSD Hardware questions
On 06/27/2012 01:10 PM, Jim Schutt wrote: On 06/27/2012 12:48 PM, Stefan Priebe wrote: Am 27.06.2012 20:38, schrieb Jim Schutt: Actually, when my 166-client test is running, "ps -o pid,nlwp,args -C ceph-osd" tells me that I typically have ~1200 threads/OSD. huh i see only 124 threads per OSD even with your settings. FWIW: 2 threads/messenger (reader+writer): 166 clients ~200 OSD data peers (this depends on PG distribution/number) ~200 OSD heartbeat peers (ditto) ~200 OSD peers because I have 12 such servers with 24 OSDs each. -- Jim plus 24 OSD op threads (my tuning) 24 OSD disk threads (my tuning) 6 OSD filestore op threads (my tuning) So, 2*566 + 54 = 1186 threads/OSD Plus, there's various other worker threads, such as the timer, message dispatch threads, monitor/MDS messenger, etc. Hmmm. The only other obvious difference, based on what I remember from your other posts, is that you're testing against RBD, right? I've been testing exclusively with the Linux kernel client. right and SSD. So it might be some timing issues. I guess so. -- Jim Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD Hardware questions
On 06/27/2012 12:48 PM, Stefan Priebe wrote: Am 27.06.2012 20:38, schrieb Jim Schutt: Actually, when my 166-client test is running, "ps -o pid,nlwp,args -C ceph-osd" tells me that I typically have ~1200 threads/OSD. huh i see only 124 threads per OSD even with your settings. FWIW: 2 threads/messenger (reader+writer): 166 clients ~200 OSD data peers (this depends on PG distribution/number) ~200 OSD heartbeat peers (ditto) plus 24 OSD op threads (my tuning) 24 OSD disk threads (my tuning) 6 OSD filestore op threads (my tuning) So, 2*566 + 54 = 1186 threads/OSD Plus, there's various other worker threads, such as the timer, message dispatch threads, monitor/MDS messenger, etc. Hmmm. The only other obvious difference, based on what I remember from your other posts, is that you're testing against RBD, right? I've been testing exclusively with the Linux kernel client. right and SSD. So it might be some timing issues. I guess so. -- Jim Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD Hardware questions
On 06/27/2012 11:54 AM, Stefan Priebe wrote: Am 27.06.2012 um 19:23 schrieb "Jim Schutt": On 06/27/2012 09:19 AM, Stefan Priebe wrote: Am 27.06.2012 16:55, schrieb Jim Schutt: This is my current best tuning for my hardware, which uses 24 SAS drives/server, and 1 OSD/drive with a journal partition on the outer tracks and btrfs for the data store. Which raid level do you use? No RAID. Each OSD directly accesses a single disk, via a partition for the journal and a partition for the btrfs file store for that OSD. So you have 24 threads x 24 osds 576 threads running? Actually, when my 166-client test is running, "ps -o pid,nlwp,args -C ceph-osd" tells me that I typically have ~1200 threads/OSD. Since NPTL uses a 1:1 threading model, I recently had to increase /proc/sys/kernel/pid_max from the default 32768 to get them all to fit. The NICs are Chelsio T4, but I'm not using any of the TCP stateful offload features for this testing. I don't know if they have ntuple support, but the ethtool version I'm using (2.6.33) doesn't mention it. For kernels I switch back and forth between latest development kernel from Linus's tree, or latest stable kernel, depending on where the kernel development cycle is. I usually switch to the development kernel around -rc4 or so. Crazy that this works for you. Btrfs is crashing to me in 20s while running full speed on ssd. Hmmm. The only other obvious difference, based on what I remember from your other posts, is that you're testing against RBD, right? I've been testing exclusively with the Linux kernel client. ??? -- Jim Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD Hardware questions
On 06/27/2012 09:53 AM, Mark Nelson wrote: I'm still trying to learn how to get perf to tell me more -- Jim Thanks for doing this! I've been wanting to get perf going on our test boxes for ages but haven't had time to get the packages built yet for our gitbuilder kernels. Try generating a call-graph ala: http://lwn.net/Articles/340010/ That looks like exactly what I need. Thanks! I'll post when I've learned more. -- Jim Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD Hardware questions
On 06/27/2012 09:19 AM, Stefan Priebe wrote: Am 27.06.2012 16:55, schrieb Jim Schutt: This is my current best tuning for my hardware, which uses 24 SAS drives/server, and 1 OSD/drive with a journal partition on the outer tracks and btrfs for the data store. Which raid level do you use? No RAID. Each OSD directly accesses a single disk, via a partition for the journal and a partition for the btrfs file store for that OSD. I've got my 24 drives spread across three 6 Gb/s SAS HBAs, so I can sustain ~90 MB/s per drive with all drives active, when writing to the outer tracks using dd. I want to rely on Ceph for data protection via replication. At some point I expect to play around with the RAID0 support in btrfs to explore the performance relationship between number of OSDs and size of each OSD, but haven't yet. I'd be very curious to hear how these work for you. My current testing load is streaming writes from 166 linux clients, and the above tunings let me sustain ~2 GB/s on each server (2x replication, so 500 MB/s per server aggregate client bandwidth). 10GBe max speed shoudl be around 1Gbit/s. Do i miss something? Hmmm, not sure. My servers are limited by the bandwidth of the SAS drives and HBAs. So 2 GB/s aggregate disk bandwidth is 1 GB/s for journals and 1 GB/s for data. At 2x replication, that's 500 MB/s client data bandwidth. I have dual-port 10 GbE NICs, and use one port for the cluster and one for the clients. I use jumbo frames because it freed up ~10% CPU cycles over the default config of 1500-byte frames + GRO/GSO/etc on the load I'm currently testing with. Do you have ntuple and lro on or off? Which kernel version do you use and which driver version? Intel cards? # ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp-segmentation-offload: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off The NICs are Chelsio T4, but I'm not using any of the TCP stateful offload features for this testing. I don't know if they have ntuple support, but the ethtool version I'm using (2.6.33) doesn't mention it. For kernels I switch back and forth between latest development kernel from Linus's tree, or latest stable kernel, depending on where the kernel development cycle is. I usually switch to the development kernel around -rc4 or so. -- Jim Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD Hardware questions
Hi Mark, On 06/27/2012 07:55 AM, Mark Nelson wrote: For what it's worth, I've got a pair of Dell R515 setup with a single 2.8GHz 6-core 4184 Opteron, 16GB of RAM, and 10 SSDs that are capable of about 200MB/s each. Currently I'm topping out at about 600MB/s with rados bench using half of the drives for data and half for journals (at 2x replication). Putting journals on the same drive and doing 10 OSDs on each node is slower. Still working on figuring out why. Just for fun, try the following tunings to see if they make a difference for you. This is my current best tuning for my hardware, which uses 24 SAS drives/server, and 1 OSD/drive with a journal partition on the outer tracks and btrfs for the data store. journal dio = true osd op threads = 24 osd disk threads = 24 filestore op threads = 6 filestore queue max ops = 24 osd client message size cap = 1400 ms dispatch throttle bytes = 1750 I'd be very curious to hear how these work for you. My current testing load is streaming writes from 166 linux clients, and the above tunings let me sustain ~2 GB/s on each server (2x replication, so 500 MB/s per server aggregate client bandwidth). I have dual-port 10 GbE NICs, and use one port for the cluster and one for the clients. I use jumbo frames because it freed up ~10% CPU cycles over the default config of 1500-byte frames + GRO/GSO/etc on the load I'm currently testing with. FWIW these servers are dual-socket Intel 5675 Xeons, so total 12 cores at 3.0 GHz. On the above load I usually see 15-30% idle. FWIW, "perf top" has this to say about where time is being spent under the above load under normal conditions. PerfTop: 19134 irqs/sec kernel:79.2% exact: 0.0% [1000Hz cycles], (all, 24 CPUs) samples pcnt function DSO ___ _ __ 37656.00 15.3% ceph_crc32c_le /usr/bin/ceph-osd 23221.00 9.5% copy_user_generic_string [kernel.kallsyms] 16857.00 6.9% btrfs_end_transaction_dmeta /lib/modules/3.5.0-rc4-00011-g15d0694/kernel/fs/btrfs/btrfs.ko 16787.00 6.8% __crc32c_le [kernel.kallsyms] But, sometimes I see this: PerfTop:4930 irqs/sec kernel:97.8% exact: 0.0% [1000Hz cycles], (all, 24 CPUs) samples pcnt function DSO ___ _ __ 147565.00 45.8% _raw_spin_lock_irqsave [kernel.kallsyms] 24427.00 7.6% isolate_freepages_block [kernel.kallsyms] 23759.00 7.4% ceph_crc32c_le /usr/bin/ceph-osd 16521.00 5.1% copy_user_generic_string [kernel.kallsyms] 10549.00 3.3% __crc32c_le [kernel.kallsyms] 8901.00 2.8% btrfs_end_transaction_dmeta /lib/modules/3.5.0-rc4-00011-g15d0694/kernel/fs/btrfs/btrfs.ko When this happens, OSDs cannot process heartbeats in a timely fashion, get wrongly marked down, thrashing ensues, clients stall. I'm still trying to learn how to get perf to tell me more -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkcephfs regression in current master branch
On 05/24/2012 03:13 PM, Sage Weil wrote: Hi Jim, On Thu, 24 May 2012, Jim Schutt wrote: Hi, In my testing I make repeated use of the "manual" mkcephfs sequence described in the man page: master# mkdir /tmp/foo master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo mdsnode# mkcephfs --init-local-daemons mds -d /tmp/foo master# mkcephfs --prepare-mon -d /tmp/foo monnode# mkcephfs --init-local-daemons mon -d /tmp/foo Using current master branch (commit ca79f45a33f9), the "mkcephfs --init-local-daemons osd" phase breaks like this: 2012-05-24 13:24:06.325905 7fdcd61d4780 -1 filestore(/ram/mnt/ceph/data.osd.0) could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory 2012-05-24 13:24:06.326768 7fdcd1ee2700 -1 filestore(/ram/mnt/ceph/data.osd.0) async snap create 'snap_1' transid 0 got (17) File exists os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792 os/FileStore.cc: 3564: FAILED assert(0 == "async snap ioctl error") ceph version 0.47.1-157-gb003815 (commit:b003815c222add8bdcf645d9ba4ef7e13f34587e) 1: (FileStore::sync_entry()+0x34f0) [0x68e190] 2: (FileStore::SyncThread::entry()+0xd) [0x6a2abd] 3: (Thread::_entry_func(void*)+0x1b9) [0x803499] 4: (()+0x77f1) [0x7fdcd58637f1] 5: (clone()+0x6d) [0x7fdcd4cb4ccd] NOTE: a copy of the executable, or `objdump -rdS` is needed to interpret this. 2012-05-24 13:24:06.327991 7fdcd1ee2700 -1 os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792 os/FileStore.cc: 3564: FAILED assert(0 == "async snap ioctl error") I just pushed a fix for this to master. Great, that fixed it for me. Thanks for the quick response! BTW, the real change happening with these patches is that --mkfs no longer clobbers existing data. If you want to wipe out an osd and start anew, you need to rm -r (and btrfs sub delete snap_* and current), or re-run mkfs.btrfs. Ah, OK. It turns out I always run mkfs.btrfs anyway, but this is good to know. -- Jim Thanks for the report! sage FWIW, with commit 598dea12411 (filestore: mkfs: only create snap_0 if we created current_op_seq) reverted, I am able to create a new filesystem using the above sequence, and a typical "mkcephfs --init-local-daemons osd" looks like this: 2012-05-24 13:06:25.918663 7fe7ac829780 -1 filestore(/ram/mnt/ceph/data.osd.0) could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory 2012-05-24 13:06:26.301738 7fe7ac829780 -1 created object store /ram/mnt/ceph/data.osd.0 journal /dev/mapper/cs32s01p2 for osd.0 fsid f8cc9fa2-a300-45a1-ae6d-e0c0ef418d0f creating private key for osd.0 keyring /mnt/ceph/misc.osd.0/keyring.osd.0 creating /mnt/ceph/misc.osd.0/keyring.osd.0 Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mkcephfs regression in current master branch
Hi, In my testing I make repeated use of the "manual" mkcephfs sequence described in the man page: master# mkdir /tmp/foo master# mkcephfs -c /etc/ceph/ceph.conf --prepare-monmap -d /tmp/foo osdnode# mkcephfs --init-local-daemons osd -d /tmp/foo mdsnode# mkcephfs --init-local-daemons mds -d /tmp/foo master# mkcephfs --prepare-mon -d /tmp/foo monnode# mkcephfs --init-local-daemons mon -d /tmp/foo Using current master branch (commit ca79f45a33f9), the "mkcephfs --init-local-daemons osd" phase breaks like this: 2012-05-24 13:24:06.325905 7fdcd61d4780 -1 filestore(/ram/mnt/ceph/data.osd.0) could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory 2012-05-24 13:24:06.326768 7fdcd1ee2700 -1 filestore(/ram/mnt/ceph/data.osd.0) async snap create 'snap_1' transid 0 got (17) File exists os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792 os/FileStore.cc: 3564: FAILED assert(0 == "async snap ioctl error") ceph version 0.47.1-157-gb003815 (commit:b003815c222add8bdcf645d9ba4ef7e13f34587e) 1: (FileStore::sync_entry()+0x34f0) [0x68e190] 2: (FileStore::SyncThread::entry()+0xd) [0x6a2abd] 3: (Thread::_entry_func(void*)+0x1b9) [0x803499] 4: (()+0x77f1) [0x7fdcd58637f1] 5: (clone()+0x6d) [0x7fdcd4cb4ccd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. 2012-05-24 13:24:06.327991 7fdcd1ee2700 -1 os/FileStore.cc: In function 'void FileStore::sync_entry()' thread 7fdcd1ee2700 time 2012-05-24 13:24:06.326792 os/FileStore.cc: 3564: FAILED assert(0 == "async snap ioctl error") FWIW, with commit 598dea12411 (filestore: mkfs: only create snap_0 if we created current_op_seq) reverted, I am able to create a new filesystem using the above sequence, and a typical "mkcephfs --init-local-daemons osd" looks like this: 2012-05-24 13:06:25.918663 7fe7ac829780 -1 filestore(/ram/mnt/ceph/data.osd.0) could not find 23c2fcde/osd_superblock/0 in index: (2) No such file or directory 2012-05-24 13:06:26.301738 7fe7ac829780 -1 created object store /ram/mnt/ceph/data.osd.0 journal /dev/mapper/cs32s01p2 for osd.0 fsid f8cc9fa2-a300-45a1-ae6d-e0c0ef418d0f creating private key for osd.0 keyring /mnt/ceph/misc.osd.0/keyring.osd.0 creating /mnt/ceph/misc.osd.0/keyring.osd.0 Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: [RFC PATCH 0/2] Distribute re-replicated objects evenly after OSD failure
On 05/12/2012 05:51 PM, Sage Weil wrote: Hey Jim, These both look like reasonable changes. And it's great to see they fix behavior for you. I'm not going to merge them yet, though. We're just kicking off a CRUSH refresh project next week that will include some testing framework to more thoroughly validate the quality of the output, and also take a more holistic look at all what the algorithm is doing and see what we can improve. I really didn't expect you'd merge them right away -- I knew you had this CRUSH effort coming, so my goal was to get these to you so you could evaluate them as part of that push. Most likely these changes will be included, but revving the mapping algorithm is going to be tricky for forward/backward compatibility, and we'd like to get it all in at once. (And/or come up with a better way to deal with mismatched versions...) Yep, I completely ignored the version compatibility issue - I didn't have any clever ideas on how to handle it. Also, FWIW I'm running with the patch below on top of the previous two - I think it helps avoid giving up too early in clusters where many OSDs have gone down/out, but I haven't done enough testing on it yet to quantify. Thanks! Hey, thanks for taking at look! sage -- Jim --- ceph: retry CRUSH map descents from root a little longer before falling back to exhaustive search The exhaustive search isn't as exhaustive as we'd like if the CRUSH map is several levels deep, so try a few more times to find an "in" device during "spread re-replication around" mode. This makes it less likely we'll give up when the storage cluster has many failed devices. Signed-off-by: Jim Schutt --- src/crush/mapper.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/crush/mapper.c b/src/crush/mapper.c index 698da55..39e9c5d 100644 --- a/src/crush/mapper.c +++ b/src/crush/mapper.c @@ -306,7 +306,7 @@ static int crush_choose(const struct crush_map *map, int item = 0; int itemtype; int collide, reject; - const unsigned int orig_tries = 5; /* attempts before we fall back to search */ + const unsigned int orig_tries = 10; /* attempts before we fall back to search */ dprintk("CHOOSE%s bucket %d x %d outpos %d numrep %d\n", recurse_to_leaf ? "_LEAF" : "", bucket->id, x, outpos, numrep); @@ -440,7 +440,7 @@ reject: else if (flocal <= in->size + orig_tries) /* exhaustive bucket search */ retry_bucket = 1; - else if (ftotal < 20) + else if (ftotal <= orig_tries + 15) /* then retry descent */ retry_descent = 1; else -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/2] ceph: retry CRUSH map descent from root if leaf is failed
When an object is re-replicated after a leaf failure, the remapped replica ends up under the bucket that held the failed leaf. This causes uneven data distribution across the storage cluster, to the point that when all the leaves of a bucket but one fail, that remaining leaf holds all the data from its failed peers. For example, consider the crush rule step chooseleaf firstn 0 type This rule means that replicas will be chosen in such a manner that each chosen leaf's branch will contain a unique instance of . For such ops, the tree descent has two steps: call them the inner and outer descent. If the tree descent down to is the outer descent, and the descent from down to a leaf is the inner descent, the issue is that a down leaf is detected on the inner descent, but we want to retry the outer descent. This ensures that re-replication after a leaf failure disperses the re-replicated objects as widely as possible across the storage cluster. Fix this by causing the inner descent to return immediately on choosing a failed leaf, unless we've fallen back to exhaustive search. Note that after this change, for a chooseleaf rule, if the primary OSD in a placement group has failed, choosing a replacement may result in one of the other OSDs in the PG colliding with the new primary. This requires that OSD's data for that PG to need moving as well. This seems unavoidable but should be relatively rare. Signed-off-by: Jim Schutt --- src/crush/mapper.c | 12 +--- 1 files changed, 9 insertions(+), 3 deletions(-) diff --git a/src/crush/mapper.c b/src/crush/mapper.c index e5dc950..698da55 100644 --- a/src/crush/mapper.c +++ b/src/crush/mapper.c @@ -286,6 +286,7 @@ static int is_out(const struct crush_map *map, const __u32 *weight, int item, in * @param outpos our position in that vector * @param firstn true if choosing "first n" items, false if choosing "indep" * @param recurseto_leaf: true if we want one device under each item of given type + * @param descend_once true if we should only try one descent before giving up * @param out2 second output vector for leaf items (if @a recurse_to_leaf) */ static int crush_choose(const struct crush_map *map, @@ -293,7 +294,7 @@ static int crush_choose(const struct crush_map *map, const __u32 *weight, int x, int numrep, int type, int *out, int outpos, - int firstn, int recurse_to_leaf, + int firstn, int recurse_to_leaf, int descend_once, int *out2) { int rep; @@ -397,6 +398,7 @@ static int crush_choose(const struct crush_map *map, x, outpos+1, 0, out2, outpos, firstn, 0, +ftotal < orig_tries, NULL) <= outpos) /* didn't get leaf */ reject = 1; @@ -430,7 +432,10 @@ reject: * descent during that phase so that multiple * buckets can be exhaustively searched. */ - if (ftotal <= orig_tries) + if (reject && descend_once) + /* let outer call try again */ + skip_rep = 1; + else if (ftotal <= orig_tries) retry_descent = 1; else if (flocal <= in->size + orig_tries) /* exhaustive bucket search */ @@ -491,6 +496,7 @@ int crush_do_rule(const struct crush_map *map, int i, j; int numrep; int firstn; + const int descend_once = 0; if ((__u32)ruleno >= map->max_rules) { dprintk(" bad ruleno %d\n", ruleno); @@ -550,7 +556,7 @@ int crush_do_rule(const struct crush_map *map, curstep->arg2, o+osize, j, firstn, - recurse_to_leaf, c+osize); + recurse_to_leaf, descend_once, c+osize); } if (recurse_to_leaf) -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/2] ceph: retry CRUSH map descent before retrying bucket
For the first few rejections or collisions, we'll retry the descent to keep objects spread across the cluster. After that, we'll fall back to exhaustive search of the bucket to avoid trying forever in the event a bucket has only a few in items and the hash doesn't do a good job of finding them. Signed-off-by: Jim Schutt --- src/crush/mapper.c | 20 ++-- 1 files changed, 14 insertions(+), 6 deletions(-) diff --git a/src/crush/mapper.c b/src/crush/mapper.c index 8857577..e5dc950 100644 --- a/src/crush/mapper.c +++ b/src/crush/mapper.c @@ -350,8 +350,7 @@ static int crush_choose(const struct crush_map *map, reject = 1; goto reject; } - if (flocal >= (in->size>>1) && - flocal > orig_tries) + if (ftotal >= orig_tries)/* exhaustive bucket search */ item = bucket_perm_choose(in, x, r); else item = crush_bucket_choose(in, x, r); @@ -420,10 +419,19 @@ reject: if (reject || collide) { ftotal++; flocal++; - - if (collide && flocal < 3) - /* retry locally a few times */ - retry_bucket = 1; + /* +* For the first couple rejections or collisions, +* we'll retry the descent to keep objects spread +* across the cluster. After that, we'll fall back +* to exhaustive search of buckets to avoid trying +* forever in the event a bucket has only a few +* "in" items and the hash doesn't do a good job +* of finding them. Note that we need to retry +* descent during that phase so that multiple +* buckets can be exhaustively searched. +*/ + if (ftotal <= orig_tries) + retry_descent = 1; else if (flocal <= in->size + orig_tries) /* exhaustive bucket search */ retry_bucket = 1; -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/2] Distribute re-replicated objects evenly after OSD failure
Hi Sage, I've been trying to solve the issue mentioned in tracker #2047, which I think is the same as I described in http://www.spinics.net/lists/ceph-devel/msg05824.html The attached patches seem to fix it for me. I also attempted to address the local search issue you mentioned in #2047. I'm testing this using a cluster with 3 rows, 2 racks/row, 2 hosts/rack, 4 osds/host. I tested against a CRUSH map with the rules: step take root step chooseleaf firstn 0 type rack step emit I'm in the processes of testing this as follows: I wrote some data to the cluster, then started shutting down OSDs using "init-ceph stop osd.n". For the first rack's worth, I shut OSDs down sequentially. I waited for recovery to complete each time before stopping the next OSD. For the next rack I shut down the first 3 OSDs on a host at the same time, waited for recovery to complete, then shut down the last OSD on that host. For the next racks, I shut down all the OSDs on the hosts in the rack at the same time. Right now I'm waiting for recovery to complete after shutting down the third rack. Once recovery completed after each phase so far, there were no degraded objects. So, this is looking fairly solid to me so far. What do you think? Thanks -- Jim Jim Schutt (2): ceph: retry CRUSH map descent before retrying bucket ceph: retry CRUSH map descent from root if leaf is failed src/crush/mapper.c | 30 ++ 1 files changed, 22 insertions(+), 8 deletions(-) -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Ceph tracepoints
Hi Alex, I ran across tracker #2374 today - I've been carrying these two tracepoint patches for a while. Perhaps you'll find them useful. Jim Schutt (2): ceph: add tracepoints for message submission on read/write requests ceph: add tracepoints for message send queueing and completion, reply handling fs/ceph/addr.c |8 +++ fs/ceph/file.c |6 ++ include/trace/events/ceph.h | 144 +++ net/ceph/messenger.c|9 +++- net/ceph/osd_client.c |8 +++ 5 files changed, 174 insertions(+), 1 deletions(-) create mode 100644 include/trace/events/ceph.h -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html