Re: Writing to RBD image while it's snapshot is being created causes I/O errors
On Friday 14 of June 2013 08:56:55 Sage Weil wrote: On Fri, 14 Jun 2013, Karol Jurak wrote: I noticed that writing to RBD image using kernel driver while it's snapshot is being created causes I/O errors and the filesystem (reiserfs) eventually aborts and remounts itself in read-only mode: This is definitely a bug; you should be able to create a snapshot at any time. After a rollback, it should look (to the fs) like a crash or power cycle. How easy is this to reproduce? Does it happen every time? I can reproduce it in the following way: # rbd create -s 10240 test # rbd map test # mkfs -t reiserfs /dev/rbd/rbd/test # mount /dev/rbd/rbd/test /mnt/test # dd if=/dev/zero of=/mnt/test/a bs=1M count=1024 and in another shell while dd is running: # rbd snap create test@snap1 After 2 or 3 seconds dmesg shows I/O errors: [429532.259910] end_request: I/O error, dev rbd1, sector 1384448 [429532.272554] end_request: I/O error, dev rbd1, sector 872 [429532.275556] REISERFS abort (device rbd1): Journal write error in flush_commit_list and dd fails: dd: writing `/mnt/test/a': Cannot allocate memory 590+0 records in 589+0 records out 618225664 bytes (618 MB) copied, 15.8701 s, 39.0 MB/s This happens every time I repeat the test. -- Karol Jurak -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: silence GCC warnings
Building rbd.o triggers two GCC warnings: drivers/block/rbd.c: In function ‘rbd_img_request_fill’: drivers/block/rbd.c:1272:22: warning: ‘bio_list’ may be used uninitialized in this function [-Wmaybe-uninitialized] drivers/block/rbd.c:2170:14: note: ‘bio_list’ was declared here drivers/block/rbd.c:2231:10: warning: ‘pages’ may be used uninitialized in this function [-Wmaybe-uninitialized] Apparently GCC has trouble determining that bio_list is unused if type is OBJ_REQUEST_PAGES and, conversely, that pages will be unused if type is OBJ_REQUEST_BIO. Add harmless initializations to NULL to help GCC. Signed-off-by: Paul Bolle pebo...@tiscali.nl --- 0) Compile tested only. 1) These warnings were introduced in v3.10-rc1, apparently through commit f1a4739f33 (rbd: support page array image requests). 2) Note that rbd_assert(type == OBJ_REQUEST_PAGES); seems redundant. I see no way that this assertion could ever be false. drivers/block/rbd.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 3063452..b8a58178 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2185,9 +2185,11 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request, if (type == OBJ_REQUEST_BIO) { bio_list = data_desc; rbd_assert(img_offset == bio_list-bi_sector SECTOR_SHIFT); + pages = NULL; } else { rbd_assert(type == OBJ_REQUEST_PAGES); pages = data_desc; + bio_list = NULL; } while (resid) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8] misc fixes for mds
From: Yan, Zheng zheng.z@intel.com these patches are also in: git://github.com/ukernel/ceph.git wip-mds Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] mds: try purging stray inode after storing backtrace
From: Yan, Zheng zheng.z@intel.com Inode is auth pinned and can't be purged while storing backtrace, so we should try purging stray inode after storing backtrace. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/CInode.cc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc index 0e14293..4a592bc 100644 --- a/src/mds/CInode.cc +++ b/src/mds/CInode.cc @@ -1069,11 +1069,12 @@ void CInode::_stored_backtrace(version_t v, Context *fin) { dout(10) _stored_backtrace dendl; + auth_unpin(this); if (v == inode.backtrace_version) clear_dirty_parent(); - auth_unpin(this); if (fin) fin-complete(0); + mdcache-maybe_eval_stray(this); } void CInode::_mark_dirty_parent(LogSegment *ls, bool dirty_pool) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] mds: fix cross-authorty rename race
From: Yan, Zheng zheng.z@intel.com When doing cross-authorty rename, we need to make sure bystanders have received all messages sent by inode's original auth MDS, then change inode's authorty. Otherwise lock messages sent by the original/new auth MDS can arrive bystanders out of order. The fix is: inode's original auth MDS sends notify messages to bystanders, performs slave rename after receiving all bystanders' notify acks. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 31 - src/mds/Server.cc | 51 +++-- src/mds/Server.h| 1 + src/messages/MMDSSlaveRequest.h | 6 + 4 files changed, 81 insertions(+), 8 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index b9b154d..2b0029f 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -2651,6 +2651,15 @@ void MDCache::handle_mds_failure(int who) if (p-second-slave_to_mds == who) { if (p-second-slave_did_prepare()) { dout(10) slave request *p-second uncommitted, will resolve shortly dendl; + if (!p-second-more()-waiting_on_slave.empty()) { + assert(p-second-more()-srcdn_auth_mds == mds-get_nodeid()); + // will rollback, no need to wait + if (p-second-slave_request) { + p-second-slave_request-put(); + p-second-slave_request = 0; + } + p-second-more()-waiting_on_slave.clear(); + } } else { dout(10) slave request *p-second has no prepare, finishing up dendl; if (p-second-slave_request) @@ -2660,12 +2669,22 @@ void MDCache::handle_mds_failure(int who) } } -if (p-second-is_slave() - p-second-slave_did_prepare() p-second-more()-srcdn_auth_mds == who - mds-mdsmap-is_clientreplay_or_active_or_stopping(p-second-slave_to_mds)) { - // rename srcdn's auth mds failed, resolve even I'm a survivor. - dout(10) slave request *p-second uncommitted, will resolve shortly dendl; - add_ambiguous_slave_update(p-first, p-second-slave_to_mds); +if (p-second-is_slave() p-second-slave_did_prepare()) { + if (p-second-more()-waiting_on_slave.count(who)) { + assert(p-second-more()-srcdn_auth_mds == mds-get_nodeid()); + dout(10) slave request *p-second no longer need rename notity ack from mds. + who dendl; + p-second-more()-waiting_on_slave.erase(who); + if (p-second-more()-waiting_on_slave.empty()) + mds-queue_waiter(new C_MDS_RetryRequest(this, p-second)); + } + + if (p-second-more()-srcdn_auth_mds == who + mds-mdsmap-is_clientreplay_or_active_or_stopping(p-second-slave_to_mds)) { + // rename srcdn's auth mds failed, resolve even I'm a survivor. + dout(10) slave request *p-second uncommitted, will resolve shortly dendl; + add_ambiguous_slave_update(p-first, p-second-slave_to_mds); + } } // failed node is slave? diff --git a/src/mds/Server.cc b/src/mds/Server.cc index c3162e7..00bf018 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -1280,6 +1280,16 @@ void Server::handle_slave_request(MMDSSlaveRequest *m) if (m-is_reply()) return handle_slave_request_reply(m); + // the purpose of rename notify is enforcing causal message ordering. making sure + // bystanders have received all messages from rename srcdn's auth MDS. + if (m-get_op() == MMDSSlaveRequest::OP_RENAMENOTIFY) { +MMDSSlaveRequest *reply = new MMDSSlaveRequest(m-get_reqid(), m-get_attempt(), + MMDSSlaveRequest::OP_RENAMENOTIFYACK); +mds-send_message(reply, m-get_connection()); +m-put(); +return; + } + CDentry *straydn = NULL; if (m-stray.length() 0) { straydn = mdcache-add_replica_stray(m-stray, from); @@ -1432,6 +1442,10 @@ void Server::handle_slave_request_reply(MMDSSlaveRequest *m) handle_slave_rename_prep_ack(mdr, m); break; + case MMDSSlaveRequest::OP_RENAMENOTIFYACK: +handle_slave_rename_notify_ack(mdr, m); +break; + default: assert(0); } @@ -6560,6 +6574,9 @@ void Server::handle_slave_rename_prep(MDRequest *mdr) // am i srcdn auth? if (srcdn-is_auth()) { +setint srcdnrep; +srcdn-list_replicas(srcdnrep); + bool reply_witness = false; if (srcdnl-is_primary() !srcdnl-get_inode()-state_test(CInode::STATE_AMBIGUOUSAUTH)) { // freeze? @@ -6594,12 +6611,19 @@ void Server::handle_slave_rename_prep(MDRequest *mdr) if (mdr-slave_request-witnesses.size() 1) { dout(10) set srci ambiguous auth; providing srcdn replica list dendl; reply_witness = true; + for (setint::iterator p = srcdnrep.begin(); p != srcdnrep.end(); ++p) { + if (*p == mdr-slave_to_mds || + !mds-mdsmap-is_clientreplay_or_active_or_stopping(*p)) + continue; +
[PATCH 8/8] mds: fix remote wrlock rejoin
From: Yan, Zheng zheng.z@intel.com remote wrlock's target is not always inode's auth MDS. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 40 ++-- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index f1ebedf..2b127b5 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -4523,25 +4523,29 @@ void MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong) mdr-locks.insert(lock); } } -// wrlock(s)? -if (strong-wrlocked_inodes.count(in-vino())) { - for (mapint, listMMDSCacheRejoin::slave_reqid ::iterator q = strong-wrlocked_inodes[in-vino()].begin(); - q != strong-wrlocked_inodes[in-vino()].end(); - ++q) { - SimpleLock *lock = in-get_lock(q-first); - for (listMMDSCacheRejoin::slave_reqid::iterator r = q-second.begin(); -r != q-second.end(); -++r) { - dout(10) inode wrlock by *r on *lock on *in dendl; - MDRequest *mdr = request_get(r-reqid); // should have this from auth_pin above. + } + // wrlock(s)? + for (mapvinodeno_t, mapint, listMMDSCacheRejoin::slave_reqid ::iterator p = strong-wrlocked_inodes.begin(); + p != strong-wrlocked_inodes.end(); + ++p) { +CInode *in = get_inode(p-first); +for (mapint, listMMDSCacheRejoin::slave_reqid ::iterator q = p-second.begin(); +q != p-second.end(); + ++q) { + SimpleLock *lock = in-get_lock(q-first); + for (listMMDSCacheRejoin::slave_reqid::iterator r = q-second.begin(); + r != q-second.end(); + ++r) { + dout(10) inode wrlock by *r on *lock on *in dendl; + MDRequest *mdr = request_get(r-reqid); // should have this from auth_pin above. + if (in-is_auth()) assert(mdr-is_auth_pinned(in)); - lock-set_state(LOCK_MIX); - if (lock == in-filelock) - in-loner_cap = -1; - lock-get_wrlock(true); - mdr-wrlocks.insert(lock); - mdr-locks.insert(lock); - } + lock-set_state(LOCK_MIX); + if (lock == in-filelock) + in-loner_cap = -1; + lock-get_wrlock(true); + mdr-wrlocks.insert(lock); + mdr-locks.insert(lock); } } } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] mds: fix race between scatter gather and dirfrag export
From: Yan, Zheng zheng.z@intel.com If we gather dirty scatter lock state while corresponding dirfrag is been exporting, we may receive different dirfrag states from two MDS and we need to find which one is the newest. The solution is adding a new variable migrate seq to dirfrag, increase it by one when dirfrag's auth MDS changes. When gathering dirty scatter lock state, use migrate seq to find the newest dirfrag state. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/CDir.cc| 4 src/mds/CDir.h | 4 +++- src/mds/CInode.cc | 18 ++ src/mds/MDCache.cc | 13 +++-- 4 files changed, 36 insertions(+), 3 deletions(-) diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc index 8c83eba..2b991d7 100644 --- a/src/mds/CDir.cc +++ b/src/mds/CDir.cc @@ -154,6 +154,7 @@ ostream CDir::print_db_line_prefix(ostream out) // CDir CDir::CDir(CInode *in, frag_t fg, MDCache *mdcache, bool auth) : + mseq(0), dirty_rstat_inodes(member_offset(CInode, dirty_rstat_item)), item_dirty(this), item_new(this), pop_me(ceph_clock_now(g_ceph_context)), @@ -2121,6 +2122,8 @@ void CDir::_committed(version_t v) void CDir::encode_export(bufferlist bl) { assert(!is_projected()); + ceph_seq_t seq = mseq + 1; + ::encode(seq, bl); ::encode(first, bl); ::encode(fnode, bl); ::encode(dirty_old_rstat, bl); @@ -2150,6 +2153,7 @@ void CDir::finish_export(utime_t now) void CDir::decode_import(bufferlist::iterator blp, utime_t now, LogSegment *ls) { + ::decode(mseq, blp); ::decode(first, blp); ::decode(fnode, blp); ::decode(dirty_old_rstat, blp); diff --git a/src/mds/CDir.h b/src/mds/CDir.h index 87c79c2..11f4a76 100644 --- a/src/mds/CDir.h +++ b/src/mds/CDir.h @@ -170,6 +170,7 @@ public: fnode_t fnode; snapid_t first; + ceph_seq_t mseq; // migrate sequence mapsnapid_t,old_rstat_t dirty_old_rstat; // [value.first,key] // my inodes with dirty rstat data @@ -547,7 +548,8 @@ public: // -- import/export -- void encode_export(bufferlist bl); void finish_export(utime_t now); - void abort_export() { + void abort_export() { +mseq += 2; put(PIN_TEMPEXPORTING); } void decode_import(bufferlist::iterator blp, utime_t now, LogSegment *ls); diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc index 8936acd..c1ce8a1 100644 --- a/src/mds/CInode.cc +++ b/src/mds/CInode.cc @@ -1222,6 +1222,7 @@ void CInode::encode_lock_state(int type, bufferlist bl) dout(20) fg fragstat pf-fragstat dendl; dout(20) fg accounted_fragstat pf-accounted_fragstat dendl; ::encode(fg, tmp); + ::encode(dir-mseq, tmp); ::encode(dir-first, tmp); ::encode(pf-fragstat, tmp); ::encode(pf-accounted_fragstat, tmp); @@ -1255,6 +1256,7 @@ void CInode::encode_lock_state(int type, bufferlist bl) dout(10) fg pf-rstat dendl; dout(10) fg dir-dirty_old_rstat dendl; ::encode(fg, tmp); + ::encode(dir-mseq, tmp); ::encode(dir-first, tmp); ::encode(pf-rstat, tmp); ::encode(pf-accounted_rstat, tmp); @@ -1404,10 +1406,12 @@ void CInode::decode_lock_state(int type, bufferlist bl) dout(10) ...got n fragstats on *this dendl; while (n--) { frag_t fg; + ceph_seq_t mseq; snapid_t fgfirst; frag_info_t fragstat; frag_info_t accounted_fragstat; ::decode(fg, p); + ::decode(mseq, p); ::decode(fgfirst, p); ::decode(fragstat, p); ::decode(accounted_fragstat, p); @@ -1420,6 +1424,12 @@ void CInode::decode_lock_state(int type, bufferlist bl) assert(dir);// i am auth; i had better have this dir open dout(10) fg first dir-first - fgfirst on *dir dendl; + if (dir-fnode.fragstat.version == inode.dirstat.version + ceph_seq_cmp(mseq, dir-mseq) 0) { + dout(10) mseq mseq dir-mseq , ignoring dendl; + continue; + } + dir-mseq = mseq; dir-first = fgfirst; dir-fnode.fragstat = fragstat; dir-fnode.accounted_fragstat = accounted_fragstat; @@ -1462,11 +1472,13 @@ void CInode::decode_lock_state(int type, bufferlist bl) ::decode(n, p); while (n--) { frag_t fg; + ceph_seq_t mseq; snapid_t fgfirst; nest_info_t rstat; nest_info_t accounted_rstat; mapsnapid_t,old_rstat_t dirty_old_rstat; ::decode(fg, p); + ::decode(mseq, p); ::decode(fgfirst, p); ::decode(rstat, p); ::decode(accounted_rstat, p); @@ -1481,6 +1493,12 @@ void CInode::decode_lock_state(int type, bufferlist bl) assert(dir);// i am auth; i had better have this dir open dout(10) fg first dir-first - fgfirst on *dir dendl; + if (dir-fnode.rstat.version ==
[PATCH 1/8] mds: don't update migrate_seq when importing non-auth cap
From: Yan, Zheng zheng.z@intel.com We use migrate_seq to distinguish old and new auth MDS. So we should not change migrate_seq when importing non-auth cap. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Capability.h | 5 +++-- src/mds/Migrator.cc | 8 src/mds/Migrator.h | 3 ++- 3 files changed, 9 insertions(+), 7 deletions(-) diff --git a/src/mds/Capability.h b/src/mds/Capability.h index 54d2312..fdecb90 100644 --- a/src/mds/Capability.h +++ b/src/mds/Capability.h @@ -273,7 +273,7 @@ public: return Export(_wanted, issued(), pending(), client_follows, mseq+1, last_issue_stamp); } void rejoin_import() { mseq++; } - void merge(Export other) { + void merge(Export other, bool auth_cap) { // issued + pending int newpending = other.pending | pending(); if (other.issued ~newpending) @@ -286,7 +286,8 @@ public: // wanted _wanted = _wanted | other.wanted; -mseq = other.mseq; +if (auth_cap) + mseq = other.mseq; } void merge(int otherwanted, int otherissued) { // issued + pending diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index 6ea28c9..0647448 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -2223,7 +2223,7 @@ void Migrator::import_logged_start(dirfrag_t df, CDir *dir, int from, for (mapCInode*, mapclient_t,Capability::Export ::iterator p = import_caps[dir].begin(); p != import_caps[dir].end(); ++p) { -finish_import_inode_caps(p-first, from, p-second); +finish_import_inode_caps(p-first, true, p-second); } // send notify's etc. @@ -2398,7 +2398,7 @@ void Migrator::decode_import_inode_caps(CInode *in, } } -void Migrator::finish_import_inode_caps(CInode *in, int from, +void Migrator::finish_import_inode_caps(CInode *in, bool auth_cap, mapclient_t,Capability::Export cap_map) { for (mapclient_t,Capability::Export::iterator it = cap_map.begin(); @@ -2412,7 +2412,7 @@ void Migrator::finish_import_inode_caps(CInode *in, int from, if (!cap) { cap = in-add_client_cap(it-first, session); } -cap-merge(it-second); +cap-merge(it-second, auth_cap); mds-mdcache-do_cap_import(session, in, cap); } @@ -2688,7 +2688,7 @@ void Migrator::logged_import_caps(CInode *in, mds-server-finish_force_open_sessions(client_map, sseqmap); assert(cap_imports.count(in)); - finish_import_inode_caps(in, from, cap_imports[in]); + finish_import_inode_caps(in, false, cap_imports[in]); mds-locker-eval(in, CEPH_CAP_LOCKS, true); mds-send_message_mds(new MExportCapsAck(in-ino()), from); diff --git a/src/mds/Migrator.h b/src/mds/Migrator.h index 70b59bc..afe2e6c 100644 --- a/src/mds/Migrator.h +++ b/src/mds/Migrator.h @@ -256,7 +256,8 @@ public: void decode_import_inode_caps(CInode *in, bufferlist::iterator blp, mapCInode*, mapclient_t,Capability::Export cap_imports); - void finish_import_inode_caps(CInode *in, int from, mapclient_t,Capability::Export cap_map); + void finish_import_inode_caps(CInode *in, bool auth_cap, + mapclient_t,Capability::Export cap_map); int decode_import_dir(bufferlist::iterator blp, int oldauth, CDir *import_root, -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] mds: fix frozen check in Server::try_open_auth_dirfrag()
From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Server.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 253c56d..c3162e7 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -2204,7 +2204,7 @@ CDir* Server::try_open_auth_dirfrag(CInode *diri, frag_t fg, MDRequest *mdr) } // not open and inode frozen? - if (!dir diri-is_frozen_dir()) { + if (!dir diri-is_frozen()) { dout(10) try_open_auth_dirfrag: dir inode is frozen, waiting *diri dendl; assert(diri-get_parent_dir()); diri-get_parent_dir()-add_waiter(CDir::WAIT_UNFREEZE, new C_MDS_RetryRequest(mdcache, mdr)); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] mds: don't journal bare dirfrag
From: Yan, Zheng zheng.z@intel.com don't journal bare dirfrag when starting scatter. also add debug code for bare dirfrag modification. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/CDir.cc | 1 + src/mds/CInode.cc | 2 ++ src/mds/Server.cc | 5 +++-- 3 files changed, 6 insertions(+), 2 deletions(-) diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc index 211cec0..8c83eba 100644 --- a/src/mds/CDir.cc +++ b/src/mds/CDir.cc @@ -1211,6 +1211,7 @@ void CDir::finish_waiting(uint64_t mask, int result) fnode_t *CDir::project_fnode() { + assert(get_version() != 0); fnode_t *p = new fnode_t; *p = *get_projected_fnode(); projected_fnode.push_back(p); diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc index 4a592bc..8936acd 100644 --- a/src/mds/CInode.cc +++ b/src/mds/CInode.cc @@ -1625,6 +1625,8 @@ void CInode::finish_scatter_update(ScatterLock *lock, CDir *dir, if (dir-is_frozen()) { dout(10) finish_scatter_update fg frozen, marking *lock stale *dir dendl; + } else if (dir-get_version() == 0) { +dout(10) finish_scatter_update fg not loaded, marking *lock stale *dir dendl; } else { if (dir_accounted_version != inode_version) { dout(10) finish_scatter_update fg journaling accounted scatterstat update v inode_version dendl; diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 00bf018..1d16d04 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -4010,7 +4010,8 @@ public: if (newi-inode.is_dir()) { CDir *dir = newi-get_dirfrag(frag_t()); assert(dir); - dir-mark_dirty(1, mdr-ls); + dir-fnode.version--; + dir-mark_dirty(dir-fnode.version + 1, mdr-ls); dir-mark_new(mdr-ls); } @@ -4169,7 +4170,7 @@ void Server::handle_client_mkdir(MDRequest *mdr) // ...and that new dir is empty. CDir *newdir = newi-get_or_open_dirfrag(mds-mdcache, frag_t()); newdir-mark_complete(); - newdir-pre_dirty(); + newdir-fnode.version = newdir-pre_dirty(); // prepare finisher mdr-ls = mdlog-get_current_segment(); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] mds: handle undefined dirfrags when opening inode
From: Yan, Zheng zheng.z@intel.com When MDS is rejoin stage, cache rejoin message can add undefined inodes and dirfrags to the cache. These undefined objects can affect lookup-by-ino processes. If an undefined dirfrag is encountered, we should fetch it from disk. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 37 + src/mds/MDCache.h | 1 + 2 files changed, 34 insertions(+), 4 deletions(-) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 2b7ad71..b9b154d 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -8015,6 +8015,29 @@ void MDCache::_open_ino_traverse_dir(inodeno_t ino, open_ino_info_t info, int r do_open_ino(ino, info, ret); } +void MDCache::_open_ino_fetch_dir(inodeno_t ino, MMDSOpenIno *m, CDir *dir) +{ + if (dir-state_test(CDir::STATE_REJOINUNDEF) dir-get_frag() == frag_t()) { +rejoin_undef_dirfrags.erase(dir); +dir-state_clear(CDir::STATE_REJOINUNDEF); + +CInode *diri = dir-get_inode(); +diri-force_dirfrags(); +listCDir* ls; +diri-get_dirfrags(ls); + +C_GatherBuilder gather(g_ceph_context, _open_ino_get_waiter(ino, m)); +for (listCDir*::iterator p = ls.begin(); p != ls.end(); ++p) { + rejoin_undef_dirfrags.insert(*p); + (*p)-state_set(CDir::STATE_REJOINUNDEF); + (*p)-fetch(gather.new_sub()); +} +assert(gather.has_subs()); +gather.activate(); + } else +dir-fetch(_open_ino_get_waiter(ino, m)); +} + int MDCache::open_ino_traverse_dir(inodeno_t ino, MMDSOpenIno *m, vectorinode_backpointer_t ancestors, bool discover, bool want_xlocked, int *hint) @@ -8032,8 +8055,14 @@ int MDCache::open_ino_traverse_dir(inodeno_t ino, MMDSOpenIno *m, continue; } -if (diri-state_test(CInode::STATE_REJOINUNDEF)) - continue; +if (diri-state_test(CInode::STATE_REJOINUNDEF)) { + CDir *dir = diri-get_parent_dir(); + while (dir-state_test(CDir::STATE_REJOINUNDEF) +dir-get_inode()-state_test(CInode::STATE_REJOINUNDEF)) + dir = dir-get_inode()-get_parent_dir(); + _open_ino_fetch_dir(ino, m, dir); + return 1; +} if (!diri-is_dir()) { dout(10) *diri is not dir dendl; @@ -8067,14 +8096,14 @@ int MDCache::open_ino_traverse_dir(inodeno_t ino, MMDSOpenIno *m, if (dnl dnl-is_primary() dnl-get_inode()-state_test(CInode::STATE_REJOINUNDEF)) { dout(10) fetching undef *dnl-get_inode() dendl; - dir-fetch(_open_ino_get_waiter(ino, m)); + _open_ino_fetch_dir(ino, m, dir); return 1; } if (!dnl !dir-is_complete() (!dir-has_bloom() || dir-is_in_bloom(name))) { dout(10) fetching incomplete *dir dendl; - dir-fetch(_open_ino_get_waiter(ino, m)); + _open_ino_fetch_dir(ino, m, dir); return 1; } diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h index 3da8a36..36a322c 100644 --- a/src/mds/MDCache.h +++ b/src/mds/MDCache.h @@ -790,6 +790,7 @@ protected: void _open_ino_backtrace_fetched(inodeno_t ino, bufferlist bl, int err); void _open_ino_parent_opened(inodeno_t ino, int ret); void _open_ino_traverse_dir(inodeno_t ino, open_ino_info_t info, int err); + void _open_ino_fetch_dir(inodeno_t ino, MMDSOpenIno *m, CDir *dir); Context* _open_ino_get_waiter(inodeno_t ino, MMDSOpenIno *m); int open_ino_traverse_dir(inodeno_t ino, MMDSOpenIno *m, vectorinode_backpointer_t ancestors, -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: remove sb_start/end_write in ceph_aio_write.
Either in vfs_write or io_submit,it call file_start/end_write. The different between file_start/end_write and sb_start/end_write is file_ only handle regular file.But i think in ceph_aio_write,it only for regular file. Signed-off-by: Jianpeng Mamajianp...@gmail.com --- fs/ceph/file.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 656e169..7c69f4f 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -716,7 +716,6 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov, if (ceph_snap(inode) != CEPH_NOSNAP) return -EROFS; - sb_start_write(inode-i_sb); mutex_lock(inode-i_mutex); hold_mutex = true; @@ -809,7 +808,6 @@ retry_snap: out: if (hold_mutex) mutex_unlock(inode-i_mutex); - sb_end_write(inode-i_sb); current-backing_dev_info = NULL; return written ? written : err; -- 1.8.3.rc1.44.gb387c77 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: [PATCH 2/2] Enable fscache as an optional feature of ceph.
Elbandi, Can you give me some info about your test case so I can figure out what's going on. 1) In the graphs you attached what am I looking at? My best guess is that it's traffic on a 10gigE card, but I can't tell from the graph since there's no labels. 2) Can you give me more info about your serving case. What application are you using to serve the video (http server)? Are you serving static mp4 files from Ceph filesystem? 3) What's the hardware, most importantly how big is your partition that cachefilesd is on and what kind of disk are you hosting it on (rotating, SSD)? 4) Statistics from fscache. Can you paste the output /proc/fs/fscache/stats and /proc/fs/fscache/histogram. 5) dmesg lines for ceph/fscache/cachefiles like: [2049099.198234] CacheFiles: Loaded [2049099.541721] FS-Cache: Cache mycache added (type cachefiles) [2049099.541727] CacheFiles: File cache on md0 registered [2049120.650897] Key type ceph registered [2049120.651015] libceph: loaded (mon/osd proto 15/24) [2049120.673202] FS-Cache: Netfs 'ceph' registered for caching [2049120.673207] ceph: loaded (mds proto 32) [2049120.680919] libceph: client6473 fsid e23a1bfc-8328-46bf-bc59-1209df3f5434 [2049120.683397] libceph: mon0 10.0.5.226:6789 session established I think with these answers I'll be better able to diagnose what's going on for you. - Milosz On Mon, Jun 17, 2013 at 9:16 AM, Elso Andras elso.and...@gmail.com wrote: Hi, I tested your patches on a ubuntu lucid system, but ubuntu raring kernel (3.8), but with for-linus branch from ceph-client and your fscache. There was no probs in heavy load. But i dont see any difference with/without fscache on our test case (mp4 video streaming, ~5500 connections): with fscache: http://imageshack.us/photo/my-images/109/xg5a.png/ without fscache: http://imageshack.us/photo/my-images/5/xak.png/ Elbandi 2013/5/29 Milosz Tanski mil...@adfin.com: Sage, Thanks for taking a look at this. No worries about the timing. I added two extra changes into my branch located here: https://bitbucket.org/adfin/linux-fs/commits/branch/forceph. The first one is a fix for kernel deadlock. The second one makes fsc cache a non-default mount option (akin to NFS). Finally, I observed an occasional oops in the fscache that's fixed in David's branch that's waiting to get into mainline. The fix for the issue is here: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscacheid=82958c45e35963c93fc6cbe6a27752e2d97e9f9a. I can only cause that issue by forcing the kernel to drop it's caches in some cases. Let me know if you any other feedback, or if I can help in anyway. Thanks, - Milosz On Tue, May 28, 2013 at 1:11 PM, Sage Weil s...@inktank.com wrote: Hi Milosz, Just a heads up that I hope to take a closer look at the patch this afternoon or tomorrow. Just catching up after the long weekend. Thanks! sage On Thu, 23 May 2013, Milosz Tanski wrote: Enable fscache as an optional feature of ceph. Adding support for fscache to the Ceph filesystem. This would bring it to on par with some of the other network filesystems in Linux (like NFS, AFS, etc...) This exploits the existing Ceph cache lazyio capabilities. Signed-off-by: Milosz Tanski mil...@adfin.com --- fs/ceph/Kconfig |9 ++ fs/ceph/Makefile |2 ++ fs/ceph/addr.c | 85 -- fs/ceph/caps.c | 21 +- fs/ceph/file.c |9 ++ fs/ceph/inode.c | 25 ++-- fs/ceph/super.c | 25 ++-- fs/ceph/super.h | 12 8 files changed, 162 insertions(+), 26 deletions(-) diff --git a/fs/ceph/Kconfig b/fs/ceph/Kconfig index 49bc782..ac9a2ef 100644 --- a/fs/ceph/Kconfig +++ b/fs/ceph/Kconfig @@ -16,3 +16,12 @@ config CEPH_FS If unsure, say N. +if CEPH_FS +config CEPH_FSCACHE + bool Enable Ceph client caching support + depends on CEPH_FS=m FSCACHE || CEPH_FS=y FSCACHE=y + help + Choose Y here to enable persistent, read-only local + caching support for Ceph clients using FS-Cache + +endif diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile index bd35212..0af0678 100644 --- a/fs/ceph/Makefile +++ b/fs/ceph/Makefile @@ -9,3 +9,5 @@ ceph-y := super.o inode.o dir.o file.o locks.o addr.o ioctl.o \ mds_client.o mdsmap.o strings.o ceph_frag.o \ debugfs.o +ceph-$(CONFIG_CEPH_FSCACHE) += cache.o + diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 3e68ac1..fd3a1cc 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -11,6 +11,7 @@ #include super.h #include mds_client.h +#include cache.h #include linux/ceph/osd_client.h /* @@ -149,11 +150,26 @@ static void ceph_invalidatepage(struct page *page, unsigned long offset) struct ceph_inode_info *ci; struct ceph_snap_context *snapc = page_snap_context(page); -
[PATCH v3 04/13] locks: make added in __posix_lock_file a bool
...save 3 bytes of stack space. Signed-off-by: Jeff Layton jlay...@redhat.com Acked-by: J. Bruce Fields bfie...@fieldses.org --- fs/locks.c |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 1e6301b..c186649 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -791,7 +791,8 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str struct file_lock *left = NULL; struct file_lock *right = NULL; struct file_lock **before; - int error, added = 0; + int error; + bool added = false; /* * We may need two file_lock structures for this operation, @@ -885,7 +886,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str continue; } request = fl; - added = 1; + added = true; } else { /* Processing for different lock types is a bit @@ -896,7 +897,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str if (fl-fl_start request-fl_end) break; if (request-fl_type == F_UNLCK) - added = 1; + added = true; if (fl-fl_start request-fl_start) left = fl; /* If the next lock in the list has a higher end @@ -926,7 +927,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str locks_release_private(fl); locks_copy_private(fl, request); request = fl; - added = 1; + added = true; } } /* Go on to next lock. -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 10/13] locks: add a new lm_owner_key lock operation
Currently, the hashing that the locking code uses to add these values to the blocked_hash is simply calculated using fl_owner field. That's valid in most cases except for server-side lockd, which validates the owner of a lock based on fl_owner and fl_pid. In the case where you have a small number of NFS clients doing a lot of locking between different processes, you could end up with all the blocked requests sitting in a very small number of hash buckets. Add a new lm_owner_key operation to the lock_manager_operations that will generate an unsigned long to use as the key in the hashtable. That function is only implemented for server-side lockd, and simply XORs the fl_owner and fl_pid. Signed-off-by: Jeff Layton jlay...@redhat.com Acked-by: J. Bruce Fields bfie...@fieldses.org --- Documentation/filesystems/Locking | 16 +++- fs/lockd/svclock.c| 12 fs/locks.c| 12 ++-- include/linux/fs.h|1 + 4 files changed, 34 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 413685f..dfeb01b 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -351,6 +351,7 @@ fl_release_private: maybe no --- lock_manager_operations --- prototypes: int (*lm_compare_owner)(struct file_lock *, struct file_lock *); + unsigned long (*lm_owner_key)(struct file_lock *); void (*lm_notify)(struct file_lock *); /* unblock callback */ int (*lm_grant)(struct file_lock *, struct file_lock *, int); void (*lm_break)(struct file_lock *); /* break_lease callback */ @@ -360,16 +361,21 @@ locking rules: inode-i_lock file_lock_lock may block lm_compare_owner: yes[1] maybe no +lm_owner_key yes[1] yes no lm_notify: yes yes no lm_grant: no no no lm_break: yes no no lm_change yes no no -[1]: -lm_compare_owner is generally called with *an* inode-i_lock held. It -may not be the i_lock of the inode for either file_lock being compared! This is -the case with deadlock detection, since the code has to chase down the owners -of locks that may be entirely unrelated to the one on which the lock is being -acquired. When doing a search for deadlocks, the file_lock_lock is also held. +[1]: -lm_compare_owner and -lm_owner_key are generally called with +*an* inode-i_lock held. It may not be the i_lock of the inode +associated with either file_lock argument! This is the case with deadlock +detection, since the code has to chase down the owners of locks that may +be entirely unrelated to the one on which the lock is being acquired. +For deadlock detection however, the file_lock_lock is also held. The +fact that these locks are held ensures that the file_locks do not +disappear out from under you while doing the comparison or generating an +owner key. --- buffer_head --- prototypes: diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c index e703318..ce2cdab 100644 --- a/fs/lockd/svclock.c +++ b/fs/lockd/svclock.c @@ -744,8 +744,20 @@ static int nlmsvc_same_owner(struct file_lock *fl1, struct file_lock *fl2) return fl1-fl_owner == fl2-fl_owner fl1-fl_pid == fl2-fl_pid; } +/* + * Since NLM uses two keys for tracking locks, we need to hash them down + * to one for the blocked_hash. Here, we're just xor'ing the host address + * with the pid in order to create a key value for picking a hash bucket. + */ +static unsigned long +nlmsvc_owner_key(struct file_lock *fl) +{ + return (unsigned long)fl-fl_owner ^ (unsigned long)fl-fl_pid; +} + const struct lock_manager_operations nlmsvc_lock_operations = { .lm_compare_owner = nlmsvc_same_owner, + .lm_owner_key = nlmsvc_owner_key, .lm_notify = nlmsvc_notify_blocked, .lm_grant = nlmsvc_grant_deferred, }; diff --git a/fs/locks.c b/fs/locks.c index d93b291..55f3af7 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -505,10 +505,18 @@ locks_delete_global_locks(struct file_lock *waiter) spin_unlock(file_lock_lock); } +static unsigned long +posix_owner_key(struct file_lock *fl) +{ + if (fl-fl_lmops fl-fl_lmops-lm_owner_key) + return fl-fl_lmops-lm_owner_key(fl); + return (unsigned long)fl-fl_owner; +} + static inline void locks_insert_global_blocked(struct file_lock *waiter) { - hash_add(blocked_hash, waiter-fl_link, (unsigned long)waiter-fl_owner); + hash_add(blocked_hash, waiter-fl_link, posix_owner_key(waiter)); } static inline void @@ -739,7 +747,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl) {
[PATCH v3 11/13] locks: give the blocked_hash its own spinlock
There's no reason we have to protect the blocked_hash and file_lock_list with the same spinlock. With the tests I have, breaking it in two gives a barely measurable performance benefit, but it seems reasonable to make this locking as granular as possible. Signed-off-by: Jeff Layton jlay...@redhat.com --- Documentation/filesystems/Locking | 16 fs/locks.c| 33 ++--- 2 files changed, 26 insertions(+), 23 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index dfeb01b..cf04448 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -359,20 +359,20 @@ prototypes: locking rules: - inode-i_lock file_lock_lock may block -lm_compare_owner: yes[1] maybe no -lm_owner_key yes[1] yes no -lm_notify: yes yes no -lm_grant: no no no -lm_break: yes no no -lm_change yes no no + inode-i_lock blocked_lock_lock may block +lm_compare_owner: yes[1] maybe no +lm_owner_key yes[1] yes no +lm_notify: yes yes no +lm_grant: no no no +lm_break: yes no no +lm_change yes no no [1]: -lm_compare_owner and -lm_owner_key are generally called with *an* inode-i_lock held. It may not be the i_lock of the inode associated with either file_lock argument! This is the case with deadlock detection, since the code has to chase down the owners of locks that may be entirely unrelated to the one on which the lock is being acquired. -For deadlock detection however, the file_lock_lock is also held. The +For deadlock detection however, the blocked_lock_lock is also held. The fact that these locks are held ensures that the file_locks do not disappear out from under you while doing the comparison or generating an owner key. diff --git a/fs/locks.c b/fs/locks.c index 55f3af7..5db80c7 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -159,10 +159,11 @@ int lease_break_time = 45; * by the file_lock_lock. */ static HLIST_HEAD(file_lock_list); +static DEFINE_SPINLOCK(file_lock_lock); /* * The blocked_hash is used to find POSIX lock loops for deadlock detection. - * It is protected by file_lock_lock. + * It is protected by blocked_lock_lock. * * We hash locks by lockowner in order to optimize searching for the lock a * particular lockowner is waiting on. @@ -174,8 +175,8 @@ static HLIST_HEAD(file_lock_list); #define BLOCKED_HASH_BITS 7 static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS); -/* Protects the file_lock_list, the blocked_hash and fl-fl_block list */ -static DEFINE_SPINLOCK(file_lock_lock); +/* protects blocked_hash and fl-fl_block list */ +static DEFINE_SPINLOCK(blocked_lock_lock); static struct kmem_cache *filelock_cache __read_mostly; @@ -528,7 +529,7 @@ locks_delete_global_blocked(struct file_lock *waiter) /* Remove waiter from blocker's block list. * When blocker ends up pointing to itself then the list is empty. * - * Must be called with file_lock_lock held. + * Must be called with blocked_lock_lock held. */ static void __locks_delete_block(struct file_lock *waiter) { @@ -539,9 +540,9 @@ static void __locks_delete_block(struct file_lock *waiter) static void locks_delete_block(struct file_lock *waiter) { - spin_lock(file_lock_lock); + spin_lock(blocked_lock_lock); __locks_delete_block(waiter); - spin_unlock(file_lock_lock); + spin_unlock(blocked_lock_lock); } /* Insert waiter into blocker's block list. @@ -549,9 +550,9 @@ static void locks_delete_block(struct file_lock *waiter) * the order they blocked. The documentation doesn't require this but * it seems like the reasonable thing to do. * - * Must be called with both the i_lock and file_lock_lock held. The fl_block + * Must be called with both the i_lock and blocked_lock_lock held. The fl_block * list itself is protected by the file_lock_list, but by ensuring that the - * i_lock is also held on insertions we can avoid taking the file_lock_lock + * i_lock is also held on insertions we can avoid taking the blocked_lock_lock * in some cases when we see that the fl_block list is empty. */ static void __locks_insert_block(struct file_lock *blocker, @@ -568,9 +569,9 @@ static void __locks_insert_block(struct file_lock *blocker, static void locks_insert_block(struct file_lock *blocker, struct file_lock *waiter) { - spin_lock(file_lock_lock); + spin_lock(blocked_lock_lock);
[PATCH v3 07/13] locks: avoid taking global lock if possible when waking up blocked waiters
Since we always hold the i_lock when inserting a new waiter onto the fl_block list, we can avoid taking the global lock at all if we find that it's empty when we go to wake up blocked waiters. Signed-off-by: Jeff Layton jlay...@redhat.com --- fs/locks.c | 17 ++--- 1 files changed, 14 insertions(+), 3 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 8f56651..a8f3b33 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -532,7 +532,10 @@ static void locks_delete_block(struct file_lock *waiter) * the order they blocked. The documentation doesn't require this but * it seems like the reasonable thing to do. * - * Must be called with file_lock_lock held! + * Must be called with both the i_lock and file_lock_lock held. The fl_block + * list itself is protected by the file_lock_list, but by ensuring that the + * i_lock is also held on insertions we can avoid taking the file_lock_lock + * in some cases when we see that the fl_block list is empty. */ static void __locks_insert_block(struct file_lock *blocker, struct file_lock *waiter) @@ -560,8 +563,16 @@ static void locks_insert_block(struct file_lock *blocker, */ static void locks_wake_up_blocks(struct file_lock *blocker) { + /* +* Avoid taking global lock if list is empty. This is safe since new +* blocked requests are only added to the list under the i_lock, and +* the i_lock is always held here. +*/ + if (list_empty(blocker-fl_block)) + return; + spin_lock(file_lock_lock); - while (!list_empty(blocker-fl_block)) { + do { struct file_lock *waiter; waiter = list_first_entry(blocker-fl_block, @@ -571,7 +582,7 @@ static void locks_wake_up_blocks(struct file_lock *blocker) waiter-fl_lmops-lm_notify(waiter); else wake_up(waiter-fl_wait); - } + } while (!list_empty(blocker-fl_block)); spin_unlock(file_lock_lock); } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 12/13] seq_file: add seq_list_*_percpu helpers
When we convert the file_lock_list to a set of percpu lists, we'll need a way to iterate over them in order to output /proc/locks info. Add some seq_list_*_percpu helpers to handle that. Signed-off-by: Jeff Layton jlay...@redhat.com Acked-by: J. Bruce Fields bfie...@fieldses.org --- fs/seq_file.c| 54 ++ include/linux/seq_file.h |6 + 2 files changed, 60 insertions(+), 0 deletions(-) diff --git a/fs/seq_file.c b/fs/seq_file.c index 774c1eb..3135c25 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -921,3 +921,57 @@ struct hlist_node *seq_hlist_next_rcu(void *v, return rcu_dereference(node-next); } EXPORT_SYMBOL(seq_hlist_next_rcu); + +/** + * seq_hlist_start_precpu - start an iteration of a percpu hlist array + * @head: pointer to percpu array of struct hlist_heads + * @cpu: pointer to cpu cursor + * @pos: start position of sequence + * + * Called at seq_file-op-start(). + */ +struct hlist_node * +seq_hlist_start_percpu(struct hlist_head __percpu *head, int *cpu, loff_t pos) +{ + struct hlist_node *node; + + for_each_possible_cpu(*cpu) { + hlist_for_each(node, per_cpu_ptr(head, *cpu)) { + if (pos-- == 0) + return node; + } + } + return NULL; +} +EXPORT_SYMBOL(seq_hlist_start_percpu); + +/** + * seq_hlist_next_percpu - move to the next position of the percpu hlist array + * @v:pointer to current hlist_node + * @head: pointer to percpu array of struct hlist_heads + * @cpu: pointer to cpu cursor + * @pos: start position of sequence + * + * Called at seq_file-op-next(). + */ +struct hlist_node * +seq_hlist_next_percpu(void *v, struct hlist_head __percpu *head, + int *cpu, loff_t *pos) +{ + struct hlist_node *node = v; + + ++*pos; + + if (node-next) + return node-next; + + for (*cpu = cpumask_next(*cpu, cpu_possible_mask); *cpu nr_cpu_ids; +*cpu = cpumask_next(*cpu, cpu_possible_mask)) { + struct hlist_head *bucket = per_cpu_ptr(head, *cpu); + + if (!hlist_empty(bucket)) + return bucket-first; + } + return NULL; +} +EXPORT_SYMBOL(seq_hlist_next_percpu); diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 2da29ac..4e32edc 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -173,4 +173,10 @@ extern struct hlist_node *seq_hlist_start_head_rcu(struct hlist_head *head, extern struct hlist_node *seq_hlist_next_rcu(void *v, struct hlist_head *head, loff_t *ppos); + +/* Helpers for iterating over per-cpu hlist_head-s in seq_files */ +extern struct hlist_node *seq_hlist_start_percpu(struct hlist_head __percpu *head, int *cpu, loff_t pos); + +extern struct hlist_node *seq_hlist_next_percpu(void *v, struct hlist_head __percpu *head, int *cpu, loff_t *pos); + #endif -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 06/13] locks: protect most of the file_lock handling with i_lock
Having a global lock that protects all of this code is a clear scalability problem. Instead of doing that, move most of the code to be protected by the i_lock instead. The exceptions are the global lists that the -fl_link sits on, and the -fl_block list. -fl_link is what connects these structures to the global lists, so we must ensure that we hold those locks when iterating over or updating these lists. Furthermore, sound deadlock detection requires that we hold the blocked_list state steady while checking for loops. We also must ensure that the search and update to the list are atomic. For the checking and insertion side of the blocked_list, push the acquisition of the global lock into __posix_lock_file and ensure that checking and update of the blocked_list is done without dropping the lock in between. On the removal side, when waking up blocked lock waiters, take the global lock before walking the blocked list and dequeue the waiters from the global list prior to removal from the fl_block list. With this, deadlock detection should be race free while we minimize excessive file_lock_lock thrashing. Finally, in order to avoid a lock inversion problem when handling /proc/locks output we must ensure that manipulations of the fl_block list are also protected by the file_lock_lock. Signed-off-by: Jeff Layton jlay...@redhat.com --- Documentation/filesystems/Locking | 21 -- fs/afs/flock.c|5 +- fs/ceph/locks.c |2 +- fs/ceph/mds_client.c |8 +- fs/cifs/cifsfs.c |2 +- fs/cifs/file.c| 13 ++-- fs/gfs2/file.c|2 +- fs/lockd/svcsubs.c| 12 ++-- fs/locks.c| 151 ++--- fs/nfs/delegation.c | 10 +- fs/nfs/nfs4state.c|8 +- fs/nfsd/nfs4state.c |8 +- include/linux/fs.h| 11 --- 13 files changed, 140 insertions(+), 113 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 0706d32..413685f 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -344,7 +344,7 @@ prototypes: locking rules: - file_lock_lock may block + inode-i_lock may block fl_copy_lock: yes no fl_release_private:maybe no @@ -357,12 +357,19 @@ prototypes: int (*lm_change)(struct file_lock **, int); locking rules: - file_lock_lock may block -lm_compare_owner: yes no -lm_notify: yes no -lm_grant: no no -lm_break: yes no -lm_change yes no + + inode-i_lock file_lock_lock may block +lm_compare_owner: yes[1] maybe no +lm_notify: yes yes no +lm_grant: no no no +lm_break: yes no no +lm_change yes no no + +[1]: -lm_compare_owner is generally called with *an* inode-i_lock held. It +may not be the i_lock of the inode for either file_lock being compared! This is +the case with deadlock detection, since the code has to chase down the owners +of locks that may be entirely unrelated to the one on which the lock is being +acquired. When doing a search for deadlocks, the file_lock_lock is also held. --- buffer_head --- prototypes: diff --git a/fs/afs/flock.c b/fs/afs/flock.c index 2497bf3..03fc0d1 100644 --- a/fs/afs/flock.c +++ b/fs/afs/flock.c @@ -252,6 +252,7 @@ static void afs_defer_unlock(struct afs_vnode *vnode, struct key *key) */ static int afs_do_setlk(struct file *file, struct file_lock *fl) { + struct inode = file_inode(file); struct afs_vnode *vnode = AFS_FS_I(file-f_mapping-host); afs_lock_type_t type; struct key *key = file-private_data; @@ -273,7 +274,7 @@ static int afs_do_setlk(struct file *file, struct file_lock *fl) type = (fl-fl_type == F_RDLCK) ? AFS_LOCK_READ : AFS_LOCK_WRITE; - lock_flocks(); + spin_lock(inode-i_lock); /* make sure we've got a callback on this file and that our view of the * data version is up to date */ @@ -420,7 +421,7 @@ given_lock: afs_vnode_fetch_status(vnode, NULL, key); error: - unlock_flocks(); + spin_unlock(inode-i_lock); _leave( = %d, ret); return ret; diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index ebbf680..690f73f 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -192,7 +192,7 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count) /** * Encode the flock and fcntl locks for the given inode into the
[PATCH v3 00/13] locks: scalability improvements for file locking
Summary of Significant Changes: --- v3: - Change spinlock handling to avoid the need to traverse the global blocked_hash when doing output of /proc/locks. This means that the fl_block list must continue to be protected by a global lock, but the fact that the i_lock is also held in most cases means that we can avoid taking it in certain situations. v2: - Fix potential races in deadlock detection. Manipulation of global blocked_hash and deadlock detection are now atomic. This is a little slower than the earlier set, but is provably correct. Also, the patch that converts to using the i_lock has been split out from most of the other changes. That should make it easier to review, but it does leave a potential race in the deadlock detection that is fixed up by the following patch. It may make sense to fold patches 7 and 8 together before merging. - Add percpu hlists and lglocks for global file_lock_list. This gives us some speedup since this list is seldom read. Abstract (tl;dr version): - This patchset represents an overhaul of the file locking code with an aim toward improving its scalability and making the code a bit easier to understand. Longer version: --- When the BKL was finally ripped out of the kernel in 2010, the strategy taken for the file locking code was to simply turn it into a new file_lock_locks spinlock. It was an expedient way to deal with the file locking code at the time, but having a giant spinlock around all of this code is clearly not great for scalability. Red Hat has bug reports that go back into the 2.6.18 era that point to BKL scalability problems in the file locking code and the file_lock_lock suffers from the same issues. This patchset is my first attempt to make this code less dependent on global locking. The main change is to switch most of the file locking code to be protected by the inode-i_lock instead of the file_lock_lock. While that works for most things, there are a couple of global data structures (lists in the current code) that need a global lock to protect them. So we still need a global lock in order to deal with those. The remaining patches are intended to make that global locking less painful. The big gains are made by turning the blocked_list into a hashtable, which greatly speeds up the deadlock detection code and making the file_lock_list percpu. This is not the first attempt at doing this. The conversion to the i_lock was originally attempted by Bruce Fields a few years ago. His approach was NAK'ed since it involved ripping out the deadlock detection. People also really seem to like /proc/locks for debugging, so keeping that in is probably worthwhile. There's more work to be done in this area and this patchset is just a start. There's a horrible thundering herd problem when a blocking lock is released, for instance. There was also interest in solving the goofy unlock on any close POSIX lock semantics at this year's LSF. I think this patchset will help lay the groundwork for those changes as well. While file locking is not usually considered to be a high-performance codepath, it *is* an IPC mechanism and I think it behooves us to try to make it as fast as possible. I'd like to see this considered for 3.11, but some soak time in -next would be good. Comments and suggestions welcome. Performance testing and results: In order to measure the benefit of this set, I've written some locking performance tests that I've made available here: git://git.samba.org/jlayton/lockperf.git Here are the results from the same 32-way, 4 NUMA node machine that I used to generate the v2 patch results. The first number is the mean time spent in locking for the test. The number in parenthesis is the standard deviation. 3.10.0-rc5-00219-ga2648eb 3.10.0-rc5-00231-g7569869 --- flock01 24119.96 (266.08) 24542.51 (254.89) flock02 1345.09 (37.37) 8.60 (0.31) posix01 31217.14 (320.91) 24899.20 (254.27) posix02 1348.60 (36.83) 12.70 (0.44) I wasn't able to reserve the exact same smaller machine for testing this set, but this one is comparable with 4 CPUs and UMA architecture: 3.10.0-rc5-00219-ga2648eb 3.10.0-rc5-00231-g7569869 --- flock01 1787.51 (11.23) 1797.75 (9.27) flock02 314.90 (8.84) 34.87 (2.82) posix01 1843.43 (11.63) 1880.47 (13.47) posix02 325.13 (8.53) 54.09 (4.02) I think the conclusion we can draw here is that this patchset it roughly as fast as the previous one. In addition, the posix02 test saw a vast increase in performance. I believe that's mostly
[PATCH v3 09/13] locks: turn the blocked_list into a hashtable
Break up the blocked_list into a hashtable, using the fl_owner as a key. This speeds up searching the hash chains, which is especially significant for deadlock detection. Note that the initial implementation assumes that hashing on fl_owner is sufficient. In most cases it should be, with the notable exception being server-side lockd, which compares ownership using a tuple of the nlm_host and the pid sent in the lock request. So, this may degrade to a single hash bucket when you only have a single NFS client. That will be addressed in a later patch. The careful observer may note that this patch leaves the file_lock_list alone. There's much less of a case for turning the file_lock_list into a hashtable. The only user of that list is the code that generates /proc/locks, and it always walks the entire list. Signed-off-by: Jeff Layton jlay...@redhat.com Acked-by: J. Bruce Fields bfie...@fieldses.org --- fs/locks.c | 25 + 1 files changed, 17 insertions(+), 8 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 32826ed..d93b291 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -126,6 +126,7 @@ #include linux/time.h #include linux/rcupdate.h #include linux/pid_namespace.h +#include linux/hashtable.h #include asm/uaccess.h @@ -160,12 +161,20 @@ int lease_break_time = 45; static HLIST_HEAD(file_lock_list); /* - * The blocked_list is used to find POSIX lock loops for deadlock detection. - * Protected by file_lock_lock. + * The blocked_hash is used to find POSIX lock loops for deadlock detection. + * It is protected by file_lock_lock. + * + * We hash locks by lockowner in order to optimize searching for the lock a + * particular lockowner is waiting on. + * + * FIXME: make this value scale via some heuristic? We generally will want more + * buckets when we have more lockowners holding locks, but that's a little + * difficult to determine without knowing what the workload will look like. */ -static HLIST_HEAD(blocked_list); +#define BLOCKED_HASH_BITS 7 +static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS); -/* Protects the two list heads above, and fl-fl_block list. */ +/* Protects the file_lock_list, the blocked_hash and fl-fl_block list */ static DEFINE_SPINLOCK(file_lock_lock); static struct kmem_cache *filelock_cache __read_mostly; @@ -499,13 +508,13 @@ locks_delete_global_locks(struct file_lock *waiter) static inline void locks_insert_global_blocked(struct file_lock *waiter) { - hlist_add_head(waiter-fl_link, blocked_list); + hash_add(blocked_hash, waiter-fl_link, (unsigned long)waiter-fl_owner); } static inline void locks_delete_global_blocked(struct file_lock *waiter) { - hlist_del_init(waiter-fl_link); + hash_del(waiter-fl_link); } /* Remove waiter from blocker's block list. @@ -730,7 +739,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl) { struct file_lock *fl; - hlist_for_each_entry(fl, blocked_list, fl_link) { + hash_for_each_possible(blocked_hash, fl, fl_link, (unsigned long)block_fl-fl_owner) { if (posix_same_owner(fl, block_fl)) return fl-fl_next; } @@ -866,7 +875,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str /* * New lock request. Walk all POSIX locks and look for conflicts. If * there are any, either return error or put the request on the -* blocker's list of waiters and the global blocked_list. +* blocker's list of waiters and the global blocked_hash. */ if (request-fl_type != F_UNLCK) { for_each_lock(inode, before) { -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 02/13] locks: make generic_add_lease and generic_delete_lease static
Signed-off-by: Jeff Layton jlay...@redhat.com Acked-by: J. Bruce Fields bfie...@fieldses.org --- fs/locks.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 7a02064..e3140b8 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1337,7 +1337,7 @@ int fcntl_getlease(struct file *filp) return type; } -int generic_add_lease(struct file *filp, long arg, struct file_lock **flp) +static int generic_add_lease(struct file *filp, long arg, struct file_lock **flp) { struct file_lock *fl, **before, **my_before = NULL, *lease; struct dentry *dentry = filp-f_path.dentry; @@ -1402,7 +1402,7 @@ out: return error; } -int generic_delete_lease(struct file *filp, struct file_lock **flp) +static int generic_delete_lease(struct file *filp, struct file_lock **flp) { struct file_lock *fl, **before; struct dentry *dentry = filp-f_path.dentry; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 05/13] locks: encapsulate the fl_link list handling
Move the fl_link list handling routines into a separate set of helpers. Also ensure that locks and requests are always put on global lists last (after fully initializing them) and are taken off before unintializing them. Signed-off-by: Jeff Layton jlay...@redhat.com --- fs/locks.c | 45 - 1 files changed, 36 insertions(+), 9 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index c186649..c0e613f 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -153,13 +153,15 @@ int lease_break_time = 45; #define for_each_lock(inode, lockp) \ for (lockp = inode-i_flock; *lockp != NULL; lockp = (*lockp)-fl_next) +/* The global file_lock_list is only used for displaying /proc/locks. */ static LIST_HEAD(file_lock_list); + +/* The blocked_list is used to find POSIX lock loops for deadlock detection. */ static LIST_HEAD(blocked_list); + +/* Protects the two list heads above, plus the inode-i_flock list */ static DEFINE_SPINLOCK(file_lock_lock); -/* - * Protects the two list heads above, plus the inode-i_flock list - */ void lock_flocks(void) { spin_lock(file_lock_lock); @@ -484,13 +486,37 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2) return fl1-fl_owner == fl2-fl_owner; } +static inline void +locks_insert_global_locks(struct file_lock *waiter) +{ + list_add_tail(waiter-fl_link, file_lock_list); +} + +static inline void +locks_delete_global_locks(struct file_lock *waiter) +{ + list_del_init(waiter-fl_link); +} + +static inline void +locks_insert_global_blocked(struct file_lock *waiter) +{ + list_add(waiter-fl_link, blocked_list); +} + +static inline void +locks_delete_global_blocked(struct file_lock *waiter) +{ + list_del_init(waiter-fl_link); +} + /* Remove waiter from blocker's block list. * When blocker ends up pointing to itself then the list is empty. */ static void __locks_delete_block(struct file_lock *waiter) { + locks_delete_global_blocked(waiter); list_del_init(waiter-fl_block); - list_del_init(waiter-fl_link); waiter-fl_next = NULL; } @@ -512,10 +538,10 @@ static void locks_insert_block(struct file_lock *blocker, struct file_lock *waiter) { BUG_ON(!list_empty(waiter-fl_block)); - list_add_tail(waiter-fl_block, blocker-fl_block); waiter-fl_next = blocker; + list_add_tail(waiter-fl_block, blocker-fl_block); if (IS_POSIX(blocker)) - list_add(waiter-fl_link, blocked_list); + locks_insert_global_blocked(request); } /* @@ -543,13 +569,13 @@ static void locks_wake_up_blocks(struct file_lock *blocker) */ static void locks_insert_lock(struct file_lock **pos, struct file_lock *fl) { - list_add(fl-fl_link, file_lock_list); - fl-fl_nspid = get_pid(task_tgid(current)); /* insert into file's list */ fl-fl_next = *pos; *pos = fl; + + locks_insert_global_locks(fl); } /* @@ -562,9 +588,10 @@ static void locks_delete_lock(struct file_lock **thisfl_p) { struct file_lock *fl = *thisfl_p; + locks_delete_global_locks(fl); + *thisfl_p = fl-fl_next; fl-fl_next = NULL; - list_del_init(fl-fl_link); if (fl-fl_nspid) { put_pid(fl-fl_nspid); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 03/13] locks: comment cleanups and clarifications
Signed-off-by: Jeff Layton jlay...@redhat.com --- fs/locks.c | 21 + include/linux/fs.h | 18 ++ 2 files changed, 31 insertions(+), 8 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index e3140b8..1e6301b 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -518,9 +518,10 @@ static void locks_insert_block(struct file_lock *blocker, list_add(waiter-fl_link, blocked_list); } -/* Wake up processes blocked waiting for blocker. - * If told to wait then schedule the processes until the block list - * is empty, otherwise empty the block list ourselves. +/* + * Wake up processes blocked waiting for blocker. + * + * Must be called with the file_lock_lock held! */ static void locks_wake_up_blocks(struct file_lock *blocker) { @@ -806,6 +807,11 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str } lock_flocks(); + /* +* New lock request. Walk all POSIX locks and look for conflicts. If +* there are any, either return error or put the request on the +* blocker's list of waiters and the global blocked_list. +*/ if (request-fl_type != F_UNLCK) { for_each_lock(inode, before) { fl = *before; @@ -844,7 +850,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str before = fl-fl_next; } - /* Process locks with this owner. */ + /* Process locks with this owner. */ while ((fl = *before) posix_same_owner(request, fl)) { /* Detect adjacent or overlapping regions (if same lock type) */ @@ -930,10 +936,9 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str } /* -* The above code only modifies existing locks in case of -* merging or replacing. If new lock(s) need to be inserted -* all modifications are done bellow this, so it's safe yet to -* bail out. +* The above code only modifies existing locks in case of merging or +* replacing. If new lock(s) need to be inserted all modifications are +* done below this, so it's safe yet to bail out. */ error = -ENOLCK; /* no luck */ if (right left == right !new_fl2) diff --git a/include/linux/fs.h b/include/linux/fs.h index b9d7816..94105d2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -926,6 +926,24 @@ int locks_in_grace(struct net *); /* that will die - we need it for nfs_lock_info */ #include linux/nfs_fs_i.h +/* + * struct file_lock represents a generic file lock. It's used to represent + * POSIX byte range locks, BSD (flock) locks, and leases. It's important to + * note that the same struct is used to represent both a request for a lock and + * the lock itself, but the same object is never used for both. + * + * FIXME: should we create a separate struct lock_request to help distinguish + * these two uses? + * + * The i_flock list is ordered by: + * + * 1) lock type -- FL_LEASEs first, then FL_FLOCK, and finally FL_POSIX + * 2) lock owner + * 3) lock range start + * 4) lock range end + * + * Obviously, the last two criteria only matter for POSIX locks. + */ struct file_lock { struct file_lock *fl_next; /* singly linked list for this inode */ struct list_head fl_link; /* doubly linked list of all locks */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 08/13] locks: convert fl_link to a hlist_node
Testing has shown that iterating over the blocked_list for deadlock detection turns out to be a bottleneck. In order to alleviate that, begin the process of turning it into a hashtable. We start by turning the fl_link into a hlist_node and the global lists into hlists. A later patch will do the conversion of the blocked_list to a hashtable. Signed-off-by: Jeff Layton jlay...@redhat.com Acked-by: J. Bruce Fields bfie...@fieldses.org --- fs/locks.c | 24 include/linux/fs.h |2 +- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index a8f3b33..32826ed 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -157,13 +157,13 @@ int lease_break_time = 45; * The global file_lock_list is only used for displaying /proc/locks. Protected * by the file_lock_lock. */ -static LIST_HEAD(file_lock_list); +static HLIST_HEAD(file_lock_list); /* * The blocked_list is used to find POSIX lock loops for deadlock detection. * Protected by file_lock_lock. */ -static LIST_HEAD(blocked_list); +static HLIST_HEAD(blocked_list); /* Protects the two list heads above, and fl-fl_block list. */ static DEFINE_SPINLOCK(file_lock_lock); @@ -172,7 +172,7 @@ static struct kmem_cache *filelock_cache __read_mostly; static void locks_init_lock_heads(struct file_lock *fl) { - INIT_LIST_HEAD(fl-fl_link); + INIT_HLIST_NODE(fl-fl_link); INIT_LIST_HEAD(fl-fl_block); init_waitqueue_head(fl-fl_wait); } @@ -206,7 +206,7 @@ void locks_free_lock(struct file_lock *fl) { BUG_ON(waitqueue_active(fl-fl_wait)); BUG_ON(!list_empty(fl-fl_block)); - BUG_ON(!list_empty(fl-fl_link)); + BUG_ON(!hlist_unhashed(fl-fl_link)); locks_release_private(fl); kmem_cache_free(filelock_cache, fl); @@ -484,7 +484,7 @@ static inline void locks_insert_global_locks(struct file_lock *waiter) { spin_lock(file_lock_lock); - list_add_tail(waiter-fl_link, file_lock_list); + hlist_add_head(waiter-fl_link, file_lock_list); spin_unlock(file_lock_lock); } @@ -492,20 +492,20 @@ static inline void locks_delete_global_locks(struct file_lock *waiter) { spin_lock(file_lock_lock); - list_del_init(waiter-fl_link); + hlist_del_init(waiter-fl_link); spin_unlock(file_lock_lock); } static inline void locks_insert_global_blocked(struct file_lock *waiter) { - list_add(waiter-fl_link, blocked_list); + hlist_add_head(waiter-fl_link, blocked_list); } static inline void locks_delete_global_blocked(struct file_lock *waiter) { - list_del_init(waiter-fl_link); + hlist_del_init(waiter-fl_link); } /* Remove waiter from blocker's block list. @@ -730,7 +730,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl) { struct file_lock *fl; - list_for_each_entry(fl, blocked_list, fl_link) { + hlist_for_each_entry(fl, blocked_list, fl_link) { if (posix_same_owner(fl, block_fl)) return fl-fl_next; } @@ -2285,7 +2285,7 @@ static int locks_show(struct seq_file *f, void *v) { struct file_lock *fl, *bfl; - fl = list_entry(v, struct file_lock, fl_link); + fl = hlist_entry(v, struct file_lock, fl_link); lock_get_status(f, fl, *((loff_t *)f-private), ); @@ -2301,14 +2301,14 @@ static void *locks_start(struct seq_file *f, loff_t *pos) spin_lock(file_lock_lock); *p = (*pos + 1); - return seq_list_start(file_lock_list, *pos); + return seq_hlist_start(file_lock_list, *pos); } static void *locks_next(struct seq_file *f, void *v, loff_t *pos) { loff_t *p = f-private; ++*p; - return seq_list_next(v, file_lock_list, pos); + return seq_hlist_next(v, file_lock_list, pos); } static void locks_stop(struct seq_file *f, void *v) diff --git a/include/linux/fs.h b/include/linux/fs.h index e2f896d..3b340f7 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -946,7 +946,7 @@ int locks_in_grace(struct net *); */ struct file_lock { struct file_lock *fl_next; /* singly linked list for this inode */ - struct list_head fl_link; /* doubly linked list of all locks */ + struct hlist_node fl_link; /* node in global lists */ struct list_head fl_block; /* circular list of blocked processes */ fl_owner_t fl_owner; unsigned int fl_flags; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 01/13] cifs: use posix_unblock_lock instead of locks_delete_block
commit 66189be74 (CIFS: Fix VFS lock usage for oplocked files) exported the locks_delete_block symbol. There's already an exported helper function that provides this capability however, so make cifs use that instead and turn locks_delete_block back into a static function. Note that if fl-fl_next == NULL then this lock has already been through locks_delete_block(), so we should be OK to ignore an ENOENT error here and simply not retry the lock. Cc: Pavel Shilovsky piastr...@gmail.com Signed-off-by: Jeff Layton jlay...@redhat.com Acked-by: J. Bruce Fields bfie...@fieldses.org --- fs/cifs/file.c |2 +- fs/locks.c |3 +-- include/linux/fs.h |5 - 3 files changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 48b29d2..44a4f18 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -999,7 +999,7 @@ try_again: rc = wait_event_interruptible(flock-fl_wait, !flock-fl_next); if (!rc) goto try_again; - locks_delete_block(flock); + posix_unblock_lock(file, flock); } return rc; } diff --git a/fs/locks.c b/fs/locks.c index cb424a4..7a02064 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -496,13 +496,12 @@ static void __locks_delete_block(struct file_lock *waiter) /* */ -void locks_delete_block(struct file_lock *waiter) +static void locks_delete_block(struct file_lock *waiter) { lock_flocks(); __locks_delete_block(waiter); unlock_flocks(); } -EXPORT_SYMBOL(locks_delete_block); /* Insert waiter into blocker's block list. * We use a circular list so that processes can be easily woken up in diff --git a/include/linux/fs.h b/include/linux/fs.h index 43db02e..b9d7816 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1006,7 +1006,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **); extern int lease_modify(struct file_lock **, int); extern int lock_may_read(struct inode *, loff_t start, unsigned long count); extern int lock_may_write(struct inode *, loff_t start, unsigned long count); -extern void locks_delete_block(struct file_lock *waiter); extern void lock_flocks(void); extern void unlock_flocks(void); #else /* !CONFIG_FILE_LOCKING */ @@ -1151,10 +1150,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start, return 1; } -static inline void locks_delete_block(struct file_lock *waiter) -{ -} - static inline void lock_flocks(void) { } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/13] locks: protect most of the file_lock handling with i_lock
On Mon, 17 Jun 2013 11:13:49 -0400 Jeff Layton jlay...@redhat.com wrote: Having a global lock that protects all of this code is a clear scalability problem. Instead of doing that, move most of the code to be protected by the i_lock instead. The exceptions are the global lists that the -fl_link sits on, and the -fl_block list. -fl_link is what connects these structures to the global lists, so we must ensure that we hold those locks when iterating over or updating these lists. Furthermore, sound deadlock detection requires that we hold the blocked_list state steady while checking for loops. We also must ensure that the search and update to the list are atomic. For the checking and insertion side of the blocked_list, push the acquisition of the global lock into __posix_lock_file and ensure that checking and update of the blocked_list is done without dropping the lock in between. On the removal side, when waking up blocked lock waiters, take the global lock before walking the blocked list and dequeue the waiters from the global list prior to removal from the fl_block list. With this, deadlock detection should be race free while we minimize excessive file_lock_lock thrashing. Finally, in order to avoid a lock inversion problem when handling /proc/locks output we must ensure that manipulations of the fl_block list are also protected by the file_lock_lock. Signed-off-by: Jeff Layton jlay...@redhat.com --- Documentation/filesystems/Locking | 21 -- fs/afs/flock.c|5 +- fs/ceph/locks.c |2 +- fs/ceph/mds_client.c |8 +- fs/cifs/cifsfs.c |2 +- fs/cifs/file.c| 13 ++-- fs/gfs2/file.c|2 +- fs/lockd/svcsubs.c| 12 ++-- fs/locks.c| 151 ++--- fs/nfs/delegation.c | 10 +- fs/nfs/nfs4state.c|8 +- fs/nfsd/nfs4state.c |8 +- include/linux/fs.h| 11 --- 13 files changed, 140 insertions(+), 113 deletions(-) [...] @@ -1231,7 +1254,7 @@ int __break_lease(struct inode *inode, unsigned int mode) if (IS_ERR(new_fl)) return PTR_ERR(new_fl); - lock_flocks(); + spin_lock(inode-i_lock); time_out_leases(inode); @@ -1281,11 +1304,11 @@ restart: break_time++; } locks_insert_block(flock, new_fl); - unlock_flocks(); + spin_unlock(inode-i_lock); error = wait_event_interruptible_timeout(new_fl-fl_wait, !new_fl-fl_next, break_time); - lock_flocks(); - __locks_delete_block(new_fl); + spin_lock(inode-i_lock); + locks_delete_block(new_fl); Doh -- bug here. This should not have been changed to locks_delete_block(). My apologies. if (error = 0) { if (error == 0) time_out_leases(inode); [...] posix_unblock_lock(struct file *filp, struct file_lock *waiter) { + struct inode *inode = file_inode(filp); int status = 0; - lock_flocks(); + spin_lock(inode-i_lock); if (waiter-fl_next) - __locks_delete_block(waiter); + locks_delete_block(waiter); Ditto here... else status = -ENOENT; - unlock_flocks(); + spin_unlock(inode-i_lock); return status; } -- Jeff Layton jlay...@redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Enable fscache as an optional feature of ceph.
Hi, 1) In the graphs you attached what am I looking at? My best guess is that it's traffic on a 10gigE card, but I can't tell from the graph since there's no labels. Yes, 10G traffic on switch port. So incoming means server-to-switch, outgoing means switch-to-server. No separated card for ceph traffic :( 2) Can you give me more info about your serving case. What application are you using to serve the video (http server)? Are you serving static mp4 files from Ceph filesystem? lighttpd server with mp4 streaming mod (http://h264.code-shop.com/trac/wiki/Mod-H264-Streaming-Lighttpd-Version2), the files lives on cephfs. there is a speed limit, controlled by mp4 mod. the bandwidth is the video bitrate value. mount options: name=test,rsize=0,rasize=131072,noshare,fsc,key=client.test rsize=0 and rasize=131072 is a tested, with other values there was 4x incoming (from osd) traffic than outgoing (to internet) traffic. 3) What's the hardware, most importantly how big is your partition that cachefilesd is on and what kind of disk are you hosting it on (rotating, SSD)? there are 5 osd servers: HP DL380 G6, 32G ram, 16 X HP sas disk (10k rpm) with raid0. bonding two 1G interface together. (In previous life, this hw could serve the ~2.3G traffic with raid5 and three bonding interface) 4) Statistics from fscache. Can you paste the output /proc/fs/fscache/stats and /proc/fs/fscache/histogram. FS-Cache statistics Cookies: idx=1 dat=8001 spc=0 Objects: alc=0 nal=0 avl=0 ded=0 ChkAux : non=0 ok=0 upd=0 obs=0 Pages : mrk=0 unc=0 Acquire: n=8002 nul=0 noc=0 ok=8002 nbf=0 oom=0 Lookups: n=0 neg=0 pos=0 crt=0 tmo=0 Invals : n=0 run=0 Updates: n=0 nul=0 run=0 Relinqs: n=2265 nul=0 wcr=0 rtr=0 AttrChg: n=0 ok=0 nbf=0 oom=0 run=0 Allocs : n=0 ok=0 wt=0 nbf=0 int=0 Allocs : ops=0 owt=0 abt=0 Retrvls: n=2983745 ok=0 wt=0 nod=0 nbf=2983745 int=0 oom=0 Retrvls: ops=0 owt=0 abt=0 Stores : n=0 ok=0 agn=0 nbf=0 oom=0 Stores : ops=0 run=0 pgs=0 rxd=0 olm=0 VmScan : nos=0 gon=0 bsy=0 can=0 wt=0 Ops: pend=0 run=0 enq=0 can=0 rej=0 Ops: dfr=0 rel=0 gc=0 CacheOp: alo=0 luo=0 luc=0 gro=0 CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0 CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0 No histogram, i try to build to enable this. 5) dmesg lines for ceph/fscache/cachefiles like: [ 264.186887] FS-Cache: Loaded [ 264.223851] Key type ceph registered [ 264.223902] libceph: loaded (mon/osd proto 15/24) [ 264.246334] FS-Cache: Netfs 'ceph' registered for caching [ 264.246341] ceph: loaded (mds proto 32) [ 264.249497] libceph: client31274 fsid 1d78ebe5-f254-44ff-81c1-f641bb2036b6 Elbandi -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/13] locks: protect most of the file_lock handling with i_lock
On Mon, 17 Jun 2013 11:46:09 -0400 Jeff Layton jlay...@redhat.com wrote: On Mon, 17 Jun 2013 11:13:49 -0400 Jeff Layton jlay...@redhat.com wrote: Having a global lock that protects all of this code is a clear scalability problem. Instead of doing that, move most of the code to be protected by the i_lock instead. The exceptions are the global lists that the -fl_link sits on, and the -fl_block list. -fl_link is what connects these structures to the global lists, so we must ensure that we hold those locks when iterating over or updating these lists. Furthermore, sound deadlock detection requires that we hold the blocked_list state steady while checking for loops. We also must ensure that the search and update to the list are atomic. For the checking and insertion side of the blocked_list, push the acquisition of the global lock into __posix_lock_file and ensure that checking and update of the blocked_list is done without dropping the lock in between. On the removal side, when waking up blocked lock waiters, take the global lock before walking the blocked list and dequeue the waiters from the global list prior to removal from the fl_block list. With this, deadlock detection should be race free while we minimize excessive file_lock_lock thrashing. Finally, in order to avoid a lock inversion problem when handling /proc/locks output we must ensure that manipulations of the fl_block list are also protected by the file_lock_lock. Signed-off-by: Jeff Layton jlay...@redhat.com --- Documentation/filesystems/Locking | 21 -- fs/afs/flock.c|5 +- fs/ceph/locks.c |2 +- fs/ceph/mds_client.c |8 +- fs/cifs/cifsfs.c |2 +- fs/cifs/file.c| 13 ++-- fs/gfs2/file.c|2 +- fs/lockd/svcsubs.c| 12 ++-- fs/locks.c| 151 ++--- fs/nfs/delegation.c | 10 +- fs/nfs/nfs4state.c|8 +- fs/nfsd/nfs4state.c |8 +- include/linux/fs.h| 11 --- 13 files changed, 140 insertions(+), 113 deletions(-) [...] @@ -1231,7 +1254,7 @@ int __break_lease(struct inode *inode, unsigned int mode) if (IS_ERR(new_fl)) return PTR_ERR(new_fl); - lock_flocks(); + spin_lock(inode-i_lock); time_out_leases(inode); @@ -1281,11 +1304,11 @@ restart: break_time++; } locks_insert_block(flock, new_fl); - unlock_flocks(); + spin_unlock(inode-i_lock); error = wait_event_interruptible_timeout(new_fl-fl_wait, !new_fl-fl_next, break_time); - lock_flocks(); - __locks_delete_block(new_fl); + spin_lock(inode-i_lock); + locks_delete_block(new_fl); Doh -- bug here. This should not have been changed to locks_delete_block(). My apologies. if (error = 0) { if (error == 0) time_out_leases(inode); [...] posix_unblock_lock(struct file *filp, struct file_lock *waiter) { + struct inode *inode = file_inode(filp); int status = 0; - lock_flocks(); + spin_lock(inode-i_lock); if (waiter-fl_next) - __locks_delete_block(waiter); + locks_delete_block(waiter); Ditto here... else status = -ENOENT; - unlock_flocks(); + spin_unlock(inode-i_lock); return status; } Bah, scratch that -- this patch is actually fine. We hold the i_lock here and locks_delete_block takes the file_lock_lock, which is correct. There is a potential race in patch 7 though. I'll reply to that patch to point it out in a minute. -- Jeff Layton jlay...@redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/13] locks: avoid taking global lock if possible when waking up blocked waiters
On Mon, 17 Jun 2013 11:13:50 -0400 Jeff Layton jlay...@redhat.com wrote: Since we always hold the i_lock when inserting a new waiter onto the fl_block list, we can avoid taking the global lock at all if we find that it's empty when we go to wake up blocked waiters. Signed-off-by: Jeff Layton jlay...@redhat.com --- fs/locks.c | 17 ++--- 1 files changed, 14 insertions(+), 3 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 8f56651..a8f3b33 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -532,7 +532,10 @@ static void locks_delete_block(struct file_lock *waiter) * the order they blocked. The documentation doesn't require this but * it seems like the reasonable thing to do. * - * Must be called with file_lock_lock held! + * Must be called with both the i_lock and file_lock_lock held. The fl_block + * list itself is protected by the file_lock_list, but by ensuring that the + * i_lock is also held on insertions we can avoid taking the file_lock_lock + * in some cases when we see that the fl_block list is empty. */ static void __locks_insert_block(struct file_lock *blocker, struct file_lock *waiter) @@ -560,8 +563,16 @@ static void locks_insert_block(struct file_lock *blocker, */ static void locks_wake_up_blocks(struct file_lock *blocker) { + /* + * Avoid taking global lock if list is empty. This is safe since new + * blocked requests are only added to the list under the i_lock, and + * the i_lock is always held here. + */ + if (list_empty(blocker-fl_block)) + return; + Ok, potential race here. We hold the i_lock when we check list_empty() above, but it's possible for the fl_block list to become empty between that check and when we take the spinlock below. locks_delete_block does not require that you hold the i_lock, and some callers don't hold it. This is trivially fixable by just keeping this as a while() loop. We'll do the list_empty() check twice in that case, but that shouldn't change the performance here much. I'll fix that in my tree and it'll be in the next resend. Sorry for the noise... spin_lock(file_lock_lock); - while (!list_empty(blocker-fl_block)) { + do { struct file_lock *waiter; waiter = list_first_entry(blocker-fl_block, @@ -571,7 +582,7 @@ static void locks_wake_up_blocks(struct file_lock *blocker) waiter-fl_lmops-lm_notify(waiter); else wake_up(waiter-fl_wait); - } + } while (!list_empty(blocker-fl_block)); spin_unlock(file_lock_lock); } -- Jeff Layton jlay...@redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Enable fscache as an optional feature of ceph.
Elbandi, It looks like it's trying to use fscache (from the stats) but there's no data. Did you install, configure and enable the cachefilesd daemon? It's the user-space component of fscache. It's the only officially supported fsache backed by Ubuntu, RHEL SUSE. I'm guessing that's your problem since I don't see any of the bellow lines in your dmesg snippet. [2049099.198234] CacheFiles: Loaded [2049099.541721] FS-Cache: Cache mycache added (type cachefiles) [2049099.541727] CacheFiles: File cache on md0 registered - Milosz On Mon, Jun 17, 2013 at 11:47 AM, Elso Andras elso.and...@gmail.com wrote: Hi, 1) In the graphs you attached what am I looking at? My best guess is that it's traffic on a 10gigE card, but I can't tell from the graph since there's no labels. Yes, 10G traffic on switch port. So incoming means server-to-switch, outgoing means switch-to-server. No separated card for ceph traffic :( 2) Can you give me more info about your serving case. What application are you using to serve the video (http server)? Are you serving static mp4 files from Ceph filesystem? lighttpd server with mp4 streaming mod (http://h264.code-shop.com/trac/wiki/Mod-H264-Streaming-Lighttpd-Version2), the files lives on cephfs. there is a speed limit, controlled by mp4 mod. the bandwidth is the video bitrate value. mount options: name=test,rsize=0,rasize=131072,noshare,fsc,key=client.test rsize=0 and rasize=131072 is a tested, with other values there was 4x incoming (from osd) traffic than outgoing (to internet) traffic. 3) What's the hardware, most importantly how big is your partition that cachefilesd is on and what kind of disk are you hosting it on (rotating, SSD)? there are 5 osd servers: HP DL380 G6, 32G ram, 16 X HP sas disk (10k rpm) with raid0. bonding two 1G interface together. (In previous life, this hw could serve the ~2.3G traffic with raid5 and three bonding interface) 4) Statistics from fscache. Can you paste the output /proc/fs/fscache/stats and /proc/fs/fscache/histogram. FS-Cache statistics Cookies: idx=1 dat=8001 spc=0 Objects: alc=0 nal=0 avl=0 ded=0 ChkAux : non=0 ok=0 upd=0 obs=0 Pages : mrk=0 unc=0 Acquire: n=8002 nul=0 noc=0 ok=8002 nbf=0 oom=0 Lookups: n=0 neg=0 pos=0 crt=0 tmo=0 Invals : n=0 run=0 Updates: n=0 nul=0 run=0 Relinqs: n=2265 nul=0 wcr=0 rtr=0 AttrChg: n=0 ok=0 nbf=0 oom=0 run=0 Allocs : n=0 ok=0 wt=0 nbf=0 int=0 Allocs : ops=0 owt=0 abt=0 Retrvls: n=2983745 ok=0 wt=0 nod=0 nbf=2983745 int=0 oom=0 Retrvls: ops=0 owt=0 abt=0 Stores : n=0 ok=0 agn=0 nbf=0 oom=0 Stores : ops=0 run=0 pgs=0 rxd=0 olm=0 VmScan : nos=0 gon=0 bsy=0 can=0 wt=0 Ops: pend=0 run=0 enq=0 can=0 rej=0 Ops: dfr=0 rel=0 gc=0 CacheOp: alo=0 luo=0 luc=0 gro=0 CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0 CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0 No histogram, i try to build to enable this. 5) dmesg lines for ceph/fscache/cachefiles like: [ 264.186887] FS-Cache: Loaded [ 264.223851] Key type ceph registered [ 264.223902] libceph: loaded (mon/osd proto 15/24) [ 264.246334] FS-Cache: Netfs 'ceph' registered for caching [ 264.246341] ceph: loaded (mds proto 32) [ 264.249497] libceph: client31274 fsid 1d78ebe5-f254-44ff-81c1-f641bb2036b6 Elbandi -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Comments on Ceph distributed parity implementation
Loic, As Benoit points out, Mojette uses discrete geometry rather than algebra, so simple XOR is all that is needed. Benoit, Microsoft's paper states that their [12,2,2] LRC provides better availability than 3x replication with 1.33x efficiency. 1.5x is certainly a good number. I'm just pointing out that better efficiency can be had without losing availibity. All the best, Paul On 6/16/2013 02:31 PM Loic Dachary wrote: Hi Benoît, From the ( naïve ) point of view of engineering, performances are important. The recent works of James Plank ( cc'ed ) greatly improved them and I'm looking forward to the next version of jerasure ( http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html ). Rozofs Mojette Transform implementation ( https://github.com/rozofs/rozofs/blob/master/rozofs/common/transform.h https://github.com/rozofs/rozofs/blob/master/rozofs/common/transform.cc ) does not seem to make use of SIMD. Is it because the performances are good enough to not require them ? Cheers On 06/16/2013 09:51 PM, Benoît Parrein wrote: Paul Von-Stamwitz PVonStamwitz at us.fujitsu.com writes: Hi Paul, Loic, I know nothing about Mojette Transforms. From what little I gleaned, it might be good for repair (needing only a subset of chunks within a range to recalculate a missing chunk) but I'm worried about the storage efficiency. RozoFS claims 1.5x. I'd like to do better than that. All the best, Paul If you want to do better than that you will probably lose in availability. 1.5x give the same availability than 3 replicats and that for any kind of erasure coding. FYI, Mojette transform has no constraint in terms of Galois fields. It is the big advantage to use discrete geometry rather than algebra. best regards, bp -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Enable fscache as an optional feature of ceph.
Hi, Oh, i forgot about this daemon... but this daemon cache the data to file. Thus it's useless, the caching to disk is more slow than the whole osds. Elbandi 2013/6/17 Milosz Tanski mil...@adfin.com: Elbandi, It looks like it's trying to use fscache (from the stats) but there's no data. Did you install, configure and enable the cachefilesd daemon? It's the user-space component of fscache. It's the only officially supported fsache backed by Ubuntu, RHEL SUSE. I'm guessing that's your problem since I don't see any of the bellow lines in your dmesg snippet. [2049099.198234] CacheFiles: Loaded [2049099.541721] FS-Cache: Cache mycache added (type cachefiles) [2049099.541727] CacheFiles: File cache on md0 registered - Milosz On Mon, Jun 17, 2013 at 11:47 AM, Elso Andras elso.and...@gmail.com wrote: Hi, 1) In the graphs you attached what am I looking at? My best guess is that it's traffic on a 10gigE card, but I can't tell from the graph since there's no labels. Yes, 10G traffic on switch port. So incoming means server-to-switch, outgoing means switch-to-server. No separated card for ceph traffic :( 2) Can you give me more info about your serving case. What application are you using to serve the video (http server)? Are you serving static mp4 files from Ceph filesystem? lighttpd server with mp4 streaming mod (http://h264.code-shop.com/trac/wiki/Mod-H264-Streaming-Lighttpd-Version2), the files lives on cephfs. there is a speed limit, controlled by mp4 mod. the bandwidth is the video bitrate value. mount options: name=test,rsize=0,rasize=131072,noshare,fsc,key=client.test rsize=0 and rasize=131072 is a tested, with other values there was 4x incoming (from osd) traffic than outgoing (to internet) traffic. 3) What's the hardware, most importantly how big is your partition that cachefilesd is on and what kind of disk are you hosting it on (rotating, SSD)? there are 5 osd servers: HP DL380 G6, 32G ram, 16 X HP sas disk (10k rpm) with raid0. bonding two 1G interface together. (In previous life, this hw could serve the ~2.3G traffic with raid5 and three bonding interface) 4) Statistics from fscache. Can you paste the output /proc/fs/fscache/stats and /proc/fs/fscache/histogram. FS-Cache statistics Cookies: idx=1 dat=8001 spc=0 Objects: alc=0 nal=0 avl=0 ded=0 ChkAux : non=0 ok=0 upd=0 obs=0 Pages : mrk=0 unc=0 Acquire: n=8002 nul=0 noc=0 ok=8002 nbf=0 oom=0 Lookups: n=0 neg=0 pos=0 crt=0 tmo=0 Invals : n=0 run=0 Updates: n=0 nul=0 run=0 Relinqs: n=2265 nul=0 wcr=0 rtr=0 AttrChg: n=0 ok=0 nbf=0 oom=0 run=0 Allocs : n=0 ok=0 wt=0 nbf=0 int=0 Allocs : ops=0 owt=0 abt=0 Retrvls: n=2983745 ok=0 wt=0 nod=0 nbf=2983745 int=0 oom=0 Retrvls: ops=0 owt=0 abt=0 Stores : n=0 ok=0 agn=0 nbf=0 oom=0 Stores : ops=0 run=0 pgs=0 rxd=0 olm=0 VmScan : nos=0 gon=0 bsy=0 can=0 wt=0 Ops: pend=0 run=0 enq=0 can=0 rej=0 Ops: dfr=0 rel=0 gc=0 CacheOp: alo=0 luo=0 luc=0 gro=0 CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0 CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0 No histogram, i try to build to enable this. 5) dmesg lines for ceph/fscache/cachefiles like: [ 264.186887] FS-Cache: Loaded [ 264.223851] Key type ceph registered [ 264.223902] libceph: loaded (mon/osd proto 15/24) [ 264.246334] FS-Cache: Netfs 'ceph' registered for caching [ 264.246341] ceph: loaded (mds proto 32) [ 264.249497] libceph: client31274 fsid 1d78ebe5-f254-44ff-81c1-f641bb2036b6 Elbandi -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Enable fscache as an optional feature of ceph.
Elso, It does cache the data to the file, thus it may not be useful for your situation. By default the ceph filesystem already uses the (in memory) page cache provided by linux kernel. So if that's all you want, than you're good with the current implementation. Generally large sequential data transfers will not be improved (although there's cases where we observed improvements). The motivation for us to implement fscache has been the following use-case. We have a large distributed analytics databases (built in house) and we have a few different access patterns present. First, there's seemingly random access on the compressed indexes. Second, there's also random access in the column data files for extent indexes. Finally, there's either sequential or random access over the actual data (depending on the query). In our cases the machines that run the database have multiple large SSD drives in a raid0 configuration. We're using it the SSD drives for scratch storage (housekeeping background jobs) and the ceph fscache. In some conditions we can get up to 1GB/s reads from these SSD drives. We're currently in our last stages of deploying this to production. And for most workloads our query performance for data stored locally versus on ceph backed by fscache is pretty much the same. Our biggest gain probably comes from much lower latency to get metadata and indexes to make the query due the large number random iops the SSD drives afford us. I'm going to published some updates numbers compared to previous quick and dirty prototype. I realize that's not going to be the case for everybody. However, if you have a data access pattern that follows the 80/20 rule or the zipfan distribution and fast local disks for caching this is a great. Thanks, - Milosz On Mon, Jun 17, 2013 at 1:09 PM, Elso Andras elso.and...@gmail.com wrote: Hi, Oh, i forgot about this daemon... but this daemon cache the data to file. Thus it's useless, the caching to disk is more slow than the whole osds. Elbandi 2013/6/17 Milosz Tanski mil...@adfin.com: Elbandi, It looks like it's trying to use fscache (from the stats) but there's no data. Did you install, configure and enable the cachefilesd daemon? It's the user-space component of fscache. It's the only officially supported fsache backed by Ubuntu, RHEL SUSE. I'm guessing that's your problem since I don't see any of the bellow lines in your dmesg snippet. [2049099.198234] CacheFiles: Loaded [2049099.541721] FS-Cache: Cache mycache added (type cachefiles) [2049099.541727] CacheFiles: File cache on md0 registered - Milosz On Mon, Jun 17, 2013 at 11:47 AM, Elso Andras elso.and...@gmail.com wrote: Hi, 1) In the graphs you attached what am I looking at? My best guess is that it's traffic on a 10gigE card, but I can't tell from the graph since there's no labels. Yes, 10G traffic on switch port. So incoming means server-to-switch, outgoing means switch-to-server. No separated card for ceph traffic :( 2) Can you give me more info about your serving case. What application are you using to serve the video (http server)? Are you serving static mp4 files from Ceph filesystem? lighttpd server with mp4 streaming mod (http://h264.code-shop.com/trac/wiki/Mod-H264-Streaming-Lighttpd-Version2), the files lives on cephfs. there is a speed limit, controlled by mp4 mod. the bandwidth is the video bitrate value. mount options: name=test,rsize=0,rasize=131072,noshare,fsc,key=client.test rsize=0 and rasize=131072 is a tested, with other values there was 4x incoming (from osd) traffic than outgoing (to internet) traffic. 3) What's the hardware, most importantly how big is your partition that cachefilesd is on and what kind of disk are you hosting it on (rotating, SSD)? there are 5 osd servers: HP DL380 G6, 32G ram, 16 X HP sas disk (10k rpm) with raid0. bonding two 1G interface together. (In previous life, this hw could serve the ~2.3G traffic with raid5 and three bonding interface) 4) Statistics from fscache. Can you paste the output /proc/fs/fscache/stats and /proc/fs/fscache/histogram. FS-Cache statistics Cookies: idx=1 dat=8001 spc=0 Objects: alc=0 nal=0 avl=0 ded=0 ChkAux : non=0 ok=0 upd=0 obs=0 Pages : mrk=0 unc=0 Acquire: n=8002 nul=0 noc=0 ok=8002 nbf=0 oom=0 Lookups: n=0 neg=0 pos=0 crt=0 tmo=0 Invals : n=0 run=0 Updates: n=0 nul=0 run=0 Relinqs: n=2265 nul=0 wcr=0 rtr=0 AttrChg: n=0 ok=0 nbf=0 oom=0 run=0 Allocs : n=0 ok=0 wt=0 nbf=0 int=0 Allocs : ops=0 owt=0 abt=0 Retrvls: n=2983745 ok=0 wt=0 nod=0 nbf=2983745 int=0 oom=0 Retrvls: ops=0 owt=0 abt=0 Stores : n=0 ok=0 agn=0 nbf=0 oom=0 Stores : ops=0 run=0 pgs=0 rxd=0 olm=0 VmScan : nos=0 gon=0 bsy=0 can=0 wt=0 Ops: pend=0 run=0 enq=0 can=0 rej=0 Ops: dfr=0 rel=0 gc=0 CacheOp: alo=0 luo=0 luc=0 gro=0 CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0 CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0 No histogram, i try to build to enable this. 5) dmesg
Re: [PATCH 2/2] Enable fscache as an optional feature of ceph.
Hi, 1. in the cases where client caching is useful, AFS disk caching is still common--though yes, giant memory caches became more common over time, and 2. a memory fs-cache backend is probably out there (I wonder if you can write it in kernel mode), at worst, it looks like you can use cachefilesd on tempfs? Matt - Elso Andras elso.and...@gmail.com wrote: Hi, Oh, i forgot about this daemon... but this daemon cache the data to file. Thus it's useless, the caching to disk is more slow than the whole osds. Elbandi 2013/6/17 Milosz Tanski mil...@adfin.com: Elbandi, It looks like it's trying to use fscache (from the stats) but there's no data. Did you install, configure and enable the cachefilesd daemon? It's the user-space component of fscache. It's the only officially supported fsache backed by Ubuntu, RHEL SUSE. I'm guessing that's your problem since I don't see any of the bellow lines in your dmesg snippet. [2049099.198234] CacheFiles: Loaded [2049099.541721] FS-Cache: Cache mycache added (type cachefiles) [2049099.541727] CacheFiles: File cache on md0 registered - Milosz On Mon, Jun 17, 2013 at 11:47 AM, Elso Andras elso.and...@gmail.com wrote: Hi, 1) In the graphs you attached what am I looking at? My best guess is that it's traffic on a 10gigE card, but I can't tell from the graph since there's no labels. Yes, 10G traffic on switch port. So incoming means server-to-switch, outgoing means switch-to-server. No separated card for ceph traffic :( 2) Can you give me more info about your serving case. What application are you using to serve the video (http server)? Are you serving static mp4 files from Ceph filesystem? lighttpd server with mp4 streaming mod (http://h264.code-shop.com/trac/wiki/Mod-H264-Streaming-Lighttpd-Version2), the files lives on cephfs. there is a speed limit, controlled by mp4 mod. the bandwidth is the video bitrate value. mount options: name=test,rsize=0,rasize=131072,noshare,fsc,key=client.test rsize=0 and rasize=131072 is a tested, with other values there was 4x incoming (from osd) traffic than outgoing (to internet) traffic. 3) What's the hardware, most importantly how big is your partition that cachefilesd is on and what kind of disk are you hosting it on (rotating, SSD)? there are 5 osd servers: HP DL380 G6, 32G ram, 16 X HP sas disk (10k rpm) with raid0. bonding two 1G interface together. (In previous life, this hw could serve the ~2.3G traffic with raid5 and three bonding interface) 4) Statistics from fscache. Can you paste the output /proc/fs/fscache/stats and /proc/fs/fscache/histogram. FS-Cache statistics Cookies: idx=1 dat=8001 spc=0 Objects: alc=0 nal=0 avl=0 ded=0 ChkAux : non=0 ok=0 upd=0 obs=0 Pages : mrk=0 unc=0 Acquire: n=8002 nul=0 noc=0 ok=8002 nbf=0 oom=0 Lookups: n=0 neg=0 pos=0 crt=0 tmo=0 Invals : n=0 run=0 Updates: n=0 nul=0 run=0 Relinqs: n=2265 nul=0 wcr=0 rtr=0 AttrChg: n=0 ok=0 nbf=0 oom=0 run=0 Allocs : n=0 ok=0 wt=0 nbf=0 int=0 Allocs : ops=0 owt=0 abt=0 Retrvls: n=2983745 ok=0 wt=0 nod=0 nbf=2983745 int=0 oom=0 Retrvls: ops=0 owt=0 abt=0 Stores : n=0 ok=0 agn=0 nbf=0 oom=0 Stores : ops=0 run=0 pgs=0 rxd=0 olm=0 VmScan : nos=0 gon=0 bsy=0 can=0 wt=0 Ops: pend=0 run=0 enq=0 can=0 rej=0 Ops: dfr=0 rel=0 gc=0 CacheOp: alo=0 luo=0 luc=0 gro=0 CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0 CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0 No histogram, i try to build to enable this. 5) dmesg lines for ceph/fscache/cachefiles like: [ 264.186887] FS-Cache: Loaded [ 264.223851] Key type ceph registered [ 264.223902] libceph: loaded (mon/osd proto 15/24) [ 264.246334] FS-Cache: Netfs 'ceph' registered for caching [ 264.246341] ceph: loaded (mds proto 32) [ 264.249497] libceph: client31274 fsid 1d78ebe5-f254-44ff-81c1-f641bb2036b6 Elbandi -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Matt Benjamin The Linux Box 206 South Fifth Ave. Suite 150 Ann Arbor, MI 48104 http://linuxbox.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
Loan Syndicacion Am AFG Guaranty Trust Bank, zu strukturieren wir Kreditlinien treffen Sie unsere Kunden spezifischen geschäftlichen Anforderungen und einen deutlichen Mehrwert für unsere Kunden Unternehmen. eine Division der AFG Finance und Private Bank plc. Wenn Sie erwägen, eine große Akquisition oder ein Großprojekt sind, können Sie brauchen eine erhebliche Menge an Kredit. AFG Guaranty Trust Bank setzen können zusammen das Syndikat, das die gesamte Kredit schnürt für Sie. Als Bank mit internationaler Reichweite, sind wir gekommen, um Darlehen zu identifizieren Syndizierungen als Teil unseres Kerngeschäfts und durch spitzte diese Zeile aggressiv sind wir an einem Punkt, wo wir kommen, um als erkannt haben Hauptakteur in diesem Bereich. öffnen Sie ein Girokonto heute mit einem Minimum Bankguthaben von 500 £ und Getup zu £ 10.000 als Darlehen und auch den Hauch einer Chance und gewann die Sterne Preis von £ 500.000 in die sparen und gewinnen promo in may.aply jetzt. mit dem Folowing Informationen über Rechtsanwalt steven lee das Konto Offizier. FULL NAME; Wohnadresse; E-MAIL-ADRESSE; Telefonnummer; Nächsten KINS; MUTTER MAIDEN NAME; Familienstand; BÜROADRESSE; ALTERNATIVE Telefonnummer; TO @ yahoo.com bar.stevenlee NOTE; ALLE Darlehen sind für 10JAHRE RATE VALID ANGEBOT ENDET BALD SO JETZT HURRY -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] rbd rm image results in osd marked down wrongly with 0.61.3
Hi Florian, If you can trigger this with logs, we're very eager to see what they say about this! The http://tracker.ceph.com/issues/5336 bug is open to track this issue. Thanks! sage On Thu, 13 Jun 2013, Smart Weblications GmbH - Florian Wiessner wrote: Hi, Is really no one on the list interrested in fixing this? Or am i the only one having this kind of bug/problem? Am 11.06.2013 16:19, schrieb Smart Weblications GmbH - Florian Wiessner: Hi List, i observed that an rbd rm image results in some osds mark one osd as down wrongly in cuttlefish. The situation gets even worse if there are more than one rbd rm image running in parallel. Please see attached logfiles. The rbd rm command was issued on 20:24:00 via cronjob, 40 seconds later the osd 6 got marked down... ceph osd tree # idweight type name up/down reweight -1 7 pool default -3 7 rack unknownrack -2 1 host node01 0 1 osd.0 up 1 -4 1 host node02 1 1 osd.1 up 1 -5 1 host node03 2 1 osd.2 up 1 -6 1 host node04 3 1 osd.3 up 1 -7 1 host node06 5 1 osd.5 up 1 -8 1 host node05 4 1 osd.4 up 1 -9 1 host node07 6 1 osd.6 up 1 I have seen some patches to parallelize rbd rm, but i think there must be some other issue, as my clients seem to not be able to do IO when ceph is recovering... I think this has worked better in 0.56.x - there was IO while recovering. I also observed in the log of osd.6 that after heartbeat_map reset_timeout, the osd tries to connect to the other osds, but it retries so fast that you could think this is a DoS attack... Please advise.. -- Mit freundlichen Gr??en, Florian Wiessner Smart Weblications GmbH Martinsberger Str. 1 D-95119 Naila fon.: +49 9282 9638 200 fax.: +49 9282 9638 205 24/7: +49 900 144 000 00 - 0,99 EUR/Min* http://www.smart-weblications.de -- Sitz der Gesellschaft: Naila Gesch?ftsf?hrer: Florian Wiessner HRB-Nr.: HRB 3840 Amtsgericht Hof *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Writing to RBD image while it's snapshot is being created causes I/O errors
On Mon, 17 Jun 2013, Karol Jurak wrote: On Friday 14 of June 2013 08:56:55 Sage Weil wrote: On Fri, 14 Jun 2013, Karol Jurak wrote: I noticed that writing to RBD image using kernel driver while it's snapshot is being created causes I/O errors and the filesystem (reiserfs) eventually aborts and remounts itself in read-only mode: This is definitely a bug; you should be able to create a snapshot at any time. After a rollback, it should look (to the fs) like a crash or power cycle. How easy is this to reproduce? Does it happen every time? I can reproduce it in the following way: # rbd create -s 10240 test # rbd map test # mkfs -t reiserfs /dev/rbd/rbd/test # mount /dev/rbd/rbd/test /mnt/test # dd if=/dev/zero of=/mnt/test/a bs=1M count=1024 and in another shell while dd is running: # rbd snap create test@snap1 After 2 or 3 seconds dmesg shows I/O errors: [429532.259910] end_request: I/O error, dev rbd1, sector 1384448 [429532.272554] end_request: I/O error, dev rbd1, sector 872 [429532.275556] REISERFS abort (device rbd1): Journal write error in flush_commit_list and dd fails: dd: writing `/mnt/test/a': Cannot allocate memory 590+0 records in 589+0 records out 618225664 bytes (618 MB) copied, 15.8701 s, 39.0 MB/s This happens every time I repeat the test. What kernel version are you using? I'm not able to reproduce this with ext4 or reiserfs and many snapshots over several minutes of write workload. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html