Re: Writing to RBD image while it's snapshot is being created causes I/O errors

2013-06-17 Thread Karol Jurak
On Friday 14 of June 2013 08:56:55 Sage Weil wrote:
 On Fri, 14 Jun 2013, Karol Jurak wrote:
  I noticed that writing to RBD image using kernel driver while it's
  snapshot is being created causes I/O errors and the filesystem
  (reiserfs) eventually aborts and remounts itself in read-only mode:
 
 This is definitely a bug; you should be able to create a snapshot at any
 time.  After a rollback, it should look (to the fs) like a crash or power
 cycle.
 
 How easy is this to reproduce?  Does it happen every time?

I can reproduce it in the following way:

# rbd create -s 10240 test
# rbd map test
# mkfs -t reiserfs /dev/rbd/rbd/test
# mount /dev/rbd/rbd/test /mnt/test
# dd if=/dev/zero of=/mnt/test/a bs=1M count=1024

and in another shell while dd is running:

# rbd snap create test@snap1

After 2 or 3 seconds dmesg shows I/O errors:

[429532.259910] end_request: I/O error, dev rbd1, sector 1384448
[429532.272554] end_request: I/O error, dev rbd1, sector 872
[429532.275556] REISERFS abort (device rbd1): Journal write error in 
flush_commit_list

and dd fails:

dd: writing `/mnt/test/a': Cannot allocate memory
590+0 records in
589+0 records out
618225664 bytes (618 MB) copied, 15.8701 s, 39.0 MB/s

This happens every time I repeat the test.

-- 
Karol Jurak

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rbd: silence GCC warnings

2013-06-17 Thread Paul Bolle
Building rbd.o triggers two GCC warnings:
drivers/block/rbd.c: In function ‘rbd_img_request_fill’:
drivers/block/rbd.c:1272:22: warning: ‘bio_list’ may be used uninitialized 
in this function [-Wmaybe-uninitialized]
drivers/block/rbd.c:2170:14: note: ‘bio_list’ was declared here
drivers/block/rbd.c:2231:10: warning: ‘pages’ may be used uninitialized in 
this function [-Wmaybe-uninitialized]

Apparently GCC has trouble determining that bio_list is unused if type
is OBJ_REQUEST_PAGES and, conversely, that pages will be unused if
type is OBJ_REQUEST_BIO. Add harmless initializations to NULL to
help GCC.

Signed-off-by: Paul Bolle pebo...@tiscali.nl
---
0) Compile tested only.

1) These warnings were introduced in v3.10-rc1, apparently through
commit f1a4739f33 (rbd: support page array image requests). 

2) Note that
rbd_assert(type == OBJ_REQUEST_PAGES);

seems redundant. I see no way that this assertion could ever be false. 

 drivers/block/rbd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 3063452..b8a58178 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2185,9 +2185,11 @@ static int rbd_img_request_fill(struct rbd_img_request 
*img_request,
if (type == OBJ_REQUEST_BIO) {
bio_list = data_desc;
rbd_assert(img_offset == bio_list-bi_sector  SECTOR_SHIFT);
+   pages = NULL;
} else {
rbd_assert(type == OBJ_REQUEST_PAGES);
pages = data_desc;
+   bio_list = NULL;
}
 
while (resid) {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8] misc fixes for mds

2013-06-17 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

these patches are also in:
  git://github.com/ukernel/ceph.git wip-mds

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] mds: try purging stray inode after storing backtrace

2013-06-17 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

Inode is auth pinned and can't be purged while storing backtrace,
so we should try purging stray inode after storing backtrace.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 src/mds/CInode.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
index 0e14293..4a592bc 100644
--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -1069,11 +1069,12 @@ void CInode::_stored_backtrace(version_t v, Context 
*fin)
 {
   dout(10)  _stored_backtrace  dendl;
 
+  auth_unpin(this);
   if (v == inode.backtrace_version)
 clear_dirty_parent();
-  auth_unpin(this);
   if (fin)
 fin-complete(0);
+  mdcache-maybe_eval_stray(this);
 }
 
 void CInode::_mark_dirty_parent(LogSegment *ls, bool dirty_pool)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] mds: fix cross-authorty rename race

2013-06-17 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

When doing cross-authorty rename, we need to make sure bystanders
have received all messages sent by inode's original auth MDS,
then change inode's authorty. Otherwise lock messages sent by the
original/new auth MDS can arrive bystanders out of order. The fix
is: inode's original auth MDS sends notify messages to bystanders,
performs slave rename after receiving all bystanders' notify acks.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 src/mds/MDCache.cc  | 31 -
 src/mds/Server.cc   | 51 +++--
 src/mds/Server.h|  1 +
 src/messages/MMDSSlaveRequest.h |  6 +
 4 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index b9b154d..2b0029f 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -2651,6 +2651,15 @@ void MDCache::handle_mds_failure(int who)
 if (p-second-slave_to_mds == who) {
   if (p-second-slave_did_prepare()) {
dout(10)   slave request   *p-second   uncommitted, will 
resolve shortly  dendl;
+   if (!p-second-more()-waiting_on_slave.empty()) {
+ assert(p-second-more()-srcdn_auth_mds == mds-get_nodeid());
+ // will rollback, no need to wait
+ if (p-second-slave_request) {
+   p-second-slave_request-put();
+   p-second-slave_request = 0;
+ }
+ p-second-more()-waiting_on_slave.clear();
+   }
   } else {
dout(10)   slave request   *p-second   has no prepare, 
finishing up  dendl;
if (p-second-slave_request)
@@ -2660,12 +2669,22 @@ void MDCache::handle_mds_failure(int who)
   }
 }
 
-if (p-second-is_slave() 
-   p-second-slave_did_prepare()  p-second-more()-srcdn_auth_mds == 
who 
-   
mds-mdsmap-is_clientreplay_or_active_or_stopping(p-second-slave_to_mds)) {
-  // rename srcdn's auth mds failed, resolve even I'm a survivor.
-  dout(10)   slave request   *p-second   uncommitted, will 
resolve shortly  dendl;
-  add_ambiguous_slave_update(p-first, p-second-slave_to_mds);
+if (p-second-is_slave()  p-second-slave_did_prepare()) {
+  if (p-second-more()-waiting_on_slave.count(who)) {
+   assert(p-second-more()-srcdn_auth_mds == mds-get_nodeid());
+   dout(10)   slave request   *p-second   no longer need rename 
notity ack from mds.
+ who  dendl;
+   p-second-more()-waiting_on_slave.erase(who);
+   if (p-second-more()-waiting_on_slave.empty())
+ mds-queue_waiter(new C_MDS_RetryRequest(this, p-second));
+  }
+
+  if (p-second-more()-srcdn_auth_mds == who 
+ 
mds-mdsmap-is_clientreplay_or_active_or_stopping(p-second-slave_to_mds)) {
+   // rename srcdn's auth mds failed, resolve even I'm a survivor.
+   dout(10)   slave request   *p-second   uncommitted, will 
resolve shortly  dendl;
+   add_ambiguous_slave_update(p-first, p-second-slave_to_mds);
+  }
 }
 
 // failed node is slave?
diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index c3162e7..00bf018 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -1280,6 +1280,16 @@ void Server::handle_slave_request(MMDSSlaveRequest *m)
   if (m-is_reply())
 return handle_slave_request_reply(m);
 
+  // the purpose of rename notify is enforcing causal message ordering. making 
sure
+  // bystanders have received all messages from rename srcdn's auth MDS.
+  if (m-get_op() == MMDSSlaveRequest::OP_RENAMENOTIFY) {
+MMDSSlaveRequest *reply = new MMDSSlaveRequest(m-get_reqid(), 
m-get_attempt(),
+  
MMDSSlaveRequest::OP_RENAMENOTIFYACK);
+mds-send_message(reply, m-get_connection());
+m-put();
+return;
+  }
+
   CDentry *straydn = NULL;
   if (m-stray.length()  0) {
 straydn = mdcache-add_replica_stray(m-stray, from);
@@ -1432,6 +1442,10 @@ void Server::handle_slave_request_reply(MMDSSlaveRequest 
*m)
 handle_slave_rename_prep_ack(mdr, m);
 break;
 
+  case MMDSSlaveRequest::OP_RENAMENOTIFYACK:
+handle_slave_rename_notify_ack(mdr, m);
+break;
+
   default:
 assert(0);
   }
@@ -6560,6 +6574,9 @@ void Server::handle_slave_rename_prep(MDRequest *mdr)
 
   // am i srcdn auth?
   if (srcdn-is_auth()) {
+setint srcdnrep;
+srcdn-list_replicas(srcdnrep);
+
 bool reply_witness = false;
 if (srcdnl-is_primary()  
!srcdnl-get_inode()-state_test(CInode::STATE_AMBIGUOUSAUTH)) {
   // freeze?
@@ -6594,12 +6611,19 @@ void Server::handle_slave_rename_prep(MDRequest *mdr)
   if (mdr-slave_request-witnesses.size()  1) {
dout(10)   set srci ambiguous auth; providing srcdn replica list  
dendl;
reply_witness = true;
+   for (setint::iterator p = srcdnrep.begin(); p != srcdnrep.end(); ++p) 
{
+ if (*p == mdr-slave_to_mds ||
+ !mds-mdsmap-is_clientreplay_or_active_or_stopping(*p))
+   continue;
+   

[PATCH 8/8] mds: fix remote wrlock rejoin

2013-06-17 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

remote wrlock's target is not always inode's auth MDS.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 src/mds/MDCache.cc | 40 ++--
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index f1ebedf..2b127b5 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -4523,25 +4523,29 @@ void 
MDCache::handle_cache_rejoin_strong(MMDSCacheRejoin *strong)
mdr-locks.insert(lock);
   }
 }
-// wrlock(s)?
-if (strong-wrlocked_inodes.count(in-vino())) {
-  for (mapint, listMMDSCacheRejoin::slave_reqid ::iterator q = 
strong-wrlocked_inodes[in-vino()].begin();
-  q != strong-wrlocked_inodes[in-vino()].end();
-  ++q) {
-   SimpleLock *lock = in-get_lock(q-first);
-   for (listMMDSCacheRejoin::slave_reqid::iterator r = q-second.begin();
-r != q-second.end();
-++r) {
- dout(10)   inode wrlock by   *r   on   *lock   on   
*in  dendl;
- MDRequest *mdr = request_get(r-reqid);  // should have this from 
auth_pin above.
+  }
+  // wrlock(s)?
+  for (mapvinodeno_t, mapint, listMMDSCacheRejoin::slave_reqid  
::iterator p = strong-wrlocked_inodes.begin();
+   p != strong-wrlocked_inodes.end();
+   ++p) {
+CInode *in = get_inode(p-first);
+for (mapint, listMMDSCacheRejoin::slave_reqid ::iterator q = 
p-second.begin();
+q != p-second.end();
+   ++q) {
+  SimpleLock *lock = in-get_lock(q-first);
+  for (listMMDSCacheRejoin::slave_reqid::iterator r = q-second.begin();
+ r != q-second.end();
+ ++r) {
+   dout(10)   inode wrlock by   *r   on   *lock   on   
*in  dendl;
+   MDRequest *mdr = request_get(r-reqid);  // should have this from 
auth_pin above.
+   if (in-is_auth())
  assert(mdr-is_auth_pinned(in));
- lock-set_state(LOCK_MIX);
- if (lock == in-filelock)
-   in-loner_cap = -1;
- lock-get_wrlock(true);
- mdr-wrlocks.insert(lock);
- mdr-locks.insert(lock);
-   }
+   lock-set_state(LOCK_MIX);
+   if (lock == in-filelock)
+ in-loner_cap = -1;
+   lock-get_wrlock(true);
+   mdr-wrlocks.insert(lock);
+   mdr-locks.insert(lock);
   }
 }
   }
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] mds: fix race between scatter gather and dirfrag export

2013-06-17 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

If we gather dirty scatter lock state while corresponding dirfrag
is been exporting, we may receive different dirfrag states from
two MDS and we need to find which one is the newest. The solution
is adding a new variable migrate seq to dirfrag, increase it by
one when dirfrag's auth MDS changes. When gathering dirty scatter
lock state, use migrate seq to find the newest dirfrag state.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 src/mds/CDir.cc|  4 
 src/mds/CDir.h |  4 +++-
 src/mds/CInode.cc  | 18 ++
 src/mds/MDCache.cc | 13 +++--
 4 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
index 8c83eba..2b991d7 100644
--- a/src/mds/CDir.cc
+++ b/src/mds/CDir.cc
@@ -154,6 +154,7 @@ ostream CDir::print_db_line_prefix(ostream out)
 // CDir
 
 CDir::CDir(CInode *in, frag_t fg, MDCache *mdcache, bool auth) :
+  mseq(0),
   dirty_rstat_inodes(member_offset(CInode, dirty_rstat_item)),
   item_dirty(this), item_new(this),
   pop_me(ceph_clock_now(g_ceph_context)),
@@ -2121,6 +2122,8 @@ void CDir::_committed(version_t v)
 void CDir::encode_export(bufferlist bl)
 {
   assert(!is_projected());
+  ceph_seq_t seq = mseq + 1;
+  ::encode(seq, bl);
   ::encode(first, bl);
   ::encode(fnode, bl);
   ::encode(dirty_old_rstat, bl);
@@ -2150,6 +2153,7 @@ void CDir::finish_export(utime_t now)
 
 void CDir::decode_import(bufferlist::iterator blp, utime_t now, LogSegment 
*ls)
 {
+  ::decode(mseq, blp);
   ::decode(first, blp);
   ::decode(fnode, blp);
   ::decode(dirty_old_rstat, blp);
diff --git a/src/mds/CDir.h b/src/mds/CDir.h
index 87c79c2..11f4a76 100644
--- a/src/mds/CDir.h
+++ b/src/mds/CDir.h
@@ -170,6 +170,7 @@ public:
 
   fnode_t fnode;
   snapid_t first;
+  ceph_seq_t mseq; // migrate sequence
   mapsnapid_t,old_rstat_t dirty_old_rstat;  // [value.first,key]
 
   // my inodes with dirty rstat data
@@ -547,7 +548,8 @@ public:
   // -- import/export --
   void encode_export(bufferlist bl);
   void finish_export(utime_t now);
-  void abort_export() { 
+  void abort_export() {
+mseq += 2;
 put(PIN_TEMPEXPORTING);
   }
   void decode_import(bufferlist::iterator blp, utime_t now, LogSegment *ls);
diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
index 8936acd..c1ce8a1 100644
--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -1222,6 +1222,7 @@ void CInode::encode_lock_state(int type, bufferlist bl)
  dout(20)  fg fragstat   pf-fragstat  dendl;
  dout(20)  fg   accounted_fragstat   pf-accounted_fragstat  
dendl;
  ::encode(fg, tmp);
+ ::encode(dir-mseq, tmp);
  ::encode(dir-first, tmp);
  ::encode(pf-fragstat, tmp);
  ::encode(pf-accounted_fragstat, tmp);
@@ -1255,6 +1256,7 @@ void CInode::encode_lock_state(int type, bufferlist bl)
  dout(10)  fg pf-rstat  dendl;
  dout(10)  fg dir-dirty_old_rstat  dendl;
  ::encode(fg, tmp);
+ ::encode(dir-mseq, tmp);
  ::encode(dir-first, tmp);
  ::encode(pf-rstat, tmp);
  ::encode(pf-accounted_rstat, tmp);
@@ -1404,10 +1406,12 @@ void CInode::decode_lock_state(int type, bufferlist bl)
   dout(10)   ...got   n   fragstats on   *this  dendl;
   while (n--) {
frag_t fg;
+   ceph_seq_t mseq;
snapid_t fgfirst;
frag_info_t fragstat;
frag_info_t accounted_fragstat;
::decode(fg, p);
+   ::decode(mseq, p);
::decode(fgfirst, p);
::decode(fragstat, p);
::decode(accounted_fragstat, p);
@@ -1420,6 +1424,12 @@ void CInode::decode_lock_state(int type, bufferlist bl)
  assert(dir);// i am auth; i had better have this dir 
open
  dout(10)  fg   first   dir-first   -   fgfirst
 on   *dir  dendl;
+ if (dir-fnode.fragstat.version == inode.dirstat.version 
+ ceph_seq_cmp(mseq, dir-mseq)  0) {
+   dout(10)   mseq   mseq  dir-mseq  , ignoring 
 dendl;
+   continue;
+ }
+ dir-mseq = mseq;
  dir-first = fgfirst;
  dir-fnode.fragstat = fragstat;
  dir-fnode.accounted_fragstat = accounted_fragstat;
@@ -1462,11 +1472,13 @@ void CInode::decode_lock_state(int type, bufferlist bl)
   ::decode(n, p);
   while (n--) {
frag_t fg;
+   ceph_seq_t mseq;
snapid_t fgfirst;
nest_info_t rstat;
nest_info_t accounted_rstat;
mapsnapid_t,old_rstat_t dirty_old_rstat;
::decode(fg, p);
+   ::decode(mseq, p);
::decode(fgfirst, p);
::decode(rstat, p);
::decode(accounted_rstat, p);
@@ -1481,6 +1493,12 @@ void CInode::decode_lock_state(int type, bufferlist bl)
  assert(dir);// i am auth; i had better have this dir 
open
  dout(10)  fg   first   dir-first   -   fgfirst
 on   *dir  dendl;
+ if (dir-fnode.rstat.version == 

[PATCH 1/8] mds: don't update migrate_seq when importing non-auth cap

2013-06-17 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

We use migrate_seq to distinguish old and new auth MDS. So we should
not change migrate_seq when importing non-auth cap.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 src/mds/Capability.h | 5 +++--
 src/mds/Migrator.cc  | 8 
 src/mds/Migrator.h   | 3 ++-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/src/mds/Capability.h b/src/mds/Capability.h
index 54d2312..fdecb90 100644
--- a/src/mds/Capability.h
+++ b/src/mds/Capability.h
@@ -273,7 +273,7 @@ public:
 return Export(_wanted, issued(), pending(), client_follows, mseq+1, 
last_issue_stamp);
   }
   void rejoin_import() { mseq++; }
-  void merge(Export other) {
+  void merge(Export other, bool auth_cap) {
 // issued + pending
 int newpending = other.pending | pending();
 if (other.issued  ~newpending)
@@ -286,7 +286,8 @@ public:
 
 // wanted
 _wanted = _wanted | other.wanted;
-mseq = other.mseq;
+if (auth_cap)
+  mseq = other.mseq;
   }
   void merge(int otherwanted, int otherissued) {
 // issued + pending
diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
index 6ea28c9..0647448 100644
--- a/src/mds/Migrator.cc
+++ b/src/mds/Migrator.cc
@@ -2223,7 +2223,7 @@ void Migrator::import_logged_start(dirfrag_t df, CDir 
*dir, int from,
   for (mapCInode*, mapclient_t,Capability::Export ::iterator p = 
import_caps[dir].begin();
p != import_caps[dir].end();
++p) {
-finish_import_inode_caps(p-first, from, p-second);
+finish_import_inode_caps(p-first, true, p-second);
   }
   
   // send notify's etc.
@@ -2398,7 +2398,7 @@ void Migrator::decode_import_inode_caps(CInode *in,
   }
 }
 
-void Migrator::finish_import_inode_caps(CInode *in, int from, 
+void Migrator::finish_import_inode_caps(CInode *in, bool auth_cap,
mapclient_t,Capability::Export 
cap_map)
 {
   for (mapclient_t,Capability::Export::iterator it = cap_map.begin();
@@ -2412,7 +2412,7 @@ void Migrator::finish_import_inode_caps(CInode *in, int 
from,
 if (!cap) {
   cap = in-add_client_cap(it-first, session);
 }
-cap-merge(it-second);
+cap-merge(it-second, auth_cap);
 
 mds-mdcache-do_cap_import(session, in, cap);
   }
@@ -2688,7 +2688,7 @@ void Migrator::logged_import_caps(CInode *in,
   mds-server-finish_force_open_sessions(client_map, sseqmap);
 
   assert(cap_imports.count(in));
-  finish_import_inode_caps(in, from, cap_imports[in]);  
+  finish_import_inode_caps(in, false, cap_imports[in]);
   mds-locker-eval(in, CEPH_CAP_LOCKS, true);
 
   mds-send_message_mds(new MExportCapsAck(in-ino()), from);
diff --git a/src/mds/Migrator.h b/src/mds/Migrator.h
index 70b59bc..afe2e6c 100644
--- a/src/mds/Migrator.h
+++ b/src/mds/Migrator.h
@@ -256,7 +256,8 @@ public:
   void decode_import_inode_caps(CInode *in,
bufferlist::iterator blp,
mapCInode*, mapclient_t,Capability::Export 
 cap_imports);
-  void finish_import_inode_caps(CInode *in, int from, 
mapclient_t,Capability::Export cap_map);
+  void finish_import_inode_caps(CInode *in, bool auth_cap,
+   mapclient_t,Capability::Export cap_map);
   int decode_import_dir(bufferlist::iterator blp,
int oldauth,
CDir *import_root,
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] mds: fix frozen check in Server::try_open_auth_dirfrag()

2013-06-17 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 src/mds/Server.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index 253c56d..c3162e7 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -2204,7 +2204,7 @@ CDir* Server::try_open_auth_dirfrag(CInode *diri, frag_t 
fg, MDRequest *mdr)
   }
 
   // not open and inode frozen?
-  if (!dir  diri-is_frozen_dir()) {
+  if (!dir  diri-is_frozen()) {
 dout(10)  try_open_auth_dirfrag: dir inode is frozen, waiting   
*diri  dendl;
 assert(diri-get_parent_dir());
 diri-get_parent_dir()-add_waiter(CDir::WAIT_UNFREEZE, new 
C_MDS_RetryRequest(mdcache, mdr));
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] mds: don't journal bare dirfrag

2013-06-17 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

don't journal bare dirfrag when starting scatter. also add debug
code for bare dirfrag modification.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 src/mds/CDir.cc   | 1 +
 src/mds/CInode.cc | 2 ++
 src/mds/Server.cc | 5 +++--
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
index 211cec0..8c83eba 100644
--- a/src/mds/CDir.cc
+++ b/src/mds/CDir.cc
@@ -1211,6 +1211,7 @@ void CDir::finish_waiting(uint64_t mask, int result)
 
 fnode_t *CDir::project_fnode()
 {
+  assert(get_version() != 0);
   fnode_t *p = new fnode_t;
   *p = *get_projected_fnode();
   projected_fnode.push_back(p);
diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
index 4a592bc..8936acd 100644
--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -1625,6 +1625,8 @@ void CInode::finish_scatter_update(ScatterLock *lock, 
CDir *dir,
 
   if (dir-is_frozen()) {
 dout(10)  finish_scatter_update   fg   frozen, marking   *lock 
  stale   *dir  dendl;
+  } else if (dir-get_version() == 0) {
+dout(10)  finish_scatter_update   fg   not loaded, marking   
*lock   stale   *dir  dendl;
   } else {
 if (dir_accounted_version != inode_version) {
   dout(10)  finish_scatter_update   fg   journaling accounted 
scatterstat update v  inode_version  dendl;
diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index 00bf018..1d16d04 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -4010,7 +4010,8 @@ public:
 if (newi-inode.is_dir()) { 
   CDir *dir = newi-get_dirfrag(frag_t());
   assert(dir);
-  dir-mark_dirty(1, mdr-ls);
+  dir-fnode.version--;
+  dir-mark_dirty(dir-fnode.version + 1, mdr-ls);
   dir-mark_new(mdr-ls);
 }
 
@@ -4169,7 +4170,7 @@ void Server::handle_client_mkdir(MDRequest *mdr)
   // ...and that new dir is empty.
   CDir *newdir = newi-get_or_open_dirfrag(mds-mdcache, frag_t());
   newdir-mark_complete();
-  newdir-pre_dirty();
+  newdir-fnode.version = newdir-pre_dirty();
 
   // prepare finisher
   mdr-ls = mdlog-get_current_segment();
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] mds: handle undefined dirfrags when opening inode

2013-06-17 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

When MDS is rejoin stage, cache rejoin message can add undefined
inodes and dirfrags to the cache. These undefined objects can affect
lookup-by-ino processes. If an undefined dirfrag is encountered,
we should fetch it from disk.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 src/mds/MDCache.cc | 37 +
 src/mds/MDCache.h  |  1 +
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 2b7ad71..b9b154d 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -8015,6 +8015,29 @@ void MDCache::_open_ino_traverse_dir(inodeno_t ino, 
open_ino_info_t info, int r
   do_open_ino(ino, info, ret);
 }
 
+void MDCache::_open_ino_fetch_dir(inodeno_t ino, MMDSOpenIno *m, CDir *dir)
+{
+  if (dir-state_test(CDir::STATE_REJOINUNDEF)  dir-get_frag() == frag_t()) 
{
+rejoin_undef_dirfrags.erase(dir);
+dir-state_clear(CDir::STATE_REJOINUNDEF);
+
+CInode *diri = dir-get_inode();
+diri-force_dirfrags();
+listCDir* ls;
+diri-get_dirfrags(ls);
+
+C_GatherBuilder gather(g_ceph_context, _open_ino_get_waiter(ino, m));
+for (listCDir*::iterator p = ls.begin(); p != ls.end(); ++p) {
+  rejoin_undef_dirfrags.insert(*p);
+  (*p)-state_set(CDir::STATE_REJOINUNDEF);
+  (*p)-fetch(gather.new_sub());
+}
+assert(gather.has_subs());
+gather.activate();
+  } else
+dir-fetch(_open_ino_get_waiter(ino, m));
+}
+
 int MDCache::open_ino_traverse_dir(inodeno_t ino, MMDSOpenIno *m,
   vectorinode_backpointer_t ancestors,
   bool discover, bool want_xlocked, int *hint)
@@ -8032,8 +8055,14 @@ int MDCache::open_ino_traverse_dir(inodeno_t ino, 
MMDSOpenIno *m,
   continue;
 }
 
-if (diri-state_test(CInode::STATE_REJOINUNDEF))
-  continue;
+if (diri-state_test(CInode::STATE_REJOINUNDEF)) {
+  CDir *dir = diri-get_parent_dir();
+  while (dir-state_test(CDir::STATE_REJOINUNDEF) 
+dir-get_inode()-state_test(CInode::STATE_REJOINUNDEF))
+   dir = dir-get_inode()-get_parent_dir();
+  _open_ino_fetch_dir(ino, m, dir);
+  return 1;
+}
 
 if (!diri-is_dir()) {
   dout(10) *diri   is not dir  dendl;
@@ -8067,14 +8096,14 @@ int MDCache::open_ino_traverse_dir(inodeno_t ino, 
MMDSOpenIno *m,
if (dnl  dnl-is_primary() 
dnl-get_inode()-state_test(CInode::STATE_REJOINUNDEF)) {
  dout(10)   fetching undef   *dnl-get_inode()  dendl;
- dir-fetch(_open_ino_get_waiter(ino, m));
+ _open_ino_fetch_dir(ino, m, dir);
  return 1;
}
 
if (!dnl  !dir-is_complete() 
(!dir-has_bloom() || dir-is_in_bloom(name))) {
  dout(10)   fetching incomplete   *dir  dendl;
- dir-fetch(_open_ino_get_waiter(ino, m));
+ _open_ino_fetch_dir(ino, m, dir);
  return 1;
}
 
diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h
index 3da8a36..36a322c 100644
--- a/src/mds/MDCache.h
+++ b/src/mds/MDCache.h
@@ -790,6 +790,7 @@ protected:
   void _open_ino_backtrace_fetched(inodeno_t ino, bufferlist bl, int err);
   void _open_ino_parent_opened(inodeno_t ino, int ret);
   void _open_ino_traverse_dir(inodeno_t ino, open_ino_info_t info, int err);
+  void _open_ino_fetch_dir(inodeno_t ino, MMDSOpenIno *m, CDir *dir);
   Context* _open_ino_get_waiter(inodeno_t ino, MMDSOpenIno *m);
   int open_ino_traverse_dir(inodeno_t ino, MMDSOpenIno *m,
vectorinode_backpointer_t ancestors,
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ceph: remove sb_start/end_write in ceph_aio_write.

2013-06-17 Thread majianpeng

Either in vfs_write or io_submit,it call file_start/end_write.
The different between file_start/end_write and sb_start/end_write is
file_ only handle regular file.But i think in ceph_aio_write,it only
for regular file.

Signed-off-by: Jianpeng Mamajianp...@gmail.com
---
 fs/ceph/file.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 656e169..7c69f4f 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -716,7 +716,6 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
 
-	sb_start_write(inode-i_sb);

mutex_lock(inode-i_mutex);
hold_mutex = true;
 
@@ -809,7 +808,6 @@ retry_snap:

 out:
if (hold_mutex)
mutex_unlock(inode-i_mutex);
-   sb_end_write(inode-i_sb);
current-backing_dev_info = NULL;
 
 	return written ? written : err;

-- 1.8.3.rc1.44.gb387c77

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: [PATCH 2/2] Enable fscache as an optional feature of ceph.

2013-06-17 Thread Milosz Tanski
Elbandi,

Can you give me some info about your test case so I can figure out
what's going on.

1) In the graphs you attached what am I looking at? My best guess is
that it's traffic on a 10gigE card, but I can't tell from the graph
since there's no labels.
2) Can you give me more info about your serving case. What application
are you using to serve the video (http server)? Are you serving static
mp4 files from Ceph filesystem?
3) What's the hardware, most importantly how big is your partition
that cachefilesd is on and what kind of disk are you hosting it on
(rotating, SSD)?
4) Statistics from fscache. Can you paste the output
/proc/fs/fscache/stats and /proc/fs/fscache/histogram.
5) dmesg lines for ceph/fscache/cachefiles like:

[2049099.198234] CacheFiles: Loaded
[2049099.541721] FS-Cache: Cache mycache added (type cachefiles)
[2049099.541727] CacheFiles: File cache on md0 registered
[2049120.650897] Key type ceph registered
[2049120.651015] libceph: loaded (mon/osd proto 15/24)
[2049120.673202] FS-Cache: Netfs 'ceph' registered for caching
[2049120.673207] ceph: loaded (mds proto 32)
[2049120.680919] libceph: client6473 fsid e23a1bfc-8328-46bf-bc59-1209df3f5434
[2049120.683397] libceph: mon0 10.0.5.226:6789 session established

I think with these answers I'll be better able to diagnose what's
going on for you.

- Milosz

On Mon, Jun 17, 2013 at 9:16 AM, Elso Andras elso.and...@gmail.com wrote:

 Hi,

 I tested your patches on a ubuntu lucid system, but ubuntu raring
 kernel (3.8), but with for-linus branch from ceph-client and your
 fscache. There was no probs in heavy load.
 But i dont see any difference with/without fscache on our test case
 (mp4 video streaming, ~5500 connections):
 with fscache: http://imageshack.us/photo/my-images/109/xg5a.png/
 without fscache: http://imageshack.us/photo/my-images/5/xak.png/

 Elbandi

 2013/5/29 Milosz Tanski mil...@adfin.com:
  Sage,
 
  Thanks for taking a look at this. No worries about the timing.
 
  I added two extra changes into my branch located here:
  https://bitbucket.org/adfin/linux-fs/commits/branch/forceph. The first
  one is a fix for kernel deadlock. The second one makes fsc cache a
  non-default mount option (akin to NFS).
 
  Finally, I observed an occasional oops in the fscache that's fixed in
  David's branch that's waiting to get into mainline. The fix for the
  issue is here: 
  http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscacheid=82958c45e35963c93fc6cbe6a27752e2d97e9f9a.
  I can only cause that issue by forcing the kernel to drop it's caches
  in some cases.
 
  Let me know if you any other feedback, or if I can help in anyway.
 
  Thanks,
  - Milosz
 
  On Tue, May 28, 2013 at 1:11 PM, Sage Weil s...@inktank.com wrote:
  Hi Milosz,
 
  Just a heads up that I hope to take a closer look at the patch this
  afternoon or tomorrow.  Just catching up after the long weekend.
 
  Thanks!
  sage
 
 
  On Thu, 23 May 2013, Milosz Tanski wrote:
 
  Enable fscache as an optional feature of ceph.
 
  Adding support for fscache to the Ceph filesystem. This would bring it to 
  on
  par with some of the other network filesystems in Linux (like NFS, AFS, 
  etc...)
 
  This exploits the existing Ceph cache  lazyio capabilities.
 
  Signed-off-by: Milosz Tanski mil...@adfin.com
  ---
   fs/ceph/Kconfig  |9 ++
   fs/ceph/Makefile |2 ++
   fs/ceph/addr.c   |   85 
  --
   fs/ceph/caps.c   |   21 +-
   fs/ceph/file.c   |9 ++
   fs/ceph/inode.c  |   25 ++--
   fs/ceph/super.c  |   25 ++--
   fs/ceph/super.h  |   12 
   8 files changed, 162 insertions(+), 26 deletions(-)
 
  diff --git a/fs/ceph/Kconfig b/fs/ceph/Kconfig
  index 49bc782..ac9a2ef 100644
  --- a/fs/ceph/Kconfig
  +++ b/fs/ceph/Kconfig
  @@ -16,3 +16,12 @@ config CEPH_FS
 
 If unsure, say N.
 
  +if CEPH_FS
  +config CEPH_FSCACHE
  + bool Enable Ceph client caching support
  + depends on CEPH_FS=m  FSCACHE || CEPH_FS=y  FSCACHE=y
  + help
  +  Choose Y here to enable persistent, read-only local
  +  caching support for Ceph clients using FS-Cache
  +
  +endif
  diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile
  index bd35212..0af0678 100644
  --- a/fs/ceph/Makefile
  +++ b/fs/ceph/Makefile
  @@ -9,3 +9,5 @@ ceph-y := super.o inode.o dir.o file.o locks.o addr.o 
  ioctl.o \
mds_client.o mdsmap.o strings.o ceph_frag.o \
debugfs.o
 
  +ceph-$(CONFIG_CEPH_FSCACHE) += cache.o
  +
  diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
  index 3e68ac1..fd3a1cc 100644
  --- a/fs/ceph/addr.c
  +++ b/fs/ceph/addr.c
  @@ -11,6 +11,7 @@
 
   #include super.h
   #include mds_client.h
  +#include cache.h
   #include linux/ceph/osd_client.h
 
   /*
  @@ -149,11 +150,26 @@ static void ceph_invalidatepage(struct page
  *page, unsigned long offset)
struct ceph_inode_info *ci;
struct ceph_snap_context *snapc = page_snap_context(page);
 
  - 

[PATCH v3 04/13] locks: make added in __posix_lock_file a bool

2013-06-17 Thread Jeff Layton
...save 3 bytes of stack space.

Signed-off-by: Jeff Layton jlay...@redhat.com
Acked-by: J. Bruce Fields bfie...@fieldses.org
---
 fs/locks.c |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 1e6301b..c186649 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -791,7 +791,8 @@ static int __posix_lock_file(struct inode *inode, struct 
file_lock *request, str
struct file_lock *left = NULL;
struct file_lock *right = NULL;
struct file_lock **before;
-   int error, added = 0;
+   int error;
+   bool added = false;
 
/*
 * We may need two file_lock structures for this operation,
@@ -885,7 +886,7 @@ static int __posix_lock_file(struct inode *inode, struct 
file_lock *request, str
continue;
}
request = fl;
-   added = 1;
+   added = true;
}
else {
/* Processing for different lock types is a bit
@@ -896,7 +897,7 @@ static int __posix_lock_file(struct inode *inode, struct 
file_lock *request, str
if (fl-fl_start  request-fl_end)
break;
if (request-fl_type == F_UNLCK)
-   added = 1;
+   added = true;
if (fl-fl_start  request-fl_start)
left = fl;
/* If the next lock in the list has a higher end
@@ -926,7 +927,7 @@ static int __posix_lock_file(struct inode *inode, struct 
file_lock *request, str
locks_release_private(fl);
locks_copy_private(fl, request);
request = fl;
-   added = 1;
+   added = true;
}
}
/* Go on to next lock.
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 10/13] locks: add a new lm_owner_key lock operation

2013-06-17 Thread Jeff Layton
Currently, the hashing that the locking code uses to add these values
to the blocked_hash is simply calculated using fl_owner field. That's
valid in most cases except for server-side lockd, which validates the
owner of a lock based on fl_owner and fl_pid.

In the case where you have a small number of NFS clients doing a lot
of locking between different processes, you could end up with all
the blocked requests sitting in a very small number of hash buckets.

Add a new lm_owner_key operation to the lock_manager_operations that
will generate an unsigned long to use as the key in the hashtable.
That function is only implemented for server-side lockd, and simply
XORs the fl_owner and fl_pid.

Signed-off-by: Jeff Layton jlay...@redhat.com
Acked-by: J. Bruce Fields bfie...@fieldses.org
---
 Documentation/filesystems/Locking |   16 +++-
 fs/lockd/svclock.c|   12 
 fs/locks.c|   12 ++--
 include/linux/fs.h|1 +
 4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index 413685f..dfeb01b 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -351,6 +351,7 @@ fl_release_private: maybe   no
 --- lock_manager_operations ---
 prototypes:
int (*lm_compare_owner)(struct file_lock *, struct file_lock *);
+   unsigned long (*lm_owner_key)(struct file_lock *);
void (*lm_notify)(struct file_lock *);  /* unblock callback */
int (*lm_grant)(struct file_lock *, struct file_lock *, int);
void (*lm_break)(struct file_lock *); /* break_lease callback */
@@ -360,16 +361,21 @@ locking rules:
 
inode-i_lock   file_lock_lock  may block
 lm_compare_owner:  yes[1]  maybe   no
+lm_owner_key   yes[1]  yes no
 lm_notify: yes yes no
 lm_grant:  no  no  no
 lm_break:  yes no  no
 lm_change  yes no  no
 
-[1]:   -lm_compare_owner is generally called with *an* inode-i_lock held. It
-may not be the i_lock of the inode for either file_lock being compared! This is
-the case with deadlock detection, since the code has to chase down the owners
-of locks that may be entirely unrelated to the one on which the lock is being
-acquired. When doing a search for deadlocks, the file_lock_lock is also held.
+[1]:   -lm_compare_owner and -lm_owner_key are generally called with
+*an* inode-i_lock held. It may not be the i_lock of the inode
+associated with either file_lock argument! This is the case with deadlock
+detection, since the code has to chase down the owners of locks that may
+be entirely unrelated to the one on which the lock is being acquired.
+For deadlock detection however, the file_lock_lock is also held. The
+fact that these locks are held ensures that the file_locks do not
+disappear out from under you while doing the comparison or generating an
+owner key.
 
 --- buffer_head ---
 prototypes:
diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index e703318..ce2cdab 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -744,8 +744,20 @@ static int nlmsvc_same_owner(struct file_lock *fl1, struct 
file_lock *fl2)
return fl1-fl_owner == fl2-fl_owner  fl1-fl_pid == fl2-fl_pid;
 }
 
+/*
+ * Since NLM uses two keys for tracking locks, we need to hash them down
+ * to one for the blocked_hash. Here, we're just xor'ing the host address
+ * with the pid in order to create a key value for picking a hash bucket.
+ */
+static unsigned long
+nlmsvc_owner_key(struct file_lock *fl)
+{
+   return (unsigned long)fl-fl_owner ^ (unsigned long)fl-fl_pid;
+}
+
 const struct lock_manager_operations nlmsvc_lock_operations = {
.lm_compare_owner = nlmsvc_same_owner,
+   .lm_owner_key = nlmsvc_owner_key,
.lm_notify = nlmsvc_notify_blocked,
.lm_grant = nlmsvc_grant_deferred,
 };
diff --git a/fs/locks.c b/fs/locks.c
index d93b291..55f3af7 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -505,10 +505,18 @@ locks_delete_global_locks(struct file_lock *waiter)
spin_unlock(file_lock_lock);
 }
 
+static unsigned long
+posix_owner_key(struct file_lock *fl)
+{
+   if (fl-fl_lmops  fl-fl_lmops-lm_owner_key)
+   return fl-fl_lmops-lm_owner_key(fl);
+   return (unsigned long)fl-fl_owner;
+}
+
 static inline void
 locks_insert_global_blocked(struct file_lock *waiter)
 {
-   hash_add(blocked_hash, waiter-fl_link, (unsigned 
long)waiter-fl_owner);
+   hash_add(blocked_hash, waiter-fl_link, posix_owner_key(waiter));
 }
 
 static inline void
@@ -739,7 +747,7 @@ static struct file_lock *what_owner_is_waiting_for(struct 
file_lock *block_fl)
 {

[PATCH v3 11/13] locks: give the blocked_hash its own spinlock

2013-06-17 Thread Jeff Layton
There's no reason we have to protect the blocked_hash and file_lock_list
with the same spinlock. With the tests I have, breaking it in two gives
a barely measurable performance benefit, but it seems reasonable to make
this locking as granular as possible.

Signed-off-by: Jeff Layton jlay...@redhat.com
---
 Documentation/filesystems/Locking |   16 
 fs/locks.c|   33 ++---
 2 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index dfeb01b..cf04448 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -359,20 +359,20 @@ prototypes:
 
 locking rules:
 
-   inode-i_lock   file_lock_lock  may block
-lm_compare_owner:  yes[1]  maybe   no
-lm_owner_key   yes[1]  yes no
-lm_notify: yes yes no
-lm_grant:  no  no  no
-lm_break:  yes no  no
-lm_change  yes no  no
+   inode-i_lock   blocked_lock_lock   may block
+lm_compare_owner:  yes[1]  maybe   no
+lm_owner_key   yes[1]  yes no
+lm_notify: yes yes no
+lm_grant:  no  no  no
+lm_break:  yes no  no
+lm_change  yes no  no
 
 [1]:   -lm_compare_owner and -lm_owner_key are generally called with
 *an* inode-i_lock held. It may not be the i_lock of the inode
 associated with either file_lock argument! This is the case with deadlock
 detection, since the code has to chase down the owners of locks that may
 be entirely unrelated to the one on which the lock is being acquired.
-For deadlock detection however, the file_lock_lock is also held. The
+For deadlock detection however, the blocked_lock_lock is also held. The
 fact that these locks are held ensures that the file_locks do not
 disappear out from under you while doing the comparison or generating an
 owner key.
diff --git a/fs/locks.c b/fs/locks.c
index 55f3af7..5db80c7 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -159,10 +159,11 @@ int lease_break_time = 45;
  * by the file_lock_lock.
  */
 static HLIST_HEAD(file_lock_list);
+static DEFINE_SPINLOCK(file_lock_lock);
 
 /*
  * The blocked_hash is used to find POSIX lock loops for deadlock detection.
- * It is protected by file_lock_lock.
+ * It is protected by blocked_lock_lock.
  *
  * We hash locks by lockowner in order to optimize searching for the lock a
  * particular lockowner is waiting on.
@@ -174,8 +175,8 @@ static HLIST_HEAD(file_lock_list);
 #define BLOCKED_HASH_BITS  7
 static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
 
-/* Protects the file_lock_list, the blocked_hash and fl-fl_block list */
-static DEFINE_SPINLOCK(file_lock_lock);
+/* protects blocked_hash and fl-fl_block list */
+static DEFINE_SPINLOCK(blocked_lock_lock);
 
 static struct kmem_cache *filelock_cache __read_mostly;
 
@@ -528,7 +529,7 @@ locks_delete_global_blocked(struct file_lock *waiter)
 /* Remove waiter from blocker's block list.
  * When blocker ends up pointing to itself then the list is empty.
  *
- * Must be called with file_lock_lock held.
+ * Must be called with blocked_lock_lock held.
  */
 static void __locks_delete_block(struct file_lock *waiter)
 {
@@ -539,9 +540,9 @@ static void __locks_delete_block(struct file_lock *waiter)
 
 static void locks_delete_block(struct file_lock *waiter)
 {
-   spin_lock(file_lock_lock);
+   spin_lock(blocked_lock_lock);
__locks_delete_block(waiter);
-   spin_unlock(file_lock_lock);
+   spin_unlock(blocked_lock_lock);
 }
 
 /* Insert waiter into blocker's block list.
@@ -549,9 +550,9 @@ static void locks_delete_block(struct file_lock *waiter)
  * the order they blocked. The documentation doesn't require this but
  * it seems like the reasonable thing to do.
  *
- * Must be called with both the i_lock and file_lock_lock held. The fl_block
+ * Must be called with both the i_lock and blocked_lock_lock held. The fl_block
  * list itself is protected by the file_lock_list, but by ensuring that the
- * i_lock is also held on insertions we can avoid taking the file_lock_lock
+ * i_lock is also held on insertions we can avoid taking the blocked_lock_lock
  * in some cases when we see that the fl_block list is empty.
  */
 static void __locks_insert_block(struct file_lock *blocker,
@@ -568,9 +569,9 @@ static void __locks_insert_block(struct file_lock *blocker,
 static void locks_insert_block(struct file_lock *blocker,
struct file_lock *waiter)
 {
-   spin_lock(file_lock_lock);
+   spin_lock(blocked_lock_lock);

[PATCH v3 07/13] locks: avoid taking global lock if possible when waking up blocked waiters

2013-06-17 Thread Jeff Layton
Since we always hold the i_lock when inserting a new waiter onto the
fl_block list, we can avoid taking the global lock at all if we find
that it's empty when we go to wake up blocked waiters.

Signed-off-by: Jeff Layton jlay...@redhat.com
---
 fs/locks.c |   17 ++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 8f56651..a8f3b33 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -532,7 +532,10 @@ static void locks_delete_block(struct file_lock *waiter)
  * the order they blocked. The documentation doesn't require this but
  * it seems like the reasonable thing to do.
  *
- * Must be called with file_lock_lock held!
+ * Must be called with both the i_lock and file_lock_lock held. The fl_block
+ * list itself is protected by the file_lock_list, but by ensuring that the
+ * i_lock is also held on insertions we can avoid taking the file_lock_lock
+ * in some cases when we see that the fl_block list is empty.
  */
 static void __locks_insert_block(struct file_lock *blocker,
struct file_lock *waiter)
@@ -560,8 +563,16 @@ static void locks_insert_block(struct file_lock *blocker,
  */
 static void locks_wake_up_blocks(struct file_lock *blocker)
 {
+   /*
+* Avoid taking global lock if list is empty. This is safe since new
+* blocked requests are only added to the list under the i_lock, and
+* the i_lock is always held here.
+*/
+   if (list_empty(blocker-fl_block))
+   return;
+
spin_lock(file_lock_lock);
-   while (!list_empty(blocker-fl_block)) {
+   do {
struct file_lock *waiter;
 
waiter = list_first_entry(blocker-fl_block,
@@ -571,7 +582,7 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
waiter-fl_lmops-lm_notify(waiter);
else
wake_up(waiter-fl_wait);
-   }
+   } while (!list_empty(blocker-fl_block));
spin_unlock(file_lock_lock);
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 12/13] seq_file: add seq_list_*_percpu helpers

2013-06-17 Thread Jeff Layton
When we convert the file_lock_list to a set of percpu lists, we'll need
a way to iterate over them in order to output /proc/locks info. Add
some seq_list_*_percpu helpers to handle that.

Signed-off-by: Jeff Layton jlay...@redhat.com
Acked-by: J. Bruce Fields bfie...@fieldses.org
---
 fs/seq_file.c|   54 ++
 include/linux/seq_file.h |6 +
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 774c1eb..3135c25 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -921,3 +921,57 @@ struct hlist_node *seq_hlist_next_rcu(void *v,
return rcu_dereference(node-next);
 }
 EXPORT_SYMBOL(seq_hlist_next_rcu);
+
+/**
+ * seq_hlist_start_precpu - start an iteration of a percpu hlist array
+ * @head: pointer to percpu array of struct hlist_heads
+ * @cpu:  pointer to cpu cursor
+ * @pos:  start position of sequence
+ *
+ * Called at seq_file-op-start().
+ */
+struct hlist_node *
+seq_hlist_start_percpu(struct hlist_head __percpu *head, int *cpu, loff_t pos)
+{
+   struct hlist_node *node;
+
+   for_each_possible_cpu(*cpu) {
+   hlist_for_each(node, per_cpu_ptr(head, *cpu)) {
+   if (pos-- == 0)
+   return node;
+   }
+   }
+   return NULL;
+}
+EXPORT_SYMBOL(seq_hlist_start_percpu);
+
+/**
+ * seq_hlist_next_percpu - move to the next position of the percpu hlist array
+ * @v:pointer to current hlist_node
+ * @head: pointer to percpu array of struct hlist_heads
+ * @cpu:  pointer to cpu cursor
+ * @pos:  start position of sequence
+ *
+ * Called at seq_file-op-next().
+ */
+struct hlist_node *
+seq_hlist_next_percpu(void *v, struct hlist_head __percpu *head,
+   int *cpu, loff_t *pos)
+{
+   struct hlist_node *node = v;
+
+   ++*pos;
+
+   if (node-next)
+   return node-next;
+
+   for (*cpu = cpumask_next(*cpu, cpu_possible_mask); *cpu  nr_cpu_ids;
+*cpu = cpumask_next(*cpu, cpu_possible_mask)) {
+   struct hlist_head *bucket = per_cpu_ptr(head, *cpu);
+
+   if (!hlist_empty(bucket))
+   return bucket-first;
+   }
+   return NULL;
+}
+EXPORT_SYMBOL(seq_hlist_next_percpu);
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index 2da29ac..4e32edc 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -173,4 +173,10 @@ extern struct hlist_node *seq_hlist_start_head_rcu(struct 
hlist_head *head,
 extern struct hlist_node *seq_hlist_next_rcu(void *v,
   struct hlist_head *head,
   loff_t *ppos);
+
+/* Helpers for iterating over per-cpu hlist_head-s in seq_files */
+extern struct hlist_node *seq_hlist_start_percpu(struct hlist_head __percpu 
*head, int *cpu, loff_t pos);
+
+extern struct hlist_node *seq_hlist_next_percpu(void *v, struct hlist_head 
__percpu *head, int *cpu, loff_t *pos);
+
 #endif
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 06/13] locks: protect most of the file_lock handling with i_lock

2013-06-17 Thread Jeff Layton
Having a global lock that protects all of this code is a clear
scalability problem. Instead of doing that, move most of the code to be
protected by the i_lock instead. The exceptions are the global lists
that the -fl_link sits on, and the -fl_block list.

-fl_link is what connects these structures to the
global lists, so we must ensure that we hold those locks when iterating
over or updating these lists.

Furthermore, sound deadlock detection requires that we hold the
blocked_list state steady while checking for loops. We also must ensure
that the search and update to the list are atomic.

For the checking and insertion side of the blocked_list, push the
acquisition of the global lock into __posix_lock_file and ensure that
checking and update of the  blocked_list is done without dropping the
lock in between.

On the removal side, when waking up blocked lock waiters, take the
global lock before walking the blocked list and dequeue the waiters from
the global list prior to removal from the fl_block list.

With this, deadlock detection should be race free while we minimize
excessive file_lock_lock thrashing.

Finally, in order to avoid a lock inversion problem when handling
/proc/locks output we must ensure that manipulations of the fl_block
list are also protected by the file_lock_lock.

Signed-off-by: Jeff Layton jlay...@redhat.com
---
 Documentation/filesystems/Locking |   21 --
 fs/afs/flock.c|5 +-
 fs/ceph/locks.c   |2 +-
 fs/ceph/mds_client.c  |8 +-
 fs/cifs/cifsfs.c  |2 +-
 fs/cifs/file.c|   13 ++--
 fs/gfs2/file.c|2 +-
 fs/lockd/svcsubs.c|   12 ++--
 fs/locks.c|  151 ++---
 fs/nfs/delegation.c   |   10 +-
 fs/nfs/nfs4state.c|8 +-
 fs/nfsd/nfs4state.c   |8 +-
 include/linux/fs.h|   11 ---
 13 files changed, 140 insertions(+), 113 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index 0706d32..413685f 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -344,7 +344,7 @@ prototypes:
 
 
 locking rules:
-   file_lock_lock  may block
+   inode-i_lock   may block
 fl_copy_lock:  yes no
 fl_release_private:maybe   no
 
@@ -357,12 +357,19 @@ prototypes:
int (*lm_change)(struct file_lock **, int);
 
 locking rules:
-   file_lock_lock  may block
-lm_compare_owner:  yes no
-lm_notify: yes no
-lm_grant:  no  no
-lm_break:  yes no
-lm_change  yes no
+
+   inode-i_lock   file_lock_lock  may block
+lm_compare_owner:  yes[1]  maybe   no
+lm_notify: yes yes no
+lm_grant:  no  no  no
+lm_break:  yes no  no
+lm_change  yes no  no
+
+[1]:   -lm_compare_owner is generally called with *an* inode-i_lock held. It
+may not be the i_lock of the inode for either file_lock being compared! This is
+the case with deadlock detection, since the code has to chase down the owners
+of locks that may be entirely unrelated to the one on which the lock is being
+acquired. When doing a search for deadlocks, the file_lock_lock is also held.
 
 --- buffer_head ---
 prototypes:
diff --git a/fs/afs/flock.c b/fs/afs/flock.c
index 2497bf3..03fc0d1 100644
--- a/fs/afs/flock.c
+++ b/fs/afs/flock.c
@@ -252,6 +252,7 @@ static void afs_defer_unlock(struct afs_vnode *vnode, 
struct key *key)
  */
 static int afs_do_setlk(struct file *file, struct file_lock *fl)
 {
+   struct inode = file_inode(file);
struct afs_vnode *vnode = AFS_FS_I(file-f_mapping-host);
afs_lock_type_t type;
struct key *key = file-private_data;
@@ -273,7 +274,7 @@ static int afs_do_setlk(struct file *file, struct file_lock 
*fl)
 
type = (fl-fl_type == F_RDLCK) ? AFS_LOCK_READ : AFS_LOCK_WRITE;
 
-   lock_flocks();
+   spin_lock(inode-i_lock);
 
/* make sure we've got a callback on this file and that our view of the
 * data version is up to date */
@@ -420,7 +421,7 @@ given_lock:
afs_vnode_fetch_status(vnode, NULL, key);
 
 error:
-   unlock_flocks();
+   spin_unlock(inode-i_lock);
_leave( = %d, ret);
return ret;
 
diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index ebbf680..690f73f 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -192,7 +192,7 @@ void ceph_count_locks(struct inode *inode, int 
*fcntl_count, int *flock_count)
 
 /**
  * Encode the flock and fcntl locks for the given inode into the 

[PATCH v3 00/13] locks: scalability improvements for file locking

2013-06-17 Thread Jeff Layton
Summary of Significant Changes:
---
v3:
- Change spinlock handling to avoid the need to traverse the global
  blocked_hash when doing output of /proc/locks. This means that the
  fl_block list must continue to be protected by a global lock, but
  the fact that the i_lock is also held in most cases means that we
  can avoid taking it in certain situations.

v2:
- Fix potential races in deadlock detection. Manipulation of global
  blocked_hash and deadlock detection are now atomic. This is a
  little slower than the earlier set, but is provably correct. Also,
  the patch that converts to using the i_lock has been split out from
  most of the other changes. That should make it easier to review, but
  it does leave a potential race in the deadlock detection that is fixed
  up by the following patch. It may make sense to fold patches 7 and 8
  together before merging.

- Add percpu hlists and lglocks for global file_lock_list. This gives
  us some speedup since this list is seldom read.

Abstract (tl;dr version):
-
This patchset represents an overhaul of the file locking code with an
aim toward improving its scalability and making the code a bit easier to
understand.

Longer version:
---
When the BKL was finally ripped out of the kernel in 2010, the strategy
taken for the file locking code was to simply turn it into a new
file_lock_locks spinlock. It was an expedient way to deal with the file
locking code at the time, but having a giant spinlock around all of this
code is clearly not great for scalability. Red Hat has bug reports that
go back into the 2.6.18 era that point to BKL scalability problems in
the file locking code and the file_lock_lock suffers from the same
issues.

This patchset is my first attempt to make this code less dependent on
global locking. The main change is to switch most of the file locking
code to be protected by the inode-i_lock instead of the file_lock_lock.

While that works for most things, there are a couple of global data
structures (lists in the current code) that need a global lock to
protect them. So we still need a global lock in order to deal with
those. The remaining patches are intended to make that global locking
less painful. The big gains are made by turning the blocked_list into a
hashtable, which greatly speeds up the deadlock detection code and
making the file_lock_list percpu.

This is not the first attempt at doing this. The conversion to the
i_lock was originally attempted by Bruce Fields a few years ago. His
approach was NAK'ed since it involved ripping out the deadlock
detection. People also really seem to like /proc/locks for debugging, so
keeping that in is probably worthwhile.

There's more work to be done in this area and this patchset is just a
start. There's a horrible thundering herd problem when a blocking lock
is released, for instance. There was also interest in solving the goofy
unlock on any close POSIX lock semantics at this year's LSF. I think
this patchset will help lay the groundwork for those changes as well.

While file locking is not usually considered to be a high-performance
codepath, it *is* an IPC mechanism and I think it behooves us to try to
make it as fast as possible.

I'd like to see this considered for 3.11, but some soak time in -next
would be good. Comments and suggestions welcome.

Performance testing and results:

In order to measure the benefit of this set, I've written some locking
performance tests that I've made available here:

git://git.samba.org/jlayton/lockperf.git

Here are the results from the same 32-way, 4 NUMA node machine that I
used to generate the v2 patch results. The first number is the mean
time spent in locking for the test. The number in parenthesis is the
standard deviation.

3.10.0-rc5-00219-ga2648eb   3.10.0-rc5-00231-g7569869
---
flock01 24119.96 (266.08)   24542.51 (254.89)
flock02  1345.09  (37.37)   8.60   (0.31)
posix01 31217.14 (320.91)   24899.20 (254.27)
posix02  1348.60  (36.83)  12.70   (0.44)

I wasn't able to reserve the exact same smaller machine for testing this
set, but this one is comparable with 4 CPUs and UMA architecture:

3.10.0-rc5-00219-ga2648eb   3.10.0-rc5-00231-g7569869
---
flock01 1787.51 (11.23) 1797.75  (9.27)
flock02  314.90  (8.84)   34.87  (2.82)
posix01 1843.43 (11.63) 1880.47 (13.47)
posix02  325.13  (8.53)   54.09  (4.02)

I think the conclusion we can draw here is that this patchset it roughly
as fast as the previous one. In addition, the posix02 test saw a vast
increase in performance.

I believe that's mostly 

[PATCH v3 09/13] locks: turn the blocked_list into a hashtable

2013-06-17 Thread Jeff Layton
Break up the blocked_list into a hashtable, using the fl_owner as a key.
This speeds up searching the hash chains, which is especially significant
for deadlock detection.

Note that the initial implementation assumes that hashing on fl_owner is
sufficient. In most cases it should be, with the notable exception being
server-side lockd, which compares ownership using a tuple of the
nlm_host and the pid sent in the lock request. So, this may degrade to a
single hash bucket when you only have a single NFS client. That will be
addressed in a later patch.

The careful observer may note that this patch leaves the file_lock_list
alone. There's much less of a case for turning the file_lock_list into a
hashtable. The only user of that list is the code that generates
/proc/locks, and it always walks the entire list.

Signed-off-by: Jeff Layton jlay...@redhat.com
Acked-by: J. Bruce Fields bfie...@fieldses.org
---
 fs/locks.c |   25 +
 1 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 32826ed..d93b291 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -126,6 +126,7 @@
 #include linux/time.h
 #include linux/rcupdate.h
 #include linux/pid_namespace.h
+#include linux/hashtable.h
 
 #include asm/uaccess.h
 
@@ -160,12 +161,20 @@ int lease_break_time = 45;
 static HLIST_HEAD(file_lock_list);
 
 /*
- * The blocked_list is used to find POSIX lock loops for deadlock detection.
- * Protected by file_lock_lock.
+ * The blocked_hash is used to find POSIX lock loops for deadlock detection.
+ * It is protected by file_lock_lock.
+ *
+ * We hash locks by lockowner in order to optimize searching for the lock a
+ * particular lockowner is waiting on.
+ *
+ * FIXME: make this value scale via some heuristic? We generally will want more
+ * buckets when we have more lockowners holding locks, but that's a little
+ * difficult to determine without knowing what the workload will look like.
  */
-static HLIST_HEAD(blocked_list);
+#define BLOCKED_HASH_BITS  7
+static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
 
-/* Protects the two list heads above, and fl-fl_block list. */
+/* Protects the file_lock_list, the blocked_hash and fl-fl_block list */
 static DEFINE_SPINLOCK(file_lock_lock);
 
 static struct kmem_cache *filelock_cache __read_mostly;
@@ -499,13 +508,13 @@ locks_delete_global_locks(struct file_lock *waiter)
 static inline void
 locks_insert_global_blocked(struct file_lock *waiter)
 {
-   hlist_add_head(waiter-fl_link, blocked_list);
+   hash_add(blocked_hash, waiter-fl_link, (unsigned 
long)waiter-fl_owner);
 }
 
 static inline void
 locks_delete_global_blocked(struct file_lock *waiter)
 {
-   hlist_del_init(waiter-fl_link);
+   hash_del(waiter-fl_link);
 }
 
 /* Remove waiter from blocker's block list.
@@ -730,7 +739,7 @@ static struct file_lock *what_owner_is_waiting_for(struct 
file_lock *block_fl)
 {
struct file_lock *fl;
 
-   hlist_for_each_entry(fl, blocked_list, fl_link) {
+   hash_for_each_possible(blocked_hash, fl, fl_link, (unsigned 
long)block_fl-fl_owner) {
if (posix_same_owner(fl, block_fl))
return fl-fl_next;
}
@@ -866,7 +875,7 @@ static int __posix_lock_file(struct inode *inode, struct 
file_lock *request, str
/*
 * New lock request. Walk all POSIX locks and look for conflicts. If
 * there are any, either return error or put the request on the
-* blocker's list of waiters and the global blocked_list.
+* blocker's list of waiters and the global blocked_hash.
 */
if (request-fl_type != F_UNLCK) {
for_each_lock(inode, before) {
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 02/13] locks: make generic_add_lease and generic_delete_lease static

2013-06-17 Thread Jeff Layton
Signed-off-by: Jeff Layton jlay...@redhat.com
Acked-by: J. Bruce Fields bfie...@fieldses.org
---
 fs/locks.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 7a02064..e3140b8 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1337,7 +1337,7 @@ int fcntl_getlease(struct file *filp)
return type;
 }
 
-int generic_add_lease(struct file *filp, long arg, struct file_lock **flp)
+static int generic_add_lease(struct file *filp, long arg, struct file_lock 
**flp)
 {
struct file_lock *fl, **before, **my_before = NULL, *lease;
struct dentry *dentry = filp-f_path.dentry;
@@ -1402,7 +1402,7 @@ out:
return error;
 }
 
-int generic_delete_lease(struct file *filp, struct file_lock **flp)
+static int generic_delete_lease(struct file *filp, struct file_lock **flp)
 {
struct file_lock *fl, **before;
struct dentry *dentry = filp-f_path.dentry;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/13] locks: encapsulate the fl_link list handling

2013-06-17 Thread Jeff Layton
Move the fl_link list handling routines into a separate set of helpers.
Also ensure that locks and requests are always put on global lists
last (after fully initializing them) and are taken off before unintializing
them.

Signed-off-by: Jeff Layton jlay...@redhat.com
---
 fs/locks.c |   45 -
 1 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index c186649..c0e613f 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -153,13 +153,15 @@ int lease_break_time = 45;
 #define for_each_lock(inode, lockp) \
for (lockp = inode-i_flock; *lockp != NULL; lockp = 
(*lockp)-fl_next)
 
+/* The global file_lock_list is only used for displaying /proc/locks. */
 static LIST_HEAD(file_lock_list);
+
+/* The blocked_list is used to find POSIX lock loops for deadlock detection. */
 static LIST_HEAD(blocked_list);
+
+/* Protects the two list heads above, plus the inode-i_flock list */
 static DEFINE_SPINLOCK(file_lock_lock);
 
-/*
- * Protects the two list heads above, plus the inode-i_flock list
- */
 void lock_flocks(void)
 {
spin_lock(file_lock_lock);
@@ -484,13 +486,37 @@ static int posix_same_owner(struct file_lock *fl1, struct 
file_lock *fl2)
return fl1-fl_owner == fl2-fl_owner;
 }
 
+static inline void
+locks_insert_global_locks(struct file_lock *waiter)
+{
+   list_add_tail(waiter-fl_link, file_lock_list);
+}
+
+static inline void
+locks_delete_global_locks(struct file_lock *waiter)
+{
+   list_del_init(waiter-fl_link);
+}
+
+static inline void
+locks_insert_global_blocked(struct file_lock *waiter)
+{
+   list_add(waiter-fl_link, blocked_list);
+}
+
+static inline void
+locks_delete_global_blocked(struct file_lock *waiter)
+{
+   list_del_init(waiter-fl_link);
+}
+
 /* Remove waiter from blocker's block list.
  * When blocker ends up pointing to itself then the list is empty.
  */
 static void __locks_delete_block(struct file_lock *waiter)
 {
+   locks_delete_global_blocked(waiter);
list_del_init(waiter-fl_block);
-   list_del_init(waiter-fl_link);
waiter-fl_next = NULL;
 }
 
@@ -512,10 +538,10 @@ static void locks_insert_block(struct file_lock *blocker,
   struct file_lock *waiter)
 {
BUG_ON(!list_empty(waiter-fl_block));
-   list_add_tail(waiter-fl_block, blocker-fl_block);
waiter-fl_next = blocker;
+   list_add_tail(waiter-fl_block, blocker-fl_block);
if (IS_POSIX(blocker))
-   list_add(waiter-fl_link, blocked_list);
+   locks_insert_global_blocked(request);
 }
 
 /*
@@ -543,13 +569,13 @@ static void locks_wake_up_blocks(struct file_lock 
*blocker)
  */
 static void locks_insert_lock(struct file_lock **pos, struct file_lock *fl)
 {
-   list_add(fl-fl_link, file_lock_list);
-
fl-fl_nspid = get_pid(task_tgid(current));
 
/* insert into file's list */
fl-fl_next = *pos;
*pos = fl;
+
+   locks_insert_global_locks(fl);
 }
 
 /*
@@ -562,9 +588,10 @@ static void locks_delete_lock(struct file_lock **thisfl_p)
 {
struct file_lock *fl = *thisfl_p;
 
+   locks_delete_global_locks(fl);
+
*thisfl_p = fl-fl_next;
fl-fl_next = NULL;
-   list_del_init(fl-fl_link);
 
if (fl-fl_nspid) {
put_pid(fl-fl_nspid);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 03/13] locks: comment cleanups and clarifications

2013-06-17 Thread Jeff Layton
Signed-off-by: Jeff Layton jlay...@redhat.com
---
 fs/locks.c |   21 +
 include/linux/fs.h |   18 ++
 2 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index e3140b8..1e6301b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -518,9 +518,10 @@ static void locks_insert_block(struct file_lock *blocker,
list_add(waiter-fl_link, blocked_list);
 }
 
-/* Wake up processes blocked waiting for blocker.
- * If told to wait then schedule the processes until the block list
- * is empty, otherwise empty the block list ourselves.
+/*
+ * Wake up processes blocked waiting for blocker.
+ *
+ * Must be called with the file_lock_lock held!
  */
 static void locks_wake_up_blocks(struct file_lock *blocker)
 {
@@ -806,6 +807,11 @@ static int __posix_lock_file(struct inode *inode, struct 
file_lock *request, str
}
 
lock_flocks();
+   /*
+* New lock request. Walk all POSIX locks and look for conflicts. If
+* there are any, either return error or put the request on the
+* blocker's list of waiters and the global blocked_list.
+*/
if (request-fl_type != F_UNLCK) {
for_each_lock(inode, before) {
fl = *before;
@@ -844,7 +850,7 @@ static int __posix_lock_file(struct inode *inode, struct 
file_lock *request, str
before = fl-fl_next;
}
 
-   /* Process locks with this owner.  */
+   /* Process locks with this owner. */
while ((fl = *before)  posix_same_owner(request, fl)) {
/* Detect adjacent or overlapping regions (if same lock type)
 */
@@ -930,10 +936,9 @@ static int __posix_lock_file(struct inode *inode, struct 
file_lock *request, str
}
 
/*
-* The above code only modifies existing locks in case of
-* merging or replacing.  If new lock(s) need to be inserted
-* all modifications are done bellow this, so it's safe yet to
-* bail out.
+* The above code only modifies existing locks in case of merging or
+* replacing. If new lock(s) need to be inserted all modifications are
+* done below this, so it's safe yet to bail out.
 */
error = -ENOLCK; /* no luck */
if (right  left == right  !new_fl2)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b9d7816..94105d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -926,6 +926,24 @@ int locks_in_grace(struct net *);
 /* that will die - we need it for nfs_lock_info */
 #include linux/nfs_fs_i.h
 
+/*
+ * struct file_lock represents a generic file lock. It's used to represent
+ * POSIX byte range locks, BSD (flock) locks, and leases. It's important to
+ * note that the same struct is used to represent both a request for a lock and
+ * the lock itself, but the same object is never used for both.
+ *
+ * FIXME: should we create a separate struct lock_request to help distinguish
+ * these two uses?
+ *
+ * The i_flock list is ordered by:
+ *
+ * 1) lock type -- FL_LEASEs first, then FL_FLOCK, and finally FL_POSIX
+ * 2) lock owner
+ * 3) lock range start
+ * 4) lock range end
+ *
+ * Obviously, the last two criteria only matter for POSIX locks.
+ */
 struct file_lock {
struct file_lock *fl_next;  /* singly linked list for this inode  */
struct list_head fl_link;   /* doubly linked list of all locks */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/13] locks: convert fl_link to a hlist_node

2013-06-17 Thread Jeff Layton
Testing has shown that iterating over the blocked_list for deadlock
detection turns out to be a bottleneck. In order to alleviate that,
begin the process of turning it into a hashtable. We start by turning
the fl_link into a hlist_node and the global lists into hlists. A later
patch will do the conversion of the blocked_list to a hashtable.

Signed-off-by: Jeff Layton jlay...@redhat.com
Acked-by: J. Bruce Fields bfie...@fieldses.org
---
 fs/locks.c |   24 
 include/linux/fs.h |2 +-
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index a8f3b33..32826ed 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -157,13 +157,13 @@ int lease_break_time = 45;
  * The global file_lock_list is only used for displaying /proc/locks. Protected
  * by the file_lock_lock.
  */
-static LIST_HEAD(file_lock_list);
+static HLIST_HEAD(file_lock_list);
 
 /*
  * The blocked_list is used to find POSIX lock loops for deadlock detection.
  * Protected by file_lock_lock.
  */
-static LIST_HEAD(blocked_list);
+static HLIST_HEAD(blocked_list);
 
 /* Protects the two list heads above, and fl-fl_block list. */
 static DEFINE_SPINLOCK(file_lock_lock);
@@ -172,7 +172,7 @@ static struct kmem_cache *filelock_cache __read_mostly;
 
 static void locks_init_lock_heads(struct file_lock *fl)
 {
-   INIT_LIST_HEAD(fl-fl_link);
+   INIT_HLIST_NODE(fl-fl_link);
INIT_LIST_HEAD(fl-fl_block);
init_waitqueue_head(fl-fl_wait);
 }
@@ -206,7 +206,7 @@ void locks_free_lock(struct file_lock *fl)
 {
BUG_ON(waitqueue_active(fl-fl_wait));
BUG_ON(!list_empty(fl-fl_block));
-   BUG_ON(!list_empty(fl-fl_link));
+   BUG_ON(!hlist_unhashed(fl-fl_link));
 
locks_release_private(fl);
kmem_cache_free(filelock_cache, fl);
@@ -484,7 +484,7 @@ static inline void
 locks_insert_global_locks(struct file_lock *waiter)
 {
spin_lock(file_lock_lock);
-   list_add_tail(waiter-fl_link, file_lock_list);
+   hlist_add_head(waiter-fl_link, file_lock_list);
spin_unlock(file_lock_lock);
 }
 
@@ -492,20 +492,20 @@ static inline void
 locks_delete_global_locks(struct file_lock *waiter)
 {
spin_lock(file_lock_lock);
-   list_del_init(waiter-fl_link);
+   hlist_del_init(waiter-fl_link);
spin_unlock(file_lock_lock);
 }
 
 static inline void
 locks_insert_global_blocked(struct file_lock *waiter)
 {
-   list_add(waiter-fl_link, blocked_list);
+   hlist_add_head(waiter-fl_link, blocked_list);
 }
 
 static inline void
 locks_delete_global_blocked(struct file_lock *waiter)
 {
-   list_del_init(waiter-fl_link);
+   hlist_del_init(waiter-fl_link);
 }
 
 /* Remove waiter from blocker's block list.
@@ -730,7 +730,7 @@ static struct file_lock *what_owner_is_waiting_for(struct 
file_lock *block_fl)
 {
struct file_lock *fl;
 
-   list_for_each_entry(fl, blocked_list, fl_link) {
+   hlist_for_each_entry(fl, blocked_list, fl_link) {
if (posix_same_owner(fl, block_fl))
return fl-fl_next;
}
@@ -2285,7 +2285,7 @@ static int locks_show(struct seq_file *f, void *v)
 {
struct file_lock *fl, *bfl;
 
-   fl = list_entry(v, struct file_lock, fl_link);
+   fl = hlist_entry(v, struct file_lock, fl_link);
 
lock_get_status(f, fl, *((loff_t *)f-private), );
 
@@ -2301,14 +2301,14 @@ static void *locks_start(struct seq_file *f, loff_t 
*pos)
 
spin_lock(file_lock_lock);
*p = (*pos + 1);
-   return seq_list_start(file_lock_list, *pos);
+   return seq_hlist_start(file_lock_list, *pos);
 }
 
 static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
 {
loff_t *p = f-private;
++*p;
-   return seq_list_next(v, file_lock_list, pos);
+   return seq_hlist_next(v, file_lock_list, pos);
 }
 
 static void locks_stop(struct seq_file *f, void *v)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e2f896d..3b340f7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -946,7 +946,7 @@ int locks_in_grace(struct net *);
  */
 struct file_lock {
struct file_lock *fl_next;  /* singly linked list for this inode  */
-   struct list_head fl_link;   /* doubly linked list of all locks */
+   struct hlist_node fl_link;  /* node in global lists */
struct list_head fl_block;  /* circular list of blocked processes */
fl_owner_t fl_owner;
unsigned int fl_flags;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/13] cifs: use posix_unblock_lock instead of locks_delete_block

2013-06-17 Thread Jeff Layton
commit 66189be74 (CIFS: Fix VFS lock usage for oplocked files) exported
the locks_delete_block symbol. There's already an exported helper
function that provides this capability however, so make cifs use that
instead and turn locks_delete_block back into a static function.

Note that if fl-fl_next == NULL then this lock has already been through
locks_delete_block(), so we should be OK to ignore an ENOENT error here
and simply not retry the lock.

Cc: Pavel Shilovsky piastr...@gmail.com
Signed-off-by: Jeff Layton jlay...@redhat.com
Acked-by: J. Bruce Fields bfie...@fieldses.org
---
 fs/cifs/file.c |2 +-
 fs/locks.c |3 +--
 include/linux/fs.h |5 -
 3 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 48b29d2..44a4f18 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -999,7 +999,7 @@ try_again:
rc = wait_event_interruptible(flock-fl_wait, !flock-fl_next);
if (!rc)
goto try_again;
-   locks_delete_block(flock);
+   posix_unblock_lock(file, flock);
}
return rc;
 }
diff --git a/fs/locks.c b/fs/locks.c
index cb424a4..7a02064 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -496,13 +496,12 @@ static void __locks_delete_block(struct file_lock *waiter)
 
 /*
  */
-void locks_delete_block(struct file_lock *waiter)
+static void locks_delete_block(struct file_lock *waiter)
 {
lock_flocks();
__locks_delete_block(waiter);
unlock_flocks();
 }
-EXPORT_SYMBOL(locks_delete_block);
 
 /* Insert waiter into blocker's block list.
  * We use a circular list so that processes can be easily woken up in
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 43db02e..b9d7816 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1006,7 +1006,6 @@ extern int vfs_setlease(struct file *, long, struct 
file_lock **);
 extern int lease_modify(struct file_lock **, int);
 extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
 extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
-extern void locks_delete_block(struct file_lock *waiter);
 extern void lock_flocks(void);
 extern void unlock_flocks(void);
 #else /* !CONFIG_FILE_LOCKING */
@@ -1151,10 +1150,6 @@ static inline int lock_may_write(struct inode *inode, 
loff_t start,
return 1;
 }
 
-static inline void locks_delete_block(struct file_lock *waiter)
-{
-}
-
 static inline void lock_flocks(void)
 {
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/13] locks: protect most of the file_lock handling with i_lock

2013-06-17 Thread Jeff Layton
On Mon, 17 Jun 2013 11:13:49 -0400
Jeff Layton jlay...@redhat.com wrote:

 Having a global lock that protects all of this code is a clear
 scalability problem. Instead of doing that, move most of the code to be
 protected by the i_lock instead. The exceptions are the global lists
 that the -fl_link sits on, and the -fl_block list.
 
 -fl_link is what connects these structures to the
 global lists, so we must ensure that we hold those locks when iterating
 over or updating these lists.
 
 Furthermore, sound deadlock detection requires that we hold the
 blocked_list state steady while checking for loops. We also must ensure
 that the search and update to the list are atomic.
 
 For the checking and insertion side of the blocked_list, push the
 acquisition of the global lock into __posix_lock_file and ensure that
 checking and update of the  blocked_list is done without dropping the
 lock in between.
 
 On the removal side, when waking up blocked lock waiters, take the
 global lock before walking the blocked list and dequeue the waiters from
 the global list prior to removal from the fl_block list.
 
 With this, deadlock detection should be race free while we minimize
 excessive file_lock_lock thrashing.
 
 Finally, in order to avoid a lock inversion problem when handling
 /proc/locks output we must ensure that manipulations of the fl_block
 list are also protected by the file_lock_lock.
 
 Signed-off-by: Jeff Layton jlay...@redhat.com
 ---
  Documentation/filesystems/Locking |   21 --
  fs/afs/flock.c|5 +-
  fs/ceph/locks.c   |2 +-
  fs/ceph/mds_client.c  |8 +-
  fs/cifs/cifsfs.c  |2 +-
  fs/cifs/file.c|   13 ++--
  fs/gfs2/file.c|2 +-
  fs/lockd/svcsubs.c|   12 ++--
  fs/locks.c|  151 
 ++---
  fs/nfs/delegation.c   |   10 +-
  fs/nfs/nfs4state.c|8 +-
  fs/nfsd/nfs4state.c   |8 +-
  include/linux/fs.h|   11 ---
  13 files changed, 140 insertions(+), 113 deletions(-)
 

[...]

 @@ -1231,7 +1254,7 @@ int __break_lease(struct inode *inode, unsigned int 
 mode)
   if (IS_ERR(new_fl))
   return PTR_ERR(new_fl);
  
 - lock_flocks();
 + spin_lock(inode-i_lock);
  
   time_out_leases(inode);
  
 @@ -1281,11 +1304,11 @@ restart:
   break_time++;
   }
   locks_insert_block(flock, new_fl);
 - unlock_flocks();
 + spin_unlock(inode-i_lock);
   error = wait_event_interruptible_timeout(new_fl-fl_wait,
   !new_fl-fl_next, break_time);
 - lock_flocks();
 - __locks_delete_block(new_fl);
 + spin_lock(inode-i_lock);
 + locks_delete_block(new_fl);

Doh -- bug here. This should not have been changed to
locks_delete_block(). My apologies.

   if (error = 0) {
   if (error == 0)
   time_out_leases(inode);

[...]

  posix_unblock_lock(struct file *filp, struct file_lock *waiter)
  {
 + struct inode *inode = file_inode(filp);
   int status = 0;
  
 - lock_flocks();
 + spin_lock(inode-i_lock);
   if (waiter-fl_next)
 - __locks_delete_block(waiter);
 + locks_delete_block(waiter);


Ditto here...

   else
   status = -ENOENT;
 - unlock_flocks();
 + spin_unlock(inode-i_lock);
   return status;
  }
  

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Enable fscache as an optional feature of ceph.

2013-06-17 Thread Elso Andras
Hi,


 1) In the graphs you attached what am I looking at? My best guess is that
 it's traffic on a 10gigE card, but I can't tell from the graph since there's
 no labels.
Yes, 10G traffic on switch port. So incoming means server-to-switch,
outgoing means switch-to-server. No separated card for ceph traffic
:(

 2) Can you give me more info about your serving case. What application are
 you using to serve the video (http server)? Are you serving static mp4 files
 from Ceph filesystem?
lighttpd server with mp4 streaming mod
(http://h264.code-shop.com/trac/wiki/Mod-H264-Streaming-Lighttpd-Version2),
the files lives on cephfs.
there is a speed limit, controlled by mp4 mod. the bandwidth is the
video bitrate value.

mount options:
name=test,rsize=0,rasize=131072,noshare,fsc,key=client.test

rsize=0 and rasize=131072 is a tested, with other values there was 4x
incoming (from osd) traffic than outgoing (to internet) traffic.

 3) What's the hardware, most importantly how big is your partition that
 cachefilesd is on and what kind of disk are you hosting it on (rotating,
 SSD)?
there are 5 osd servers: HP DL380 G6, 32G ram, 16 X HP sas disk (10k
rpm) with raid0. bonding two 1G interface together.
(In previous life, this hw could serve the ~2.3G traffic with raid5
and three bonding interface)

 4) Statistics from fscache. Can you paste the output /proc/fs/fscache/stats
 and /proc/fs/fscache/histogram.

FS-Cache statistics
Cookies: idx=1 dat=8001 spc=0
Objects: alc=0 nal=0 avl=0 ded=0
ChkAux : non=0 ok=0 upd=0 obs=0
Pages  : mrk=0 unc=0
Acquire: n=8002 nul=0 noc=0 ok=8002 nbf=0 oom=0
Lookups: n=0 neg=0 pos=0 crt=0 tmo=0
Invals : n=0 run=0
Updates: n=0 nul=0 run=0
Relinqs: n=2265 nul=0 wcr=0 rtr=0
AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
Allocs : n=0 ok=0 wt=0 nbf=0 int=0
Allocs : ops=0 owt=0 abt=0
Retrvls: n=2983745 ok=0 wt=0 nod=0 nbf=2983745 int=0 oom=0
Retrvls: ops=0 owt=0 abt=0
Stores : n=0 ok=0 agn=0 nbf=0 oom=0
Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
VmScan : nos=0 gon=0 bsy=0 can=0 wt=0
Ops: pend=0 run=0 enq=0 can=0 rej=0
Ops: dfr=0 rel=0 gc=0
CacheOp: alo=0 luo=0 luc=0 gro=0
CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0

No histogram, i try to build to enable this.

 5) dmesg lines for ceph/fscache/cachefiles like:
[  264.186887] FS-Cache: Loaded
[  264.223851] Key type ceph registered
[  264.223902] libceph: loaded (mon/osd proto 15/24)
[  264.246334] FS-Cache: Netfs 'ceph' registered for caching
[  264.246341] ceph: loaded (mds proto 32)
[  264.249497] libceph: client31274 fsid 1d78ebe5-f254-44ff-81c1-f641bb2036b6


Elbandi
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/13] locks: protect most of the file_lock handling with i_lock

2013-06-17 Thread Jeff Layton
On Mon, 17 Jun 2013 11:46:09 -0400
Jeff Layton jlay...@redhat.com wrote:

 On Mon, 17 Jun 2013 11:13:49 -0400
 Jeff Layton jlay...@redhat.com wrote:
 
  Having a global lock that protects all of this code is a clear
  scalability problem. Instead of doing that, move most of the code to be
  protected by the i_lock instead. The exceptions are the global lists
  that the -fl_link sits on, and the -fl_block list.
  
  -fl_link is what connects these structures to the
  global lists, so we must ensure that we hold those locks when iterating
  over or updating these lists.
  
  Furthermore, sound deadlock detection requires that we hold the
  blocked_list state steady while checking for loops. We also must ensure
  that the search and update to the list are atomic.
  
  For the checking and insertion side of the blocked_list, push the
  acquisition of the global lock into __posix_lock_file and ensure that
  checking and update of the  blocked_list is done without dropping the
  lock in between.
  
  On the removal side, when waking up blocked lock waiters, take the
  global lock before walking the blocked list and dequeue the waiters from
  the global list prior to removal from the fl_block list.
  
  With this, deadlock detection should be race free while we minimize
  excessive file_lock_lock thrashing.
  
  Finally, in order to avoid a lock inversion problem when handling
  /proc/locks output we must ensure that manipulations of the fl_block
  list are also protected by the file_lock_lock.
  
  Signed-off-by: Jeff Layton jlay...@redhat.com
  ---
   Documentation/filesystems/Locking |   21 --
   fs/afs/flock.c|5 +-
   fs/ceph/locks.c   |2 +-
   fs/ceph/mds_client.c  |8 +-
   fs/cifs/cifsfs.c  |2 +-
   fs/cifs/file.c|   13 ++--
   fs/gfs2/file.c|2 +-
   fs/lockd/svcsubs.c|   12 ++--
   fs/locks.c|  151 
  ++---
   fs/nfs/delegation.c   |   10 +-
   fs/nfs/nfs4state.c|8 +-
   fs/nfsd/nfs4state.c   |8 +-
   include/linux/fs.h|   11 ---
   13 files changed, 140 insertions(+), 113 deletions(-)
  
 
 [...]
 
  @@ -1231,7 +1254,7 @@ int __break_lease(struct inode *inode, unsigned int 
  mode)
  if (IS_ERR(new_fl))
  return PTR_ERR(new_fl);
   
  -   lock_flocks();
  +   spin_lock(inode-i_lock);
   
  time_out_leases(inode);
   
  @@ -1281,11 +1304,11 @@ restart:
  break_time++;
  }
  locks_insert_block(flock, new_fl);
  -   unlock_flocks();
  +   spin_unlock(inode-i_lock);
  error = wait_event_interruptible_timeout(new_fl-fl_wait,
  !new_fl-fl_next, break_time);
  -   lock_flocks();
  -   __locks_delete_block(new_fl);
  +   spin_lock(inode-i_lock);
  +   locks_delete_block(new_fl);
 
 Doh -- bug here. This should not have been changed to
 locks_delete_block(). My apologies.
 
  if (error = 0) {
  if (error == 0)
  time_out_leases(inode);
 
 [...]
 
   posix_unblock_lock(struct file *filp, struct file_lock *waiter)
   {
  +   struct inode *inode = file_inode(filp);
  int status = 0;
   
  -   lock_flocks();
  +   spin_lock(inode-i_lock);
  if (waiter-fl_next)
  -   __locks_delete_block(waiter);
  +   locks_delete_block(waiter);
 
 
 Ditto here...
 
  else
  status = -ENOENT;
  -   unlock_flocks();
  +   spin_unlock(inode-i_lock);
  return status;
   }
   
 

Bah, scratch that -- this patch is actually fine. We hold the i_lock
here and locks_delete_block takes the file_lock_lock, which is correct.
There is a potential race in patch 7 though. I'll reply to that patch
to point it out in a minute.

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/13] locks: avoid taking global lock if possible when waking up blocked waiters

2013-06-17 Thread Jeff Layton
On Mon, 17 Jun 2013 11:13:50 -0400
Jeff Layton jlay...@redhat.com wrote:

 Since we always hold the i_lock when inserting a new waiter onto the
 fl_block list, we can avoid taking the global lock at all if we find
 that it's empty when we go to wake up blocked waiters.
 
 Signed-off-by: Jeff Layton jlay...@redhat.com
 ---
  fs/locks.c |   17 ++---
  1 files changed, 14 insertions(+), 3 deletions(-)
 
 diff --git a/fs/locks.c b/fs/locks.c
 index 8f56651..a8f3b33 100644
 --- a/fs/locks.c
 +++ b/fs/locks.c
 @@ -532,7 +532,10 @@ static void locks_delete_block(struct file_lock *waiter)
   * the order they blocked. The documentation doesn't require this but
   * it seems like the reasonable thing to do.
   *
 - * Must be called with file_lock_lock held!
 + * Must be called with both the i_lock and file_lock_lock held. The fl_block
 + * list itself is protected by the file_lock_list, but by ensuring that the
 + * i_lock is also held on insertions we can avoid taking the file_lock_lock
 + * in some cases when we see that the fl_block list is empty.
   */
  static void __locks_insert_block(struct file_lock *blocker,
   struct file_lock *waiter)
 @@ -560,8 +563,16 @@ static void locks_insert_block(struct file_lock *blocker,
   */
  static void locks_wake_up_blocks(struct file_lock *blocker)
  {
 + /*
 +  * Avoid taking global lock if list is empty. This is safe since new
 +  * blocked requests are only added to the list under the i_lock, and
 +  * the i_lock is always held here.
 +  */
 + if (list_empty(blocker-fl_block))
 + return;
 +


Ok, potential race here. We hold the i_lock when we check list_empty()
above, but it's possible for the fl_block list to become empty between
that check and when we take the spinlock below. locks_delete_block does
not require that you hold the i_lock, and some callers don't hold it.

This is trivially fixable by just keeping this as a while() loop. We'll
do the list_empty() check twice in that case, but that shouldn't change
the performance here much.

I'll fix that in my tree and it'll be in the next resend. Sorry for the
noise...

   spin_lock(file_lock_lock);
 - while (!list_empty(blocker-fl_block)) {
 + do {

   struct file_lock *waiter;
  
   waiter = list_first_entry(blocker-fl_block,
 @@ -571,7 +582,7 @@ static void locks_wake_up_blocks(struct file_lock 
 *blocker)
   waiter-fl_lmops-lm_notify(waiter);
   else
   wake_up(waiter-fl_wait);
 - }
 + } while (!list_empty(blocker-fl_block));
   spin_unlock(file_lock_lock);
  }
  


-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Enable fscache as an optional feature of ceph.

2013-06-17 Thread Milosz Tanski
Elbandi,

It looks like it's trying to use fscache (from the stats) but there's
no data. Did you install, configure and enable the cachefilesd daemon?
It's the user-space component of fscache. It's the only officially
supported fsache backed by Ubuntu, RHEL  SUSE. I'm guessing that's
your problem since I don't see any of the bellow lines in your dmesg
snippet.

[2049099.198234] CacheFiles: Loaded
[2049099.541721] FS-Cache: Cache mycache added (type cachefiles)
[2049099.541727] CacheFiles: File cache on md0 registered

- Milosz

On Mon, Jun 17, 2013 at 11:47 AM, Elso Andras elso.and...@gmail.com wrote:
 Hi,


 1) In the graphs you attached what am I looking at? My best guess is that
 it's traffic on a 10gigE card, but I can't tell from the graph since there's
 no labels.
 Yes, 10G traffic on switch port. So incoming means server-to-switch,
 outgoing means switch-to-server. No separated card for ceph traffic
 :(

 2) Can you give me more info about your serving case. What application are
 you using to serve the video (http server)? Are you serving static mp4 files
 from Ceph filesystem?
 lighttpd server with mp4 streaming mod
 (http://h264.code-shop.com/trac/wiki/Mod-H264-Streaming-Lighttpd-Version2),
 the files lives on cephfs.
 there is a speed limit, controlled by mp4 mod. the bandwidth is the
 video bitrate value.

 mount options:
 name=test,rsize=0,rasize=131072,noshare,fsc,key=client.test

 rsize=0 and rasize=131072 is a tested, with other values there was 4x
 incoming (from osd) traffic than outgoing (to internet) traffic.

 3) What's the hardware, most importantly how big is your partition that
 cachefilesd is on and what kind of disk are you hosting it on (rotating,
 SSD)?
 there are 5 osd servers: HP DL380 G6, 32G ram, 16 X HP sas disk (10k
 rpm) with raid0. bonding two 1G interface together.
 (In previous life, this hw could serve the ~2.3G traffic with raid5
 and three bonding interface)

 4) Statistics from fscache. Can you paste the output /proc/fs/fscache/stats
 and /proc/fs/fscache/histogram.

 FS-Cache statistics
 Cookies: idx=1 dat=8001 spc=0
 Objects: alc=0 nal=0 avl=0 ded=0
 ChkAux : non=0 ok=0 upd=0 obs=0
 Pages  : mrk=0 unc=0
 Acquire: n=8002 nul=0 noc=0 ok=8002 nbf=0 oom=0
 Lookups: n=0 neg=0 pos=0 crt=0 tmo=0
 Invals : n=0 run=0
 Updates: n=0 nul=0 run=0
 Relinqs: n=2265 nul=0 wcr=0 rtr=0
 AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
 Allocs : n=0 ok=0 wt=0 nbf=0 int=0
 Allocs : ops=0 owt=0 abt=0
 Retrvls: n=2983745 ok=0 wt=0 nod=0 nbf=2983745 int=0 oom=0
 Retrvls: ops=0 owt=0 abt=0
 Stores : n=0 ok=0 agn=0 nbf=0 oom=0
 Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
 VmScan : nos=0 gon=0 bsy=0 can=0 wt=0
 Ops: pend=0 run=0 enq=0 can=0 rej=0
 Ops: dfr=0 rel=0 gc=0
 CacheOp: alo=0 luo=0 luc=0 gro=0
 CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
 CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0

 No histogram, i try to build to enable this.

 5) dmesg lines for ceph/fscache/cachefiles like:
 [  264.186887] FS-Cache: Loaded
 [  264.223851] Key type ceph registered
 [  264.223902] libceph: loaded (mon/osd proto 15/24)
 [  264.246334] FS-Cache: Netfs 'ceph' registered for caching
 [  264.246341] ceph: loaded (mds proto 32)
 [  264.249497] libceph: client31274 fsid 1d78ebe5-f254-44ff-81c1-f641bb2036b6


 Elbandi
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Comments on Ceph distributed parity implementation

2013-06-17 Thread Paul Von-Stamwitz
Loic,

As Benoit points out, Mojette uses discrete geometry rather than algebra, so 
simple XOR is all that is needed.

Benoit,

Microsoft's paper states that their [12,2,2] LRC provides better availability 
than 3x replication with 1.33x efficiency. 1.5x is certainly a good number. I'm 
just pointing out that better efficiency can be had without losing availibity.

All the best,
Paul

On 6/16/2013 02:31 PM Loic Dachary wrote:
 Hi Benoît,
 
 From the ( naïve ) point of view of engineering, performances are
 important. The recent works of James Plank ( cc'ed ) greatly improved them
  and I'm looking forward to the next version of jerasure
 ( http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html ). Rozofs
 Mojette Transform implementation
 ( https://github.com/rozofs/rozofs/blob/master/rozofs/common/transform.h 
 https://github.com/rozofs/rozofs/blob/master/rozofs/common/transform.cc )
 does not seem to make use of SIMD. Is it because the performances are good
 enough to not require them ?
 
 Cheers
 
 On 06/16/2013 09:51 PM, Benoît Parrein wrote:
  Paul Von-Stamwitz PVonStamwitz at us.fujitsu.com writes:
 
  Hi Paul,
 
 
  Loic, I know nothing about Mojette Transforms. From what little I
 gleaned,
  it might be good for repair
  (needing only a subset of chunks within a range to recalculate a
 missing
  chunk) but I'm worried about the
  storage efficiency. RozoFS claims 1.5x. I'd like to do better than that.
 
  All the best,
  Paul
 
 
  If you want to do better than that you will probably lose in
 availability.
  1.5x give the same availability than 3 replicats and that for any kind
 of
  erasure coding.
  FYI, Mojette transform has no constraint in terms of Galois fields. It
 is the
  big advantage to use discrete geometry rather than algebra.
 
  best regards,
  bp
 
 
 
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 Loïc Dachary, Artisan Logiciel Libre
 All that is necessary for the triumph of evil is that good people do
 nothing.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Enable fscache as an optional feature of ceph.

2013-06-17 Thread Elso Andras
Hi,

Oh, i forgot about this daemon... but this daemon cache the data to
file. Thus it's useless, the caching to disk is more slow than the
whole osds.

Elbandi

2013/6/17 Milosz Tanski mil...@adfin.com:
 Elbandi,

 It looks like it's trying to use fscache (from the stats) but there's
 no data. Did you install, configure and enable the cachefilesd daemon?
 It's the user-space component of fscache. It's the only officially
 supported fsache backed by Ubuntu, RHEL  SUSE. I'm guessing that's
 your problem since I don't see any of the bellow lines in your dmesg
 snippet.

 [2049099.198234] CacheFiles: Loaded
 [2049099.541721] FS-Cache: Cache mycache added (type cachefiles)
 [2049099.541727] CacheFiles: File cache on md0 registered

 - Milosz

 On Mon, Jun 17, 2013 at 11:47 AM, Elso Andras elso.and...@gmail.com wrote:
 Hi,


 1) In the graphs you attached what am I looking at? My best guess is that
 it's traffic on a 10gigE card, but I can't tell from the graph since there's
 no labels.
 Yes, 10G traffic on switch port. So incoming means server-to-switch,
 outgoing means switch-to-server. No separated card for ceph traffic
 :(

 2) Can you give me more info about your serving case. What application are
 you using to serve the video (http server)? Are you serving static mp4 files
 from Ceph filesystem?
 lighttpd server with mp4 streaming mod
 (http://h264.code-shop.com/trac/wiki/Mod-H264-Streaming-Lighttpd-Version2),
 the files lives on cephfs.
 there is a speed limit, controlled by mp4 mod. the bandwidth is the
 video bitrate value.

 mount options:
 name=test,rsize=0,rasize=131072,noshare,fsc,key=client.test

 rsize=0 and rasize=131072 is a tested, with other values there was 4x
 incoming (from osd) traffic than outgoing (to internet) traffic.

 3) What's the hardware, most importantly how big is your partition that
 cachefilesd is on and what kind of disk are you hosting it on (rotating,
 SSD)?
 there are 5 osd servers: HP DL380 G6, 32G ram, 16 X HP sas disk (10k
 rpm) with raid0. bonding two 1G interface together.
 (In previous life, this hw could serve the ~2.3G traffic with raid5
 and three bonding interface)

 4) Statistics from fscache. Can you paste the output /proc/fs/fscache/stats
 and /proc/fs/fscache/histogram.

 FS-Cache statistics
 Cookies: idx=1 dat=8001 spc=0
 Objects: alc=0 nal=0 avl=0 ded=0
 ChkAux : non=0 ok=0 upd=0 obs=0
 Pages  : mrk=0 unc=0
 Acquire: n=8002 nul=0 noc=0 ok=8002 nbf=0 oom=0
 Lookups: n=0 neg=0 pos=0 crt=0 tmo=0
 Invals : n=0 run=0
 Updates: n=0 nul=0 run=0
 Relinqs: n=2265 nul=0 wcr=0 rtr=0
 AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
 Allocs : n=0 ok=0 wt=0 nbf=0 int=0
 Allocs : ops=0 owt=0 abt=0
 Retrvls: n=2983745 ok=0 wt=0 nod=0 nbf=2983745 int=0 oom=0
 Retrvls: ops=0 owt=0 abt=0
 Stores : n=0 ok=0 agn=0 nbf=0 oom=0
 Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
 VmScan : nos=0 gon=0 bsy=0 can=0 wt=0
 Ops: pend=0 run=0 enq=0 can=0 rej=0
 Ops: dfr=0 rel=0 gc=0
 CacheOp: alo=0 luo=0 luc=0 gro=0
 CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
 CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0

 No histogram, i try to build to enable this.

 5) dmesg lines for ceph/fscache/cachefiles like:
 [  264.186887] FS-Cache: Loaded
 [  264.223851] Key type ceph registered
 [  264.223902] libceph: loaded (mon/osd proto 15/24)
 [  264.246334] FS-Cache: Netfs 'ceph' registered for caching
 [  264.246341] ceph: loaded (mds proto 32)
 [  264.249497] libceph: client31274 fsid 1d78ebe5-f254-44ff-81c1-f641bb2036b6


 Elbandi
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Enable fscache as an optional feature of ceph.

2013-06-17 Thread Milosz Tanski
Elso,

It does cache the data to the file, thus it may not be useful for your
situation. By default the ceph filesystem already uses the (in memory)
page cache provided by linux kernel. So if that's all you want, than
you're good with the current implementation.

Generally large sequential data transfers will not be improved
(although there's cases where we observed improvements). The
motivation for us to implement fscache has been the following
use-case.

We have a large distributed analytics databases (built in house) and
we have a few different access patterns present. First, there's
seemingly random access on the compressed indexes. Second, there's
also random access in the column data files for extent indexes.
Finally, there's either sequential or random access over the actual
data (depending on the query).

In our cases the machines that run the database have multiple large
SSD drives in a raid0 configuration. We're using it the SSD drives for
scratch storage (housekeeping background jobs) and the ceph fscache.
In some conditions we can get up to 1GB/s reads from these SSD drives.

We're currently in our last stages of deploying this to production.
And for most workloads our query performance for data stored locally
versus on ceph backed by fscache is pretty much the same. Our biggest
gain probably comes from much lower latency to get metadata and
indexes to make the query due the large number random iops the SSD
drives afford us. I'm going to published some updates numbers compared
to previous quick and dirty prototype.

I realize that's not going to be the case for everybody. However, if
you have a data access pattern that follows the 80/20 rule or the
zipfan distribution and fast local disks for caching this is a great.

Thanks,
- Milosz


On Mon, Jun 17, 2013 at 1:09 PM, Elso Andras elso.and...@gmail.com wrote:
 Hi,

 Oh, i forgot about this daemon... but this daemon cache the data to
 file. Thus it's useless, the caching to disk is more slow than the
 whole osds.

 Elbandi

 2013/6/17 Milosz Tanski mil...@adfin.com:
 Elbandi,

 It looks like it's trying to use fscache (from the stats) but there's
 no data. Did you install, configure and enable the cachefilesd daemon?
 It's the user-space component of fscache. It's the only officially
 supported fsache backed by Ubuntu, RHEL  SUSE. I'm guessing that's
 your problem since I don't see any of the bellow lines in your dmesg
 snippet.

 [2049099.198234] CacheFiles: Loaded
 [2049099.541721] FS-Cache: Cache mycache added (type cachefiles)
 [2049099.541727] CacheFiles: File cache on md0 registered

 - Milosz

 On Mon, Jun 17, 2013 at 11:47 AM, Elso Andras elso.and...@gmail.com wrote:
 Hi,


 1) In the graphs you attached what am I looking at? My best guess is that
 it's traffic on a 10gigE card, but I can't tell from the graph since 
 there's
 no labels.
 Yes, 10G traffic on switch port. So incoming means server-to-switch,
 outgoing means switch-to-server. No separated card for ceph traffic
 :(

 2) Can you give me more info about your serving case. What application are
 you using to serve the video (http server)? Are you serving static mp4 
 files
 from Ceph filesystem?
 lighttpd server with mp4 streaming mod
 (http://h264.code-shop.com/trac/wiki/Mod-H264-Streaming-Lighttpd-Version2),
 the files lives on cephfs.
 there is a speed limit, controlled by mp4 mod. the bandwidth is the
 video bitrate value.

 mount options:
 name=test,rsize=0,rasize=131072,noshare,fsc,key=client.test

 rsize=0 and rasize=131072 is a tested, with other values there was 4x
 incoming (from osd) traffic than outgoing (to internet) traffic.

 3) What's the hardware, most importantly how big is your partition that
 cachefilesd is on and what kind of disk are you hosting it on (rotating,
 SSD)?
 there are 5 osd servers: HP DL380 G6, 32G ram, 16 X HP sas disk (10k
 rpm) with raid0. bonding two 1G interface together.
 (In previous life, this hw could serve the ~2.3G traffic with raid5
 and three bonding interface)

 4) Statistics from fscache. Can you paste the output /proc/fs/fscache/stats
 and /proc/fs/fscache/histogram.

 FS-Cache statistics
 Cookies: idx=1 dat=8001 spc=0
 Objects: alc=0 nal=0 avl=0 ded=0
 ChkAux : non=0 ok=0 upd=0 obs=0
 Pages  : mrk=0 unc=0
 Acquire: n=8002 nul=0 noc=0 ok=8002 nbf=0 oom=0
 Lookups: n=0 neg=0 pos=0 crt=0 tmo=0
 Invals : n=0 run=0
 Updates: n=0 nul=0 run=0
 Relinqs: n=2265 nul=0 wcr=0 rtr=0
 AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
 Allocs : n=0 ok=0 wt=0 nbf=0 int=0
 Allocs : ops=0 owt=0 abt=0
 Retrvls: n=2983745 ok=0 wt=0 nod=0 nbf=2983745 int=0 oom=0
 Retrvls: ops=0 owt=0 abt=0
 Stores : n=0 ok=0 agn=0 nbf=0 oom=0
 Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
 VmScan : nos=0 gon=0 bsy=0 can=0 wt=0
 Ops: pend=0 run=0 enq=0 can=0 rej=0
 Ops: dfr=0 rel=0 gc=0
 CacheOp: alo=0 luo=0 luc=0 gro=0
 CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
 CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0

 No histogram, i try to build to enable this.

 5) dmesg 

Re: [PATCH 2/2] Enable fscache as an optional feature of ceph.

2013-06-17 Thread Matt W. Benjamin
Hi,

1. in the cases where client caching is useful, AFS disk caching is still 
common--though yes, giant memory caches became more common over time, and

2. a memory fs-cache backend is probably out there (I wonder if you can write 
it in kernel mode), at worst, it looks like you can use cachefilesd on tempfs?

Matt
- Elso Andras elso.and...@gmail.com wrote:

 Hi,
 
 Oh, i forgot about this daemon... but this daemon cache the data to
 file. Thus it's useless, the caching to disk is more slow than the
 whole osds.
 
 Elbandi
 
 2013/6/17 Milosz Tanski mil...@adfin.com:
  Elbandi,
 
  It looks like it's trying to use fscache (from the stats) but
 there's
  no data. Did you install, configure and enable the cachefilesd
 daemon?
  It's the user-space component of fscache. It's the only officially
  supported fsache backed by Ubuntu, RHEL  SUSE. I'm guessing that's
  your problem since I don't see any of the bellow lines in your
 dmesg
  snippet.
 
  [2049099.198234] CacheFiles: Loaded
  [2049099.541721] FS-Cache: Cache mycache added (type cachefiles)
  [2049099.541727] CacheFiles: File cache on md0 registered
 
  - Milosz
 
  On Mon, Jun 17, 2013 at 11:47 AM, Elso Andras
 elso.and...@gmail.com wrote:
  Hi,
 
 
  1) In the graphs you attached what am I looking at? My best guess
 is that
  it's traffic on a 10gigE card, but I can't tell from the graph
 since there's
  no labels.
  Yes, 10G traffic on switch port. So incoming means
 server-to-switch,
  outgoing means switch-to-server. No separated card for ceph
 traffic
  :(
 
  2) Can you give me more info about your serving case. What
 application are
  you using to serve the video (http server)? Are you serving static
 mp4 files
  from Ceph filesystem?
  lighttpd server with mp4 streaming mod
 
 (http://h264.code-shop.com/trac/wiki/Mod-H264-Streaming-Lighttpd-Version2),
  the files lives on cephfs.
  there is a speed limit, controlled by mp4 mod. the bandwidth is
 the
  video bitrate value.
 
  mount options:
  name=test,rsize=0,rasize=131072,noshare,fsc,key=client.test
 
  rsize=0 and rasize=131072 is a tested, with other values there was
 4x
  incoming (from osd) traffic than outgoing (to internet) traffic.
 
  3) What's the hardware, most importantly how big is your partition
 that
  cachefilesd is on and what kind of disk are you hosting it on
 (rotating,
  SSD)?
  there are 5 osd servers: HP DL380 G6, 32G ram, 16 X HP sas disk
 (10k
  rpm) with raid0. bonding two 1G interface together.
  (In previous life, this hw could serve the ~2.3G traffic with
 raid5
  and three bonding interface)
 
  4) Statistics from fscache. Can you paste the output
 /proc/fs/fscache/stats
  and /proc/fs/fscache/histogram.
 
  FS-Cache statistics
  Cookies: idx=1 dat=8001 spc=0
  Objects: alc=0 nal=0 avl=0 ded=0
  ChkAux : non=0 ok=0 upd=0 obs=0
  Pages  : mrk=0 unc=0
  Acquire: n=8002 nul=0 noc=0 ok=8002 nbf=0 oom=0
  Lookups: n=0 neg=0 pos=0 crt=0 tmo=0
  Invals : n=0 run=0
  Updates: n=0 nul=0 run=0
  Relinqs: n=2265 nul=0 wcr=0 rtr=0
  AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
  Allocs : n=0 ok=0 wt=0 nbf=0 int=0
  Allocs : ops=0 owt=0 abt=0
  Retrvls: n=2983745 ok=0 wt=0 nod=0 nbf=2983745 int=0 oom=0
  Retrvls: ops=0 owt=0 abt=0
  Stores : n=0 ok=0 agn=0 nbf=0 oom=0
  Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
  VmScan : nos=0 gon=0 bsy=0 can=0 wt=0
  Ops: pend=0 run=0 enq=0 can=0 rej=0
  Ops: dfr=0 rel=0 gc=0
  CacheOp: alo=0 luo=0 luc=0 gro=0
  CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
  CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0
 
  No histogram, i try to build to enable this.
 
  5) dmesg lines for ceph/fscache/cachefiles like:
  [  264.186887] FS-Cache: Loaded
  [  264.223851] Key type ceph registered
  [  264.223902] libceph: loaded (mon/osd proto 15/24)
  [  264.246334] FS-Cache: Netfs 'ceph' registered for caching
  [  264.246341] ceph: loaded (mds proto 32)
  [  264.249497] libceph: client31274 fsid
 1d78ebe5-f254-44ff-81c1-f641bb2036b6
 
 
  Elbandi
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2013-06-17 Thread AFG GTBANK LOAN



Loan Syndicacion

Am AFG Guaranty Trust Bank, zu strukturieren wir Kreditlinien treffen Sie
unsere
Kunden spezifischen geschäftlichen Anforderungen und einen deutlichen
Mehrwert für unsere
Kunden Unternehmen.
eine Division der AFG Finance und Private Bank plc.

Wenn Sie erwägen, eine große Akquisition oder ein Großprojekt sind, können
Sie
brauchen eine erhebliche Menge an Kredit. AFG Guaranty Trust Bank setzen
können
zusammen das Syndikat, das die gesamte Kredit schnürt für
Sie.


Als Bank mit internationaler Reichweite, sind wir gekommen, um Darlehen zu
identifizieren
Syndizierungen als Teil unseres Kerngeschäfts und durch spitzte diese Zeile
aggressiv sind wir an einem Punkt, wo wir kommen, um als erkannt haben
Hauptakteur in diesem Bereich.


öffnen Sie ein Girokonto heute mit einem Minimum Bankguthaben von 500 £ und
Getup zu £ 10.000 als Darlehen und auch den Hauch einer Chance und gewann
die Sterne
Preis von £ 500.000 in die sparen und gewinnen promo in may.aply jetzt.


mit dem Folowing Informationen über Rechtsanwalt steven lee das Konto
Offizier.


FULL NAME;


Wohnadresse;


E-MAIL-ADRESSE;

Telefonnummer;

Nächsten KINS;

MUTTER MAIDEN NAME;


Familienstand;


BÜROADRESSE;

ALTERNATIVE Telefonnummer;

TO @ yahoo.com bar.stevenlee
NOTE; ALLE Darlehen sind für 10JAHRE RATE VALID
ANGEBOT ENDET BALD SO JETZT HURRY

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] rbd rm image results in osd marked down wrongly with 0.61.3

2013-06-17 Thread Sage Weil
Hi Florian,

If you can trigger this with logs, we're very eager to see what they say 
about this!  The http://tracker.ceph.com/issues/5336 bug is open to track 
this issue.

Thanks!
sage


On Thu, 13 Jun 2013, Smart Weblications GmbH - Florian Wiessner wrote:

 Hi,
 
 Is really no one on the list interrested in fixing this? Or am i the only one
 having this kind of bug/problem?
 
 Am 11.06.2013 16:19, schrieb Smart Weblications GmbH - Florian Wiessner:
  Hi List,
  
  i observed that an rbd rm image results in some osds mark one osd as down
  wrongly in cuttlefish.
  
  The situation gets even worse if there are more than one rbd rm image 
  running
  in parallel.
  
  Please see attached logfiles. The rbd rm command was issued on 20:24:00 via
  cronjob, 40 seconds later the osd 6 got marked down...
  
  
  ceph osd tree
  
  # idweight  type name   up/down reweight
  -1  7   pool default
  -3  7   rack unknownrack
  -2  1   host node01
  0   1   osd.0   up  1
  -4  1   host node02
  1   1   osd.1   up  1
  -5  1   host node03
  2   1   osd.2   up  1
  -6  1   host node04
  3   1   osd.3   up  1
  -7  1   host node06
  5   1   osd.5   up  1
  -8  1   host node05
  4   1   osd.4   up  1
  -9  1   host node07
  6   1   osd.6   up  1
  
  
  I have seen some patches to parallelize rbd rm, but i think there must be 
  some
  other issue, as my clients seem to not be able to do IO when ceph is
  recovering... I think this has worked better in 0.56.x - there was IO while
  recovering.
  
  I also observed in the log of osd.6 that after heartbeat_map reset_timeout, 
  the
  osd tries to connect to the other osds, but it retries so fast that you 
  could
  think this is a DoS attack...
  
  
  Please advise..
  
 
 
 -- 
 
 Mit freundlichen Gr??en,
 
 Florian Wiessner
 
 Smart Weblications GmbH
 Martinsberger Str. 1
 D-95119 Naila
 
 fon.: +49 9282 9638 200
 fax.: +49 9282 9638 205
 24/7: +49 900 144 000 00 - 0,99 EUR/Min*
 http://www.smart-weblications.de
 
 --
 Sitz der Gesellschaft: Naila
 Gesch?ftsf?hrer: Florian Wiessner
 HRB-Nr.: HRB 3840 Amtsgericht Hof
 *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Writing to RBD image while it's snapshot is being created causes I/O errors

2013-06-17 Thread Sage Weil
On Mon, 17 Jun 2013, Karol Jurak wrote:
 On Friday 14 of June 2013 08:56:55 Sage Weil wrote:
  On Fri, 14 Jun 2013, Karol Jurak wrote:
   I noticed that writing to RBD image using kernel driver while it's
   snapshot is being created causes I/O errors and the filesystem
   (reiserfs) eventually aborts and remounts itself in read-only mode:
  
  This is definitely a bug; you should be able to create a snapshot at any
  time.  After a rollback, it should look (to the fs) like a crash or power
  cycle.
  
  How easy is this to reproduce?  Does it happen every time?
 
 I can reproduce it in the following way:
 
 # rbd create -s 10240 test
 # rbd map test
 # mkfs -t reiserfs /dev/rbd/rbd/test
 # mount /dev/rbd/rbd/test /mnt/test
 # dd if=/dev/zero of=/mnt/test/a bs=1M count=1024
 
 and in another shell while dd is running:
 
 # rbd snap create test@snap1
 
 After 2 or 3 seconds dmesg shows I/O errors:
 
 [429532.259910] end_request: I/O error, dev rbd1, sector 1384448
 [429532.272554] end_request: I/O error, dev rbd1, sector 872
 [429532.275556] REISERFS abort (device rbd1): Journal write error in 
 flush_commit_list
 
 and dd fails:
 
 dd: writing `/mnt/test/a': Cannot allocate memory
 590+0 records in
 589+0 records out
 618225664 bytes (618 MB) copied, 15.8701 s, 39.0 MB/s
 
 This happens every time I repeat the test.

What kernel version are you using?  I'm not able to reproduce this with 
ext4 or reiserfs and many snapshots over several minutes of write 
workload.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html