Re: CEPH Erasure Encoding + OSD Scalability

2013-09-22 Thread Loic Dachary
Hi Andreas,

That sounds reasonable. Would you be so kind as to send a patch with your 
changes ? I'll rework it into something that fits the test infrastructure of 
Ceph.

Cheers

On 22/09/2013 09:26, Andreas Joachim Peters wrote:
 Hi Loic, 
 I run a benchmark with the changed code tomorrow ... I actually had to insert 
 some of my realtime benchmark macro's into your Jerasure code to see the 
 different time fractions between buffer preparation  encoding step, but for 
 you QA suite it is probably enough to get a total value after your fix. I 
 will send you a program sampling the performance at different buffer sizes 
 and encoding types.
 
 I changed my code to use vector operations (128-bit XOR's) and it gives 
 another 10% gain. I also want to try out if it makes sense to do the CRC32C 
 computation in-line in the encoding step and compare it with the two step 
 procedure first encoding all blocks, then CRC32C on all blocks.
 
 Cheers Andreas.
 
 
 
 
 From: Loic Dachary [l...@dachary.org]
 Sent: 21 September 2013 17:11
 To: Andreas Joachim Peters
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: CEPH Erasure Encoding + OSD Scalability
 
 Hi Andreas,
 
 It's probably too soon to be smart about reducing the number of copies, but 
 you're right : this copy is not necessary. The following pull request gets 
 rid of it:
 
 https://github.com/ceph/ceph/pull/615
 
 Cheers
 
 On 20/09/2013 18:49, Loic Dachary wrote:
 Hi,

 This is a first attempt at avoiding unnecessary copy:

 https://github.com/dachary/ceph/blob/03445a5926cd073c11cd8693fb110729e40f35fa/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L66

 I'm not sure how it could be made more readable / terse with bufferlist 
 iterators. Any kind of hint would be welcome :-)

 Cheers

 On 20/09/2013 17:36, Sage Weil wrote:
 On Fri, 20 Sep 2013, Loic Dachary wrote:
 Hi Andreas,

 Great work on these benchmarks ! It's definitely an incentive to improve 
 as much as possible. Could you push / send the scripts and sequence of 
 operations you've used ? I'll reproduce this locally while getting rid of 
 the extra copy. It would be useful to capture that into a script that can 
 be conveniently run from the teuthology integrations tests to check 
 against performance regressions.

 Regarding the 3P implementation, in my opinion it would be very valuable 
 for some people who prefer low CPU consumption. And I'm eager to see more 
 than one plugin in the erasure code plugin directory ;-)

 One way to approach this might be to make a bufferlist 'multi-iterator'
 that you give you bufferlist::iterator's and will give you back a pair of
 points and length for each contiguous segment.  This would capture the
 annoying iterator details and let the user focus on processing chunks that
 are as large as possible.

 sage


  
 Cheers

 On 20/09/2013 13:35, Andreas Joachim Peters wrote:
 Hi Loic,

 I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) 
 for ENCODING based on the CEPH Jerasure port.
 I measured for objects from 128k to 512 MB with random contents (if you 
 encode 1 GB objects you see slow downs due to caching inefficiencies 
 ...), otherwise results are stable for the given object sizes.

 I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) 
 , the other are significantly slower (2-3x slower) and my 3P(3,2,1) 
 implementation providing the same redundancy level like RS-Raid6[3,2] 
 (double disk failure) but using more space (66% vs 100% overhead).

 The effect of out.c_str() is significant ( contributes with factor 2 
 slow-down for the best jerasure algorithm for [3,2] ).

 Averaged results for Objects Size 4MB:

 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms 
 encoding = ~780 MB/s
 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in 
 the algorithm) - 0.87ms encoding = ~4.4 GB/s

 I think it pays off to avoid the copy in the encoding if it does not 
 matter for the buffer handling upstream and pad only the last chunk.

 Last thing I tested is how performances scales with number of cores 
 running 4 tests in parallel:

 Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz).
 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz).

 I also implemented the decoding for 3P, but didn't test yet all 
 reconstruction cases. There is probably room for improvements using AVX 
 support for XOR operations in both implementations.

 Before I invest more time, do think it is useful to have this fast 3P 
 algorithm for double disk failures with 100% space overhead? Because I 
 believe that people will always optimize for space and would rather use 
 something like (10,2) even if the performance degrades and CPU 
 consumption goes up?!? Let me know, no problem in any case!

 Finally I tested some combinations for 
 ErasureCodeJerasureReedSolomonRAID6:

 (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s

 Cheers 

Re: [ceph-users] Ceph write performance and my Dell R515's

2013-09-22 Thread Mark Nelson

On 09/22/2013 03:12 AM, Quenten Grasso wrote:


Hi All,

I’m finding my write performance is less than I would have expected. 
After spending some considerable amount of time testing several 
different configurations I can never seems to break over ~360mb/s 
write even when using tmpfs for journaling.


So I’ve purchased 3x Dell R515’s with 1 x AMD 6C CPU with 12 x 3TB SAS 
 2 x 100GB Intel DC S3700 SSD’s  32GB Ram with the Perc H710p Raid 
controller and Dual Port 10GBE Network Cards.


So first up I realise the SSD’s were a mistake, I should have bought 
the 200GB Ones as they have considerably better write though put of 
~375 Mb/s vs 200 Mb/s


So to our Nodes Configuration,

2 x 3TB disks in Raid1 for OS/MON  1 partition for OSD, 12 Disks in a 
Single each in a Raid0 (like a JBOD Fashion) with a 1MB Stripe size,


(Stripe size this part was particularly important because I found the 
stripe size matters considerably even on a single disk raid0. contrary 
to what you might read on the internet)


Also each disk is configured with (write back cache) is enabled and 
(read head) disabled.


For Networking, All nodes are connected via LACP bond with L3 hashing 
and using iperf I can get up to 16gbit/s tx and rx between the nodes.


OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to 
upgrade kernel due to 10Gbit Intel NIC’s driver issues)


So this gives me 11 OSD’s  2 SSD’s Per Node.



I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but you 
definitely will want to do some investigation to make sure that OSD 
isn't holding the other ones back. iostat or collectl might be useful, 
along with the ceph osd admin socket and the dump_ops_in_flight and 
dump_historic_ops commands.


Next I’ve tried several different configurations which I’ll briefly 
describe 2 of which below,


1)Cluster Configuration 1,

33 OSD’s with 6x SSD’s as Journals, w/ 15GB Journals on SSD.

# ceph osd pool create benchmark1 1800 1800

# rados bench -p benchmark1 180 write --no-cleanup

--

Maintaining 16 concurrent writes of 4194304 bytes for up to 180 
seconds or 0 objects


Total time run: 180.250417

Total writes made: 10152

Write size: 4194304

Bandwidth (MB/sec): 225.287

Stddev Bandwidth: 35.0897

Max bandwidth (MB/sec): 312

Min bandwidth (MB/sec): 0

Average Latency: 0.284054

Stddev Latency: 0.199075

Max latency: 1.46791

Min latency: 0.038512

--



What was your pool replication set to?


# rados bench -p benchmark1 180 seq

-

Total time run: 43.782554

Total reads made: 10120

Read size: 4194304

Bandwidth (MB/sec): 924.569

Average Latency: 0.0691903

Max latency: 0.262542

Min latency: 0.015756

-

In this configuration I found my write performance suffers a lot to 
the SSD’s seem to be a bottleneck and my write performance using rados 
bench was around 224-230mb/s


2)Cluster Configuration 2,

33 OSD’s with 1Gbyte Journals on tmpfs.

# ceph osd pool create benchmark1 1800 1800

# rados bench -p benchmark1 180 write --no-cleanup

--

Maintaining 16 concurrent writes of 4194304 bytes for up to 180 
seconds or 0 objects


Total time run: 180.044669

Total writes made: 15328

Write size: 4194304

Bandwidth (MB/sec): 340.538

Stddev Bandwidth: 26.6096

Max bandwidth (MB/sec): 380

Min bandwidth (MB/sec): 0

Average Latency: 0.187916

Stddev Latency: 0.0102989

Max latency: 0.336581

Min latency: 0.034475

--



Definitely low, especially with journals on tmpfs. :( How are the CPUs 
doing at this point? We have some R515s in our lab, and they definitely 
are slow too. Ours have 7 OSD disks and 1 Dell branded SSD (usually 
unused) each and can do about ~150MB/s writes per system. It's actually 
a puzzle we've been trying to solve for quite some time.


Some thoughts:

Could the expander backplane be having issues due to having to tunnel 
STP for the SATA SSDs (or potentially be causing expander wide resets)? 
Could the H700 (and apparently H710) be doing something unusual that the 
stock LSI firmware handles better? We replaced the H700 with an Areca 
1880 and definitely saw changes in performance (better large IO 
throughput and worse IOPS), but the performance was still much lower 
than in a supermicro node with no expanders in the backplane using 
either an LSI 2208 or Areca 1880.


Things you might want to try:

- single node tests, and if you have an alternate controller you can 
try, seeing if that works better.
- removing the S3700s from the chassis entirely and retry the tmpfs 
journal tests.
- Since the H710 is SAS2208 based, you may be able to use megacli to set 
it into JBOD mode and see if that works any better (it may if you are 
using SSD or tmpfs backed journals).


MegaCli -AdpSetProp -EnableJBOD 

Re: [ceph-users] Ceph write performance and my Dell R515's

2013-09-22 Thread Leen Besselink
On Sun, Sep 22, 2013 at 07:40:24AM -0500, Mark Nelson wrote:
 On 09/22/2013 03:12 AM, Quenten Grasso wrote:
 
 Hi All,
 
 I’m finding my write performance is less than I would have
 expected. After spending some considerable amount of time testing
 several different configurations I can never seems to break over
 ~360mb/s write even when using tmpfs for journaling.
 
 So I’ve purchased 3x Dell R515’s with 1 x AMD 6C CPU with 12 x 3TB
 SAS  2 x 100GB Intel DC S3700 SSD’s  32GB Ram with the Perc
 H710p Raid controller and Dual Port 10GBE Network Cards.
 
 So first up I realise the SSD’s were a mistake, I should have
 bought the 200GB Ones as they have considerably better write
 though put of ~375 Mb/s vs 200 Mb/s
 
 So to our Nodes Configuration,
 
 2 x 3TB disks in Raid1 for OS/MON  1 partition for OSD, 12 Disks
 in a Single each in a Raid0 (like a JBOD Fashion) with a 1MB
 Stripe size,
 
 (Stripe size this part was particularly important because I found
 the stripe size matters considerably even on a single disk raid0.
 contrary to what you might read on the internet)
 
 Also each disk is configured with (write back cache) is enabled
 and (read head) disabled.
 
 For Networking, All nodes are connected via LACP bond with L3
 hashing and using iperf I can get up to 16gbit/s tx and rx between
 the nodes.
 
 OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to
 upgrade kernel due to 10Gbit Intel NIC’s driver issues)
 
 So this gives me 11 OSD’s  2 SSD’s Per Node.
 
 
 I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but
 you definitely will want to do some investigation to make sure that
 OSD isn't holding the other ones back. iostat or collectl might be
 useful, along with the ceph osd admin socket and the
 dump_ops_in_flight and dump_historic_ops commands.
 

I was wondering if latency on the network was OK, I wondered if there was some 
kind
of LACP-bonding not working correctly or L3 hashing cashing problems.

ifstat or iptraf or graphs of SNMP from the switch might show you where the 
traffic went.

I did my last tests I with nping echo-mode from nmap with --rate to do latency 
tests under network load.

It does generate a lot of output which slows it down a bit you might want to 
redirect it somewhere.

 Next I’ve tried several different configurations which I’ll
 briefly describe 2 of which below,
 
 1)Cluster Configuration 1,
 
 33 OSD’s with 6x SSD’s as Journals, w/ 15GB Journals on SSD.
 
 # ceph osd pool create benchmark1 1800 1800
 
 # rados bench -p benchmark1 180 write --no-cleanup
 
 --
 
 Maintaining 16 concurrent writes of 4194304 bytes for up to 180
 seconds or 0 objects
 
 Total time run: 180.250417
 
 Total writes made: 10152
 
 Write size: 4194304
 
 Bandwidth (MB/sec): 225.287
 
 Stddev Bandwidth: 35.0897
 
 Max bandwidth (MB/sec): 312
 
 Min bandwidth (MB/sec): 0
 
 Average Latency: 0.284054
 
 Stddev Latency: 0.199075
 
 Max latency: 1.46791
 
 Min latency: 0.038512
 
 --
 
 
 What was your pool replication set to?
 
 # rados bench -p benchmark1 180 seq
 
 -
 
 Total time run: 43.782554
 
 Total reads made: 10120
 
 Read size: 4194304
 
 Bandwidth (MB/sec): 924.569
 
 Average Latency: 0.0691903
 
 Max latency: 0.262542
 
 Min latency: 0.015756
 
 -
 
 In this configuration I found my write performance suffers a lot
 to the SSD’s seem to be a bottleneck and my write performance
 using rados bench was around 224-230mb/s
 
 2)Cluster Configuration 2,
 
 33 OSD’s with 1Gbyte Journals on tmpfs.
 
 # ceph osd pool create benchmark1 1800 1800
 
 # rados bench -p benchmark1 180 write --no-cleanup
 
 --
 
 Maintaining 16 concurrent writes of 4194304 bytes for up to 180
 seconds or 0 objects
 
 Total time run: 180.044669
 
 Total writes made: 15328
 
 Write size: 4194304
 
 Bandwidth (MB/sec): 340.538
 
 Stddev Bandwidth: 26.6096
 
 Max bandwidth (MB/sec): 380
 
 Min bandwidth (MB/sec): 0
 
 Average Latency: 0.187916
 
 Stddev Latency: 0.0102989
 
 Max latency: 0.336581
 
 Min latency: 0.034475
 
 --
 
 
 Definitely low, especially with journals on tmpfs. :( How are the

I'm no expert, but I did notice the tmpfs journals were only 1GB that
seems kinda small. But the systems didn't have a lot more memory, so there
wasn't much choice.

Even if you make them slightly larger it will cut into the memory available
for the filesystem cache. That might be a bad idea as well, I guess.

 CPUs doing at this point? We have some R515s in our lab, and they
 definitely are slow too. Ours have 7 OSD disks and 1 Dell branded
 SSD (usually unused) each and can do about ~150MB/s writes per
 system. It's actually a puzzle we've been trying to solve for quite
 some time.
 
 Some thoughts:
 
 Could the 

Hiding auth key string for the qemu process

2013-09-22 Thread Andrey Korolyov
Hello,

Since it was a long time from enabling cephx by default and we may
think that everyone using it, is seems worthy to introduce bits of
code hiding the key from cmdline. First applicable place for such
improvement is most-likely OpenStack envs with their sparse security
and usage of admin key as default one.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: CEPH Erasure Encoding + OSD Scalability

2013-09-22 Thread Andreas Joachim Peters
Hi Loic, 
I was applying the changes and the 
situation improves, however there is still one important thing which 
actually dominated all the measurements which were needing larger packet
 sizes (everything besides Raid6):

 pad_in_length(unsigned in_length)


The implementation is sort of 'unlucky' and slow when one increases the 
packetsize.

   while (in_length%(k*w*packetsize*sizeof(int)) != 0)
  in_length++;

better do like this:

  alignment = k*w*packetsize*sizeof(int);
  in_length += (alignment - (in_length%alignment);

E.g. for the CauchyGood algorithm one should increase the packetsize and when 
changing the pad_in_length 
implementation one get's excellent (pure encoding) performance for (3+2) : 2.6 
GB/s and it scales well with the number of core's to  8 GB/s.

I compared this with the output of the 'encode' example of the jerasure
 example and it gives the same result for (3+2), so that looks now good 
and consistent! (10,4) is ~ 610 MB/s.

... 
Finally the description of Jerasure 2.0 looks really great and will probably 
shift all the performance problems upstream  ;-)

Do you evt. want to add support into the plugin for local parities (like Xorbas 
does) to improve the disk draining performance?

Cheers Andreas.--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ceph: fix sync read eof check deadlock

2013-09-22 Thread majianpeng
As Yan,Zheng said, commit 0913444208db intruoduce a bug:getattr need to
read lock inode's filelock. But the lock can be in unstable state.
the getattr request waits for lock's state to become stable, the lock
waits for client to release Fr cap.

Commit 6a026589ba333185c466c90 resolved the same bug also.
Before doing getattr, it must put the caps which already hold avoid
deadlock.

Reported-by: Yan, Zheng zheng.z@intel.com
Signed-off-by: Jianpeng Ma majianp...@gmail.com
---
 fs/ceph/file.c | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 7da35c7..bc00ace 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -839,7 +839,15 @@ again:
ret = ceph_sync_read(iocb, i, checkeof);
 
if (checkeof  ret = 0) {
-   int statret = ceph_do_getattr(inode,
+   int statret;
+   /*
+*Before getattr,it should put caps avoid
+*deadlock.
+*/
+   ceph_put_cap_refs(ci, got);
+   got = 0;
+
+   statret = ceph_do_getattr(inode,
  CEPH_STAT_CAP_SIZE);
 
/* hit EOF or hole? */
@@ -851,16 +859,23 @@ again:
 
read += ret;
checkeof = 0;
-   goto again;
+   ret = ceph_get_caps(ci, CEPH_CAP_FILE_RD,
+   want, got, -1);
+   if (ret  0)
+   ret = 0;
+   else
+   goto again;
}
}
 
} else
ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
 
-   dout(aio_read %p %llx.%llx dropping cap refs on %s = %d\n,
-inode, ceph_vinop(inode), ceph_cap_string(got), (int)ret);
-   ceph_put_cap_refs(ci, got);
+   if (got) {
+   dout(aio_read %p %llx.%llx dropping cap refs on %s = %d\n,
+inode, ceph_vinop(inode), ceph_cap_string(got), (int)ret);
+   ceph_put_cap_refs(ci, got);
+   }
 
if (ret = 0)
ret += read;
-- 
1.8.4-rc0
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·œz˜]z÷¥Š{ayºʇڙë,j­¢f£¢·hš‹àz¹®w¥¢¸
¢·¦j:+v‰¨ŠwèjØm¶Ÿÿ¾«‘êçzZ+ƒùšŽŠÝ¢jú!¶i

Re: [PATCH 1/3] ceph: queue cap release in __ceph_remove_cap()

2013-09-22 Thread Sage Weil
Reviewed-by: Sage Weil s...@inktank.com

On Sun, 22 Sep 2013, Yan, Zheng wrote:

 From: Yan, Zheng zheng.z@intel.com
 
 call __queue_cap_release() in __ceph_remove_cap(), this avoids
 acquiring s_cap_lock twice.
 
 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  fs/ceph/caps.c   | 21 +++--
  fs/ceph/mds_client.c |  6 ++
  fs/ceph/super.h  |  8 +---
  3 files changed, 14 insertions(+), 21 deletions(-)
 
 diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
 index 13976c3..d2d6e40 100644
 --- a/fs/ceph/caps.c
 +++ b/fs/ceph/caps.c
 @@ -897,7 +897,7 @@ static int __ceph_is_any_caps(struct ceph_inode_info *ci)
   * caller should hold i_ceph_lock.
   * caller will not hold session s_mutex if called from destroy_inode.
   */
 -void __ceph_remove_cap(struct ceph_cap *cap)
 +void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release)
  {
   struct ceph_mds_session *session = cap-session;
   struct ceph_inode_info *ci = cap-ci;
 @@ -909,6 +909,10 @@ void __ceph_remove_cap(struct ceph_cap *cap)
  
   /* remove from session list */
   spin_lock(session-s_cap_lock);
 + if (queue_release)
 + __queue_cap_release(session, ci-i_vino.ino, cap-cap_id,
 + cap-mseq, cap-issue_seq);
 +
   if (session-s_cap_iterator == cap) {
   /* not yet, we are iterating over this very cap */
   dout(__ceph_remove_cap  delaying %p removal from session %p\n,
 @@ -1023,7 +1027,6 @@ void __queue_cap_release(struct ceph_mds_session 
 *session,
   struct ceph_mds_cap_release *head;
   struct ceph_mds_cap_item *item;
  
 - spin_lock(session-s_cap_lock);
   BUG_ON(!session-s_num_cap_releases);
   msg = list_first_entry(session-s_cap_releases,
  struct ceph_msg, list_head);
 @@ -1052,7 +1055,6 @@ void __queue_cap_release(struct ceph_mds_session 
 *session,
(int)CEPH_CAPS_PER_RELEASE,
(int)msg-front.iov_len);
   }
 - spin_unlock(session-s_cap_lock);
  }
  
  /*
 @@ -1067,12 +1069,8 @@ void ceph_queue_caps_release(struct inode *inode)
   p = rb_first(ci-i_caps);
   while (p) {
   struct ceph_cap *cap = rb_entry(p, struct ceph_cap, ci_node);
 - struct ceph_mds_session *session = cap-session;
 -
 - __queue_cap_release(session, ceph_ino(inode), cap-cap_id,
 - cap-mseq, cap-issue_seq);
   p = rb_next(p);
 - __ceph_remove_cap(cap);
 + __ceph_remove_cap(cap, true);
   }
  }
  
 @@ -2791,7 +2789,7 @@ static void handle_cap_export(struct inode *inode, 
 struct ceph_mds_caps *ex,
   }
   spin_unlock(mdsc-cap_dirty_lock);
   }
 - __ceph_remove_cap(cap);
 + __ceph_remove_cap(cap, false);
   }
   /* else, we already released it */
  
 @@ -2931,9 +2929,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
   if (!inode) {
   dout( i don't have ino %llx\n, vino.ino);
  
 - if (op == CEPH_CAP_OP_IMPORT)
 + if (op == CEPH_CAP_OP_IMPORT) {
 + spin_lock(session-s_cap_lock);
   __queue_cap_release(session, vino.ino, cap_id,
   mseq, seq);
 + spin_unlock(session-s_cap_lock);
 + }
   goto flush_cap_releases;
   }
  
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index f51ab26..8f8f5c0 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -986,7 +986,7 @@ static int remove_session_caps_cb(struct inode *inode, 
 struct ceph_cap *cap,
   dout(removing cap %p, ci is %p, inode is %p\n,
cap, ci, ci-vfs_inode);
   spin_lock(ci-i_ceph_lock);
 - __ceph_remove_cap(cap);
 + __ceph_remove_cap(cap, false);
   if (!__ceph_is_any_real_caps(ci)) {
   struct ceph_mds_client *mdsc =
   ceph_sb_to_client(inode-i_sb)-mdsc;
 @@ -1231,9 +1231,7 @@ static int trim_caps_cb(struct inode *inode, struct 
 ceph_cap *cap, void *arg)
   session-s_trim_caps--;
   if (oissued) {
   /* we aren't the only cap.. just remove us */
 - __queue_cap_release(session, ceph_ino(inode), cap-cap_id,
 - cap-mseq, cap-issue_seq);
 - __ceph_remove_cap(cap);
 + __ceph_remove_cap(cap, true);
   } else {
   /* try to drop referring dentries */
   spin_unlock(ci-i_ceph_lock);
 diff --git a/fs/ceph/super.h b/fs/ceph/super.h
 index a538b51..8de94b5 100644
 --- a/fs/ceph/super.h
 +++ b/fs/ceph/super.h
 @@ -742,13 +742,7 @@ extern int ceph_add_cap(struct inode *inode,
   int fmode, unsigned issued, unsigned wanted,
   unsigned cap, unsigned seq, u64 realmino, int flags,