RE: Regarding key/value interface

2014-09-11 Thread Allen Samuels
Another thing we're looking into is compression. The intersection of 
compression and object striping (fracturing) is interesting. Is the striping 
variable on a per-object basis? 

Allen Samuels
Chief Software Architect, Emerging Storage Solutions 

951 SanDisk Drive, Milpitas, CA 95035
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Thursday, September 11, 2014 6:55 PM
To: Somnath Roy
Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
ceph-devel@vger.kernel.org
Subject: RE: Regarding key/value interface

On Fri, 12 Sep 2014, Somnath Roy wrote:
> Make perfect sense Sage..
> 
> Regarding striping of filedata, You are saying KeyValue interface will do the 
> following for me?
> 
> 1. Say in case of rbd image of order 4 MB, a write request coming to 
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> sizes (configurable ?) and stripe it as multiple key/value pair ?
> 
> 2. Also, while reading it will take care of accumulating and send it back.

Precisely.

A smarter thing we might want to make it do in the future would be to take a 4 
KB write create a new key that logically overwrites part of the larger, say, 
1MB key, and apply it on read.  And maybe give up and rewrite the entire 1MB 
stripe after too many small overwrites have accumulated.  
Something along those lines to reduce the cost of small IOs to large objects.

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
> ceph-devel@vger.kernel.org
> Subject: Re: Regarding key/value interface
> 
> Hi Somnath,
> 
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range 
> > queries (and I don?t need any explicit caching etc.) and I want to 
> > replace filestore (and leveldb omap) with that,  which interface you 
> > recommend me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore 
> > interfaces earlier (pre keyvalueinteface days) but not tested 
> > thoroughly enough to see what functionality is broken (Basic 
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and 
> > disadvantages) of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting 
> > all the ObjectStore interfaces like clone and all ?
> 
> Everything is supported, I think, for perhaps some IO hints that don't make 
> sense in a k/v context.  The big things that you get by using KeyValueStore 
> and plugging into the lower-level interface are:
> 
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to implement 
> bug tedious and to do so.
> 
> The other nice thing about reusing this code is that you can use a leveldb or 
> rocksdb backend as a reference for testing or performance or whatever.
> 
> The main thing that will be a challenge going forward, I predict, is making 
> storage of the object byte payload in key/value pairs efficient.  I think 
> KeyValuestore is doing some simple striping, but it will suffer for small 
> overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
> pretty simple heuristics and tricks that can be done to mitigate the most 
> common patterns, but there is no simple solution since the backends generally 
> don't support partial value updates (I assume yours doesn't either?).  But, 
> any work done here will benefit the other backends too so that would be a 
> win..
> 
> sage
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info a

RE: FW: CURSH optimization for unbalanced pg distribution

2014-09-11 Thread Sage Weil
Hi,

This is pretty exciting.  I haven't read through all of it, but have 
some initial comments on the pps mapping portion.

On Wed, 10 Sep 2014, Zhang, Jian wrote:
> Thanks. 
> 
> Created a feature here: http://tracker.ceph.com/issues/9410, to include all 
> the attachments. .
> http://tracker.ceph.com/attachments/download/1383/adaptive-crush-modify.patch
> http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf
> http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf
> 
> 
> Hi all,
> Several months ago we met an issue of read performance issues (17% 
> degradation) when working on ceph object storage performance evaluation with 
> 10M objects (scaling from 10K objects to 1Million objects) , and found the 
> root cause is unbalanced pg distribution among all osd disks, leading to 
> unbalanced data distribution. We did some further investigation then and 
> identified that CRUSH failed to map pgs evenly to each osd. Please refer to 
> the attached pdf (crush_proposals) for details.
> 
> Key Message:
> As mentioned in the attached pdf, we described possible optimization 
> proposals 
> (http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf) for 
> CRUSH and got some feedback from community 
> (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979 ) . Sage 
> suggested us take the idea of "Change placement strategy only for step of 
> selecting devices from hosts", by adding a new bucket type called "linear", 
> and applying a modulo-like hash function to this kind of buckets to achieve 
> balanced distribution. We followed this suggestion and designed an optimized 
> CRUSH algorithm, with new hash methods and an adaptive module. Please refer 
> to the Design and Implementation part for details. We also wrote some POC for 
> it, see the attached patch. And as a result, we got more than 10% read 
> performance improvement using the optimized CRUSH algorithm.
> 
> Design and Implementation:
> 1.Problem Identification
> 1.1 Input key (pps) space of CRUSH is not uniform
> Since PG# on the nested devices of a host is not uniform even if we 
> select the device using simple modulo operation, we decide to change the 
> algorithm of hashing raw pg to pps.
> 1.2 Algorithm of selecting items from buckets is not uniform
> After we get uniform input key space, we should make the procedure of 
> selecting devices from host be uniform. Since current CRUSH algorithm uses 
> Jenkins hash based strategies and failed to reach the goal, we decide to add 
> a new bucket type and apply new (modulo based) hash algorithm to make it.
> 2.Design
> 2.1New pps hash algorithm
> We design the new pps hash algorithm based on the "Congruential 
> pseudo-random number generator" 
> (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a 
> bijection between the original sequence {0, ...,2^N-1} and some permutation 
> of it. In other words, given different keys between 0 and 2^N-1, the 
> generator will produce different integers, but within the same range 
> {0,...,2^N-1}. 
> Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, 
> np?2^n<2*np) as the key, and then it will be hashed into a pps value between 
> 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles 
> the original pgid sequence as output in this case, making the key space 
> consisting of a permutation of {0,...,2^n-1}, which achieves the best 
> uniformity. Moreover, poolid can be regarded as a seed in the generator, 
> producing different pps value even with the same pgid but different poolid. 
> Therefore, pgid sequences of various pools are mapped into distinct pps 
> sequences, getting rid of PG overlapping.

I made a few comments on github at

https://github.com/ceph/ceph/pull/2402/files#r17462015

I have some questions about the underlying math.  If this is similar to 
the approach used by the uniform buckets, I think 1033 needs to be > the 
denominator?  Also, I looked a bit at the referenced paper and I think the 
denominator should be prime, not 2^n-1 (pg_num_mask).

My other concern is with raw_pg_to_congruential_pps.  Adding poolid into 
the numerator before you do the modulo means that each pool has a 
different permutation.  But, if you have two pools both with (say) 1024 
PGs, they will map to the same 1024 outputs (0..1023).  The pool is added 
in to the final pps, but this doesn't really help as it only means a 
handful of PGs get unique mappings... and they'll be overlap with the next 
pool.  This is exactly the problem we were solving with the HASHPSPOOL 
flag.  Perhaps adding a pseudorrandom value between 0 and 2^32 based on 
the poolid will (usually) give the pools distinct output ranges and the 
linear mapping will still be happy with that (since the inputs for each 
pool live in a contiguous range).

In any case, though, yes: this general approach will mean that the pps 
values live in a packed range in

RE: Regarding key/value interface

2014-09-11 Thread Somnath Roy
Hi Haomai,

> Make perfect sense Sage..
>
> Regarding striping of filedata, You are saying KeyValue interface will do the 
> following for me?
>
> 1. Say in case of rbd image of order 4 MB, a write request coming to 
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> sizes (configurable ?) and stripe it as multiple key/value pair ?


Yes, and the stripe size can be configurated.

[Somnath] That's great, thanks

>
>
> 2. Also, while reading it will take care of accumulating and send it back.



Do you have any other idea? 

[Somnath] No, I was just asking

By the way, could you tell more about your key/value interface. I'm doing some 
jobs for NVMe interface with intel NVMe SSD.

[Somnath] It has the following interfaces.

1. Init & shutdown

2.  It has container concept

3.  Read/write objects, delete objects, enumerate objects, multi put/get support

4. Transaction semantics

5. Range query support

6. Container level snapshot

7.  statistics

Let me know if you need anything specifics.

Thanks & Regards
Somnath

>
>
>
> Thanks & Regards
> Somnath
>
>
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
> ceph-devel@vger.kernel.org
> Subject: Re: Regarding key/value interface
>
> Hi Somnath,
>
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range 
> > queries (and I don?t need any explicit caching etc.) and I want to 
> > replace filestore (and leveldb omap) with that,  which interface you 
> > recommend me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore 
> > interfaces earlier (pre keyvalueinteface days) but not tested 
> > thoroughly enough to see what functionality is broken (Basic 
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and 
> > disadvantages) of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting 
> > all the ObjectStore interfaces like clone and all ?
>
> Everything is supported, I think, for perhaps some IO hints that don't make 
> sense in a k/v context.  The big things that you get by using KeyValueStore 
> and plugging into the lower-level interface are:
>
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to implement 
> bug tedious and to do so.
>
> The other nice thing about reusing this code is that you can use a leveldb or 
> rocksdb backend as a reference for testing or performance or whatever.
>
> The main thing that will be a challenge going forward, I predict, is making 
> storage of the object byte payload in key/value pairs efficient.  I think 
> KeyValuestore is doing some simple striping, but it will suffer for small 
> overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
> pretty simple heuristics and tricks that can be done to mitigate the most 
> common patterns, but there is no simple solution since the backends generally 
> don't support partial value updates (I assume yours doesn't either?).  But, 
> any work done here will benefit the other backends too so that would be a 
> win..
>
> sage
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>



-- 

Best Regards,

Wheat


RE: Regarding key/value interface

2014-09-11 Thread Sage Weil
On Fri, 12 Sep 2014, Somnath Roy wrote:
> Thanks Sage...
> Basically, we are doing similar chunking in our current implementation which 
> is derived from objectstore. 
> Moving to Key/value will save us from that :-)
> Also, I was thinking, we may want to do compression (later may be dedupe ?) 
> on that Key/value layer as well.
> 
> Yes, partial read/write is definitely performance killer for object stores 
> and our objectstore is no exception. We need to see how we can counter that.
> 
> But, I think these are enough reason for me now to move our implementation to 
> the key/value interfaces. 

Sounds good.

By the way, hopefully this is a pretty painless process of wrapping your 
kv library with the KeyValueDB interface.  If not, that will be good to 
know.  I'm hoping it will fit well with a broad range of backends, but so 
far we've only done leveldb/rocksdb (same interface) and kinetic.  I'd 
like to see us try LMDB in this context as well...

sage

> 
> Regards
> Somnath
> 
> 
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com] 
> Sent: Thursday, September 11, 2014 6:55 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
> ceph-devel@vger.kernel.org
> Subject: RE: Regarding key/value interface
> 
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> > Make perfect sense Sage..
> > 
> > Regarding striping of filedata, You are saying KeyValue interface will do 
> > the following for me?
> > 
> > 1. Say in case of rbd image of order 4 MB, a write request coming to 
> > Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> > sizes (configurable ?) and stripe it as multiple key/value pair ?
> > 
> > 2. Also, while reading it will take care of accumulating and send it back.
> 
> Precisely.
> 
> A smarter thing we might want to make it do in the future would be to take a 
> 4 KB write create a new key that logically overwrites part of the larger, 
> say, 1MB key, and apply it on read.  And maybe give up and rewrite the entire 
> 1MB stripe after too many small overwrites have accumulated.  
> Something along those lines to reduce the cost of small IOs to large objects.
> 
> sage
> 
> 
> 
>  > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -Original Message-
> > From: Sage Weil [mailto:sw...@redhat.com]
> > Sent: Thursday, September 11, 2014 6:31 PM
> > To: Somnath Roy
> > Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
> > ceph-devel@vger.kernel.org
> > Subject: Re: Regarding key/value interface
> > 
> > Hi Somnath,
> > 
> > On Fri, 12 Sep 2014, Somnath Roy wrote:
> > >
> > > Hi Sage/Haomai,
> > >
> > > If I have a key/value backend that support transaction, range 
> > > queries (and I don?t need any explicit caching etc.) and I want to 
> > > replace filestore (and leveldb omap) with that,  which interface you 
> > > recommend me to derive from , directly ObjectStore or  KeyValueDB ?
> > >
> > > I have already integrated this backend by deriving from ObjectStore 
> > > interfaces earlier (pre keyvalueinteface days) but not tested 
> > > thoroughly enough to see what functionality is broken (Basic 
> > > functionalities of RGW/RBD are working fine).
> > >
> > > Basically, I want to know what are the advantages (and 
> > > disadvantages) of deriving it from the new key/value interfaces ?
> > >
> > > Also, what state is it in ? Is it feature complete and supporting 
> > > all the ObjectStore interfaces like clone and all ?
> > 
> > Everything is supported, I think, for perhaps some IO hints that don't make 
> > sense in a k/v context.  The big things that you get by using KeyValueStore 
> > and plugging into the lower-level interface are:
> > 
> >  - striping of file data across keys
> >  - efficient clone
> >  - a zillion smaller methods that aren't conceptually difficult to 
> > implement bug tedious and to do so.
> > 
> > The other nice thing about reusing this code is that you can use a leveldb 
> > or rocksdb backend as a reference for testing or performance or whatever.
> > 
> > The main thing that will be a challenge going forward, I predict, is making 
> > storage of the object byte payload in key/value pairs efficient.  I think 
> > KeyValuestore is doing some simple striping, but it will suffer for small 
> > overwrites (like 512-byte or 4k writes from an RBD).  There are probably 
> > some pretty simple heuristics and tricks that can be done to mitigate the 
> > most common patterns, but there is no simple solution since the backends 
> > generally don't support partial value updates (I assume yours doesn't 
> > either?).  But, any work done here will benefit the other backends too so 
> > that would be a win..
> > 
> > sage
> > 
> > 
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is 
> > intended only for the use of the designated recipient(s) named above. If 
> > the reader of this message is not the intended recipient, you are he

RE: Regarding key/value interface

2014-09-11 Thread Somnath Roy
Thanks Sage...
Basically, we are doing similar chunking in our current implementation which is 
derived from objectstore. 
Moving to Key/value will save us from that :-)
Also, I was thinking, we may want to do compression (later may be dedupe ?) on 
that Key/value layer as well.

Yes, partial read/write is definitely performance killer for object stores and 
our objectstore is no exception. We need to see how we can counter that.

But, I think these are enough reason for me now to move our implementation to 
the key/value interfaces. 

Regards
Somnath


-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Thursday, September 11, 2014 6:55 PM
To: Somnath Roy
Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
ceph-devel@vger.kernel.org
Subject: RE: Regarding key/value interface

On Fri, 12 Sep 2014, Somnath Roy wrote:
> Make perfect sense Sage..
> 
> Regarding striping of filedata, You are saying KeyValue interface will do the 
> following for me?
> 
> 1. Say in case of rbd image of order 4 MB, a write request coming to 
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> sizes (configurable ?) and stripe it as multiple key/value pair ?
> 
> 2. Also, while reading it will take care of accumulating and send it back.

Precisely.

A smarter thing we might want to make it do in the future would be to take a 4 
KB write create a new key that logically overwrites part of the larger, say, 
1MB key, and apply it on read.  And maybe give up and rewrite the entire 1MB 
stripe after too many small overwrites have accumulated.  
Something along those lines to reduce the cost of small IOs to large objects.

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
> ceph-devel@vger.kernel.org
> Subject: Re: Regarding key/value interface
> 
> Hi Somnath,
> 
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range 
> > queries (and I don?t need any explicit caching etc.) and I want to 
> > replace filestore (and leveldb omap) with that,  which interface you 
> > recommend me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore 
> > interfaces earlier (pre keyvalueinteface days) but not tested 
> > thoroughly enough to see what functionality is broken (Basic 
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and 
> > disadvantages) of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting 
> > all the ObjectStore interfaces like clone and all ?
> 
> Everything is supported, I think, for perhaps some IO hints that don't make 
> sense in a k/v context.  The big things that you get by using KeyValueStore 
> and plugging into the lower-level interface are:
> 
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to implement 
> bug tedious and to do so.
> 
> The other nice thing about reusing this code is that you can use a leveldb or 
> rocksdb backend as a reference for testing or performance or whatever.
> 
> The main thing that will be a challenge going forward, I predict, is making 
> storage of the object byte payload in key/value pairs efficient.  I think 
> KeyValuestore is doing some simple striping, but it will suffer for small 
> overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
> pretty simple heuristics and tricks that can be done to mitigate the most 
> common patterns, but there is no simple solution since the backends generally 
> don't support partial value updates (I assume yours doesn't either?).  But, 
> any work done here will benefit the other backends too so that would be a 
> win..
> 
> sage
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC]New Message Implementation Based on Event

2014-09-11 Thread Haomai Wang
Hi all,

Recently, I did some basic work on new message implementation based on
event(https://github.com/yuyuyu101/ceph/tree/msg-event). The basic
idea is that we use a Processor thread for each Messenger to monitor
all sockets and dispatch fd to threadpool. The event mechanism can be
epoll, kqueue, poll or select. The thread in threadpool will
read/write with this socket and dispatch message later.

Now the branch has passed basic tests and before make it more stable
and pass more QA suites. I want to do some benchmark tests compared to
pipe implementation with large-scale cluster. I would like to use at
least 100 OSDs(SSD) and hundreds of clients to test it. And now the
benchmark for only one OSD, the client can get the same latency with
pipe implementation and the latency stdev will be smaller.

The background for this implementation is that pipe implementation
consumes too much overhead on context switch and thread resource. In
our env, several ceph-osd is running on compute node which also runs
KVM process.

Do you have any ideas about this, or some serious concerns compared to pipe.

-- 

Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD readahead strategies

2014-09-11 Thread Sage Weil
On Wed, 10 Sep 2014, Adam Crume wrote:
> I've been testing a few strategies for RBD readahead and wanted to
> share my results as well as ask for input.
> 
> I have four sample workloads that I replayed at maximum speed with
> rbd-replay.  boot-ide and boot-virtio are captured from booting a VM
> with the image on the IDE and virtio buses, respectively.  Likewise,
> grep-ide and grep-virtio are captured from a large grep run.  (I'm not
> entirely sure why the IDE and virtio workloads are different, but part
> of it is the number of pending requests allowed.)
> 
> The readahead strategies are:
> - none: No readahead.
> - plain: My initial implementation.  The readahead window doubles for
> each readahead request, up to a limit, and resets when a random
> request is detected.
> - aligned: Same as above, but readahead requests are aligned with
> object boundaries, when possible.
> - eager: When activated, read to the end of the object.
> 
> For all of these, 10 sequential requests trigger readahead, the
> maximum readahead size is 4 MB, and "rbd readahead disable after
> bytes" is disabled (meaning that readahead is enabled for the entire
> workload).  The object size is the default 4 MB, and data is striped
> over a single object.  (Alignment with stripes or object sets is
> ignored for now.)
> 
> Here's the data:
> 
> workload  strategy   time (seconds)   RA ops   RA MB   read ops   read MB
> boot-ide  none   46.22 +/- 0.410   0  57516   407
> boot-ide  plain  11.42 +/- 0.25  281 203  57516   407
> boot-ide  aligned11.46 +/- 0.13  276 201  57516   407
> boot-ide  eager  12.48 +/- 0.61  111 303  57516   407
> boot-virtio   none9.05 +/- 0.250   0  11851   393
> boot-virtio   plain   8.05 +/- 0.38  451 221  11851   393
> boot-virtio   aligned 7.86 +/- 0.27  452 213  11851   393
> boot-virtio   eager   9.17 +/- 0.34  249 600  11851   393
> grep-ide  none  138.55 +/- 1.670   0 130104  3044
> grep-ide  plain 136.07 +/- 1.57  397 867 130104  3044
> grep-ide  aligned   137.30 +/- 1.77  379 844 130104  3044
> grep-ide  eager 138.77 +/- 1.52  346 993 130104  3044
> grep-virtio   none  120.73 +/- 1.330   0 130061  2820
> grep-virtio   plain 121.29 +/- 1.28 11861485 130061  2820
> grep-virtio   aligned   123.32 +/- 1.29 11391409 130061  2820
> grep-virtio   eager 127.75 +/- 1.32  8422218 130061  2820
> 
> (The time is the mean wall-clock time +/- the margin of error with
> 99.7% confidence.  RA=readahead.)
> 
> Right off the bat, readahead is a huge improvement for the boot-ide
> workload, which is no surprise because it issues 50,000 sequential,
> single-sector reads.  (Why the early boot process is so inefficient is
> open for speculation, but that's a real, natural workload.)
> boot-virtio also sees an improvement, although not nearly so dramatic.
> The grep workloads show no statistically significant improvement.
> 
> One conclusion I draw is that 'eager' is, well, too eager.  'aligned'
> shows no statistically significant difference from 'plain', and
> 'plain' is no worse than 'none' (at statistically significant levels)
> and sometimes better.
> 
> Should the readahead strategy be configurable, or should we just stick
> with whichever seems the best one?  Is there anything big I'm missing?

Aligned seems like, even if it is no faster fromthe client's perspective, 
will result in fewer IOs on teh backend, right?  That makes me think we 
should go with that if we have to choose one.

Have you looked at what it might take to put the readahead logic in 
ObjectCacher somewhere, or in some other piece of shared code that would 
allow us to subsume the Client.cc readahead code as well?  Perhaps simply 
wrapping the readahead logic in a single class such that the calling code 
is super simple (just feeds in current offset and conditionally issues a 
readahead IO) would work as well.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding key/value interface

2014-09-11 Thread Haomai Wang
On Fri, Sep 12, 2014 at 9:46 AM, Somnath Roy  wrote:
>
> Make perfect sense Sage..
>
> Regarding striping of filedata, You are saying KeyValue interface will do the 
> following for me?
>
> 1. Say in case of rbd image of order 4 MB, a write request coming to 
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> sizes (configurable ?) and stripe it as multiple key/value pair ?


Yes, and the stripe size can be configurated.

>
>
> 2. Also, while reading it will take care of accumulating and send it back.



Do you have any other idea? By the way, could you tell more about your
key/value interface. I'm doing some jobs for NVMe interface with intel
NVMe SSD.

>
>
>
> Thanks & Regards
> Somnath
>
>
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
> ceph-devel@vger.kernel.org
> Subject: Re: Regarding key/value interface
>
> Hi Somnath,
>
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range queries
> > (and I don?t need any explicit caching etc.) and I want to replace
> > filestore (and leveldb omap) with that,  which interface you recommend
> > me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore
> > interfaces earlier (pre keyvalueinteface days) but not tested
> > thoroughly enough to see what functionality is broken (Basic
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and disadvantages)
> > of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting all
> > the ObjectStore interfaces like clone and all ?
>
> Everything is supported, I think, for perhaps some IO hints that don't make 
> sense in a k/v context.  The big things that you get by using KeyValueStore 
> and plugging into the lower-level interface are:
>
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to implement 
> bug tedious and to do so.
>
> The other nice thing about reusing this code is that you can use a leveldb or 
> rocksdb backend as a reference for testing or performance or whatever.
>
> The main thing that will be a challenge going forward, I predict, is making 
> storage of the object byte payload in key/value pairs efficient.  I think 
> KeyValuestore is doing some simple striping, but it will suffer for small 
> overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
> pretty simple heuristics and tricks that can be done to mitigate the most 
> common patterns, but there is no simple solution since the backends generally 
> don't support partial value updates (I assume yours doesn't either?).  But, 
> any work done here will benefit the other backends too so that would be a 
> win..
>
> sage
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>



-- 

Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Regarding key/value interface

2014-09-11 Thread Sage Weil
On Fri, 12 Sep 2014, Somnath Roy wrote:
> Make perfect sense Sage..
> 
> Regarding striping of filedata, You are saying KeyValue interface will do the 
> following for me?
> 
> 1. Say in case of rbd image of order 4 MB, a write request coming to 
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> sizes (configurable ?) and stripe it as multiple key/value pair ?
> 
> 2. Also, while reading it will take care of accumulating and send it back.

Precisely.

A smarter thing we might want to make it do in the future would be to take 
a 4 KB write create a new key that logically overwrites part of the 
larger, say, 1MB key, and apply it on read.  And maybe give up and rewrite 
the entire 1MB stripe after too many small overwrites have accumulated.  
Something along those lines to reduce the cost of small IOs to large 
objects.

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
> ceph-devel@vger.kernel.org
> Subject: Re: Regarding key/value interface
> 
> Hi Somnath,
> 
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range queries
> > (and I don?t need any explicit caching etc.) and I want to replace
> > filestore (and leveldb omap) with that,  which interface you recommend
> > me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore
> > interfaces earlier (pre keyvalueinteface days) but not tested
> > thoroughly enough to see what functionality is broken (Basic
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and disadvantages)
> > of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting all
> > the ObjectStore interfaces like clone and all ?
> 
> Everything is supported, I think, for perhaps some IO hints that don't make 
> sense in a k/v context.  The big things that you get by using KeyValueStore 
> and plugging into the lower-level interface are:
> 
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to implement 
> bug tedious and to do so.
> 
> The other nice thing about reusing this code is that you can use a leveldb or 
> rocksdb backend as a reference for testing or performance or whatever.
> 
> The main thing that will be a challenge going forward, I predict, is making 
> storage of the object byte payload in key/value pairs efficient.  I think 
> KeyValuestore is doing some simple striping, but it will suffer for small 
> overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
> pretty simple heuristics and tricks that can be done to mitigate the most 
> common patterns, but there is no simple solution since the backends generally 
> don't support partial value updates (I assume yours doesn't either?).  But, 
> any work done here will benefit the other backends too so that would be a 
> win..
> 
> sage
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Regarding key/value interface

2014-09-11 Thread Somnath Roy
Make perfect sense Sage..

Regarding striping of filedata, You are saying KeyValue interface will do the 
following for me?

1. Say in case of rbd image of order 4 MB, a write request coming to Key/Value 
interface, it will  chunk the object (say full 4MB) in smaller sizes 
(configurable ?) and stripe it as multiple key/value pair ?

2. Also, while reading it will take care of accumulating and send it back.


Thanks & Regards
Somnath


-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Thursday, September 11, 2014 6:31 PM
To: Somnath Roy
Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
ceph-devel@vger.kernel.org
Subject: Re: Regarding key/value interface

Hi Somnath,

On Fri, 12 Sep 2014, Somnath Roy wrote:
>
> Hi Sage/Haomai,
>
> If I have a key/value backend that support transaction, range queries
> (and I don?t need any explicit caching etc.) and I want to replace
> filestore (and leveldb omap) with that,  which interface you recommend
> me to derive from , directly ObjectStore or  KeyValueDB ?
>
> I have already integrated this backend by deriving from ObjectStore
> interfaces earlier (pre keyvalueinteface days) but not tested
> thoroughly enough to see what functionality is broken (Basic
> functionalities of RGW/RBD are working fine).
>
> Basically, I want to know what are the advantages (and disadvantages)
> of deriving it from the new key/value interfaces ?
>
> Also, what state is it in ? Is it feature complete and supporting all
> the ObjectStore interfaces like clone and all ?

Everything is supported, I think, for perhaps some IO hints that don't make 
sense in a k/v context.  The big things that you get by using KeyValueStore and 
plugging into the lower-level interface are:

 - striping of file data across keys
 - efficient clone
 - a zillion smaller methods that aren't conceptually difficult to implement 
bug tedious and to do so.

The other nice thing about reusing this code is that you can use a leveldb or 
rocksdb backend as a reference for testing or performance or whatever.

The main thing that will be a challenge going forward, I predict, is making 
storage of the object byte payload in key/value pairs efficient.  I think 
KeyValuestore is doing some simple striping, but it will suffer for small 
overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
pretty simple heuristics and tricks that can be done to mitigate the most 
common patterns, but there is no simple solution since the backends generally 
don't support partial value updates (I assume yours doesn't either?).  But, any 
work done here will benefit the other backends too so that would be a win..

sage



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding key/value interface

2014-09-11 Thread Sage Weil
Hi Somnath,

On Fri, 12 Sep 2014, Somnath Roy wrote:
> 
> Hi Sage/Haomai,
> 
> If I have a key/value backend that support transaction, range queries (and I
> don?t need any explicit caching etc.) and I want to replace filestore (and
> leveldb omap) with that,  which interface you recommend me to derive from ,
> directly ObjectStore or  KeyValueDB ?
> 
> I have already integrated this backend by deriving from ObjectStore
> interfaces earlier (pre keyvalueinteface days) but not tested thoroughly
> enough to see what functionality is broken (Basic functionalities of RGW/RBD
> are working fine).
> 
> Basically, I want to know what are the advantages (and disadvantages) of
> deriving it from the new key/value interfaces ?
> 
> Also, what state is it in ? Is it feature complete and supporting all the
> ObjectStore interfaces like clone and all ?

Everything is supported, I think, for perhaps some IO hints that don't 
make sense in a k/v context.  The big things that you get by using 
KeyValueStore and plugging into the lower-level interface are:

 - striping of file data across keys
 - efficient clone
 - a zillion smaller methods that aren't conceptually difficult to 
implement bug tedious and to do so.

The other nice thing about reusing this code is that you can use a leveldb 
or rocksdb backend as a reference for testing or performance or whatever.

The main thing that will be a challenge going forward, I predict, is 
making storage of the object byte payload in key/value pairs efficient.  I 
think KeyValuestore is doing some simple striping, but it will suffer for 
small overwrites (like 512-byte or 4k writes from an RBD).  There are 
probably some pretty simple heuristics and tricks that can be done to 
mitigate the most common patterns, but there is no simple solution since 
the backends generally don't support partial value updates (I assume yours 
doesn't either?).  But, any work done here will benefit the other backends 
too so that would be a win..

sage

RE: osd cpu usage is bigger than 100%

2014-09-11 Thread Chen, Xiaoxi
1. 12% wa is quite normal, with more disks and more load ,you could even see 
30%+ on random write case
2.  Our BKM is set osd_op_threads to 20,

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of yue longguang
Sent: Thursday, September 11, 2014 3:44 PM
To: ceph-devel@vger.kernel.org
Subject: osd cpu usage is bigger than 100%

hi,all
i am testing   rbd performance, now there is only one vm which is
using  rbd as its disk, and inside it  fio is doing r/w.
the big diffenence is that i set a big iodepth other than iodepth=1.
according to my test,  the bigger iodepth, the bigger cpu usage.

analyse  the output of top command.
1.
12% wa,  if it means disk speed is not fast enough?

2. from where  we  can know  whether ceph's number of threads  is enough or not?


how do you think about it,  which part is using up cpu? i want to find the root 
cause, why big iodepth leads to high cpu usage.


---default options
osd_op_threads": "2",
  "osd_disk_threads": "1",
  "osd_recovery_threads": "1",
"filestore_op_threads": "2",


thanks

--top---iodepth=16-
top - 15:27:34 up 2 days,  6:03,  2 users,  load average: 0.49, 0.56, 0.62
Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
Cpu(s): 19.0%us,  8.1%sy,  0.0%ni, 59.3%id, 12.1%wa,  0.0%hi,  0.8%si,  0.7%st
Mem:   1922540k total,  1853180k used,69360k free, 7012k buffers
Swap:  1048568k total,76796k used,   971772k free,  1034272k cached
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 2763 root  20   0 1112m 386m 5028 S 60.8 20.6 200:43.47 ceph-osd

 -top
top - 19:50:08 up 1 day, 10:26,  2 users,  load average: 1.55, 0.97, 0.81
Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
Cpu(s): 37.6%us, 14.2%sy,  0.0%ni, 37.0%id,  9.4%wa,  0.0%hi,  1.3%si,  0.5%st
Mem:   1922540k total,  1820196k used,   102344k free,23100k buffers
Swap:  1048568k total,91724k used,   956844k free,  1052292k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 4312 root  20   0 1100m 337m 5192 S 107.3 18.0  88:33.27 ceph-osd
 1704 root  20   0  514m 272m 3648 S  0.7 14.5   3:27.19 ceph-mon



--iostat--

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd   5.50   137.50  247.00  782.00  2896.00  8773.00
11.34 7.083.55   0.63  65.05

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd   9.50   119.00  327.50  458.50  3940.00  4733.50
11.0312.03   19.66   0.70  55.40

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd  15.5010.50  324.00  559.50  3784.00  3398.00
8.13 1.982.22   0.81  71.25

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd   4.50   253.50  273.50  803.00  3056.00 12155.00
14.13 4.704.32   0.55  59.55

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd  10.00 6.00  294.00  488.00  3200.00  2933.50
7.84 1.101.49   0.70  54.85

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd  10.0014.00  333.00  645.00  3780.00  3846.00
7.80 2.132.15   0.90  87.55

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd  11.00   240.50  259.00  579.00  3144.00 10035.50
15.73 8.51   10.18   0.84  70.20

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd  10.5017.00  318.50  707.00  3876.00  4084.50
7.76 1.321.30   0.61  62.65

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd   4.50   208.00  233.50  918.00  2648.00 19214.50
18.99 5.434.71   0.55  63.20

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd   7.00 1.50  306.00  212.00  3376.00  2176.50
10.72 1.031.83   0.96  49.70
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

[GIT PULL] Ceph fixes for 3.17-rc5

2014-09-11 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

The main thing here is a set of three patches that fix a buffer overrun 
for large authentication tickets (sigh).  There is also a trivial warning 
fix and an error path fix that are both regressions.

Thanks!
sage


Ilya Dryomov (3):
  rbd: avoid format-security warning inside alloc_workqueue()
  libceph: add process_one_ticket() helper
  libceph: do not hard code max auth ticket len

Sage Weil (1):
  libceph: gracefully handle large reply messages from the mon

Wei Yongjun (1):
  rbd: fix error return code in rbd_dev_device_setup()

 drivers/block/rbd.c   |   6 +-
 net/ceph/auth_x.c | 256 ++
 net/ceph/mon_client.c |   8 ++
 3 files changed, 147 insertions(+), 123 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: set_alloc_hint old osds

2014-09-11 Thread Samuel Just
Yeah, so that's part of it.  The larger question is whether it's ok
for the client to indiscriminately send that op in the first place.
-Sam

On Thu, Sep 11, 2014 at 2:05 PM, Gregory Farnum  wrote:
> Oh, in that case the peers could just share their supported ops with
> the primary or something (like we do with mon commands). That sounds
> good to me, anyway?
> -Greg
>
> On Thu, Sep 11, 2014 at 1:46 PM, Samuel Just  wrote:
>> No, we don't put the transaction into the pg log.
>> -Sam
>>
>> On Thu, Sep 11, 2014 at 1:40 PM, Gregory Farnum  wrote:
>>> Does the hint not go into the pg log? Which could be retried on an older 
>>> OSD?
>>>
>>> On Thu, Sep 11, 2014 at 1:33 PM, Samuel Just  wrote:
 That part is harmless, the transaction would be recreated for the new
 acting set taking into account the new acting set features.  It
 doesn't have any actual affect on the contents of the object.
 -Sam

 On Thu, Sep 11, 2014 at 1:30 PM, Gregory Farnum  wrote:
> On Thu, Sep 11, 2014 at 1:19 PM, Samuel Just  wrote:
>> http://tracker.ceph.com/issues/9419
>>
>> librbd unconditionally sends set_alloc_hint.  Do we require that users
>> upgrade the osds first?  Also, should the primary respond with
>> ENOTSUPP if any replicas don't support it?
>
> Something closer to the second option, I think...but then you run into
> the problem where maybe the PG gets moved from a set of new OSDs to a
> set of old ones that don't support the op. :/ I think for anything
> that goes to disk you need to go through a full features-in-the-osdmap
> process like we did for erasure coding.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: set_alloc_hint old osds

2014-09-11 Thread Gregory Farnum
Oh, in that case the peers could just share their supported ops with
the primary or something (like we do with mon commands). That sounds
good to me, anyway?
-Greg

On Thu, Sep 11, 2014 at 1:46 PM, Samuel Just  wrote:
> No, we don't put the transaction into the pg log.
> -Sam
>
> On Thu, Sep 11, 2014 at 1:40 PM, Gregory Farnum  wrote:
>> Does the hint not go into the pg log? Which could be retried on an older OSD?
>>
>> On Thu, Sep 11, 2014 at 1:33 PM, Samuel Just  wrote:
>>> That part is harmless, the transaction would be recreated for the new
>>> acting set taking into account the new acting set features.  It
>>> doesn't have any actual affect on the contents of the object.
>>> -Sam
>>>
>>> On Thu, Sep 11, 2014 at 1:30 PM, Gregory Farnum  wrote:
 On Thu, Sep 11, 2014 at 1:19 PM, Samuel Just  wrote:
> http://tracker.ceph.com/issues/9419
>
> librbd unconditionally sends set_alloc_hint.  Do we require that users
> upgrade the osds first?  Also, should the primary respond with
> ENOTSUPP if any replicas don't support it?

 Something closer to the second option, I think...but then you run into
 the problem where maybe the PG gets moved from a set of new OSDs to a
 set of old ones that don't support the op. :/ I think for anything
 that goes to disk you need to go through a full features-in-the-osdmap
 process like we did for erasure coding.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


jerasure buffer misaligned

2014-09-11 Thread Loic Dachary
Hi Janne,

In the context of jerasure buffers you found to be misaligned, ( 
http://tracker.ceph.com/issues/9408 ) it would help fix it if you had a simple 
way to reproduce it. I read the code once more and I don't see something 
obviously wrong. 

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: set_alloc_hint old osds

2014-09-11 Thread Samuel Just
No, we don't put the transaction into the pg log.
-Sam

On Thu, Sep 11, 2014 at 1:40 PM, Gregory Farnum  wrote:
> Does the hint not go into the pg log? Which could be retried on an older OSD?
>
> On Thu, Sep 11, 2014 at 1:33 PM, Samuel Just  wrote:
>> That part is harmless, the transaction would be recreated for the new
>> acting set taking into account the new acting set features.  It
>> doesn't have any actual affect on the contents of the object.
>> -Sam
>>
>> On Thu, Sep 11, 2014 at 1:30 PM, Gregory Farnum  wrote:
>>> On Thu, Sep 11, 2014 at 1:19 PM, Samuel Just  wrote:
 http://tracker.ceph.com/issues/9419

 librbd unconditionally sends set_alloc_hint.  Do we require that users
 upgrade the osds first?  Also, should the primary respond with
 ENOTSUPP if any replicas don't support it?
>>>
>>> Something closer to the second option, I think...but then you run into
>>> the problem where maybe the PG gets moved from a set of new OSDs to a
>>> set of old ones that don't support the op. :/ I think for anything
>>> that goes to disk you need to go through a full features-in-the-osdmap
>>> process like we did for erasure coding.
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: set_alloc_hint old osds

2014-09-11 Thread Gregory Farnum
Does the hint not go into the pg log? Which could be retried on an older OSD?

On Thu, Sep 11, 2014 at 1:33 PM, Samuel Just  wrote:
> That part is harmless, the transaction would be recreated for the new
> acting set taking into account the new acting set features.  It
> doesn't have any actual affect on the contents of the object.
> -Sam
>
> On Thu, Sep 11, 2014 at 1:30 PM, Gregory Farnum  wrote:
>> On Thu, Sep 11, 2014 at 1:19 PM, Samuel Just  wrote:
>>> http://tracker.ceph.com/issues/9419
>>>
>>> librbd unconditionally sends set_alloc_hint.  Do we require that users
>>> upgrade the osds first?  Also, should the primary respond with
>>> ENOTSUPP if any replicas don't support it?
>>
>> Something closer to the second option, I think...but then you run into
>> the problem where maybe the PG gets moved from a set of new OSDs to a
>> set of old ones that don't support the op. :/ I think for anything
>> that goes to disk you need to go through a full features-in-the-osdmap
>> process like we did for erasure coding.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: set_alloc_hint old osds

2014-09-11 Thread Samuel Just
That part is harmless, the transaction would be recreated for the new
acting set taking into account the new acting set features.  It
doesn't have any actual affect on the contents of the object.
-Sam

On Thu, Sep 11, 2014 at 1:30 PM, Gregory Farnum  wrote:
> On Thu, Sep 11, 2014 at 1:19 PM, Samuel Just  wrote:
>> http://tracker.ceph.com/issues/9419
>>
>> librbd unconditionally sends set_alloc_hint.  Do we require that users
>> upgrade the osds first?  Also, should the primary respond with
>> ENOTSUPP if any replicas don't support it?
>
> Something closer to the second option, I think...but then you run into
> the problem where maybe the PG gets moved from a set of new OSDs to a
> set of old ones that don't support the op. :/ I think for anything
> that goes to disk you need to go through a full features-in-the-osdmap
> process like we did for erasure coding.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: set_alloc_hint old osds

2014-09-11 Thread Gregory Farnum
On Thu, Sep 11, 2014 at 1:19 PM, Samuel Just  wrote:
> http://tracker.ceph.com/issues/9419
>
> librbd unconditionally sends set_alloc_hint.  Do we require that users
> upgrade the osds first?  Also, should the primary respond with
> ENOTSUPP if any replicas don't support it?

Something closer to the second option, I think...but then you run into
the problem where maybe the PG gets moved from a set of new OSDs to a
set of old ones that don't support the op. :/ I think for anything
that goes to disk you need to go through a full features-in-the-osdmap
process like we did for erasure coding.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


set_alloc_hint old osds

2014-09-11 Thread Samuel Just
http://tracker.ceph.com/issues/9419

librbd unconditionally sends set_alloc_hint.  Do we require that users
upgrade the osds first?  Also, should the primary respond with
ENOTSUPP if any replicas don't support it?
-Sam
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpTracker optimization

2014-09-11 Thread Samuel Just
Just added it to wip-sam-testing.
-Sam

On Thu, Sep 11, 2014 at 11:30 AM, Somnath Roy  wrote:
> Sam/Sage,
> I have addressed all of your comments and pushed the changes to the same pull 
> request.
>
> https://github.com/ceph/ceph/pull/2440
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Wednesday, September 10, 2014 8:33 PM
> To: Somnath Roy
> Cc: Samuel Just; ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com
> Subject: RE: OpTracker optimization
>
> I had two substantiative comments on the first patch and then some trivial
> whitespace nits.Otherwise looks good!
>
> tahnks-
> sage
>
> On Thu, 11 Sep 2014, Somnath Roy wrote:
>
>> Sam/Sage,
>> I have incorporated all of your comments. Please have a look at the same 
>> pull request.
>>
>> https://github.com/ceph/ceph/pull/2440
>>
>> Thanks & Regards
>> Somnath
>>
>> -Original Message-
>> From: Samuel Just [mailto:sam.j...@inktank.com]
>> Sent: Wednesday, September 10, 2014 3:25 PM
>> To: Somnath Roy
>> Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org;
>> ceph-us...@lists.ceph.com
>> Subject: Re: OpTracker optimization
>>
>> Oh, I changed my mind, your approach is fine.  I was unclear.
>> Currently, I just need you to address the other comments.
>> -Sam
>>
>> On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy  wrote:
>> > As I understand, you want me to implement the following.
>> >
>> > 1.  Keep this implementation one sharded optracker for the ios going 
>> > through ms_dispatch path.
>> >
>> > 2. Additionally, for ios going through ms_fast_dispatch, you want me
>> > to implement optracker (without internal shard) per opwq shard
>> >
>> > Am I right ?
>> >
>> > Thanks & Regards
>> > Somnath
>> >
>> > -Original Message-
>> > From: Samuel Just [mailto:sam.j...@inktank.com]
>> > Sent: Wednesday, September 10, 2014 3:08 PM
>> > To: Somnath Roy
>> > Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org;
>> > ceph-us...@lists.ceph.com
>> > Subject: Re: OpTracker optimization
>> >
>> > I don't quite understand.
>> > -Sam
>> >
>> > On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy  
>> > wrote:
>> >> Thanks Sam.
>> >> So, you want me to go with optracker/shadedopWq , right ?
>> >>
>> >> Regards
>> >> Somnath
>> >>
>> >> -Original Message-
>> >> From: Samuel Just [mailto:sam.j...@inktank.com]
>> >> Sent: Wednesday, September 10, 2014 2:36 PM
>> >> To: Somnath Roy
>> >> Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org;
>> >> ceph-us...@lists.ceph.com
>> >> Subject: Re: OpTracker optimization
>> >>
>> >> Responded with cosmetic nonsense.  Once you've got that and the other 
>> >> comments addressed, I can put it in wip-sam-testing.
>> >> -Sam
>> >>
>> >> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  
>> >> wrote:
>> >>> Thanks Sam..I responded back :-)
>> >>>
>> >>> -Original Message-
>> >>> From: ceph-devel-ow...@vger.kernel.org
>> >>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
>> >>> Sent: Wednesday, September 10, 2014 11:17 AM
>> >>> To: Somnath Roy
>> >>> Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org;
>> >>> ceph-us...@lists.ceph.com
>> >>> Subject: Re: OpTracker optimization
>> >>>
>> >>> Added a comment about the approach.
>> >>> -Sam
>> >>>
>> >>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  
>> >>> wrote:
>>  Hi Sam/Sage,
>> 
>>  As we discussed earlier, enabling the present OpTracker code
>>  degrading performance severely. For example, in my setup a single
>>  OSD node with
>>  10 clients is reaching ~103K read iops with io served from memory
>>  while optracking is disabled but enabling optracker it is reduced to 
>>  ~39K iops.
>>  Probably, running OSD without enabling OpTracker is not an option
>>  for many of Ceph users.
>> 
>>  Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
>>  ops_in_flight) and removing some other bottlenecks I am able to
>>  match the performance of OpTracking enabled OSD with OpTracking
>>  disabled, but with the expense of ~1 extra cpu core.
>> 
>>  In this process I have also fixed the following tracker.
>> 
>> 
>> 
>>  http://tracker.ceph.com/issues/9384
>> 
>> 
>> 
>>  and probably http://tracker.ceph.com/issues/8885 too.
>> 
>> 
>> 
>>  I have created following pull request for the same. Please review it.
>> 
>> 
>> 
>>  https://github.com/ceph/ceph/pull/2440
>> 
>> 
>> 
>>  Thanks & Regards
>> 
>>  Somnath
>> 
>> 
>> 
>> 
>>  
>> 
>>  PLEASE NOTE: The information contained in this electronic mail
>>  message is intended only for the use of the designated
>>  recipient(s) named above. If the reader of this message is not
>>  the intended recipient, you are hereby notified that you have
>>  received this message in erro

RE: OpTracker optimization

2014-09-11 Thread Somnath Roy
Sam/Sage,
I have addressed all of your comments and pushed the changes to the same pull 
request.

https://github.com/ceph/ceph/pull/2440

Thanks & Regards
Somnath

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Wednesday, September 10, 2014 8:33 PM
To: Somnath Roy
Cc: Samuel Just; ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com
Subject: RE: OpTracker optimization

I had two substantiative comments on the first patch and then some trivial 
whitespace nits.Otherwise looks good!

tahnks-
sage

On Thu, 11 Sep 2014, Somnath Roy wrote:

> Sam/Sage,
> I have incorporated all of your comments. Please have a look at the same pull 
> request.
> 
> https://github.com/ceph/ceph/pull/2440
> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: Samuel Just [mailto:sam.j...@inktank.com]
> Sent: Wednesday, September 10, 2014 3:25 PM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; 
> ceph-us...@lists.ceph.com
> Subject: Re: OpTracker optimization
> 
> Oh, I changed my mind, your approach is fine.  I was unclear.
> Currently, I just need you to address the other comments.
> -Sam
> 
> On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy  wrote:
> > As I understand, you want me to implement the following.
> >
> > 1.  Keep this implementation one sharded optracker for the ios going 
> > through ms_dispatch path.
> >
> > 2. Additionally, for ios going through ms_fast_dispatch, you want me 
> > to implement optracker (without internal shard) per opwq shard
> >
> > Am I right ?
> >
> > Thanks & Regards
> > Somnath
> >
> > -Original Message-
> > From: Samuel Just [mailto:sam.j...@inktank.com]
> > Sent: Wednesday, September 10, 2014 3:08 PM
> > To: Somnath Roy
> > Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; 
> > ceph-us...@lists.ceph.com
> > Subject: Re: OpTracker optimization
> >
> > I don't quite understand.
> > -Sam
> >
> > On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy  
> > wrote:
> >> Thanks Sam.
> >> So, you want me to go with optracker/shadedopWq , right ?
> >>
> >> Regards
> >> Somnath
> >>
> >> -Original Message-
> >> From: Samuel Just [mailto:sam.j...@inktank.com]
> >> Sent: Wednesday, September 10, 2014 2:36 PM
> >> To: Somnath Roy
> >> Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; 
> >> ceph-us...@lists.ceph.com
> >> Subject: Re: OpTracker optimization
> >>
> >> Responded with cosmetic nonsense.  Once you've got that and the other 
> >> comments addressed, I can put it in wip-sam-testing.
> >> -Sam
> >>
> >> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  
> >> wrote:
> >>> Thanks Sam..I responded back :-)
> >>>
> >>> -Original Message-
> >>> From: ceph-devel-ow...@vger.kernel.org 
> >>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
> >>> Sent: Wednesday, September 10, 2014 11:17 AM
> >>> To: Somnath Roy
> >>> Cc: Sage Weil (sw...@redhat.com); ceph-devel@vger.kernel.org; 
> >>> ceph-us...@lists.ceph.com
> >>> Subject: Re: OpTracker optimization
> >>>
> >>> Added a comment about the approach.
> >>> -Sam
> >>>
> >>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  
> >>> wrote:
>  Hi Sam/Sage,
> 
>  As we discussed earlier, enabling the present OpTracker code 
>  degrading performance severely. For example, in my setup a single 
>  OSD node with
>  10 clients is reaching ~103K read iops with io served from memory 
>  while optracking is disabled but enabling optracker it is reduced to 
>  ~39K iops.
>  Probably, running OSD without enabling OpTracker is not an option 
>  for many of Ceph users.
> 
>  Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
>  ops_in_flight) and removing some other bottlenecks I am able to 
>  match the performance of OpTracking enabled OSD with OpTracking 
>  disabled, but with the expense of ~1 extra cpu core.
> 
>  In this process I have also fixed the following tracker.
> 
> 
> 
>  http://tracker.ceph.com/issues/9384
> 
> 
> 
>  and probably http://tracker.ceph.com/issues/8885 too.
> 
> 
> 
>  I have created following pull request for the same. Please review it.
> 
> 
> 
>  https://github.com/ceph/ceph/pull/2440
> 
> 
> 
>  Thanks & Regards
> 
>  Somnath
> 
> 
> 
> 
>  
> 
>  PLEASE NOTE: The information contained in this electronic mail 
>  message is intended only for the use of the designated 
>  recipient(s) named above. If the reader of this message is not 
>  the intended recipient, you are hereby notified that you have 
>  received this message in error and that any review, 
>  dissemination, distribution, or copying of this message is 
>  strictly prohibited. If you have received this communication in 
>  error, please notify the sender by telephone or e-mail (as shown 
>  abo

Re: [PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Alex Elder
On 09/11/2014 11:17 AM, Ilya Dryomov wrote:
> On Thu, Sep 11, 2014 at 7:23 PM, Alex Elder  wrote:
>> On 09/11/2014 10:10 AM, Ilya Dryomov wrote:
>>> Trying to map an image out of a pool for which we don't have an 'x'
>>> permission bit fails with -ERANGE from ceph_extract_encoded_string()
>>> due to an unsigned vs signed bug.  Fix it and get rid of the -ENIVAL
>>> sink, thus exposing rbd::get_id cls method return value.  (I've seen
>>> a bunch of unexplained -ERANGE reports, I bet this is it).
>>>
>>> Signed-off-by: Ilya Dryomov 
>>
>> I often think people are annoyed by my explicit type casts
>> all over the place.  This (missed) one matters a lot.
>>
>> I think the -EINVAL was to ensure an error code that was
>> expected by a write() call would be returned.
> 
> Yeah, the way it's written it's possible in theory to get a positive
> return value from rbd_dev_image_id().  Looking deeper, this sizeof() is
> not needed at all - ceph_extract_encoded_string() deals with short
> buffers as it should.  As for the ret == sizeof(u32) (i.e. an empty
> string), neither userspace nor us check against empty strings in
> similar cases (object prefix, snapshot name, etc).
> 
> With the above in mind, how about this?
> 
> From 3ded0a7fee82f2204c58b4fc00fc74f05331514d Mon Sep 17 00:00:00 2001
> From: Ilya Dryomov 
> Date: Thu, 11 Sep 2014 18:49:18 +0400
> Subject: [PATCH] rbd: do not return -ERANGE on auth failures
> 
> Trying to map an image out of a pool for which we don't have an 'x'
> permission bit fails with -ERANGE from ceph_extract_encoded_string()
> due to an unsigned vs signed bug.  Fix it and get rid of the -EINVAL
> sink, thus propagating rbd::get_id cls method errors.  (I've seen
> a bunch of unexplained -ERANGE reports, I bet this is it).
> 
> Signed-off-by: Ilya Dryomov 

So now we know that the value returned by rbd_dev_image_id()
will be either 0 or a negative errno.  It could still
return something that write(2) isn't defined to return,
but at least it's an error.  That's OK with me...

Reviewed-by: Alex Elder 

> ---
>  drivers/block/rbd.c |4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 4b97baf8afa3..ce457db5d847 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -4924,7 +4924,7 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
> ret = image_id ? 0 : -ENOMEM;
> if (!ret)
> rbd_dev->image_format = 1;
> -   } else if (ret > sizeof (__le32)) {
> +   } else if (ret >= 0) {
> void *p = response;
> 
> image_id = ceph_extract_encoded_string(&p, p + ret,
> @@ -4932,8 +4932,6 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
> ret = PTR_ERR_OR_ZERO(image_id);
> if (!ret)
> rbd_dev->image_format = 2;
> -   } else {
> -   ret = -EINVAL;
> }
> 
> if (!ret) {
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Alex Elder
On 09/11/2014 11:27 AM, Ilya Dryomov wrote:
>> I should have asked this before.  Why is a permission error
>> > leading to ceph_extract_encoded_string() finding a short
>> > buffer?  I didn't take the time to trace the error path
>> > you're talking about here all the way back.
> rbd_obj_method_sync() returns -EPERM, which when compared with size_t
> from sizeof() ends up a big positive.  ceph_extract_encoded_string() is
> then called and the safe decode macro kicks in..

I somehow got lost along the way and thought rbd_obj_method_sync()
was returning -ERANGE.  The problem starts because rbd_obj_method_sync()
returns an error other than -ENOENT, and that led me astray I think.

Sorry for the confusion.

-Alex

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Ilya Dryomov
On Thu, Sep 11, 2014 at 8:24 PM, Alex Elder  wrote:
> On 09/11/2014 11:17 AM, Ilya Dryomov wrote:
>> On Thu, Sep 11, 2014 at 7:23 PM, Alex Elder  wrote:
>>> On 09/11/2014 10:10 AM, Ilya Dryomov wrote:
 Trying to map an image out of a pool for which we don't have an 'x'
 permission bit fails with -ERANGE from ceph_extract_encoded_string()
 due to an unsigned vs signed bug.  Fix it and get rid of the -ENIVAL
 sink, thus exposing rbd::get_id cls method return value.  (I've seen
 a bunch of unexplained -ERANGE reports, I bet this is it).

 Signed-off-by: Ilya Dryomov 
>>>
>>> I often think people are annoyed by my explicit type casts
>>> all over the place.  This (missed) one matters a lot.
>>>
>>> I think the -EINVAL was to ensure an error code that was
>>> expected by a write() call would be returned.
>>
>> Yeah, the way it's written it's possible in theory to get a positive
>> return value from rbd_dev_image_id().  Looking deeper, this sizeof() is
>> not needed at all - ceph_extract_encoded_string() deals with short
>> buffers as it should.  As for the ret == sizeof(u32) (i.e. an empty
>> string), neither userspace nor us check against empty strings in
>> similar cases (object prefix, snapshot name, etc).
>
> I should have asked this before.  Why is a permission error
> leading to ceph_extract_encoded_string() finding a short
> buffer?  I didn't take the time to trace the error path
> you're talking about here all the way back.

rbd_obj_method_sync() returns -EPERM, which when compared with size_t
from sizeof() ends up a big positive.  ceph_extract_encoded_string() is
then called and the safe decode macro kicks in..

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Alex Elder
On 09/11/2014 11:17 AM, Ilya Dryomov wrote:
> On Thu, Sep 11, 2014 at 7:23 PM, Alex Elder  wrote:
>> On 09/11/2014 10:10 AM, Ilya Dryomov wrote:
>>> Trying to map an image out of a pool for which we don't have an 'x'
>>> permission bit fails with -ERANGE from ceph_extract_encoded_string()
>>> due to an unsigned vs signed bug.  Fix it and get rid of the -ENIVAL
>>> sink, thus exposing rbd::get_id cls method return value.  (I've seen
>>> a bunch of unexplained -ERANGE reports, I bet this is it).
>>>
>>> Signed-off-by: Ilya Dryomov 
>>
>> I often think people are annoyed by my explicit type casts
>> all over the place.  This (missed) one matters a lot.
>>
>> I think the -EINVAL was to ensure an error code that was
>> expected by a write() call would be returned.
> 
> Yeah, the way it's written it's possible in theory to get a positive
> return value from rbd_dev_image_id().  Looking deeper, this sizeof() is
> not needed at all - ceph_extract_encoded_string() deals with short
> buffers as it should.  As for the ret == sizeof(u32) (i.e. an empty
> string), neither userspace nor us check against empty strings in
> similar cases (object prefix, snapshot name, etc).

I should have asked this before.  Why is a permission error
leading to ceph_extract_encoded_string() finding a short
buffer?  I didn't take the time to trace the error path
you're talking about here all the way back.

(I'm looking at your new patch in the mean time.)

-Alex
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Ilya Dryomov
On Thu, Sep 11, 2014 at 7:23 PM, Alex Elder  wrote:
> On 09/11/2014 10:10 AM, Ilya Dryomov wrote:
>> Trying to map an image out of a pool for which we don't have an 'x'
>> permission bit fails with -ERANGE from ceph_extract_encoded_string()
>> due to an unsigned vs signed bug.  Fix it and get rid of the -ENIVAL
>> sink, thus exposing rbd::get_id cls method return value.  (I've seen
>> a bunch of unexplained -ERANGE reports, I bet this is it).
>>
>> Signed-off-by: Ilya Dryomov 
>
> I often think people are annoyed by my explicit type casts
> all over the place.  This (missed) one matters a lot.
>
> I think the -EINVAL was to ensure an error code that was
> expected by a write() call would be returned.

Yeah, the way it's written it's possible in theory to get a positive
return value from rbd_dev_image_id().  Looking deeper, this sizeof() is
not needed at all - ceph_extract_encoded_string() deals with short
buffers as it should.  As for the ret == sizeof(u32) (i.e. an empty
string), neither userspace nor us check against empty strings in
similar cases (object prefix, snapshot name, etc).

With the above in mind, how about this?

>From 3ded0a7fee82f2204c58b4fc00fc74f05331514d Mon Sep 17 00:00:00 2001
From: Ilya Dryomov 
Date: Thu, 11 Sep 2014 18:49:18 +0400
Subject: [PATCH] rbd: do not return -ERANGE on auth failures

Trying to map an image out of a pool for which we don't have an 'x'
permission bit fails with -ERANGE from ceph_extract_encoded_string()
due to an unsigned vs signed bug.  Fix it and get rid of the -EINVAL
sink, thus propagating rbd::get_id cls method errors.  (I've seen
a bunch of unexplained -ERANGE reports, I bet this is it).

Signed-off-by: Ilya Dryomov 
---
 drivers/block/rbd.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 4b97baf8afa3..ce457db5d847 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4924,7 +4924,7 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
ret = image_id ? 0 : -ENOMEM;
if (!ret)
rbd_dev->image_format = 1;
-   } else if (ret > sizeof (__le32)) {
+   } else if (ret >= 0) {
void *p = response;

image_id = ceph_extract_encoded_string(&p, p + ret,
@@ -4932,8 +4932,6 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
ret = PTR_ERR_OR_ZERO(image_id);
if (!ret)
rbd_dev->image_format = 2;
-   } else {
-   ret = -EINVAL;
}

if (!ret) {
-- 
1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Alex Elder
On 09/11/2014 10:10 AM, Ilya Dryomov wrote:
> Trying to map an image out of a pool for which we don't have an 'x'
> permission bit fails with -ERANGE from ceph_extract_encoded_string()
> due to an unsigned vs signed bug.  Fix it and get rid of the -ENIVAL
> sink, thus exposing rbd::get_id cls method return value.  (I've seen
> a bunch of unexplained -ERANGE reports, I bet this is it).
> 
> Signed-off-by: Ilya Dryomov 

I often think people are annoyed by my explicit type casts
all over the place.  This (missed) one matters a lot.

I think the -EINVAL was to ensure an error code that was
expected by a write() call would be returned.

In any case, this looks good to me.

Reviewed-by: Alex Elder 

> ---
>  drivers/block/rbd.c |4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 4b97baf8afa3..fe3726c62a37 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -4924,7 +4924,7 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
>   ret = image_id ? 0 : -ENOMEM;
>   if (!ret)
>   rbd_dev->image_format = 1;
> - } else if (ret > sizeof (__le32)) {
> + } else if (ret > (int)sizeof(u32)) {
>   void *p = response;
>  
>   image_id = ceph_extract_encoded_string(&p, p + ret,
> @@ -4932,8 +4932,6 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
>   ret = PTR_ERR_OR_ZERO(image_id);
>   if (!ret)
>   rbd_dev->image_format = 2;
> - } else {
> - ret = -EINVAL;
>   }
>  
>   if (!ret) {
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Ilya Dryomov
Trying to map an image out of a pool for which we don't have an 'x'
permission bit fails with -ERANGE from ceph_extract_encoded_string()
due to an unsigned vs signed bug.  Fix it and get rid of the -ENIVAL
sink, thus exposing rbd::get_id cls method return value.  (I've seen
a bunch of unexplained -ERANGE reports, I bet this is it).

Signed-off-by: Ilya Dryomov 
---
 drivers/block/rbd.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 4b97baf8afa3..fe3726c62a37 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4924,7 +4924,7 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
ret = image_id ? 0 : -ENOMEM;
if (!ret)
rbd_dev->image_format = 1;
-   } else if (ret > sizeof (__le32)) {
+   } else if (ret > (int)sizeof(u32)) {
void *p = response;
 
image_id = ceph_extract_encoded_string(&p, p + ret,
@@ -4932,8 +4932,6 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
ret = PTR_ERR_OR_ZERO(image_id);
if (!ret)
rbd_dev->image_format = 2;
-   } else {
-   ret = -EINVAL;
}
 
if (!ret) {
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RGW threads hung - more logs

2014-09-11 Thread Guang Yang
Hi Sage, Sam and Greg,
With the radosgw hung issue we discussed this today, I finally got some more 
logs showing that the reply message has been received by ragosgw, but failed to 
be dispatched as dispatcher thread was hung. I put all the logs into the 
tracker - http://tracker.ceph.com/issues/9008

While the logs explain what we observed, I failed to find any clue that why the 
dispatcher would need to wait for objecter_bytes throttler budget, did I miss 
anything obvious here?

Tracker link - http://tracker.ceph.com/issues/9008

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: fix a memory leak in handle_watch_notify

2014-09-11 Thread Ilya Dryomov
On Thu, Sep 11, 2014 at 2:50 PM, Alex Elder  wrote:
> On 09/11/2014 03:31 AM, Ilya Dryomov wrote:
>>
>> On Thu, Sep 11, 2014 at 5:41 AM, Alex Elder  wrote:
>>>
>>> On 09/10/2014 07:20 PM, roy.qing...@gmail.com wrote:


 From: Li RongQing 

 event_work should be freed when adding it to queue failed

 Signed-off-by: Li RongQing 
>>>
>>>
>>>
>>> Looks good.
>>>
>>> Reviewed-by: Alex Elder 
>>
>>
>> Hmm, queue_work() returns %false if @work was already on a queue, %true
>> otherwise, so this seems bogus to me.  I'd go with something like this
>> (mangled).
>
>
> The original change was fine.  Whether it matters is another question.
> Your suggestion looks good as well, and on the assumption that if you
> choose to use it instead your "real" fix is done correctly you can
> use "Reviewed-by: " if you like.

Well, the original change makes something bogus even more bogus.  It's
basically:

foo = kmalloc(...);
foo->bar = 0;

if (foo->bar & BAZ) {
/* WARNING */
kfree(foo);
goto ...
}

So yeah, I'm going to use your Reviewed-by on my "real" fix ;)

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: fix a memory leak in handle_watch_notify

2014-09-11 Thread Alex Elder

On 09/11/2014 03:31 AM, Ilya Dryomov wrote:

On Thu, Sep 11, 2014 at 5:41 AM, Alex Elder  wrote:

On 09/10/2014 07:20 PM, roy.qing...@gmail.com wrote:


From: Li RongQing 

event_work should be freed when adding it to queue failed

Signed-off-by: Li RongQing 



Looks good.

Reviewed-by: Alex Elder 


Hmm, queue_work() returns %false if @work was already on a queue, %true
otherwise, so this seems bogus to me.  I'd go with something like this
(mangled).


The original change was fine.  Whether it matters is another question.
Your suggestion looks good as well, and on the assumption that if you
choose to use it instead your "real" fix is done correctly you can
use "Reviewed-by: " if you like.

-Alex



 From c0711eee447b199b1c2193460fce8c9d958f23f4 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov 
Date: Thu, 11 Sep 2014 12:18:53 +0400
Subject: [PATCH] libceph: don't try checking queue_work() return value

queue_work() doesn't "fail to queue", it returns false if work was
already on a queue, which can't happen here since we allocate
event_work right before we queue it.  So don't bother at all.

Signed-off-by: Ilya Dryomov 
---
  net/ceph/osd_client.c |   15 +--
  1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 0f569d322405..952e9c254cc7 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2355,26 +2355,21 @@ static void handle_watch_notify(struct
ceph_osd_client *osdc,
 if (event) {
 event_work = kmalloc(sizeof(*event_work), GFP_NOIO);
 if (!event_work) {
-   dout("ERROR: could not allocate event_work\n");
-   goto done_err;
+   pr_err("couldn't allocate event_work\n");
+   ceph_osdc_put_event(event);
+   return;
 }
 INIT_WORK(&event_work->work, do_event_work);
 event_work->event = event;
 event_work->ver = ver;
 event_work->notify_id = notify_id;
 event_work->opcode = opcode;
-   if (!queue_work(osdc->notify_wq, &event_work->work)) {
-   dout("WARNING: failed to queue notify event work\n");
-   goto done_err;
-   }
+
+   queue_work(osdc->notify_wq, &event_work->work);
 }

 return;

-done_err:
-   ceph_osdc_put_event(event);
-   return;
-
  bad:
 pr_err("osdc handle_watch_notify corrupt msg\n");
  }



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: fix a memory leak in handle_watch_notify

2014-09-11 Thread Ilya Dryomov
On Thu, Sep 11, 2014 at 5:41 AM, Alex Elder  wrote:
> On 09/10/2014 07:20 PM, roy.qing...@gmail.com wrote:
>>
>> From: Li RongQing 
>>
>> event_work should be freed when adding it to queue failed
>>
>> Signed-off-by: Li RongQing 
>
>
> Looks good.
>
> Reviewed-by: Alex Elder 

Hmm, queue_work() returns %false if @work was already on a queue, %true
otherwise, so this seems bogus to me.  I'd go with something like this
(mangled).

>From c0711eee447b199b1c2193460fce8c9d958f23f4 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov 
Date: Thu, 11 Sep 2014 12:18:53 +0400
Subject: [PATCH] libceph: don't try checking queue_work() return value

queue_work() doesn't "fail to queue", it returns false if work was
already on a queue, which can't happen here since we allocate
event_work right before we queue it.  So don't bother at all.

Signed-off-by: Ilya Dryomov 
---
 net/ceph/osd_client.c |   15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 0f569d322405..952e9c254cc7 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2355,26 +2355,21 @@ static void handle_watch_notify(struct
ceph_osd_client *osdc,
if (event) {
event_work = kmalloc(sizeof(*event_work), GFP_NOIO);
if (!event_work) {
-   dout("ERROR: could not allocate event_work\n");
-   goto done_err;
+   pr_err("couldn't allocate event_work\n");
+   ceph_osdc_put_event(event);
+   return;
}
INIT_WORK(&event_work->work, do_event_work);
event_work->event = event;
event_work->ver = ver;
event_work->notify_id = notify_id;
event_work->opcode = opcode;
-   if (!queue_work(osdc->notify_wq, &event_work->work)) {
-   dout("WARNING: failed to queue notify event work\n");
-   goto done_err;
-   }
+
+   queue_work(osdc->notify_wq, &event_work->work);
}

return;

-done_err:
-   ceph_osdc_put_event(event);
-   return;
-
 bad:
pr_err("osdc handle_watch_notify corrupt msg\n");
 }
-- 
1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


osd cpu usage is bigger than 100%

2014-09-11 Thread yue longguang
hi,all
i am testing   rbd performance, now there is only one vm which is
using  rbd as its disk, and inside it  fio is doing r/w.
the big diffenence is that i set a big iodepth other than iodepth=1.
according to my test,  the bigger iodepth, the bigger cpu usage.

analyse  the output of top command.
1.
12% wa,  if it means disk speed is not fast enough?

2. from where  we  can know  whether ceph's number of threads  is
enough or not?


how do you think about it,  which part is using up cpu? i want to find
the root cause, why big iodepth leads to high cpu usage.


---default options
osd_op_threads": "2",
  "osd_disk_threads": "1",
  "osd_recovery_threads": "1",
"filestore_op_threads": "2",


thanks

--top---iodepth=16-
top - 15:27:34 up 2 days,  6:03,  2 users,  load average: 0.49, 0.56, 0.62
Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
Cpu(s): 19.0%us,  8.1%sy,  0.0%ni, 59.3%id, 12.1%wa,  0.0%hi,  0.8%si,  0.7%st
Mem:   1922540k total,  1853180k used,69360k free, 7012k buffers
Swap:  1048568k total,76796k used,   971772k free,  1034272k cached
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 2763 root  20   0 1112m 386m 5028 S 60.8 20.6 200:43.47 ceph-osd

 -top
top - 19:50:08 up 1 day, 10:26,  2 users,  load average: 1.55, 0.97, 0.81
Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
Cpu(s): 37.6%us, 14.2%sy,  0.0%ni, 37.0%id,  9.4%wa,  0.0%hi,  1.3%si,  0.5%st
Mem:   1922540k total,  1820196k used,   102344k free,23100k buffers
Swap:  1048568k total,91724k used,   956844k free,  1052292k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 4312 root  20   0 1100m 337m 5192 S 107.3 18.0  88:33.27 ceph-osd
 1704 root  20   0  514m 272m 3648 S  0.7 14.5   3:27.19 ceph-mon



--iostat--

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd   5.50   137.50  247.00  782.00  2896.00  8773.00
11.34 7.083.55   0.63  65.05

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd   9.50   119.00  327.50  458.50  3940.00  4733.50
11.0312.03   19.66   0.70  55.40

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd  15.5010.50  324.00  559.50  3784.00  3398.00
8.13 1.982.22   0.81  71.25

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd   4.50   253.50  273.50  803.00  3056.00 12155.00
14.13 4.704.32   0.55  59.55

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd  10.00 6.00  294.00  488.00  3200.00  2933.50
7.84 1.101.49   0.70  54.85

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd  10.0014.00  333.00  645.00  3780.00  3846.00
7.80 2.132.15   0.90  87.55

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd  11.00   240.50  259.00  579.00  3144.00 10035.50
15.73 8.51   10.18   0.84  70.20

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd  10.5017.00  318.50  707.00  3876.00  4084.50
7.76 1.321.30   0.61  62.65

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd   4.50   208.00  233.50  918.00  2648.00 19214.50
18.99 5.434.71   0.55  63.20

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
vdd   7.00 1.50  306.00  212.00  3376.00  2176.50
10.72 1.031.83   0.96  49.70
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] ceph: make sure request isn't in any waiting list when kicking request.

2014-09-11 Thread Yan, Zheng
From: "Yan, Zheng" 

we may corrupt waiting list if a request in the waiting list is kicked.

Signed-off-by: Yan, Zheng 
---
 fs/ceph/mds_client.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 267ba44..a17fc49 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2078,6 +2078,7 @@ static void kick_requests(struct ceph_mds_client *mdsc, 
int mds)
if (req->r_session &&
req->r_session->s_mds == mds) {
dout(" kicking tid %llu\n", req->r_tid);
+   list_del_init(&req->r_wait);
__do_request(mdsc, req);
}
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] ceph: protect kick_requests() with mdsc->mutex

2014-09-11 Thread Yan, Zheng
From: "Yan, Zheng" 

Signed-off-by: Yan, Zheng 
---
 fs/ceph/mds_client.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f751fea..267ba44 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2471,9 +2471,8 @@ static void handle_session(struct ceph_mds_session 
*session,
if (session->s_state == CEPH_MDS_SESSION_RECONNECTING)
pr_info("mds%d reconnect denied\n", session->s_mds);
remove_session_caps(session);
-   wake = 1; /* for good measure */
+   wake = 2; /* for good measure */
wake_up_all(&mdsc->session_close_wq);
-   kick_requests(mdsc, mds);
break;
 
case CEPH_SESSION_STALE:
@@ -2503,6 +2502,8 @@ static void handle_session(struct ceph_mds_session 
*session,
if (wake) {
mutex_lock(&mdsc->mutex);
__wake_requests(mdsc, &session->s_waiting);
+   if (wake == 2)
+   kick_requests(mdsc, mds);
mutex_unlock(&mdsc->mutex);
}
return;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html