Oh, yeah, we'll definitely test for correctness for async reads on
filestore, I'm just worried about validating the performance
assumptions.  The 3700s might be just fine for that validation though.
-Sam

On Fri, Aug 28, 2015 at 1:01 PM, Blinick, Stephen L
<stephen.l.blin...@intel.com> wrote:
> This sounds ok, with the synchronous interface still possible to the 
> ObjectStore based on return code.
>
> I'd think that the async read interface can be evaluated with any hardware, 
> at least for correctness, by observing the queue depth to the device during a 
> test run.  Also, I think asynchronous reads may benefit various types of NAND 
> SSD's as they do better with more parallelism and I typically see very low 
> queuedepth to them today with Filestore (one of the reasons I think doubling 
> up OSD's on a single flash device helps benchmarks).
>
> Thanks,
>
> Stephen
>
>
>
> -----Original Message-----
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
> Sent: Thursday, August 27, 2015 4:22 PM
> To: Milosz Tanski
> Cc: Matt Benjamin; Haomai Wang; Yehuda Sadeh-Weinraub; Sage Weil; ceph-devel
> Subject: Re: Async reads, sync writes, op thread model discussion
>
> It's been a couple of weeks, so I thought I'd send out a short progress 
> update.  I've started by trying to nail down enough of the threading 
> design/async interface to start refactoring do_op.  For the moment, I've 
> backtracked on the token approach mostly because it seemed more complicated 
> than necessary.  I'm thinking we'll keep a callback like mechanism, but move 
> responsibility for queuing and execution back to the interface user by 
> allowing the user to pass a completion queue and an uninterpreted completion 
> pointer.  These two commits have the gist of the direction I'm going in (the 
> actual code is more a place holder today).  An OSDReactor instance will 
> replace each of the "shards" in the current sharded work queue.  Any aio 
> initiated by a pg operation from a reactor will pass that reactor's queue, 
> ensuring that the completion winds up back in the same thread.
> Writes would work pretty much the same way, but with two callbacks.
>
> My plan is to flesh this out to the point where the OSD works again, and then 
> refactor the osd write path to use this mechanism for basic rbd writes.  That 
> should be enough to let us evaluate whether this is a good path forward for 
> async writes.  Async reads may be a bit tricky to evaluate.  It seems like 
> we'd need hardware that needs that kind of queue depth and an objectstore 
> implementation which can exploit it.
> I'll wire up filestore to do async reads optionally for testing purposes, but 
> it's not clear to me that there will be cases where filestore would want to 
> do an async read rather than a sync read.
>
> https://github.com/athanatos/ceph/commit/642b7190d70a5970534b911f929e6e3885bf99c4
> https://github.com/athanatos/ceph/commit/42bee815081a91abd003bf7170ef1270f23222f6
> -Sam
>
> On Fri, Aug 14, 2015 at 3:36 PM, Milosz Tanski <mil...@adfin.com> wrote:
>> On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin <mbenja...@redhat.com> wrote:
>>> Hi,
>>>
>>> I tend to agree with your comments regarding swapcontext/fibers.  I am not 
>>> much more enamored of jumping to new models (new! frameworks!) as a single 
>>> jump, either.
>>
>> Not suggesting the libraries/frameworks. Just brining up promises as
>> an alternative technique to coroutines. Dealing with spaghetti
>> evented/callback code gets old after doing it for 10+ years. Then
>> throw in blocking IO.
>>
>> And FYI, the data flow promises go back in comp sci back to the 80s.
>>
>> Cheers,
>> - Milosz
>>
>>>
>>> I like the way I interpreted Sam's design to be going, and in particular, 
>>> that it seems to allow for consistent handling of read, write transactions. 
>>>  I also would like to see how Yehuda's system works before arguing 
>>> generalities.
>>>
>>> My intuition is, since the goal is more deterministic performance in
>>> a short horizion, you
>>>
>>> a. need to prioritize transparency over novel abstractions b. need to
>>> build solid microbenchmarks that encapsulate small, then larger
>>> pieces of the work pipeline
>>>
>>> My .05.
>>>
>>> Matt
>>>
>>> --
>>> Matt Benjamin
>>> Red Hat, Inc.
>>> 315 West Huron Street, Suite 140A
>>> Ann Arbor, Michigan 48103
>>>
>>> http://www.redhat.com/en/technologies/storage
>>>
>>> tel.  734-761-4689
>>> fax.  734-769-8938
>>> cel.  734-216-5309
>>>
>>> ----- Original Message -----
>>>> From: "Milosz Tanski" <mil...@adfin.com>
>>>> To: "Haomai Wang" <haomaiw...@gmail.com>
>>>> Cc: "Yehuda Sadeh-Weinraub" <ysade...@redhat.com>, "Samuel Just"
>>>> <sj...@redhat.com>, "Sage Weil" <s...@newdream.net>,
>>>> ceph-devel@vger.kernel.org
>>>> Sent: Friday, August 14, 2015 4:56:26 PM
>>>> Subject: Re: Async reads, sync writes, op thread model discussion
>>>>
>>>> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiw...@gmail.com> wrote:
>>>> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
>>>> > <ysade...@redhat.com> wrote:
>>>> >> Already mentioned it on irc, adding to ceph-devel for the sake of
>>>> >> completeness. I did some infrastructure work for rgw and it seems
>>>> >> (at least to me) that it could at least be partially useful here.
>>>> >> Basically it's an async execution framework that utilizes coroutines.
>>>> >> It's comprised of aio notification manager that can also be tied
>>>> >> into coroutines execution. The coroutines themselves are
>>>> >> stackless, they are implemented as state machines, but using some
>>>> >> boost trickery to hide the details so they can be written very
>>>> >> similar to blocking methods. Coroutines can also execute other
>>>> >> coroutines and can be stacked, or can generate concurrent
>>>> >> execution. It's still somewhat in flux, but I think it's mostly
>>>> >> done and already useful at this point, so if there's anything you
>>>> >> could use it might be a good idea to avoid effort duplication.
>>>> >>
>>>> >
>>>> > coroutines like qemu is cool. The only thing I afraid is the
>>>> > complicate of debug and it's really a big task :-(
>>>> >
>>>> > I agree with sage that this design is really a new implementation
>>>> > for objectstore so that it's harmful to existing objectstore impl.
>>>> > I also suffer the pain from sync read xattr, we may add a async
>>>> > read interface to solove this?
>>>> >
>>>> > For context switch thing, now we have at least 3 cs for one op at
>>>> > osd side. messenger -> op queue -> objectstore queue. I guess op
>>>> > queue -> objectstore is easier to kick off just as sam said. We
>>>> > can make write journal inline with queue_transaction, so the
>>>> > caller could directly handle the transaction right now.
>>>>
>>>> I would caution agains coroutines (fibers) esp. in a multi-threaded
>>>> environment. Posix has officially obsoleted the swapcontext family
>>>> of functions in 1003.1-2004 and removed it in 1003.1-2008. That's
>>>> because they were notoriously non portable, and buggy. And yes you
>>>> can use something like boost::context / boost::coroutine instead but
>>>> they also have platform limitations. These implementations tend to
>>>> abuse / turn of various platform scrutiny features (like the one for
>>>> setjmp/longjmp). And on top of that many platforms don't consider
>>>> alternative context so you end up with obscure bugs. I've debugged
>>>> my fair share of bugs in Mordor coroutines with C++ exceptions, and
>>>> errno variables (since errno is really a function on linux and it's
>>>> output a pointer to threads errno is marked pure) if your coroutine
>>>> migrates threads. And you need to migrate them because of blocking
>>>> and uneven processor/thread distribution.
>>>>
>>>> None of these are obstacles that can't be solved, but added together
>>>> they become a pretty long term liability. So I think long and hard
>>>> about it. Qemu doesn't have some of those issues because it's uses a
>>>> single thread and a much simpler C ABI that it deals with.
>>>>
>>>> An alternative to coroutines that goes a long way towards solving
>>>> the callback spaghetti problem is futures/promises. I'm not talking
>>>> of the very future model that exists in C++11 library but more along
>>>> the lines that exist in other languages (like what's being done in
>>>> Javascript today). There's a good implementation of it Folly (the
>>>> facebook c++11 library). They have a very nice piece of
>>>> documentation here to understand how they work and how they differ.
>>>>
>>>> That future model is very handy when dealing with the callback
>>>> control flow problem. You can chain a bunch of processing steps that
>>>> requires some async action, return a future and continue so on and so 
>>>> forth.
>>>> Also, it makes handling complex error cases easy by giving you a way
>>>> to skip lots of processing steps strait to onError at the end of the
>>>> chain.
>>>>
>>>> Take a look at folly. Take a look at the expanded boost futures
>>>> (they call this future continuations:
>>>> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization
>>>> .html#thread.synchronization.futures.then
>>>> ). Also, building a cut down future framework just for Ceph (or
>>>> reduced set folly one) might be another option.
>>>>
>>>> Just an alternative.
>>>>
>>>> >
>>>> > Anyway, I think we need to do some changes for this field.
>>>> >
>>>> >> Yehuda
>>>> >>
>>>> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sj...@redhat.com> wrote:
>>>> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all
>>>> >>> tied to the actual interface I presented so much as the notion
>>>> >>> that the next thing to do is restructure the OpWQ users as async
>>>> >>> state machines.
>>>> >>> -Sam
>>>> >>>
>>>> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <s...@newdream.net> wrote:
>>>> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>>> >>>>> Currently, there are some deficiencies in how the OSD maps ops
>>>> >>>>> onto
>>>> >>>>> threads:
>>>> >>>>>
>>>> >>>>> 1. Reads are always syncronous limiting the queue depth seen
>>>> >>>>> from the device
>>>> >>>>>    and therefore the possible parallelism.
>>>> >>>>> 2. Writes are always asyncronous forcing even very fast writes
>>>> >>>>> to be completed
>>>> >>>>>    in a seperate thread.
>>>> >>>>> 3. do_op cannot surrender the thread/pg lock during an
>>>> >>>>> operation forcing reads
>>>> >>>>>    required to continue the operation to be syncronous.
>>>> >>>>>
>>>> >>>>> For spinning disks, this is mostly ok since they don't benefit
>>>> >>>>> as much from large read queues, and writes (filestore with
>>>> >>>>> journal) are too slow for the thread switches to make a big
>>>> >>>>> difference.  For very fast flash, however, we want the
>>>> >>>>> flexibility to allow the backend to perform writes
>>>> >>>>> syncronously or asyncronously when it makes sense, and to
>>>> >>>>> maintain a larger number of outstanding reads than we have
>>>> >>>>> threads.  To that end, I suggest changing the ObjectStore
>>>> >>>>> interface to be somewhat polling based:
>>>> >>>>>
>>>> >>>>> /// Create new token
>>>> >>>>> void *create_operation_token() = 0; bool
>>>> >>>>> is_operation_complete(void *token) = 0; bool
>>>> >>>>> is_operation_committed(void *token) = 0; bool
>>>> >>>>> is_operation_applied(void *token) = 0; void
>>>> >>>>> wait_for_committed(void *token) = 0; void
>>>> >>>>> wait_for_applied(void *token) = 0; void wait_for_complete(void
>>>> >>>>> *token) = 0; /// Get result of operation int get_result(void
>>>> >>>>> *token) = 0; /// Must only be called once
>>>> >>>>> is_opearation_complete(token) void reset_operation_token(void
>>>> >>>>> *token) = 0; /// Must only be called once
>>>> >>>>> is_opearation_complete(token) void detroy_operation_token(void
>>>> >>>>> *token) = 0;
>>>> >>>>>
>>>> >>>>> /**
>>>> >>>>>  * Queue a transaction
>>>> >>>>>  *
>>>> >>>>>  * token must be either fresh or reset since the last operation.
>>>> >>>>>  * If the operation is completed syncronously, token can be
>>>> >>>>> resused
>>>> >>>>>  * without calling reset_operation_token.
>>>> >>>>>  *
>>>> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async  */
>>>> >>>>> int queue_transaction(
>>>> >>>>>   Transaction *t,
>>>> >>>>>   OpSequencer *osr,
>>>> >>>>>   void *token
>>>> >>>>>   ) = 0;
>>>> >>>>>
>>>> >>>>> /**
>>>> >>>>>  * Queue a transaction
>>>> >>>>>  *
>>>> >>>>>  * token must be either fresh or reset since the last operation.
>>>> >>>>>  * If the operation is completed syncronously, token can be
>>>> >>>>> resused
>>>> >>>>>  * without calling reset_operation_token.
>>>> >>>>>  *
>>>> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>>> >>>>>  */
>>>> >>>>> int read(..., void *token) = 0; ...
>>>> >>>>>
>>>> >>>>> The "token" concept here is opaque to allow the implementation
>>>> >>>>> some flexibility.  Ideally, it would be nice to be able to
>>>> >>>>> include libaio operation contexts directly.
>>>> >>>>>
>>>> >>>>> The main goal here is for the backend to have the freedom to
>>>> >>>>> complete writes and reads asyncronously or syncronously as the
>>>> >>>>> sitation warrants.
>>>> >>>>> It also leaves the interface user in control of where the
>>>> >>>>> operation completion is handled.  Each op thread can therefore
>>>> >>>>> handle its own
>>>> >>>>> completions:
>>>> >>>>>
>>>> >>>>> struct InProgressOp {
>>>> >>>>>   PGRef pg;
>>>> >>>>>   ObjectStore::Token *token;
>>>> >>>>>   OpContext *ctx;
>>>> >>>>> };
>>>> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>>> >>>>
>>>> >>>> Probably a deque<> since we'll be pushign new requests and
>>>> >>>> slurping off completed ones?  Or, we can make token not
>>>> >>>> completely opaque, so that it includes a boost::intrusive::list
>>>> >>>> node and can be strung on a user-managed queue.
>>>> >>>>
>>>> >>>>> for (auto op : in_progress) {
>>>> >>>>>   op.token = objectstore->create_operation_token();
>>>> >>>>> }
>>>> >>>>>
>>>> >>>>> uint64_t next_to_start = 0;
>>>> >>>>> uint64_t next_to_complete = 0;
>>>> >>>>>
>>>> >>>>> while (1) {
>>>> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>>> >>>>>     InProgressOp &op = in_progress[next_to_complete % 
>>>> >>>>> MAX_IN_PROGRESS];
>>>> >>>>>     objectstore->wait_for_complete(op.token);
>>>> >>>>>   }
>>>> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>>> >>>>>     InProgressOp &op = in_progress[next_to_complete % 
>>>> >>>>> MAX_IN_PROGRESS];
>>>> >>>>>     if (objectstore->is_operation_complete(op.token)) {
>>>> >>>>>       PGRef pg = op.pg;
>>>> >>>>>       OpContext *ctx = op.ctx;
>>>> >>>>>       op.pg = PGRef();
>>>> >>>>>       op.ctx = nullptr;
>>>> >>>>>       objectstore->reset_operation_token(op.token);
>>>> >>>>>       if (pg->continue_op(
>>>> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>> >>>>>               == -EAGAIN) {
>>>> >>>>>         ++next_to_start;
>>>> >>>>>         continue;
>>>> >>>>>       }
>>>> >>>>>     } else {
>>>> >>>>>       break;
>>>> >>>>>     }
>>>> >>>>>   }
>>>> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>>> >>>>>   if (dq.second->do_op(
>>>> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>> >>>>>           == -EAGAIN) {
>>>> >>>>>     ++next_to_start;
>>>> >>>>>   }
>>>> >>>>> }
>>>> >>>>>
>>>> >>>>> A design like this would allow the op thread to move onto
>>>> >>>>> another task if the objectstore implementation wants to
>>>> >>>>> perform an async operation.  For this to work, there is some
>>>> >>>>> work to be done:
>>>> >>>>>
>>>> >>>>> 1. All current reads in the read and write paths (probably
>>>> >>>>> including the attr
>>>> >>>>>    reads in get_object_context and friends) need to be able to handle
>>>> >>>>>    getting
>>>> >>>>>    -EAGAIN from the objectstore.
>>>> >>>>
>>>> >>>> Can we leave the old read methods in place as blocking
>>>> >>>> versions, and have them block on the token before returning?
>>>> >>>> That'll make the transition less painful.
>>>> >>>>
>>>> >>>>> 2. Writes and reads need to be able to handle having the pg
>>>> >>>>> lock dropped
>>>> >>>>>    during the operation.  This should be ok since the actual object
>>>> >>>>>    information
>>>> >>>>>    is protected by the RWState locks.
>>>> >>>>
>>>> >>>> All of the async write pieces already handle this (recheck PG
>>>> >>>> state after taking the lock).  If they don't get -EAGAIN they'd
>>>> >>>> just call the next stage, probably with a flag indicating that
>>>> >>>> validation can be skipped (since the lock hasn't been dropped)?
>>>> >>>>
>>>> >>>>> 3. OpContext needs to have enough information to pick up where
>>>> >>>>> the operation
>>>> >>>>>    left off.  This suggests that we should obtain all required
>>>> >>>>>    ObjectContexts
>>>> >>>>>    at the beginning of the operation.  Cache/Tiering complicates 
>>>> >>>>> this.
>>>> >>>>
>>>> >>>> Yeah...
>>>> >>>>
>>>> >>>>> 4. The object class interface will need to be replaced with a
>>>> >>>>> new interface
>>>> >>>>>    based on possibly async reads.  We can maintain compatibility with
>>>> >>>>>    the
>>>> >>>>>    current ones by launching a new thread to handle any message which
>>>> >>>>>    happens
>>>> >>>>>    to contain an old-style object class operation.
>>>> >>>>
>>>> >>>> Again, for now, wrappers would avoid this?
>>>> >>>>
>>>> >>>> s
>>>> >>>>>
>>>> >>>>> Most of this needs to happen to support object class
>>>> >>>>> operations on ec pools anyway.
>>>> >>>>> -Sam
>>>> >>>>> --
>>>> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> >>>>> in
>>>> >>>>> the body of a message to majord...@vger.kernel.org More
>>>> >>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> >>>>>
>>>> >>>>>
>>>> >>> --
>>>> >>> To unsubscribe from this list: send the line "unsubscribe
>>>> >>> ceph-devel" in the body of a message to
>>>> >>> majord...@vger.kernel.org More majordomo info at
>>>> >>> http://vger.kernel.org/majordomo-info.html
>>>> >> --
>>>> >> To unsubscribe from this list: send the line "unsubscribe
>>>> >> ceph-devel" in the body of a message to majord...@vger.kernel.org
>>>> >> More majordomo info at
>>>> >> http://vger.kernel.org/majordomo-info.html
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best Regards,
>>>> >
>>>> > Wheat
>>>> > --
>>>> > To unsubscribe from this list: send the line "unsubscribe
>>>> > ceph-devel" in the body of a message to majord...@vger.kernel.org
>>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>> Milosz Tanski
>>>> CTO
>>>> 16 East 34th Street, 15th floor
>>>> New York, NY 10016
>>>>
>>>> p: 646-253-9055
>>>> e: mil...@adfin.com
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majord...@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 16 East 34th Street, 15th floor
>> New York, NY 10016
>>
>> p: 646-253-9055
>> e: mil...@adfin.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to