I think the O(n) scan is fine for now, and we can add an index to the
most recent entry for each object + embedded pointers in the log
entries allowing us to walk backwards through the entries for an
object.
-Sam

On Tue, Jan 27, 2015 at 6:30 AM, Sage Weil <sw...@redhat.com> wrote:
> On Tue, 27 Jan 2015, Wang, Zhiqiang wrote:
>> Not sure if it is correct, but below is my understanding. Correct me if
>> I'm wrong.
>>
>> Yes, in the current code, a hash lookup on each IO is used to check dup
>> op. But when adding extra_reqids as a log entry in the pg_log, 2 hash
>> lookups may be not sufficient. The extra_reqids is organized by object
>> id as in your code. We may have other ops on an object just promoted. If
>> a dup op comes in after these ops, then we have to search for log
>> entries of this object in pg_log.
>
> Not sure I follow... we have a second hash_map for that?
>
>         
> https://github.com/liewegas/ceph/commit/fd66f515953e1af5af107cae2cc9bb01d1fab011#diff-d0421fbca7045e2a15a4ff980f3eb00cR60
>
> Perhaps that should be a unordered_multimap in case a reqid shows up in
> two entries (e.g., object is modified, promoted, then flushed).  But
> either way it will be a single lookup to determine if it's a dup...?
>
> sage
>
>
>>
>> -----Original Message-----
>> From: ceph-devel-ow...@vger.kernel.org 
>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
>> Sent: Tuesday, January 27, 2015 1:21 AM
>> To: Wang, Zhiqiang
>> Cc: ceph-devel@vger.kernel.org; Gregory Farnum
>> Subject: RE: idempotent op (esp delete)
>>
>> On Mon, 26 Jan 2015, Wang, Zhiqiang wrote:
>> > The downside of this approach is that we may need to search the pg_log
>> > for a specific object in every write io?
>>
>> Not quite.  IndexedLog maintains a hash_map of all of the request ids in the 
>> log, so it's just a hash lookup on each IO.  (Well, now 2 hash lookups, 
>> because I put the additional request IDs in a second auxilliary map to 
>> handle dups properly.  I think we can avoid that lookup if we use the 
>> request flags carefully, though.. the RETRY and REDIRECTED flags I think?  
>> Need to check carefully.)
>>
>> > Maybe we can combine this
>> > approach and the changes in PR 3447. For the flush case when the
>> > object is deleted in the base, we search the pg_log for dup op. This
>> > should be rare cases. Otherwise the object exists, we check the reqid
>> > list in the object_info_t for dup op.
>>
>> We could do a hybrid approach, but there is some cost to the per-object
>> tracking: a tiny bit more memory, and an O(n) search of the items in that 
>> list (~10 or 20?) for the dup check.  I suspect the hash lookup is cheaper?  
>> And simpler.
>>
>> sage
>>
>>
>> >
>> > -----Original Message-----
>> > From: ceph-devel-ow...@vger.kernel.org
>> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang
>> > Sent: Monday, January 26, 2015 10:35 AM
>> > To: Sage Weil; Gregory Farnum
>> > Cc: ceph-devel@vger.kernel.org
>> > Subject: RE: idempotent op (esp delete)
>> >
>> > This method puts the reqid list in the pg_log instead of the 
>> > object_info_t, so that it's preserved even in the delete case, which 
>> > sounds more reasonable.
>> >
>> > -----Original Message-----
>> > From: ceph-devel-ow...@vger.kernel.org
>> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
>> > Sent: Saturday, January 24, 2015 6:19 AM
>> > To: Gregory Farnum
>> > Cc: ceph-devel@vger.kernel.org
>> > Subject: Re: idempotent op (esp delete)
>> >
>> > On Fri, 23 Jan 2015, Gregory Farnum wrote:
>> > > On Fri, Jan 23, 2015 at 1:43 PM, Sage Weil <sw...@redhat.com> wrote:
>> > > > Background:
>> > > >
>> > > > 1) Way back when we made a task that would thrash the cache modes
>> > > > by adding and removing the cache tier while ceph_test_rados was 
>> > > > running.
>> > > > This mostly worked, but would occasionally fail because we would
>> > > >
>> > > >  - delete an object from the cache tier
>> > > >  - a network failure injection would lose the reply
>> > > >  - we'd disable the cache
>> > > >  - the delete would resend to the base tier, not get recognized as
>> > > > a dup (different pool, different pg log)
>> > > >    -> -ENOENT instead of 0
>> > > >
>> > > > 2) The proxy write code hits a similar problem:
>> > > >
>> > > >  - delete gets proxied
>> > > >  - we initiate async promote
>> > > >  - a network failure injection loses the delete reply
>> > > >  - delete resends and blocks on promote (or arrives after it
>> > > > finishes)
>> > > >  - promote finishes
>> > > >  - delete is handled
>> > > >   -> -ENOENT instead of 0
>> > > >
>> > > > The ticket is http://tracker.ceph.com/issues/8935
>> > > >
>> > > > The problem is partially addressed by
>> > > >
>> > > >         https://github.com/ceph/ceph/pull/3447
>> > > >
>> > > > by logging a few request ids on every object_info_t and preserving
>> > > > that on promote and flush.
>> > > >
>> > > > However, it doesn't solve the problem for delete because we throw
>> > > > out object_info_t so that reqid_t is lost.
>> > > >
>> > > > I think we have two options, not necessarily mutually exclusive:
>> > > >
>> > > > 1) When promoting an object that doesn't exist (to create a
>> > > > whiteout), pull reqids out of the base tier's pg log so that the
>> > > > whiteout is primed with request ids.
>> > > >
>> > > > 1.5) When flushing... well, that is harder because we have nowhere
>> > > > to put the reqids.  Unless we make a way to cram a list of reqid's
>> > > > into a single PG log entry...?  In that case, we wouldn't strictly
>> > > > need the per-object list since we could pile the base tier's
>> > > > reqids into the promote log entry in the cache tier.
>> > > >
>> > > > 2) Make delete idempotent (0 instead of ENOENT if the object
>> > > > doesn't exist).  This will require a delicate compat transition
>> > > > (let's ignore that a moment) but you can preserve the old behavior
>> > > > for callers that care by preceding the delete with an assert_exists op.
>> > > > Most callers don't care, but a handful do.  This simplifies the
>> > > > semantics we need to support going forward.
>> > > >
>> > > > Of course, it's all a bit delicate.  The idempotent op semantics
>> > > > have a time horizon so it's all a bit wishy-washy... :/
>> > > >
>> > > > Thoughts?
>> > >
>> > > Do we have other cases that we're worried about which would be
>> > > improved by maintaining reqids across pool cache transitions? I'm
>> > > not a big fan of maintaining those per-op lists (they sound really
>> > > expensive?), but if we need them for something else that's a point
>> > > in their favor.
>> >
>> > I don't think they're *too* expensive (say, vector of 20 per 
>> > object_info_t?).  But the only thing I can think of beyond the cache 
>> > tiering stuff would be cases where the pg log isnt long enough for a very 
>> > laggy client.  In general ops will be distributed across ops so it will be 
>> > catch the dup from another angle.
>> >
>> > However.. I just hacked up a patch that lets us cram lots of reqids into a 
>> > single pg_log_entry_t and I think that may be a simpler solution.  We can 
>> > cram all reqids (for the last N of them) for promote and flush into the 
>> > single log entry and the delete is no longer special.. it'd work equally 
>> > well for other dups and for class methods that do who knows what.  The 
>> > patch is here:
>> >
>> >     https://github.com/liewegas/ceph/commit/wip-pg-reqids
>> >
>> > What do you think?
>> >
>> > > We could make delete idempotent instead and that's what I initially
>> > > favor, but it also seems a bit scary (it's not like our operations
>> > > can be made idempotent; lots of them invoke classes that will differ
>> > > or
>> > > whatever!) and I can't think of which callers might care so I'm
>> > > having trouble formulating the bounds of this solution.
>> >
>> > Yeah, it seems like an easier endpoint but a dangerous path to get there...
>> >
>> > sage
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> > in the body of a message to majord...@vger.kernel.org More majordomo
>> > info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> > in the body of a message to majord...@vger.kernel.org More majordomo
>> > info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
>> body of a message to majord...@vger.kernel.org More majordomo info at  
>> http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to