I think the O(n) scan is fine for now, and we can add an index to the most recent entry for each object + embedded pointers in the log entries allowing us to walk backwards through the entries for an object. -Sam
On Tue, Jan 27, 2015 at 6:30 AM, Sage Weil <sw...@redhat.com> wrote: > On Tue, 27 Jan 2015, Wang, Zhiqiang wrote: >> Not sure if it is correct, but below is my understanding. Correct me if >> I'm wrong. >> >> Yes, in the current code, a hash lookup on each IO is used to check dup >> op. But when adding extra_reqids as a log entry in the pg_log, 2 hash >> lookups may be not sufficient. The extra_reqids is organized by object >> id as in your code. We may have other ops on an object just promoted. If >> a dup op comes in after these ops, then we have to search for log >> entries of this object in pg_log. > > Not sure I follow... we have a second hash_map for that? > > > https://github.com/liewegas/ceph/commit/fd66f515953e1af5af107cae2cc9bb01d1fab011#diff-d0421fbca7045e2a15a4ff980f3eb00cR60 > > Perhaps that should be a unordered_multimap in case a reqid shows up in > two entries (e.g., object is modified, promoted, then flushed). But > either way it will be a single lookup to determine if it's a dup...? > > sage > > >> >> -----Original Message----- >> From: ceph-devel-ow...@vger.kernel.org >> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil >> Sent: Tuesday, January 27, 2015 1:21 AM >> To: Wang, Zhiqiang >> Cc: ceph-devel@vger.kernel.org; Gregory Farnum >> Subject: RE: idempotent op (esp delete) >> >> On Mon, 26 Jan 2015, Wang, Zhiqiang wrote: >> > The downside of this approach is that we may need to search the pg_log >> > for a specific object in every write io? >> >> Not quite. IndexedLog maintains a hash_map of all of the request ids in the >> log, so it's just a hash lookup on each IO. (Well, now 2 hash lookups, >> because I put the additional request IDs in a second auxilliary map to >> handle dups properly. I think we can avoid that lookup if we use the >> request flags carefully, though.. the RETRY and REDIRECTED flags I think? >> Need to check carefully.) >> >> > Maybe we can combine this >> > approach and the changes in PR 3447. For the flush case when the >> > object is deleted in the base, we search the pg_log for dup op. This >> > should be rare cases. Otherwise the object exists, we check the reqid >> > list in the object_info_t for dup op. >> >> We could do a hybrid approach, but there is some cost to the per-object >> tracking: a tiny bit more memory, and an O(n) search of the items in that >> list (~10 or 20?) for the dup check. I suspect the hash lookup is cheaper? >> And simpler. >> >> sage >> >> >> > >> > -----Original Message----- >> > From: ceph-devel-ow...@vger.kernel.org >> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang >> > Sent: Monday, January 26, 2015 10:35 AM >> > To: Sage Weil; Gregory Farnum >> > Cc: ceph-devel@vger.kernel.org >> > Subject: RE: idempotent op (esp delete) >> > >> > This method puts the reqid list in the pg_log instead of the >> > object_info_t, so that it's preserved even in the delete case, which >> > sounds more reasonable. >> > >> > -----Original Message----- >> > From: ceph-devel-ow...@vger.kernel.org >> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil >> > Sent: Saturday, January 24, 2015 6:19 AM >> > To: Gregory Farnum >> > Cc: ceph-devel@vger.kernel.org >> > Subject: Re: idempotent op (esp delete) >> > >> > On Fri, 23 Jan 2015, Gregory Farnum wrote: >> > > On Fri, Jan 23, 2015 at 1:43 PM, Sage Weil <sw...@redhat.com> wrote: >> > > > Background: >> > > > >> > > > 1) Way back when we made a task that would thrash the cache modes >> > > > by adding and removing the cache tier while ceph_test_rados was >> > > > running. >> > > > This mostly worked, but would occasionally fail because we would >> > > > >> > > > - delete an object from the cache tier >> > > > - a network failure injection would lose the reply >> > > > - we'd disable the cache >> > > > - the delete would resend to the base tier, not get recognized as >> > > > a dup (different pool, different pg log) >> > > > -> -ENOENT instead of 0 >> > > > >> > > > 2) The proxy write code hits a similar problem: >> > > > >> > > > - delete gets proxied >> > > > - we initiate async promote >> > > > - a network failure injection loses the delete reply >> > > > - delete resends and blocks on promote (or arrives after it >> > > > finishes) >> > > > - promote finishes >> > > > - delete is handled >> > > > -> -ENOENT instead of 0 >> > > > >> > > > The ticket is http://tracker.ceph.com/issues/8935 >> > > > >> > > > The problem is partially addressed by >> > > > >> > > > https://github.com/ceph/ceph/pull/3447 >> > > > >> > > > by logging a few request ids on every object_info_t and preserving >> > > > that on promote and flush. >> > > > >> > > > However, it doesn't solve the problem for delete because we throw >> > > > out object_info_t so that reqid_t is lost. >> > > > >> > > > I think we have two options, not necessarily mutually exclusive: >> > > > >> > > > 1) When promoting an object that doesn't exist (to create a >> > > > whiteout), pull reqids out of the base tier's pg log so that the >> > > > whiteout is primed with request ids. >> > > > >> > > > 1.5) When flushing... well, that is harder because we have nowhere >> > > > to put the reqids. Unless we make a way to cram a list of reqid's >> > > > into a single PG log entry...? In that case, we wouldn't strictly >> > > > need the per-object list since we could pile the base tier's >> > > > reqids into the promote log entry in the cache tier. >> > > > >> > > > 2) Make delete idempotent (0 instead of ENOENT if the object >> > > > doesn't exist). This will require a delicate compat transition >> > > > (let's ignore that a moment) but you can preserve the old behavior >> > > > for callers that care by preceding the delete with an assert_exists op. >> > > > Most callers don't care, but a handful do. This simplifies the >> > > > semantics we need to support going forward. >> > > > >> > > > Of course, it's all a bit delicate. The idempotent op semantics >> > > > have a time horizon so it's all a bit wishy-washy... :/ >> > > > >> > > > Thoughts? >> > > >> > > Do we have other cases that we're worried about which would be >> > > improved by maintaining reqids across pool cache transitions? I'm >> > > not a big fan of maintaining those per-op lists (they sound really >> > > expensive?), but if we need them for something else that's a point >> > > in their favor. >> > >> > I don't think they're *too* expensive (say, vector of 20 per >> > object_info_t?). But the only thing I can think of beyond the cache >> > tiering stuff would be cases where the pg log isnt long enough for a very >> > laggy client. In general ops will be distributed across ops so it will be >> > catch the dup from another angle. >> > >> > However.. I just hacked up a patch that lets us cram lots of reqids into a >> > single pg_log_entry_t and I think that may be a simpler solution. We can >> > cram all reqids (for the last N of them) for promote and flush into the >> > single log entry and the delete is no longer special.. it'd work equally >> > well for other dups and for class methods that do who knows what. The >> > patch is here: >> > >> > https://github.com/liewegas/ceph/commit/wip-pg-reqids >> > >> > What do you think? >> > >> > > We could make delete idempotent instead and that's what I initially >> > > favor, but it also seems a bit scary (it's not like our operations >> > > can be made idempotent; lots of them invoke classes that will differ >> > > or >> > > whatever!) and I can't think of which callers might care so I'm >> > > having trouble formulating the bounds of this solution. >> > >> > Yeah, it seems like an easier endpoint but a dangerous path to get there... >> > >> > sage >> > -- >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> > in the body of a message to majord...@vger.kernel.org More majordomo >> > info at http://vger.kernel.org/majordomo-info.html >> > -- >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> > in the body of a message to majord...@vger.kernel.org More majordomo >> > info at http://vger.kernel.org/majordomo-info.html >> > >> > >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the >> body of a message to majord...@vger.kernel.org More majordomo info at >> http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html