I've added the testcases and made the style changes. The testcases ran without problem. Pls merge the pull request https://github.com/ceph/ceph/pull/2177, thx.
-----Original Message----- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, July 29, 2014 11:44 PM To: Wang, Zhiqiang Cc: Zhang, Jian; 'ico...@redhat.com'; 'sam.j...@inktank.com'; 'ceph-devel@vger.kernel.org' Subject: RE: Cache tiering read-proxy mode On Tue, 29 Jul 2014, Wang, Zhiqiang wrote: > Thanks for the review. > > I have one question for the comment "move the hit_set check into > maybe_handle_cache". The current code inserts 'oid' into the hit set > before calling maybe_handle_cache. If 'oid' is the same as > 'missing_oid', and we move the hit_set check into maybe_handle_cache, > we'll always see this 'oid' in the in memory hit sets, and not do > redirecting for the 1st read. That's the reason why I add the hit_set > check before the inserting. Ah, yeah, that makes sense! sage > > -----Original Message----- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Tuesday, July 29, 2014 4:00 AM > To: Wang, Zhiqiang > Cc: Zhang, Jian; 'ico...@redhat.com'; 'sam.j...@inktank.com'; > 'ceph-devel@vger.kernel.org' > Subject: RE: Cache tiering read-proxy mode > > On Mon, 28 Jul 2014, Wang, Zhiqiang wrote: > > Hi Sage, > > > > I made this change in > > https://github.com/wonzhq/ceph/commit/924e418abb831338e2df7f4a4ec9409b02ee5524 > > and unit tested it. Could you take a review and give comments? Thanks. > > I made a few comments on the commit on github. Overall it looks good, but we > should add a test to ceph_test_rados_api_tier (test/librados/tier.cc). > > Thanks! > sage > > > > > > -----Original Message----- > > From: Wang, Zhiqiang > > Sent: Tuesday, July 22, 2014 9:38 AM > > To: Sage Weil > > Cc: Zhang, Jian; ico...@redhat.com; sam.j...@inktank.com; > > ceph-devel@vger.kernel.org > > Subject: RE: Cache tiering read-proxy mode > > > > Since we can't be accurate at the seconds level, how about making > > the min_read_recency_for_promote option as the number of 'hit set > > intervals' instead of number of seconds? So that, when > > min_read_recency_for_promote is > > 1) 0, promotion on first read > > 2) 1, promotion on second read, checking only the current hit set > > 3) any other number, promotion on second read, keep this number > > (including the current one) of hit sets in memory, checking object > > existence in these hit sets regardless of hit set rotation > > > > -----Original Message----- > > From: Sage Weil [mailto:sw...@redhat.com] > > Sent: Monday, July 21, 2014 10:20 PM > > To: Wang, Zhiqiang > > Cc: Zhang, Jian; ico...@redhat.com; sam.j...@inktank.com; > > ceph-devel@vger.kernel.org > > Subject: RE: Cache tiering read-proxy mode > > > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote: > > > In the current code, when the evict mode is idle, we just keep the > > > current hit set in memory. All the other hit sets > > > (hit_set_count-1) are on disks. And when the evict mode is not > > > idle, all the hit sets are loaded into memory. When the current > > > hit set is full or exceeds its interval, it is persisted to disk. > > > A new hit set is created to act as the current and the oldest is removed > > > from disk. > > > > > > So, if we introduce the min_read_recency_for_promote option, say > > > the user sets its value to 200, and the value of 'hit set > > > interval' to 60, does it mean we need to always keep 200/60+1=4 > > > latest hit sets in memory (Assuming 'hit set count' is greater than 4, > > > number of 'hit set count' > > > if not), even if the evict mode is idle? And when persisting the > > > current hit set, it is still kept in memory, but the oldest > > > in-memory hit set is removed from memory? > > > > Exactly. We can probably just make helper that loads these into memory for > > the tiering agent sufficiently generic (if it isn't already) so that it > > keeps the right number of them in memory when the agent is inactive. > > > > > Btw, I don't quite get what you said on the normal hit set rotation part. > > > > If we set the tunable to, say, one hour, and the HitSet interval is also an > > hour, then does this mean we always have 2 HitSet's in RAM, so that we > > cover *at least* an hour while the newest is being populated? If we decide > > to check the first and second HitSets, then we are actually covering up to > > double the configured period. > > > > sage > > > > > > > -----Original Message----- > > > From: Sage Weil [mailto:sw...@redhat.com] > > > Sent: Monday, July 21, 2014 11:55 AM > > > To: Wang, Zhiqiang > > > Cc: Zhang, Jian; ico...@redhat.com; sam.j...@inktank.com; > > > ceph-devel@vger.kernel.org > > > Subject: RE: Cache tiering read-proxy mode > > > > > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote: > > > > For the min_read_recency_for_promote option, it's easy to > > > > understand the '0' and '<= hit set interval' cases. But for the '> hit > > > > set interval' > > > > case, do you mean we always keep all the hit sets in RAM and > > > > check for the object's existence in all of them, or just load > > > > all the hit sets and check for object existence before the read? > > > > In another word, when min_read_recency_for_promote is greater > > > > than 'hit set interval', we always keep all the hit sets in RAM? > > > > > > I'm thinking we would keep any many HitSets as are needed to cover > > > whatever the configured interval is. Setting the option to the same > > > value as the hitset interval (or just '1'?) would be the simplest thing, > > > and probably the default? > > > > > > We would need to decide what behavior we want with respect to the normal > > > HitSet rotation, though. If they each cover, say, one hour, then on > > > average they will half of that, and sometimes almost no time at all (if > > > they just rotated). So probably we'd want to keep the next-most-recent > > > in memory for some period? It'll always be a bit imprecise, though, but > > > hopefully it won't really matter... > > > > > > sage > > > > > > > > > > > -----Original Message----- > > > > From: Sage Weil [mailto:sw...@redhat.com] > > > > Sent: Monday, July 21, 2014 9:44 AM > > > > To: Wang, Zhiqiang > > > > Cc: Zhang, Jian; ico...@redhat.com; sam.j...@inktank.com; > > > > ceph-devel@vger.kernel.org > > > > Subject: RE: Cache tiering read-proxy mode > > > > > > > > [Adding ceph-devel] > > > > > > > > On Mon, 21 Jul 2014, Wang, Zhiqiang wrote: > > > > > Sage, > > > > > > > > > > I agree with you that promotion on the 2nd read could improve > > > > > cache tiering's performance for some kinds of workloads. The > > > > > general idea here is to implement some kinds of policies in > > > > > the cache tier to measure the warmness of the data. If the > > > > > cache tier is aware of the data warmness, it could even > > > > > initiate data movement between the cache tier and the base > > > > > tier. This means data could be prefetched into the cache tier before > > > > > reading or writing. > > > > > But I think this is something we could do in the future. > > > > > > > > Yeah. I suspect it will be challenging to put this sort of prefetching > > > > intelligence directly into the OSDs, though. It could possibly be done > > > > by an external agent, maybe, or could be driven by explicit hints from > > > > clients ("I will probably access this data soon"). > > > > > > > > > The 'promotion on 2nd read' policy is straightforward. Sure it > > > > > will benefit some kinds of workload, but not all. If it is > > > > > implemented as a cache tier option, the user needs to decide > > > > > to turn it on or not. But I'm afraid most of the users don't > > > > > have the idea of this. This increases the difficulty of using cache > > > > > tiering. > > > > > > > > I suspect the 2nd read behavior will be something we'll want to do by > > > > default... but yeah, there will be a new pool option (or options) that > > > > controls the behavior. > > > > > > > > > One question for the implementation of 'promotion on 2nd read': > > > > > what do we do for the 1st read? Does the cache tier read the > > > > > object from base tier but not doing replication, or just redirecting > > > > > it? > > > > > > > > For the first read, we just redirect the client. The on the second > > > > read, we call promote_object(). See maybe_handle_cache() in > > > > ReplicatedPG.cc. > > > > We can pretty easily tell the difference by checking the in-memory > > > > HitSet for a match. > > > > > > > > Perhaps the option in the pool would be something like > > > > min_read_recency_for_promote? If we measure "recency" as "(avg) > > > > seconds since last access" (loosely), 0 would mean it would promote on > > > > first read, and anything <= the HitSet interval would mean promote if > > > > the object is in the current HitSet. > than that would mean we'd need > > > > to keep additional previous HitSets in RAM. > > > > > > > > ...which leads us to a separate question of how to describe > > > > access frequency vs recency. We keep N HitSets, each covering a > > > > time period of T seconds. Normally we only keep the most recent > > > > HitSet in memory, unless the agent is active (flushing data). > > > > So what I described above is checking how recently the last > > > > access was (within how many multiples of T seconds). > > > > Additionally, though, we could describe the frequency of > > > > access: was the object accesssed at least once in every N interval of T > > > > seconds? Or some fraction of them? That is probably best described as > > > > "temperature?" I'm not to fond of the term "recency," tho I can't > > > > think of anything better right now. > > > > > > > > Anyway, for the read promote behavior, recency is probably sufficient, > > > > but for the tiering agent flush/evict behavior temperature might be a > > > > good thing to consider... > > > > > > > > sage > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe > > > > ceph-devel" in the body of a message to > > > > majord...@vger.kernel.org More majordomo info at > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in the body of a message to majord...@vger.kernel.org More > > > majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majord...@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majord...@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majord...@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html