Re: Fwd: how io works when backfill
if add in osd.7 and 7 becomes the primary: pg1.0 [1, 2, 3] --> pg1.0 [7, 2, 3], is it similar with the example above? still install a pg_temp entry mapping the PG back to [1, 2, 3], then backfill happens to 7, normal io write to [1, 2, 3], if io to the portion of the PG that has already been backfilled will also be sent to osd.7? how about these examples about removing an osd: - pg1.0 [1, 2, 3] - osd.3 down and be removed - mapping changes to [1, 2, 5], but osd.5 has no data, then install a pg_temp mapping the PG back to [1, 2], then backfill happens to 5, - normal io write to [1, 2], if io hits object which has been backfilled to osd.5, io will also send to osd.5 - when backfill completes, remove the pg_temp and mapping changes back to [1, 2, 5] another example: - pg1.0 [1, 2, 3] - osd.3 down and be removed - mapping changes to [5, 1, 2], but osd.5 has no data of the pg, then install a pg_temp mapping the PG back to [1, 2] which osd.1 temporarily becomes the primary, then backfill happens to 5, - normal io write to [1, 2], if io hits object which has been backfilled to osd.5, io will also send to osd.5 - when backfill completes, remove the pg_temp and mapping changes back to [5, 1, 2] is my ananysis right? 2015-12-29 1:30 GMT+08:00 Sage Weil <s...@newdream.net>: > On Mon, 28 Dec 2015, Zhiqiang Wang wrote: >> 2015-12-27 20:48 GMT+08:00 Dong Wu <archer.wud...@gmail.com>: >> > Hi, >> > When add osd or remove osd, ceph will backfill to rebalance data. >> > eg: >> > - pg1.0[1, 2, 3] >> > - add an osd(eg. osd.7) >> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7] >> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now >> > object a is backfilling >> > - when a write io hits object a, then the io needs to wait for its >> > complete, then goes on. >> > - but if io hits object b which has not been backfilled, io reaches >> > osd.1, then osd.1 send the io to osd.2 and osd.7, but osd.7 does not >> > have object b, so osd.7 needs to wait for object b to backfilled, then >> > write. Is it right? Or osd.1 only send the io to osd.2, not both? >> >> I think in this case, when the write of object b reaches osd.1, it >> holds the client write, raises the priority of the recovery of object >> b, and kick off the recovery of it. When the recovery of object b is >> done, it requeue the client write, and then everything goes like >> usual. > > It's more complicated than that. In a normal (log-based) recovery > situation, it is something like the above: if the acting set is [1,2,3] > but 3 is missing the latest copy of A, a write to A will block on the > primary while the primary initiates recovery of A immediately. Once that > completes the IO will continue. > > For backfill, it's different. In your example, you start with [1,2,3] > then add in osd.7. The OSD will see that 7 has no data for teh PG and > install a pg_temp entry mapping the PG back to [1,2,3] temporarily. Then > things will proceed normally while backfill happens to 7. Backfill won't > interfere with normal IO at all, except that IO to the portion of the PG > that has already been backfilled will also be sent to the backfill target > (7) so that it stays up to date. Once it complets, the pg_temp entry is > removed and the mapping changes back to [1,2,7]. Then osd.3 is allowed to > remove it's copy of the PG. > > sage > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how io works when backfill
Hi, When add osd or remove osd, ceph will backfill to rebalance data. eg: - pg1.0[1, 2, 3] - add an osd(eg. osd.7) - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7] - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now object a is backfilling - when a write io hits object a, then the io needs to wait for its complete, then goes on. - but if io hits object b which has not been backfilled, io reaches osd.1, then osd.1 send the io to osd.2 and osd.7, but osd.7 does not have object b, so osd.7 needs to wait for object b to backfilled, then write. Is it right? Or osd.1 only send the io to osd.2, not both? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] why not add (offset,len) to pglog
Thank you for your reply. I am looking formard to Sage's opinion too @sage. Also I'll keep on with the BlueStore and Kstore's progress. Regards 2015-12-25 14:48 GMT+08:00 Ning Yao <zay11...@gmail.com>: > Hi, Dong Wu, > > 1. As I currently work for other things, this proposal is abandon for > a long time > 2. This is a complicated task as we need to consider a lots such as > (not just for writeOp, as well as truncate, delete) and also need to > consider the different affects for different backends(Replicated, EC). > 3. I don't think it is good time to redo this patch now, since the > BlueStore and Kstore is inprogress, and I'm afraid to bring some > side-effect. We may prepare and propose the whole design in next CDS. > 4. Currently, we already have some tricks to deal with recovery (like > throttle the max recovery op, set the priority for recovery and so > on). So this kind of patch may not solve the critical problem but just > make things better, and I am not quite sure that this will really > bring a big improvement. Based on my previous test, it works > excellently on slow disk (say hdd), and also for a short-time > maintaining. Otherwise, it will trigger the backfill process. So wait > for Sage's opinion @sage > > If you are interest on this, we may cooperate to do this. > > Regards > Ning Yao > > > 2015-12-25 14:23 GMT+08:00 Dong Wu <archer.wud...@gmail.com>: >> Thanks, from this pull request I learned that this issue is not >> completed, is there any new progress of this issue? >> >> 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) <xmdx...@gmail.com>: >>> Yeah, This is good idea for recovery, but not for backfill. >>> @YaoNing have pull a request about this >>> https://github.com/ceph/ceph/pull/3837 this year. >>> >>> 2015-12-25 11:16 GMT+08:00 Dong Wu <archer.wud...@gmail.com>: >>>> Hi, >>>> I have doubt about pglog, the pglog contains (op,object,version) etc. >>>> when peering, use pglog to construct missing list,then recover the >>>> whole object in missing list even if different data among replicas is >>>> less then a whole object data(eg,4MB). >>>> why not add (offset,len) to pglog? If so, the missing list can contain >>>> (object, offset, len), then we can reduce recover data. >>>> ___ >>>> ceph-users mailing list >>>> ceph-us...@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> -- >>> Regards, >>> Xinze Chi >> ___ >> ceph-users mailing list >> ceph-us...@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] why not add (offset,len) to pglog
Thanks, from this pull request I learned that this issue is not completed, is there any new progress of this issue? 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) <xmdx...@gmail.com>: > Yeah, This is good idea for recovery, but not for backfill. > @YaoNing have pull a request about this > https://github.com/ceph/ceph/pull/3837 this year. > > 2015-12-25 11:16 GMT+08:00 Dong Wu <archer.wud...@gmail.com>: >> Hi, >> I have doubt about pglog, the pglog contains (op,object,version) etc. >> when peering, use pglog to construct missing list,then recover the >> whole object in missing list even if different data among replicas is >> less then a whole object data(eg,4MB). >> why not add (offset,len) to pglog? If so, the missing list can contain >> (object, offset, len), then we can reduce recover data. >> ___ >> ceph-users mailing list >> ceph-us...@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Regards, > Xinze Chi -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
why not add (offset,len) to pglog
Hi, I have doubt about pglog, the pglog contains (op,object,version) etc. when peering, use pglog to construct missing list,then recover the whole object in missing list even if different data among replicas is less then a whole object data(eg,4MB). why not add (offset,len) to pglog? If so, the missing list can contain (object, offset, len), then we can reduce recover data. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
subscribe ceph-devel -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html