Re: PG Backend Proposal
On Thu, 1 Aug 2013, Samuel Just wrote: > I think there are some tricky edge cases with the above approach. You > might end up with two pg replicas in the same acting set which happen > for reasons of history to have the same chunk for one or more objects. > That would have to be detected and repaired even though the object > would be missing from neither replica (and might not even be in the pg > log). The erasure_code_rank would have to be somehow maintained > through recovery (do we remember the original holder of a particular > chunk in case it ever comes back?). > > The chunk rank doesn't *need* to match the acting set position, but > there are some good reasons to arrange for that to be the case: > 1) Otherwise, we need something else to assign the chunk ranks > 2) This way, a new primary can determine which osds hold which > replicas of which chunk rank by looking at past osd maps. > > It seems to me that given an OSDMap and an object, we should know > immediately where all chunks should be stored since a future primary > may need to do that without access to the objects themselves. > > Importantly, while it may be possible for an acting set transition > like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a > mode which will cause replacement to behave well for erasure codes: > > initial: [0,1,2] > 0 fails: [3,1,2] > 2 fails: [3,1,4] > 0 recovers: [0,1,4] > > We do, however, need to decouple primariness from position in the > acting set so that backfill can work well. BTW, this reminds me: this might also be a good time to add the ability in the OSDMap to probabilistically reorder acting sets (and adjust primariness) for any mapping so that we can cheaply shift read traffic around (e.g., based on normal statistical skew, or temorary workload balance). Currently the only way to do this also involves moving data. s > -Sam > > On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary wrote: > > Hi Sam, > > > > I'm under the impression that > > https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions > > assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc. > > > > The chunk rank does not need to match the OSD position in the acting set. > > As long as each object chunk is stored with its rank in an attribute, > > changing the order of the acting set does not require to move the chunks > > around. > > > > With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on > > [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute > > set to their rank. > > > > If the acting set changes to [2,1,0] the read would reorder the chunk based > > on their 'erasure_code_rank' attribute instead of the rank of the OSD they > > originate from in the current acting set. And then be able to decode them > > with the erasure code library, which requires that the chunks are provided > > in a specific order. > > > > When doing a full write, the chunks are written in the same order as the > > acting set. This implies that the order of the chunks of the previous > > version of the object may be different but I don't see a problem with that. > > > > When doing an append, the primary must first retrieve the order in which > > the objects are stored by retrieving their 'erasure_code_rank' attribute, > > because the order of the acting set is not the same as the order of the > > chunks. It then maps the chunks to the OSDs matching their rank and pushes > > them to the OSDs. > > > > The only downside is that it may make things more complicated to implement > > optimizations based on the fact that, sometimes, chunks can just be > > concatenated to recover the content of the object and don't need to be > > decoded ( when using systematic codes and the M data chunks are available ). > > > > Cheers > > > > On 01/08/2013 19:14, Loic Dachary wrote: > >> > >> > >> On 01/08/2013 18:42, Loic Dachary wrote: > >>> Hi Sam, > >>> > >>> When the acting set changes order two chunks for the same object may > >>> co-exist in the same placement group. The key should therefore also > >>> contain the chunk number. > >>> > >>> That's probably the most sensible comment I have so far. This document is > >>> immensely useful (even in its current state) because it shows me your > >>> perspective on the implementation. > >>> > >>> I'm puzzled by: > >> > >> I get it ( thanks to yanzheng ). Object is deleted, then created again ... > >> spurious non version chunks would get in the way. > >> > >> :-) > >> > >>> > >>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires > >>> that we retain the deleted object until all replicas have persisted the > >>> deletion event. ErasureCoded backend will therefore need to store objects > >>> with the version at which they were created included in the key provided > >>> to the filestore. Old versions of an object can be pruned when
Re: [PATCH] ceph: fix bugs about handling short-read for sync read mode.
On Fri, 2 Aug 2013, majianpeng wrote: > cephfs . show_layout > >layyout.data_pool: 0 > >layout.object_size: 4194304 > >layout.stripe_unit: 4194304 > >layout.stripe_count: 1 > > TestA: > >dd if=/dev/urandom of=test bs=1M count=2 oflag=direct > >dd if=/dev/urandom of=test bs=1M count=2 seek=4 oflag=direct > >dd if=test of=/dev/null bs=6M count=1 iflag=direct > The messages from func striped_read are: > ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 > HITSTRIPE SHORT > ceph: file.c:350 : striped_read 2097152~4194304 (read 2097152) got > 0 HITSTRIPE SHORT > ceph: file.c:381 : zero tail 4194304 > ceph: file.c:390 : striped_read returns 6291456 > The hole of file is from 2M--4M.But actualy it zero the last 4M include > the last 2M area which isn't a hole. > Using this patch, the messages are: > ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 > HITSTRIPE SHORT > ceph: file.c:358 : zero gap 2097152 to 4194304 > ceph: file.c:350 : striped_read 4194304~2097152 (read 4194304) got > 2097152 > ceph: file.c:384 : striped_read returns 6291456 > > TestB: > >echo majianpeng > test > >dd if=test of=/dev/null bs=2M count=1 iflag=direct > The messages are: > ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 > HITSTRIPE SHORT > ceph: file.c:350 : striped_read 11~6291445 (read 11) got 0 > HITSTRIPE SHORT > ceph: file.c:390 : striped_read returns 11 > For this case,it did once more striped_read.It's no meaningless. > Using this patch, the message are: > ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 > HITSTRIPE SHORT > ceph: file.c:384 : striped_read returns 11 > > Big thanks to Yan Zheng for the patch. Thanks for working through this! Do you mind putting the test into a simple .sh script (or whatever) that verifies this is working properly (that the read returns the correct data) on a file in the currenct directory? Then we can add this to the collection of tests in ceph.git/qa/workunits. Probably writing the same thing to $CWD/test and /tmp/test.$$ and running cmp to verify they match is the simplest. Thanks! sage > > Signed-off-by: Jianpeng Ma > Reviewed-by: Yan, Zheng > --- > fs/ceph/file.c | 40 +--- > 1 file changed, 17 insertions(+), 23 deletions(-) > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c > index 2ddf061..3d8d14d 100644 > --- a/fs/ceph/file.c > +++ b/fs/ceph/file.c > @@ -349,44 +349,38 @@ more: > dout("striped_read %llu~%u (read %u) got %d%s%s\n", pos, left, read, >ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : ""); > > - if (ret > 0) { > - int didpages = (page_align + ret) >> PAGE_CACHE_SHIFT; > - > - if (read < pos - off) { > - dout(" zero gap %llu to %llu\n", off + read, pos); > - ceph_zero_page_vector_range(page_align + read, > - pos - off - read, pages); > + if (ret >= 0) { > + int didpages; > + if (was_short && (pos + ret < inode->i_size)) { > + u64 tmp = min(this_len - ret, > + inode->i_size - pos - ret); > + dout(" zero gap %llu to %llu\n", > + pos + ret, pos + ret + tmp); > + ceph_zero_page_vector_range(page_align + read + ret, > + tmp, pages); > + ret += tmp; > } > + > + didpages = (page_align + ret) >> PAGE_CACHE_SHIFT; > pos += ret; > read = pos - off; > left -= ret; > page_pos += didpages; > pages_left -= didpages; > > - /* hit stripe? */ > - if (left && hit_stripe) > + /* hit stripe and need continue*/ > + if (left && hit_stripe && pos < inode->i_size) > goto more; > + > } > > - if (was_short) { > + if (ret >= 0) { > + ret = read; > /* did we bounce off eof? */ > if (pos + left > inode->i_size) > *checkeof = 1; > - > - /* zero trailing bytes (inside i_size) */ > - if (left > 0 && pos < inode->i_size) { > - if (pos + left > inode->i_size) > - left = inode->i_size - pos; > - > - dout("zero tail %d\n", left); > - ceph_zero_page_vector_range(page_align + read, left, > - pages); > - read += left; > - } > } > > - if (ret >= 0) > - ret = read; > dout("striped_read returns %d\n", ret); > return ret; > } > -- > 1.8.1.2
Re: PG Backend Proposal
I think there are some tricky edge cases with the above approach. You might end up with two pg replicas in the same acting set which happen for reasons of history to have the same chunk for one or more objects. That would have to be detected and repaired even though the object would be missing from neither replica (and might not even be in the pg log). The erasure_code_rank would have to be somehow maintained through recovery (do we remember the original holder of a particular chunk in case it ever comes back?). The chunk rank doesn't *need* to match the acting set position, but there are some good reasons to arrange for that to be the case: 1) Otherwise, we need something else to assign the chunk ranks 2) This way, a new primary can determine which osds hold which replicas of which chunk rank by looking at past osd maps. It seems to me that given an OSDMap and an object, we should know immediately where all chunks should be stored since a future primary may need to do that without access to the objects themselves. Importantly, while it may be possible for an acting set transition like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a mode which will cause replacement to behave well for erasure codes: initial: [0,1,2] 0 fails: [3,1,2] 2 fails: [3,1,4] 0 recovers: [0,1,4] We do, however, need to decouple primariness from position in the acting set so that backfill can work well. -Sam On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary wrote: > Hi Sam, > > I'm under the impression that > https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions > assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc. > > The chunk rank does not need to match the OSD position in the acting set. As > long as each object chunk is stored with its rank in an attribute, changing > the order of the acting set does not require to move the chunks around. > > With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on > [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set > to their rank. > > If the acting set changes to [2,1,0] the read would reorder the chunk based > on their 'erasure_code_rank' attribute instead of the rank of the OSD they > originate from in the current acting set. And then be able to decode them > with the erasure code library, which requires that the chunks are provided in > a specific order. > > When doing a full write, the chunks are written in the same order as the > acting set. This implies that the order of the chunks of the previous version > of the object may be different but I don't see a problem with that. > > When doing an append, the primary must first retrieve the order in which the > objects are stored by retrieving their 'erasure_code_rank' attribute, because > the order of the acting set is not the same as the order of the chunks. It > then maps the chunks to the OSDs matching their rank and pushes them to the > OSDs. > > The only downside is that it may make things more complicated to implement > optimizations based on the fact that, sometimes, chunks can just be > concatenated to recover the content of the object and don't need to be > decoded ( when using systematic codes and the M data chunks are available ). > > Cheers > > On 01/08/2013 19:14, Loic Dachary wrote: >> >> >> On 01/08/2013 18:42, Loic Dachary wrote: >>> Hi Sam, >>> >>> When the acting set changes order two chunks for the same object may >>> co-exist in the same placement group. The key should therefore also contain >>> the chunk number. >>> >>> That's probably the most sensible comment I have so far. This document is >>> immensely useful (even in its current state) because it shows me your >>> perspective on the implementation. >>> >>> I'm puzzled by: >> >> I get it ( thanks to yanzheng ). Object is deleted, then created again ... >> spurious non version chunks would get in the way. >> >> :-) >> >>> >>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that >>> we retain the deleted object until all replicas have persisted the deletion >>> event. ErasureCoded backend will therefore need to store objects with the >>> version at which they were created included in the key provided to the >>> filestore. Old versions of an object can be pruned when all replicas have >>> committed up to the log event deleting the object. >>> >>> because I don't understand why the version would be necessary. I thought >>> that deleting an erasure coded object could be even easier than erasing a >>> replicated object because it cannot be resurrected if enough chunks are >>> lots, therefore you don't need to wait for ack from all OSDs in the up set. >>> I'm obviously missing something. >>> >>> I failed to understand how important the pg logs were to maintaining the >>> consistency of the PG. For some reason I thought about them only in terms >>> of
Re: PG Backend Proposal
Hi Sam, I'm under the impression that https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc. The chunk rank does not need to match the OSD position in the acting set. As long as each object chunk is stored with its rank in an attribute, changing the order of the acting set does not require to move the chunks around. With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set to their rank. If the acting set changes to [2,1,0] the read would reorder the chunk based on their 'erasure_code_rank' attribute instead of the rank of the OSD they originate from in the current acting set. And then be able to decode them with the erasure code library, which requires that the chunks are provided in a specific order. When doing a full write, the chunks are written in the same order as the acting set. This implies that the order of the chunks of the previous version of the object may be different but I don't see a problem with that. When doing an append, the primary must first retrieve the order in which the objects are stored by retrieving their 'erasure_code_rank' attribute, because the order of the acting set is not the same as the order of the chunks. It then maps the chunks to the OSDs matching their rank and pushes them to the OSDs. The only downside is that it may make things more complicated to implement optimizations based on the fact that, sometimes, chunks can just be concatenated to recover the content of the object and don't need to be decoded ( when using systematic codes and the M data chunks are available ). Cheers On 01/08/2013 19:14, Loic Dachary wrote: > > > On 01/08/2013 18:42, Loic Dachary wrote: >> Hi Sam, >> >> When the acting set changes order two chunks for the same object may >> co-exist in the same placement group. The key should therefore also contain >> the chunk number. >> >> That's probably the most sensible comment I have so far. This document is >> immensely useful (even in its current state) because it shows me your >> perspective on the implementation. >> >> I'm puzzled by: > > I get it ( thanks to yanzheng ). Object is deleted, then created again ... > spurious non version chunks would get in the way. > > :-) > >> >> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that >> we retain the deleted object until all replicas have persisted the deletion >> event. ErasureCoded backend will therefore need to store objects with the >> version at which they were created included in the key provided to the >> filestore. Old versions of an object can be pruned when all replicas have >> committed up to the log event deleting the object. >> >> because I don't understand why the version would be necessary. I thought >> that deleting an erasure coded object could be even easier than erasing a >> replicated object because it cannot be resurrected if enough chunks are >> lots, therefore you don't need to wait for ack from all OSDs in the up set. >> I'm obviously missing something. >> >> I failed to understand how important the pg logs were to maintaining the >> consistency of the PG. For some reason I thought about them only in terms of >> being a light weight version of the operation logs. Adding a payload to the >> pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I >> would have never thought or dared think the logs could be extended in such a >> way. Given the recent problems with logs writes having a high impact on >> performances ( I'm referring to what forced you to introduce code to reduce >> the amount of logs being written to only those that have been changed >> instead of the complete logs ) I thought about the pg logs as something >> immutable. >> >> I'm still trying to figure out how PGBackend::perform_write / read / >> try_rollback would fit in the current backfilling / write / read / scrubbing >> ... code path. >> >> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst >> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h >> >> Cheers >> > -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. signature.asc Description: OpenPGP digital signature
Re: Rados Protocoll
Hi Niklas, The RADOS reference implementation in C++ is quite large. Reproducing it all in another language would be interesting, but I'm curious if wrapping the C interface is not an option for you? There are Java bindings that are being worked on here: https://github.com/wido/rados-java. There are links on ceph.com/docs to some information about Ceph, as well as videos on Youtube, and academic papers linked to. -Noah On Thu, Aug 1, 2013 at 1:01 PM, Niklas Goerke wrote: > Hi, > > I was wondering why there is no native Java implementation of librados. I'm > thinking about creating one and I'm thus looking for a documentation of the > RADOS protocol. > Also the way I see it librados implements the crush algorithm. Is there a > documentation for it? > Also an educated guess about whether the RADOS Protocol is due to changes > would be very much appreciated. > > Thank you in advance > > Niklas > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Can you dump your osd settings? sudo ceph --admin-daemon ceph-osd..asok config show -Sam On Thu, Aug 1, 2013 at 12:07 PM, Stefan Priebe wrote: > Mike we already have the async patch running. Yes it helps but only helps it > does not solve. It just hides the issue ... > Am 01.08.2013 20:54, schrieb Mike Dawson: > >> I am also seeing recovery issues with 0.61.7. Here's the process: >> >> - ceph osd set noout >> >> - Reboot one of the nodes hosting OSDs >> - VMs mounted from RBD volumes work properly >> >> - I see the OSD's boot messages as they re-join the cluster >> >> - Start seeing active+recovery_wait, peering, and active+recovering >> - VMs mounted from RBD volumes become unresponsive. >> >> - Recovery completes >> - VMs mounted from RBD volumes regain responsiveness >> >> - ceph osd unset noout >> >> Would joshd's async patch for qemu help here, or is there something else >> going on? >> >> Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY >> >> Thanks, >> >> Mike Dawson >> Co-Founder & Director of Cloud Architecture >> Cloudapt LLC >> 6330 East 75th Street, Suite 170 >> Indianapolis, IN 46250 >> >> On 8/1/2013 2:34 PM, Samuel Just wrote: >>> >>> Can you reproduce and attach the ceph.log from before you stop the osd >>> until after you have started the osd and it has recovered? >>> -Sam >>> >>> On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG >>> wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and "re" backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rados Protocoll
Hi, I was wondering why there is no native Java implementation of librados. I'm thinking about creating one and I'm thus looking for a documentation of the RADOS protocol. Also the way I see it librados implements the crush algorithm. Is there a documentation for it? Also an educated guess about whether the RADOS Protocol is due to changes would be very much appreciated. Thank you in advance Niklas -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Mike we already have the async patch running. Yes it helps but only helps it does not solve. It just hides the issue ... Am 01.08.2013 20:54, schrieb Mike Dawson: I am also seeing recovery issues with 0.61.7. Here's the process: - ceph osd set noout - Reboot one of the nodes hosting OSDs - VMs mounted from RBD volumes work properly - I see the OSD's boot messages as they re-join the cluster - Start seeing active+recovery_wait, peering, and active+recovering - VMs mounted from RBD volumes become unresponsive. - Recovery completes - VMs mounted from RBD volumes regain responsiveness - ceph osd unset noout Would joshd's async patch for qemu help here, or is there something else going on? Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY Thanks, Mike Dawson Co-Founder & Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 8/1/2013 2:34 PM, Samuel Just wrote: Can you reproduce and attach the ceph.log from before you stop the osd until after you have started the osd and it has recovered? -Sam On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and "re" backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
I am also seeing recovery issues with 0.61.7. Here's the process: - ceph osd set noout - Reboot one of the nodes hosting OSDs - VMs mounted from RBD volumes work properly - I see the OSD's boot messages as they re-join the cluster - Start seeing active+recovery_wait, peering, and active+recovering - VMs mounted from RBD volumes become unresponsive. - Recovery completes - VMs mounted from RBD volumes regain responsiveness - ceph osd unset noout Would joshd's async patch for qemu help here, or is there something else going on? Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY Thanks, Mike Dawson Co-Founder & Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 On 8/1/2013 2:34 PM, Samuel Just wrote: Can you reproduce and attach the ceph.log from before you stop the osd until after you have started the osd and it has recovered? -Sam On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and "re" backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
here it is Am 01.08.2013 20:36, schrieb Samuel Just: For now, just the main ceph.log. -Sam On Thu, Aug 1, 2013 at 11:34 AM, Stefan Priebe wrote: m 01.08.2013 20:34, schrieb Samuel Just: Can you reproduce and attach the ceph.log from before you stop the osd until after you have started the osd and it has recovered? -Sam Sure which log levels? On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and "re" backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ceph.log.gz Description: application/gzip
Re: still recovery issues with cuttlefish
Is there a bug open for this? I suspect we don't sufficiently throttle the snapshot removal work. -Sam On Thu, Aug 1, 2013 at 7:50 AM, Andrey Korolyov wrote: > Second this. Also for long-lasting snapshot problem and related > performance issues I may say that cuttlefish improved things greatly, > but creation/deletion of large snapshot (hundreds of gigabytes of > commited data) still can bring down cluster for a minutes, despite > usage of every possible optimization. > > On Thu, Aug 1, 2013 at 12:22 PM, Stefan Priebe - Profihost AG > wrote: >> Hi, >> >> i still have recovery issues with cuttlefish. After the OSD comes back >> it seem to hang for around 2-4 minutes and then recovery seems to start >> (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I >> get a lot of slow request messages an hanging VMs. >> >> What i noticed today is that if i leave the OSD off as long as ceph >> starts to backfill - the recovery and "re" backfilling wents absolutely >> smooth without any issues and no slow request messages at all. >> >> Does anybody have an idea why? >> >> Greets, >> Stefan >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
It doesn't have log levels, should be in /var/log/ceph/ceph.log. -Sam On Thu, Aug 1, 2013 at 11:36 AM, Samuel Just wrote: > For now, just the main ceph.log. > -Sam > > On Thu, Aug 1, 2013 at 11:34 AM, Stefan Priebe wrote: >> m 01.08.2013 20:34, schrieb Samuel Just: >> >>> Can you reproduce and attach the ceph.log from before you stop the osd >>> until after you have started the osd and it has recovered? >>> -Sam >> >> >> Sure which log levels? >> >> >>> On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG >>> wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and "re" backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
For now, just the main ceph.log. -Sam On Thu, Aug 1, 2013 at 11:34 AM, Stefan Priebe wrote: > m 01.08.2013 20:34, schrieb Samuel Just: > >> Can you reproduce and attach the ceph.log from before you stop the osd >> until after you have started the osd and it has recovered? >> -Sam > > > Sure which log levels? > > >> On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG >> wrote: >>> >>> Hi, >>> >>> i still have recovery issues with cuttlefish. After the OSD comes back >>> it seem to hang for around 2-4 minutes and then recovery seems to start >>> (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I >>> get a lot of slow request messages an hanging VMs. >>> >>> What i noticed today is that if i leave the OSD off as long as ceph >>> starts to backfill - the recovery and "re" backfilling wents absolutely >>> smooth without any issues and no slow request messages at all. >>> >>> Does anybody have an idea why? >>> >>> Greets, >>> Stefan >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
m 01.08.2013 20:34, schrieb Samuel Just: Can you reproduce and attach the ceph.log from before you stop the osd until after you have started the osd and it has recovered? -Sam Sure which log levels? On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG wrote: Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and "re" backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: still recovery issues with cuttlefish
Can you reproduce and attach the ceph.log from before you stop the osd until after you have started the osd and it has recovered? -Sam On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG wrote: > Hi, > > i still have recovery issues with cuttlefish. After the OSD comes back > it seem to hang for around 2-4 minutes and then recovery seems to start > (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I > get a lot of slow request messages an hanging VMs. > > What i noticed today is that if i leave the OSD off as long as ceph > starts to backfill - the recovery and "re" backfilling wents absolutely > smooth without any issues and no slow request messages at all. > > Does anybody have an idea why? > > Greets, > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
On Thu, 1 Aug 2013, Yan, Zheng wrote: > On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju wrote: > > From: Sha Zhengju > > > > Following we will begin to add memcg dirty page accounting around > __set_page_dirty_ > > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to > avoid exporting > > those details to filesystems. > > > > Signed-off-by: Sha Zhengju > > --- > > fs/ceph/addr.c | 13 + > > 1 file changed, 1 insertion(+), 12 deletions(-) > > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c > > index 3e68ac1..1445bf1 100644 > > --- a/fs/ceph/addr.c > > +++ b/fs/ceph/addr.c > > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page) > > if (unlikely(!mapping)) > > return !TestSetPageDirty(page); > > > > - if (TestSetPageDirty(page)) { > > + if (!__set_page_dirty_nobuffers(page)) { > it's too early to set the radix tree tag here. We should set page's snapshot > context and increase the i_wrbuffer_ref first. This is because once the tag > is set, writeback thread can find and start flushing the page. Unfortunately I only remember being frustrated by this code. :) Looking at it now, though, it seems like the minimum fix is to set the page->private before marking the page dirty. I don't know the locking rules around that, though. If that is potentially racy, maybe the safest thing would be if __set_page_dirty_nobuffers() took a void* to set page->private to atomically while holding the tree_lock. sage > > > dout("%p set_page_dirty %p idx %lu -- already dirty\n", > > mapping->host, page, page->index); > > return 0; > > @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page) > > snapc, snapc->seq, snapc->num_snaps); > > spin_unlock(&ci->i_ceph_lock); > > > > - /* now adjust page */ > > - spin_lock_irq(&mapping->tree_lock); > > if (page->mapping) { /* Race with truncate? */ > > - WARN_ON_ONCE(!PageUptodate(page)); > > - account_page_dirtied(page, page->mapping); > > - radix_tree_tag_set(&mapping->page_tree, > > - page_index(page), PAGECACHE_TAG_DIRTY); > > - > > this code was coped from __set_page_dirty_nobuffers(). I think the reason > Sage did this is to handle the race described in > __set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping == > NULL" can still happen here. Because truncate_inode_page() unmap page from > processes's address spaces first, then delete page from page cache. > > Regards > Yan, Zheng > > > /* > > * Reference snap context in page->private. Also set > > * PagePrivate so that we get invalidatepage callback. > > @@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page) > > undo = 1; > > } > > > > - spin_unlock_irq(&mapping->tree_lock); > > > > > > - > > if (undo) > > /* whoops, we failed to dirty the page */ > > ceph_put_wrbuffer_cap_refs(ci, 1, snapc); > > > > - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > - > > BUG_ON(!PageDirty(page)); > > return 1; > > } > > -- > > 1.7.9.5 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > >
v0.67-rc3 Dumpling release candidate
We've tagged and pushed out packages for another release candidate for Dumpling. At this point things are looking very good. There are a few odds and ends with the CLI changes but the core ceph functionality is looking quite stable. Please test! Packages are available in the -testing repos: http://ceph.com/debian-testing/ http://ceph.com/rpm-testing/ Note that at any time you can also run the latest code for the pending release from http://gitbuilder.ceph.com/ceph-deb-$DISTRO-x86_64-basic/ref/next/ The draft release notes for v0.67 dumpling are at http://ceph.com/docs/master/release-notes/ You can see our bug queue at http://tracker.ceph.com/projects/ceph/issues?query_id=27 Happy testing! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mds: remove waiting lock before merging with neighbours
On Thu, 1 Aug 2013, David Disseldorp wrote: > Hi, > > Did anyone get a chance to look at this change? > Any comments/feedback/ridicule would be appreciated. Sorry, not yet--and Greg just headed out for vacation yesterday. It's on my list to look at when I have some time tonight or tomorrow, though. Thanks! I'm hopefully this will clear up some of the locking hangs we've seen with the samba and flock tests... sage > > Cheers, David > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PG Backend Proposal
On 01/08/2013 18:42, Loic Dachary wrote: > Hi Sam, > > When the acting set changes order two chunks for the same object may co-exist > in the same placement group. The key should therefore also contain the chunk > number. > > That's probably the most sensible comment I have so far. This document is > immensely useful (even in its current state) because it shows me your > perspective on the implementation. > > I'm puzzled by: I get it ( thanks to yanzheng ). Object is deleted, then created again ... spurious non version chunks would get in the way. :-) > > CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we > retain the deleted object until all replicas have persisted the deletion > event. ErasureCoded backend will therefore need to store objects with the > version at which they were created included in the key provided to the > filestore. Old versions of an object can be pruned when all replicas have > committed up to the log event deleting the object. > > because I don't understand why the version would be necessary. I thought that > deleting an erasure coded object could be even easier than erasing a > replicated object because it cannot be resurrected if enough chunks are lots, > therefore you don't need to wait for ack from all OSDs in the up set. I'm > obviously missing something. > > I failed to understand how important the pg logs were to maintaining the > consistency of the PG. For some reason I thought about them only in terms of > being a light weight version of the operation logs. Adding a payload to the > pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I > would have never thought or dared think the logs could be extended in such a > way. Given the recent problems with logs writes having a high impact on > performances ( I'm referring to what forced you to introduce code to reduce > the amount of logs being written to only those that have been changed instead > of the complete logs ) I thought about the pg logs as something immutable. > > I'm still trying to figure out how PGBackend::perform_write / read / > try_rollback would fit in the current backfilling / write / read / scrubbing > ... code path. > > https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst > https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h > > Cheers > -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. signature.asc Description: OpenPGP digital signature
Re: PG Backend Proposal
DELETE can always be rolled forward, but there may be other operations in the log that can't be (like an append). So we need to be able to roll it back (I think) perform_write, read, try_rollback probably don't matter to backfill, scrubbing. You are correct, we need to include the chunk number in the object as well! -Sam On Thu, Aug 1, 2013 at 9:42 AM, Loic Dachary wrote: > Hi Sam, > > When the acting set changes order two chunks for the same object may co-exist > in the same placement group. The key should therefore also contain the chunk > number. > > That's probably the most sensible comment I have so far. This document is > immensely useful (even in its current state) because it shows me your > perspective on the implementation. > > I'm puzzled by: > > CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we > retain the deleted object until all replicas have persisted the deletion > event. ErasureCoded backend will therefore need to store objects with the > version at which they were created included in the key provided to the > filestore. Old versions of an object can be pruned when all replicas have > committed up to the log event deleting the object. > > because I don't understand why the version would be necessary. I thought that > deleting an erasure coded object could be even easier than erasing a > replicated object because it cannot be resurrected if enough chunks are lots, > therefore you don't need to wait for ack from all OSDs in the up set. I'm > obviously missing something. > > I failed to understand how important the pg logs were to maintaining the > consistency of the PG. For some reason I thought about them only in terms of > being a light weight version of the operation logs. Adding a payload to the > pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I > would have never thought or dared think the logs could be extended in such a > way. Given the recent problems with logs writes having a high impact on > performances ( I'm referring to what forced you to introduce code to reduce > the amount of logs being written to only those that have been changed instead > of the complete logs ) I thought about the pg logs as something immutable. > > I'm still trying to figure out how PGBackend::perform_write / read / > try_rollback would fit in the current backfilling / write / read / scrubbing > ... code path. > > https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst > https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h > > Cheers > > -- > Loïc Dachary, Artisan Logiciel Libre > All that is necessary for the triumph of evil is that good people do nothing. > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PG Backend Proposal
Hi Sam, When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number. That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation. I'm puzzled by: CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object. because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something. I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable. I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path. https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h Cheers -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. signature.asc Description: OpenPGP digital signature
Re: still recovery issues with cuttlefish
Second this. Also for long-lasting snapshot problem and related performance issues I may say that cuttlefish improved things greatly, but creation/deletion of large snapshot (hundreds of gigabytes of commited data) still can bring down cluster for a minutes, despite usage of every possible optimization. On Thu, Aug 1, 2013 at 12:22 PM, Stefan Priebe - Profihost AG wrote: > Hi, > > i still have recovery issues with cuttlefish. After the OSD comes back > it seem to hang for around 2-4 minutes and then recovery seems to start > (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I > get a lot of slow request messages an hanging VMs. > > What i noticed today is that if i leave the OSD off as long as ceph > starts to backfill - the recovery and "re" backfilling wents absolutely > smooth without any issues and no slow request messages at all. > > Does anybody have an idea why? > > Greets, > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add missing buildrequires for Fedora
Hi, I've opened a pull request with some additional fixes for this issue: https://github.com/ceph/ceph/pull/478 Danny Am 30.07.2013 09:53, schrieb Erik Logtenberg: > Hi, > > This patch adds two buildrequires to the ceph.spec file, that are needed > to build the rpms under Fedora. Danny Al-Gaaf commented that the > snappy-devel dependency should actually be added to the leveldb-devel > package. I will try to get that fixed too, in the mean time, this patch > does make sure Ceph builds on Fedora. > > Signed-off-by: Erik Logtenberg > --- > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mds: remove waiting lock before merging with neighbours
Hi, Did anyone get a chance to look at this change? Any comments/feedback/ridicule would be appreciated. Cheers, David -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
From: Sha Zhengju Following we will begin to add memcg dirty page accounting around __set_page_dirty_ {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to avoid exporting those details to filesystems. Signed-off-by: Sha Zhengju --- fs/ceph/addr.c | 13 + 1 file changed, 1 insertion(+), 12 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 3e68ac1..1445bf1 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page) if (unlikely(!mapping)) return !TestSetPageDirty(page); - if (TestSetPageDirty(page)) { + if (!__set_page_dirty_nobuffers(page)) { dout("%p set_page_dirty %p idx %lu -- already dirty\n", mapping->host, page, page->index); return 0; @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page) snapc, snapc->seq, snapc->num_snaps); spin_unlock(&ci->i_ceph_lock); - /* now adjust page */ - spin_lock_irq(&mapping->tree_lock); if (page->mapping) {/* Race with truncate? */ - WARN_ON_ONCE(!PageUptodate(page)); - account_page_dirtied(page, page->mapping); - radix_tree_tag_set(&mapping->page_tree, - page_index(page), PAGECACHE_TAG_DIRTY); - /* * Reference snap context in page->private. Also set * PagePrivate so that we get invalidatepage callback. @@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page) undo = 1; } - spin_unlock_irq(&mapping->tree_lock); - if (undo) /* whoops, we failed to dirty the page */ ceph_put_wrbuffer_cap_refs(ci, 1, snapc); - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); - BUG_ON(!PageDirty(page)); return 1; } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LFS & Ceph
Hello, Sorry for the late answer as I was travelling lately. The LFS works has been in a heavy state of work in progress by Peter (in CC) and others there is some documentation in this review : https://review.openstack.org/#/c/30051/ (summarized in Pete's gist here https://gist.github.com/portante/5488238/raw/66b0bf2a91a8ca75301fa68dc0fef2d3dc76e5a2/gistfile1.txt) some preliminary work here : https://review.openstack.org/#/c/35381/ and some old documentation here : https://raw.github.com/zaitcev/swift-lfs/master/doc/source/lfs_plugin.rst I am sure Pete would appreciate the feedback. Cheers, Chmouel. On 24/07/2013 17:00, Loic Dachary wrote: > Hi, > > Thanks for take the time to discuss LFS today @ OSCON :-) Would you be so > kind as to send links to the current discussion about the LFS driver API ? > > Cheers > -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
still recovery issues with cuttlefish
Hi, i still have recovery issues with cuttlefish. After the OSD comes back it seem to hang for around 2-4 minutes and then recovery seems to start (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I get a lot of slow request messages an hanging VMs. What i noticed today is that if i leave the OSD off as long as ceph starts to backfill - the recovery and "re" backfilling wents absolutely smooth without any issues and no slow request messages at all. Does anybody have an idea why? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: question about striped_read
On Thu, Aug 1, 2013 at 2:30 PM, majianpeng wrote: >>On Thu, Aug 1, 2013 at 9:45 AM, majianpeng wrote: On Wed, Jul 31, 2013 at 3:32 PM, majianpeng wrote: > > [snip] > Test case > A: touch file > dd if=file of=/dev/null bs=5M count=1 iflag=direct > B: [data(2M)|hole(2m)][data(2M)] >dd if=file of=/dev/null bs=8M count=1 iflag=direct > C: [data(4M)[hole(4M)][hole(4M)][data(2M)] > dd if=file of=/dev/null bs=16M count=1 iflag=direct > D: touch file;truncate -s 5M file > dd if=file of=/dev/null bs=8M count=1 iflag=direct > > Those cases can work. > Now i make different processing for short-read between 'ret > 0' and > "ret =0". > For the short-read which ret > 0, it don't do read-page rather than zero > the left area. > This means reduce one meaningless read operation. > This patch looks good. But I still hope not to duplicate code. how about change "hit_stripe = this_len < left;" to "hit_stripe = this_len < left && (ret == this_len || pos + this_len < inode->i_size);" >>> To make the code easy to understand, i don't apply your suggestion.But i >>> add this check on the judgement of >>> whether read more contents. >>> The follow is the latest patch.Can you check? >>> >>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c >>> index 2ddf061..3d8d14d 100644 >>> --- a/fs/ceph/file.c >>> +++ b/fs/ceph/file.c >>> @@ -349,44 +349,38 @@ more: >>> dout("striped_read %llu~%u (read %u) got %d%s%s\n", pos, left, read, >>> ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : >>> ""); >>> >>> - if (ret > 0) { >>> - int didpages = (page_align + ret) >> PAGE_CACHE_SHIFT; >>> - >>> - if (read < pos - off) { >>> - dout(" zero gap %llu to %llu\n", off + read, pos); >>> - ceph_zero_page_vector_range(page_align + read, >>> - pos - off - read, >>> pages); >>> + if (ret >= 0) { >>> + int didpages; >>> + if (was_short && (pos + ret < inode->i_size)) { >>> + u64 tmp = min(this_len - ret, >>> +inode->i_size - pos - ret); >>> + dout(" zero gap %llu to %llu\n", >>> + pos + ret, pos + ret + tmp); >>> + ceph_zero_page_vector_range(page_align + read + ret, >>> + tmp, pages); >>> + ret += tmp; >>> } >>> + >>> + didpages = (page_align + ret) >> PAGE_CACHE_SHIFT; >>> pos += ret; >>> read = pos - off; >>> left -= ret; >>> page_pos += didpages; >>> pages_left -= didpages; >>> >>> - /* hit stripe? */ >>> - if (left && hit_stripe) >>> + /* hit stripe and need continue*/ >>> + if (left && hit_stripe && pos < inode->i_size) >>> goto more; >>> + >>> } >>> >>> - if (was_short) { >>> + if (ret >= 0) { >>> + ret = read; >>> /* did we bounce off eof? */ >>> if (pos + left > inode->i_size) >>> *checkeof = 1; >>> - >>> - /* zero trailing bytes (inside i_size) */ >>> - if (left > 0 && pos < inode->i_size) { >>> - if (pos + left > inode->i_size) >>> - left = inode->i_size - pos; >>> - >>> - dout("zero tail %d\n", left); >>> - ceph_zero_page_vector_range(page_align + read, left, >>> - pages); >>> - read += left; >>> - } >>> } >>> >>> - if (ret >= 0) >>> - ret = read; >> >>I think this line should be "if (read > 0) ret = read;". Other than >>this, your patch looks good. > Because you metioned this, I noticed for ceph_sync_read/write the result are > && the every striped read/write. > That is if we met one error, the total result is error.It can't return > partial result. This behavior is not correct. If we read/write some data, then meet an error, we should return the size we have read/written. I think all other FS behave like this. See generic_file_aio_read() and do_generic_file_read(). Regards Yan, Zheng > I think i should write anthor patch for that. >>You can add "Reviewed-by: Yan, Zheng " to your >>formal patch. > Ok ,thanks your times. > > Thanks! > Jianpeng Ma >> >>Regards >>Yan, Zheng >> >> >>> dout("striped_read returns %d\n", ret); >>> return ret; >>> } >>> >>> Thanks! >>> Jianpeng Ma > Thanks! > Jianpeng Ma >>On Thu, Aug 1, 2013 at 9:45 AM, majianpeng wrote: On Wed, Jul 31, 2013 at 3:32 PM, maj