Re: packaging init systems in a more autoools style way.
On 06/03/2015 06:26 PM, Sage Weil wrote: On Wed, 3 Jun 2015, Owen Synge wrote: Dear ceph-devel, Linux has more than one init systems. We in SUSE are in the process of up streaming our spec files, and all our releases are systemd based. Ceph seems more tested with sysVinit upstream. We have 3 basic options for doing this in a packaged upstream system. 1) We dont install init scripts/config as part of make install and install all the init components via conditionals in the spec file. 2) We install all init scripts/config for all flavours of init using make install and delete unwanted init systems via conditionals in the spec file. 3) We add autotools an conditional for each init system, and only install with make install enabled init systems scripts/config. snip/ Their are many ways to follow policy 3 so I would propose that when no init system is followed, policy (1) and policy (3) should appear identical. - Let's do it! Great :) snip/ I'm hoping that phase 3 can be avoided entirely. The upgrade/conversion path (at least for upstream packages) will be firefly - infernalis; I'm don't think it will be that useful to build infernalis packages that do sysvinit for systemd distros. (Maybe this situation gets more complicated if we backport this transition to hammer or downstream does the same, but even then the transition will be an upgrade one.) Agreed, snip/ Also, I think we should do 1 and 2 basically at the same time. I don't think it's worth spending any effort trying to make things behave with just 1 (and not 2). Am I talking sense? I can never tell with this stuff. :) sage I think you speak sense, If I underwstand right you favor the user interface as: --with-init=systemd --with-init=sysv --with-init=upstart --with-init=bsd This is wiser when you start adding up all the possible init systems that can exist. Owen -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
Hi Sage, I saw the crash again here is the output after adding the debug message from wip-newstore-debuglist -31 2015-06-03 20:28:18.864496 7fd95976b700 -1 newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is --.7fff..!!!. Here is the id of the file I posted. ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804 Let me know if you need anything else. Regards Srikanth On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote: I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: packaging init systems in a more autoools style way.
On 06/03/2015 03:38 PM, Sage Weil wrote: On Wed, 3 Jun 2015, Ken Dreyer wrote: On 06/03/2015 02:45 PM, Sage Weil wrote: Sounds good to me. It could (should?) even error out if no init system is specified? Otherwise someone will likely be in for a surprise. I was picturing that we'd just autodetect based on OS version (eg Ubuntu 15.04 should default to --with-init=systemd). It's one less thing to get wrong during the build process. What do you think? ./configure ... --with-init=`src/ceph-detect-init` ? sage I should have been clearer. I was thinking that we'd call that detect-init script inside ./configure itself , unless the user specifies --with-init=foo . - Ken -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: packaging init systems in a more autoools style way.
On 06/03/2015 03:38 PM, Gregory Farnum wrote: We could maybe autodetect if they don't specify one? Sorry, yes, that's what I meant; my last email was unclear. - Ken -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Discuss: New default recovery config settings
On Mon, 1 Jun 2015, Gregory Farnum wrote: On Mon, Jun 1, 2015 at 6:39 PM, Paul Von-Stamwitz pvonstamw...@us.fujitsu.com wrote: On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote: Many people have reported that they need to lower the osd recovery config options to minimize the impact of recovery on client io. We are talking about changing the defaults as follows: osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 1 (from 10) osd_recovery_max_single_start to 1 (from 5) I'm under the (possibly erroneous) impression that reducing the number of max backfills doesn't actually reduce recovery speed much (but will reduce memory use), but that dropping the op priority can. I'd rather we make users manually adjust values which can have a material impact on their data safety, even if most of them choose to do so. After all, even under our worst behavior we're still doing a lot better than a resilvering RAID array. ;) -Greg -- Greg, When we set... osd recovery max active = 1 osd max backfills = 1 We see rebalance times go down by more than half and client write performance increase significantly while rebalancing. We initially played with these settings to improve client IO expecting recovery time to get worse, but we got a 2-for-1. This was with firefly using replication, downing an entire node with lots of SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and osd_recovery_max_single_start default. We dropped osd_recovery_max_active and osd_max_backfills together. If you're right, do you think osd_recovery_max_active=1 is primary reason for the improvement? (higher osd_max_backfills helps recovery time with erasure coding.) Well, recovery max active and max backfills are similar in many ways. Both are about moving data into a new or outdated copy of the PG ? the difference is that recovery refers to our log-based recovery (where we compare the PG logs and move over the objects which have changed) whereas backfill requires us to incrementally move through the entire PG's hash space and compare. I suspect dropping down max backfills is more important than reducing max recovery (gathering recovery metadata happens largely in memory) but I don't really know either way. My comment was meant to convey that I'd prefer we not reduce the recovery op priority levels. :) We could make a less extreme move than to 1, but IMO we have to reduce it one way or another. Every major operator I've talked to does this, our PS folks have been recommending it for years, and I've yet to see a single complaint about recovery times... meanwhile we're drowning in a sea of complaints about the impact on clients. How about osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 3 (from 10) osd_recovery_max_single_start to 1 (from 5) (same as above, but 1/3rd the recovery op prio instead of 1/10th) ? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: packaging init systems in a more autoools style way.
On Wed, 3 Jun 2015, Ken Dreyer wrote: On 06/03/2015 02:45 PM, Sage Weil wrote: Sounds good to me. It could (should?) even error out if no init system is specified? Otherwise someone will likely be in for a surprise. I was picturing that we'd just autodetect based on OS version (eg Ubuntu 15.04 should default to --with-init=systemd). It's one less thing to get wrong during the build process. What do you think? ./configure ... --with-init=`src/ceph-detect-init` ? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: packaging init systems in a more autoools style way.
On Wed, Jun 3, 2015 at 2:36 PM, Ken Dreyer kdre...@redhat.com wrote: On 06/03/2015 02:45 PM, Sage Weil wrote: Sounds good to me. It could (should?) even error out if no init system is specified? Otherwise someone will likely be in for a surprise. I was picturing that we'd just autodetect based on OS version (eg Ubuntu 15.04 should default to --with-init=systemd). It's one less thing to get wrong during the build process. What do you think? Debian users will get very angry at us. ;) We could maybe autodetect if they don't specify one? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: packaging init systems in a more autoools style way.
On Wed, 3 Jun 2015, Ken Dreyer wrote: On 06/03/2015 03:38 PM, Sage Weil wrote: On Wed, 3 Jun 2015, Ken Dreyer wrote: On 06/03/2015 02:45 PM, Sage Weil wrote: Sounds good to me. It could (should?) even error out if no init system is specified? Otherwise someone will likely be in for a surprise. I was picturing that we'd just autodetect based on OS version (eg Ubuntu 15.04 should default to --with-init=systemd). It's one less thing to get wrong during the build process. What do you think? ./configure ... --with-init=`src/ceph-detect-init` ? sage I should have been clearer. I was thinking that we'd call that detect-init script inside ./configure itself , unless the user specifies --with-init=foo . Works for me, as long as there is only 1 piece of code (ceph-detect-init) that does the detection! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: packaging init systems in a more autoools style way.
On 06/03/2015 02:45 PM, Sage Weil wrote: Sounds good to me. It could (should?) even error out if no init system is specified? Otherwise someone will likely be in for a surprise. I was picturing that we'd just autodetect based on OS version (eg Ubuntu 15.04 should default to --with-init=systemd). It's one less thing to get wrong during the build process. What do you think? - Ken -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD journal draft design
On Wed, Jun 3, 2015 at 9:13 AM, Jason Dillaman dilla...@redhat.com wrote: In contrast to the current journal code used by CephFS, the new journal code will use sequence numbers to identify journal entries, instead of offsets within the journal. Am I misremembering what actually got done with our journal v2 format? I think this is done — or at least we made a move in this direction. Assuming journal v2 is the code in osdc/Journaler.cc, there is a new resilient format that helps in detecting corruption, but it appears to be still largely based upon offsets and using the Filer/Striper for I/O. This does remind me that I probably want to include a magic preamble value at the start of each journal entry to facilitate recovery. Ah yeah, I was confusing the changes we did there and in our MDLog wrapper bits. Ignore me on this bit. A new journal object class method will be used to submit journal entry append requests. This will act as a gatekeeper for the concurrent client case. The object class is going to be a big barrier to using EC pools; unless you want to block the use of EC pools on EC pools supporting object classes. :( Josh mentioned (via Sam) that reads were not currently supported by object classes on EC pools. Are appends not supported either? We discussed this briefly and certain object class functions might work by mistake on EC pools, but you should assume nothing does (is my recollection of the conclusions). For instance, even if it's technically possible, the append thing is really hard for this sort of write; I think I mentioned in Josh's thread about needing to have an entire stripe at a time (and the smallest you could even think about doing reasonably is 4KB * N, and really that's not big enough given metadata overheads). A successful append will indicate whether or not the journal is now full (larger than the max object size), indicating to the client that a new journal object should be used. If the journal is too large, an error code responce would alert the client that it needs to write to the current active journal object. In practice, the only time the journaler should expect to see such a response would be in the case where multiple clients are using the same journal and the active object update notification has yet to be received. I'm confused. How does this work with the splay count thing you mentioned above? Can you define splay count? Similar to the stripe width. Okay, that sort of makes sense but I don't see how you could legally be writing to different sets so why not just make it an explicit striping thing and move all journal entries for that set at once? ...Actually, doesn't *not* forcing a coordinated move from one object set to another mean that you don't actually have an ordering guarantee across tags if you replay the journal objects in order? What happens if users submit sequenced entries substantially out of order? It sounds like if you have multiple writers (or even just a misbehaving client) it would not be hard for one of them to grab sequence value N, for another to fill up one of the journal entry objects with sequences in the range [N+1]...[N+x] and then for the user of N to get an error response. I was thinking that when a client submits their journal entry payload, the journaler will allocate the next available sequence number, compute which active journal object that sequence should be submitted to, and start an AIO append op to write the journal entry. The next journal entry to be appended to the same journal object would be splay count/width entries later. This does bring up a good point that if you are generating journal entries fast enough, the delayed response saying the object is full could cause multiple later journal entry ops to need to be resent to the new (non-full) object. Given that, it might be best to scrap the hard error when the journal object gets full and just let the journaler eventually switch to a new object when it receives a response saying the object is now full. I was misunderstanding where the seqs came from and that they were associated with the tag, not the journal. So this shouldn't be such a problem. Since the journal is designed to be append-only, there needs to be support for cases where journal entry needs to be updated out-of-band (e.g. fixing a corrupt entry similar to CephFS's current journal recovery tools). The proposed solution is to just append a new journal entry with the same sequence number as the record to be replaced to the end of the journal (i.e. last entry for a given sequence number wins). This also protects against accidental replays of the original append operation. An alternative suggestion would be to use a compare-and-swap mechanism to update the full journal object with the updated contents. I'm confused by this bit. It seems to imply that fetching a single entry requires checking the
Re: Discuss: New default recovery config settings
On Wed, Jun 3, 2015 at 3:44 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Gregory Farnum wrote: On Mon, Jun 1, 2015 at 6:39 PM, Paul Von-Stamwitz pvonstamw...@us.fujitsu.com wrote: On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote: Many people have reported that they need to lower the osd recovery config options to minimize the impact of recovery on client io. We are talking about changing the defaults as follows: osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 1 (from 10) osd_recovery_max_single_start to 1 (from 5) I'm under the (possibly erroneous) impression that reducing the number of max backfills doesn't actually reduce recovery speed much (but will reduce memory use), but that dropping the op priority can. I'd rather we make users manually adjust values which can have a material impact on their data safety, even if most of them choose to do so. After all, even under our worst behavior we're still doing a lot better than a resilvering RAID array. ;) -Greg -- Greg, When we set... osd recovery max active = 1 osd max backfills = 1 We see rebalance times go down by more than half and client write performance increase significantly while rebalancing. We initially played with these settings to improve client IO expecting recovery time to get worse, but we got a 2-for-1. This was with firefly using replication, downing an entire node with lots of SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and osd_recovery_max_single_start default. We dropped osd_recovery_max_active and osd_max_backfills together. If you're right, do you think osd_recovery_max_active=1 is primary reason for the improvement? (higher osd_max_backfills helps recovery time with erasure coding.) Well, recovery max active and max backfills are similar in many ways. Both are about moving data into a new or outdated copy of the PG ? the difference is that recovery refers to our log-based recovery (where we compare the PG logs and move over the objects which have changed) whereas backfill requires us to incrementally move through the entire PG's hash space and compare. I suspect dropping down max backfills is more important than reducing max recovery (gathering recovery metadata happens largely in memory) but I don't really know either way. My comment was meant to convey that I'd prefer we not reduce the recovery op priority levels. :) We could make a less extreme move than to 1, but IMO we have to reduce it one way or another. Every major operator I've talked to does this, our PS folks have been recommending it for years, and I've yet to see a single complaint about recovery times... meanwhile we're drowning in a sea of complaints about the impact on clients. How about osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 3 (from 10) osd_recovery_max_single_start to 1 (from 5) (same as above, but 1/3rd the recovery op prio instead of 1/10th) ? Do we actually have numbers for these changes individually? We might, but I have a suspicion that at some point there was just a well, you could turn them all down comment and that state was preferred to our defaults. I mean, I have no real knowledge of how changing the op priority impacts things, but I don't think many (any?) other people do either, so I'd rather mutate slowly and see if that works better. :) Especially given Paul's comment that just the recovery_max and max_backfills values made a huge positive difference without any change to priorities. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD journal draft design
On 02/06/2015 16:11, Jason Dillaman wrote: I am posting to get wider review/feedback on this draft design. In support of the RBD mirroring feature [1], a new client-side journaling class will be developed for use by librbd. The implementation is designed to carry opaque journal entry payloads so it will be possible to be re-used in other applications as well in the future. It will also use the librados API for all operations. At a high level, a single journal will be composed of a journal header to store metadata and multiple journal objects to contain the individual journal entries. ... A new journal object class method will be used to submit journal entry append requests. This will act as a gatekeeper for the concurrent client case. A successful append will indicate whether or not the journal is now full (larger than the max object size), indicating to the client that a new journal object should be used. If the journal is too large, an error code responce would alert the client that it needs to write to the current active journal object. In practice, the only time the journaler should expect to see such a response would be in the case where multiple clients are using the same journal and the active object update notification has yet to be received. Can you clarify the procedure when a client write gets a I'm full return code from a journal object? The key part I'm not clear on is whether the client will first update the header to add an object to the active set (and then write it) or whether it goes ahead and writes objects and then lazily updates the header. * If it's object first, header later, what bounds how far ahead of the active set we have to scan when doing recovery? * If it's header first, object later, thats an uncomfortable bit of latency whenever we cross and object bound Nothing intractable about mitigating either case, just wondering what the idea is in this design. In contrast to the current journal code used by CephFS, the new journal code will use sequence numbers to identify journal entries, instead of offsets within the journal. Additionally, a given journal entry will not be striped across multiple journal objects. Journal entries will be mapped to journal objects using the sequence number: sequence number mod splay count == object number mod splay count for active journal objects. The rationale for this difference is to facilitate parallelism for appends as journal entries will be splayed across a configurable number of journal objects. The journal API for appending a new journal entry will return a future which can be used to retrieve the assigned sequence number for the submitted journal entry payload once committed to disk. The use of a future allows for asynchronous journal entry submissions by default and can be used to simplify integration with the client-side cache writeback handler (and as a potential future enhacement to delay appends to the journal in order to satisfy EC-pool alignment requirements). When two clients are both doing splayed writes, and they both send writes in parallel, it seems like the per-object fullness check via the object class could result in the writes getting staggered across different objects. E.g. if we have two objects that both only have one slot left, then A could end up taking the slot in one (call it 1) and B could end up taking the slot in the other (call it 2). Then when B's write lands at to object 1, it gets a I'm full response and has to send the entry... where? I guess to some arbitrarily-higher-numbered journal object depending on how many other writes landed in the meantime. This potentially leads to the stripes (splays?) of a given journal entry being separated arbitrarily far across different journal objects, which would be fine as long as everything was well formed, but will make detecting issues during replay harder (would have to remember partially-read entries when looking for their remaining stripes through rest of journal). You could apply the object class behaviour only to the object containing the 0th splay, but then you'd have to wait for the write there to complete before writing to the rest of the splays, so the latency benefit would go away. Or its equally possible that there's a trick in the design that has gone over my head :-) Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Teuthology error 'exception on parallel execution'
Hi all, Several teuthology jobs fail with the error 'exception on parallel execution' in my testing. I don't see any failures/errors/assertions in the mon/osd logs. And these failed jobs always have 'sentry event' log. However, I'm not able to open the sentry link. Here is an example: http://pulpito.ceph.com/yuan-2015-06-02_20:18:44-rados-wip-proxy-write---basic-multi/918584/ Log: http://qa-proxy.ceph.com/teuthology/yuan-2015-06-02_20:18:44-rados-wip-proxy-write---basic-multi/918584/teuthology.log Sentry event: http://sentry.ceph.com/sepia/teuthology/search?q=38a00a22270748e79e8f0dd69a624ab4 2015-06-02T20:27:18.149 ERROR:teuthology.parallel:Exception in parallel execution Traceback (most recent call last): File /home/teuthworker/src/teuthology_master/teuthology/parallel.py, line 82, in __exit__ for result in self: File /home/teuthworker/src/teuthology_master/teuthology/parallel.py, line 101, in next resurrect_traceback(result) File /home/teuthworker/src/teuthology_master/teuthology/parallel.py, line 19, in capture_traceback return func(*args, **kwargs) File /var/lib/teuthworker/src/ceph-qa-suite_master/tasks/workunit.py, line 361, in _run_tests label=workunit test {workunit}.format(workunit=workunit) File /home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py, line 156, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File /home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py, line 378, in run r.wait() File /home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py, line 114, in wait label=self.label) CommandFailedError: Command failed (workunit test cephtool/test.sh) on plana52 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d5a52869dd23ae1fbdf3e80d54e855560e467ff6 TESTDIR=/home/ubuntu/cephtest CEPH_ID=0 PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/cephtool/test.sh' 2015-06-02T20:27:18.150 ERROR:teuthology.run_tasks:Saw exception from tasks. Is this something wrong with the teuthology itself? Anyone has experience with this? Any hints are welcomed~ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
06/03/2015 Weekly Ceph Performance Meeting IS ON!
8AM PST as usual! Discussion topics include: cache tiering update. Please feel free to add your own! Here's the links: Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: 'Racing read got wrong version' during proxy write testing
I ran into the 'op not idempotent' problem during the testing today. There is one bug in the previous fix. In that fix, we copy the reqids in the final step of 'fill_in_copy_get'. If the object is deleted, since the 'copy get' op is a read op, it returns earlier with ENOENT in do_op. No reqids will be copied during promotion in this case. This again leads to the 'op not idempotent' problem. We need a 'smart' way to detect the op is a 'copy get' op (looping the ops vector doesn't seem smart?) and copy the reqids in this case. -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, May 26, 2015 12:27 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: Re: 'Racing read got wrong version' during proxy write testing On Mon, 25 May 2015, Wang, Zhiqiang wrote: Hi all, I ran into a problem during the teuthology test of proxy write. It is like this: - Client sends 3 writes and a read on the same object to base tier - Set up cache tiering - Client retries ops and sends the 3 writes and 1 read to the cache tier - The 3 writes finished on the base tier, say with versions v1, v2 and v3 - Cache tier proxies the 1st write, and start to promote the object for the 2nd write, the 2nd and 3rd writes and the read are blocked - The proxied 1st write finishes on the base tier with version v4, and returns to cache tier. But somehow the cache tier fails to send the reply due to socket failure injecting - Client retries the writes and the read again, the writes are identified as dup ops - The promotion finishes, it copies the pg_log entries from the base tier and put it in the cache tier's pg_log. This includes the 3 writes on the base tier and the proxied write - The writes dispatches after the promotion, they are identified as completed dup ops. Cache tier replies these write ops with the version from the base tier (v1, v2 and v3) - In the last, the read dispatches, it reads the version of the proxied write (v4) and replies to client - Client complains that 'racing read got wrong version' In a previous discussion of the 'ops not idempotent' problem, we solved it by copying the pg_log entries in the base tier to cache tier during promotion. Seems like there is still a problem with this approach in the above scenario. My first thought is that when proxying the write, the cache tier should use the original reqid from the client. But currently we don't have a way to pass the original reqid from cache to base. Any ideas? I agree--I think the correct fix here is to make the proxied op be recognized as a dup. We can either do that by passing in an optional reqid to the Objecter, or extending the op somehow so that both reqids are listed. I think the first option will be cleaner, but I think we will also need to make sure the 'retry' count is preserved as (I think) we skip the dup check if retry==0. And we probably want to preserve the behavior that a given (reqid, retry) only exists once in the system. This probably means adding more optional args to Objecter::read()...? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
packaging init systems in a more autoools style way.
Dear ceph-devel, Linux has more than one init systems. We in SUSE are in the process of up streaming our spec files, and all our releases are systemd based. Ceph seems more tested with sysVinit upstream. We have 3 basic options for doing this in a packaged upstream system. 1) We dont install init scripts/config as part of make install and install all the init components via conditionals in the spec file. 2) We install all init scripts/config for all flavours of init using make install and delete unwanted init systems via conditionals in the spec file. 3) We add autotools an conditional for each init system, and only install with make install enabled init systems scripts/config. -- We are currently following policy (1) I propose we follow policy (3) because (1) makes many distribution specific conditionals and requires duplication for each platform for all files not installed with make install. - Their are many ways to follow policy 3 so I would propose that when no init system is followed, policy (1) and policy (3) should appear identical. - For a transition period between following policy (1) to policy (3) phase (1) I would expect we would add a conditional to ceph.spec for suse to add to the configure step: --with-init-systemd And when other distributions want to move to a full systemd flavour they also add a similar conditional. phase (2) We add a new configure level conditional: --with-init-sysv All sysV init installs are removed from from the spec file and added to the make install process. phase (3) Distributions with more than one init system, or init systems that can emulate sysVinit, can build packages with either init system and so migration can be tested. - Does anyone object to this plan? Does anyone agree with this plan? Does anyone see difficulties with the plan? Best regards Owen -- SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD journal draft design
In contrast to the current journal code used by CephFS, the new journal code will use sequence numbers to identify journal entries, instead of offsets within the journal. Am I misremembering what actually got done with our journal v2 format? I think this is done — or at least we made a move in this direction. Assuming journal v2 is the code in osdc/Journaler.cc, there is a new resilient format that helps in detecting corruption, but it appears to be still largely based upon offsets and using the Filer/Striper for I/O. This does remind me that I probably want to include a magic preamble value at the start of each journal entry to facilitate recovery. A new journal object class method will be used to submit journal entry append requests. This will act as a gatekeeper for the concurrent client case. The object class is going to be a big barrier to using EC pools; unless you want to block the use of EC pools on EC pools supporting object classes. :( Josh mentioned (via Sam) that reads were not currently supported by object classes on EC pools. Are appends not supported either? A successful append will indicate whether or not the journal is now full (larger than the max object size), indicating to the client that a new journal object should be used. If the journal is too large, an error code responce would alert the client that it needs to write to the current active journal object. In practice, the only time the journaler should expect to see such a response would be in the case where multiple clients are using the same journal and the active object update notification has yet to be received. I'm confused. How does this work with the splay count thing you mentioned above? Can you define splay count? Similar to the stripe width. What happens if users submit sequenced entries substantially out of order? It sounds like if you have multiple writers (or even just a misbehaving client) it would not be hard for one of them to grab sequence value N, for another to fill up one of the journal entry objects with sequences in the range [N+1]...[N+x] and then for the user of N to get an error response. I was thinking that when a client submits their journal entry payload, the journaler will allocate the next available sequence number, compute which active journal object that sequence should be submitted to, and start an AIO append op to write the journal entry. The next journal entry to be appended to the same journal object would be splay count/width entries later. This does bring up a good point that if you are generating journal entries fast enough, the delayed response saying the object is full could cause multiple later journal entry ops to need to be resent to the new (non-full) object. Given that, it might be best to scrap the hard error when the journal object gets full and just let the journaler eventually switch to a new object when it receives a response saying the object is now full. Since the journal is designed to be append-only, there needs to be support for cases where journal entry needs to be updated out-of-band (e.g. fixing a corrupt entry similar to CephFS's current journal recovery tools). The proposed solution is to just append a new journal entry with the same sequence number as the record to be replaced to the end of the journal (i.e. last entry for a given sequence number wins). This also protects against accidental replays of the original append operation. An alternative suggestion would be to use a compare-and-swap mechanism to update the full journal object with the updated contents. I'm confused by this bit. It seems to imply that fetching a single entry requires checking the entire object to make sure there's no replacement. Certainly if we were doing replay we couldn't just apply each entry sequentially any more because an overwritten entry might have its value replaced by a later (by sequence number) entry that occurs earlier (by offset) in the journal. The goal would be to use prefetching on the replay. Since the whole object is already in-memory, scanning for duplicates would be fairly trivial. If there is a way to prevent the OSDs from potentially replaying a duplicate append journal entry message, the CAS update technique could be used. I'd also like it if we could organize a single Journal implementation within the Ceph project, or at least have a blessed one going forward that we use for new stuff and might plausibly migrate existing users to. The big things I see different from osdc/Journaler are: Agreed. While librbd will be the first user of this, I wasn't planning to locate it within the librbd library. 1) (design) class-based 2) (design) uses librados instead of Objecter (hurray) 3) (need) should allow multiple writers 4) (fallout of other choices?) does not stripe entries across multiple objects For striping, I assume this is a function of how large MDS
Re: RBD journal draft design
A new journal object class method will be used to submit journal entry append requests. This will act as a gatekeeper for the concurrent client case. A successful append will indicate whether or not the journal is now full (larger than the max object size), indicating to the client that a new journal object should be used. If the journal is too large, an error code responce would alert the client that it needs to write to the current active journal object. In practice, the only time the journaler should expect to see such a response would be in the case where multiple clients are using the same journal and the active object update notification has yet to be received. Can you clarify the procedure when a client write gets a I'm full return code from a journal object? The key part I'm not clear on is whether the client will first update the header to add an object to the active set (and then write it) or whether it goes ahead and writes objects and then lazily updates the header. * If it's object first, header later, what bounds how far ahead of the active set we have to scan when doing recovery? * If it's header first, object later, thats an uncomfortable bit of latency whenever we cross and object bound Nothing intractable about mitigating either case, just wondering what the idea is in this design. I was thinking object first, header later. As I mentioned in my response to Greg, I now think this I'm full should only be used as a guide to kick future (un-submitted) requests over to a new journal object. For example, if you submitted 16 4K AIO journal entry append requests, it's possible that the first request filled the journal -- so now your soft max size will include an extra 15 4K journal entries before the response to the first request indicates that the journal object is full and future requests should use a new journal object. The rationale for this difference is to facilitate parallelism for appends as journal entries will be splayed across a configurable number of journal objects. The journal API for appending a new journal entry will return a future which can be used to retrieve the assigned sequence number for the submitted journal entry payload once committed to disk. The use of a future allows for asynchronous journal entry submissions by default and can be used to simplify integration with the client-side cache writeback handler (and as a potential future enhacement to delay appends to the journal in order to satisfy EC-pool alignment requirements). When two clients are both doing splayed writes, and they both send writes in parallel, it seems like the per-object fullness check via the object class could result in the writes getting staggered across different objects. E.g. if we have two objects that both only have one slot left, then A could end up taking the slot in one (call it 1) and B could end up taking the slot in the other (call it 2). Then when B's write lands at to object 1, it gets a I'm full response and has to send the entry... where? I guess to some arbitrarily-higher-numbered journal object depending on how many other writes landed in the meantime. In this case, assuming B sent the request to journal object 0, it would send the re-request to journal object 0 + splay width since the request sequence number mod splay width must equal object number mod splay width. However, at this point I think it would be better to eliminate the I'm full error code and stick with extra soft max object size. This potentially leads to the stripes (splays?) of a given journal entry being separated arbitrarily far across different journal objects, which would be fine as long as everything was well formed, but will make detecting issues during replay harder (would have to remember partially-read entries when looking for their remaining stripes through rest of journal). You could apply the object class behaviour only to the object containing the 0th splay, but then you'd have to wait for the write there to complete before writing to the rest of the splays, so the latency benefit would go away. Or its equally possible that there's a trick in the design that has gone over my head :-) I'm probably missing something here. A journal entry won't be partially striped across multiple journal objects. The journal entry in its entirety would be written to one of the splay width active journal objects. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: packaging init systems in a more autoools style way.
On Wed, 3 Jun 2015, Owen Synge wrote: Dear ceph-devel, Linux has more than one init systems. We in SUSE are in the process of up streaming our spec files, and all our releases are systemd based. Ceph seems more tested with sysVinit upstream. We have 3 basic options for doing this in a packaged upstream system. 1) We dont install init scripts/config as part of make install and install all the init components via conditionals in the spec file. 2) We install all init scripts/config for all flavours of init using make install and delete unwanted init systems via conditionals in the spec file. 3) We add autotools an conditional for each init system, and only install with make install enabled init systems scripts/config. -- We are currently following policy (1) I propose we follow policy (3) because (1) makes many distribution specific conditionals and requires duplication for each platform for all files not installed with make install. - Their are many ways to follow policy 3 so I would propose that when no init system is followed, policy (1) and policy (3) should appear identical. - Let's do it! For a transition period between following policy (1) to policy (3) phase (1) I would expect we would add a conditional to ceph.spec for suse to add to the configure step: --with-init-systemd And when other distributions want to move to a full systemd flavour they also add a similar conditional. phase (2) We add a new configure level conditional: --with-init-sysv All sysV init installs are removed from from the spec file and added to the make install process. phase (3) Distributions with more than one init system, or init systems that can emulate sysVinit, can build packages with either init system and so migration can be tested. - Does anyone object to this plan? Does anyone agree with this plan? Does anyone see difficulties with the plan? I'm hoping that phase 3 can be avoided entirely. The upgrade/conversion path (at least for upstream packages) will be firefly - infernalis; I'm don't think it will be that useful to build infernalis packages that do sysvinit for systemd distros. (Maybe this situation gets more complicated if we backport this transition to hammer or downstream does the same, but even then the transition will be an upgrade one.) Ken and I talked a bit about this yesterday and he convinced me that catering to multiple init systems w/in a single distro (e.g., by letting sysvinit and systemd files coexist) is not worth our time. This has the nice benefit of letting us sunset the /var/lib/ceph/*/*/{sysvinit,upstart,systemd} files. Also, I think we should do 1 and 2 basically at the same time. I don't think it's worth spending any effort trying to make things behave with just 1 (and not 2). Am I talking sense? I can never tell with this stuff. :) sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: 'Racing read got wrong version' during proxy write testing
Making the 'copy get' op to be a cache op seems like a good idea. -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Thursday, June 4, 2015 9:14 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: RE: 'Racing read got wrong version' during proxy write testing On Wed, 3 Jun 2015, Wang, Zhiqiang wrote: I ran into the 'op not idempotent' problem during the testing today. There is one bug in the previous fix. In that fix, we copy the reqids in the final step of 'fill_in_copy_get'. If the object is deleted, since the 'copy get' op is a read op, it returns earlier with ENOENT in do_op. No reqids will be copied during promotion in this case. This again leads to the 'op not idempotent' problem. We need a 'smart' way to detect the op is a 'copy get' op (looping the ops vector doesn't seem smart?) and copy the reqids in this case. Hmm. I think the idea here is/was that that ENOENT would somehow include the reqid list from PGLog::get_object_reqids(). I think teh trick is getting it past the generic check in do_op: if (!op-may_write() !op-may_cache() (!obc-obs.exists || ((m-get_snapid() != CEPH_SNAPDIR) obc-obs.oi.is_whiteout( { reply_ctx(ctx, -ENOENT); return; } Maybe we mark these as cache operations so that may_cache is true? Sam, what do you think? sage -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, May 26, 2015 12:27 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: Re: 'Racing read got wrong version' during proxy write testing On Mon, 25 May 2015, Wang, Zhiqiang wrote: Hi all, I ran into a problem during the teuthology test of proxy write. It is like this: - Client sends 3 writes and a read on the same object to base tier - Set up cache tiering - Client retries ops and sends the 3 writes and 1 read to the cache tier - The 3 writes finished on the base tier, say with versions v1, v2 and v3 - Cache tier proxies the 1st write, and start to promote the object for the 2nd write, the 2nd and 3rd writes and the read are blocked - The proxied 1st write finishes on the base tier with version v4, and returns to cache tier. But somehow the cache tier fails to send the reply due to socket failure injecting - Client retries the writes and the read again, the writes are identified as dup ops - The promotion finishes, it copies the pg_log entries from the base tier and put it in the cache tier's pg_log. This includes the 3 writes on the base tier and the proxied write - The writes dispatches after the promotion, they are identified as completed dup ops. Cache tier replies these write ops with the version from the base tier (v1, v2 and v3) - In the last, the read dispatches, it reads the version of the proxied write (v4) and replies to client - Client complains that 'racing read got wrong version' In a previous discussion of the 'ops not idempotent' problem, we solved it by copying the pg_log entries in the base tier to cache tier during promotion. Seems like there is still a problem with this approach in the above scenario. My first thought is that when proxying the write, the cache tier should use the original reqid from the client. But currently we don't have a way to pass the original reqid from cache to base. Any ideas? I agree--I think the correct fix here is to make the proxied op be recognized as a dup. We can either do that by passing in an optional reqid to the Objecter, or extending the op somehow so that both reqids are listed. I think the first option will be cleaner, but I think we will also need to make sure the 'retry' count is preserved as (I think) we skip the dup check if retry==0. And we probably want to preserve the behavior that a given (reqid, retry) only exists once in the system. This probably means adding more optional args to Objecter::read()...? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 'Racing read got wrong version' during proxy write testing
I'm wonder if this issue could be the cause of #11511. Could a proxy write have raced with the fill_in_copy_get() so object_info_t size doesn't correspond with the size of the object in the filestore? David On 6/3/15 6:22 PM, Wang, Zhiqiang wrote: Making the 'copy get' op to be a cache op seems like a good idea. -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Thursday, June 4, 2015 9:14 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: RE: 'Racing read got wrong version' during proxy write testing On Wed, 3 Jun 2015, Wang, Zhiqiang wrote: I ran into the 'op not idempotent' problem during the testing today. There is one bug in the previous fix. In that fix, we copy the reqids in the final step of 'fill_in_copy_get'. If the object is deleted, since the 'copy get' op is a read op, it returns earlier with ENOENT in do_op. No reqids will be copied during promotion in this case. This again leads to the 'op not idempotent' problem. We need a 'smart' way to detect the op is a 'copy get' op (looping the ops vector doesn't seem smart?) and copy the reqids in this case. Hmm. I think the idea here is/was that that ENOENT would somehow include the reqid list from PGLog::get_object_reqids(). I think teh trick is getting it past the generic check in do_op: if (!op-may_write() !op-may_cache() (!obc-obs.exists || ((m-get_snapid() != CEPH_SNAPDIR) obc-obs.oi.is_whiteout( { reply_ctx(ctx, -ENOENT); return; } Maybe we mark these as cache operations so that may_cache is true? Sam, what do you think? sage -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, May 26, 2015 12:27 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: Re: 'Racing read got wrong version' during proxy write testing On Mon, 25 May 2015, Wang, Zhiqiang wrote: Hi all, I ran into a problem during the teuthology test of proxy write. It is like this: - Client sends 3 writes and a read on the same object to base tier - Set up cache tiering - Client retries ops and sends the 3 writes and 1 read to the cache tier - The 3 writes finished on the base tier, say with versions v1, v2 and v3 - Cache tier proxies the 1st write, and start to promote the object for the 2nd write, the 2nd and 3rd writes and the read are blocked - The proxied 1st write finishes on the base tier with version v4, and returns to cache tier. But somehow the cache tier fails to send the reply due to socket failure injecting - Client retries the writes and the read again, the writes are identified as dup ops - The promotion finishes, it copies the pg_log entries from the base tier and put it in the cache tier's pg_log. This includes the 3 writes on the base tier and the proxied write - The writes dispatches after the promotion, they are identified as completed dup ops. Cache tier replies these write ops with the version from the base tier (v1, v2 and v3) - In the last, the read dispatches, it reads the version of the proxied write (v4) and replies to client - Client complains that 'racing read got wrong version' In a previous discussion of the 'ops not idempotent' problem, we solved it by copying the pg_log entries in the base tier to cache tier during promotion. Seems like there is still a problem with this approach in the above scenario. My first thought is that when proxying the write, the cache tier should use the original reqid from the client. But currently we don't have a way to pass the original reqid from cache to base. Any ideas? I agree--I think the correct fix here is to make the proxied op be recognized as a dup. We can either do that by passing in an optional reqid to the Objecter, or extending the op somehow so that both reqids are listed. I think the first option will be cleaner, but I think we will also need to make sure the 'retry' count is preserved as (I think) we skip the dup check if retry==0. And we probably want to preserve the behavior that a given (reqid, retry) only exists once in the system. This probably means adding more optional args to Objecter::read()...? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: 'Racing read got wrong version' during proxy write testing
Hi David, Proxy write hasn't been merge into master yet. It's not likely this is causing #11511. -Original Message- From: David Zafman [mailto:dzaf...@redhat.com] Sent: Thursday, June 4, 2015 9:46 AM To: Wang, Zhiqiang; Sage Weil Cc: ceph-devel@vger.kernel.org Subject: Re: 'Racing read got wrong version' during proxy write testing I'm wonder if this issue could be the cause of #11511. Could a proxy write have raced with the fill_in_copy_get() so object_info_t size doesn't correspond with the size of the object in the filestore? David On 6/3/15 6:22 PM, Wang, Zhiqiang wrote: Making the 'copy get' op to be a cache op seems like a good idea. -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Thursday, June 4, 2015 9:14 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: RE: 'Racing read got wrong version' during proxy write testing On Wed, 3 Jun 2015, Wang, Zhiqiang wrote: I ran into the 'op not idempotent' problem during the testing today. There is one bug in the previous fix. In that fix, we copy the reqids in the final step of 'fill_in_copy_get'. If the object is deleted, since the 'copy get' op is a read op, it returns earlier with ENOENT in do_op. No reqids will be copied during promotion in this case. This again leads to the 'op not idempotent' problem. We need a 'smart' way to detect the op is a 'copy get' op (looping the ops vector doesn't seem smart?) and copy the reqids in this case. Hmm. I think the idea here is/was that that ENOENT would somehow include the reqid list from PGLog::get_object_reqids(). I think teh trick is getting it past the generic check in do_op: if (!op-may_write() !op-may_cache() (!obc-obs.exists || ((m-get_snapid() != CEPH_SNAPDIR) obc-obs.oi.is_whiteout( { reply_ctx(ctx, -ENOENT); return; } Maybe we mark these as cache operations so that may_cache is true? Sam, what do you think? sage -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, May 26, 2015 12:27 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: Re: 'Racing read got wrong version' during proxy write testing On Mon, 25 May 2015, Wang, Zhiqiang wrote: Hi all, I ran into a problem during the teuthology test of proxy write. It is like this: - Client sends 3 writes and a read on the same object to base tier - Set up cache tiering - Client retries ops and sends the 3 writes and 1 read to the cache tier - The 3 writes finished on the base tier, say with versions v1, v2 and v3 - Cache tier proxies the 1st write, and start to promote the object for the 2nd write, the 2nd and 3rd writes and the read are blocked - The proxied 1st write finishes on the base tier with version v4, and returns to cache tier. But somehow the cache tier fails to send the reply due to socket failure injecting - Client retries the writes and the read again, the writes are identified as dup ops - The promotion finishes, it copies the pg_log entries from the base tier and put it in the cache tier's pg_log. This includes the 3 writes on the base tier and the proxied write - The writes dispatches after the promotion, they are identified as completed dup ops. Cache tier replies these write ops with the version from the base tier (v1, v2 and v3) - In the last, the read dispatches, it reads the version of the proxied write (v4) and replies to client - Client complains that 'racing read got wrong version' In a previous discussion of the 'ops not idempotent' problem, we solved it by copying the pg_log entries in the base tier to cache tier during promotion. Seems like there is still a problem with this approach in the above scenario. My first thought is that when proxying the write, the cache tier should use the original reqid from the client. But currently we don't have a way to pass the original reqid from cache to base. Any ideas? I agree--I think the correct fix here is to make the proxied op be recognized as a dup. We can either do that by passing in an optional reqid to the Objecter, or extending the op somehow so that both reqids are listed. I think the first option will be cleaner, but I think we will also need to make sure the 'retry' count is preserved as (I think) we skip the dup check if retry==0. And we probably want to preserve the behavior that a given (reqid, retry) only exists once in the system. This probably means adding more optional args to Objecter::read()...? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To
RE: 'Racing read got wrong version' during proxy write testing
On Wed, 3 Jun 2015, Wang, Zhiqiang wrote: I ran into the 'op not idempotent' problem during the testing today. There is one bug in the previous fix. In that fix, we copy the reqids in the final step of 'fill_in_copy_get'. If the object is deleted, since the 'copy get' op is a read op, it returns earlier with ENOENT in do_op. No reqids will be copied during promotion in this case. This again leads to the 'op not idempotent' problem. We need a 'smart' way to detect the op is a 'copy get' op (looping the ops vector doesn't seem smart?) and copy the reqids in this case. Hmm. I think the idea here is/was that that ENOENT would somehow include the reqid list from PGLog::get_object_reqids(). I think teh trick is getting it past the generic check in do_op: if (!op-may_write() !op-may_cache() (!obc-obs.exists || ((m-get_snapid() != CEPH_SNAPDIR) obc-obs.oi.is_whiteout( { reply_ctx(ctx, -ENOENT); return; } Maybe we mark these as cache operations so that may_cache is true? Sam, what do you think? sage -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, May 26, 2015 12:27 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: Re: 'Racing read got wrong version' during proxy write testing On Mon, 25 May 2015, Wang, Zhiqiang wrote: Hi all, I ran into a problem during the teuthology test of proxy write. It is like this: - Client sends 3 writes and a read on the same object to base tier - Set up cache tiering - Client retries ops and sends the 3 writes and 1 read to the cache tier - The 3 writes finished on the base tier, say with versions v1, v2 and v3 - Cache tier proxies the 1st write, and start to promote the object for the 2nd write, the 2nd and 3rd writes and the read are blocked - The proxied 1st write finishes on the base tier with version v4, and returns to cache tier. But somehow the cache tier fails to send the reply due to socket failure injecting - Client retries the writes and the read again, the writes are identified as dup ops - The promotion finishes, it copies the pg_log entries from the base tier and put it in the cache tier's pg_log. This includes the 3 writes on the base tier and the proxied write - The writes dispatches after the promotion, they are identified as completed dup ops. Cache tier replies these write ops with the version from the base tier (v1, v2 and v3) - In the last, the read dispatches, it reads the version of the proxied write (v4) and replies to client - Client complains that 'racing read got wrong version' In a previous discussion of the 'ops not idempotent' problem, we solved it by copying the pg_log entries in the base tier to cache tier during promotion. Seems like there is still a problem with this approach in the above scenario. My first thought is that when proxying the write, the cache tier should use the original reqid from the client. But currently we don't have a way to pass the original reqid from cache to base. Any ideas? I agree--I think the correct fix here is to make the proxied op be recognized as a dup. We can either do that by passing in an optional reqid to the Objecter, or extending the op somehow so that both reqids are listed. I think the first option will be cleaner, but I think we will also need to make sure the 'retry' count is preserved as (I think) we skip the dup check if retry==0. And we probably want to preserve the behavior that a given (reqid, retry) only exists once in the system. This probably means adding more optional args to Objecter::read()...? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: preparing v0.80.11
On 05/26/2015 10:28 PM, Nathan Cutler wrote: Hi Loic: The first round of 0.80.11 backports, including all trivial backports (where trivial is defined as those I was able to do by myself without help), is now ready for integration testing in the firefly-backports branch of the SUSE fork: https://github.com/SUSE/ceph/commits/firefly-backports The non-trivial backports (on which I hereby solicit help) are: http://tracker.ceph.com/issues/11699 Objecter: resend linger ops on split http://tracker.ceph.com/issues/11700 make the all osd/filestore thread pool suicide timeouts separately configurable http://tracker.ceph.com/issues/11704 erasure-code: misalignment http://tracker.ceph.com/issues/11720 rgw deleting S3 objects leaves __shadow_ objects behind Could I also ask for this one to be backported? https://github.com/ceph/ceph/pull/4844 It breaks a couple of setups I know of. It's not in master yet, but it's a very trivial fix. Nathan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html