Re: [ceph-users] Help needed porting Ceph to RSockets
Hi Sean, I will re-check until the end of the week; there is some test scheduling issue with our test system, which affects my access times. Thanks Andreas On Mon, 19 Aug 2013 17:10:11 + "Hefty, Sean" wrote: > Can you see if the patch below fixes the hang? > > Signed-off-by: Sean Hefty > --- > src/rsocket.c | 11 ++- > 1 files changed, 10 insertions(+), 1 deletions(-) > > diff --git a/src/rsocket.c b/src/rsocket.c > index d544dd0..e45b26d 100644 > --- a/src/rsocket.c > +++ b/src/rsocket.c > @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd > *rfds, struct pollfd *fds, nfds_t nfds) > rs = idm_lookup(&idm, fds[i].fd); > if (rs) { > + fastlock_acquire(&rs->cq_wait_lock); > if (rs->type == SOCK_STREAM) > rs_get_cq_event(rs); > else > ds_get_cq_event(rs); > + fastlock_release(&rs->cq_wait_lock); > fds[i].revents = rs_poll_rs(rs, > fds[i].events, 1, rs_poll_all); } else { > fds[i].revents = rfds[i].revents; > @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set > *writefds, > /* > * For graceful disconnect, notify the remote side that we're > - * disconnecting and wait until all outstanding sends complete. > + * disconnecting and wait until all outstanding sends complete, > provided > + * that the remote side has not sent a disconnect message. > */ > int rshutdown(int socket, int how) > { > @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how) > if (rs->state & rs_connected) > rs_process_cq(rs, 0, rs_conn_all_sends_done); > > + if (rs->state & rs_disconnected) { > + /* Generate event by flushing receives to unblock > rpoll */ > + ibv_req_notify_cq(rs->cm_id->recv_cq, 0); > + rdma_disconnect(rs->cm_id); > + } > + > if ((rs->fd_flags & O_NONBLOCK) && (rs->state & > rs_connected)) rs_set_nonblocking(rs, rs->fd_flags); > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > in the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- Andreas Bluemle mailto:andreas.blue...@itxperts.de Heinrich Boell Strasse 88 Phone: (+49) 89 4317582 D-81829 Muenchen (Germany) Mobil: (+49) 177 522 0151 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW blueprint for plugin architecture
On Tue, Aug 20, 2013 at 2:58 AM, Yehuda Sadeh wrote: > Well, practically I'd like to have such work doing baby steps, rather > than swiping changes. Such changes have higher chances of getting > completed and eventually merged upstream. That's why I prefer the > current model of directly linking the plugins (whether statically or > dynamically), with (relatively) minor internal adjustments. What current model of "directly linking plugins" do you refer to exactly? > Maybe start with thinking about the use cases, and then figure what > kind of api that would be. As I said, I'm not sure that an internal > api is the way to go, but rather exposing some lower level > functionality externally. The big difference is that with the former > we tie in the internal architecture, while the latter hides the [gory] > details. The problem is that right now basically everything is 'lower level functionality', because a lot of generic stuff depends on S3 stuff, which in turn depends on generic stuff. Take for example the following; class RGWHandler_Usage : public RGWHandler_Auth_S3 { } class RGWHandler_Auth_S3 : public RGWHandler_ObjStore { } This basically ties usage statistics collection + authentication handling + object store all together. I think this needs to be completely unravelled, but before making all kinds of use cases (like, usage statistics collection or authentication in this case) it might be wise to know what the design decisions were to make the S3 API so very much integrated into everything else. Or is this just legacy? Roald -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Help needed porting Ceph to RSockets
Hi, I have added the patch and re-tested: I still encounter hangs of my application. I am not quite sure whether the I hit the same error on the shutdown because now I don't hit the error always, but only every now and then. WHen adding the patch to my code base (git tag v1.0.17) I notice an offset of "-34 lines". Which code base are you using? Best Regards Andreas Bluemle On Tue, 20 Aug 2013 09:21:13 +0200 Andreas Bluemle wrote: > Hi Sean, > > I will re-check until the end of the week; there is > some test scheduling issue with our test system, which > affects my access times. > > Thanks > > Andreas > > > On Mon, 19 Aug 2013 17:10:11 + > "Hefty, Sean" wrote: > > > Can you see if the patch below fixes the hang? > > > > Signed-off-by: Sean Hefty > > --- > > src/rsocket.c | 11 ++- > > 1 files changed, 10 insertions(+), 1 deletions(-) > > > > diff --git a/src/rsocket.c b/src/rsocket.c > > index d544dd0..e45b26d 100644 > > --- a/src/rsocket.c > > +++ b/src/rsocket.c > > @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd > > *rfds, struct pollfd *fds, nfds_t nfds) > > rs = idm_lookup(&idm, fds[i].fd); > > if (rs) { > > + fastlock_acquire(&rs->cq_wait_lock); > > if (rs->type == SOCK_STREAM) > > rs_get_cq_event(rs); > > else > > ds_get_cq_event(rs); > > + fastlock_release(&rs->cq_wait_lock); > > fds[i].revents = rs_poll_rs(rs, > > fds[i].events, 1, rs_poll_all); } else { > > fds[i].revents = rfds[i].revents; > > @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set > > *writefds, > > /* > > * For graceful disconnect, notify the remote side that we're > > - * disconnecting and wait until all outstanding sends complete. > > + * disconnecting and wait until all outstanding sends complete, > > provided > > + * that the remote side has not sent a disconnect message. > > */ > > int rshutdown(int socket, int how) > > { > > @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how) > > if (rs->state & rs_connected) > > rs_process_cq(rs, 0, rs_conn_all_sends_done); > > > > + if (rs->state & rs_disconnected) { > > + /* Generate event by flushing receives to unblock > > rpoll */ > > + ibv_req_notify_cq(rs->cm_id->recv_cq, 0); > > + rdma_disconnect(rs->cm_id); > > + } > > + > > if ((rs->fd_flags & O_NONBLOCK) && (rs->state & > > rs_connected)) rs_set_nonblocking(rs, rs->fd_flags); > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe > > linux-rdma" in the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > -- Andreas Bluemle mailto:andreas.blue...@itxperts.de Heinrich Boell Strasse 88 Phone: (+49) 89 4317582 D-81829 Muenchen (Germany) Mobil: (+49) 177 522 0151 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Review request : Erasure Code plugin loader implementation
Hi Sage, I created "erasure code : convenience functions to code / decode" http://tracker.ceph.com/issues/6064 to implement the suggested functions. Please let me know if this should be merged with another task. Cheers On 19/08/2013 17:06, Loic Dachary wrote: > > > On 19/08/2013 02:01, Sage Weil wrote: >> On Sun, 18 Aug 2013, Loic Dachary wrote: >>> Hi Sage, >>> >>> Unless I misunderstood something ( which is still possible at this stage >>> ;-) decode() is used both for recovery of missing chunks and retrieval of >>> the original buffer. Decoding the M data chunks is a special case of >>> decoding N <= M chunks out of the M+K chunks that were produced by >>> encode(). It can be used to recover parity chunks as well as data chunks. >>> >>> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api >>> >>> map decode(const set &want_to_read, const map>> buffer> &chunks) >>> >>> decode chunks to read the content of the want_to_read chunks and return >>> a map associating the chunk number with its decoded content. For instance, >>> in the simplest case M=2,K=1 for an encoded payload of data A and B with >>> parity Z, calling >>> >>> decode([1,2], { 1 => 'A', 2 => 'B', 3 => 'Z' }) >>> => { 1 => 'A', 2 => 'B' } >>> >>> If however, the chunk B is to be read but is missing it will be: >>> >>> decode([2], { 1 => 'A', 3 => 'Z' }) >>> => { 2 => 'B' } >> >> Ah, I guess this works when some of the chunks contain the original >> data (as with a parity code). There are codes that don't work that way, >> although I suspect we won't use them. >> >> Regardless, I wonder if we should generalize slightly and have some >> methods work in terms of (offset,length) of the original stripe to >> generalize that bit. Then we would have something like >> >> map transcode(const set &want_to_read, const map> buffer>& chunks); >> >> to go from chunks -> chunks (as we would want to do with, say, a LRC-like >> code where we can rebuild some shards from a subset of the other shards). >> And then also have >> >> int decode(const map& chunks, unsigned offset, >> unsigned len, bufferlist *out); > > This function would be implemented more or less as: > > set want_to_read = range_to_chunks(offset, len) // compute what chunks > must be retrieved > set available = the up set > set minimum = minimum_to_decode(want_to_read, available); > map available_chunks = retrieve_chunks_from_osds(minimum); > map chunks = transcode(want_to_read, available_chunks); // > repairs if necessary > out = bufferptr(concat_chunks(chunks), offset - offset of the first chunk, > len) > > or do you have something else in mind ? > >> >> that recovers the original data. >> >> In our case, the read path would use decode, and for recovery we would use >> transcode. >> >> We'd also want to have alternate minimum_to_decode* methods, like >> >> virtual set minimum_to_decode(unsigned offset, unsigned len, const >> set &available_chunks) = 0; > > I also have a convenience wrapper in mind for this but I feel I'm missing > something. > > Cheers > >> >> What do you think? >> >> sage >> >> >> >> >>> >>> Cheers >>> >>> On 18/08/2013 19:34, Sage Weil wrote: On Sun, 18 Aug 2013, Loic Dachary wrote: > Hi Ceph, > > I've implemented a draft of the Erasure Code plugin loader in the context > of http://tracker.ceph.com/issues/5878. It has a trivial unit test and an > example plugin. It would be great if someone could do a quick review. The > general idea is that the erasure code pool calls something like: > > ErasureCodePlugin::factory(&erasure_code, "example", parameters) > > as shown at > > https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/test/osd/TestErasureCode.cc#L28 > > to get an object implementing the interface > > https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/osd/ErasureCodeInterface.h > > which matches the proposal described at > > https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api > > The draft is at > > https://github.com/ceph/ceph/commit/5a2b1d66ae17b78addc14fee68c73985412f3c8c > > Thanks in advance :-) I haven't been following this discussion too closely, but taking a look now, the first 3 make sense, but virtual map decode(const set &want_to_read, const map &chunks) = 0; it seems like this one should be more like virtual int decode(const map &chunks, bufferlist *out); As in, you'd decode the chunks you have to get the actual data. If you want to get (missing) chunks for recovery, you'd do minimum_to_decode(...); // see what we need decode(...); // reconst
Erasure Code plugin system with an example : review request
Hi Ceph, Yesterday I implemented a simple erasure code plugin that can sustain the loss of a single chunk. https://github.com/dachary/ceph/blob/wip-5878/src/osd/ErasureCodeExample.h and it works as shown in the unit test https://github.com/dachary/ceph/blob/wip-5878/src/test/osd/TestErasureCodeExample.cc It would be of limited use in a production environment because it only saves 25% space ( M=2 K=1 ) over a 2 replica pool, but it would work. I would very much appreciate a review of the erasure code plugin system and the associated example plugin : https://github.com/ceph/ceph/pull/515 When it's good enough, creating a jerasure plugin will be next :-) Cheers -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. signature.asc Description: OpenPGP digital signature
Re: RGW blueprint for plugin architecture
On Tue, Aug 20, 2013 at 1:58 AM, Roald van Loon wrote: > On Tue, Aug 20, 2013 at 2:58 AM, Yehuda Sadeh wrote: >> Well, practically I'd like to have such work doing baby steps, rather >> than swiping changes. Such changes have higher chances of getting >> completed and eventually merged upstream. That's why I prefer the >> current model of directly linking the plugins (whether statically or >> dynamically), with (relatively) minor internal adjustments. > > What current model of "directly linking plugins" do you refer to exactly? I was referring to your work at wip-rgw-plugin, where the plugin code itself still needs to rely on the rgw utility code. > >> Maybe start with thinking about the use cases, and then figure what >> kind of api that would be. As I said, I'm not sure that an internal >> api is the way to go, but rather exposing some lower level >> functionality externally. The big difference is that with the former >> we tie in the internal architecture, while the latter hides the [gory] >> details. > > The problem is that right now basically everything is 'lower level > functionality', because a lot of generic stuff depends on S3 stuff, > which in turn depends on generic stuff. Take for example the > following; > > class RGWHandler_Usage : public RGWHandler_Auth_S3 { } > class RGWHandler_Auth_S3 : public RGWHandler_ObjStore { } > > This basically ties usage statistics collection + authentication > handling + object store all together. That's not quite a hard dependency. At the moment it's like that, as we made a decision to use the S3 auth for the admin utilities. Switching to a different auth system (atm) would require defining a new auth class and inheriting from it instead. It's not very flexible, but it's not very intrusive. I'd certainly be interested in removing this inheritance relationship and switch to a different pipeline model. > > I think this needs to be completely unravelled, but before making all > kinds of use cases (like, usage statistics collection or > authentication in this case) it might be wise to know what the design > decisions were to make the S3 API so very much integrated into > everything else. Or is this just legacy? > As I said, I don't see it as such. We do use it all over the place, but the same way you could just switch these to use RGWHandler_Auth_Swift and it should work (give or take a few tweaks). Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ceph-users] Help needed porting Ceph to RSockets
> I have added the patch and re-tested: I still encounter > hangs of my application. I am not quite sure whether the > I hit the same error on the shutdown because now I don't hit > the error always, but only every now and then. I guess this is at least some progress... :/ > WHen adding the patch to my code base (git tag v1.0.17) I notice > an offset of "-34 lines". Which code base are you using? This patch was generated against the tip of the git tree. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
libvirt: Removing RBD volumes with snapshots, auto purge or not?
Hi, The current [0] libvirt storage pool code simply calls "rbd_remove" without anything else. As far as I know rbd_remove will fail if the image still has snapshots, you have to remove those snapshots first before you can remove the image. The problem is that libvirt's storage pools do not support listing snapshots, so we can't integrate that. Libvirt however has a flag you can pass down to tell you want the device to be zeroed. The normal procedure is that the device is filled with zeros before actually removing it. I was thinking about "abusing" this flag to use it as a snap purge for RBD. So a regular volume removal will call only rbd_remove, but when the flag VIR_STORAGE_VOL_DELETE_ZEROED is passed it will purge all snapshots prior to calling rbd_remove. Another way would be to always purge snapshots, but I'm afraid that could make somebody very unhappy at some point. Currently "virsh" doesn't support flags, but that could be fixed in a different patch. Does my idea sound sane? [0]: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=e3340f63f412c22d025f615beb7cfed25f00107b;hb=master#l407 -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libvirt: Removing RBD volumes with snapshots, auto purge or not?
On Tue, Aug 20, 2013 at 7:36 PM, Wido den Hollander wrote: > Hi, > > The current [0] libvirt storage pool code simply calls "rbd_remove" without > anything else. > > As far as I know rbd_remove will fail if the image still has snapshots, you > have to remove those snapshots first before you can remove the image. > > The problem is that libvirt's storage pools do not support listing > snapshots, so we can't integrate that. > > Libvirt however has a flag you can pass down to tell you want the device to > be zeroed. > > The normal procedure is that the device is filled with zeros before actually > removing it. > > I was thinking about "abusing" this flag to use it as a snap purge for RBD. > > So a regular volume removal will call only rbd_remove, but when the flag > VIR_STORAGE_VOL_DELETE_ZEROED is passed it will purge all snapshots prior to > calling rbd_remove. > > Another way would be to always purge snapshots, but I'm afraid that could > make somebody very unhappy at some point. > > Currently "virsh" doesn't support flags, but that could be fixed in a > different patch. > > Does my idea sound sane? > > [0]: > http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=e3340f63f412c22d025f615beb7cfed25f00107b;hb=master#l407 > > -- > Wido den Hollander > 42on B.V. Hi Wido, You had mentioned not so long ago the same idea as I had about a year and half ago about placing memory dumps along with the regular snapshot in Ceph using libvirt mechanisms. That sounds pretty nice since we`ll have something other than qcow2 with same snapshot functionality but your current proposal does not extend to this. Placing custom side hook seems much more expandable than putting snap purge into specific flag. > > Phone: +31 (0)20 700 9902 > Skype: contact42on > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW blueprint for plugin architecture
On Tue, Aug 20, 2013 at 4:49 PM, Yehuda Sadeh wrote: > I was referring to your work at wip-rgw-plugin, where the plugin code > itself still needs to rely on the rgw utility code. Right. So we can agree on ditching the dynamic loading thing and clean internal API (for now), but at least start separating code into "plugins" like this? > That's not quite a hard dependency. At the moment it's like that, as > we made a decision to use the S3 auth for the admin utilities. > Switching to a different auth system (atm) would require defining a > new auth class and inheriting from it instead. It's not very flexible, > but it's not very intrusive. > I'd certainly be interested in removing this inheritance relationship > and switch to a different pipeline model. I don't know if you looked at it in detail, but for the wip-rgw-plugin work I created a RGWAuthManager / RGWAuthPipeline relation to seggregate authentication specific stuff from the REST handlers. Is that in general a model you like to see discussed in more detail? If so, it would probably be wise to start a separate blueprint for it. > As I said, I don't see it as such. We do use it all over the place, > but the same way you could just switch these to use > RGWHandler_Auth_Swift and it should work (give or take a few tweaks). IMHO, REST handlers should leave authentication/authorization/accounting specific tasks to a separate component (like the aforementioned pipelining system, and maybe integrate that with current RGWUser related code), although this will likely never be purely abstracted (at least for authentication). This just makes the whole system more modular (albeit it just a bit). But for now I propose to implement a small "plugin" system where plugins are still linked into the rgw core (but code wise as much separated as possible), and keep the auth stuff for later. Roald -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libvirt: Removing RBD volumes with snapshots, auto purge or not?
On 08/20/2013 05:43 PM, Andrey Korolyov wrote: On Tue, Aug 20, 2013 at 7:36 PM, Wido den Hollander wrote: Hi, The current [0] libvirt storage pool code simply calls "rbd_remove" without anything else. As far as I know rbd_remove will fail if the image still has snapshots, you have to remove those snapshots first before you can remove the image. The problem is that libvirt's storage pools do not support listing snapshots, so we can't integrate that. Libvirt however has a flag you can pass down to tell you want the device to be zeroed. The normal procedure is that the device is filled with zeros before actually removing it. I was thinking about "abusing" this flag to use it as a snap purge for RBD. So a regular volume removal will call only rbd_remove, but when the flag VIR_STORAGE_VOL_DELETE_ZEROED is passed it will purge all snapshots prior to calling rbd_remove. Another way would be to always purge snapshots, but I'm afraid that could make somebody very unhappy at some point. Currently "virsh" doesn't support flags, but that could be fixed in a different patch. Does my idea sound sane? [0]: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=e3340f63f412c22d025f615beb7cfed25f00107b;hb=master#l407 -- Wido den Hollander 42on B.V. Hi Wido, You had mentioned not so long ago the same idea as I had about a year and half ago about placing memory dumps along with the regular snapshot in Ceph using libvirt mechanisms. That sounds pretty nice since we`ll have something other than qcow2 with same snapshot functionality but your current proposal does not extend to this. Correct, since this is about the storage pool support, that is something completely different. Placing custom side hook seems much more expandable than putting snap purge into specific flag. My proposal is a bit selfish, since I'm running into this with CloudStack. CloudStack now has a work-around for RBD since images could still have snapshots where other storage types are handled by libvirt. I want to have it all handled by libvirt. Wido Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
app design recommendations
Hi, I am creating an email system which will handle whole company's email, mostly internal mail. There will be thousands of companies and hundreds of users per company. So I am planning to use one pool per company to store email messages. Can Ceph manage thousands or maybe hundred thousands of pools ? Could there be any slowdown at production with such design after some growth? Every email will be stored as an individual ceph object (emails will average 512 bytes and rarely have attachments) , is it ok to store them as a ceph objects or will it be less efficient than storing multiple emails in a ceph object,? What is the optimal ceph object size to store individually, so it would be preferable to do this instead of writing through omap with leveldb? (kind of "ceph object vs omap" benchmark question) Also I will be putting mini-chat sessions between users in a ceph object, each time a user sends a message to another user, I will append the text to the ceph object, so my question is, will Ceph rewrite the whole object into a new physical location on disk when I do an append? Or will it just rewrite the block that was modified? And last questions: Which is faster, storing small key/value pairs in omap or in xattrs ? Will storing key/value pairs in xattrs result in space waste by allocating a block for zero-sized object on the OSD? (I won't write any data to the object, just use xattrs) Will appreciate very much your comments. Best Regards Nulik -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libvirt: Removing RBD volumes with snapshots, auto purge or not?
On 08/20/2013 08:36 AM, Wido den Hollander wrote: Hi, The current [0] libvirt storage pool code simply calls "rbd_remove" without anything else. As far as I know rbd_remove will fail if the image still has snapshots, you have to remove those snapshots first before you can remove the image. The problem is that libvirt's storage pools do not support listing snapshots, so we can't integrate that. libvirt's storage pools don't have any concept of snapshots, which is the real problem. Ideally they would have functions to at least create, list and delete snapshots (and probably rollback and create a volume from a snapshot too). Libvirt however has a flag you can pass down to tell you want the device to be zeroed. The normal procedure is that the device is filled with zeros before actually removing it. I was thinking about "abusing" this flag to use it as a snap purge for RBD. So a regular volume removal will call only rbd_remove, but when the flag VIR_STORAGE_VOL_DELETE_ZEROED is passed it will purge all snapshots prior to calling rbd_remove. I don't think we should reinterpret the flag like that. A new flag for that purpose could work, but since libvirt storage pools don't manage snapshots at all right now I'd rather CloudStack delete the snapshots via librbd, since it's the service creating them in this case. You could see what the libvirt devs think about a new flag though. Another way would be to always purge snapshots, but I'm afraid that could make somebody very unhappy at some point. I agree this would be too unsafe for a default. It seems that's what the LVM storage pool does now, maybe because it doesn't expect snapshots to be used. Currently "virsh" doesn't support flags, but that could be fixed in a different patch. No backend actually uses the flags yet either. Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need some help with the RBD Java bindings
Wido, I pushed up a patch to https://github.com/ceph/rados-java/commit/ca16d82bc5b596620609880e429ec9f4eaa4d5ce That includes a fix for this problem. The fix is a bit hacky, but the tests pass now. I included more details about the hack in the code. On Thu, Aug 15, 2013 at 9:57 AM, Noah Watkins wrote: > On Thu, Aug 15, 2013 at 8:51 AM, Wido den Hollander wrote: >> >> public List snapList() throws RbdException { >> IntByReference numSnaps = new IntByReference(16); >> PointerByReference snaps = new PointerByReference(); >> List list = new ArrayList(); >> RbdSnapInfo snapInfo, snapInfos[]; >> >> while (true) { >> int r = rbd.rbd_snap_list(this.getPointer(), snaps, numSnaps); > > I think you need to allocate the memory for `snaps` yourself. Here is > the RBD wrapper for Python which does that: > > self.snaps = (rbd_snap_info_t * num_snaps.value)() > ret = self.librbd.rbd_snap_list(image.image, byref(self.snaps), >byref(num_snaps)) > > - Noah -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
do not upgrade bobtail -> dumpling directly until 0.67.2
We've identified a problem when upgrading directly from bobtail to dumpling; please wait until 0.67.2 before doing so. Upgrades from bobtail -> cuttlefish -> dumpling are fine. It is only the long jump between versions that is problematic. The fix is already in the dumpling branch. Another point release will be out in the next day or two. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW blueprint for plugin architecture
On Tue, Aug 20, 2013 at 9:03 AM, Roald van Loon wrote: > On Tue, Aug 20, 2013 at 4:49 PM, Yehuda Sadeh wrote: >> I was referring to your work at wip-rgw-plugin, where the plugin code >> itself still needs to rely on the rgw utility code. > > Right. So we can agree on ditching the dynamic loading thing and clean > internal API (for now), but at least start separating code into > "plugins" like this? > >> That's not quite a hard dependency. At the moment it's like that, as >> we made a decision to use the S3 auth for the admin utilities. >> Switching to a different auth system (atm) would require defining a >> new auth class and inheriting from it instead. It's not very flexible, >> but it's not very intrusive. >> I'd certainly be interested in removing this inheritance relationship >> and switch to a different pipeline model. > > I don't know if you looked at it in detail, but for the wip-rgw-plugin > work I created a RGWAuthManager / RGWAuthPipeline relation to > seggregate authentication specific stuff from the REST handlers. Is > that in general a model you like to see discussed in more detail? If > so, it would probably be wise to start a separate blueprint for it. I didn't look closely at all the details, but yeah, something along those lines. But it'll need to be clearly defined. > >> As I said, I don't see it as such. We do use it all over the place, >> but the same way you could just switch these to use >> RGWHandler_Auth_Swift and it should work (give or take a few tweaks). > > IMHO, REST handlers should leave > authentication/authorization/accounting specific tasks to a separate > component (like the aforementioned pipelining system, and maybe > integrate that with current RGWUser related code), although this will > likely never be purely abstracted (at least for authentication). This > just makes the whole system more modular (albeit it just a bit). Can't think of examples off the top of my head right now, but the devil's always in the details. Hopefully wrt the auth system there aren't many hidden issues. > > But for now I propose to implement a small "plugin" system where > plugins are still linked into the rgw core (but code wise as much > separated as possible), and keep the auth stuff for later. > Sounds good. Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
New Defects reported by Coverity Scan for ceph (fwd)
Coverity picked up some issues with the filestore code. These are mostly old issues that appear new becuase code moved around, but this is probably a good opportunity to fix them... :) sage--- Begin Message --- Hi, Please find the latest report on new defect(s) introduced to ceph found with Coverity Scan Defect(s) Reported-by: Coverity Scan Showing 7 of 9 defects ** CID 1063704: Uninitialized scalar field (UNINIT_CTOR) /os/BtrfsFileStoreBackend.cc: 57 ** CID 1063703: Time of check time of use (TOCTOU) /os/GenericFileStoreBackend.cc: 170 ** CID 1063702: Time of check time of use (TOCTOU) /os/BtrfsFileStoreBackend.cc: 246 ** CID 1063701: Copy into fixed size buffer (STRING_OVERFLOW) /os/BtrfsFileStoreBackend.cc: 458 ** CID 1063700: Copy into fixed size buffer (STRING_OVERFLOW) /os/BtrfsFileStoreBackend.cc: 370 ** CID 1063699: Resource leak (RESOURCE_LEAK) /os/BtrfsFileStoreBackend.cc: 345 ** CID 1063698: Improper use of negative value (NEGATIVE_RETURNS) CID 1063704: Uninitialized scalar field (UNINIT_CTOR) /os/BtrfsFileStoreBackend.h: 25 ( member_decl) 22private: 23 bool has_clone_range; ///< clone range ioctl is supported 24 bool has_snap_create; ///< snap create ioctl is supported >>> Class member declaration for "has_snap_destroy". 25 bool has_snap_destroy; ///< snap destroy ioctl is supported 26 bool has_snap_create_v2;///< snap create v2 ioctl (async!) is supported 27 bool has_wait_sync; ///< wait sync ioctl is supported 28 bool stable_commits; 29 bool m_filestore_btrfs_clone_range; /os/BtrfsFileStoreBackend.cc: 57 ( uninit_member) 54GenericFileStoreBackend(fs), has_clone_range(false), has_snap_create(false), 55has_snap_create_v2(false), has_wait_sync(false), stable_commits(false), 56m_filestore_btrfs_clone_range(g_conf->filestore_btrfs_clone_range), >>> CID 1063704: Uninitialized scalar field (UNINIT_CTOR) >>> Non-static class member "has_snap_destroy" is not initialized in this >>> constructor nor in any functions that it calls. 57m_filestore_btrfs_snap (g_conf->filestore_btrfs_snap) { } 58 59int BtrfsFileStoreBackend::detect_features() 60{ 61 int r; CID 1063703: Time of check time of use (TOCTOU) /os/GenericFileStoreBackend.cc: 170 ( fs_check_call) 167int GenericFileStoreBackend::create_current() 168{ 169 struct stat st; >>> CID 1063703: Time of check time of use (TOCTOU) >>> Calling function "stat(char const *, stat *)" to perform check on >>> "this->get_current_path()->c_str()". 170 int ret = ::stat(get_current_path().c_str(), &st); 171 if (ret == 0) { 172// current/ exists 173if (!S_ISDIR(st.st_mode)) { 174 dout(0) << "_create_current: current/ exists but is not a directory" << dendl; /os/GenericFileStoreBackend.cc: 178 ( toctou) 175 ret = -EINVAL; 176} 177 } else { >>> Calling function "mkdir(char const *, __mode_t)" that uses >>> "this->get_current_path()->c_str()" after a check function. This can cause >>> a time-of-check, time-of-use race condition. 178ret = ::mkdir(get_current_path().c_str(), 0755); 179if (ret < 0) { 180 ret = -errno; 181 dout(0) << "_create_current: mkdir " << get_current_path() << " failed: "<< cpp_strerror(ret) << dendl; 182} CID 1063702: Time of check time of use (TOCTOU) /os/BtrfsFileStoreBackend.cc: 246 ( fs_check_call) 243int BtrfsFileStoreBackend::create_current() 244{ 245 struct stat st; >>> CID 1063702: Time of check time of use (TOCTOU) >>> Calling function "stat(char const *, stat *)" to perform check on >>> "this->get_current_path()->c_str()". 246 int ret = ::stat(get_current_path().c_str(), &st); 247 if (ret == 0) { 248// current/ exists 249if (!S_ISDIR(st.st_mode)) { 250 dout(0) << "create_current: current/ exists but is not a directory" << dendl; /os/BtrfsFileStoreBackend.cc: 288 ( toctou) 285 } 286 287 dout(2) << "create_current: created btrfs subvol " << get_current_path() << dendl; >>> Calling function "chmod(char const *, __mode_t)" that uses >>> "this->get_current_path()->c_str()" after a check function. This can cause >>> a time-of-check, time-of-use race condition. 288 if (::chmod(get_current_path().c_str(), 0755) < 0) { 289ret = -errno; 290dout(0) << "create_current: failed to chmod " << get_current_path() << " to 0755: " 291 << cpp_strerror(ret) << dendl; 292return ret; ___