Re: [ceph-users] Help needed porting Ceph to RSockets

2013-08-20 Thread Andreas Bluemle
Hi Sean,

I will re-check until the end of the week; there is
some test scheduling issue with our test system, which
affects my access times.

Thanks

Andreas


On Mon, 19 Aug 2013 17:10:11 +
"Hefty, Sean"  wrote:

> Can you see if the patch below fixes the hang?
> 
> Signed-off-by: Sean Hefty 
> ---
>  src/rsocket.c |   11 ++-
>  1 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/src/rsocket.c b/src/rsocket.c
> index d544dd0..e45b26d 100644
> --- a/src/rsocket.c
> +++ b/src/rsocket.c
> @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd
> *rfds, struct pollfd *fds, nfds_t nfds) 
>   rs = idm_lookup(&idm, fds[i].fd);
>   if (rs) {
> + fastlock_acquire(&rs->cq_wait_lock);
>   if (rs->type == SOCK_STREAM)
>   rs_get_cq_event(rs);
>   else
>   ds_get_cq_event(rs);
> + fastlock_release(&rs->cq_wait_lock);
>   fds[i].revents = rs_poll_rs(rs,
> fds[i].events, 1, rs_poll_all); } else {
>   fds[i].revents = rfds[i].revents;
> @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set
> *writefds, 
>  /*
>   * For graceful disconnect, notify the remote side that we're
> - * disconnecting and wait until all outstanding sends complete.
> + * disconnecting and wait until all outstanding sends complete,
> provided
> + * that the remote side has not sent a disconnect message.
>   */
>  int rshutdown(int socket, int how)
>  {
> @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how)
>   if (rs->state & rs_connected)
>   rs_process_cq(rs, 0, rs_conn_all_sends_done);
>  
> + if (rs->state & rs_disconnected) {
> + /* Generate event by flushing receives to unblock
> rpoll */
> + ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
> + rdma_disconnect(rs->cm_id);
> + }
> +
>   if ((rs->fd_flags & O_NONBLOCK) && (rs->state &
> rs_connected)) rs_set_nonblocking(rs, rs->fd_flags);
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> in the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



-- 
Andreas Bluemle mailto:andreas.blue...@itxperts.de
Heinrich Boell Strasse 88   Phone: (+49) 89 4317582
D-81829 Muenchen (Germany)  Mobil: (+49) 177 522 0151
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW blueprint for plugin architecture

2013-08-20 Thread Roald van Loon
On Tue, Aug 20, 2013 at 2:58 AM, Yehuda Sadeh  wrote:
> Well, practically I'd like to have such work doing baby steps, rather
> than swiping changes. Such changes have higher chances of getting
> completed and eventually merged upstream.  That's why I prefer the
> current model of directly linking the plugins (whether statically or
> dynamically), with (relatively) minor internal adjustments.

What current model of "directly linking plugins" do you refer to exactly?

> Maybe start with thinking about the use cases, and then figure what
> kind of api that would be. As I said, I'm not sure that an internal
> api is the way to go, but rather exposing some lower level
> functionality externally. The big difference is that with the former
> we tie in the internal architecture, while the latter hides the [gory]
> details.

The problem is that right now basically everything is 'lower level
functionality', because a lot of generic stuff depends on S3 stuff,
which in turn depends on generic stuff. Take for example the
following;

class RGWHandler_Usage : public RGWHandler_Auth_S3 { }
class RGWHandler_Auth_S3 : public RGWHandler_ObjStore { }

This basically ties usage statistics collection + authentication
handling + object store all together.

I think this needs to be completely unravelled, but before making all
kinds of use cases (like, usage statistics collection or
authentication in this case) it might be wise to know what the design
decisions were to make the S3 API so very much integrated into
everything else. Or is this just legacy?

Roald
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Help needed porting Ceph to RSockets

2013-08-20 Thread Andreas Bluemle
Hi,

I have added the patch and re-tested: I still encounter
hangs of my application. I am not quite sure whether the
I hit the same error on the shutdown because now I don't hit
the error always, but only every now and then.

WHen adding the patch to my code base (git tag v1.0.17) I notice
an offset of "-34 lines". Which code base are you using?


Best Regards

Andreas Bluemle

On Tue, 20 Aug 2013 09:21:13 +0200
Andreas Bluemle  wrote:

> Hi Sean,
> 
> I will re-check until the end of the week; there is
> some test scheduling issue with our test system, which
> affects my access times.
> 
> Thanks
> 
> Andreas
> 
> 
> On Mon, 19 Aug 2013 17:10:11 +
> "Hefty, Sean"  wrote:
> 
> > Can you see if the patch below fixes the hang?
> > 
> > Signed-off-by: Sean Hefty 
> > ---
> >  src/rsocket.c |   11 ++-
> >  1 files changed, 10 insertions(+), 1 deletions(-)
> > 
> > diff --git a/src/rsocket.c b/src/rsocket.c
> > index d544dd0..e45b26d 100644
> > --- a/src/rsocket.c
> > +++ b/src/rsocket.c
> > @@ -2948,10 +2948,12 @@ static int rs_poll_events(struct pollfd
> > *rfds, struct pollfd *fds, nfds_t nfds) 
> > rs = idm_lookup(&idm, fds[i].fd);
> > if (rs) {
> > +   fastlock_acquire(&rs->cq_wait_lock);
> > if (rs->type == SOCK_STREAM)
> > rs_get_cq_event(rs);
> > else
> > ds_get_cq_event(rs);
> > +   fastlock_release(&rs->cq_wait_lock);
> > fds[i].revents = rs_poll_rs(rs,
> > fds[i].events, 1, rs_poll_all); } else {
> > fds[i].revents = rfds[i].revents;
> > @@ -3098,7 +3100,8 @@ int rselect(int nfds, fd_set *readfds, fd_set
> > *writefds, 
> >  /*
> >   * For graceful disconnect, notify the remote side that we're
> > - * disconnecting and wait until all outstanding sends complete.
> > + * disconnecting and wait until all outstanding sends complete,
> > provided
> > + * that the remote side has not sent a disconnect message.
> >   */
> >  int rshutdown(int socket, int how)
> >  {
> > @@ -3138,6 +3141,12 @@ int rshutdown(int socket, int how)
> > if (rs->state & rs_connected)
> > rs_process_cq(rs, 0, rs_conn_all_sends_done);
> >  
> > +   if (rs->state & rs_disconnected) {
> > +   /* Generate event by flushing receives to unblock
> > rpoll */
> > +   ibv_req_notify_cq(rs->cm_id->recv_cq, 0);
> > +   rdma_disconnect(rs->cm_id);
> > +   }
> > +
> > if ((rs->fd_flags & O_NONBLOCK) && (rs->state &
> > rs_connected)) rs_set_nonblocking(rs, rs->fd_flags);
> >  
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-rdma" in the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 
> 



-- 
Andreas Bluemle mailto:andreas.blue...@itxperts.de
Heinrich Boell Strasse 88   Phone: (+49) 89 4317582
D-81829 Muenchen (Germany)  Mobil: (+49) 177 522 0151
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Review request : Erasure Code plugin loader implementation

2013-08-20 Thread Loic Dachary
Hi Sage,

I created "erasure code : convenience functions to code / decode" 
http://tracker.ceph.com/issues/6064 to implement the suggested functions. 
Please let me know if this should be merged with another task.

Cheers

On 19/08/2013 17:06, Loic Dachary wrote:
> 
> 
> On 19/08/2013 02:01, Sage Weil wrote:
>> On Sun, 18 Aug 2013, Loic Dachary wrote:
>>> Hi Sage,
>>>
>>> Unless I misunderstood something ( which is still possible at this stage 
>>> ;-) decode() is used both for recovery of missing chunks and retrieval of 
>>> the original buffer. Decoding the M data chunks is a special case of 
>>> decoding N <= M chunks out of the M+K chunks that were produced by 
>>> encode(). It can be used to recover parity chunks as well as data chunks.
>>>
>>> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api
>>>
>>> map decode(const set &want_to_read, const map>> buffer> &chunks)
>>>
>>> decode chunks to read the content of the want_to_read chunks and return 
>>> a map associating the chunk number with its decoded content. For instance, 
>>> in the simplest case M=2,K=1 for an encoded payload of data A and B with 
>>> parity Z, calling
>>>
>>> decode([1,2], { 1 => 'A', 2 => 'B', 3 => 'Z' })
>>> => { 1 => 'A', 2 => 'B' }
>>>
>>> If however, the chunk B is to be read but is missing it will be:
>>>
>>> decode([2], { 1 => 'A', 3 => 'Z' })
>>> => { 2 => 'B' }
>>
>> Ah, I guess this works when some of the chunks contain the original 
>> data (as with a parity code).  There are codes that don't work that way, 
>> although I suspect we won't use them.
>>
>> Regardless, I wonder if we should generalize slightly and have some 
>> methods work in terms of (offset,length) of the original stripe to 
>> generalize that bit.  Then we would have something like
>>
>>  map transcode(const set &want_to_read, const map> buffer>& chunks);
>>
>> to go from chunks -> chunks (as we would want to do with, say, a LRC-like 
>> code where we can rebuild some shards from a subset of the other shards).  
>> And then also have
>>
>>  int decode(const map& chunks, unsigned offset, 
>>  unsigned len, bufferlist *out);
> 
> This function would be implemented more or less as:
> 
>   set want_to_read = range_to_chunks(offset, len) // compute what chunks 
> must be retrieved
>   set available = the up set
>   set minimum = minimum_to_decode(want_to_read, available);
>   map available_chunks = retrieve_chunks_from_osds(minimum);
>   map chunks = transcode(want_to_read, available_chunks); // 
> repairs if necessary
>   out = bufferptr(concat_chunks(chunks), offset - offset of the first chunk, 
> len)
> 
> or do you have something else in mind ?
> 
>>
>> that recovers the original data.
>>
>> In our case, the read path would use decode, and for recovery we would use 
>> transcode.  
>>
>> We'd also want to have alternate minimum_to_decode* methods, like
>>
>> virtual set minimum_to_decode(unsigned offset, unsigned len, const 
>>  set &available_chunks) = 0;
> 
> I also have a convenience wrapper in mind for this but I feel I'm missing 
> something.
> 
> Cheers
> 
>>
>> What do you think?
>>
>> sage
>>
>>
>>
>>
>>>
>>> Cheers
>>>
>>> On 18/08/2013 19:34, Sage Weil wrote:
 On Sun, 18 Aug 2013, Loic Dachary wrote:
> Hi Ceph,
>
> I've implemented a draft of the Erasure Code plugin loader in the context 
> of http://tracker.ceph.com/issues/5878. It has a trivial unit test and an 
> example plugin. It would be great if someone could do a quick review. The 
> general idea is that the erasure code pool calls something like:
>
> ErasureCodePlugin::factory(&erasure_code, "example", parameters)
>
> as shown at
>
> https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/test/osd/TestErasureCode.cc#L28
>
> to get an object implementing the interface
>
> https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/osd/ErasureCodeInterface.h
>
> which matches the proposal described at
>
> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api
>
> The draft is at
>
> https://github.com/ceph/ceph/commit/5a2b1d66ae17b78addc14fee68c73985412f3c8c
>
> Thanks in advance :-)

 I haven't been following this discussion too closely, but taking a look 
 now, the first 3 make sense, but

 virtual map decode(const set &want_to_read, const 
 map &chunks) = 0;

 it seems like this one should be more like

 virtual int decode(const map &chunks, bufferlist *out);

 As in, you'd decode the chunks you have to get the actual data.  If you 
 want to get (missing) chunks for recovery, you'd do

   minimum_to_decode(...);  // see what we need
   
   decode(...);   // reconst

Erasure Code plugin system with an example : review request

2013-08-20 Thread Loic Dachary
Hi Ceph,

Yesterday I implemented a simple erasure code plugin that can sustain the loss 
of a single chunk.

https://github.com/dachary/ceph/blob/wip-5878/src/osd/ErasureCodeExample.h

and it works as shown in the unit test

https://github.com/dachary/ceph/blob/wip-5878/src/test/osd/TestErasureCodeExample.cc

It would be of limited use in a production environment because it only saves 
25% space ( M=2 K=1 ) over a 2 replica pool, but it would work.

I would very much appreciate a review of the erasure code plugin system and the 
associated example plugin :

https://github.com/ceph/ceph/pull/515

When it's good enough, creating a jerasure plugin will be next :-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



signature.asc
Description: OpenPGP digital signature


Re: RGW blueprint for plugin architecture

2013-08-20 Thread Yehuda Sadeh
On Tue, Aug 20, 2013 at 1:58 AM, Roald van Loon  wrote:
> On Tue, Aug 20, 2013 at 2:58 AM, Yehuda Sadeh  wrote:
>> Well, practically I'd like to have such work doing baby steps, rather
>> than swiping changes. Such changes have higher chances of getting
>> completed and eventually merged upstream.  That's why I prefer the
>> current model of directly linking the plugins (whether statically or
>> dynamically), with (relatively) minor internal adjustments.
>
> What current model of "directly linking plugins" do you refer to exactly?

I was referring to your work at wip-rgw-plugin, where the plugin code
itself still needs to rely on the rgw utility code.

>
>> Maybe start with thinking about the use cases, and then figure what
>> kind of api that would be. As I said, I'm not sure that an internal
>> api is the way to go, but rather exposing some lower level
>> functionality externally. The big difference is that with the former
>> we tie in the internal architecture, while the latter hides the [gory]
>> details.
>
> The problem is that right now basically everything is 'lower level
> functionality', because a lot of generic stuff depends on S3 stuff,
> which in turn depends on generic stuff. Take for example the
> following;
>
> class RGWHandler_Usage : public RGWHandler_Auth_S3 { }
> class RGWHandler_Auth_S3 : public RGWHandler_ObjStore { }
>
> This basically ties usage statistics collection + authentication
> handling + object store all together.

That's not quite a hard dependency. At the moment it's like that, as
we made a decision to use the S3 auth for the admin utilities.
Switching to a different auth system (atm) would require defining a
new auth class and inheriting from it instead. It's not very flexible,
but it's not very intrusive.
I'd certainly be interested in removing this inheritance relationship
and switch to a different pipeline model.

>
> I think this needs to be completely unravelled, but before making all
> kinds of use cases (like, usage statistics collection or
> authentication in this case) it might be wise to know what the design
> decisions were to make the S3 API so very much integrated into
> everything else. Or is this just legacy?
>

As I said, I don't see it as such. We do use it all over the place,
but the same way you could just switch these to use
RGWHandler_Auth_Swift and it should work (give or take a few tweaks).

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ceph-users] Help needed porting Ceph to RSockets

2013-08-20 Thread Hefty, Sean
> I have added the patch and re-tested: I still encounter
> hangs of my application. I am not quite sure whether the
> I hit the same error on the shutdown because now I don't hit
> the error always, but only every now and then.

I guess this is at least some progress... :/
 
> WHen adding the patch to my code base (git tag v1.0.17) I notice
> an offset of "-34 lines". Which code base are you using?

This patch was generated against the tip of the git tree. 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


libvirt: Removing RBD volumes with snapshots, auto purge or not?

2013-08-20 Thread Wido den Hollander

Hi,

The current [0] libvirt storage pool code simply calls "rbd_remove" 
without anything else.


As far as I know rbd_remove will fail if the image still has snapshots, 
you have to remove those snapshots first before you can remove the image.


The problem is that libvirt's storage pools do not support listing 
snapshots, so we can't integrate that.


Libvirt however has a flag you can pass down to tell you want the device 
to be zeroed.


The normal procedure is that the device is filled with zeros before 
actually removing it.


I was thinking about "abusing" this flag to use it as a snap purge for RBD.

So a regular volume removal will call only rbd_remove, but when the flag 
VIR_STORAGE_VOL_DELETE_ZEROED is passed it will purge all snapshots 
prior to calling rbd_remove.


Another way would be to always purge snapshots, but I'm afraid that 
could make somebody very unhappy at some point.


Currently "virsh" doesn't support flags, but that could be fixed in a 
different patch.


Does my idea sound sane?

[0]: 
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=e3340f63f412c22d025f615beb7cfed25f00107b;hb=master#l407


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libvirt: Removing RBD volumes with snapshots, auto purge or not?

2013-08-20 Thread Andrey Korolyov
On Tue, Aug 20, 2013 at 7:36 PM, Wido den Hollander  wrote:
> Hi,
>
> The current [0] libvirt storage pool code simply calls "rbd_remove" without
> anything else.
>
> As far as I know rbd_remove will fail if the image still has snapshots, you
> have to remove those snapshots first before you can remove the image.
>
> The problem is that libvirt's storage pools do not support listing
> snapshots, so we can't integrate that.
>
> Libvirt however has a flag you can pass down to tell you want the device to
> be zeroed.
>
> The normal procedure is that the device is filled with zeros before actually
> removing it.
>
> I was thinking about "abusing" this flag to use it as a snap purge for RBD.
>
> So a regular volume removal will call only rbd_remove, but when the flag
> VIR_STORAGE_VOL_DELETE_ZEROED is passed it will purge all snapshots prior to
> calling rbd_remove.
>
> Another way would be to always purge snapshots, but I'm afraid that could
> make somebody very unhappy at some point.
>
> Currently "virsh" doesn't support flags, but that could be fixed in a
> different patch.
>
> Does my idea sound sane?
>
> [0]:
> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=e3340f63f412c22d025f615beb7cfed25f00107b;hb=master#l407
>
> --
> Wido den Hollander
> 42on B.V.

Hi Wido,


You had mentioned not so long ago the same idea as I had about a year
and half ago about placing memory dumps along with the regular
snapshot in Ceph using libvirt mechanisms. That sounds pretty nice
since we`ll have something other than qcow2 with same snapshot
functionality but your current proposal does not extend to this.
Placing custom side hook seems much more expandable than putting snap
purge into specific flag.

>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW blueprint for plugin architecture

2013-08-20 Thread Roald van Loon
On Tue, Aug 20, 2013 at 4:49 PM, Yehuda Sadeh  wrote:
> I was referring to your work at wip-rgw-plugin, where the plugin code
> itself still needs to rely on the rgw utility code.

Right. So we can agree on ditching the dynamic loading thing and clean
internal API (for now), but at least start separating code into
"plugins" like this?

> That's not quite a hard dependency. At the moment it's like that, as
> we made a decision to use the S3 auth for the admin utilities.
> Switching to a different auth system (atm) would require defining a
> new auth class and inheriting from it instead. It's not very flexible,
> but it's not very intrusive.
> I'd certainly be interested in removing this inheritance relationship
> and switch to a different pipeline model.

I don't know if you looked at it in detail, but for the wip-rgw-plugin
work I created a RGWAuthManager / RGWAuthPipeline relation to
seggregate authentication specific stuff from the REST handlers. Is
that in general a model you like to see discussed in more detail? If
so, it would probably be wise to start a separate blueprint for it.

> As I said, I don't see it as such. We do use it all over the place,
> but the same way you could just switch these to use
> RGWHandler_Auth_Swift and it should work (give or take a few tweaks).

IMHO, REST handlers should leave
authentication/authorization/accounting specific tasks to a separate
component (like the aforementioned pipelining system, and maybe
integrate that with current RGWUser related code), although this will
likely never be purely abstracted (at least for authentication). This
just makes the whole system more modular (albeit it just a bit).

But for now I propose to implement a small "plugin" system where
plugins are still linked into the rgw core (but code wise as much
separated as possible), and keep the auth stuff for later.

Roald
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libvirt: Removing RBD volumes with snapshots, auto purge or not?

2013-08-20 Thread Wido den Hollander

On 08/20/2013 05:43 PM, Andrey Korolyov wrote:

On Tue, Aug 20, 2013 at 7:36 PM, Wido den Hollander  wrote:

Hi,

The current [0] libvirt storage pool code simply calls "rbd_remove" without
anything else.

As far as I know rbd_remove will fail if the image still has snapshots, you
have to remove those snapshots first before you can remove the image.

The problem is that libvirt's storage pools do not support listing
snapshots, so we can't integrate that.

Libvirt however has a flag you can pass down to tell you want the device to
be zeroed.

The normal procedure is that the device is filled with zeros before actually
removing it.

I was thinking about "abusing" this flag to use it as a snap purge for RBD.

So a regular volume removal will call only rbd_remove, but when the flag
VIR_STORAGE_VOL_DELETE_ZEROED is passed it will purge all snapshots prior to
calling rbd_remove.

Another way would be to always purge snapshots, but I'm afraid that could
make somebody very unhappy at some point.

Currently "virsh" doesn't support flags, but that could be fixed in a
different patch.

Does my idea sound sane?

[0]:
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=e3340f63f412c22d025f615beb7cfed25f00107b;hb=master#l407

--
Wido den Hollander
42on B.V.


Hi Wido,


You had mentioned not so long ago the same idea as I had about a year
and half ago about placing memory dumps along with the regular
snapshot in Ceph using libvirt mechanisms. That sounds pretty nice
since we`ll have something other than qcow2 with same snapshot
functionality but your current proposal does not extend to this.


Correct, since this is about the storage pool support, that is something 
completely different.



Placing custom side hook seems much more expandable than putting snap
purge into specific flag.



My proposal is a bit selfish, since I'm running into this with 
CloudStack. CloudStack now has a work-around for RBD since images could 
still have snapshots where other storage types are handled by libvirt.


I want to have it all handled by libvirt.

Wido



Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


app design recommendations

2013-08-20 Thread Nulik Nol
Hi,
I am creating an email system which will handle whole company's email,
mostly internal mail. There will be thousands of companies and
hundreds of users per company. So I am planning to use one pool per
company to store email messages. Can Ceph manage thousands or maybe
hundred thousands of pools ? Could there be any slowdown at production
with such design after some growth?

Every email will be stored as an individual ceph object (emails will
average 512 bytes and rarely have attachments) , is it ok to store
them as a ceph objects or will it be less efficient than storing
multiple emails in a ceph object,? What is the optimal ceph object
size to store individually, so it would be preferable to do this
instead of writing through omap with leveldb? (kind of "ceph object vs
omap" benchmark question)

Also I will be putting mini-chat sessions between users in a ceph
object, each time a user sends a message to another user, I will
append the text to the ceph object, so my question is, will Ceph
rewrite the whole object into a new physical location on disk when I
do an append? Or will it just rewrite the block that was modified?

And last questions: Which is faster, storing small key/value pairs in
omap or in xattrs ? Will storing key/value pairs in xattrs result in
space waste by allocating a block for zero-sized object on the OSD? (I
won't write any data to the object, just use xattrs)

Will appreciate very much your comments.

Best Regards
Nulik
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libvirt: Removing RBD volumes with snapshots, auto purge or not?

2013-08-20 Thread Josh Durgin

On 08/20/2013 08:36 AM, Wido den Hollander wrote:

Hi,

The current [0] libvirt storage pool code simply calls "rbd_remove"
without anything else.

As far as I know rbd_remove will fail if the image still has snapshots,
you have to remove those snapshots first before you can remove the image.

The problem is that libvirt's storage pools do not support listing
snapshots, so we can't integrate that.


libvirt's storage pools don't have any concept of snapshots, which is
the real problem. Ideally they would have functions to at least create,
list and delete snapshots (and probably rollback and create a volume 
from a snapshot too).



Libvirt however has a flag you can pass down to tell you want the device
to be zeroed.

The normal procedure is that the device is filled with zeros before
actually removing it.

I was thinking about "abusing" this flag to use it as a snap purge for RBD.

So a regular volume removal will call only rbd_remove, but when the flag
VIR_STORAGE_VOL_DELETE_ZEROED is passed it will purge all snapshots
prior to calling rbd_remove.


I don't think we should reinterpret the flag like that. A new flag
for that purpose could work, but since libvirt storage pools don't
manage snapshots at all right now I'd rather CloudStack delete the
snapshots via librbd, since it's the service creating them in this case.
You could see what the libvirt devs think about a new flag though.


Another way would be to always purge snapshots, but I'm afraid that
could make somebody very unhappy at some point.


I agree this would be too unsafe for a default. It seems that's what
the LVM storage pool does now, maybe because it doesn't expect
snapshots to be used.


Currently "virsh" doesn't support flags, but that could be fixed in a
different patch.


No backend actually uses the flags yet either.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need some help with the RBD Java bindings

2013-08-20 Thread Noah Watkins
Wido,

I pushed up a patch to

   
https://github.com/ceph/rados-java/commit/ca16d82bc5b596620609880e429ec9f4eaa4d5ce

That includes a fix for this problem. The fix is a bit hacky, but the
tests pass now. I included more details about the hack in the code.

On Thu, Aug 15, 2013 at 9:57 AM, Noah Watkins  wrote:
> On Thu, Aug 15, 2013 at 8:51 AM, Wido den Hollander  wrote:
>>
>> public List snapList() throws RbdException {
>> IntByReference numSnaps = new IntByReference(16);
>> PointerByReference snaps = new PointerByReference();
>> List list = new ArrayList();
>> RbdSnapInfo snapInfo, snapInfos[];
>>
>> while (true) {
>> int r = rbd.rbd_snap_list(this.getPointer(), snaps, numSnaps);
>
> I think you need to allocate the memory for `snaps` yourself. Here is
> the RBD wrapper for Python which does that:
>
>   self.snaps = (rbd_snap_info_t * num_snaps.value)()
>   ret = self.librbd.rbd_snap_list(image.image, byref(self.snaps),
>byref(num_snaps))
>
> - Noah
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


do not upgrade bobtail -> dumpling directly until 0.67.2

2013-08-20 Thread Sage Weil
We've identified a problem when upgrading directly from bobtail to 
dumpling; please wait until 0.67.2 before doing so.

Upgrades from bobtail -> cuttlefish -> dumpling are fine.  It is only the 
long jump between versions that is problematic.

The fix is already in the dumpling branch.  Another point release will be 
out in the next day or two.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW blueprint for plugin architecture

2013-08-20 Thread Yehuda Sadeh
On Tue, Aug 20, 2013 at 9:03 AM, Roald van Loon  wrote:
> On Tue, Aug 20, 2013 at 4:49 PM, Yehuda Sadeh  wrote:
>> I was referring to your work at wip-rgw-plugin, where the plugin code
>> itself still needs to rely on the rgw utility code.
>
> Right. So we can agree on ditching the dynamic loading thing and clean
> internal API (for now), but at least start separating code into
> "plugins" like this?
>
>> That's not quite a hard dependency. At the moment it's like that, as
>> we made a decision to use the S3 auth for the admin utilities.
>> Switching to a different auth system (atm) would require defining a
>> new auth class and inheriting from it instead. It's not very flexible,
>> but it's not very intrusive.
>> I'd certainly be interested in removing this inheritance relationship
>> and switch to a different pipeline model.
>
> I don't know if you looked at it in detail, but for the wip-rgw-plugin
> work I created a RGWAuthManager / RGWAuthPipeline relation to
> seggregate authentication specific stuff from the REST handlers. Is
> that in general a model you like to see discussed in more detail? If
> so, it would probably be wise to start a separate blueprint for it.

I didn't look closely at all the details, but yeah, something along
those lines. But it'll need to be clearly defined.

>
>> As I said, I don't see it as such. We do use it all over the place,
>> but the same way you could just switch these to use
>> RGWHandler_Auth_Swift and it should work (give or take a few tweaks).
>
> IMHO, REST handlers should leave
> authentication/authorization/accounting specific tasks to a separate
> component (like the aforementioned pipelining system, and maybe
> integrate that with current RGWUser related code), although this will
> likely never be purely abstracted (at least for authentication). This
> just makes the whole system more modular (albeit it just a bit).

Can't think of examples off the top of my head right now, but the
devil's always in the details. Hopefully wrt the auth system there
aren't many hidden issues.
>
> But for now I propose to implement a small "plugin" system where
> plugins are still linked into the rgw core (but code wise as much
> separated as possible), and keep the auth stuff for later.
>
Sounds good.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


New Defects reported by Coverity Scan for ceph (fwd)

2013-08-20 Thread Sage Weil
Coverity picked up some issues with the filestore code.  These are mostly 
old issues that appear new becuase code moved around, but this is probably 
a good opportunity to fix them... :)

sage--- Begin Message ---


Hi,

Please find the latest report on new defect(s) introduced to ceph found with 
Coverity Scan

Defect(s) Reported-by: Coverity Scan
Showing 7 of 9 defects

** CID 1063704: Uninitialized scalar field (UNINIT_CTOR)
/os/BtrfsFileStoreBackend.cc: 57

** CID 1063703: Time of check time of use (TOCTOU)
/os/GenericFileStoreBackend.cc: 170

** CID 1063702: Time of check time of use (TOCTOU)
/os/BtrfsFileStoreBackend.cc: 246

** CID 1063701: Copy into fixed size buffer (STRING_OVERFLOW)
/os/BtrfsFileStoreBackend.cc: 458

** CID 1063700: Copy into fixed size buffer (STRING_OVERFLOW)
/os/BtrfsFileStoreBackend.cc: 370

** CID 1063699: Resource leak (RESOURCE_LEAK)
/os/BtrfsFileStoreBackend.cc: 345

** CID 1063698: Improper use of negative value (NEGATIVE_RETURNS)



CID 1063704: Uninitialized scalar field (UNINIT_CTOR)

/os/BtrfsFileStoreBackend.h: 25 ( member_decl)
   22private:
   23  bool has_clone_range;   ///< clone range ioctl is supported
   24  bool has_snap_create;   ///< snap create ioctl is supported
>>> Class member declaration for "has_snap_destroy".
   25  bool has_snap_destroy;  ///< snap destroy ioctl is supported
   26  bool has_snap_create_v2;///< snap create v2 ioctl (async!) is 
supported
   27  bool has_wait_sync; ///< wait sync ioctl is supported
   28  bool stable_commits;
   29  bool m_filestore_btrfs_clone_range;
  

/os/BtrfsFileStoreBackend.cc: 57 ( uninit_member)
   54GenericFileStoreBackend(fs), has_clone_range(false), 
has_snap_create(false),
   55has_snap_create_v2(false), has_wait_sync(false), 
stable_commits(false),
   56m_filestore_btrfs_clone_range(g_conf->filestore_btrfs_clone_range),
>>> CID 1063704: Uninitialized scalar field (UNINIT_CTOR)
>>> Non-static class member "has_snap_destroy" is not initialized in this 
>>> constructor nor in any functions that it calls.
   57m_filestore_btrfs_snap (g_conf->filestore_btrfs_snap) { }
   58
   59int BtrfsFileStoreBackend::detect_features()
   60{
   61  int r;
  

CID 1063703: Time of check time of use (TOCTOU)

/os/GenericFileStoreBackend.cc: 170 ( fs_check_call)
   167int GenericFileStoreBackend::create_current()
   168{
   169  struct stat st;
>>> CID 1063703: Time of check time of use (TOCTOU)
>>> Calling function "stat(char const *, stat *)" to perform check on 
>>> "this->get_current_path()->c_str()".
   170  int ret = ::stat(get_current_path().c_str(), &st);
   171  if (ret == 0) {
   172// current/ exists
   173if (!S_ISDIR(st.st_mode)) {
   174  dout(0) << "_create_current: current/ exists but is not a 
directory" << dendl;
  

/os/GenericFileStoreBackend.cc: 178 ( toctou)
   175  ret = -EINVAL;
   176}
   177  } else {
>>> Calling function "mkdir(char const *, __mode_t)" that uses 
>>> "this->get_current_path()->c_str()" after a check function. This can cause 
>>> a time-of-check, time-of-use race condition.
   178ret = ::mkdir(get_current_path().c_str(), 0755);
   179if (ret < 0) {
   180  ret = -errno;
   181  dout(0) << "_create_current: mkdir " << get_current_path() << " 
failed: "<< cpp_strerror(ret) << dendl;
   182}
  

CID 1063702: Time of check time of use (TOCTOU)

/os/BtrfsFileStoreBackend.cc: 246 ( fs_check_call)
   243int BtrfsFileStoreBackend::create_current()
   244{
   245  struct stat st;
>>> CID 1063702: Time of check time of use (TOCTOU)
>>> Calling function "stat(char const *, stat *)" to perform check on 
>>> "this->get_current_path()->c_str()".
   246  int ret = ::stat(get_current_path().c_str(), &st);
   247  if (ret == 0) {
   248// current/ exists
   249if (!S_ISDIR(st.st_mode)) {
   250  dout(0) << "create_current: current/ exists but is not a 
directory" << dendl;
  

/os/BtrfsFileStoreBackend.cc: 288 ( toctou)
   285  }
   286
   287  dout(2) << "create_current: created btrfs subvol " << 
get_current_path() << dendl;
>>> Calling function "chmod(char const *, __mode_t)" that uses 
>>> "this->get_current_path()->c_str()" after a check function. This can cause 
>>> a time-of-check, time-of-use race condition.
   288  if (::chmod(get_current_path().c_str(), 0755) < 0) {
   289ret = -errno;
   290dout(0) << "create_current: failed to chmod " << 
get_current_path() << " to 0755: "
   291  << cpp_strerror(ret) << dendl;
   292return ret;
  
___