Re: Speeding up rbd_stat() in libvirt

2016-01-04 Thread Wido den Hollander


On 04-01-16 16:38, Jason Dillaman wrote:
> Short term, assuming there wouldn't be an objection from the libvirt 
> community, I think spawning a thread pool and concurrently executing several 
> rbd_stat calls concurrently would be the easiest and cleanest solution.  I 
> wouldn't suggest trying to roll your own solution for retrieving image sizes 
> for format 1 and 2 RBD images directly within libvirt.
> 

I'll ask in the libvirt community if they allow such a thing.

> Longer term, given this use case, perhaps it would make sense to add an async 
> version of rbd_open.  The rbd_stat call itself just reads the data from 
> memory initialized by rbd_open.  On the Jewel branch, librbd has had some 
> major rework and image loading is asynchronous under the hood already.
> 

Hmm, that would be nice. In the callback I could call rbd_stat() and
populate the volume list within libvirt.

I would very much like to go that route since it saves me a lot of code
inside libvirt ;)

Wido

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Speeding up rbd_stat() in libvirt

2015-12-28 Thread Wido den Hollander
Hi,

The storage pools of libvirt know a mechanism called 'refresh' which
will scan a storage pool to refresh the contents.

The current implementation does:
* List all images via rbd_list()
* Call rbd_stat() on each image

Source:
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=cdbfdee98505492407669130712046783223c3cf;hb=master#l329

This works, but a RBD pool with 10k images takes a couple of minutes to
scan.

Now, Ceph is distributed, so this could be done in parallel, but before
I start on this I was wondering if somebody had a good idea to fix this?

I don't know if it is allowed in libvirt to spawn multiple threads and
have workers do this, but it was something which came to mind.

libvirt only wants to know the size of a image and this is now stored in
the rbd_directory object, so the rbd_stat() is required.

Suggestions or ideas? I would like to have this process to be as fast as
possible.

Wido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-22 Thread Wido den Hollander
On 12/21/2015 11:20 PM, Josh Durgin wrote:
> On 12/21/2015 11:00 AM, Wido den Hollander wrote:
>> My discard code now works, but I wanted to verify. If I understand Jason
>> correctly it would be a matter of figuring out the 'order' of a image
>> and call rbd_discard in a loop until you reach the end of the image.
> 
> You'd need to get the order via rbd_stat(), convert it to object size
> (i.e. (1 << order)), and fetch stripe_count with rbd_get_stripe_count().
> 
> Then do the discards in (object size * stripe_count) chunks. This
> ensures you discard entire objects. This is the size you'd want to use
> for import/export as well, ideally.
> 

Thanks! I just implemented this, could you take a look?

https://github.com/wido/libvirt/commit/b07925ad50fdb6683b5b21deefceb0829a7842dc

>> I just want libvirt to be as feature complete as possible when it comes
>> to RBD.
> 
> I see, makes sense.
> 
> Josh
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD performance with many childs and snapshots

2015-12-22 Thread Wido den Hollander
On 12/21/2015 11:51 PM, Josh Durgin wrote:
> On 12/21/2015 11:06 AM, Wido den Hollander wrote:
>> Hi,
>>
>> While implementing the buildvolfrom method in libvirt for RBD I'm stuck
>> at some point.
>>
>> $ virsh vol-clone --pool myrbdpool image1 image2
>>
>> This would clone image1 to a new RBD image called 'image2'.
>>
>> The code I've written now does:
>>
>> 1. Create a snapshot called image1@libvirt-
>> 2. Protect the snapshot
>> 3. Clone the snapshot to 'image1'
>>
>> wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
>> rbdpool image1 image2
>> Vol image2 cloned from image1
>>
>> wido@wido-desktop:~/repos/libvirt$
>>
>> root@alpha:~# rbd -p libvirt info image2
>> rbd image 'image2':
>> size 10240 MB in 2560 objects
>> order 22 (4096 kB objects)
>> block_name_prefix: rbd_data.1976451ead36b
>> format: 2
>> features: layering, striping
>> flags:
>> parent: libvirt/image1@libvirt-1450724650
>> overlap: 10240 MB
>> stripe unit: 4096 kB
>> stripe count: 1
>> root@alpha:~#
>>
>> But this could potentially lead to a lot of snapshots with children on
>> 'image1'.
>>
>> image1 itself will probably never change, but I'm wondering about the
>> negative performance impact this might have on a OSD.
> 
> Creating them isn't so bad, more snapshots that don't change don't have
> much affect on the osds. Deleting them is what's expensive, since the
> osds need to scan the objects to see which ones are part of the
> snapshot and can be deleted. If you have too many snapshots created and
> deleted, it can affect cluster load, so I'd rather avoid always
> creating a snapshot.
> 
>> I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
>> into libvirt. There is however no way to pass something like a snapshot
>> name in libvirt when cloning.
>>
>> Any bright suggestions? Or is it fine to create so many snapshots?
> 
> You could have canonical names for the libvirt snapshots like you
> suggest, 'libvirt-', and check via rbd_diff_iterate2()
> whether the parent image changed since the last snapshot. That's a bit
> slower than plain cloning, but with object map + fast diff it's fast
> again, since it doesn't need to scan all the objects anymore.
> 
> I think libvirt would need to expand its api a bit to be able to really
> use it effectively to manage rbd. Hiding the snapshots becomes
> cumbersome if the application wants to use them too. If libvirt's
> current model of clones lets parents be deleted before children,
> that may be a hassle to hide too...
> 

I gave it a shot. callback functions are a bit new to me, but I gave it
a try:
https://github.com/wido/libvirt/commit/756dca8023027616f53c39fa73c52a6d8f86a223

Could you take a look?

> Josh
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD performance with many childs and snapshots

2015-12-22 Thread Wido den Hollander


On 21-12-15 23:51, Josh Durgin wrote:
> On 12/21/2015 11:06 AM, Wido den Hollander wrote:
>> Hi,
>>
>> While implementing the buildvolfrom method in libvirt for RBD I'm stuck
>> at some point.
>>
>> $ virsh vol-clone --pool myrbdpool image1 image2
>>
>> This would clone image1 to a new RBD image called 'image2'.
>>
>> The code I've written now does:
>>
>> 1. Create a snapshot called image1@libvirt-
>> 2. Protect the snapshot
>> 3. Clone the snapshot to 'image1'
>>
>> wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
>> rbdpool image1 image2
>> Vol image2 cloned from image1
>>
>> wido@wido-desktop:~/repos/libvirt$
>>
>> root@alpha:~# rbd -p libvirt info image2
>> rbd image 'image2':
>> size 10240 MB in 2560 objects
>> order 22 (4096 kB objects)
>> block_name_prefix: rbd_data.1976451ead36b
>> format: 2
>> features: layering, striping
>> flags:
>> parent: libvirt/image1@libvirt-1450724650
>> overlap: 10240 MB
>> stripe unit: 4096 kB
>> stripe count: 1
>> root@alpha:~#
>>
>> But this could potentially lead to a lot of snapshots with children on
>> 'image1'.
>>
>> image1 itself will probably never change, but I'm wondering about the
>> negative performance impact this might have on a OSD.
> 
> Creating them isn't so bad, more snapshots that don't change don't have
> much affect on the osds. Deleting them is what's expensive, since the
> osds need to scan the objects to see which ones are part of the
> snapshot and can be deleted. If you have too many snapshots created and
> deleted, it can affect cluster load, so I'd rather avoid always
> creating a snapshot.
> 
>> I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
>> into libvirt. There is however no way to pass something like a snapshot
>> name in libvirt when cloning.
>>
>> Any bright suggestions? Or is it fine to create so many snapshots?
> 
> You could have canonical names for the libvirt snapshots like you
> suggest, 'libvirt-', and check via rbd_diff_iterate2()
> whether the parent image changed since the last snapshot. That's a bit
> slower than plain cloning, but with object map + fast diff it's fast
> again, since it doesn't need to scan all the objects anymore.
> 

I'll give that a try, seems like a good suggestion!

I'll have to use rbd_diff_iterate() through since iterate2() is
post-hammer and that will not be available on all systems.

> I think libvirt would need to expand its api a bit to be able to really
> use it effectively to manage rbd. Hiding the snapshots becomes
> cumbersome if the application wants to use them too. If libvirt's
> current model of clones lets parents be deleted before children,
> that may be a hassle to hide too...
> 

Yes, I would love to see:

- vol-snap-list
- vol-snap-create
- vol-snap-delete
- vol-snap-revert

And then:

- vol-clone --snapshot  --pool  image1 image2

But this would need some more work inside libvirt. Would be very nice
though.

At CloudStack we want to do as much as possible using libvirt, the more
features it has there, the less we have to do in Java code :)

Wido

> Josh
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RBD performance with many childs and snapshots

2015-12-21 Thread Wido den Hollander
Hi,

While implementing the buildvolfrom method in libvirt for RBD I'm stuck
at some point.

$ virsh vol-clone --pool myrbdpool image1 image2

This would clone image1 to a new RBD image called 'image2'.

The code I've written now does:

1. Create a snapshot called image1@libvirt-
2. Protect the snapshot
3. Clone the snapshot to 'image1'

wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool
rbdpool image1 image2
Vol image2 cloned from image1

wido@wido-desktop:~/repos/libvirt$

root@alpha:~# rbd -p libvirt info image2
rbd image 'image2':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1976451ead36b
format: 2
features: layering, striping
flags:
parent: libvirt/image1@libvirt-1450724650
overlap: 10240 MB
stripe unit: 4096 kB
stripe count: 1
root@alpha:~#

But this could potentially lead to a lot of snapshots with children on
'image1'.

image1 itself will probably never change, but I'm wondering about the
negative performance impact this might have on a OSD.

I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot'
into libvirt. There is however no way to pass something like a snapshot
name in libvirt when cloning.

Any bright suggestions? Or is it fine to create so many snapshots?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is rbd_discard enough to wipe an RBD image?

2015-12-21 Thread Wido den Hollander
On 12/21/2015 04:50 PM, Josh Durgin wrote:
> On 12/21/2015 07:09 AM, Jason Dillaman wrote:
>> You will have to ensure that your writes are properly aligned with the
>> object size (or object set if fancy striping is used on the RBD
>> volume).  In that case, the discard is translated to remove operations
>> on each individual backing object.  The only time zeros are written to
>> disk is if you specify an offset somewhere in the middle of an object
>> (i.e. the whole object cannot be deleted nor can it be truncated) --
>> this is the partial discard case controlled by that configuration param.
>>
> 
> I'm curious what's using the virVolWipe stuff - it can't guarantee it's
> actually wiping the data in many common configurations, not just with
> ceph but with any kind of disk, since libvirt is usually not consuming
> raw disks, and with modern flash and smr drives even that is not enough.
> There's a recent patch improving the docs on this [1].
> 
> If the goal is just to make the data inaccessible to the libvirt user,
> removing the image is just as good.
> 
> That said, with rbd there's not much cost to zeroing the image with
> object map enabled - it's effectively just doing the data removal step
> of 'rbd rm' early.
> 

I was looking at the features the RBD storage pool driver is missing in
libvirt and it is:

- Build from Volume. That's RBD cloning
- Uploading and Downloading Volume
- Wiping Volume

The thing about wiping in libvirt is that the volume still exists
afterwards, it is just empty.

My discard code now works, but I wanted to verify. If I understand Jason
correctly it would be a matter of figuring out the 'order' of a image
and call rbd_discard in a loop until you reach the end of the image.

I just want libvirt to be as feature complete as possible when it comes
to RBD.

Wido

> Josh
> 
> [1] http://comments.gmane.org/gmane.comp.emulators.libvirt/122235
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is rbd_discard enough to wipe an RBD image?

2015-12-20 Thread Wido den Hollander
Hi,

I'm busy implementing the volume wiping method of the libvirt storage
pool backend and instead of writing to the whole RBD image with zeroes
I'm using rbd_discard.

Using a 4MB length I'm starting at offset 0 and work my way through the
whole RBD image.

A quick try shows me that my partition table + filesystem are gone on
the RBD image after I've run rbd_discard.

I just want to know if this is sufficient to wipe a RBD image? Or would
it be better to fully fill the image with zeroes?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Quering since when a PG is inactive

2015-12-09 Thread Wido den Hollander
On 12/09/2015 02:50 PM, Sage Weil wrote:
> Hi Wido!
> 
> On Wed, 9 Dec 2015, Wido den Hollander wrote:
>> Hi,
>>
>> I'm working on a patch in PGMonitor.cc that sets the state to HEALTH_ERR
>> if >= X PGs are stuck non-active.
>>
>> This works for me now, but I would like to add a timer that a PG has to
>> be inactive for more than Y seconds.
>>
>> The PGMap contains "last_active" and "last_clean", but these timestamps
>> are never updated. So I can't query for last_active =< (now() - 300) for
>> example.
>>
>> On a idle test cluster I have a PG for example:
>>
>> "last_active": "2015-12-09 02:32:31.540712",
>>
>> It's currently 08:53:56 here, so I can't check against last_active.
>>
>> What would a good way be to see for how long a PG has been inactive?
> 
> It sounds like maybe the current code is subtley broken:
> 
>   https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2566
> 
> The last_active/clean etc should be fresh within 
> osd_pg_stat_report_interval_max seconds...
> 

Indeed, that seems broken. I created a issue for it:
http://tracker.ceph.com/issues/14028

I'm not sure where to start (yet).

> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CodingStyle on existing code

2015-12-09 Thread Wido den Hollander
On 12/03/2015 12:12 PM, Joao Eduardo Luis wrote:
> On 12/01/2015 03:18 PM, Sage Weil wrote:
>> On Tue, 1 Dec 2015, Wido den Hollander wrote:
>>>
>>> On 01-12-15 16:00, Gregory Farnum wrote:
>>>> On Tue, Dec 1, 2015 at 5:47 AM, Loic Dachary <l...@dachary.org> wrote:
>>>>>
>>>>>
>>>>> On 01/12/2015 14:10, Wido den Hollander wrote:
>>>>>> Hi,
>>>>>>
>>>>>> While working on mon/PGMonitor.cc I see that there is a lot of
>>>>>> inconsistency on the code.
>>>>>>
>>>>>> A lot of whitespaces, indentation which is not correct, well, a lot of
>>>>>> things.
>>>>>>
>>>>>> Is this something we want to fix? With some scripts we can probably do
>>>>>> this easily, but it might cause merge hell with people working on 
>>>>>> features.
>>>>>
>>>>> A sane (but long) way to do that is to cleanup when fixing a bug or 
>>>>> adding a feature. With (a lot) of patience, it will eventually be better 
>>>>> :-)
>>>>
>>>> Yeah, we generally want you to follow the standards in any new code. A
>>>> mass update of the code style on existing code makes navigating the
>>>> history a little harder so a lot of people don't like it much, though.
>>>
>>> Understood. But in this case I'm working in PGMonitor.cc. For just 20
>>> lines of code I probably shouldn't refactor the whole file, should I?
> 
> While it annoys me finding a given commit in history that just changes
> every single line to fix styling, I also recognize that this is the sort
> of janitorial task that may be warranted for certain files.
> 
> As sage mentions below, this is a low-traffic file. And whenever it's
> changed, is often just tiny bits here and there. That tends to add to
> the style divergence rather than to convergence.
> 
> I'd say go for it. If you are indeed changing those 20 or so lines
> though, please add those lines on a separate patch from the style changes.
> 

Here is the first pull request without a functional change. It is just
the codingstyle fix: https://github.com/ceph/ceph/pull/6881

If that comes through I can put in the actual code.

Wido

>   -Joao
> 
>>
>> Easiest thing is to fix the code around your change.
>>
>> I'm also open to a wholesale cleanup since it's a low-traffic file and 
>> likely won't conflict with other stuff in flight.  But, up to you!
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Quering since when a PG is inactive

2015-12-08 Thread Wido den Hollander
Hi,

I'm working on a patch in PGMonitor.cc that sets the state to HEALTH_ERR
if >= X PGs are stuck non-active.

This works for me now, but I would like to add a timer that a PG has to
be inactive for more than Y seconds.

The PGMap contains "last_active" and "last_clean", but these timestamps
are never updated. So I can't query for last_active =< (now() - 300) for
example.

On a idle test cluster I have a PG for example:

"last_active": "2015-12-09 02:32:31.540712",

It's currently 08:53:56 here, so I can't check against last_active.

What would a good way be to see for how long a PG has been inactive?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CodingStyle on existing code

2015-12-01 Thread Wido den Hollander


On 01-12-15 16:00, Gregory Farnum wrote:
> On Tue, Dec 1, 2015 at 5:47 AM, Loic Dachary <l...@dachary.org> wrote:
>>
>>
>> On 01/12/2015 14:10, Wido den Hollander wrote:
>>> Hi,
>>>
>>> While working on mon/PGMonitor.cc I see that there is a lot of
>>> inconsistency on the code.
>>>
>>> A lot of whitespaces, indentation which is not correct, well, a lot of
>>> things.
>>>
>>> Is this something we want to fix? With some scripts we can probably do
>>> this easily, but it might cause merge hell with people working on features.
>>
>> A sane (but long) way to do that is to cleanup when fixing a bug or adding a 
>> feature. With (a lot) of patience, it will eventually be better :-)
> 
> Yeah, we generally want you to follow the standards in any new code. A
> mass update of the code style on existing code makes navigating the
> history a little harder so a lot of people don't like it much, though.

Understood. But in this case I'm working in PGMonitor.cc. For just 20
lines of code I probably shouldn't refactor the whole file, should I?

> *shrug*
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


CodingStyle on existing code

2015-12-01 Thread Wido den Hollander
Hi,

While working on mon/PGMonitor.cc I see that there is a lot of
inconsistency on the code.

A lot of whitespaces, indentation which is not correct, well, a lot of
things.

Is this something we want to fix? With some scripts we can probably do
this easily, but it might cause merge hell with people working on features.

Wido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Would it make sense to require ntp

2015-11-06 Thread Wido den Hollander
On 11/06/2015 11:06 AM, Nathan Cutler wrote:
> Hi Ceph:
> 
> Recently I encountered some a "clock skew" issue with 0.94.3. I have
> some small demo clusters in AWS. When I boot them up, in most cases the
> cluster will start in HEALTH_WARN due to clock skew on some of the MONs.
> 
> I surmise that this is due to a race condition between the ceph-mon and
> ntpd systemd services. Sometimes ntpd.service starts *after* ceph-mon -
> in this case the MON sees a wrong/unsynchronized time value.
> 
> Now, even though ntpd.service starts (and fixes the time value) very
> soon afterwards, the cluster remains in clock skew for a long time - but
> that is a separate issue. What I would like to ask is this:
> 
> Is there any reasonable Ceph cluster node configuration that does not
> include running the NTP daemon?
> 

Well, the MONs are very, very time sensitive. OSDs somewhat less, but if
they drift too far they run into trouble authenticating.

> If the answer is "no", would it make sense to make NTP a runtime
> dependency and tell the ceph-mon systemd service to wait for
> ntpd.service before it starts?
> 

I think it makes sense, correct time is essential imho.

> Thanks and regards
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: civetweb upstream/downstream divergence

2015-10-29 Thread Wido den Hollander


On 29-10-15 10:19, Nathan Cutler wrote:
> Hi Ceph:
> 
> The civetweb code in RGW is taken from https://github.com/ceph/civetweb/
> which is a fork of https://github.com/civetweb/civetweb. The last commit
> to our fork took place on March 18.
> 
> Upstream civetweb development has progressed ("This branch is 19 commits
> ahead, 972 commits behind civetweb:master.")
> 
> Are there plans to rebase to a newer upstream version or should we think
> more in terms of backporting (to ceph/civetweb.git) from upstream
> (civetweb/civetweb.git) when we need to fix bugs or add features?
> 

I think it would be smart to keep tracking civetweb from upstream
otherwise we forked Civetweb.

We might run into some issues with Civetweb which we need to fix
upstream, that's a lot easier if we are close to where upstream is.

Wido

> Thanks and regards
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-19 Thread Wido den Hollander
ile approach and not go all-in on kv + block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: radosgw with openstack auth v3

2015-10-07 Thread Wido den Hollander
On 10/06/2015 11:03 AM, Luis Periquito wrote:
> Hi,
> 
> I was trying to get radosgw to authenticate using API v3, as I thought
> it would be relatively straightforward, but my C++ is not up to
> standard.
> 
> That and the time it takes to compile and make the radosgw binary is
> way too long. Is there a way to just compile radosgw so I can make the
> tests quickly?
> 

You need to compile Ceph first, afterwards you can run this in the src
directory:

$ cd src
$ make radosgw

> I can share all the progress I made, and share some thoughts. I also
> have a complete setup that I can use to test the solution.
> 
> Is there anyone currently looking into this?
> 
> thanks,
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: radosgw socket dir

2015-09-14 Thread Wido den Hollander


On 14-09-15 17:05, Sage Weil wrote:
> radosgw now runs as ceph:ceph under systemd.  /run/ceph-rgw is 755 owned 
> as the web user.  I think this will break anybody using apache and 
> fastcgi.  That's not the default, but I don't think that combo 
> makes any sense anyway.
> 
> Should we...
> 
>  - make /run/ceph-rgw ceph:ceph 0755, make the sockets 770, and then make 
> the user add the web user to the ceph group?
> 
>  - make /run/ceph-rgw ceph:www 0755 + group sticky bit and then make the 
> sockets 770 so that the web user can open them?
> 
>  - get rid of /run/ceph-rgw entirely and let the admin do this on their 
> own if they want to do fastcgi instead of civetweb?
> 
> I like of like the last option...

I agree. I would not promote FastCGI in any way. Civetweb is easier and
faster.

So if anyone wants FastCGI, let them override the systemd service file
and add the socket.

> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [radosgw] absolute uri

2015-08-27 Thread Wido den Hollander
On 08/27/2015 06:36 AM, Lorieri wrote:
 Hi,
 
 I'm debugging a golang aws client and I've noticed it makes absolute
 uri requests on amazon.
 http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1.2
 
 it would be great if radosgw implement it.
 

I think only the Golang client is the one that does it. I've never seen
any other client do it.

Even the AWS SDK or Python Boto don't do this, so shouldn't the Golang
AWS client be fixed?

 cheers,
 -lorieri
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: civetweb health check

2015-08-06 Thread Wido den Hollander


On 05-08-15 18:37, Srikanth Madugundi wrote:
 Hi,
 
 We are planning to move our radosgw setup from apache to civetweb. We
 were successfully able to setup and run civetweb on a test cluster.
 
 The radosgw instances are fronted by a VIP with currently checks the
 health by getting /status.html file, after moving to civetweb the vip
 is unable to get the health of radosgw server using /status.html
 endpoint and assumes the server is down.
 
 I looked at ceph radosgw documentation and did not find any
 configuration to rewrite urls. What is the best approach for VIP to
 get the health of radosgw?
 

You can simply query /

This is what I use in Varnish to do a health check:

backend rgw {
.host   = 127.0.0.1;
.port   = 7480;
.connect_timeout= 1s;
.probe = {
.timeout   = 30s;
.interval  = 3s;
.window= 10;
.threshold = 3;
.request =
GET / HTTP/1.1
Host: localhost
User-Agent: Varnish-health-check
Connection: close;
}
}

Works fine, RGW will respond with a 200 OK in /

Wido

 Thanks
 Srikanth
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: C++11 and librados C++

2015-08-04 Thread Wido den Hollander


On 03-08-15 22:25, Samuel Just wrote:
 It seems like it's about time for us to make the jump to C++11.  This
 is probably going to have an impact on users of the librados C++
 bindings.  It seems like such users would have to recompile code using
 the librados C++ libraries after upgrading the librados library
 version.  Is that reasonable?  What do people expect here?

Well, some people use Qemu build by their distro, but they use librados
/ librbd from ceph.com

So if they suddenly have to rebuild Qemu that would hurt them I think.

Wido

 -Sam
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Signed-off-by and aliases

2015-08-01 Thread Wido den Hollander
On 07/31/2015 09:59 PM, Loic Dachary wrote:
 Hi Ceph,
 
 We require that each commit has a Signed-off-by line with the name and email 
 of the author. The general idea is that the Ceph project trusts each 
 developer to understand what it entails[1]. There is no formal verification : 
 the person submitting the patch could use a fake name or publish code from 
 someone else. In reality the odds of that happening and causing problem are 
 so low that neither Ceph nor the Linux kernel felt the need to impose a more 
 formal process. There is no bullet proof process anyway, it's all about 
 balancing risks and costs.
 
 If a contributor was using an alias that looks like a real name (for instance 
 I could contribute under the name Louis Lavile), (s)he would go unnoticed and 
 her/his contribution would be accepted as any other. If the same contributor 
 was using an alias that is obviously an alias (such as A. Nonymous), it would 
 raise the question of accepting contributions Signed-off with an alias.
 
 I think Ceph should accept contributions that are signed with an alias 
 because it does not make a difference.
 
 From a lawyer perspective, there is a difference between an alias and a real 
 name, of course. Should the author be in court, (s)he would have to prove 
 (s)he is the person behind the alias. If (s)he was using her/his real name, 
 an ID card would be enough. And probably other differences that I don't see 
 because IANAL. However since we already accept Signed-off-by that are not 
 formally verified, we're already in a situation where we implicitly accept 
 aliases. Explicitly accepting aliases would not change that, therefore it is 
 not actually something we need to run by lawyers because nothing changes from 
 a legal standpoint.
 
 What do you think ?
 

Using an alias is just dumb since it would only make you loose the
copyright since it's not you doing the commit.

However, if we want to go for security, there is also a way to sign your
Git commits using GPG [2].

[2]: https://git-scm.com/book/tr/v2/Git-Tools-Signing-Your-Work

 Cheers
 
 [1] SIGNING CONTRIBUTIONS 
 https://github.com/ceph/ceph/blob/master/SubmittingPatches#L13
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pre_start command not executed with upstart

2015-07-30 Thread Wido den Hollander


On 30-07-15 14:28, Sage Weil wrote:
 On Thu, 30 Jul 2015, Wido den Hollander wrote:
 On 28-07-15 14:29, Sage Weil wrote:
 On Tue, 28 Jul 2015, Wido den Hollander wrote:
 Hi,

 I was trying to inject a pre_start command on a bunch of OSDs under
 Ubuntu 14.04, but that didn't work.

 I found out that only the sysvinit script execute pre_start commands,
 but the upstart nor the systemd scripts do this.

 Is this a thing which will disappear or is this just a oversight?

 I'm actually using sleep 30 as a pre-start to prevent all OSDs from
 starting at the same time. Big boxes which die under the workload of all
 OSDs starting at the same time.

 I also try to prevent a lot of OSDMap changes by booting the OSDs slowly
 so I tried to fix this with pre_start.

 I was hoping to let it die...

 I would take a fresh look at systemd and see if there is a 
 different/better mechanism there to address the startup issue.  You can 
 manually add a prestart script/command to the unit file...


 Oh, it doesn't matter if it dies out. It's just that there will probably
 be scenarios where users want to do some sort of pre-start before a OSD
 is executed.

 The update_crush hook is one of the examples which is already executed
 for example.

 A pre_start which does a mount for example (for those without udev)
 might be useful.
 
 For what it's worth, the view from Leannart is that unit files are simple 
 enough to be configuration, in which case such users can just add a 
 ExecStartPre to their ceph-osd@.service file (instead of doing the same in 
 the ceph.conf file).
 

Agreed and you are right. We shouldn't do anything special here. Since
both Ubuntu and CentOS/RHEL are migrating to systemd we shouldn't
probably really care about pre_start anymore.

Wido

 sage
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


pre_start command not executed with upstart

2015-07-28 Thread Wido den Hollander
Hi,

I was trying to inject a pre_start command on a bunch of OSDs under
Ubuntu 14.04, but that didn't work.

I found out that only the sysvinit script execute pre_start commands,
but the upstart nor the systemd scripts do this.

Is this a thing which will disappear or is this just a oversight?

I'm actually using sleep 30 as a pre-start to prevent all OSDs from
starting at the same time. Big boxes which die under the workload of all
OSDs starting at the same time.

I also try to prevent a lot of OSDMap changes by booting the OSDs slowly
so I tried to fix this with pre_start.

Wido
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Inactive PGs should trigger a HEALTH_ERR state

2015-07-22 Thread Wido den Hollander
Hi,

I was just testing with a cluster on VMs and I noticed that
undersized+degraded+peering PGs do not trigger a HEALTH_ERR state. Why
is that?

In my opinion any PG which is not active+? should trigger a HEALTH_ERR
state since I/O is blocking at that point.

Is that a sane thing to do or am I missing something?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rads gateway Parallel Uploads

2015-07-13 Thread Wido den Hollander


On 13-07-15 08:07, 王 健 wrote:
 Hello Cepher,
 I am working on prorting Google Cloud Storage API to rados gateway and I 
 notice that Google Cloud Storage support parallel uploads as below:
 Object composition enables a simple mechanism for uploading an object in 
 parallel: simply divide your data into as many chunks as required to fully 
 utilize your available bandwidth, upload each chunk to a distinct object, 
 compose your final object, and delete any temporary objects. “
 So I am wondering whether radios gateway support the similar feature to let 
 user parallel upload large objects?

Yes, you can do a multipart upload with the RADOS gateway, no problem.

Wido

 Thanks
 Jian
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rads gateway Parallel Uploads

2015-07-13 Thread Wido den Hollander


On 13-07-15 08:32, 王 健 wrote:
 Does rados gateway support object composition? Like Google could storage, 
 user can upload divide a large object into several objects and upload 
 parallel, then combine them as a single object.
 Thanks
 Jian

Yes, that is how multipart uploads work. The RADOS Gatewas supports it.

Wido

 On Jul 13, 2015, at 2:16 AM, Wido den Hollander w...@42on.com wrote:



 On 13-07-15 08:07, 王 健 wrote:
 Hello Cepher,
 I am working on prorting Google Cloud Storage API to rados gateway and I 
 notice that Google Cloud Storage support parallel uploads as below:
 Object composition enables a simple mechanism for uploading an object in 
 parallel: simply divide your data into as many chunks as required to fully 
 utilize your available bandwidth, upload each chunk to a distinct object, 
 compose your final object, and delete any temporary objects. “
 So I am wondering whether radios gateway support the similar feature to let 
 user parallel upload large objects?

 Yes, you can do a multipart upload with the RADOS gateway, no problem.

 Wido

 Thanks
 Jian

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IANA #826907] Application for a Port Number and/or Service Name ceph (Completed) (fwd)

2015-07-10 Thread Wido den Hollander
On 07/10/2015 11:08 PM, Sage Weil wrote:
 It's official!  We have a new port number for the monitor:
 
   3300(that's CE4h, or 0xCE4).
 
 Sometime in the next cycle we'll need to make a transition plan to move 
 from 6789.
 

Very nice! We should be clear though that nothing will break for
existing users.

The new default will become 3300, but every mon has IP+Port in the
monmap, so it won't hit them. Just to make sure people aren't afraid to
upgrade their Ceph cluster.

 sage
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RGW with Civetweb and HTTP Keep-Alive

2015-06-22 Thread Wido den Hollander
Hi,

When running Varnish in front of RGW+Civetweb you want to pipe [0]
request from Varnish to Civetweb.

Varnish will act as a simple TCP Proxy and sends a Connecton: close
header to the backend.

Civetweb doesn't seem to honor this query, so it doesn't send back a
Connection: close header in the response.

This causes a client to retry the request to Varnish, but it gets back
ad Broken Pipe since Varnish already closed the connection to Civetweb.

Any ideas on how we can turn this behavior of Civetweb off? With Varnish
in front I actually don't want Civetweb to do KeepAlive. Varnish talks
keepalive with my clients, but not between Varnish and Civetweb.

It's Civetweb which doesn't honor the Connection: close header
properly, but that might be harder to fixed.

I opened a new issue [1] where we should make it configurable if HTTP
KeepAlive is enabled for Civetweb or not.

Wido

[0]: https://www.varnish-software.com/blog/using-pipe-varnish
[1]: http://tracker.ceph.com/issues/12110
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in


Re: rbd top

2015-06-16 Thread Wido den Hollander
On 06/15/2015 06:52 PM, John Spray wrote:
 
 
 On 15/06/2015 17:10, Robert LeBlanc wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 John, let me see if I understand what you are saying...

 When a person runs `rbd top`, each OSD would receive a message saying
 please capture all the performance, grouped by RBD and limit it to
 'X'. That way the OSD doesn't have to constantly update performance
 for each object, but when it is requested it starts tracking it?
 
 Right, initially the OSD isn't collecting anything, it starts as soon as
 it sees a query get loaded up (published via OSDMap or some other
 mechanism).
 

I like that idea very much. Currently the OSDs are already CPU bound, a
lot of time is used by processing a request while it's not waiting on
the disk.

Although tracking IOps might seem like a small and cheap thing to do,
it's yet more CPU time spent by the system on something else then
processing the I/O.

So I'm in favor of not always collecting, but only on demand.

Go for performance, low-latency and high IOps.

Wido

 That said, in practice I can see people having some set of queries that
 they always have loaded and feeding into graphite in the background.

 If so, that is an interesting idea. I wonder if that would be simpler
 than tracking the performance of each/MRU objects in some format like
 /proc/diskstats where it is in memory and not necessarily consistent.
 The benefit is that you could have lifelong stats that show up like
 iostat and it would be a simple operation.
 
 Hmm, not sure we're on the same page about this part, what I'm talking
 about is all in memory and would be lost across daemon restarts.  Some
 other component would be responsible for gathering the stats across all
 the daemons in one place (that central part could persist stats if
 desired).
 
 Each object should be able
 to reference back to RBD/CephFS upon request and the client could even
 be responsible for that load. Client performance data would need stats
 in addition to the object stats.
 
 You could extend the mechanism to clients.  However, as much as possible
 it's a good thing to keep it server side, as servers are generally fewer
 (still have to reduce these stats across N servers to present to user),
 and we have multiple client implementations (kernel/userspace).  What
 kind of thing do you want to get from clients?
 My concern is that adding additional SQL like logic to each op is
 going to get very expensive. I guess if we could push that to another
 thread early in the op, then it might not be too bad. I'm enjoying the
 discussion and new ideas.
 
 Hopefully in most cases the query can be applied very cheaply, for
 operations like comparing pool ID or grouping by client ID. However, I
 would also envisage an optional sampling number, such that e.g. only 1
 in every 100 ops would go through the query processing.  Useful for
 systems where keeping highest throughput is paramount, and the numbers
 will still be useful if clients are doing many thousands of ops per second.
 
 Cheers,
 John
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bucket name restrictions in RGW

2015-06-13 Thread Wido den Hollander
On 06/13/2015 01:29 AM, Robin H. Johnson wrote:
 On Fri, Jun 12, 2015 at 07:13:48PM -0400,  Yehuda Sadeh-Weinraub wrote:
 Whatever we end up doing, we need to make it configurable, and also
 keep backward compatibility, so that buckets that were created prior
 to such a change will still remain accessible. Some setups would not
 need this limitation and will find it too restricting so I'm not sure
 that it's really that needed. In short, make it configurable.
 Configurable:
 - Can we obsolete 'rgw relaxed s3 bucket names', and convert it to a new
   option: 'rgw s3 bucket name create strictness'
   Value '0' = existing 'rgw relaxed s3 bucket names = true' logic
   Value '1' = existing 'rgw relaxed s3 bucket names = false' logic
   Value '2' = compliance with AmazonS3 DNS rules
 
 Backwards-Compatibility:
 - Make a new option 'rgw s3 bucket name access strictness'
   Same values as above, but used to access buckets, not create new ones.
 - Proposed default values:
   rgw s3 bucket name create strictness = 2
   rgw s3 bucket name access strictness = 1
 
 So you can only create DNS-compliant buckets, but still access your
 existing non-compliant buckets. Maybe also have keywords of major
 releases and 'relaxed' supported in addition to the integer values.
 
 I don't like the names of the config keys, but I'm coming up blank on
 something that is shorter while still being immediately clear.
 

Seems like a good plan to me. I would like to restrict them as much as
possible, but we shouldn't break anything which is online now.

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bucket name restrictions in RGW

2015-06-12 Thread Wido den Hollander
On 06/12/2015 05:28 PM, Harshal Gupta wrote:
 Hi,
 I was looking into the bucket creation and found out that we are able
 to create buckets with names which are not DNS compliant. One such
 example is names ending with a non-alphanumeric character. There are
 other rules which make bucket name restrictions in RGW more lenient
 than what is recommended for DNS compliant names as well.
 
 In case we plan to support website hosting in future on RGW, we will
 need to make bucket names DNS compliant. Keeping that in mind, I am
 thinking about modifying the bucket name rules and applying more
 restrictions to make them more towards DNS compliant.
 
 Please share your opinion about this.
 

I'm in favor. I would even like more strict bucket names, eg a setting
where you can force all names to lowercase or refuse names with
uppercase in it. This sometimes gives conflicts with DNS names when
using lower and uppercase mixed.

 Thanks,
 Harshal Gupta
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: preparing v0.80.11

2015-06-03 Thread Wido den Hollander
On 05/26/2015 10:28 PM, Nathan Cutler wrote:
 Hi Loic:
 
 The first round of 0.80.11 backports, including all trivial backports
 (where trivial is defined as those I was able to do by myself without
 help), is now ready for integration testing in the firefly-backports
 branch of the SUSE fork:
 
 https://github.com/SUSE/ceph/commits/firefly-backports
 
 The non-trivial backports (on which I hereby solicit help) are:
 
 http://tracker.ceph.com/issues/11699 Objecter: resend linger ops on split
 http://tracker.ceph.com/issues/11700 make the all osd/filestore thread
 pool suicide timeouts separately configurable
 http://tracker.ceph.com/issues/11704 erasure-code: misalignment
 http://tracker.ceph.com/issues/11720 rgw deleting S3 objects leaves
 __shadow_ objects behind
 

Could I also ask for this one to be backported?

https://github.com/ceph/ceph/pull/4844

It breaks a couple of setups I know of. It's not in master yet, but it's
a very trivial fix.

 Nathan
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: preparing v0.80.11

2015-05-27 Thread Wido den Hollander
On 05/26/2015 10:28 PM, Nathan Cutler wrote:
 Hi Loic:
 
 The first round of 0.80.11 backports, including all trivial backports
 (where trivial is defined as those I was able to do by myself without
 help), is now ready for integration testing in the firefly-backports
 branch of the SUSE fork:
 

Would it be possible to backport this as well to 0.80.11:

http://tracker.ceph.com/issues/9792#change-46498

And I think this commit would be the easiest to backport:
https://github.com/ceph/ceph/commit/6b982e4cc00f9f201d7fbffa0282f8f3295f2309

This way we add a simple safeguard against pool removal into Firefly as
well.

Wido

 https://github.com/SUSE/ceph/commits/firefly-backports
 
 The non-trivial backports (on which I hereby solicit help) are:
 
 http://tracker.ceph.com/issues/11699 Objecter: resend linger ops on split
 http://tracker.ceph.com/issues/11700 make the all osd/filestore thread
 pool suicide timeouts separately configurable
 http://tracker.ceph.com/issues/11704 erasure-code: misalignment
 http://tracker.ceph.com/issues/11720 rgw deleting S3 objects leaves
 __shadow_ objects behind
 
 Nathan
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Will the number of objects that have ever existed be infinite?

2015-05-23 Thread Wido den Hollander
On 05/23/2015 08:58 AM, 李沛伦 wrote:
 Hello!
 
 I'm a GSoC student this year and my job is to introduce Missing Rate
 Curve (or reuse distance exactly) of objects into OSD. Now I'm trying
 to find a proper algorithm to implement but there is a problem: Should
 I take the number of objects tracked in an OSD as infinite or
 constant?
 

A OSD doesn't track on a per-object basis, but it keeps track of
Placement Groups (PGs). A OSD can have a X number of PGs.

Technically the number of PGs might be infinite, but in practice you are
bound to CPU and Memory limits.

So I would be careful with the word infinite, since nothing is really
infinite, eg size of a int/long might be the limitation for something.

But in theory there is no object or PG limit per OSD.

 The point is that there is an algorithm that use hash to sample only
 constant number of references to do the analysis and is proved to be
 accurate, which makes it possible to do online MRC construction. That
 accuracy is supported by the fact that the memory addresses is
 bounded, while objects can be deleted and created again and again in
 Ceph. Is is reasonable to think that an OSD only serves bounded number
 of objects in its life time (or the time period that we want to
 compute MRC)?
 
 Any other comment about this project is also welcomed :)
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Default config value for osd_disk_thread_ioprio_class

2015-04-29 Thread Wido den Hollander
On 04/29/2015 02:10 PM, Wido den Hollander wrote:
 Hi,
 
 In the process of upgrading a cluster from Giant to Hammer I saw this
 on the OSD logs:
 
 2015-04-29 14:02:37.015454 7f887875e900 -1 osd.456 43089
 set_disk_tp_priority(22) Invalid argument:
 osd_disk_thread_ioprio_class is  but only the following values are
 allowed: idle, be or rt
 
 That is correct, since config_opts.h says:
 
 OPTION(osd_disk_thread_ioprio_class, OPT_STR, ) // rt realtime be
 best effort idle
 
 It's nothing bad, but it would be nicer if we got rid of it.
 
 What to do here? Allow  as a config setting and then ignore it or
 set the default to rt, be or idle?
 

I see there actually is a check for it:

  if (cct-_conf-osd_disk_thread_ioprio_class.empty() ||
  cct-_conf-osd_disk_thread_ioprio_priority  0)
return;

So empty() does not return True there while it should since the setting
is set to ?

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RADOS backend for Dovecot mail storage

2015-03-31 Thread Wido den Hollander
Hi,

Recently Dovecot merged with OpenXchange as announced here:
http://www.dovecot.fi/open-xchange-and-dovecot-announce-merger-to-create-worlds-leading-open-source-messaging-software-provider/

In the past few years I've contacted Dovecot a couple of times because
it would be very, very cool to have your e-mail stored via native RADOS.

They thought it was a very cool idea, but since they had closed source
Object Store plugins for Dovecot like Scality they couldn't allow RADOS
to be implemented for Object Mail storage.

My idea is basically: Dovecot - libmailrados - librados - Ceph

Now that OpenXchange acquired Dovecot it might be that this possibility
opens up!

Just wanted to get this out there. I currently won't be able to look
into this, but it would be very cool if this happens.

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: who is using radosgw with civetweb?

2015-02-26 Thread Wido den Hollander


 Op 26 feb. 2015 om 18:22 heeft Sage Weil sw...@redhat.com het volgende 
 geschreven:
 
 On Thu, 26 Feb 2015, Wido den Hollander wrote:
 On 25-02-15 20:31, Sage Weil wrote:
 Hey,
 
 We are considering switching to civetweb (the embedded/standalone rgw web
 server) as the primary supported RGW frontend instead of the current
 apache + mod-fastcgi or mod-proxy-fcgi approach.  Supported here means
 both the primary platform the upstream development focuses on and what the
 downstream Red Hat product will officially support.
 
 How many people are using RGW standalone using the embedded civetweb
 server instead of apache?  In production?  At what scale?  What
 version(s) (civetweb first appeared in firefly and we've backported most
 fixes).
 
 Have you seen any problems?  Any other feedback?  The hope is to (vastly)
 simplify deployment.
 
 It seems like Civetweb listens on 0.0.0.0 by default and that doesn't seem
 safe to me.
 
 Can you clarify?  Is that because people may inadvertantly run this on a 
 public host and not realize that the host is answering requests?
 

Yes, mainly. I think we should encourage users to run Apache, Nginx or Varnish 
as a proxy/filter in front.

I'd just suggest to bind on localhost by default and let the user choose 
otherwise.

 If we move to a world where this is the default/preferred route, this 
 seems like a good thing.. if they don't want to respond on an address they 
 can specify which IP to bind to?
 

Most services listen on localhost unless specified otherwise.

 In most deployments you'll put Apache, Nginx or Varnish in front of RGW to do
 the proper HTTP handling.
 
 I'd say that Civetweb should listen on 127.0.0.1:7480/[::1]:7480 by default.
 
 And make sure it listens on IPv6 by default :-)
 
 Yeah, +1 on IPv6:)
 
 sage
 
 
 
 Wido
 
 Thanks!
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: who is using radosgw with civetweb?

2015-02-26 Thread Wido den Hollander



On 25-02-15 20:31, Sage Weil wrote:

Hey,

We are considering switching to civetweb (the embedded/standalone rgw web
server) as the primary supported RGW frontend instead of the current
apache + mod-fastcgi or mod-proxy-fcgi approach.  Supported here means
both the primary platform the upstream development focuses on and what the
downstream Red Hat product will officially support.

How many people are using RGW standalone using the embedded civetweb
server instead of apache?  In production?  At what scale?  What
version(s) (civetweb first appeared in firefly and we've backported most
fixes).

Have you seen any problems?  Any other feedback?  The hope is to (vastly)
simplify deployment.



It seems like Civetweb listens on 0.0.0.0 by default and that doesn't 
seem safe to me.


In most deployments you'll put Apache, Nginx or Varnish in front of RGW 
to do the proper HTTP handling.


I'd say that Civetweb should listen on 127.0.0.1:7480/[::1]:7480 by default.

And make sure it listens on IPv6 by default :-)

Wido


Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: full_ratios - please explain?

2015-02-18 Thread Wido den Hollander
On 18-02-15 15:39, Wyllys Ingersoll wrote:
 Can someone explain the interaction and effects of all of these
 full_ratio parameters?  I havent found any real good explanation of how
 they affect the distribution of data once the cluster gets above the
 nearfull and close to the close ratios.
 

When only ONE (1) OSD goes over the mon_osd_nearfull_ratio the cluster
goes from HEALTH_OK into HEALTH_WARN state.

 
 mon_osd_full_ratio
 mon_osd_nearfull_ratio
 
 osd_backfill_full_ratio
 osd_failsafe_full_ratio
 osd_failsafe_nearfull_ratio
 
 We have a cluster with about 144 OSDs (518 TB) and trying to get it to a
 90% full rate for testing purposes.
 
 We've found that when some of the OSDs get above the mon_osd_full_ratio
 value (.95 in our system), then it stops accepting any new data, even
 though there is plenty of space left on other OSDs that are not yet even up
 to 90%.  Tweaking the osd_failsafe ratios enabled data to move again for a
 bit, but eventually it becomes unbalanced and stops working again.
 

Yes, that is because with Ceph safety goes first. When only one OSD goes
over the full ratio the whole cluster stops I/O.

CRUSH does not take OSD utilization into account when placing data, so
it's almost impossible to predict which I/O can continue.

Data safety and integrity is priority number 1. Full disks are a danger
to those priorities, so I/O is stopped.

 Is there a recommended combination of values to use that will allow the
 cluster to continue accepting data and rebalancing correctly above 90%.
 

No, not with those values. Monitor your filesystems that they stay below
those values. If one OSD becomes to full you can weigh it down using
CRUSH to have some data move away from it.

 thanks,
  Wyllys Ingersoll
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linking libraries into librados

2015-02-16 Thread Wido den Hollander
On 15-02-15 06:15, Rafael Shuker wrote:
 Hey guys,
 
 This is my first time posting to this list, and I am a newbie in Ceph
 development.
 
 I am trying to write new functions into librados.cc and call external
 libraries from that very function.
 
 My problem is that I can't get my head around this build system when
 using ./run-make-check.sh and adding a new some-library-path.a to
 the src/librados/Makefile.am is giving me a headache.
 
 I have successfully adjusted librados.cc and librados.h to do some
 what I wanted, but whenever I reference a function in my external
 library I get a undefined reference exception during compilation. I
 have assumed that this is because I am not linking the library
 correctly and would like to know what the correct way to link a .a
 library for librados.cc.
 
 TLDR: What is the correct way to add a library for librados.cc?
 

To understand this correctly. You are not using librados, but you are
trying to modify librados by adding a external library in there?

 Thank you all for your help, please tell me if I can provide any more
 information.
 
 Cheers
 
 Rafa.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'Immutable bit' on pools to prevent deletion

2015-01-17 Thread Wido den Hollander
On 01/17/2015 03:31 AM, Alex Elsayed wrote:
 Wido den Hollander wrote:
 
 snip
 Is it a sane thing to look at 'features' which pools could have? Other
 features which might be set on a pool:

 - Read Only (all write operations return -EPERM)
 - Delete Protected
 
 There's another pool feature I'd find very useful: a WORM flag, that permits 
 only create  append (at the RADOS level, not the RBD level as was an 
 Emperor blueprint).
 

Yes, that seems like a good addition. If we introduce the system of
'flags' for pools such a flag could be implemented as well.

 In particular, I'd _love_ being able to make something that takes Postgres 
 WAL logs and puts them in such a pool, providing real guarantees re: 
 consistency. Similarly, audit logs and such for compliance.
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS aborted after recovery and active, FAILED assert (r =0)

2015-01-16 Thread Wido den Hollander
On 01/16/2015 09:36 AM, Mohd Bazli Ab Karim wrote:
 Agree. I was about to upgrade to 0.90, but has postponed it due to this error.
 Any chance for me to recover it first before upgrading it?
 

I'm not sure, but I think you won't be able to recover this under 0.72.

 Thanks Wido.
 
 Regards,
 Bazli
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wido den Hollander
 Sent: Friday, January 16, 2015 3:50 PM
 To: Mohd Bazli Ab Karim; ceph-us...@lists.ceph.com; ceph-devel@vger.kernel.org
 Subject: Re: MDS aborted after recovery and active, FAILED assert (r =0)
 
 On 01/16/2015 08:37 AM, Mohd Bazli Ab Karim wrote:
 Dear Ceph-Users, Ceph-Devel,

 Apologize me if you get double post of this email.

 I am running a ceph cluster version 0.72.2 and one MDS (in fact, it's 3, 2 
 down and only 1 up) at the moment.
 Plus I have one CephFS client mounted to it.

 
 In Ceph world 0.72.2 is ancient en pretty old. If you want to play with 
 CephFS I recommend you upgrade to 0.90 and also use at least kernel 3.18
 
 Now, the MDS always get aborted after recovery and active for 4 secs.
 Some parts of the log are as below:

 -3 2015-01-15 14:10:28.464706 7fbcc8226700  1 --
 10.4.118.21:6800/5390 == osd.19 10.4.118.32:6821/243161 73 
 osd_op_re
 ply(3742 1000240c57e. [create 0~0,setxattr (99)]
 v56640'1871414 uv1871414 ondisk = 0) v6  221+0+0 (261801329 0 0)
 0x
 7770bc80 con 0x69c7dc0
 -2 2015-01-15 14:10:28.464730 7fbcc8226700  1 --
 10.4.118.21:6800/5390 == osd.18 10.4.118.32:6818/243072 67 
 osd_op_re
 ply(3645 107941c. [tmapup 0~0] v56640'1769567 uv1769567
 ondisk = 0) v6  179+0+0 (3759887079 0 0) 0x7757ec80 con
 0x1c6bb00
 -1 2015-01-15 14:10:28.464754 7fbcc8226700  1 --
 10.4.118.21:6800/5390 == osd.47 10.4.118.35:6809/8290 79 
 osd_op_repl
 y(3419 mds_anchortable [writefull 0~94394932] v0'0 uv0 ondisk = -90
 (Message too long)) v6  174+0+0 (3942056372 0 0) 0x69f94
 a00 con 0x1c6b9a0
  0 2015-01-15 14:10:28.471684 7fbcc8226700 -1 mds/MDSTable.cc: In
 function 'void MDSTable::save_2(int, version_t)' thread 7
 fbcc8226700 time 2015-01-15 14:10:28.46
 mds/MDSTable.cc: 83: FAILED assert(r = 0)

  ceph version  ()
  1: (MDSTable::save_2(int, unsigned long)+0x325) [0x769e25]
  2: (Context::complete(int)+0x9) [0x568d29]
  3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1097) [0x7c15d7]
  4: (MDS::handle_core_message(Message*)+0x5a0) [0x588900]
  5: (MDS::_dispatch(Message*)+0x2f) [0x58908f]
  6: (MDS::ms_dispatch(Message*)+0x1e3) [0x58ab93]
  7: (DispatchQueue::entry()+0x549) [0x975739]
  8: (DispatchQueue::DispatchThread::entry()+0xd) [0x8902dd]
  9: (()+0x7e9a) [0x7fbcccb0de9a]
  10: (clone()+0x6d) [0x7fbccb4ba3fd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.

 Is there any workaround/patch to fix this issue? Let me know if need to see 
 the log with debug-mds of certain level as well.
 Any helps would be very much appreciated.

 Thanks.
 Bazli

 
 DISCLAIMER:


 This e-mail (including any attachments) is for the addressee(s) only and may 
 be confidential, especially as regards personal data. If you are not the 
 intended recipient, please note that any dealing, review, distribution, 
 printing, copying or use of this e-mail is strictly prohibited. If you have 
 received this email in error, please notify the sender immediately and 
 delete the original message (including any attachments).


 MIMOS Berhad is a research and development institution under the purview of 
 the Malaysian Ministry of Science, Technology and Innovation. Opinions, 
 conclusions and other information in this e-mail that do not relate to the 
 official business of MIMOS Berhad and/or its subsidiaries shall be 
 understood as neither given nor endorsed by MIMOS Berhad and/or its 
 subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts 
 responsibility for the same. All liability arising from or in connection 
 with computer viruses and/or corrupted e-mails is excluded to the fullest 
 extent permitted by law.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org More majordomo
 info at  http://vger.kernel.org/majordomo-info.html

 
 
 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body of a message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html
 
 
 DISCLAIMER:
 
 
 This e-mail (including any attachments) is for the addressee(s) only and may 
 be confidential, especially as regards personal data. If you are not the 
 intended recipient, please note that any dealing, review, distribution, 
 printing, copying or use

Re: 'Immutable bit' on pools to prevent deletion

2015-01-16 Thread Wido den Hollander
On 01/16/2015 10:50 AM, Sebastien Han wrote:
 Hum if I understand correctly you’re all more in favour of a conf setting in 
 the ceph.conf;
 The problem for me is that this will apply to all the pools by default and 
 I’ll have to inject an arg to change this.
 Injecting the arg will remove this “lock and then all of the sudden all the 
 pools become deletable through the lib again (who knows what users can do 
 simultaneously)
 

No, from what I understand it's easier to implement, not the better way.

 I’m more in favour of a new flag to set to the pool, something like:
 
 ceph osd pool set foo protect true
 ceph osd pool delete foo foo —yes….
 ERROR: pool foo is protected against deletion
 
 ceph osd pool delete foo protect false
 ceph osd pool delete foo foo —yes….
 Pool successfully deleted


Something like that per pool seems better to me as well. But I'd then
opt for a 'feature' which can be set on a pool.

ceph osd pool set foo nodelete
ceph osd pool set foo nopgchange
ceph osd pool set foo nosizechange


 The good thing with that is that owners of the pool (or admin), will be able 
 to set this flag or remove it.
 We stick with the ceph osd pool delete foo foo —yes….” command as well, so 
 we don’t change too much things.
 
 Moreover we can also make use of a config option to protect all new created 
 pools by default:
 
 mon protect pool default = true
 
 This automatically set the protected flag to a new pool.
 
 What do you think?
 

Setting a nodelete flag or something like that by default is fine with
me. Like Sage mentioned earlier, almost nobody will have ephemeral pools
in their cluster. You don't want to loose data because you accidentally
removed a pool.

Wido

 On 15 Jan 2015, at 18:24, Sage Weil s...@newdream.net wrote:

 Then secondary question is whether the cluster should implicitly clear the 
 allow-delete after some time period (maybe 'pending-delete' would make 
 more sense in that case), or whether we deny IO during that period.  Seems 
 perhaps too complicated.
 
 
 Cheers.
  
 Sébastien Han 
 Cloud Architect 
 
 Always give 100%. Unless you're giving blood.
 
 Phone: +33 (0)1 49 70 99 72 
 Mail: sebastien@enovance.com 
 Address : 11 bis, rue Roquépine - 75008 Paris
 Web : www.enovance.com - Twitter : @enovance 
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'Immutable bit' on pools to prevent deletion

2015-01-16 Thread Wido den Hollander




 Op 16 jan. 2015 om 15:47 heeft Sage Weil s...@newdream.net het volgende 
 geschreven:
 
 On Fri, 16 Jan 2015, Wido den Hollander wrote:
 On 01/16/2015 10:50 AM, Sebastien Han wrote:
 Hum if I understand correctly you?re all more in favour of a conf setting 
 in the ceph.conf;
 The problem for me is that this will apply to all the pools by default and 
 I?ll have to inject an arg to change this.
 Injecting the arg will remove this ?lock and then all of the sudden all 
 the pools become deletable through the lib again (who knows what users can 
 do simultaneously)
 
 No, from what I understand it's easier to implement, not the better way.
 
 I'd like to do both, actually.  :)
 

Sounds good!

 I?m more in favour of a new flag to set to the pool, something like:
 
 ceph osd pool set foo protect true
 ceph osd pool delete foo foo ?yes?.
 ERROR: pool foo is protected against deletion
 
 ceph osd pool delete foo protect false
 ceph osd pool delete foo foo ?yes?.
 Pool successfully deleted
 
 Something like that per pool seems better to me as well. But I'd then
 opt for a 'feature' which can be set on a pool.
 
 ceph osd pool set foo nodelete
 ceph osd pool set foo nopgchange
 ceph osd pool set foo nosizechange
 
 I like this since it fits into the current flags nicely.  The downside is 
 we don't grandfather existing pools on upgrade.  Not sure if people think 
 that's a good idea.
 
 The good thing with that is that owners of the pool (or admin), will be 
 able to set this flag or remove it.
 We stick with the ceph osd pool delete foo foo ?yes?.? command as well, so 
 we don?t change too much things.
 
 Moreover we can also make use of a config option to protect all new created 
 pools by default:
 
 mon protect pool default = true
 
 This automatically set the protected flag to a new pool.
 
 What do you think?
 
 Setting a nodelete flag or something like that by default is fine with
 me. Like Sage mentioned earlier, almost nobody will have ephemeral pools
 in their cluster. You don't want to loose data because you accidentally
 removed a pool.
 
 We should mirror this option:
 
 OPTION(osd_pool_default_flag_hashpspool, OPT_BOOL, true)   // use new pg 
 hashing to prevent pool/pg overlap
 
 So,
 
 osd_pool_default_flag_nodelete = true
 osd_pool_default_flag_nopgchange = true
 osd_pool_default_flag_nosizechange = true
 
 The big question for me is should we enable these by default in hammer?
 

I would say yes. We should probably protect users against something stupid 
which makes them loose data.

Data safety is prio #1

Wido


 sage
 
 
 
 
 Wido
 
 On 15 Jan 2015, at 18:24, Sage Weil s...@newdream.net wrote:
 
 Then secondary question is whether the cluster should implicitly clear the 
 allow-delete after some time period (maybe 'pending-delete' would make 
 more sense in that case), or whether we deny IO during that period.  Seems 
 perhaps too complicated.
 
 
 Cheers.
  
 S?bastien Han 
 Cloud Architect 
 
 Always give 100%. Unless you're giving blood.
 
 Phone: +33 (0)1 49 70 99 72 
 Mail: sebastien@enovance.com 
 Address : 11 bis, rue Roqu?pine - 75008 Paris
 Web : www.enovance.com - Twitter : @enovance
 
 
 -- 
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


'Immutable bit' on pools to prevent deletion

2015-01-15 Thread Wido den Hollander
Hi,

Although the userland tools like 'ceph' and 'rados' have a safeguard
against fat fingers when it comes to removing a pool there is no such
safeguard when using native librados.

The danger still exists that by accident you remove a pool which is then
completely gone, no way to restore it.

This is still something I find quite dangerous, so I was thinking about
a additional 'Immutable bit' which could be set on a pool before
rados_pool_delete() allows this pool to be removed.

Is it a sane thing to look at 'features' which pools could have? Other
features which might be set on a pool:

- Read Only (all write operations return -EPERM)
- Delete Protected

It's just that looking at a 20TB RBD pool and thinking that just one API
call could remove this pool make me a bit scared.

Am I the only one or is this something worth looking in to?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'Immutable bit' on pools to prevent deletion

2015-01-15 Thread Wido den Hollander
On 01/15/2015 04:39 PM, Dan Van Der Ster wrote:
 Hi Wido,
 
 +1 for safeguards.
 
 Yeah that is scary: it's one api call to delete a pool, and perhaps even a 
 client with w capability on a pool can delete it?? (I didn’t try...)
 

Quick try, yes! Created a pool and user which only has access to that
pool. I was able to remove that pool.

 I can think of many ways that fat fingers can create crazy loads, deny client 
 access, ...
 

Sure, but none of those actually make you loose your data.

I know that you should create backups, but by accident removing a pool
is something that is very dangerous and it will take a lot of time to
restore from backups.

  1. changing pool size
  2. setting pool quotas
  3. unplanned PG splitting
  4. creating an EC pool on a cluster with dumpling clients
  5. reweight-by-utilization
  6. changing crush rules/tunables
 
 —yes-i-really-really-mean-it is nice when it’s there. But regardless it is 
 probably not a good practice to work daily (or run librados cron jobs) in a 
 shell that has access to the client.admin keyring. I’ve thought of using sudo 
 to restrict our admin shell to subset of ceph admin commands. But even better 
 would be a internal bit which locks out the API beneath “ceph osd pool …” and 
 “ceph osd crush …”, even for client.admin.
 
 Maybe this is already possible by creating a client.admin-readonly account 
 for daily work and crons, and limit access to client.admin except when 
 absolutely necessary ?
 

That would be great indeed. The client.admin key currently has all the
capabilities and I would indeed like a RO account.

But still, another safeguard against deleting pools would be something
I'd like to see.

Wido

 Cheers, Dan
 
 
 On 15 Jan 2015, at 15:46, Wido den Hollander w...@42on.com wrote:

 Hi,

 Although the userland tools like 'ceph' and 'rados' have a safeguard
 against fat fingers when it comes to removing a pool there is no such
 safeguard when using native librados.

 The danger still exists that by accident you remove a pool which is then
 completely gone, no way to restore it.

 This is still something I find quite dangerous, so I was thinking about
 a additional 'Immutable bit' which could be set on a pool before
 rados_pool_delete() allows this pool to be removed.

 Is it a sane thing to look at 'features' which pools could have? Other
 features which might be set on a pool:

 - Read Only (all write operations return -EPERM)
 - Delete Protected

 It's just that looking at a 20TB RBD pool and thinking that just one API
 call could remove this pool make me a bit scared.

 Am I the only one or is this something worth looking in to?

 -- 
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS aborted after recovery and active, FAILED assert (r =0)

2015-01-15 Thread Wido den Hollander
On 01/16/2015 08:37 AM, Mohd Bazli Ab Karim wrote:
 Dear Ceph-Users, Ceph-Devel,
 
 Apologize me if you get double post of this email.
 
 I am running a ceph cluster version 0.72.2 and one MDS (in fact, it's 3, 2 
 down and only 1 up) at the moment.
 Plus I have one CephFS client mounted to it.
 

In Ceph world 0.72.2 is ancient en pretty old. If you want to play with
CephFS I recommend you upgrade to 0.90 and also use at least kernel 3.18

 Now, the MDS always get aborted after recovery and active for 4 secs.
 Some parts of the log are as below:
 
 -3 2015-01-15 14:10:28.464706 7fbcc8226700  1 -- 10.4.118.21:6800/5390 
 == osd.19 10.4.118.32:6821/243161 73  osd_op_re
 ply(3742 1000240c57e. [create 0~0,setxattr (99)] v56640'1871414 
 uv1871414 ondisk = 0) v6  221+0+0 (261801329 0 0) 0x
 7770bc80 con 0x69c7dc0
 -2 2015-01-15 14:10:28.464730 7fbcc8226700  1 -- 10.4.118.21:6800/5390 
 == osd.18 10.4.118.32:6818/243072 67  osd_op_re
 ply(3645 107941c. [tmapup 0~0] v56640'1769567 uv1769567 ondisk = 
 0) v6  179+0+0 (3759887079 0 0) 0x7757ec80 con
 0x1c6bb00
 -1 2015-01-15 14:10:28.464754 7fbcc8226700  1 -- 10.4.118.21:6800/5390 
 == osd.47 10.4.118.35:6809/8290 79  osd_op_repl
 y(3419 mds_anchortable [writefull 0~94394932] v0'0 uv0 ondisk = -90 (Message 
 too long)) v6  174+0+0 (3942056372 0 0) 0x69f94
 a00 con 0x1c6b9a0
  0 2015-01-15 14:10:28.471684 7fbcc8226700 -1 mds/MDSTable.cc: In 
 function 'void MDSTable::save_2(int, version_t)' thread 7
 fbcc8226700 time 2015-01-15 14:10:28.46
 mds/MDSTable.cc: 83: FAILED assert(r = 0)
 
  ceph version  ()
  1: (MDSTable::save_2(int, unsigned long)+0x325) [0x769e25]
  2: (Context::complete(int)+0x9) [0x568d29]
  3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1097) [0x7c15d7]
  4: (MDS::handle_core_message(Message*)+0x5a0) [0x588900]
  5: (MDS::_dispatch(Message*)+0x2f) [0x58908f]
  6: (MDS::ms_dispatch(Message*)+0x1e3) [0x58ab93]
  7: (DispatchQueue::entry()+0x549) [0x975739]
  8: (DispatchQueue::DispatchThread::entry()+0xd) [0x8902dd]
  9: (()+0x7e9a) [0x7fbcccb0de9a]
  10: (clone()+0x6d) [0x7fbccb4ba3fd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 
 Is there any workaround/patch to fix this issue? Let me know if need to see 
 the log with debug-mds of certain level as well.
 Any helps would be very much appreciated.
 
 Thanks.
 Bazli
 
 
 DISCLAIMER:
 
 
 This e-mail (including any attachments) is for the addressee(s) only and may 
 be confidential, especially as regards personal data. If you are not the 
 intended recipient, please note that any dealing, review, distribution, 
 printing, copying or use of this e-mail is strictly prohibited. If you have 
 received this email in error, please notify the sender immediately and delete 
 the original message (including any attachments).
 
 
 MIMOS Berhad is a research and development institution under the purview of 
 the Malaysian Ministry of Science, Technology and Innovation. Opinions, 
 conclusions and other information in this e-mail that do not relate to the 
 official business of MIMOS Berhad and/or its subsidiaries shall be understood 
 as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and 
 neither MIMOS Berhad nor its subsidiaries accepts responsibility for the 
 same. All liability arising from or in connection with computer viruses 
 and/or corrupted e-mails is excluded to the fullest extent permitted by law.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'Immutable bit' on pools to prevent deletion

2015-01-15 Thread Wido den Hollander
On 01/15/2015 08:07 PM, Sage Weil wrote:
 On Thu, 15 Jan 2015, John Spray wrote:
 On Thu, Jan 15, 2015 at 6:07 PM, Sage Weil sw...@redhat.com wrote:
 What would that buy us? Preventing injectargs on it would require mon
 restarts, which is unfortunate ? and makes it sounds more like a
 security feature than a safety blanket.

 I meant 'ceph tell mon.* injectargs ...' as distinct from 'ceph daemon ...
 config set', which requires access to the host.  But yeah, if we went to
 the effort to limit injectargs (maybe a blanket option that disables
 injectargs on mons?), it could double as a security feature.

 But whether it may also useful for security doesn't change whether it is a
 good safety blanket.  I like it because it's simple, easy to implement,
 and easy to disable for testing... :)

 The trouble with this is admin socket part is that any tool that
 manages Ceph must use the admin socket interface as well as the normal
 over-the-network command interface, and by extension must be able to
 execute locally on a mon.  We would no longer have a comprehensive
 remote management interface for the mon: management tools would have
 to run some code locally too.
 
 True.. if we make that option enabled by default.  If we it's off by 
 default them it's an opt-in layer of protection.  Most clusters don't have 
 ephemeral pools so I think lots of people would want this.
  

If this is the easiest route I'm +1 for that way.

I'd turn it off right away on the clusters I manage. Pools don't change
that often and I simply want another safeguard against deleting them.

 I think it's sufficient to require two API calls (set the flag or
 config option, then do the delete) within the remote API, rather than
 requiring that anyone driving the interface knows how to speak two
 network protocols (usual mon remote command + SSH-to-asok).
 
 Yeah...
 
 sage
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: /usr/bin/cephfs tool

2015-01-09 Thread Wido den Hollander
On 01/09/2015 04:20 PM, Sage Weil wrote:
 Should we drop this entirely in hammer?  If I remember correctly all of 
 the layout stuff is fully supported using virtual xattrs and standard 
 tools.  The only thing left is the tool that shows you how file blocks map 
 to objects map to OSDs (or something like that), which I've never used and 
 have never seen anyone use...
 

I'm in favor of dropping it! It confused me a couple of weeks ago since
I thought for a second that it was still the right way to go.

With CephFS becoming more mature in Hammer we should not encourage users
to use this deprecated tool.

They should use vxattrs from the beginning.

I'm +1 in dropping this tool entirely in Hammer.

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Higher OSD disk util due to RBD snapshots from Dumpling to Firefly

2015-01-07 Thread Wido den Hollander
On 01/07/2015 05:51 PM, Dan van der Ster wrote:
 Hi Wido,
 I've been trying to reproduce this but haven't been able yet.
 
 What I've tried so far is use fio rbd with a 0.80.7 client connected
 to a 0.80.7 cluster. I created a 10GB format 2 block device, then
 measured the 4k randwrite iops before and after having snaps. I
 measured around 2000 iops to the image before any snapshots, then
 created 200 snapshots on the device and ran fio again. Initially the
 iops were low (I guess this is from the 4MB CoW resulting from the
 first 4k write to each underlying object). But eventually the speed
 stabilized to around 2000 iops again. Actually the initial slowdown
 was the same whether I created 1 snapshot or 200.
 
 This was just quick subjective test so far, since from your report I
 was expecting something obvious to stick out. But it appears pretty
 OK, no? Would you have expected something different from these tests?
 

Well, I'm not sure what to expect. But what I noticed is that when I
removed all the snapshots the slow requests were gone and the disk util
dropped on the OSDs.

Wido

 Cheers, Dan
 
 
 On Wed, Dec 31, 2014 at 5:21 PM, Wido den Hollander w...@42on.com wrote:
 Hi,

 Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
 0.80.7 and after the upgrade there was a severe performance drop on the
 cluster.

 It started raining slow requests after the upgrade and most of them
 included a 'snapc' in the request.

 That lead me to investigate the RBD snapshots and I found that a rogue
 process had created ~1800 snapshots spread out over 200 volumes.

 One image even had 181 snapshots!

 As the snapshots weren't used I removed them all and after the snapshots
 were removed the performance of the cluster came back to normal level again.

 I'm wondering what changed between Dumpling and Firefly which caused
 this? I saw OSDs spiking to 100% disk util constantly under Firefly
 where this didn't happen with Dumpling.

 Did something change in the way OSDs handle RBD snapshots which causes
 them to create more disk I/O?

 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

2015-01-05 Thread Wido den Hollander



On 05-01-15 11:53, Chaitanya Huilgol wrote:

Hi All,

The stock ceph-client modules with Ubuntu 14.04 LTS are quite dated and we are 
seeing crashes and soft-lockup issues which have been fixed in the current 
ceph-client code base.
What would be recommended ceph-client branch compatible with the Ubuntu 14.04 
(3.13.0-x) kernels so that we can get as many fixes as possible?



I recommend you take a look here: 
http://kernel.ubuntu.com/~kernel-ppa/mainline/


That should give you some new kernels.

Wido


Regards,
Chaitanya



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-deploy osd destroy feature

2015-01-04 Thread Wido den Hollander
On 01/02/2015 10:31 PM, Travis Rhoden wrote:
 Hi everyone,
 
 There has been a long-standing request [1] to implement an OSD
 destroy capability to ceph-deploy.  A community user has submitted a
 pull request implementing this feature [2].  While the code needs a
 bit of work (there are a few things to work out before it would be
 ready to merge), I want to verify that the approach is sound before
 diving into it.
 
 As it currently stands, the new feature would do allow for the following:
 
 ceph-deploy osd destroy host --osd-id id
 
 From that command, ceph-deploy would reach out to the host, do ceph
 osd out, stop the ceph-osd service for the OSD, then finish by doing
 ceph osd crush remove, ceph auth del, and ceph osd rm.  Finally,
 it would umount the OSD, typically in /var/lib/ceph/osd/...
 

Prior to the unmount, shouldn't it also clean up the 'ready' file to
prevent the OSD from starting after a reboot?

Although it's key has been removed from the cluster it shouldn't matter
that much, but it seems a bit cleaner.

It could even be more destructive, that if you pass --zap-disk to it, it
also runs wipefs or something to clean the whole disk.

 
 Does this high-level approach seem sane?  Anything that is missing
 when trying to remove an OSD?
 
 
 There are a few specifics to the current PR that jump out to me as
 things to address.  The format of the command is a bit rough, as other
 ceph-deploy osd commands take a list of [host[:disk[:journal]]] args
 to specify a bunch of disks/osds to act on at one.  But this command
 only allows one at a time, by virtue of the --osd-id argument.  We
 could try to accept [host:disk] and look up the OSD ID from that, or
 potentially take [host:ID] as input.
 
 Additionally, what should be done with the OSD's journal during the
 destroy process?  Should it be left untouched?
 
 Should there be any additional barriers to performing such a
 destructive command?  User confirmation?
 
 
  - Travis
 
 [1] http://tracker.ceph.com/issues/3480
 [2] https://github.com/ceph/ceph-deploy/pull/254
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Higher OSD disk util due to RBD snapshots from Dumpling to Firefly

2014-12-31 Thread Wido den Hollander
Hi,

Last week I upgraded a 250 OSD cluster from Dumpling 0.67.10 to Firefly
0.80.7 and after the upgrade there was a severe performance drop on the
cluster.

It started raining slow requests after the upgrade and most of them
included a 'snapc' in the request.

That lead me to investigate the RBD snapshots and I found that a rogue
process had created ~1800 snapshots spread out over 200 volumes.

One image even had 181 snapshots!

As the snapshots weren't used I removed them all and after the snapshots
were removed the performance of the cluster came back to normal level again.

I'm wondering what changed between Dumpling and Firefly which caused
this? I saw OSDs spiking to 100% disk util constantly under Firefly
where this didn't happen with Dumpling.

Did something change in the way OSDs handle RBD snapshots which causes
them to create more disk I/O?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: new rados whereis command

2014-12-25 Thread Wido den Hollander
On 12/23/2014 11:22 PM, Loic Dachary wrote:
 Hi Andreas,
 
 I took a closer look at https://github.com/ceph/ceph/pull/2730 implementing 
 rados whereis [--dns] and I think it deserves a discussion here. If I 
 understand correctly, it relies on a new function of the rados API:
 
   typedef struct whereis {
 int64_t osd_id;  // ID of the OSD hosting 
 this object
 std::string osd_state;   // state of the OSD - 
 either 'active' or 'inactive'
 int64_t pg_seed; // Seed of the PG hosting 
 this object
 std::string ip_string;   // Ip as string
 std::vectorstd::string host_names; // optional reverse DNS 
 HostNames
 std::mapstd::string, std::string user_map; // optional user KV map
 void resolve();  // reverse DNS OSD IPs and 
 store in HostNames
   } whereis_t;
 

Looking at this, will it return the public or cluster IP? I think the
public, which seems the right thing, but shouldn't the struct already
facilitate also returning the cluster IP?

The rados tool doesn't have to, but you never know what people want in
the future?

Great idea though! Very helpful!

   static int whereis(IoCtx ioctx, const std::string oid, 
 std::vectorwhereis_t locations);
 
 which needs to be added there because the rados API does not expose some 
 details that are needed to fill the fields of the whereis_t structure.
 
 It looks fine to me but ... I'm not used to maintaining or developing the 
 rados API and someone else may have a more informed opinion.
 
 There is a technical detail that also needs to be sorted out : the current 
 implementation exposes the RadosWhereis class (for dump) and this should 
 either be moved to rados.cc or be part of the rados API (which probably is 
 not the best option because it would also expose Formatter as a consequence).
 
 Cheers
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS hangs when writing 10GB files in loop

2014-12-18 Thread Wido den Hollander
On 12/17/2014 07:42 PM, Gregory Farnum wrote:
 On Wed, Dec 17, 2014 at 8:35 AM, Wido den Hollander w...@42on.com wrote:
 Hi,

 Today I've been playing with CephFS and the morning started great with
 CephFS playing along just fine.

 Some information first:
 - Ceph 0.89
 - Linux kernel 3.18
 - Ceph fuse 0.89
 - One Active MDS, one Standby

 This morning I could write a 10GB file like this using the kclient:
 $ dd if=/dev/zero of=10GB.bin bs=1M count=10240 conv=fsync

 That gave me 850MB/sec (all 10G network) and I could read the same file
 again with 610MB/sec.

 After writing to it multiple times it suddenly started to hang.

 No real evidence on the MDS (debug mds set to 20) or anything on the
 client. That specific operation just blocked, but I could still 'ls' the
 filesystem in a second terminal.

 The MDS was showing in it's log that it was checking active sessions of
 clients. It showed the active session of my single client.

 The client renewed it's caps and proceeded.
 
 Can you clarify this? I'm not quite sure what you mean.
 

I currently don't have the logs available. That was my problem when
typing the original e-mail.

 I currently don't have any logs, but I'm just looking for a direction to
 be pointed towards.

 Any ideas?
 
 Well, now that you're on v0.89 you should explore the admin
 socket...there are commands on the MDS to dump ops in flight (and
 maybe to look at session states? I don't remember when that merged).

Sage's pointer towards the kernel debugging and the new admin socket
showed me that it were RADOS calls which were hanging.

I investigated even further and it seems that this is not a CephFS
problem, but a local TCP issue which is only triggered when using CephFS.

At some point, which is still unclear to me, data transfer becomes very
slow. The MDS doesn't seem to be able to update the journal and the
client can't write to the OSDs anymore.

It happened after I did some very basic TCP tuning (timestamp, rmem,
wmem, sack, fastopen).

Reverting back to the Ubuntu 14.04 defaults resolved it all and CephFS
is running happily now.

I'll dig some deeper to see why this system was affected by those
changes. I applied these settings earlier on a RBD-only cluster without
any problems.

 -Greg
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS hangs when writing 10GB files in loop

2014-12-18 Thread Wido den Hollander
On 12/18/2014 11:13 AM, Wido den Hollander wrote:
 On 12/17/2014 07:42 PM, Gregory Farnum wrote:
 On Wed, Dec 17, 2014 at 8:35 AM, Wido den Hollander w...@42on.com wrote:
 Hi,

 Today I've been playing with CephFS and the morning started great with
 CephFS playing along just fine.

 Some information first:
 - Ceph 0.89
 - Linux kernel 3.18
 - Ceph fuse 0.89
 - One Active MDS, one Standby

 This morning I could write a 10GB file like this using the kclient:
 $ dd if=/dev/zero of=10GB.bin bs=1M count=10240 conv=fsync

 That gave me 850MB/sec (all 10G network) and I could read the same file
 again with 610MB/sec.

 After writing to it multiple times it suddenly started to hang.

 No real evidence on the MDS (debug mds set to 20) or anything on the
 client. That specific operation just blocked, but I could still 'ls' the
 filesystem in a second terminal.

 The MDS was showing in it's log that it was checking active sessions of
 clients. It showed the active session of my single client.

 The client renewed it's caps and proceeded.

 Can you clarify this? I'm not quite sure what you mean.

 
 I currently don't have the logs available. That was my problem when
 typing the original e-mail.
 
 I currently don't have any logs, but I'm just looking for a direction to
 be pointed towards.

 Any ideas?

 Well, now that you're on v0.89 you should explore the admin
 socket...there are commands on the MDS to dump ops in flight (and
 maybe to look at session states? I don't remember when that merged).
 
 Sage's pointer towards the kernel debugging and the new admin socket
 showed me that it were RADOS calls which were hanging.
 
 I investigated even further and it seems that this is not a CephFS
 problem, but a local TCP issue which is only triggered when using CephFS.
 
 At some point, which is still unclear to me, data transfer becomes very
 slow. The MDS doesn't seem to be able to update the journal and the
 client can't write to the OSDs anymore.
 
 It happened after I did some very basic TCP tuning (timestamp, rmem,
 wmem, sack, fastopen).
 

So it was tcp_sack. With tcp_sack=0 the MDS has problems talking to
OSDs. Other clients still work fine, but the MDS couldn't replay it's
journal and such.

Enabling tcp_sack again resolved the problem. The new admin socket
really helped there!

 Reverting back to the Ubuntu 14.04 defaults resolved it all and CephFS
 is running happily now.
 
 I'll dig some deeper to see why this system was affected by those
 changes. I applied these settings earlier on a RBD-only cluster without
 any problems.
 
 -Greg

 
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS hangs when writing 10GB files in loop

2014-12-18 Thread Wido den Hollander
On 12/18/2014 05:32 PM, Atchley, Scott wrote:
 On Dec 18, 2014, at 10:54 AM, Wido den Hollander w...@42on.com wrote:
 
 On 12/18/2014 11:13 AM, Wido den Hollander wrote:
 On 12/17/2014 07:42 PM, Gregory Farnum wrote:
 On Wed, Dec 17, 2014 at 8:35 AM, Wido den Hollander w...@42on.com wrote:
 Hi,

 Today I've been playing with CephFS and the morning started great with
 CephFS playing along just fine.

 Some information first:
 - Ceph 0.89
 - Linux kernel 3.18
 - Ceph fuse 0.89
 - One Active MDS, one Standby

 This morning I could write a 10GB file like this using the kclient:
 $ dd if=/dev/zero of=10GB.bin bs=1M count=10240 conv=fsync

 That gave me 850MB/sec (all 10G network) and I could read the same file
 again with 610MB/sec.

 After writing to it multiple times it suddenly started to hang.

 No real evidence on the MDS (debug mds set to 20) or anything on the
 client. That specific operation just blocked, but I could still 'ls' the
 filesystem in a second terminal.

 The MDS was showing in it's log that it was checking active sessions of
 clients. It showed the active session of my single client.

 The client renewed it's caps and proceeded.

 Can you clarify this? I'm not quite sure what you mean.


 I currently don't have the logs available. That was my problem when
 typing the original e-mail.

 I currently don't have any logs, but I'm just looking for a direction to
 be pointed towards.

 Any ideas?

 Well, now that you're on v0.89 you should explore the admin
 socket...there are commands on the MDS to dump ops in flight (and
 maybe to look at session states? I don't remember when that merged).

 Sage's pointer towards the kernel debugging and the new admin socket
 showed me that it were RADOS calls which were hanging.

 I investigated even further and it seems that this is not a CephFS
 problem, but a local TCP issue which is only triggered when using CephFS.

 At some point, which is still unclear to me, data transfer becomes very
 slow. The MDS doesn't seem to be able to update the journal and the
 client can't write to the OSDs anymore.

 It happened after I did some very basic TCP tuning (timestamp, rmem,
 wmem, sack, fastopen).


 So it was tcp_sack. With tcp_sack=0 the MDS has problems talking to
 OSDs. Other clients still work fine, but the MDS couldn't replay it's
 journal and such.

 Enabling tcp_sack again resolved the problem. The new admin socket
 really helped there!
 
 What was the reasoning behind disabling SACK to begin with? Without it, any 
 drops or reordering might require resending potentially a lot of data.
 

I was testing with various TCP settings and sack was one of those.
Didn't think about it earlier that it might be the problem.


 Reverting back to the Ubuntu 14.04 defaults resolved it all and CephFS
 is running happily now.

 I'll dig some deeper to see why this system was affected by those
 changes. I applied these settings earlier on a RBD-only cluster without
 any problems.

 -Greg





 -- 
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


CephFS hangs when writing 10GB files in loop

2014-12-17 Thread Wido den Hollander
Hi,

Today I've been playing with CephFS and the morning started great with
CephFS playing along just fine.

Some information first:
- Ceph 0.89
- Linux kernel 3.18
- Ceph fuse 0.89
- One Active MDS, one Standby

This morning I could write a 10GB file like this using the kclient:
$ dd if=/dev/zero of=10GB.bin bs=1M count=10240 conv=fsync

That gave me 850MB/sec (all 10G network) and I could read the same file
again with 610MB/sec.

After writing to it multiple times it suddenly started to hang.

No real evidence on the MDS (debug mds set to 20) or anything on the
client. That specific operation just blocked, but I could still 'ls' the
filesystem in a second terminal.

The MDS was showing in it's log that it was checking active sessions of
clients. It showed the active session of my single client.

The client renewed it's caps and proceeded.

I currently don't have any logs, but I'm just looking for a direction to
be pointed towards.

Any ideas?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS hangs when writing 10GB files in loop

2014-12-17 Thread Wido den Hollander
On 12/17/2014 05:40 PM, Sage Weil wrote:
 On Wed, 17 Dec 2014, Wido den Hollander wrote:
 Hi,

 Today I've been playing with CephFS and the morning started great with
 CephFS playing along just fine.

 Some information first:
 - Ceph 0.89
 - Linux kernel 3.18
 - Ceph fuse 0.89
 - One Active MDS, one Standby

 This morning I could write a 10GB file like this using the kclient:
 $ dd if=/dev/zero of=10GB.bin bs=1M count=10240 conv=fsync

 That gave me 850MB/sec (all 10G network) and I could read the same file
 again with 610MB/sec.

 After writing to it multiple times it suddenly started to hang.

 No real evidence on the MDS (debug mds set to 20) or anything on the
 client. That specific operation just blocked, but I could still 'ls' the
 filesystem in a second terminal.

 The MDS was showing in it's log that it was checking active sessions of
 clients. It showed the active session of my single client.

 The client renewed it's caps and proceeded.

 I currently don't have any logs, but I'm just looking for a direction to
 be pointed towards.
 
 Hmm.  Try
 
  cat /sys/kernel/debug/ceph/*/mdsc
  cat /sys/kernel/debug/ceph/*/osdc
 

I'll check that, good point.

 to see requests in flight (you may need to mount -t debugfs none 
 /sys/kernel/debug first).  What kernel version?
 

I tried with 3.18

Also tried with ceph-fuse 0.89, same result. It is slower, but it also
hangs at some point.

 sage
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Expose pool quota for Libvirt

2014-12-05 Thread Wido den Hollander
On 12/03/2014 05:39 PM, Logan Barfield wrote:
 I'm not sure if this is already in place, but if not I think pool quotas
 should be exposed in the same way as pool size so Libvirt can pick up on
 them.
 

Yes, that would be ideal indeed.

 We currently run several KVM hypervisors backed by Ceph RBD.  We have a
 CRUSH ruleset that defines SSD backed servers for RBD, and high capacity
 HDD backed servers for RadosGW.
 
 Right now when adding our RBD pool via Libvirt it sees the entire cluster's
 capacity, even though the actual pool capacity is much lower.  This means
 that any deployment tools that look at the available capacity when creating
 VMs can't be relied upon, and we have to continually track usage manually
 (we audit it anyway, but that should be in addition to built-in checks).
 We can currently work around this by manually setting the capacity in the
 deployment tool, but fixing it at the source seems like a much better
 option.
 
 Obviously capacity can't be automatically estimated from the CRUSH ruleset,
 but with the recent addition of pool quotas it would be useful to let
 Libvirt (and other clients) pull the quota size so we could set them as
 needed.
 
 Some changes will need to be made on the Libvirt side as well, but the
 functionality has to be implemented in Ceph first.
 

The main problem here is that I currently don't see a way in librados to
fetch this information.

rados_pool_stat_t in librados.h does not expose a quota or anything
similar. So what currently happens is that Libvirt does a cluster stat()
in librados and fetches how large the cluster is.

If librados exposes a way to get the quota from a pool it would be very
simple to fix this in Libvirt.

 I have opened an issue in the tracker for this:
 http://tracker.ceph.com/issues/10226
 

I'll assign this to myself.

 If this functionality already exists I'll close the issue and work on
 things from the Libvirt side.
 
 
 Thank You,
 
 Logan Barfield
 Tranquil Hosting
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


inode64 mount option for XFS

2014-11-03 Thread Wido den Hollander
Hi,

While look at init-ceph and ceph-disk I noticed a discrepancy between them.

init-ceph mounts XFS filesystems with rw,noatime,inode64, but
ceph-disk(-active) with rw,noatime

As inode64 gives the best performance, shouldn't ceph-disk do the same?

Any implications if we add inode64 on running deployments?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inode64 mount option for XFS

2014-11-03 Thread Wido den Hollander
On 11/03/2014 01:34 PM, Stefan Priebe - Profihost AG wrote:
 
 Am 03.11.2014 um 13:28 schrieb Wido den Hollander:
 Hi,

 While look at init-ceph and ceph-disk I noticed a discrepancy between them.

 init-ceph mounts XFS filesystems with rw,noatime,inode64, but
 ceph-disk(-active) with rw,noatime

 As inode64 gives the best performance, shouldn't ceph-disk do the same?

 Any implications if we add inode64 on running deployments?
 
 Isn't inode64 XFS default anyway?
 

The XFS website suggests it isn't:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F

By default, with 32bit inodes, XFS places inodes only in the first 1TB
of a disk.

However, if you look at bit further:
http://xfs.org/index.php/XFS_status_update_for_2012

Linux 3.7 will be a fairly boring release as far as XFS is concerned,
the biggest user visible changes are an intelligent implementation of
the lseek SEEK_HOLE/SEEK_DATA calls, and finally the switch to use the
inode64 allocator by default. 

So it seems you are partially right. It depends on the kernel you are
running if it is enabled by default.

Wido

 Stefan
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSDs crashing with Operation Not Permitted on reading PGLog

2014-10-27 Thread Wido den Hollander
Hi,

On a 0.80.7 cluster I'm experiencing a couple of OSDs refusing to start
due to a crash they encounter when reading the PGLog.

A snippet of the log:

   -11 2014-10-27 21:56:04.690046 7f034a006800 10
filestore(/var/lib/ceph/osd/ceph-25) _do_transaction on 0x392e600
   -10 2014-10-27 21:56:04.690078 7f034a006800 20
filestore(/var/lib/ceph/osd/ceph-25) _check_global_replay_guard no xattr
-9 2014-10-27 21:56:04.690140 7f034a006800 20
filestore(/var/lib/ceph/osd/ceph-25) _check_replay_guard no xattr
-8 2014-10-27 21:56:04.690150 7f034a006800 15
filestore(/var/lib/ceph/osd/ceph-25) touch meta/a1630ecd/pglog_14.1a56/0//-1
-7 2014-10-27 21:56:04.690184 7f034a006800 10
filestore(/var/lib/ceph/osd/ceph-25) touch
meta/a1630ecd/pglog_14.1a56/0//-1 = 0
-6 2014-10-27 21:56:04.690196 7f034a006800 15
filestore(/var/lib/ceph/osd/ceph-25) _omap_rmkeys
meta/a1630ecd/pglog_14.1a56/0//-1
-5 2014-10-27 21:56:04.690290 7f034a006800 10 filestore oid:
a1630ecd/pglog_14.1a56/0//-1 not skipping op, *spos 1435883.0.2
-4 2014-10-27 21:56:04.690295 7f034a006800 10 filestore  
header.spos 0.0.0
-3 2014-10-27 21:56:04.690314 7f034a006800  0
filestore(/var/lib/ceph/osd/ceph-25)  error (1) Operation not permitted
not handled on operation 33 (1435883.0.2, or op 2, counting from 0)
-2 2014-10-27 21:56:04.690325 7f034a006800  0
filestore(/var/lib/ceph/osd/ceph-25) unexpected error code
-1 2014-10-27 21:56:04.690327 7f034a006800  0
filestore(/var/lib/ceph/osd/ceph-25)  transaction dump:
{ ops: [
{ op_num: 0,
  op_name: nop},
{ op_num: 1,
  op_name: touch,
  collection: meta,
  oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
{ op_num: 2,
  op_name: omap_rmkeys,
  collection: meta,
  oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
{ op_num: 3,
  op_name: omap_setkeys,
  collection: meta,
  oid: a1630ecd\/pglog_14.1a56\/0\/\/-1,
  attr_lens: { can_rollback_to: 12}}]}
 0 2014-10-27 21:56:04.691992 7f034a006800 -1 os/FileStore.cc: In
function 'unsigned int
FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
ThreadPool::TPHandle*)' thread 7f034a006800 time 2014-10-27 21:56:04.690368
os/FileStore.cc: 2559: FAILED assert(0 == unexpected error)


The backing XFS filesystem seems to be OK, but isn't this a leveldb
issue where the omap information is stored?

Anyone seen this before? I have about 5 OSDs (out of the 336) which are
showing this problem when booting.

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSDs crashing with Operation Not Permitted on reading PGLog

2014-10-27 Thread Wido den Hollander
On 10/27/2014 10:35 PM, Samuel Just wrote:
 The file is supposed to be 0 bytes, can you attach the log which went
 with that strace?

Yes, two URLs:

* http://ceph.o.auroraobjects.eu/ceph-osd.25.log.gz
* http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz

It was with debug_filestore on 20.

Wido

 -Sam
 
 On Mon, Oct 27, 2014 at 2:16 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:05 PM, Samuel Just wrote:
 Try reproducing with an strace.

 I did so and this is the result:
 http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz

 It does this stat:

 stat(/var/lib/ceph/osd/ceph-25/current/meta/DIR_D/DIR_C

 That fails with: -1 ENOENT (No such file or directory)

 Afterwards it open this pglog:
 /var/lib/ceph/osd/ceph-25/current/meta/DIR_D/pglog\\u14.1a56__0_A1630ECD__none

 That file is however 0 bytes. (And all other files in the same directory).

 Afterwards the OSD asserts and writes to the log.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:02 PM, Wido den Hollander w...@42on.com wrote:
 Hi,

 On a 0.80.7 cluster I'm experiencing a couple of OSDs refusing to start
 due to a crash they encounter when reading the PGLog.

 A snippet of the log:

-11 2014-10-27 21:56:04.690046 7f034a006800 10
 filestore(/var/lib/ceph/osd/ceph-25) _do_transaction on 0x392e600
-10 2014-10-27 21:56:04.690078 7f034a006800 20
 filestore(/var/lib/ceph/osd/ceph-25) _check_global_replay_guard no xattr
 -9 2014-10-27 21:56:04.690140 7f034a006800 20
 filestore(/var/lib/ceph/osd/ceph-25) _check_replay_guard no xattr
 -8 2014-10-27 21:56:04.690150 7f034a006800 15
 filestore(/var/lib/ceph/osd/ceph-25) touch 
 meta/a1630ecd/pglog_14.1a56/0//-1
 -7 2014-10-27 21:56:04.690184 7f034a006800 10
 filestore(/var/lib/ceph/osd/ceph-25) touch
 meta/a1630ecd/pglog_14.1a56/0//-1 = 0
 -6 2014-10-27 21:56:04.690196 7f034a006800 15
 filestore(/var/lib/ceph/osd/ceph-25) _omap_rmkeys
 meta/a1630ecd/pglog_14.1a56/0//-1
 -5 2014-10-27 21:56:04.690290 7f034a006800 10 filestore oid:
 a1630ecd/pglog_14.1a56/0//-1 not skipping op, *spos 1435883.0.2
 -4 2014-10-27 21:56:04.690295 7f034a006800 10 filestore  
 header.spos 0.0.0
 -3 2014-10-27 21:56:04.690314 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25)  error (1) Operation not permitted
 not handled on operation 33 (1435883.0.2, or op 2, counting from 0)
 -2 2014-10-27 21:56:04.690325 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25) unexpected error code
 -1 2014-10-27 21:56:04.690327 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25)  transaction dump:
 { ops: [
 { op_num: 0,
   op_name: nop},
 { op_num: 1,
   op_name: touch,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
 { op_num: 2,
   op_name: omap_rmkeys,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
 { op_num: 3,
   op_name: omap_setkeys,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1,
   attr_lens: { can_rollback_to: 12}}]}
  0 2014-10-27 21:56:04.691992 7f034a006800 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f034a006800 time 2014-10-27 21:56:04.690368
 os/FileStore.cc: 2559: FAILED assert(0 == unexpected error)


 The backing XFS filesystem seems to be OK, but isn't this a leveldb
 issue where the omap information is stored?

 Anyone seen this before? I have about 5 OSDs (out of the 336) which are
 showing this problem when booting.

 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSDs crashing with Operation Not Permitted on reading PGLog

2014-10-27 Thread Wido den Hollander
On 10/27/2014 10:48 PM, Samuel Just wrote:
 Different nodes?

No, they are both from osd.25

I re-ran the strace with a empty logfile since the old logfile became
pretty big.

Wido

 -Sam
 
 On Mon, Oct 27, 2014 at 2:43 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:35 PM, Samuel Just wrote:
 The file is supposed to be 0 bytes, can you attach the log which went
 with that strace?

 Yes, two URLs:

 * http://ceph.o.auroraobjects.eu/ceph-osd.25.log.gz
 * http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz

 It was with debug_filestore on 20.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:16 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:05 PM, Samuel Just wrote:
 Try reproducing with an strace.

 I did so and this is the result:
 http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz

 It does this stat:

 stat(/var/lib/ceph/osd/ceph-25/current/meta/DIR_D/DIR_C

 That fails with: -1 ENOENT (No such file or directory)

 Afterwards it open this pglog:
 /var/lib/ceph/osd/ceph-25/current/meta/DIR_D/pglog\\u14.1a56__0_A1630ECD__none

 That file is however 0 bytes. (And all other files in the same directory).

 Afterwards the OSD asserts and writes to the log.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:02 PM, Wido den Hollander w...@42on.com wrote:
 Hi,

 On a 0.80.7 cluster I'm experiencing a couple of OSDs refusing to start
 due to a crash they encounter when reading the PGLog.

 A snippet of the log:

-11 2014-10-27 21:56:04.690046 7f034a006800 10
 filestore(/var/lib/ceph/osd/ceph-25) _do_transaction on 0x392e600
-10 2014-10-27 21:56:04.690078 7f034a006800 20
 filestore(/var/lib/ceph/osd/ceph-25) _check_global_replay_guard no xattr
 -9 2014-10-27 21:56:04.690140 7f034a006800 20
 filestore(/var/lib/ceph/osd/ceph-25) _check_replay_guard no xattr
 -8 2014-10-27 21:56:04.690150 7f034a006800 15
 filestore(/var/lib/ceph/osd/ceph-25) touch 
 meta/a1630ecd/pglog_14.1a56/0//-1
 -7 2014-10-27 21:56:04.690184 7f034a006800 10
 filestore(/var/lib/ceph/osd/ceph-25) touch
 meta/a1630ecd/pglog_14.1a56/0//-1 = 0
 -6 2014-10-27 21:56:04.690196 7f034a006800 15
 filestore(/var/lib/ceph/osd/ceph-25) _omap_rmkeys
 meta/a1630ecd/pglog_14.1a56/0//-1
 -5 2014-10-27 21:56:04.690290 7f034a006800 10 filestore oid:
 a1630ecd/pglog_14.1a56/0//-1 not skipping op, *spos 1435883.0.2
 -4 2014-10-27 21:56:04.690295 7f034a006800 10 filestore  
 header.spos 0.0.0
 -3 2014-10-27 21:56:04.690314 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25)  error (1) Operation not permitted
 not handled on operation 33 (1435883.0.2, or op 2, counting from 0)
 -2 2014-10-27 21:56:04.690325 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25) unexpected error code
 -1 2014-10-27 21:56:04.690327 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25)  transaction dump:
 { ops: [
 { op_num: 0,
   op_name: nop},
 { op_num: 1,
   op_name: touch,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
 { op_num: 2,
   op_name: omap_rmkeys,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
 { op_num: 3,
   op_name: omap_setkeys,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1,
   attr_lens: { can_rollback_to: 12}}]}
  0 2014-10-27 21:56:04.691992 7f034a006800 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f034a006800 time 2014-10-27 
 21:56:04.690368
 os/FileStore.cc: 2559: FAILED assert(0 == unexpected error)


 The backing XFS filesystem seems to be OK, but isn't this a leveldb
 issue where the omap information is stored?

 Anyone seen this before? I have about 5 OSDs (out of the 336) which are
 showing this problem when booting.

 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on


 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSDs crashing with Operation Not Permitted on reading PGLog

2014-10-27 Thread Wido den Hollander
On 10/27/2014 10:52 PM, Samuel Just wrote:
 I mean, the 5 osds, different nodes?

Yes. The cluster consists out of 16 nodes and all these OSDs are on
different nodes.

All running Ubuntu 12.04 with Ceph 0.80.7

Wido

 -Sam
 
 On Mon, Oct 27, 2014 at 2:50 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:48 PM, Samuel Just wrote:
 Different nodes?

 No, they are both from osd.25

 I re-ran the strace with a empty logfile since the old logfile became
 pretty big.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:43 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:35 PM, Samuel Just wrote:
 The file is supposed to be 0 bytes, can you attach the log which went
 with that strace?

 Yes, two URLs:

 * http://ceph.o.auroraobjects.eu/ceph-osd.25.log.gz
 * http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz

 It was with debug_filestore on 20.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:16 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:05 PM, Samuel Just wrote:
 Try reproducing with an strace.

 I did so and this is the result:
 http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz

 It does this stat:

 stat(/var/lib/ceph/osd/ceph-25/current/meta/DIR_D/DIR_C

 That fails with: -1 ENOENT (No such file or directory)

 Afterwards it open this pglog:
 /var/lib/ceph/osd/ceph-25/current/meta/DIR_D/pglog\\u14.1a56__0_A1630ECD__none

 That file is however 0 bytes. (And all other files in the same 
 directory).

 Afterwards the OSD asserts and writes to the log.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:02 PM, Wido den Hollander w...@42on.com 
 wrote:
 Hi,

 On a 0.80.7 cluster I'm experiencing a couple of OSDs refusing to start
 due to a crash they encounter when reading the PGLog.

 A snippet of the log:

-11 2014-10-27 21:56:04.690046 7f034a006800 10
 filestore(/var/lib/ceph/osd/ceph-25) _do_transaction on 0x392e600
-10 2014-10-27 21:56:04.690078 7f034a006800 20
 filestore(/var/lib/ceph/osd/ceph-25) _check_global_replay_guard no 
 xattr
 -9 2014-10-27 21:56:04.690140 7f034a006800 20
 filestore(/var/lib/ceph/osd/ceph-25) _check_replay_guard no xattr
 -8 2014-10-27 21:56:04.690150 7f034a006800 15
 filestore(/var/lib/ceph/osd/ceph-25) touch 
 meta/a1630ecd/pglog_14.1a56/0//-1
 -7 2014-10-27 21:56:04.690184 7f034a006800 10
 filestore(/var/lib/ceph/osd/ceph-25) touch
 meta/a1630ecd/pglog_14.1a56/0//-1 = 0
 -6 2014-10-27 21:56:04.690196 7f034a006800 15
 filestore(/var/lib/ceph/osd/ceph-25) _omap_rmkeys
 meta/a1630ecd/pglog_14.1a56/0//-1
 -5 2014-10-27 21:56:04.690290 7f034a006800 10 filestore oid:
 a1630ecd/pglog_14.1a56/0//-1 not skipping op, *spos 1435883.0.2
 -4 2014-10-27 21:56:04.690295 7f034a006800 10 filestore  
 header.spos 0.0.0
 -3 2014-10-27 21:56:04.690314 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25)  error (1) Operation not permitted
 not handled on operation 33 (1435883.0.2, or op 2, counting from 0)
 -2 2014-10-27 21:56:04.690325 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25) unexpected error code
 -1 2014-10-27 21:56:04.690327 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25)  transaction dump:
 { ops: [
 { op_num: 0,
   op_name: nop},
 { op_num: 1,
   op_name: touch,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
 { op_num: 2,
   op_name: omap_rmkeys,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
 { op_num: 3,
   op_name: omap_setkeys,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1,
   attr_lens: { can_rollback_to: 12}}]}
  0 2014-10-27 21:56:04.691992 7f034a006800 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f034a006800 time 2014-10-27 
 21:56:04.690368
 os/FileStore.cc: 2559: FAILED assert(0 == unexpected error)


 The backing XFS filesystem seems to be OK, but isn't this a leveldb
 issue where the omap information is stored?

 Anyone seen this before? I have about 5 OSDs (out of the 336) which are
 showing this problem when booting.

 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel 
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on


 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on


 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe

Re: OSDs crashing with Operation Not Permitted on reading PGLog

2014-10-27 Thread Wido den Hollander
On 10/27/2014 10:55 PM, Samuel Just wrote:
 There is nothing in dmesg?

No. The filesystem mounts cleanly and I even ran xfs_repair to see if
there was anything wrong with it.

All goes just fine. It's only the OSD which is crashing.

Wido

 -Sam
 
 On Mon, Oct 27, 2014 at 2:53 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:52 PM, Samuel Just wrote:
 I mean, the 5 osds, different nodes?

 Yes. The cluster consists out of 16 nodes and all these OSDs are on
 different nodes.

 All running Ubuntu 12.04 with Ceph 0.80.7

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:50 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:48 PM, Samuel Just wrote:
 Different nodes?

 No, they are both from osd.25

 I re-ran the strace with a empty logfile since the old logfile became
 pretty big.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:43 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:35 PM, Samuel Just wrote:
 The file is supposed to be 0 bytes, can you attach the log which went
 with that strace?

 Yes, two URLs:

 * http://ceph.o.auroraobjects.eu/ceph-osd.25.log.gz
 * http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz

 It was with debug_filestore on 20.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:16 PM, Wido den Hollander w...@42on.com 
 wrote:
 On 10/27/2014 10:05 PM, Samuel Just wrote:
 Try reproducing with an strace.

 I did so and this is the result:
 http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz

 It does this stat:

 stat(/var/lib/ceph/osd/ceph-25/current/meta/DIR_D/DIR_C

 That fails with: -1 ENOENT (No such file or directory)

 Afterwards it open this pglog:
 /var/lib/ceph/osd/ceph-25/current/meta/DIR_D/pglog\\u14.1a56__0_A1630ECD__none

 That file is however 0 bytes. (And all other files in the same 
 directory).

 Afterwards the OSD asserts and writes to the log.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:02 PM, Wido den Hollander w...@42on.com 
 wrote:
 Hi,

 On a 0.80.7 cluster I'm experiencing a couple of OSDs refusing to 
 start
 due to a crash they encounter when reading the PGLog.

 A snippet of the log:

-11 2014-10-27 21:56:04.690046 7f034a006800 10
 filestore(/var/lib/ceph/osd/ceph-25) _do_transaction on 0x392e600
-10 2014-10-27 21:56:04.690078 7f034a006800 20
 filestore(/var/lib/ceph/osd/ceph-25) _check_global_replay_guard no 
 xattr
 -9 2014-10-27 21:56:04.690140 7f034a006800 20
 filestore(/var/lib/ceph/osd/ceph-25) _check_replay_guard no xattr
 -8 2014-10-27 21:56:04.690150 7f034a006800 15
 filestore(/var/lib/ceph/osd/ceph-25) touch 
 meta/a1630ecd/pglog_14.1a56/0//-1
 -7 2014-10-27 21:56:04.690184 7f034a006800 10
 filestore(/var/lib/ceph/osd/ceph-25) touch
 meta/a1630ecd/pglog_14.1a56/0//-1 = 0
 -6 2014-10-27 21:56:04.690196 7f034a006800 15
 filestore(/var/lib/ceph/osd/ceph-25) _omap_rmkeys
 meta/a1630ecd/pglog_14.1a56/0//-1
 -5 2014-10-27 21:56:04.690290 7f034a006800 10 filestore oid:
 a1630ecd/pglog_14.1a56/0//-1 not skipping op, *spos 1435883.0.2
 -4 2014-10-27 21:56:04.690295 7f034a006800 10 filestore  
 header.spos 0.0.0
 -3 2014-10-27 21:56:04.690314 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25)  error (1) Operation not 
 permitted
 not handled on operation 33 (1435883.0.2, or op 2, counting from 0)
 -2 2014-10-27 21:56:04.690325 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25) unexpected error code
 -1 2014-10-27 21:56:04.690327 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25)  transaction dump:
 { ops: [
 { op_num: 0,
   op_name: nop},
 { op_num: 1,
   op_name: touch,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
 { op_num: 2,
   op_name: omap_rmkeys,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
 { op_num: 3,
   op_name: omap_setkeys,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1,
   attr_lens: { can_rollback_to: 12}}]}
  0 2014-10-27 21:56:04.691992 7f034a006800 -1 os/FileStore.cc: 
 In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f034a006800 time 2014-10-27 
 21:56:04.690368
 os/FileStore.cc: 2559: FAILED assert(0 == unexpected error)


 The backing XFS filesystem seems to be OK, but isn't this a leveldb
 issue where the omap information is stored?

 Anyone seen this before? I have about 5 OSDs (out of the 336) which 
 are
 showing this problem when booting.

 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 --
 To unsubscribe from this list: send the line unsubscribe 
 ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on


 --
 Wido den Hollander
 42on B.V.
 Ceph trainer

Re: OSDs crashing with Operation Not Permitted on reading PGLog

2014-10-27 Thread Wido den Hollander
On 10/27/2014 11:00 PM, Samuel Just wrote:
 Try running with osd_leveldb_paranoid=true and
 osd_leveldb_log=/var/log/ceph/osd/ceph-osd.id.log.leveldb on that
 osd.

Done and it was quite a clear message from leveldb:

2014/10/27-23:06:56.525355 7f14d0ea9800 Recovering log #164296
2014/10/27-23:06:56.554527 7f14d0ea9800 Delete type=0 #164296
2014/10/27-23:06:56.554644 7f14d0ea9800 Delete type=2 #164297
2014/10/27-23:06:56.555415 7f14d0ea9800 Delete type=2 #164298
2014/10/27-23:06:56.555709 7f14d0ea9800 Delete type=3 #164295
2014/10/27-23:06:56.556116 7f14cbc45700 Compacting 1@1 + 2@2 files
2014/10/27-23:06:56.626336 7f14cbc45700 Generated table #164299: 57
keys, 2193624 bytes
2014/10/27-23:06:56.642292 7f14cbc45700 compacted to: files[ 10 15 32 0
0 0 0 ]
2014/10/27-23:06:56.642310 7f14cbc45700 Compaction error: Corruption:
block checksum mismatch

What happened at this cluster is that a admin made a mistake and
accidentally resetted all machines using the IPMI, so all the
filesystems (and thus leveldb) were not closed properly.

5 OSDs however didn't seem to have survived. (Which now causes 4 PGs to
be down).

Wido

 -Sam
 
 On Mon, Oct 27, 2014 at 2:56 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:55 PM, Samuel Just wrote:
 There is nothing in dmesg?

 No. The filesystem mounts cleanly and I even ran xfs_repair to see if
 there was anything wrong with it.

 All goes just fine. It's only the OSD which is crashing.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:53 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:52 PM, Samuel Just wrote:
 I mean, the 5 osds, different nodes?

 Yes. The cluster consists out of 16 nodes and all these OSDs are on
 different nodes.

 All running Ubuntu 12.04 with Ceph 0.80.7

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:50 PM, Wido den Hollander w...@42on.com wrote:
 On 10/27/2014 10:48 PM, Samuel Just wrote:
 Different nodes?

 No, they are both from osd.25

 I re-ran the strace with a empty logfile since the old logfile became
 pretty big.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:43 PM, Wido den Hollander w...@42on.com 
 wrote:
 On 10/27/2014 10:35 PM, Samuel Just wrote:
 The file is supposed to be 0 bytes, can you attach the log which went
 with that strace?

 Yes, two URLs:

 * http://ceph.o.auroraobjects.eu/ceph-osd.25.log.gz
 * http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz

 It was with debug_filestore on 20.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:16 PM, Wido den Hollander w...@42on.com 
 wrote:
 On 10/27/2014 10:05 PM, Samuel Just wrote:
 Try reproducing with an strace.

 I did so and this is the result:
 http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz

 It does this stat:

 stat(/var/lib/ceph/osd/ceph-25/current/meta/DIR_D/DIR_C

 That fails with: -1 ENOENT (No such file or directory)

 Afterwards it open this pglog:
 /var/lib/ceph/osd/ceph-25/current/meta/DIR_D/pglog\\u14.1a56__0_A1630ECD__none

 That file is however 0 bytes. (And all other files in the same 
 directory).

 Afterwards the OSD asserts and writes to the log.

 Wido

 -Sam

 On Mon, Oct 27, 2014 at 2:02 PM, Wido den Hollander w...@42on.com 
 wrote:
 Hi,

 On a 0.80.7 cluster I'm experiencing a couple of OSDs refusing to 
 start
 due to a crash they encounter when reading the PGLog.

 A snippet of the log:

-11 2014-10-27 21:56:04.690046 7f034a006800 10
 filestore(/var/lib/ceph/osd/ceph-25) _do_transaction on 0x392e600
-10 2014-10-27 21:56:04.690078 7f034a006800 20
 filestore(/var/lib/ceph/osd/ceph-25) _check_global_replay_guard no 
 xattr
 -9 2014-10-27 21:56:04.690140 7f034a006800 20
 filestore(/var/lib/ceph/osd/ceph-25) _check_replay_guard no xattr
 -8 2014-10-27 21:56:04.690150 7f034a006800 15
 filestore(/var/lib/ceph/osd/ceph-25) touch 
 meta/a1630ecd/pglog_14.1a56/0//-1
 -7 2014-10-27 21:56:04.690184 7f034a006800 10
 filestore(/var/lib/ceph/osd/ceph-25) touch
 meta/a1630ecd/pglog_14.1a56/0//-1 = 0
 -6 2014-10-27 21:56:04.690196 7f034a006800 15
 filestore(/var/lib/ceph/osd/ceph-25) _omap_rmkeys
 meta/a1630ecd/pglog_14.1a56/0//-1
 -5 2014-10-27 21:56:04.690290 7f034a006800 10 filestore oid:
 a1630ecd/pglog_14.1a56/0//-1 not skipping op, *spos 1435883.0.2
 -4 2014-10-27 21:56:04.690295 7f034a006800 10 filestore  
 header.spos 0.0.0
 -3 2014-10-27 21:56:04.690314 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25)  error (1) Operation not 
 permitted
 not handled on operation 33 (1435883.0.2, or op 2, counting from 0)
 -2 2014-10-27 21:56:04.690325 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25) unexpected error code
 -1 2014-10-27 21:56:04.690327 7f034a006800  0
 filestore(/var/lib/ceph/osd/ceph-25)  transaction dump:
 { ops: [
 { op_num: 0,
   op_name: nop},
 { op_num: 1,
   op_name: touch,
   collection: meta,
   oid: a1630ecd\/pglog_14.1a56\/0\/\/-1},
 { op_num: 2,
   op_name: omap_rmkeys,
   collection: meta,
   oid: a1630ecd

Re: WriteBack Throttle kill the performace of the disk

2014-10-14 Thread Wido den Hollander
On 10/14/2014 02:19 PM, Mark Nelson wrote:
 On 10/14/2014 12:15 AM, Nicheal wrote:
 Yes, Greg.
 But Unix based system always have a parameter dirty_ratio to prevent
 the system memory from being exhausted. If Journal speed is so fast
 while backing store cannot catch up with Journal, then the backing
 store write will be blocked by the hard limitation of system dirty
 pages. The problem here may be that system call, sync(), cannot return
 since the system always has lots of dirty pages. Consequently, 1)
 FileStore::sync_entry() will be timeout and then ceph_osd_daemon
 abort.  2) Even if the thread is not timed out, Since the Journal
 committed point cannot be updated so that the Journal will be blocked,
 waiting for the sync() return and update Journal committed point.
 So the Throttle is added to solve the above problems, right?
 
 Greg or Sam can correct me if I'm wrong, but I always thought of the
 wbthrottle code as being more an attempt to smooth out spikes in write
 throughput to prevent the journal from getting too far ahead of the
 backing store.  IE have more frequent, shorter flush periods rather than
 less frequent longer ones.  For Ceph that is's probably a reasonable
 idea since you want all of the OSDs behaving as consistently as possible
 to prevent hitting the max outstanding client IOs/Bytes on the client
 and starving other ready OSDs.  I'm not sure it's worked out in practice
 as well as it might have in theory, though I'm not sure we've really
 investigated what's going on enough to be sure.
 

I thought that as well. So in the case of a SSD-based OSD where the
journal is on a partition #1 and the data on #2 you would disable
wbthrottle, correct?

Since the journal is just as fast as the data partition.

 However, in my tested ARM ceph cluster(3nodes, 9osds, 3osds/node), it
 will cause problem (SSD as journal, and HDD as data disk, fio 4k
 ramdom write iodepth 64):
  WritebackThrottle enable: Based on blktrace, we trace the back-end
 hdd io behaviour. Because of frequently calling fdatasync() in
 Writeback Throttle, it cause every back-end hdd spent more time to
 finish one io. This causes the total sync time longer. For example,
 default sync_max_interval is 5 seconds, total dirty data in 5 seconds
 is 10M. If I disable WritebackThrottle, 10M dirty data will be sync to
 disk within 4 second, So cat /proc/meminfo, the dirty data of my
 system is always clean(near zero). However, If I enable
 WritebackThrottle, fdatasync() slows down the sync process. Thus, it
 seems 8-9M random io will be sync to the disk within 5s. Thus the
 dirty data is always growing to the critical point (system
 up-limitation), and then sync_entry() is always timed out. So I means,
 in my case, disabling WritebackThrottle, I may always have 600 IOPS.
 If enabling WritebackThrottle, IOPS always drop to 200 since fdatasync
 cause back-end HDD disk overloaded.
 
 We never did a blktrace investigation, but we did see pretty bad
 performance with the default wbthrottle code when it was first
 implemented.  We ended up raising the throttles pretty considerably in
 dumpling RC2.  It would be interesting to repeat this test on an Intel
 system.
 
 So I would like that we can dynamically throttle the IOPS in
 FileStore. We cannot know the average sync() speed of the back-end
 Store since different disk own different IO performance. However, we
 can trace the average write speed in FileStore and Journal, Also, we
 can know, whether start_sync() is return and finished. Thus, If this
 time, Journal is writing so fast that the back-end cannot catch up the
 Journal(e.g. 1000IOPS/s). We cannot Throttle the Journal speed(e.g.
 800IOPS/s) in next operation interval(the interval maybe 1 to 5
 seconds, in the third second, Thottle become 1000*e^-x where x is the
 tick interval, ), if in this interval, Journal write reach the
 limitation, the following submitting write should waiting in OSD
 waiting queue.So in this way, Journal may provide a boosting IO, but
 finally, back-end sync() will return and catch up with Journal become
 we always slow down the Journal speed after several seconds.

 
 I will wait for Sam's input, but it seems reasonable to me.  Perhaps you
 might write it up as a blueprint for CDS?
 
 Mark
 -- 
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: the state of cephfs in giant

2014-10-13 Thread Wido den Hollander
On 13-10-14 20:16, Sage Weil wrote:
 We've been doing a lot of work on CephFS over the past few months. This
 is an update on the current state of things as of Giant.
 
 What we've working on:
 
 * better mds/cephfs health reports to the monitor
 * mds journal dump/repair tool
 * many kernel and ceph-fuse/libcephfs client bug fixes
 * file size recovery improvements
 * client session management fixes (and tests)
 * admin socket commands for diagnosis and admin intervention
 * many bug fixes
 
 We started using CephFS to back the teuthology (QA) infrastructure in the
 lab about three months ago. We fixed a bunch of stuff over the first
 month or two (several kernel bugs, a few MDS bugs). We've had no problems
 for the last month or so. We're currently running 0.86 (giant release
 candidate) with a single MDS and ~70 OSDs. Clients are running a 3.16
 kernel plus several fixes that went into 3.17.
 
 
 With Giant, we are at a point where we would ask that everyone try
 things out for any non-production workloads. We are very interested in
 feedback around stability, usability, feature gaps, and performance. We
 recommend:
 

A question to clarify this for anybody out there. Do you think it is
safe to run CephFS on a cluster which is doing production RBD/RGW I/O?

Will it be the MDS/CephFS part which breaks or are there potential issue
due to OSD classes which might cause OSDs to crash due to bugs in CephFS?

I know you can't fully rule it out, but it would be useful to have this
clarified.

 * Single active MDS. You can run any number of standby MDS's, but we are
   not focusing on multi-mds bugs just yet (and our existing multimds test
   suite is already hitting several).
 * No snapshots. These are disabled by default and require a scary admin
   command to enable them. Although these mostly work, there are
   several known issues that we haven't addressed and they complicate
   things immensely. Please avoid them for now.
 * Either the kernel client (kernel 3.17 or later) or userspace (ceph-fuse
   or libcephfs) clients are in good working order.
 
 The key missing feature right now is fsck (both check and repair). This is 
 *the* development focus for Hammer.
 
 
 Here's a more detailed rundown of the status of various features:
 
 * multi-mds: implemented. limited test coverage. several known issues.
   use only for non-production workloads and expect some stability
   issues that could lead to data loss.
 
 * snapshots: implemented. limited test coverage. several known issues.
   use only for non-production workloads and expect some stability issues
   that could lead to data loss.
 
 * hard links: stable. no known issues, but there is somewhat limited
   test coverage (we don't test creating huge link farms).
 
 * direct io: implemented and tested for kernel client. no special
   support for ceph-fuse (the kernel fuse driver handles this).
 
 * xattrs: implemented, stable, tested. no known issues (for both kernel
   and userspace clients).
 
 * ACLs: implemented, tested for kernel client. not implemented for
   ceph-fuse.
 
 * file locking (fcntl, flock): supported and tested for kernel client.
   limited test coverage. one known minor issue for kernel with fix
   pending. implemention in progress for ceph-fuse/libcephfs.
 
 * kernel fscache support: implmented. no test coverage. used in
   production by adfin.
 
 * hadoop bindings: implemented, limited test coverage. a few known
   issues.
 
 * samba VFS integration: implemented, limited test coverage.
 
 * ganesha NFS integration: implemented, no test coverage.
 
 * kernel NFS reexport: implemented. limited test coverage. no known
   issues.
 
 
 Anybody who has experienced bugs in the past should be excited by:
 
 * new MDS admin socket commands to look at pending operations and client 
   session states. (Check them out with ceph daemon mds.a help!) These 
   will make diagnosing, debugging, and even fixing issues a lot simpler.
 
 * the cephfs_journal_tool, which is capable of manipulating mds journal 
   state without doing difficult exports/imports and using hexedit.
 
 Thanks!
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-devel irc channel archive

2014-10-10 Thread Wido den Hollander
On 10/10/2014 09:58 AM, Loic Dachary wrote:
 Hi,
 
 Are there publicly accessible archives of the ceph-devel IRC channel ? It 
 would be most convenient to search for past conversations.
 

Yes there is: http://irclogs.ceph.widodh.nl/

But while typing this message I see that my bot has died again... I'll
restart it!

 Cheers
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-devel irc channel archive

2014-10-10 Thread Wido den Hollander
On 10/10/2014 10:56 AM, Loic Dachary wrote:
 
 
 On 10/10/2014 10:53, Wido den Hollander wrote:
 On 10/10/2014 09:58 AM, Loic Dachary wrote:
 Hi,

 Are there publicly accessible archives of the ceph-devel IRC channel ? It 
 would be most convenient to search for past conversations.


 Yes there is: http://irclogs.ceph.widodh.nl/

 But while typing this message I see that my bot has died again... I'll
 restart it!
 
 Hi Wido,
 
 But this is archiving #ceph only right ? Or is it also archiving #ceph-devel ?
 

Ah, indeed, that is #ceph only.

I was planning on fixing this anyway, so I'll set up a new bot which
also archives #ceph-devel

 Cheers

 Cheers



 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: C++11

2014-09-29 Thread Wido den Hollander
On 29-09-14 17:24, Sage Weil wrote:
 On Mon, 29 Sep 2014, Milosz Tanski wrote:
 A second more general Ceph question is somewhat off-topic. What about
 C++11 use in the Ceph code base (like in this case)? It's not
 explicitly prohibited by the coding style document, but I imagine the
 goal is to build on as many systems as possible and quite a few
 supported distros have pretty old versions of GCC. I'm asking this
 because I imagine some of the performance work that's about to happen
 will want to use things like lockless queues, and then you get into
 C++11 memory model and std::atomic... etc.
 
 We are all very eager to move to C++11.  The challenge is that we still 
 need to build packages for the target distros.  That doesn't necessarily 
 mean that the compilers on those distros need to support c++11 as long as 
 the runtime does... if we can make the build enviroment sane.
 
 I'm really curious what other projects do here...
 

At the CloudStack project we recently switched from Java 6 to Java 7 and
we said that from version X we required Java 7 on the system.

For Ceph, what keeps us from saying that version H (after Giant)
requires a C++11 compliant compiler?

That might rule out Ubuntu 12.04 and CentOS6/RHEL6, but does that really
matter that much for something 'new' like Ceph?

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RHEL 6.5 shared library upgrade safety

2014-08-18 Thread Wido den Hollander

On 08/18/2014 01:57 PM, Loic Dachary wrote:

Hi Ceph,

In RHEL 6.5, is the following scenario possible :

a) an OSD dlopen a shared library for erasure-code,
b) the shared library file is replaced while the OSD is running,
c) the OSD starts using the new file instead of the old one.

It seems unlikely but it would explain a weird stack trace at 
http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)



Well, it could be that it does so. I'm not 100% sure, but afaik it could 
happen that when you replace a library certain parts might not be in memory.


See: 
http://stackoverflow.com/questions/7767325/replacing-shared-object-so-file-while-main-program-is-running



Cheers




--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: any reason why multiple ranged-reads not currently supported in rados-java?

2014-08-16 Thread Wido den Hollander

On 08/15/2014 05:40 PM, Sage Weil wrote:

On Fri, 15 Aug 2014, Daniel Hsueh wrote:

Hello,

I'm looking at the Java librados JNA-based API, and note that ranged
reads (rados_create_read_op, rados_release_read_op,
rados_read_op_read, rados_read_op_operate) are not currently
accessible to a Java program.

Are there any difficulties in implementing access to these calls?  If
it is straightforward, I'll implement them myself and contribute back
to the github repository.


My guess is this was just an oversight.  A patch / pull request wiring
this up would be awesome.



Indeed. I was a bit lazy and implemented what I needed at the time of 
writing rados-java.


A pull request is welcome, or just create a issue in the tracker on 
http://tracker.ceph.com/


Thanks!


Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: object versioning

2014-08-14 Thread Wido den Hollander
 current version smaller than the
operation id.
The journal will be used for keeping order, and the entries in the
journal will serve as a blueprint that the gateways will need to
follow when applying changes. In order to ensure that operations that
needed to be complete were done, we'll mark the olh before going to
the bucket index, so that if the gateway died before completing the
operation, next time we try to access the object we'll know that we
need to go to the bucket index and complete the operation.

Things will then work like this:

* object creation

1. Create object instance
2. Mark olh that it's about to be modified
3. Update bucket index about new object instance
4. Read bucket index object op journal

Note that the journal should have at this point an entry that says
'point olh to specific object version, subject to olh is at version
X'.

5. Apply journal ops
6. Trim journal, unmark olh

* object removal (olh)

1. Mark olh that it's about to be modified
2. Update bucket index about the new deletion marker
3. Read bucket index object op journal

The journal entry should say something like 'mark olh as removed,
subject to olh is at version X'

4. Apply ops
5. Trim journal, unmark olh

Another option is to actually remove the olh, but in this case we'll
lose the olh versioning. We can in that case use the object
non-existent state as a check, but that will not be enough as there
are some corner cases where we could end up with the olh pointing at
the wrong object.

* object version removal

1. Mark olh as it will potentially be modified
2. Update bucket index about object instance removal
3. Read bucket index op journal
4. apply ops journal ...
Now the journal might just say something like 'remove object
instance', which means that the olh was pointing at a different object
version. The more interesting case is when the olh pointing at this
specific object version. In this case the journal will say something
like 'first point the olh at version V2, subject to olh is at version
X. Now, remove object instance V1'.

5. Trim journal, unmark olh


Note about olh marking: The olh mark will create an attr on the olh
that will have an id and a timestamp. There could be multiple marks on
the olh, and the marks should have some expiration, so that operations
that did not really start would be removed after a while.


Let me know if that makes sense, or if you have any questions.

Thanks,
Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Non existing monitor

2014-08-01 Thread Wido den Hollander

On 07/30/2014 12:07 PM, Aanchal Agrawal wrote:

Hi,

We found a case(bug?) in ceph mon code, where in, an attempt to start a non-existing monitor is 
throwing up a levelDB error saying failed to create new leveldb store, instead we 
thought an appropriate message say No Monitor present with that id would do, by 
checking for the monitor existence way ahead.

It seems that 'mon_exists()' checks for the existence of the mon data 
directory(via 'mon_data_exists()') and also for the non-empty nature of that 
directory(via 'mon_data_empty()'). The fix seemed pretty simple, as to flag the 
appropriate message if 'mon_data_exists()' were to set 'exists' to 'false', in 
case mkfs is not set.

The other behavior that we are seeking clarity, again in case of mkfs not being set is, 
if 'mon_data_exists()' sets 'exists' to 'true' and 'mon_data_empty()' sets 'exists' to 
'false' (meaning the mon data directory is present, but it is empty), then the current 
code seems to be going ahead in an attempt to open the 'store.db', and when open fails, 
it tries to create a new 'store.db' (though mkfs is not set) and eventually gives up 
throwing unable to read magic from mon data.

The questions we had around this were:

1) Though in case of mkfs not being set, what is the reason for creating a new 
levelDB store in case an attempt to open the 'store.db' is a failure, as 
levelDB anyways seem to be throwing 'magic' error going forward. Are there any 
use cases for this scenario?


Separate from Ceph it seems like bad behavior of a daemon to start 
creating directories when --mkfs is not supplied.


It should spit our errors about not being able to access it's data 
store, but not start creating them without explicitly being told to do so.



2) And also, is it valid to flag No Monitor present in case the mon data 
directory is existing, but with no data('store.db') in it, in case mkfs is not set?

Thanks,
Aanchal



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Adding a delay when restarting all OSDs on a host

2014-07-22 Thread Wido den Hollander

Hi,

Currently on Ubuntu with Upstart when you invoke a restart like this:

$ sudo restart ceph-osd-all

It will restart all OSDs at once, which can increase the load on the 
system a quite a bit.


It's better to restart all OSDs by restarting them one by one:

$ sudo ceph restart ceph-osd id=X

But you then have to figure out all the IDs by doing a find in 
/var/lib/ceph/osd and that's more manual work.


I'm thinking of patching the init scripts which allows something like this:

$ sudo restart ceph-osd-all delay=180

It then waits 180 seconds between each OSD restart making the proces 
even smoother.


I know there are currently sysvinit, upstart and systemd scripts, so it 
has to be implemented on various places, but how does the general idea 
sound?


--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding a delay when restarting all OSDs on a host

2014-07-22 Thread Wido den Hollander

On 07/22/2014 03:48 PM, Andrey Korolyov wrote:

On Tue, Jul 22, 2014 at 5:19 PM, Wido den Hollander w...@42on.com wrote:

Hi,

Currently on Ubuntu with Upstart when you invoke a restart like this:

$ sudo restart ceph-osd-all

It will restart all OSDs at once, which can increase the load on the system
a quite a bit.

It's better to restart all OSDs by restarting them one by one:

$ sudo ceph restart ceph-osd id=X

But you then have to figure out all the IDs by doing a find in
/var/lib/ceph/osd and that's more manual work.

I'm thinking of patching the init scripts which allows something like this:

$ sudo restart ceph-osd-all delay=180

It then waits 180 seconds between each OSD restart making the proces even
smoother.

I know there are currently sysvinit, upstart and systemd scripts, so it has
to be implemented on various places, but how does the general idea sound?

--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--



Hi,

this behaviour obviously have a negative side of increased overall
peering time and larger integral value of out-of-SLA delays. I`d vote
for warming up necessary files, most likely collections, just before
restart. If there are no enough room to hold all of them at once, we
can probably combine both methods to achieve lower impact value on
restart, although adding a simple delay sounds much more straight than
putting file cache to ram.



In the case I'm talking about there are 23 OSDs running on a single 
machine and restarting all the OSDs causes a lot of peering and reading 
PG logs.


A warm-up mechanism might work, but that would be a lot of work.

When upgrading your cluster you simply want to do this:

$ dsh -g ceph-osd sudo restart ceph-osd-all delay=180

That might take hours to complete, but if it's just an upgrade that 
doesn't matter. You want as minimal impact on service as possible.


--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] EU mirror now supports rsync

2014-07-17 Thread Wido den Hollander

On 07/16/2014 09:48 PM, David Moreau Simard wrote:

Hi,

Thanks for making this available.
I am currently synchronizing off of it and will make it available on our 4 Gbps 
mirror on the Canadian east coast by the end of this week.



Cool! More mirrors is always better.


Are you able to share how you are synchronizing from the Ceph repositories ?
It would probably be better for us to synchronize from the source rather than 
the Europe mirror.



I have a SSH account into the ceph.com server and use that to sync the 
packages. I've set this up the the community guys Ross and Patrick, you 
might want to ping them.


There is no official distribution mechanism for this right now. I simply 
set up a rsyncd to provide this the community.


Wido


--
David Moreau Simard

On Apr 9, 2014, at 2:04 AM, Wido den Hollander w...@42on.com wrote:


Hi,

I just enabled rsync on the eu.ceph.com mirror.

eu.ceph.com mirrors from Ceph.com every 3 hours.

Feel free to rsync all the contents to your local environment, might be useful
for some large deployments where you want to save external bandwidth by not
having each machine fetch the Deb/RPM packages from the internet.

Rsync is available over IPv4 and IPv6, simply sync with this command:
$ mkdir cephmirror
$ rsync -avr --stats --progress eu.ceph.com::ceph cephmirror

I ask you all to be gentle. It's a free service, so don't start hammering the
server by setting your Cron to sync every 5 minutes. Once every couple of hours
should be sufficient.

Also, please don't all start syncing at the first minute of the hour. When
setting up the Cron, select a random minute from the hour. This way the load on
the system can be spread out.

Should you have any questions or issues, let me know!

--
Wido den Hollander
42on B.V.
Ceph trainer and consultant
___
ceph-users mailing list
ceph-us...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Why does crushtool send all it's output to stderr?

2014-07-17 Thread Wido den Hollander

Hi,

When running tests on a crushmap with crushtool it noticed that all 
output is send to stderr, for example:


$ crushtool -i cm --test --rule 4 --num-rep 3 --show-statistics

All the output goes to stderr while I think stdout would be a better place.

Any good reason for this?

--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What the ceph do when detecting one OSD overloaded?

2014-06-06 Thread Wido den Hollander

On 06/06/2014 10:14 AM, zou wonder wrote:

Hi buddies,

   I am doing investigation on Ceph and Swift,and I am newbie to Ceph.

I am unclear to the behaviour of the Ceph when there is overload situation.
According to the doc, when Ceph do CRUSH, if it find the OSD is overload,
it will skip it to select the other OSD. It means if the OSD don't get
overloaded
the object should be put in this OSD. So how about the original on put on
this OSD when it is not get overloaded?
Can we read it ? When doing CRUSH , the OSD will be skip.



CRUSH will not take any performance characteristics into account. If a 
OSD is 100% utilized it will still be selected by CRUSH.


Keep in mind however that Block Devices are striped in 4MB chunks and 
the same happens for Objects stored via the RADOS Gateway.



If my understanding is wrong, please correct me.

Best Regards,
zou
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What the ceph do when detecting one OSD overloaded?

2014-06-06 Thread Wido den Hollander

On 06/06/2014 11:27 AM, zou wonder wrote:

Hi Wido:

Thanks for your kindness, i checked the codes just now, seems there
were no overload related thing in CRUSH codes. It is a little bit not
consistent with the CRUSH paper.
  So if the %100 utilized OSD is returned, the objects will still be
written on the underlined storage device?



Yes. If the OSD is up/in it will be selected by CRUSH and data will be 
read from it and written to it.


Again, having one OSD being 100% utilized and the rest 40% is not 
something you'll see very often since you stripe data over objects.



How about the device failure case? All the objects on the failure
device will be replicated to the good ones? once the device is
recoverd, the data will be replicated back?


When the OSD fails recovery will kick in after 5 minutes and the data 
will find a new location.


If the OSD comes back, the data goes back to that OSD.



Best Regards
Zou

On Fri, Jun 6, 2014 at 4:27 PM, Wido den Hollander w...@42on.com wrote:

On 06/06/2014 10:14 AM, zou wonder wrote:


Hi buddies,

I am doing investigation on Ceph and Swift,and I am newbie to Ceph.

I am unclear to the behaviour of the Ceph when there is overload
situation.
According to the doc, when Ceph do CRUSH, if it find the OSD is overload,
it will skip it to select the other OSD. It means if the OSD don't get
overloaded
the object should be put in this OSD. So how about the original on put on
this OSD when it is not get overloaded?
Can we read it ? When doing CRUSH , the OSD will be skip.



CRUSH will not take any performance characteristics into account. If a OSD
is 100% utilized it will still be selected by CRUSH.

Keep in mind however that Block Devices are striped in 4MB chunks and the
same happens for Objects stored via the RADOS Gateway.


If my understanding is wrong, please correct me.

Best Regards,
zou
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on



--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose

2014-06-05 Thread Wido den Hollander

On 06/05/2014 09:01 AM, Haomai Wang wrote:

Hi,
Previously I sent a mail about the difficult of rbd snapshot size
statistic. The main solution is using object map to store the changes.
The problem is we can't handle with multi client concurrent modify.

Lack of object map(like pointer map in qcow2), it cause many problems
in librbd. Such as clone depth, the deep clone depth will cause
remarkable latency. Usually each clone wrap will increase two times of
latency.

I consider to make a tradeoff between multi-client support and
single-client support for librbd. In practice, most of the
volumes/images are used by VM, there only exist one client will
access/modify image. We can't only want to make shared image possible
but make most of use cases bad. So we can add a new flag called
shared when creating image. If shared is false, librbd will
maintain a object map for each image. The object map is considered to
durable, each image_close call will store the map into rados. If the
client  is crashed and failed to dump the object map, the next client
open the image will think the object map as out of date and reset the
objectmap.


Why not flush out the object map every X period? Assume a client runs 
for weeks or months and you would keep that map in memory all the time 
since the image is never closed.




We can easily find the advantage of this feature:
1. Avoid clone performance problem
2. Make snapshot statistic possible
3. Improve librbd operation performance including read, copy-on-write operation.

What do you think above? More feedbacks are appreciate!




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions of KeyValueStore (leveldb) backend

2014-05-26 Thread Wido den Hollander

On 05/26/2014 06:55 AM, Haomai Wang wrote:

On Mon, May 26, 2014 at 9:46 AM, Guang Yang yguan...@outlook.com wrote:

Hello Haomai,
We are evaluating the key-value store backend which comes along with Firefly 
release (thanks for implementing it in Ceph), it is very promising for a couple 
of our use cases, after going through the related code change, I have a couple 
of questions which needs your help:
   1. One observation is that, for object larger than 1KB, it will be striped 
to multiple chunks (k-v in the leveldb table), with one strip as 1KB size. Is 
there any particular reason we choose 1KB as the strip size (and I didn’t find 
a configuration to tune this value)?


1KB is not a serious value, this value can be configured in the near future.



So that is currently hardcoded? I can't find any reference to it in 
config_opts.h




   2. This is properly a leveldb question, do we expect performance degradation 
as the leveldb instance keeps increasing (e.g. several TB)?


Ceph OSD is expected to own a physical disk, normally is several
TB(1-4TB). LevelDB can take it easy. Especially we use it to store
large value(compared to normal application usage).



With a large value you mean something like 4MB? The regular strip-size 
for RBD, CephFS and such?




Thanks,
Guang







--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feedback: default FS pools and newfs behavior

2014-05-21 Thread Wido den Hollander

On 05/21/2014 05:32 PM, John Spray wrote:

In response to #8010[1], I'm looking at making it possible to
explicitly disable CephFS, so that the (often unused) filesystem pools
don't hang around if they're unwanted.

The administrative behavior would change such that:
  * To enable the filesystem it is necessary to create two pools and
use ceph newfs metadata data
  * There's a new ceph rmfs command to disable the filesystem and
allow removing its pools
  * Initially, the filesystem is disabled and the 'data' and 'metadata'
pools are not created by default

There's an initial cut of this on a branch:
https://github.com/ceph/ceph/commits/wip-nullfs

Questions:
  * Are there strong opinions about whether the CephFS pools should
exist by default?  I think it makes life simpler if they don't,
avoiding what the heck is the 'data' pool? type questions from
newcomers.


+1

We simpler the clusters are, the better imho. A lot of users don't 
require CephFS, so don't enable what ain't used.



  * Is it too unfriendly to require users to explicitly create pools
before running newfs, or do we need to auto-create pools when they run
newfs?  Auto-creating some pools from newfs is a bit awkward
internally because it requires modifying both OSD and MDS maps in one
command.

Cheers,
John

1. http://tracker.ceph.com/issues/8010
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Set ms_bind_ipv6 to true when IPv6 is used

2014-05-09 Thread Wido den Hollander

Signed-off-by: Wido den Hollander w...@widodh.nl
---
 ceph_deploy/new.py |6 ++
 1 file changed, 6 insertions(+)

diff --git a/ceph_deploy/new.py b/ceph_deploy/new.py
index fc4c5f4..e573128 100644
--- a/ceph_deploy/new.py
+++ b/ceph_deploy/new.py
@@ -91,9 +91,11 @@ def new(args):
 ip = net.get_nonlocal_ip(host)
 LOG.debug('Monitor %s at %s', name, ip)
 mon_initial_members.append(name)
+ms_ipv6 = False
 try:
 socket.inet_pton(socket.AF_INET6, ip)
 mon_host.append([ + ip + ])
+ms_ipv6 = True
 except socket.error:
 mon_host.append(ip)
 
@@ -106,6 +108,10 @@ def new(args):
 cfg.set('global', 'mon initial members', ', '.join(mon_initial_members))
 # no spaces here, see http://tracker.newdream.net/issues/3145
 cfg.set('global', 'mon host', ','.join(mon_host))
+
+if ms_ipv6 == True:
+LOG.debug('Monitors are IPv6, binding Messenger traffic on IPv6')
+cfg.set('global', 'ms bind ipv6', 'true')
 
 # override undesirable defaults, needed until bobtail
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Write correct 'mon_host' with IPv6 addresses

2014-05-08 Thread Wido den Hollander
When the IP-address from a monitor is a IPv6 address, encapsulate
it with [ and ]

Fixes: #8309

Signed-off-by: Wido den Hollander w...@widodh.nl
---
 ceph_deploy/new.py |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/ceph_deploy/new.py b/ceph_deploy/new.py
index 3680821..fc4c5f4 100644
--- a/ceph_deploy/new.py
+++ b/ceph_deploy/new.py
@@ -5,6 +5,7 @@ import uuid
 import struct
 import time
 import base64
+import socket
 
 from ceph_deploy.cliutil import priority
 from ceph_deploy import conf, hosts, exc
@@ -90,7 +91,12 @@ def new(args):
 ip = net.get_nonlocal_ip(host)
 LOG.debug('Monitor %s at %s', name, ip)
 mon_initial_members.append(name)
-mon_host.append(ip)
+try:
+socket.inet_pton(socket.AF_INET6, ip)
+mon_host.append([ + ip + ])
+except socket.error:
+mon_host.append(ip)
+
 if args.ssh_copykey:
 ssh_copy_keys(host, args.username)
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What should the Ceph Foundation be about ?

2014-05-01 Thread Wido den Hollander

On 05/01/2014 04:35 PM, Loic Dachary wrote:

Hi Ceph co-developers,

I wrote down a few ideas about what I'd like the Ceph Foundation to provide, 
from a developer perspective :

https://wiki.ceph.com/Development/Foundation

This is no more than a wishlist, feel free to add whatever you have in mind ;-)


I added two lines which I think are important:
- Trademark ownership
- Logo ownership (implied by the first one)

I fully understand the people at Inktank are currently very busy with 
arranging all kinds of things, so I'm expecting a bit of lag on this topic.




Cheers




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Kernel-client crash when adding OSDs

2014-04-30 Thread Wido den Hollander

On 04/30/2014 04:16 PM, Thorwald Lundqvist wrote:

Hi!

A few days ago I added a bunch of new OSDs to hour production servers,
we have 3 hosts that maps and unmaps hundreds of RBD devices every
day. with aprox 100 RBD devs mapped at each host at any given time.

Adding OSDs is usually quite smooth if you do it the right way. I
usually follow the Manual adding/remove of osd as explained in the
documentation. Except when it comes to the crush add, i prefer
decompiling and compiling my own crush map.



So I preceded as usuall, prepared the OSD disks, keyring and so on,
and then I started up 4 new ceph-osd (osd.{9,10,11,12}) on the new OSD
host. BAM; a host that had a bunch of RBD devs mapped crashed and
rebooted with this in the log: http://pastebin.com/YKJSdWLv



That procedure seems right, I've done that multiple times and it all 
worked fine. That was with librbd and RGW though, so I'm not sure if 
this was a kernel issue.



I realise that this is not much to go on since there's no stack trace
or anything, but if anyone can help me reproduce this, I'd be
grateful. And if anyone had the same issue would really like to hear
from them too.

I'm running Linux 3.14.1 and ceph 0.72.2.

Thank you for your time,
Thorwald Lundqvist.
Jumpstarter AB
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Switch to RBD format 2 by default after Firefly

2014-04-08 Thread Wido den Hollander

Hi,

With Ubuntu 14.04 now shipping with the 3.13 kernel it fully supports 
Format 2 in both librbd and krbd.


For RHEL 7 there will be a extra yum repo with a module which supports 
format 2, so isn't it that that after Firefly we change 
'rbd_default_format' to 2?


I still see a lot of images being created with format 1 and that will 
otherwise be hunting us for a long time.


I'd vote for making format 2 the default or at least patch the 'rbd' 
tool to create format 2 images.


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Switch to RBD format 2 by default after Firefly

2014-04-08 Thread Wido den Hollander

On 04/08/2014 02:56 PM, Ilya Dryomov wrote:

On Tue, Apr 8, 2014 at 4:47 PM, Wido den Hollander w...@42on.com wrote:

Hi,

With Ubuntu 14.04 now shipping with the 3.13 kernel it fully supports Format
2 in both librbd and krbd.

For RHEL 7 there will be a extra yum repo with a module which supports
format 2, so isn't it that that after Firefly we change 'rbd_default_format'
to 2?

I still see a lot of images being created with format 1 and that will
otherwise be hunting us for a long time.

I'd vote for making format 2 the default or at least patch the 'rbd' tool to
create format 2 images.


Fancy format 2 striping is not supported by krbd.  We've had a couple
people who confused format 2 support with the ability to set their own
stripe settings, so we'll need to be more vocal about it.


Ah, that's something I've missed. Never used that in combination with 
krbd. So yes, we need to be more vocal about that.


Maybe even set a warning in the 'rbd' cli tool? Or at least in the 
help/usage.




Thanks,

 Ilya




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [librbd] Add interface of get the snapshot size?

2014-03-24 Thread Wido den Hollander

On 03/24/2014 02:30 PM, Haomai Wang wrote:

Hi all,

As we know, snapshot is a lightweight resource in librbd and we
doesn't have any statistic informations about it. But it causes some
problems to the cloud management.

We can't measure the size of snapshot, different snapshot will occur
different space. So we don't have way to estimate the resource usage
of user.

Maybe we can have a counter to record space usage when volumn created.


What do you mean with space usage? Cluster wide or pool usage?


When creating snapshot, the counter is freeze and store as the size of
snapshot. New counter will assign to zero for the volume.

Any feedback is appreciate!




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-brag ready

2014-03-04 Thread Wido den Hollander

On 03/04/2014 12:35 AM, Sage Weil wrote:

Hi Sebastien,

Good seeing you last week in Frankfurt!

I meant to follow up earlier (or hack on this a bit myself) but haven't
had time.  After taking a look, my wish list here would be:

- simplify the 'bytes' info to just be bytes.

- maybe prefix these all with 'num_'

- for crush_types, make it a map of type to count, so we can tell how many
racks/rows/hosts/etc there are.

- i wouldn't include the pool names in the pool metadata; that is probably
too sensitive.



Ack, just the IDs are fine. Pool names can indeed have to much 
information in them.



- but, we can include 'type' (replicated|erasure) and change 'rep_size' to
'size' (which is a more general name)



Why not also include the num of PGs per pool? Now we only see the num of 
PGs in total, but we probably want this to map to the pools as well.



- for sysinfo, i would remove nw_info entirely?  not sure what this would
be for, but generally if there is any identifying info people will not
want to use this

- for all the other metadata, i wonder if it would be better to break it
down into a histogram (like the crush types) with a value and count, so
that we just see how many of each version/distro/kernel/os/arch/cpu/etc
are running.  unless people think it would be useful to see how they
correlate?

In general, it seems like the more compact the info is, the easier and
more likely it is for a person to look at it, see it is safe to share, and
then do so.



I would also find out if a cluster is running on IPv4 or IPv6. Would be 
interesting to see which it is using.



Thanks!
sage




On Sun, 16 Feb 2014, Sebastien Han wrote:


Sorry for the late response Sage..
As discussed on IRC, LGPL is fine.

Thanks for taking care of this :)

Cheers.

Sébastien Han
Cloud Engineer

Always give 100%. Unless you're giving blood.”

Phone: +33 (0)1 49 70 99 72
Mail: sebastien@enovance.com
Address : 10, rue de la Victoire - 75009 Paris
Web : www.enovance.com - Twitter : @enovance

On 15 Feb 2014, at 18:43, Sage Weil s...@inktank.com wrote:


On Wed, 12 Feb 2014, Sebastien Han wrote:

Hi guys,

First implementation of the ceph-brag is ready.
We have a public repo available here, so can try it out.

https://github.com/enovance/ceph-brag

However I don’t have any idea on how to submit this to github.com/ceph.
Can someone help me on that?


I'm merging this into ceph.git now (src/brag/{client, server}).  I don't
see Signed-off-by lines on the commits, though, or an indication of what
the license is.  Can you confirm whether this should be LGPL or MIT or
whatever?

Thanks!
sage





--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Erasure Code user guide

2014-03-03 Thread Wido den Hollander

On 03/03/2014 02:20 PM, Loic Dachary wrote:

Hi John,

I'd like to draft a user guide for erasure coded pools, oriented toward system 
administrators. Where do you advise me to insert this ? Should I get 
inspiration from an example ?



the docs dir isn't good for that? If not, might be that the Wiki is a 
proper location?



Thanks in advance for any pointers or advice :-)




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Erasure Code user guide

2014-03-03 Thread Wido den Hollander

On 03/03/2014 04:35 PM, Loic Dachary wrote:



On 03/03/2014 15:47, Wido den Hollander wrote:

On 03/03/2014 02:20 PM, Loic Dachary wrote:

Hi John,

I'd like to draft a user guide for erasure coded pools, oriented toward system 
administrators. Where do you advise me to insert this ? Should I get 
inspiration from an example ?



the docs dir isn't good for that? If not, might be that the Wiki is a proper 
location?


Are you suggesting something like doc/dev/erasure-coded-pool.rst next to 
doc/dev/cache-pool.rst ? I'm not sure about the wiki.



Yes, why not? That way all the docs are in a central place.


Cheers




Thanks in advance for any pointers or advice :-)









--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Libvirt patches for Ubuntu 14.04

2014-02-25 Thread Wido den Hollander

Hi James, Ceph-dev,

Some time ago you manually pushed a patch for libvirt to the 14.04 
repositories, but this patch [0] now made it upstream and will be in 
libvirt 1.2.2.


It's a patch for creating RBD images with format 2 by default.

Another patch [1] of my just got accepted into upstream, it sets timeout 
options for librados to prevent libvirt from locking up when something 
happens to the network or the Ceph cluster where the last one if 
obviously impossible due to the design of Ceph ;-)


Looking at the release schedule [2] for Ubuntu 14.04 I see that there is 
a Beta 1 Freeze planned for Feb 27th and currently libvirt 1.2.1 is in 
the 14.04 repo [3].


Looking at the Ubuntu patches [4] I see that the RBD format 2 patch has 
been backported.


What's the chance of libvirt 1.2.2 making it into 14.04? And is it 
possible that you (also) backport the second patch into 14.04?


[0]: 
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=0227889ab0dfbbf16ba0e146800d9efcb631e281
[1]: 
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=60f70542f97805af49d656a582be35445046a0c9

[2]: https://wiki.ubuntu.com/TrustyTahr/ReleaseSchedule
[3]: http://packages.ubuntu.com/trusty/libvirt-bin
[4]: 
http://archive.ubuntu.com/ubuntu/pool/main/libv/libvirt/libvirt_1.2.1-0ubuntu9.debian.tar.gz


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling updatedb

2014-02-18 Thread Wido den Hollander

On 02/18/2014 05:47 PM, Sage Weil wrote:

Dan at CERN noticed that his performance was tanking because updatedb was
running against /var/lib/ceph.  updatedb has a PRUNE line in
/etc/updatedb.conf that we should presumably be adding ourselves to.  One
user pointed out a package that uses sed to rewrite this line in the init
script on start.

I have two questions:

- is there no better way than sed to add ourselves to this list?
- should we do it in the init script, or postinst, or both?



Well, I don't have a clue. postinst seems very dirty to me. It's a 
bummer that updatedb doesn't support a .d directory for configuration.



Presumably this is a problem others have solved with other packages.

http://tracker.ceph.com/issues/7451



Not completely the same, but I filed something similar at Ubuntu 
yesterday: https://bugs.launchpad.net/ubuntu/+source/mlocate/+bug/1281074


It's to prevent Ceph clients from indexing CephFS or Ceph Fuse. 
Hopefully that makes it into 14.04.



sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Libvirt patch for Ubuntu 14.04 to create image format 2 by default

2014-02-11 Thread Wido den Hollander

On 01/17/2014 02:39 PM, Wido den Hollander wrote:

Hi James (CC to devel),

I've been trying to get a patch [0] into libvirt for the last couple of
months to have libvirt create RBD images format 2 by default.

The response from the RedHat people has been zero until thus far. Since
this would be great for both OpenStack and CloudStack I'm thinking about
Ubuntu 14.04.



So, I've been trying to poke people at RedHat over and over for the last 
couple of weeks, but no results so far...


I got a response that the patch didn't apply to master anymore, so I 
came up with a new version [0] of the patch, but afterwards there has 
been radio silence again.


I'm working on a couple of extra patches for libvirt, but I want the 
first patch in their asap since that's the most important one.


If there is somebody with contacts inside RedHat who could help, please 
point them to that patch. I would very much appreciate that!


[0]: https://www.redhat.com/archives/libvir-list/2014-January/msg01522.html


If the patch makes it into Libvirt in let's say 6 weeks it will probably
not make it into 14.04 since that has been feature frozen.

Is there a way we can get this patch into Libvirt for Ubuntu 14.04? I'm
confident it will make it into Libvirt upstream, I just need somebody to
look at it.

[0]:
https://www.redhat.com/archives/libvir-list/2013-December/msg00645.html




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   >