Re: Ceph version 0.56.1, data loss on power failure

2013-01-16 Thread Jeff Mitchell
FWIW, my ceph data dirs (for e.g. mons) are all on XFS. I've
experienced a lot of corruption on these on power loss to the node --
and in some cases even when power wasn't lost, and the box was simply
rebooted. This is on Ubuntu 12.04 with the ceph-provied 3.6.3 kernel
(as I'm using RBD on these).

It's pretty much to the point where I'm thinking of changing them all
over to ext4 for these data dirs, as the hassle of rebuilding mons
constantly is just not worth the trouble.

--Jeff

On Wed, Jan 16, 2013 at 9:32 AM, Marcin Szukala
 wrote:
> 2013/1/16 Yann Dupont :
>> Le 16/01/2013 11:53, Wido den Hollander a écrit :
>>
>>>
>>>
>>> On 01/16/2013 11:50 AM, Marcin Szukala wrote:
>>
>>
 Hi all,

 Any ideas how can I resolve my issue? Or where the problem is?

 Let me describe the issue.
 Host boots up and maps RBD image with XFS filesystems
 Host mounts the filesystems from the RBD image
 Host starts to write data to the mounted filesystems
 Host experiences power failure
>>
>> you are not doing sync there, right ?
>
> Nope, no sync.
>>
>>
 Host comes up and map the RBD image
 Host mounts the filesystems from the RBD image
 All data from all filesystems is lost
 Host is able to use the filesystems with no problems.

 Filesystem is XFS, no errors on filesystem,
>>
>>
>> you MAY have hit an XFS issue.
>>
>> Please follow XFS list, in particular this thread :
>> http://oss.sgi.com/pipermail/xfs/2012-December/023021.html
>>
>> If i Remember well, this one is after 3.4 kernel, and I think the fix isn't
>> in the current ubuntu kernel.
>
> It looks like it, with ext4 I have no issue. Also if i do sync, the
> data is not lost.
>
> Thank You All for help.
>
> Regards,
> Marcin
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding Ceph

2013-01-19 Thread Jeff Mitchell

Sage Weil wrote:

On Sun, 20 Jan 2013, Peter Smith wrote:

Thanks for the reply, Sage and everyone.

Sage, so I can expect Ceph-rbd works well on Centos 6.3 if I only use
it as the Cinder volume backend because the librbd in QEMU doesn't
make use of kernel client, right?


Then the dependency is on the qemu version.  I don't remember that off the
top of my head, or know what version rhel6 ships.  Most people deploying
openstack and rbd are using a more modern distro (like ubuntu 12.04).


This discussion has made me curious: I'm using Ganeti to manage VMs, 
which manages the storage using the kernel client and passes in the dev 
device to qemu.


Can you comment on any known performance differences between the two 
methods -- native qemu+librbd creating a block device vs. the kernel 
client creating a block device?


Thanks,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: questions on networks and hardware

2013-01-20 Thread Jeff Mitchell

Wido den Hollander wrote:

No, not really. 1Gbit should be more then enough for your monitors. 3
monitors should also be good. No need to go for 5 or 7.


I have 5 monitors, across 16 different OSD-hosting machines...is that 
going to *harm* anything?


(I have had issues in my cluster when doing upgrades where my monitor 
count fell to 3, so it felt like having the extra was nice, at that point.)


Thanks,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: questions on networks and hardware

2013-01-22 Thread Jeff Mitchell

Wido den Hollander wrote:

One thing is still having multiple Varnish caches and object banning. I
proposed something for this some time ago, some "hook" in RGW you could
use to inform a upstream cache to "purge" something from it's cache.


Hopefully not Varnish-specific; something like the Last-Modified header 
would be good.


Also there are tricks you can do with queries; see for instance 
http://forum.nginx.org/read.php?2,1047,1052


--Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW object purging in upstream caches

2013-01-22 Thread Jeff Mitchell

Wido den Hollander wrote:

Now, when running just one Varnish instance which does loadbalancing
over multiple RGW instances is not a real problem. When it sees a PUT
operation it can "purge" (called banning in Varnish) the object from
it's cache.

When looking at the scenario where you have multiple caches you run into
the cache-consistency problem. If an object is modified the caches are
not notified and will continue to serve an outdated object.

Looking at the Last-Modified header is not an option since the cache
will not contact RGW when serving out of it's cache.

To handle this there has to be some kind of "hook" inside RGW that can
notify Varnish (or some other cache) when an object changes.


For nginx, it appears there is a well-tested production module that does 
this: http://labs.frickle.com/nginx_ngx_cache_purge/ (see the examples 
at the end of the README)


--Jeff

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Jeff Mitchell

Mark Nelson wrote:

It may (or may not) help to use a power-of-2 number of PGs. It's
generally a good idea to do this anyway, so if you haven't set up your
production cluster yet, you may want to play around with this. Basically
just take whatever number you were planning on using and round it up (or
down slightly). IE if you were going to use 7,000 PGs, round up to 8192.


As I was asking about earlier on IRC, I'm in a situation where the docs 
did not mention this in the section about calculating PGs so I have a 
non-power-of-2 -- and since there are some production things running on 
that pool I can't currently change it.


If indeed that makes a difference, here's one vote for a resilvering 
mechanism   :-)


Alternately, if I stand up a second pool, is there any easy way to 
(offline) migrate an RBD from one to the other? (Knowing that this means 
I'd have to update state with anything using it, after.) The only thing 
I know of right now is to make a second RBD, map both to a client, and dd.


Thanks,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Jeff Mitchell

Stefan Priebe wrote:

Hi,
Am 22.01.2013 22:26, schrieb Jeff Mitchell:

Mark Nelson wrote:

It may (or may not) help to use a power-of-2 number of PGs. It's
generally a good idea to do this anyway, so if you haven't set up your
production cluster yet, you may want to play around with this. 
Basically
just take whatever number you were planning on using and round it up 
(or
down slightly). IE if you were going to use 7,000 PGs, round up to 
8192.


As I was asking about earlier on IRC, I'm in a situation where the docs
did not mention this in the section about calculating PGs so I have a
non-power-of-2 -- and since there are some production things running on
that pool I can't currently change it.


Oh same thing here - did i miss the doc or can someone point me the 
location.


Here you go: http://ceph.com/docs/master/rados/operations/placement-groups/

(Notice the lack of any power-of-2 mention  :-)  )

--Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Jeff Mitchell

Mark Nelson wrote:

On 01/22/2013 03:50 PM, Stefan Priebe wrote:

Hi,
Am 22.01.2013 22:26, schrieb Jeff Mitchell:

Mark Nelson wrote:

It may (or may not) help to use a power-of-2 number of PGs. It's
generally a good idea to do this anyway, so if you haven't set up your
production cluster yet, you may want to play around with this. 
Basically
just take whatever number you were planning on using and round it 
up (or
down slightly). IE if you were going to use 7,000 PGs, round up to 
8192.


As I was asking about earlier on IRC, I'm in a situation where the docs
did not mention this in the section about calculating PGs so I have a
non-power-of-2 -- and since there are some production things running on
that pool I can't currently change it.


Oh same thing here - did i miss the doc or can someone point me the
location.

Is there a chance to change the number of PGs for a pool?

Greets,
Stefan


Honestly I don't know if it will actually have a significant effect. 
ceph_stable_mod will map things optimally when pg_num is a power of 2, 
but that's only part of how things work. It may not matter very much 
with high PG counts.


Yeah, that's why I said *if* it matters, once someone runs suitable benchmarks, 
please provide a resilvering mechanism   :-)

I'd be interested in figuring out the right way to migrate an RBD from one pool 
to another regardless.

--Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Jeff Mitchell
On Tue, Jan 22, 2013 at 7:25 PM, Josh Durgin  wrote:
> On 01/22/2013 01:58 PM, Jeff Mitchell wrote:
>>
>> I'd be interested in figuring out the right way to migrate an RBD from
>> one pool to another regardless.
>
>
> Each way involves copying data, since by definition a different pool
> will use different placement groups.
>
> You could export/import with the rbd tool, do a manual dd like you
> mentioned, or clone and then flatten in the new pool. The simplest
> is probably 'rbd cp pool1/image pool2/image'.

Awesome -- I didn't know about 'rbd cp', and I'll have to look into
cloning/flattening. Thanks for the info.

--Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rest mgmt api

2013-02-06 Thread Jeff Mitchell

Dimitri Maziuk wrote:

(Last I looked "?op=create&poolname=foo" was the Old Busted CGI, The New
Shiny Hotness(tm) was supposed to look like "/create/foo" -- and I never
understood how the optional parameters are supposed to work. But that's
beside the point.)


They're different. One is using the path to interpret functionality; one is 
using query parameters. The former requires custom path parsing/interpreting 
code for your particular application; the latter is a very well 
supported/understood way of getting key/value pairs.

Neither is right or wrong, they're just different. People seem to prefer the 
path method these days because it seems cleaner/nicer; the other thing people 
do is just POST the parameters instead of GETting, which lets you still use 
key/value parameters but not have an ugly URL.

--Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: test osd on zfs

2013-04-17 Thread Jeff Mitchell

Henry C Chang wrote:

I looked into this problem earlier. The problem is that zfs does not
return ERANGE when the size of value buffer passed to getxattr is too
small. zfs returns with truncated xattr value.


Is this a bug in ZFS, or simply different behavior?

I've used ZFSonLinux quite a bit and they do seem to be very eager to 
fix bugs related to improper behavior, so if it's actually a bug 
I/someone can talk to them and try to get them to look at it soonish.


--Jeff

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: test osd on zfs

2013-04-17 Thread Jeff Mitchell
> On 04/17/2013 02:09 PM, Stefan Priebe wrote:
>>
>> Sorry to disturb, but what is the raeson / advantage of using zfs for
>> ceph?

A few things off the top of my head:

1) Very mature filesystem with full xattr support (this bug
notwithstanding) and copy-on-write snapshots. While the port to Linux
sometimes has some rough edges (but in my experience over the past few
years is generally very good), the main code from Solaris (and now the
Illumos project) is well-tested and very well regarded. Btrfs has many
of the same features, but in my real-world experience I've had
multiple btrfs filesystems go corrupt with very innocuous usage
patterns and across a variety of kernel versions. The zfsonlinux bugs
don't tend to be data-destructive, once data is written to it.
2) Very intelligent caching; also supports external devices (like
SSDs) for a level 2 cache. This speeds up reads dramatically.
3) Very robust error-checking. There are lots of stories of ZFS
finding bad memory, bad controllers, and bad hard drives because of
its checksumming (which you can optionally turn off for speed). If you
set up the OSDs such that each OSD is based off of a ZFS mirror, you
get these benefits locally. For some people, especially when heavy on
reads (due to the intelligent caching), a solution that knocks the
remote replication level down by one but uses local mirrors for OSDs
may provide good functionality and safety compromises.

--Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: test osd on zfs

2013-04-19 Thread Jeff Mitchell

Alex Elsayed wrote:

Since Btrfs has implemented raid5/6 support (meaning raidz is only a feature
gain if you want 3x parity, which is unlikely to be useful for an OSD[1]),
the checksumming may be the only real benefit since it supports sha256 (in
addition to the non-cryptographic fletcher2/fletcher4), whereas btrfs only
has crc32c at this time.


Plus (in my real-world experience) *far* better robustness. If Ceph 
could use either and both had feature parity, I'd choose ZFS in a 
heartbeat. I've had too many simple Btrfs filesystems go corrupt, not 
even using any fancy RAID features.


I wasn't aware that Ceph was using btrfs' file-scope clone command. ZFS 
doesn't have that, although in theory with the new capabilities system 
it could be supported in one implementation without requiring an on-disk 
format change.


--Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor write performance

2013-04-20 Thread Jeff Mitchell

James Harper wrote:

Hi James,

do you VLAN's interfaces configured on your bonding interfaces? Because
I saw a similar situation in my setup.



No VLAN's on my bonding interface, although extensively used elsewhere.


What the OP described is *exactly* like a problem I've been struggling 
with. I thought the blame had lay elsewhere but maybe not.


My setup:

4 Ceph nodes, with 6 OSDs each and dual (bonded) 10GbE, with VLANs, 
running Precise. OSDs are using XFS. Replica count of 3. 3 of these are 
mons.
4 compute nodes, with dual (bonded) 10GbE, with VLANs, running a base of 
Precise along with a 3.6.3 Ceph-provided kernel, running KVM-based VMs. 
2 of these are also mons. VMs are Precise and accessing RBD through the 
kernel client.


(Eventually there will be 12 Ceph nodes. 5 mons seemed an appropriate 
number and when I've run into issues in the past I've actually gotten to 
cases where > 3 mons were knocked out, so 5 is a comfortable number 
unless it's problematic.)


In the VMs, I/O with ext4 is fine -- 10-15MB/s sustained. However, using 
ZFS (via ZFSonLinux, not FUSE), I see write speeds of about 150kb/sec, 
just like the OP.


I had figured that the problem lay with ZFS inside the VM (I've used 
ZFSonLinux on many bare metal machines without a problem for a couple of 
years now). The VMs were using virtio, and I'd heard that it was found 
that pre-1.4 Qemu versions could have some serious problems with virtio 
(which I didn't know at the time); also, I know that the kernel client 
is not the preferred client, and the version I'm using is a rather older 
version of the Ceph-provided builds. As a result, my plan was to try the 
updated Qemu version along with native Qemu librados RBD support once 
Raring was out, as I figured that the problem was either something in 
ZFSonLinux (though I reported the issue and nobody had ever heard of any 
such problem, or had any idea why it would be happening) or something 
specifically about ZFS running inside Qemu, as ext4 in the VMs is fine.


But, this thread has made me wonder if what's actually happening is in 
fact something else -- either something, as someone else saw, to do with 
using VLANs on the bonded interface (although I don't see such a write 
problem with any other traffic going through these VLANs); or, something 
about how ZFS inside the VM is writing to the RBD disk causing some kind 
of giant slowdown in Ceph. The numbers that the OP cited were exactly in 
line with what I was seeing.


I don't know offhand what the block sizes are that the kernel client was 
using, or that the different filesystems inside the VMs might be using 
when trying to write to their virtual disks (I'm guessing that if you 
are using virtio, as I am, it potentially could be anything). But 
perhaps ZFS writes extremely small blocks and ext4 doesn't.


Unfortunately, I don't have access to this testbed for the next few 
weeks, so for the moment I can only recount my experience and not 
actually test out any suggestions (unless I can corral someone with 
access to it to run tests).


Thanks,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html