Re: [ceph-users] Ceph mon quorum

2013-04-05 Thread Dimitri Maziuk
On 04/05/2013 12:50 PM, Gregory Farnum wrote:

... I'm sorry if this aspect of the system is problematic
 for you, but it's pretty fundamental to any distributed or cloudy
 system that chooses consistency over availability.

It isn't actually -- as long as 'chooses consistency over availability'
is printed in big bold letters on the front page of the sales brochure. ;)

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: rest mgmt api

2013-02-11 Thread Dimitri Maziuk
On 02/11/2013 04:00 PM, Sage Weil wrote:
 On Mon, 11 Feb 2013, Gregory Farnum wrote:
...

 That doesn't really help; it means the mon still has to understand the 
 CLI grammar.
 
 What we are talking about is the difference between:
 
 [ 'osd', 'down', '123' ]
 
 and
 
 {
   URI: '/osd/down',
   OSD-Id: 123
 }
 
 or however we generically translate the HTTP request into JSON.

I think the setup we have in mind is where the MON reads something like
{who:osd, which:123, what:down, when:now} from a socket
(pipe, whatever),

the CLI reads osd down 123 now from the prompt and pushes {who:osd,
which:123, what:down, when:now} into that socket,

the webapp gets whatever: /osd/down/123/now or
?who=osdcommand=downid=123when=now from whoever impersonates the
browser and pipes {who:osd, which:123, what:down,
when:now} into that same socket,

and all three of them are three completely separate applications that
don't try to do what they don't need to.

 FWIW you could pass the CLI command as JSON, but that's no different than 
 encoding vectorstring; it's still a different way to describing the same 
 command.

The devil is of course in the details: in (e.g.) python json.loads() the
string and gives you the map you could plug into a lookup table or
something to get right to the function call. My c++ is way rusty, I've
no idea what's available in boost co -- if you have to roll your own
json parser then you indeed don't care how that vectorstring is encoded.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: CEPHFS mount error !!!

2013-02-06 Thread Dimitri Maziuk

On 2/6/2013 5:54 AM, Dennis Jacobfeuerborn wrote:
...

To mount cephfs like that you need to have kernel support. As the Linux
kernel on CentOS 6.3 is version 2.6.32 and Ceph support wasn't added until
2.6.34, you need to compile your own kernel.



The better alternative is probably to install a kernel from
http://elrepo.org/tiki/kernel-lt . lt stand for long term and should be
fairly stable and ml is mainline which is even more current but because
of that not quite as stable (currently 3.7.6).


I had problems booting ml on some/most (dep. on the version) our 
machines, plus it's a pain to track: there's a new one every day.


I do have lt running without problems and mounting cephfs, however, I 
haven't gotten around to the actual ceph testing on it yet so I can't 
say anything about ceph client's performance/stability on it. (lt is 
3.0, as I understand it doesn't have the latest and greatest ceph module(?))


Dimitri

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rest mgmt api

2013-02-06 Thread Dimitri Maziuk
On 02/06/2013 01:34 PM, Sage Weil wrote:

 I think the one caveat here is that having a single registry for commands 
 in the monitor means that commands can come in two flavors: vectorstring 
 (cli) and URL (presumably in json form).  But a single command 
 dispatch/registry framework will make that distinction pretty simple...

Any reason you can't have your CLI json-encode the commands (or,
conversely, your cgi/wsgi/php/servlet URL handler decode them into
vectorstring) before passing them on to the monitor?

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: rest mgmt api

2013-02-06 Thread Dimitri Maziuk
On 02/06/2013 02:14 PM, Sage Weil wrote:
 On Wed, 6 Feb 2013, Dimitri Maziuk wrote:

 Any reason you can't have your CLI json-encode the commands (or,
 conversely, your cgi/wsgi/php/servlet URL handler decode them into
 vectorstring) before passing them on to the monitor?
 
 We can, but they won't necessarily look the same, because it is unlikely 
 we can make a sane 1:1 translation of the CLI to REST that makes sense, 
 and it would be nice to avoid baking knowledge about the individual 
 commands into the client side.
 
  ceph osd pool create poolname numpgs
vs
  /osd/pool/?op=createpoolname=foonumpgs=bar
 
 or whatever.  I know next to nothing about REST API design best practices, 
 but I'm guessing it doesn't look like a CLI.

(Last I looked ?op=createpoolname=foo was the Old Busted CGI, The New
Shiny Hotness(tm) was supposed to look like /create/foo -- and I never
understood how the optional parameters are supposed to work. But that's
beside the point.)

To me you sounded like you the piece that actually does the work
(daemon?) should understand both (and have a built-in httpd on top).
What I meant is it should know just one and let the UI modules do the
translation.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Ceph Development with Eclipse‏

2013-02-02 Thread Dimitri Maziuk
On 02/02/2013 11:40 AM, charles L wrote:
 
 
 Hi
 
 I am a beginner at c++ and eclipse. I need some startup help to
 develop ceph with eclipse. If you could provide your config file on
 eclipse, it will be a great starting point and very appreciated.

(giggle) Real Men use vi. Or was it emacs?

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk

On 1/24/2013 2:49 AM, Gandalf Corvotempesta wrote:

2013/1/24 Dimitri Maziuk dmaz...@bmrb.wisc.edu:

So I'm stuck at a point way before those guides become relevant: once I
had one OSD/MDS/MON box up, I got HEALTH_WARN 384 pgs degraded; 384 pgs
stuck unclean; recovery 21/42 degraded (50.000%) (384 appears be the
number of placement groups created by default).

What does that mean? That I only have one OSD? Or is it genuinely unhealthy?


ceph is building it's cluster. You should wait for it.
In my case, it needed 5-10 minutes.


No, that's not it: it was stuck in that state for 40 minutes or so.

Dima

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk

On 1/24/2013 8:20 AM, Sam Lang wrote:


Yep it means that you only have one OSD with replication level of 2.
If you had a rep level of 3, you would see degraded (66.667%).  If you
just want to make the message go away (for testing purposes), you can
set the rep level to 1
(http://ceph.com/w/index.php?title=Adjusting_replication_levelredirect=no).


OK, thanks Sam and Dino -- I kinda suspected that but didn't find any docs.

This looks like it's not adjustable via ceph.conf, I can only do it at 
runtime, correct?


Dima

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk


One other question I have left (so far) is: I read and tried to follow 
http://ceph.com/docs/master/install/rpm/ and 
http://ceph.com/docs/master/start/quick-start/ on centos 6.3.


mkcephfs step fails without rbd kernel module.

I just tried to find libvirt, kernel, module, and qemu on those 
pages: kernel occurs in add ceph packages section and module 
occurs in the header, footer, and the side menu. 0 hits for the others.


So when I read after learning that qemu uses librbd (and thus doesn't 
rely on the rbd kernel module) I was happy to stick with the stock 
CentOS kernel for my servers (with updated qemu and libvirt builds) -- 
forgive me for being dense, but I have no context for this. Where in 
ceph.conf do I tell it to use qemu and librbd instead of kernel module? 
Or does it mean I'm to set up my OSDs in virtual machines? Seems I'm 
missing an important piece of information here (possibly because it's 
blatantly obvious and is staring me in the face -- woudn't be the first 
time).


So what is it that I'm missing?

TIA
Dima

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk

On 1/24/2013 9:58 AM, Wido den Hollander wrote:

On 01/24/2013 04:53 PM, Jens Kristian Søgaard wrote:

Hi Dimitri,


Where in ceph.conf do I tell it to use qemu and librbd instead of
kernel module?


You do not need to specify that in ceph.conf.

When you run qemu then specify the disk for example like this:

  -drive format=rbd,file=rbd:/pool/imagename,if=virtio,index=0,boot=on



Small typo :) It has to be:

  -drive format=rbd,file=rbd:pool/imagename,if=virtio,index=0,boot=on


Thanks but I'm still missing the context. I'm following this document:
 http://ceph.com/docs/master/start/quick-start/
to set up an osd/mds/mon *server*.

The step that's failing without the kernel module is Deploy the 
configuration #2:

 mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring

Are you saying I'm to run qemu -drive ... instead of mkcephfs?

Dima (I'm assuming either you aren't or qemu has changed a lot since I 
last looked)


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk

On 1/24/2013 10:22 AM, Sam Lang wrote:

...  Does that make sense?

Yes, but when I'm trying to set up a ceph server using the quick start 
guide, mkcephfs is failing with an error message I didn't write down, 
but the complaint was along the lines of missing rbd.ko. Booting a 3.7 
kernel made it go away.


This is the part where everyone says server stuff should run on the 
stock centos kernel but in my reality it doesn't. (So I'm trying to 
figure out why my reality is different from everyone else's ;)


I'll see if I can reproduce it and post the exact error message.

Dima

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk
On 01/24/2013 12:15 PM, Dan Mick wrote:
 On 01/24/2013 07:28 AM, Dimitri Maziuk wrote:
 On 1/24/2013 8:20 AM, Sam Lang wrote:

 Yep it means that you only have one OSD with replication level of 2.
 If you had a rep level of 3, you would see degraded (66.667%). If you
 just want to make the message go away (for testing purposes), you can
 set the rep level to 1
 (http://ceph.com/w/index.php?title=Adjusting_replication_levelredirect=no).


 OK, thanks Sam and Dino -- I kinda suspected that but didn't find any
 docs.

 This looks like it's not adjustable via ceph.conf, I can only do it at
 runtime, correct?
 
 or you could just add another OSD.

Obviously. You'd think that only one [osd] section in ceph.conf implies
nrep = 1, though. (And then you can go on adding OSDs and changing nrep
accordingly -- that was my plan.)

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk
On 01/24/2013 12:38 PM, John Wilkins wrote:
 Dima,
 
 I'm working on a new monitoring and troubleshooting guide now that will
 answer most of the questions related to OSD and placement group states. I
 hope to have it done this week. I have not actually tested the quick starts
 on centos or rhel distributions, but it's on our radar. The intention of
 the quick starts is to get you up and running quickly. It doesn't cover
 deeper issues like how to monitor and troubleshoot. I'm working on adding a
 lot more substantive content there now.

A couple of things in the quick start:

- there should be no space between rw, and noatime in
osd mount options {fs-type} = {mount options} # default mount option is
rw, noatime

- for ext4, you need to specify user_xattr there or mkcephfs will fail
(with --mkfs at least).

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk
On 01/24/2013 12:16 PM, Dan Mick wrote:

 This is an apparently-unique problem, and we'd love to see details.

I hate it when it makes a liar out of me, this time around it worked on
2.6.23 -- FSVO worked: I did get it to 384 pgs stuck unclean stage.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk
On 01/24/2013 03:07 PM, Dan Mick wrote:
...
 Yeah; it's probably mostly just that one-OSD configurations are so
 uncommon that we never special-cased that small user set.  Also, you can
 run with a cluster in that state forever (well, until that one OSD dies
 at least); I do that regularly with the default vstart.sh local test
 cluster

Well, this goes back to the quick start guide: to me a more natural way
to start is with one host, then add another. That's what I was trying to
do, however, the quick start page ends with

When your cluster echoes back HEALTH_OK, you may begin using Ceph.

and that doesn't happen with one host: you get 384 pgs stuck unclean
instead of HEALTH_OK. To me that means I may *not* begin using ceph.

I did run ceph osd pool set ... size 1 on each of the 3 default pools,
verified that it took with ceph osd dump | grep 'rep size', and gave
it a good half hour to settle. I still got 384 pgs stuck unclean from
ceph health.

So I re-done it with 2 OSDs and got the expected HEALTH_OK right from
the start.

John,

a) a note saying if you have only one OSD you won't get HEALTH_OK until
you add another one; you can start using the cluster may be a useful
addition to the quick start,

b) more importantly, if there are any plans to write more quickstart
pages, I'd love to see the add another OSD (MDS, MON) to an existing
pool in 5 minutes.

Thanks all,
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk
On 01/24/2013 03:48 PM, Sage Weil wrote:
 On Thu, 24 Jan 2013, Dimitri Maziuk wrote:

 So I re-done it with 2 OSDs and got the expected HEALTH_OK right from
 the start.

 There may be a related issue at work here: the default crush rules now 
 replicate across hosts instead of across osds, so single-host configs may 
 have similar problems (depending on whether you used mkcephfs to create 
 the cluster or not).

Right, that's with 2nd osd on another host, not with 2 osds on the same
host.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Understanding Ceph

2013-01-24 Thread Dimitri Maziuk
John,

in block device quick start (http://ceph.com/docs/master/start/quick-rbd/)

sudo rbd map foo --pool rbd --name client.admin

maps the image to /dev/rbd0 here (centos 6.3/bobtail) so the subsequent

4. Use the block device. In the following example, create a file system.

sudo mkfs.ext4 -m0 /dev/rbd/rbd/foo

should end with /dev/rbd0 instead of /dev/rbd/rbd/foo.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Understanding Ceph

2013-01-23 Thread Dimitri Maziuk
On 01/23/2013 10:19 AM, Patrick McGarry wrote:

 http://ceph.com/howto/building-a-public-ami-with-ceph-and-openstack/

 On Wed, Jan 23, 2013 at 10:13 AM, Sam Lang sam.l...@inktank.com wrote:

 http://ceph.com/docs/master/rbd/rbd-openstack/

These are both great, I'm sure, but Patrick's page says I chose to
follow the 5 minute quickstart guide and the rbd-openstack page says
Important ... you must have a running Ceph cluster.

My problem is I can;t find a 5 minute quickstart guide for RHEL 6. and
I didn't get a running ceph cluster by trying to follow the existing
(ubuntu) guide and adjust for centos 6.3.

So I'm stuck at a point way before those guides become relevant: once I
had one OSD/MDS/MON box up, I got HEALTH_WARN 384 pgs degraded; 384 pgs
stuck unclean; recovery 21/42 degraded (50.000%) (384 appears be the
number of placement groups created by default).

What does that mean? That I only have one OSD? Or is it genuinely unhealthy?

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Understanding Ceph

2013-01-23 Thread Dimitri Maziuk
On 01/23/2013 06:17 PM, John Nielsen wrote:
...
 http://ceph.com/docs/master/install/rpm/
 http://ceph.com/docs/master/start/quick-start/
 
 Between those two links my own quick-start on CentOS 6.3 was maybe 6 minutes. 
 YMMV.

It does, obviously, since

Deploy the configuration
...
2. Execute the following on the Ceph server host
cd /etc/ceph
sudo mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring


was failing here until I booted an elrepo 3.7 kernel with rbd.ko.

 HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery 21/42 degraded 
 (50.000%)

 What does that mean? That I only have one OSD? Or is it genuinely unhealthy?

 Assuming you have more than one host ...

I just said I have one host. So is that expected when I only have one host?

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: handling fs errors

2013-01-22 Thread Dimitri Maziuk
On 01/22/2013 12:05 AM, Sage Weil wrote:
 We observed an interesting situation over the weekend.  The XFS volume 
 ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 
 minutes.
...

FWIW I see this often enough on cheap sata drives: they've a failure
mode that makes sata driver timeout, reset the link, resend the command,
rinse, lather, repeat. (You usually get slow to respond, please be
patient and/or resetting link in syslog  console.) It's at the low
enough level to freeze the whole system for minutes.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Understanding Ceph

2013-01-20 Thread Dimitri Maziuk

On 1/19/2013 12:16 PM, Sage Weil wrote:


We generally recommend the KVM+librbd route, as it is easier to manage the
dependencies, and is well integrated with libvirt.  FWIW this is what
OpenStack and CloudStack normally use.


OK, so is there a quick stat document for that configuration?

(Oh, and form in my other message is supposed to be from: tyop)

Dima


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Grid data placement

2013-01-15 Thread Dimitri Maziuk
Hi everyone,

quick question: can I get ceph to replicate a bunch of files to every
host in compute cluster and then have those hosts read those files from
local disk?

TFM looks like a custom crush map should get the files to [osd on] every
host, but I'm not clear on the read step: do I need an mds on every host
and mount the fs off localhost's mds?

(We've $APP running on the cluster, normally one instance/cpu core, that
mmap's (read only) ~30GB of binary files. I/O over NFS kills the cluster
even with a few hosts. Currently the files are rsync'ed to every host at
the start of the batch; that'll only scale to a few dozen hosts at best.)

TIA,
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu





signature.asc
Description: OpenPGP digital signature


Re: Grid data placement

2013-01-15 Thread Dimitri Maziuk
On 01/15/2013 12:36 PM, Gregory Farnum wrote:
 On Tue, Jan 15, 2013 at 10:33 AM, Dimitri Maziuk dmaz...@bmrb.wisc.edu 
 wrote:

 At the start of the batch #cores-in-the-cluster processes try to mmap
 the same 2GB and start reading it from SEEK_SET at the same time. I
 won't know until I try but I suspect it won't like that.
 
 Well, it'll be #servers-in-cluster serving up 4MB chunks out of cache.
 It's possible you could overwhelm their networking but my bet is
 they'll just get spread out slightly on the first block and then not
 contend in the future.

In the future the application spreads out the reads as well: running
instances go through the data at different speed, and when one's
finished, the next one starts on the same core  it mmap's the first
chunk again.

 Just as long as you're thinking of it as a test system that would make
 us very happy. :)

Well, IRL this is throw-away data generated at the start of a batch, and
we're good if one batch a month runs to completion. So if it doesn't
crash all the time every time, that actually should be good enough for
me. However, not all of the nodes have spare disk slots, so I couldn't
do a full-scale deployment anyway, not without rebuilding half the nodes.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature