Re: [ceph-users] Ceph mon quorum
On 04/05/2013 12:50 PM, Gregory Farnum wrote: ... I'm sorry if this aspect of the system is problematic for you, but it's pretty fundamental to any distributed or cloudy system that chooses consistency over availability. It isn't actually -- as long as 'chooses consistency over availability' is printed in big bold letters on the front page of the sales brochure. ;) -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: rest mgmt api
On 02/11/2013 04:00 PM, Sage Weil wrote: On Mon, 11 Feb 2013, Gregory Farnum wrote: ... That doesn't really help; it means the mon still has to understand the CLI grammar. What we are talking about is the difference between: [ 'osd', 'down', '123' ] and { URI: '/osd/down', OSD-Id: 123 } or however we generically translate the HTTP request into JSON. I think the setup we have in mind is where the MON reads something like {who:osd, which:123, what:down, when:now} from a socket (pipe, whatever), the CLI reads osd down 123 now from the prompt and pushes {who:osd, which:123, what:down, when:now} into that socket, the webapp gets whatever: /osd/down/123/now or ?who=osdcommand=downid=123when=now from whoever impersonates the browser and pipes {who:osd, which:123, what:down, when:now} into that same socket, and all three of them are three completely separate applications that don't try to do what they don't need to. FWIW you could pass the CLI command as JSON, but that's no different than encoding vectorstring; it's still a different way to describing the same command. The devil is of course in the details: in (e.g.) python json.loads() the string and gives you the map you could plug into a lookup table or something to get right to the function call. My c++ is way rusty, I've no idea what's available in boost co -- if you have to roll your own json parser then you indeed don't care how that vectorstring is encoded. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: CEPHFS mount error !!!
On 2/6/2013 5:54 AM, Dennis Jacobfeuerborn wrote: ... To mount cephfs like that you need to have kernel support. As the Linux kernel on CentOS 6.3 is version 2.6.32 and Ceph support wasn't added until 2.6.34, you need to compile your own kernel. The better alternative is probably to install a kernel from http://elrepo.org/tiki/kernel-lt . lt stand for long term and should be fairly stable and ml is mainline which is even more current but because of that not quite as stable (currently 3.7.6). I had problems booting ml on some/most (dep. on the version) our machines, plus it's a pain to track: there's a new one every day. I do have lt running without problems and mounting cephfs, however, I haven't gotten around to the actual ceph testing on it yet so I can't say anything about ceph client's performance/stability on it. (lt is 3.0, as I understand it doesn't have the latest and greatest ceph module(?)) Dimitri -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rest mgmt api
On 02/06/2013 01:34 PM, Sage Weil wrote: I think the one caveat here is that having a single registry for commands in the monitor means that commands can come in two flavors: vectorstring (cli) and URL (presumably in json form). But a single command dispatch/registry framework will make that distinction pretty simple... Any reason you can't have your CLI json-encode the commands (or, conversely, your cgi/wsgi/php/servlet URL handler decode them into vectorstring) before passing them on to the monitor? -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: rest mgmt api
On 02/06/2013 02:14 PM, Sage Weil wrote: On Wed, 6 Feb 2013, Dimitri Maziuk wrote: Any reason you can't have your CLI json-encode the commands (or, conversely, your cgi/wsgi/php/servlet URL handler decode them into vectorstring) before passing them on to the monitor? We can, but they won't necessarily look the same, because it is unlikely we can make a sane 1:1 translation of the CLI to REST that makes sense, and it would be nice to avoid baking knowledge about the individual commands into the client side. ceph osd pool create poolname numpgs vs /osd/pool/?op=createpoolname=foonumpgs=bar or whatever. I know next to nothing about REST API design best practices, but I'm guessing it doesn't look like a CLI. (Last I looked ?op=createpoolname=foo was the Old Busted CGI, The New Shiny Hotness(tm) was supposed to look like /create/foo -- and I never understood how the optional parameters are supposed to work. But that's beside the point.) To me you sounded like you the piece that actually does the work (daemon?) should understand both (and have a built-in httpd on top). What I meant is it should know just one and let the UI modules do the translation. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Ceph Development with Eclipse
On 02/02/2013 11:40 AM, charles L wrote: Hi I am a beginner at c++ and eclipse. I need some startup help to develop ceph with eclipse. If you could provide your config file on eclipse, it will be a great starting point and very appreciated. (giggle) Real Men use vi. Or was it emacs? -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Understanding Ceph
On 1/24/2013 2:49 AM, Gandalf Corvotempesta wrote: 2013/1/24 Dimitri Maziuk dmaz...@bmrb.wisc.edu: So I'm stuck at a point way before those guides become relevant: once I had one OSD/MDS/MON box up, I got HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery 21/42 degraded (50.000%) (384 appears be the number of placement groups created by default). What does that mean? That I only have one OSD? Or is it genuinely unhealthy? ceph is building it's cluster. You should wait for it. In my case, it needed 5-10 minutes. No, that's not it: it was stuck in that state for 40 minutes or so. Dima -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Understanding Ceph
On 1/24/2013 8:20 AM, Sam Lang wrote: Yep it means that you only have one OSD with replication level of 2. If you had a rep level of 3, you would see degraded (66.667%). If you just want to make the message go away (for testing purposes), you can set the rep level to 1 (http://ceph.com/w/index.php?title=Adjusting_replication_levelredirect=no). OK, thanks Sam and Dino -- I kinda suspected that but didn't find any docs. This looks like it's not adjustable via ceph.conf, I can only do it at runtime, correct? Dima -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Understanding Ceph
One other question I have left (so far) is: I read and tried to follow http://ceph.com/docs/master/install/rpm/ and http://ceph.com/docs/master/start/quick-start/ on centos 6.3. mkcephfs step fails without rbd kernel module. I just tried to find libvirt, kernel, module, and qemu on those pages: kernel occurs in add ceph packages section and module occurs in the header, footer, and the side menu. 0 hits for the others. So when I read after learning that qemu uses librbd (and thus doesn't rely on the rbd kernel module) I was happy to stick with the stock CentOS kernel for my servers (with updated qemu and libvirt builds) -- forgive me for being dense, but I have no context for this. Where in ceph.conf do I tell it to use qemu and librbd instead of kernel module? Or does it mean I'm to set up my OSDs in virtual machines? Seems I'm missing an important piece of information here (possibly because it's blatantly obvious and is staring me in the face -- woudn't be the first time). So what is it that I'm missing? TIA Dima -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Understanding Ceph
On 1/24/2013 9:58 AM, Wido den Hollander wrote: On 01/24/2013 04:53 PM, Jens Kristian Søgaard wrote: Hi Dimitri, Where in ceph.conf do I tell it to use qemu and librbd instead of kernel module? You do not need to specify that in ceph.conf. When you run qemu then specify the disk for example like this: -drive format=rbd,file=rbd:/pool/imagename,if=virtio,index=0,boot=on Small typo :) It has to be: -drive format=rbd,file=rbd:pool/imagename,if=virtio,index=0,boot=on Thanks but I'm still missing the context. I'm following this document: http://ceph.com/docs/master/start/quick-start/ to set up an osd/mds/mon *server*. The step that's failing without the kernel module is Deploy the configuration #2: mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring Are you saying I'm to run qemu -drive ... instead of mkcephfs? Dima (I'm assuming either you aren't or qemu has changed a lot since I last looked) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Understanding Ceph
On 1/24/2013 10:22 AM, Sam Lang wrote: ... Does that make sense? Yes, but when I'm trying to set up a ceph server using the quick start guide, mkcephfs is failing with an error message I didn't write down, but the complaint was along the lines of missing rbd.ko. Booting a 3.7 kernel made it go away. This is the part where everyone says server stuff should run on the stock centos kernel but in my reality it doesn't. (So I'm trying to figure out why my reality is different from everyone else's ;) I'll see if I can reproduce it and post the exact error message. Dima -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Understanding Ceph
On 01/24/2013 12:15 PM, Dan Mick wrote: On 01/24/2013 07:28 AM, Dimitri Maziuk wrote: On 1/24/2013 8:20 AM, Sam Lang wrote: Yep it means that you only have one OSD with replication level of 2. If you had a rep level of 3, you would see degraded (66.667%). If you just want to make the message go away (for testing purposes), you can set the rep level to 1 (http://ceph.com/w/index.php?title=Adjusting_replication_levelredirect=no). OK, thanks Sam and Dino -- I kinda suspected that but didn't find any docs. This looks like it's not adjustable via ceph.conf, I can only do it at runtime, correct? or you could just add another OSD. Obviously. You'd think that only one [osd] section in ceph.conf implies nrep = 1, though. (And then you can go on adding OSDs and changing nrep accordingly -- that was my plan.) -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Understanding Ceph
On 01/24/2013 12:38 PM, John Wilkins wrote: Dima, I'm working on a new monitoring and troubleshooting guide now that will answer most of the questions related to OSD and placement group states. I hope to have it done this week. I have not actually tested the quick starts on centos or rhel distributions, but it's on our radar. The intention of the quick starts is to get you up and running quickly. It doesn't cover deeper issues like how to monitor and troubleshoot. I'm working on adding a lot more substantive content there now. A couple of things in the quick start: - there should be no space between rw, and noatime in osd mount options {fs-type} = {mount options} # default mount option is rw, noatime - for ext4, you need to specify user_xattr there or mkcephfs will fail (with --mkfs at least). -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Understanding Ceph
On 01/24/2013 12:16 PM, Dan Mick wrote: This is an apparently-unique problem, and we'd love to see details. I hate it when it makes a liar out of me, this time around it worked on 2.6.23 -- FSVO worked: I did get it to 384 pgs stuck unclean stage. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Understanding Ceph
On 01/24/2013 03:07 PM, Dan Mick wrote: ... Yeah; it's probably mostly just that one-OSD configurations are so uncommon that we never special-cased that small user set. Also, you can run with a cluster in that state forever (well, until that one OSD dies at least); I do that regularly with the default vstart.sh local test cluster Well, this goes back to the quick start guide: to me a more natural way to start is with one host, then add another. That's what I was trying to do, however, the quick start page ends with When your cluster echoes back HEALTH_OK, you may begin using Ceph. and that doesn't happen with one host: you get 384 pgs stuck unclean instead of HEALTH_OK. To me that means I may *not* begin using ceph. I did run ceph osd pool set ... size 1 on each of the 3 default pools, verified that it took with ceph osd dump | grep 'rep size', and gave it a good half hour to settle. I still got 384 pgs stuck unclean from ceph health. So I re-done it with 2 OSDs and got the expected HEALTH_OK right from the start. John, a) a note saying if you have only one OSD you won't get HEALTH_OK until you add another one; you can start using the cluster may be a useful addition to the quick start, b) more importantly, if there are any plans to write more quickstart pages, I'd love to see the add another OSD (MDS, MON) to an existing pool in 5 minutes. Thanks all, -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Understanding Ceph
On 01/24/2013 03:48 PM, Sage Weil wrote: On Thu, 24 Jan 2013, Dimitri Maziuk wrote: So I re-done it with 2 OSDs and got the expected HEALTH_OK right from the start. There may be a related issue at work here: the default crush rules now replicate across hosts instead of across osds, so single-host configs may have similar problems (depending on whether you used mkcephfs to create the cluster or not). Right, that's with 2nd osd on another host, not with 2 osds on the same host. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Understanding Ceph
John, in block device quick start (http://ceph.com/docs/master/start/quick-rbd/) sudo rbd map foo --pool rbd --name client.admin maps the image to /dev/rbd0 here (centos 6.3/bobtail) so the subsequent 4. Use the block device. In the following example, create a file system. sudo mkfs.ext4 -m0 /dev/rbd/rbd/foo should end with /dev/rbd0 instead of /dev/rbd/rbd/foo. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Understanding Ceph
On 01/23/2013 10:19 AM, Patrick McGarry wrote: http://ceph.com/howto/building-a-public-ami-with-ceph-and-openstack/ On Wed, Jan 23, 2013 at 10:13 AM, Sam Lang sam.l...@inktank.com wrote: http://ceph.com/docs/master/rbd/rbd-openstack/ These are both great, I'm sure, but Patrick's page says I chose to follow the 5 minute quickstart guide and the rbd-openstack page says Important ... you must have a running Ceph cluster. My problem is I can;t find a 5 minute quickstart guide for RHEL 6. and I didn't get a running ceph cluster by trying to follow the existing (ubuntu) guide and adjust for centos 6.3. So I'm stuck at a point way before those guides become relevant: once I had one OSD/MDS/MON box up, I got HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery 21/42 degraded (50.000%) (384 appears be the number of placement groups created by default). What does that mean? That I only have one OSD? Or is it genuinely unhealthy? -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Understanding Ceph
On 01/23/2013 06:17 PM, John Nielsen wrote: ... http://ceph.com/docs/master/install/rpm/ http://ceph.com/docs/master/start/quick-start/ Between those two links my own quick-start on CentOS 6.3 was maybe 6 minutes. YMMV. It does, obviously, since Deploy the configuration ... 2. Execute the following on the Ceph server host cd /etc/ceph sudo mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring was failing here until I booted an elrepo 3.7 kernel with rbd.ko. HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery 21/42 degraded (50.000%) What does that mean? That I only have one OSD? Or is it genuinely unhealthy? Assuming you have more than one host ... I just said I have one host. So is that expected when I only have one host? -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: handling fs errors
On 01/22/2013 12:05 AM, Sage Weil wrote: We observed an interesting situation over the weekend. The XFS volume ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes. ... FWIW I see this often enough on cheap sata drives: they've a failure mode that makes sata driver timeout, reset the link, resend the command, rinse, lather, repeat. (You usually get slow to respond, please be patient and/or resetting link in syslog console.) It's at the low enough level to freeze the whole system for minutes. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Understanding Ceph
On 1/19/2013 12:16 PM, Sage Weil wrote: We generally recommend the KVM+librbd route, as it is easier to manage the dependencies, and is well integrated with libvirt. FWIW this is what OpenStack and CloudStack normally use. OK, so is there a quick stat document for that configuration? (Oh, and form in my other message is supposed to be from: tyop) Dima -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Grid data placement
Hi everyone, quick question: can I get ceph to replicate a bunch of files to every host in compute cluster and then have those hosts read those files from local disk? TFM looks like a custom crush map should get the files to [osd on] every host, but I'm not clear on the read step: do I need an mds on every host and mount the fs off localhost's mds? (We've $APP running on the cluster, normally one instance/cpu core, that mmap's (read only) ~30GB of binary files. I/O over NFS kills the cluster even with a few hosts. Currently the files are rsync'ed to every host at the start of the batch; that'll only scale to a few dozen hosts at best.) TIA, -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Grid data placement
On 01/15/2013 12:36 PM, Gregory Farnum wrote: On Tue, Jan 15, 2013 at 10:33 AM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: At the start of the batch #cores-in-the-cluster processes try to mmap the same 2GB and start reading it from SEEK_SET at the same time. I won't know until I try but I suspect it won't like that. Well, it'll be #servers-in-cluster serving up 4MB chunks out of cache. It's possible you could overwhelm their networking but my bet is they'll just get spread out slightly on the first block and then not contend in the future. In the future the application spreads out the reads as well: running instances go through the data at different speed, and when one's finished, the next one starts on the same core it mmap's the first chunk again. Just as long as you're thinking of it as a test system that would make us very happy. :) Well, IRL this is throw-away data generated at the start of a batch, and we're good if one batch a month runs to completion. So if it doesn't crash all the time every time, that actually should be good enough for me. However, not all of the nodes have spare disk slots, so I couldn't do a full-scale deployment anyway, not without rebuilding half the nodes. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature