[ceph-users] Replacing an OSD Drive
When the time comes to replace an OSD I've used the following procedure 1) Stop/down/out the osd and replace the drive 2) Create the ceph osd directory: ceph-osd -i N --mkfs 3) Copy the osd key out of the authorized keys list 4) ceph osd crush rm osd.N 5) ceph osd crush add osd.$i $osd_size root=default host=$(hostname -s) 6) ceph osd in osd.N 7) service ceph start osd.N If I don't do steps 4 and 5, the osd process times out in futex: [pid 22822] futex(0x4604cc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 98, {1423237460, 296281000}, [pid 22821] futex(0x4604cc0, FUTEX_WAKE_PRIVATE, 1 [pid 22822] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) Upping the debugging only shows: 2015-02-06 10:48:22.656012 7f9acf967700 20 osd.40 396 update_osd_stat osd_stat(62060 kB used, 2793 GB avail, 2793 GB total, peers []/[] op hist []) 2015-02-06 10:48:22.656025 7f9acf967700 5 osd.40 396 heartbeat: osd_stat(62060 kB used, 2793 GB avail, 2793 GB total, peers []/[] op hist []) 2015-02-06 10:48:23.356299 7f9ae76c7700 5 osd.40 396 tick 2015-02-06 10:48:23.356308 7f9ae76c7700 10 osd.40 396 do_waiters -- start 2015-02-06 10:48:23.356310 7f9ae76c7700 10 osd.40 396 do_waiters -- finish 2015-02-06 10:48:24.356114 7f9acf967700 20 osd.40 396 update_osd_stat osd_stat(62060 kB used, 2793 GB avail, 2793 GB total, peers []/[] op hist []) in the osd log file. What is ceph-osd doing that recreating the osd in the crush map changes? Thanks for any enlightenment on this. -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Openstack Instances and RBDs
http://www.sebastien-han.fr/blog/2013/06/03/ceph-integration-in-openstack-grizzly-update-and-roadmap-for-havana/ suggests it is possible to run openstack instances (not only images) off of RBDs in grizzly and havana (which I'm running), and to use RBDs in lieu of a shared file system. I've followed http://ceph.com/docs/next/rbd/libvirt/ but I can only get boot-from-volume to work. Instances still are being housed in /var/lib/nova/instances, making live-migration a non-starter. Is there a better guide for running openstack instances out of RBDs, or is it just not ready yet? Thanks, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Non-Ceph cluster name
Works perfectly. My only grip is --cluster isn't listed as a valid argument from ceph-mon --help and the only reference searching for --cluster in the ceph documentation is in regards to ceph-rest-api. Shall I file a bug to correct the documentation? Thanks again for the quick and accurate response. -Gaylord On 10/24/2013 08:11 AM, Sage Weil wrote: Try passing --cluster csceph instead of the config file path and I suspect it will work. sage Gaylord Holder wrote: I'm trying to bring a ceph cluster not named ceph. I'm running version 0.61. From my reading of the documentation, the $cluster metavariable is set by the basename of the configuration file: specifying the configuration file "/etc/ceph/mycluster.conf" sets the $cluster metavariable to "mycluster" However, given a configuration file /etc/ceph/csceph.conf: [global] fsid = 70d421fe-28ca-4804-bce8-d51a16b531ec mon host =192.168.124.202 <http://192.168.124.202> mon_initial_members = a [mon.a] host = monnode mon addr =192.168.124.202:6789 and running: ceph-authtool csceph.mon.keyring --create-keyring --name=mon. --gen-key --cap mon 'allow *' ceph-mon -c /etc/ceph/csceph.conf --mkfs -i a --keyring csceph.mon.keyring ceph-mon tries to create monfs in /var/lib/ceph/mon/ceph-a not /var/lib/ceph/mon/csceph-a as expected. Thank you for any help you can give. Cheers, -Gaylord ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Non-Ceph cluster name
I'm trying to bring a ceph cluster not named ceph. I'm running version 0.61. From my reading of the documentation, the $cluster metavariable is set by the basename of the configuration file: specifying the configuration file "/etc/ceph/mycluster.conf" sets the $cluster metavariable to "mycluster" However, given a configuration file /etc/ceph/csceph.conf: [global] fsid = 70d421fe-28ca-4804-bce8-d51a16b531ec mon host = 192.168.124.202 mon_initial_members = a [mon.a] host = monnode mon addr = 192.168.124.202:6789 and running: ceph-authtool csceph.mon.keyring --create-keyring --name=mon. --gen-key --cap mon 'allow *' ceph-mon -c /etc/ceph/csceph.conf --mkfs -i a --keyring csceph.mon.keyring ceph-mon tries to create monfs in /var/lib/ceph/mon/ceph-a not /var/lib/ceph/mon/csceph-a as expected. Thank you for any help you can give. Cheers, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How many rbds can you map?
Always nice to see I've hit a real problem, and not just my being dumb. -Gaylord On 10/08/2013 01:46 PM, Gregory Farnum wrote: I believe this is a result of how we used the kernel interfaces (allocating a major device ID for each RBD volume), and some kernel limits (only 8 bits for storing major device IDs, and some used for other purposes). See http://tracker.ceph.com/issues/5048 I believe we have discussed not using a major device ID for each mounted RBD volume, but I don't remember the details as they involved kernel-fu beyond what I'm familiar with. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Oct 8, 2013 at 10:19 AM, Gaylord Holder wrote: I'm testing how many rbds I can map on a single server. I've created 10,000 rbds in the rbd pool, but I can only actually map 230. Mapping the 230th one fails with: rbd: add failed: (16) Device or resource busy Is there a way to bump this up? -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How many rbds can you map?
I'm testing how many rbds I can map on a single server. I've created 10,000 rbds in the rbd pool, but I can only actually map 230. Mapping the 230th one fails with: rbd: add failed: (16) Device or resource busy Is there a way to bump this up? -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Full OSD questions
On 09/22/2013 02:12 AM, yy-nm wrote: On 2013/9/10 6:38, Gaylord Holder wrote: Indeed, that pool was created with the default 8 pg_nums. 8 pg_num * 2T/OSD / 2 repl ~ 8TB which about how far I got. I bumped up the pg_num to 600 for that pool and nothing happened. I bumped up the pgp_num to 600 for that pool and ceph started shifting things around. Can you explain the difference between pg_num and pgp_num to me? I can't understand the distinction. Thank you for your help! -Gaylord On 09/09/2013 04:58 PM, Samuel Just wrote: This is usually caused by having too few pgs. Each pool with a significant amount of data needs at least around 100pgs/osd. -Sam On Mon, Sep 9, 2013 at 10:32 AM, Gaylord Holder wrote: I'm starting to load up my ceph cluster. I currently have 12 2TB drives (10 up and in, 2 defined but down and out). rados df says I have 8TB free, but I have 2 nearly full OSDs. I don't understand how/why these two disks are filled while the others are relatively empty. How do I tell ceph to spread the data around more, and why isn't it already doing it? Thank you for helping me understand this system better. Cheers, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com well, pg_num as the total num of pgs, and pgp_num means the num of pgs which are used now The reference you can reference on http://ceph.com/docs/master/rados/operations/pools/#create-a-pool the description of pgp_num simply says pgp_num is: > The total number of placement groups for placement purposes. Why is the number of placement groups different from the number of placement groups for placement purposes? When would you want them to be different? Thank you for helping me understand this. Cheers, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Understanding ceph status
There are a lot of numbers ceph status prints. Is there any documentation on what they are? I'm particulary curious about what seems a total data. ceph status says I have 314TB, when I calculate I have 24TB. It also says: 10615 GB used, 8005 GB / 18621 GB avail; which I take to be 10TB used/8T available for use, and 18TB total available. This doesn't make sense to me as I have 24TB raw and with default 2x replication, I should only have 12TB available?? I see MB/s, K/s, o/s, but what are E/s units? -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Full OSD questions
Indeed, that pool was created with the default 8 pg_nums. 8 pg_num * 2T/OSD / 2 repl ~ 8TB which about how far I got. I bumped up the pg_num to 600 for that pool and nothing happened. I bumped up the pgp_num to 600 for that pool and ceph started shifting things around. Can you explain the difference between pg_num and pgp_num to me? I can't understand the distinction. Thank you for your help! -Gaylord On 09/09/2013 04:58 PM, Samuel Just wrote: This is usually caused by having too few pgs. Each pool with a significant amount of data needs at least around 100pgs/osd. -Sam On Mon, Sep 9, 2013 at 10:32 AM, Gaylord Holder wrote: I'm starting to load up my ceph cluster. I currently have 12 2TB drives (10 up and in, 2 defined but down and out). rados df says I have 8TB free, but I have 2 nearly full OSDs. I don't understand how/why these two disks are filled while the others are relatively empty. How do I tell ceph to spread the data around more, and why isn't it already doing it? Thank you for helping me understand this system better. Cheers, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Full OSD questions
I'm starting to load up my ceph cluster. I currently have 12 2TB drives (10 up and in, 2 defined but down and out). rados df says I have 8TB free, but I have 2 nearly full OSDs. I don't understand how/why these two disks are filled while the others are relatively empty. How do I tell ceph to spread the data around more, and why isn't it already doing it? Thank you for helping me understand this system better. Cheers, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD map question
Is it possible know if an RBD is mapped by a machine? -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to force lost PGs
Awesome Sage! I knew I had lost data. I'm trying to find out what will happen when the worst happens (like the ceph administer is an idiot). So those PGs are hanging around in a OSD/pool somewhere with some kind of reference count and they just need to be recreated? Thanks again for unsticking me. -Gaylord On 09/03/2013 10:44 AM, Sage Weil wrote: On Sun, 1 Sep 2013, Gaylord Holder wrote: I created a pool with no replication and an RBD within that pool. I mapped the RBD to a machine, formatted it with a file system and dumped data on it. Just to see what kind of trouble I can get into, I stopped the OSD the RBD was using, marked the OSD as out, and reformatted the OSD tree. When I brought the OSD back up, I now have three stale PGs. Now I'm trying to clear the stale PGs. I've tried removing the OSD from the crush maps, the OSD lists etc, without any luck. Note that this means that you destroyed all copies of those 3 PGs, which means this experiment lost data. You can make ceph recreate the PGs (empty!) with ceph pg force_create_pg sage Running ceph pg 3.1 query ceph pg 3.1 mark_unfound_lost revert ceph explains it doesn't have a PG 3.1 Running ceph osd repair osd.1 hangs after pg 2.3e Running ceph osd lost 1 --yes-i-really-mean-it nukes the osd. Rebuilding osd.1 goes fine, but I still have 3 stale PGs. Any help clearing these stale pages would be appreciated. Thanks, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to force lost PGs
I created a pool with no replication and an RBD within that pool. I mapped the RBD to a machine, formatted it with a file system and dumped data on it. Just to see what kind of trouble I can get into, I stopped the OSD the RBD was using, marked the OSD as out, and reformatted the OSD tree. When I brought the OSD back up, I now have three stale PGs. Now I'm trying to clear the stale PGs. I've tried removing the OSD from the crush maps, the OSD lists etc, without any luck. Running ceph pg 3.1 query ceph pg 3.1 mark_unfound_lost revert ceph explains it doesn't have a PG 3.1 Running ceph osd repair osd.1 hangs after pg 2.3e Running ceph osd lost 1 --yes-i-really-mean-it nukes the osd. Rebuilding osd.1 goes fine, but I still have 3 stale PGs. Any help clearing these stale pages would be appreciated. Thanks, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD Mapping
Is it possible to find out which machines are mapping and RBD? -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph pgs stuck or degraded.
If I understand what the #tunables page is saying, changing the tunables kicks the OSD re-balancing mechanism a bit and resets it to try again. I'll see about getting 3.9 kernel in for my RBD maachines, and reset everything to optimal. Thanks again. -Gaylord On 07/22/2013 04:51 PM, Sage Weil wrote: On Mon, 22 Jul 2013, Gaylord Holder wrote: Sage, The crush tunables did the trick. why? Could you explain what was causing the problem? This has a good explanation, I think: http://ceph.com/docs/master/rados/operations/crush-map/#tunables I've haven't installed 3.9 on my RBD servers yet. Will setting crush tunables back to default or legacy cause me similar problems in the future? Yeah. For 3.6+ kernels, you can set slightly different tunables and it will be very close to optimal... sage Thank you again Sage! -Gaylord On 07/22/2013 02:27 PM, Sage Weil wr: On Mon, 22 Jul 2013, Gaylord Holder wrote: I have a 12 OSD/3 host set up, and have be stuck with a bunch of stuck pages. I've verified the OSDs are all up and in. The crushmap looks fine. I've tried restarting all the daemons. root@never:/var/lib/ceph/mon# ceph status health HEALTH_WARN 139 pgs degraded; 461 pgs stuck unclean; recovery 216/6213 degraded (3.477%) monmap e4: 2 mons at {a=192.168.225.9:6789/0,b=192.168.225.10:6789/0}, election epoch 14, quorum 0,1 a,b Add another monitor; right now if 1 fails the cluster is unavailable. osdmap e238: 12 osds: 12 up, 12 in pgmap v7396: 2528 pgs: 2067 active+clean, 322 active+remapped, 139 active+degraded; 8218 MB data, 103 GB used, 22241 GB / 22345 GB avail; 216/6213 degraded (3.477%) mdsmap e1: 0/0/1 up My guess crush tunables. Try ceph osd crush tunables optimal unless you are using a pre-3.8(ish) kernel or other very old (pre-bobtail) clients. sage I have one non-default pool with 3x replication. Fewer than half of the pg have expanded to 3x (278/400 pgs still have acting 2x sets). Where can I go look for the trouble? Thank you for any light someone can shed on this. Cheers, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph pgs stuck or degraded.
Sage, The crush tunables did the trick. why? Could you explain what was causing the problem? I've haven't installed 3.9 on my RBD servers yet. Will setting crush tunables back to default or legacy cause me similar problems in the future? Thank you again Sage! -Gaylord On 07/22/2013 02:27 PM, Sage Weil wr: On Mon, 22 Jul 2013, Gaylord Holder wrote: I have a 12 OSD/3 host set up, and have be stuck with a bunch of stuck pages. I've verified the OSDs are all up and in. The crushmap looks fine. I've tried restarting all the daemons. root@never:/var/lib/ceph/mon# ceph status health HEALTH_WARN 139 pgs degraded; 461 pgs stuck unclean; recovery 216/6213 degraded (3.477%) monmap e4: 2 mons at {a=192.168.225.9:6789/0,b=192.168.225.10:6789/0}, election epoch 14, quorum 0,1 a,b Add another monitor; right now if 1 fails the cluster is unavailable. osdmap e238: 12 osds: 12 up, 12 in pgmap v7396: 2528 pgs: 2067 active+clean, 322 active+remapped, 139 active+degraded; 8218 MB data, 103 GB used, 22241 GB / 22345 GB avail; 216/6213 degraded (3.477%) mdsmap e1: 0/0/1 up My guess crush tunables. Try ceph osd crush tunables optimal unless you are using a pre-3.8(ish) kernel or other very old (pre-bobtail) clients. sage I have one non-default pool with 3x replication. Fewer than half of the pg have expanded to 3x (278/400 pgs still have acting 2x sets). Where can I go look for the trouble? Thank you for any light someone can shed on this. Cheers, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph pgs stuck or degraded.
I have a 12 OSD/3 host set up, and have be stuck with a bunch of stuck pages. I've verified the OSDs are all up and in. The crushmap looks fine. I've tried restarting all the daemons. root@never:/var/lib/ceph/mon# ceph status health HEALTH_WARN 139 pgs degraded; 461 pgs stuck unclean; recovery 216/6213 degraded (3.477%) monmap e4: 2 mons at {a=192.168.225.9:6789/0,b=192.168.225.10:6789/0}, election epoch 14, quorum 0,1 a,b osdmap e238: 12 osds: 12 up, 12 in pgmap v7396: 2528 pgs: 2067 active+clean, 322 active+remapped, 139 active+degraded; 8218 MB data, 103 GB used, 22241 GB / 22345 GB avail; 216/6213 degraded (3.477%) mdsmap e1: 0/0/1 up I have one non-default pool with 3x replication. Fewer than half of the pg have expanded to 3x (278/400 pgs still have acting 2x sets). Where can I go look for the trouble? Thank you for any light someone can shed on this. Cheers, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] feature set mismatch
On 07/17/2013 05:49 PM, Josh Durgin wrote: [please keep replies on the list] On 07/17/2013 04:04 AM, Gaylord Holder wrote: On 07/16/2013 09:22 PM, Josh Durgin wrote: On 07/16/2013 06:06 PM, Gaylord Holder wrote: Now whenever I try to map an RBD to a machine, mon0 complains: feature set mismatch, my 2 < server's 2040002, missing 204 missing required protocol features. Your cluster is using newer crush tunables to get better data distribution, but your kernel client doesn't support that. You'll need to upgrade to linux 3.9, or set the tunables to 'legacy', which your kernel understands [1]. Josh [1] http://ceph.com/docs/master/rados/operations/crush-map/#tuning-crush Josh, That was certainly the trick. ceph osd crush tunables legacy now allows me to map the rbd. To be clear, did you change the tunables before? If the upgrade enabled them somehow without your intervention, it would be a bug. No bugs on this issue. I had changed the tunables and not connected tunables to a protocol mismatch error messages. Thanks again for your help. -gaylord Who need to be running 3.9? Just the machines mounting the rbd, or everyone? Just the machines mounting it. Is there a better place in the documentation to track the recommended kernel version than http://ceph.com/docs/next/install/os-recommendations/ That and the release notes are the best places to look. Nothing incompatible with old kernels should be enabled by default, but some new features (like the crush tunables) may require newer kernel clients. Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] feature set mismatch
I had RBD's working and mapping working. Then I grew the cluster and increased the OSDs. Now whenever I try to map an RBD to a machine, mon0 complains: feature set mismatch, my 2 < server's 2040002, missing 204 missing required protocol features. I don't see any other problems with the cluster, only rbd map -p pool image hanging. Any help or pointers would be appreciated. Thank you, -Gaylord ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com