Re: Dynamic osd-devices selection for Ceph charm
On Sat, Nov 29, 2014 at 11:25 AM, John McEleney john.mcele...@netservers.co.uk wrote: Hi all, I've been working on the Ceph charm with the intention of making it much more powerful when it comes to the selection of OSD devices. I wanted to knock a few ideas around to see what might be possible. The main problem I'm trying to address is that with the existing implementation, when a new SAS controller is added, or drive caddies get swapped around, drive letters (/dev/sd[a-z]) get swapped around. As the current charm just asks for a list of devices, and that list of devices is global across the entire cluster, it pretty-much requires all machines to be identical, and unchanging. I also looked into used /dev/disk/by-id, but found this to be too inflexible. Below I've pasted a patch I wrote as a stop-gap for myself. This patch allows you to list model numbers for your drives instead of /dev/ devices. It then dynamically generates the list of /dev/ devices on each host. The patch is pretty unsophisticated, but it solves my immediate problem. However, I think we can do better than this. I've been thinking that xpath strings might be a better way to go. I played around with this idea a little. This will give some idea how it could work: == root@ceph-store1:~# lshw -xml -class disk /tmp/disk.xml root@ceph-store1:~# echo 'cat //node[contains(product,MG03SCA400)]/logicalname/text()'|xmllint --shell /tmp/disk.xml|grep '^/dev/' /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl == So, that takes care of selecting by model number. How about selecting drives that are larger than 3TB? == root@ceph-store1:~# echo 'cat //node[size3]/logicalname/text()'|xmllint --shell /tmp/disk.xml|grep '^/dev/' /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl == Just to give some idea of the power of this, take a look at the info lshw compiles: node id=disk:3 claimed=true class=disk handle=GUID:-a5c7-4657-924d-8ed94e1b1aaa descriptionSCSI Disk/description productMG03SCA400/product vendorTOSHIBA/vendor physid0.3.0/physid businfoscsi@1:0.3.0/businfo logicalname/dev/sdf/logicalname dev8:80/dev versionDG02/version serialX470A0XX/serial size units=bytes4000787030016/size capacity units=bytes5334969415680/capacity configuration setting id=ansiversion value=6 / setting id=guid value=-a5c7-4657-924d-8ed94e1b1aaa / setting id=sectorsize value=512 / /configuration capabilities capability id=7200rpm 7200 rotations per minute/capability capability id=gpt-1.00 GUID Partition Table version 1.00/capability capability id=partitioned Partitioned disk/capability capability id=partitioned:gpt GUID partition table/capability /capabilities /node So, you could be selecting your drives by vendor, size, model, sector size, or any combination of these and other attributes. The only reason I didn't go any further with this idea yet is that lshw -C disk is incredibly slow. I tried messing around with disabling tests, but it still crawls along. I figure that this wouldn't be that big a deal if you could cache the resulting xml file, but that's not fully satisfactory either. What if I want to hot-plug a new hard-drive into the system? lshw would need to be run again. I though that maybe udev could be used for doing this, but I certainly don't want udev running lshw once per drive at boot time as the drives are detected. I'm really wondering if anyone else has any advice on either speeding up lshw, or if there's any other simple way of pulling this kind of functionality off. Maybe I'm worrying too much about this. As long as the charm only fires this hook rarely, and caches the data for the duration of the hook run, maybe I don't need to worry? i'm wondering if instead of lshw and the time consumption there we could continue with lsblk, there's a bit more information there (size, model, rotational) etc which seems to satisfy most of the lshw examples you've given and is relatively fast in comparison. ie. https://gist.github.com/kapilt/d0485d6fac3be6caaed2 another option, here's a script around a similiar use case does a hierarchical info of drives from controller on down and supports layered block devs. http://www.spinics.net/lists/raid/msg34460.html current implementation @ https://github.com/pturmel/lsdrv/blob/master/lsdrv cheers, Kapil John Patch to match against model number (NOT REGRESSION TESTED): === modified file 'config.yaml' --- config.yaml 2014-10-06 22:07:41 + +++ config.yaml 2014-11-29 15:42:41 + @@ -42,16 +42,35 @@ These devices are the range of devices that will
Re: Dynamic osd-devices selection for Ceph charm
On Sun, Nov 30, 2014 at 12:25 AM, John McEleney john.mcele...@netservers.co.uk wrote: Hi all, I've been working on the Ceph charm with the intention of making it much more powerful when it comes to the selection of OSD devices. I wanted to knock a few ideas around to see what might be possible. The main problem I'm trying to address is that with the existing implementation, when a new SAS controller is added, or drive caddies get swapped around, drive letters (/dev/sd[a-z]) get swapped around. As the current charm just asks for a list of devices, and that list of devices is global across the entire cluster, it pretty-much requires all machines to be identical, and unchanging. I also looked into used /dev/disk/by-id, but found this to be too inflexible. Below I've pasted a patch I wrote as a stop-gap for myself. This patch allows you to list model numbers for your drives instead of /dev/ devices. It then dynamically generates the list of /dev/ devices on each host. The patch is pretty unsophisticated, but it solves my immediate problem. However, I think we can do better than this. I've been thinking that xpath strings might be a better way to go. I played around with this idea a little. This will give some idea how it could work: == root@ceph-store1:~# lshw -xml -class disk /tmp/disk.xml root@ceph-store1:~# echo 'cat //node[contains(product,MG03SCA400)]/logicalname/text()'|xmllint --shell /tmp/disk.xml|grep '^/dev/' /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl == So, that takes care of selecting by model number. How about selecting drives that are larger than 3TB? == root@ceph-store1:~# echo 'cat //node[size3]/logicalname/text()'|xmllint --shell /tmp/disk.xml|grep '^/dev/' /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl == Just to give some idea of the power of this, take a look at the info lshw compiles: node id=disk:3 claimed=true class=disk handle=GUID:-a5c7-4657-924d-8ed94e1b1aaa descriptionSCSI Disk/description productMG03SCA400/product vendorTOSHIBA/vendor physid0.3.0/physid businfoscsi@1:0.3.0/businfo logicalname/dev/sdf/logicalname dev8:80/dev versionDG02/version serialX470A0XX/serial size units=bytes4000787030016/size capacity units=bytes5334969415680/capacity configuration setting id=ansiversion value=6 / setting id=guid value=-a5c7-4657-924d-8ed94e1b1aaa / setting id=sectorsize value=512 / /configuration capabilities capability id=7200rpm 7200 rotations per minute/capability capability id=gpt-1.00 GUID Partition Table version 1.00/capability capability id=partitioned Partitioned disk/capability capability id=partitioned:gpt GUID partition table/capability /capabilities /node So, you could be selecting your drives by vendor, size, model, sector size, or any combination of these and other attributes. The only reason I didn't go any further with this idea yet is that lshw -C disk is incredibly slow. I tried messing around with disabling tests, but it still crawls along. I figure that this wouldn't be that big a deal if you could cache the resulting xml file, but that's not fully satisfactory either. What if I want to hot-plug a new hard-drive into the system? lshw would need to be run again. I though that maybe udev could be used for doing this, but I certainly don't want udev running lshw once per drive at boot time as the drives are detected. I'm really wondering if anyone else has any advice on either speeding up lshw, or if there's any other simple way of pulling this kind of functionality off. Maybe I'm worrying too much about this. As long as the charm only fires this hook rarely, and caches the data for the duration of the hook run, maybe I don't need to worry? Hi John, I don't have any particular suggestions re speeding up lshw. If you only need a subset of the information, and speed is really important, it may be worth just using lsblk and/or trawling /sys/block. I'm mainly replying because I wanted to let you know that we're working on adding storage capabilities to Juju now. Charms (such as ceph) will be able to indicate that they require storage (in this case, block devices), and when you deploy the charm you'll be able to indicate how that storage should be provisioned. Often that will just be a count and size specification (e.g. deploy ceph with three 1TB disks assigned to the osd-devices storage). You will also be able to dynamically allocate storage, including hot-plugged physically attached disks. Juju will periodically list the block devices available on each machine, and a CLI will be introduced to list them, and