Re: Dynamic osd-devices selection for Ceph charm

2014-12-01 Thread Kapil Thangavelu
On Sat, Nov 29, 2014 at 11:25 AM, John McEleney 
john.mcele...@netservers.co.uk wrote:

 Hi all,

 I've been working on the Ceph charm with the intention of making it much
 more powerful when it comes to the selection of OSD devices. I wanted to
 knock a few ideas around to see what might be possible.

 The main problem I'm trying to address is that with the existing
 implementation, when a new SAS controller is added, or drive caddies get
 swapped around, drive letters (/dev/sd[a-z]) get swapped around. As the
 current charm just asks for a list of devices, and that list of devices
 is global across the entire cluster, it pretty-much requires all
 machines to be identical, and unchanging. I also looked into used
 /dev/disk/by-id, but found this to be too inflexible.


 Below I've pasted a patch I wrote as a stop-gap for myself. This patch
 allows you to list model numbers for your drives instead of /dev/
 devices. It then dynamically generates the list of /dev/ devices on each
 host. The patch is pretty unsophisticated, but it solves my immediate
 problem. However, I think we can do better than this.


 I've been thinking that xpath strings might be a better way to go. I
 played around with this idea a little. This will give some idea how it
 could work:


 ==
 root@ceph-store1:~# lshw -xml -class disk  /tmp/disk.xml
 root@ceph-store1:~# echo 'cat
 //node[contains(product,MG03SCA400)]/logicalname/text()'|xmllint --shell
 /tmp/disk.xml|grep '^/dev/'
 /dev/sdc
 /dev/sdd
 /dev/sde
 /dev/sdf
 /dev/sdg
 /dev/sdh
 /dev/sdi
 /dev/sdj
 /dev/sdk
 /dev/sdl
 ==

 So, that takes care of selecting by model number. How about selecting
 drives that are larger than 3TB?

 ==
 root@ceph-store1:~# echo 'cat
 //node[size3]/logicalname/text()'|xmllint --shell
 /tmp/disk.xml|grep '^/dev/'
 /dev/sdc
 /dev/sdd
 /dev/sde
 /dev/sdf
 /dev/sdg
 /dev/sdh
 /dev/sdi
 /dev/sdj
 /dev/sdk
 /dev/sdl
 ==

 Just to give some idea of the power of this, take a look at the info
 lshw compiles:

   node id=disk:3 claimed=true class=disk
 handle=GUID:-a5c7-4657-924d-8ed94e1b1aaa
descriptionSCSI Disk/description
productMG03SCA400/product
vendorTOSHIBA/vendor
physid0.3.0/physid
businfoscsi@1:0.3.0/businfo
logicalname/dev/sdf/logicalname
dev8:80/dev
versionDG02/version
serialX470A0XX/serial
size units=bytes4000787030016/size
capacity units=bytes5334969415680/capacity
configuration
 setting id=ansiversion value=6 /
 setting id=guid value=-a5c7-4657-924d-8ed94e1b1aaa /
 setting id=sectorsize value=512 /
/configuration
capabilities
 capability id=7200rpm 7200 rotations per minute/capability
 capability id=gpt-1.00 GUID Partition Table version
 1.00/capability
 capability id=partitioned Partitioned disk/capability
 capability id=partitioned:gpt GUID partition table/capability
/capabilities
   /node

 So, you could be selecting your drives by vendor, size, model, sector
 size, or any combination of these and other attributes.

 The only reason I didn't go any further with this idea yet is that lshw
 -C disk is incredibly slow. I tried messing around with disabling
 tests, but it still crawls along. I figure that this wouldn't be that
 big a deal if you could cache the resulting xml file, but that's not
 fully satisfactory either. What if I want to hot-plug a new hard-drive
 into the system? lshw would need to be run again. I though that maybe
 udev could be used for doing this, but I certainly don't want udev
 running lshw once per drive at boot time as the drives are detected.

 I'm really wondering if anyone else has any advice on either speeding up
 lshw, or if there's any other simple way of pulling this kind of
 functionality off. Maybe I'm worrying too much about this. As long as
 the charm only fires this hook rarely, and caches the data for the
 duration of the hook run, maybe I don't need to worry?



i'm wondering if instead of lshw and the time consumption there we could
continue with lsblk, there's a bit more information there (size, model,
rotational) etc which seems to satisfy most of the lshw examples you've
given and is relatively fast in comparison.  ie.
https://gist.github.com/kapilt/d0485d6fac3be6caaed2

another option, here's a script around a similiar use case does a
hierarchical info of drives from controller on down and supports layered
block devs.
http://www.spinics.net/lists/raid/msg34460.html
current implementation @ https://github.com/pturmel/lsdrv/blob/master/lsdrv

cheers,

Kapil



 John

 Patch to match against model number (NOT REGRESSION TESTED):
 === modified file 'config.yaml'
 --- config.yaml 2014-10-06 22:07:41 +
 +++ config.yaml 2014-11-29 15:42:41 +
 @@ -42,16 +42,35 @@
These devices are the range of devices that will 

Re: Dynamic osd-devices selection for Ceph charm

2014-11-30 Thread Andrew Wilkins
On Sun, Nov 30, 2014 at 12:25 AM, John McEleney 
john.mcele...@netservers.co.uk wrote:

 Hi all,

 I've been working on the Ceph charm with the intention of making it much
 more powerful when it comes to the selection of OSD devices. I wanted to
 knock a few ideas around to see what might be possible.

 The main problem I'm trying to address is that with the existing
 implementation, when a new SAS controller is added, or drive caddies get
 swapped around, drive letters (/dev/sd[a-z]) get swapped around. As the
 current charm just asks for a list of devices, and that list of devices
 is global across the entire cluster, it pretty-much requires all
 machines to be identical, and unchanging. I also looked into used
 /dev/disk/by-id, but found this to be too inflexible.

 Below I've pasted a patch I wrote as a stop-gap for myself. This patch
 allows you to list model numbers for your drives instead of /dev/
 devices. It then dynamically generates the list of /dev/ devices on each
 host. The patch is pretty unsophisticated, but it solves my immediate
 problem. However, I think we can do better than this.

 I've been thinking that xpath strings might be a better way to go. I
 played around with this idea a little. This will give some idea how it
 could work:

 ==
 root@ceph-store1:~# lshw -xml -class disk  /tmp/disk.xml
 root@ceph-store1:~# echo 'cat
 //node[contains(product,MG03SCA400)]/logicalname/text()'|xmllint --shell
 /tmp/disk.xml|grep '^/dev/'
 /dev/sdc
 /dev/sdd
 /dev/sde
 /dev/sdf
 /dev/sdg
 /dev/sdh
 /dev/sdi
 /dev/sdj
 /dev/sdk
 /dev/sdl
 ==

 So, that takes care of selecting by model number. How about selecting
 drives that are larger than 3TB?

 ==
 root@ceph-store1:~# echo 'cat
 //node[size3]/logicalname/text()'|xmllint --shell
 /tmp/disk.xml|grep '^/dev/'
 /dev/sdc
 /dev/sdd
 /dev/sde
 /dev/sdf
 /dev/sdg
 /dev/sdh
 /dev/sdi
 /dev/sdj
 /dev/sdk
 /dev/sdl
 ==

 Just to give some idea of the power of this, take a look at the info
 lshw compiles:

   node id=disk:3 claimed=true class=disk
 handle=GUID:-a5c7-4657-924d-8ed94e1b1aaa
descriptionSCSI Disk/description
productMG03SCA400/product
vendorTOSHIBA/vendor
physid0.3.0/physid
businfoscsi@1:0.3.0/businfo
logicalname/dev/sdf/logicalname
dev8:80/dev
versionDG02/version
serialX470A0XX/serial
size units=bytes4000787030016/size
capacity units=bytes5334969415680/capacity
configuration
 setting id=ansiversion value=6 /
 setting id=guid value=-a5c7-4657-924d-8ed94e1b1aaa /
 setting id=sectorsize value=512 /
/configuration
capabilities
 capability id=7200rpm 7200 rotations per minute/capability
 capability id=gpt-1.00 GUID Partition Table version
 1.00/capability
 capability id=partitioned Partitioned disk/capability
 capability id=partitioned:gpt GUID partition table/capability
/capabilities
   /node

 So, you could be selecting your drives by vendor, size, model, sector
 size, or any combination of these and other attributes.

 The only reason I didn't go any further with this idea yet is that lshw
 -C disk is incredibly slow. I tried messing around with disabling
 tests, but it still crawls along. I figure that this wouldn't be that
 big a deal if you could cache the resulting xml file, but that's not
 fully satisfactory either. What if I want to hot-plug a new hard-drive
 into the system? lshw would need to be run again. I though that maybe
 udev could be used for doing this, but I certainly don't want udev
 running lshw once per drive at boot time as the drives are detected.

 I'm really wondering if anyone else has any advice on either speeding up
 lshw, or if there's any other simple way of pulling this kind of
 functionality off. Maybe I'm worrying too much about this. As long as
 the charm only fires this hook rarely, and caches the data for the
 duration of the hook run, maybe I don't need to worry?


Hi John,

I don't have any particular suggestions re speeding up lshw. If you only
need a subset of the information, and speed is really important, it may be
worth just using lsblk and/or trawling /sys/block.

I'm mainly replying because I wanted to let you know that we're working on
adding storage capabilities to Juju now. Charms (such as ceph) will be able
to indicate that they require storage (in this case, block devices), and
when you deploy the charm you'll be able to indicate how that storage
should be provisioned. Often that will just be a count and size
specification (e.g. deploy ceph with three 1TB disks assigned to the
osd-devices storage). You will also be able to dynamically allocate
storage, including hot-plugged physically attached disks. Juju will
periodically list the block devices available on each machine, and a CLI
will be introduced to list them, and