Re: [ceph-users] Managing larger ceph clusters

2015-04-17 Thread Craig Lewis
I'm running a small cluster, but I'll chime in since nobody else has.

Cern had a presentation a while ago (dumpling time-frame) about their
deployment.  They go over some of your questions:
http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern

My philosophy on Config Management is that it should save me time.  If it's
going to take me longer to write a recipe to do something, I'll just do it
by hand. Since my cluster is small, there are many things I can do faster
by hand.  This may or may not work for you, depending on your documentation
/ repeatability requirements.  For things that need to be documented, I'll
usually write the recipe anyway (I accept Chef recipes as documentation).


For my clusters, I'm using Chef to setups all nodes and manage ceph.conf.
I manually manage my pools, CRUSH map, RadosGW users, and disk
replacement.  I was using Chef to add new disks, but I ran into load
problems due to my small cluster size.  I'm currently adding disks
manually, to manage cluster load better.  As my cluster gets larger,
that'll be less important.

I'm also doing upgrades manually, because it's less work than writing the
Chef recipe to do a cluster upgrade.  Since Chef isn't cluster aware, it
would be a a pain to make the recipe cluster aware enough to handle the
upgrade.  And I figure if I stall long enough, somebody else will write it
:-)  Ansible, with it's cluster wide coordination, looks like it would
handle that a bit better.



On Wed, Apr 15, 2015 at 2:05 PM, Stillwell, Bryan 
bryan.stillw...@twcable.com wrote:

 I'm curious what people managing larger ceph clusters are doing with
 configuration management and orchestration to simplify their lives?

 We've been using ceph-deploy to manage our ceph clusters so far, but
 feel that moving the management of our clusters to standard tools would
 provide a little more consistency and help prevent some mistakes that
 have happened while using ceph-deploy.

 We're looking at using the same tools we use in our OpenStack
 environment (puppet/ansible), but I'm interested in hearing from people
 using chef/salt/juju as well.

 Some of the cluster operation tasks that I can think of along with
 ideas/concerns I have are:

 Keyring management
   Seems like hiera-eyaml is a natural fit for storing the keyrings.

 ceph.conf
   I believe the puppet ceph module can be used to manage this file, but
   I'm wondering if using a template (erb?) might be better method to
   keeping it organized and properly documented.

 Pool configuration
   The puppet module seems to be able to handle managing replicas and the
   number of placement groups, but I don't see support for erasure coded
   pools yet.  This is probably something we would want the initial
   configuration to be set up by puppet, but not something we would want
   puppet changing on a production cluster.

 CRUSH maps
   Describing the infrastructure in yaml makes sense.  Things like which
   servers are in which rows/racks/chassis.  Also describing the type of
   server (model, number of HDDs, number of SSDs) makes sense.

 CRUSH rules
   I could see puppet managing the various rules based on the backend
   storage (HDD, SSD, primary affinity, erasure coding, etc).

 Replacing a failed HDD disk
   Do you automatically identify the new drive and start using it right
   away?  I've seen people talk about using a combination of udev and
   special GPT partition IDs to automate this.  If you have a cluster
   with thousands of drives I think automating the replacement makes
   sense.  How do you handle the journal partition on the SSD?  Does
   removing the old journal partition and creating a new one create a
   hole in the partition map (because the old partition is removed and
   the new one is created at the end of the drive)?

 Replacing a failed SSD journal
   Has anyone automated recreating the journal drive using Sebastien
   Han's instructions, or do you have to rebuild all the OSDs as well?


 http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou
 rnal-failure/

 Adding new OSD servers
   How are you adding multiple new OSD servers to the cluster?  I could
   see an ansible playbook which disables nobackfill, noscrub, and
   nodeep-scrub followed by adding all the OSDs to the cluster being
   useful.

 Upgrading releases
   I've found an ansible playbook for doing a rolling upgrade which looks
   like it would work well, but are there other methods people are using?


 http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi
 ble/

 Decommissioning hardware
   Seems like another ansible playbook for reducing the OSDs weights to
   zero, marking the OSDs out, stopping the service, removing the OSD ID,
   removing the CRUSH entry, unmounting the drives, and finally removing
   the server would be the best method here.  Any other ideas on how to
   approach this?


 That's all I can think of right now.  Is there any other tasks that
 people have run into 

Re: [ceph-users] Managing larger ceph clusters

2015-04-17 Thread Steve Anthony
For reference, I'm currently running 26 nodes (338 OSDs); will be 35
nodes (455 OSDs) in the near future.

Node/OSD provisioning and replacements:

Mostly I'm using ceph-deploy, at least to do node/osd adds and
replacements. Right now the process is:

Use FAI (http://fai-project.org) to setup software RAID1/LVM for the OS
disks, and do a minimal installation, including the salt-minion.

Accept the new minion on the salt-master node and deploy the
configuration. LDAP auth, nrpe, diamond collector, udev configuration,
custom python disk add script, and everything on the Ceph preflight page
(http://ceph.com/docs/firefly/start/quick-start-preflight/)

Insert the journals into the case. Udev triggers my python code, which
partitions the SSDs and fires a Prowl alert (http://www.prowlapp.com/)
to my phone when it's finished.

Insert the OSDs into the case. Same thing, udev triggers the python
code, which selects the next available partition on the journals so OSDs
go on journal1partA, journal2partA, journal3partA, journal1partB,... for
the three journals in each node. The code then fires a salt event at the
master node with the OSD dev path, journal /dev/by-id/ path and node
hostname. The salt reactor on the master node takes this event and runs
a script on the admin node which passes those parameters to ceph-deploy,
which does the OSD deployment. Send Prowl alert on success or fail with
details.

Similarity, when an OSD fails, I remove it, and insert the new OSD. The
same process as above occurs. Logical removal I do manually, since I'm
not at a scale where it's common yet. Eventually, I imagine I'll write
code to trigger OSD removal on certain events using the same
event/reactor Salt framework.

Pool/CRUSH management:

Pool configuration and CRUSH management are mostly one-time operations.
That is, I'll make a change rarely and when I do it will persist in that
new state for a long time. Given that and the fact that I can make the
changes from one node and inject them into the cluster, I haven't needed
to automate that portion of Ceph as I've added more nodes, at least not yet.

Replacing journals:

I haven't had to do this yet; I'd probably remove/readd all the OSDs if
it happened today, but will be reading the post you linked.

Upgrading releases:

Change the configuration of /etc/apt/source.list.d/ceph.list to point at
new release and push to all the nodes with Salt. Then salt -N 'ceph'
pkg.upgrade to upgrade the packages on all the nodes in the ceph
nodegroup. Then, use Salt to restart the monitors, then the OSDs on each
node, one by one. Finally run the following command on all nodes with
Salt to verify all monitors/OSDs are using the new version:

for i in $(ls /var/run/ceph/ceph-*.asok);do echo $i;ceph --admin-daemon
$i version;done

Node decommissioning:

I have a script which enumerates all the OSDs on a given host and stores
that list in a file. Another script (run by cron every 10 minutes)
checks if the cluster health is OK, and if so pops the next OSD from
that file and executes the steps to remove it from the host, trickling
the node out of service.




On 04/17/2015 02:18 PM, Craig Lewis wrote:
 I'm running a small cluster, but I'll chime in since nobody else has.

 Cern had a presentation a while ago (dumpling time-frame) about their
 deployment.  They go over some of your
 questions: http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern

 My philosophy on Config Management is that it should save me time.  If
 it's going to take me longer to write a recipe to do something, I'll
 just do it by hand. Since my cluster is small, there are many things I
 can do faster by hand.  This may or may not work for you, depending on
 your documentation / repeatability requirements.  For things that need
 to be documented, I'll usually write the recipe anyway (I accept Chef
 recipes as documentation).


 For my clusters, I'm using Chef to setups all nodes and manage
 ceph.conf.  I manually manage my pools, CRUSH map, RadosGW users, and
 disk replacement.  I was using Chef to add new disks, but I ran into
 load problems due to my small cluster size.  I'm currently adding
 disks manually, to manage cluster load better.  As my cluster gets
 larger, that'll be less important.

 I'm also doing upgrades manually, because it's less work than writing
 the Chef recipe to do a cluster upgrade.  Since Chef isn't cluster
 aware, it would be a a pain to make the recipe cluster aware enough to
 handle the upgrade.  And I figure if I stall long enough, somebody
 else will write it :-)  Ansible, with it's cluster wide coordination,
 looks like it would handle that a bit better.



 On Wed, Apr 15, 2015 at 2:05 PM, Stillwell, Bryan
 bryan.stillw...@twcable.com mailto:bryan.stillw...@twcable.com wrote:

 I'm curious what people managing larger ceph clusters are doing with
 configuration management and orchestration to simplify their lives?

 We've been using ceph-deploy to manage our ceph clusters so far, but
 

Re: [ceph-users] Managing larger ceph clusters

2015-04-17 Thread Quentin Hartman
I also have a fairly small deployment of 14 nodes, 42 OSDs, but even I use
some automation. I do my OS installs and partitioning with PXE / kickstart,
then use chef for my baseline install of the normal server stuff in our
env and admin accounts. Then the ceph-specific stuff I handle by hand and
with ceph-deploy and some light wrapper scripts. Monitoring / alerting is
sensu and graphite. I tried Calamari, and it was nice. But it produced a
lot of load on the admin machine (especially considering the work it should
have been performing) and once I figured out how to get metrics into
normal graphite, the appeal of a ceph-specific tool was reduced
substantially.

QH

On Fri, Apr 17, 2015 at 1:07 PM, Steve Anthony sma...@lehigh.edu wrote:

  For reference, I'm currently running 26 nodes (338 OSDs); will be 35
 nodes (455 OSDs) in the near future.

 Node/OSD provisioning and replacements:

 Mostly I'm using ceph-deploy, at least to do node/osd adds and
 replacements. Right now the process is:

 Use FAI (http://fai-project.org) to setup software RAID1/LVM for the OS
 disks, and do a minimal installation, including the salt-minion.

 Accept the new minion on the salt-master node and deploy the
 configuration. LDAP auth, nrpe, diamond collector, udev configuration,
 custom python disk add script, and everything on the Ceph preflight page (
 http://ceph.com/docs/firefly/start/quick-start-preflight/)

 Insert the journals into the case. Udev triggers my python code, which
 partitions the SSDs and fires a Prowl alert (http://www.prowlapp.com/) to
 my phone when it's finished.

 Insert the OSDs into the case. Same thing, udev triggers the python code,
 which selects the next available partition on the journals so OSDs go on
 journal1partA, journal2partA, journal3partA, journal1partB,... for the
 three journals in each node. The code then fires a salt event at the master
 node with the OSD dev path, journal /dev/by-id/ path and node hostname. The
 salt reactor on the master node takes this event and runs a script on the
 admin node which passes those parameters to ceph-deploy, which does the OSD
 deployment. Send Prowl alert on success or fail with details.

 Similarity, when an OSD fails, I remove it, and insert the new OSD. The
 same process as above occurs. Logical removal I do manually, since I'm not
 at a scale where it's common yet. Eventually, I imagine I'll write code to
 trigger OSD removal on certain events using the same event/reactor Salt
 framework.

 Pool/CRUSH management:

 Pool configuration and CRUSH management are mostly one-time operations.
 That is, I'll make a change rarely and when I do it will persist in that
 new state for a long time. Given that and the fact that I can make the
 changes from one node and inject them into the cluster, I haven't needed to
 automate that portion of Ceph as I've added more nodes, at least not yet.

 Replacing journals:

 I haven't had to do this yet; I'd probably remove/readd all the OSDs if it
 happened today, but will be reading the post you linked.

 Upgrading releases:

 Change the configuration of /etc/apt/source.list.d/ceph.list to point at
 new release and push to all the nodes with Salt. Then salt -N 'ceph'
 pkg.upgrade to upgrade the packages on all the nodes in the ceph nodegroup.
 Then, use Salt to restart the monitors, then the OSDs on each node, one by
 one. Finally run the following command on all nodes with Salt to verify all
 monitors/OSDs are using the new version:

 for i in $(ls /var/run/ceph/ceph-*.asok);do echo $i;ceph --admin-daemon $i
 version;done

 Node decommissioning:

 I have a script which enumerates all the OSDs on a given host and stores
 that list in a file. Another script (run by cron every 10 minutes) checks
 if the cluster health is OK, and if so pops the next OSD from that file and
 executes the steps to remove it from the host, trickling the node out of
 service.





 On 04/17/2015 02:18 PM, Craig Lewis wrote:

 I'm running a small cluster, but I'll chime in since nobody else has.

  Cern had a presentation a while ago (dumpling time-frame) about their
 deployment.  They go over some of your questions:
 http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern

  My philosophy on Config Management is that it should save me time.  If
 it's going to take me longer to write a recipe to do something, I'll just
 do it by hand. Since my cluster is small, there are many things I can do
 faster by hand.  This may or may not work for you, depending on your
 documentation / repeatability requirements.  For things that need to be
 documented, I'll usually write the recipe anyway (I accept Chef recipes as
 documentation).


  For my clusters, I'm using Chef to setups all nodes and manage
 ceph.conf.  I manually manage my pools, CRUSH map, RadosGW users, and disk
 replacement.  I was using Chef to add new disks, but I ran into load
 problems due to my small cluster size.  I'm currently adding disks
 manually, to manage cluster load 

[ceph-users] Managing larger ceph clusters

2015-04-15 Thread Stillwell, Bryan
I'm curious what people managing larger ceph clusters are doing with
configuration management and orchestration to simplify their lives?

We've been using ceph-deploy to manage our ceph clusters so far, but
feel that moving the management of our clusters to standard tools would
provide a little more consistency and help prevent some mistakes that
have happened while using ceph-deploy.

We're looking at using the same tools we use in our OpenStack
environment (puppet/ansible), but I'm interested in hearing from people
using chef/salt/juju as well.

Some of the cluster operation tasks that I can think of along with
ideas/concerns I have are:

Keyring management
  Seems like hiera-eyaml is a natural fit for storing the keyrings.

ceph.conf
  I believe the puppet ceph module can be used to manage this file, but
  I'm wondering if using a template (erb?) might be better method to
  keeping it organized and properly documented.

Pool configuration
  The puppet module seems to be able to handle managing replicas and the
  number of placement groups, but I don't see support for erasure coded
  pools yet.  This is probably something we would want the initial
  configuration to be set up by puppet, but not something we would want
  puppet changing on a production cluster.

CRUSH maps
  Describing the infrastructure in yaml makes sense.  Things like which
  servers are in which rows/racks/chassis.  Also describing the type of
  server (model, number of HDDs, number of SSDs) makes sense.

CRUSH rules
  I could see puppet managing the various rules based on the backend
  storage (HDD, SSD, primary affinity, erasure coding, etc).

Replacing a failed HDD disk
  Do you automatically identify the new drive and start using it right
  away?  I've seen people talk about using a combination of udev and
  special GPT partition IDs to automate this.  If you have a cluster
  with thousands of drives I think automating the replacement makes
  sense.  How do you handle the journal partition on the SSD?  Does
  removing the old journal partition and creating a new one create a
  hole in the partition map (because the old partition is removed and
  the new one is created at the end of the drive)?

Replacing a failed SSD journal
  Has anyone automated recreating the journal drive using Sebastien
  Han's instructions, or do you have to rebuild all the OSDs as well?


http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou
rnal-failure/

Adding new OSD servers
  How are you adding multiple new OSD servers to the cluster?  I could
  see an ansible playbook which disables nobackfill, noscrub, and
  nodeep-scrub followed by adding all the OSDs to the cluster being
  useful.

Upgrading releases
  I've found an ansible playbook for doing a rolling upgrade which looks
  like it would work well, but are there other methods people are using?


http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi
ble/

Decommissioning hardware
  Seems like another ansible playbook for reducing the OSDs weights to
  zero, marking the OSDs out, stopping the service, removing the OSD ID,
  removing the CRUSH entry, unmounting the drives, and finally removing
  the server would be the best method here.  Any other ideas on how to
  approach this?


That's all I can think of right now.  Is there any other tasks that
people have run into that are missing from this list?

Thanks,
Bryan


This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com