Hello all, medium term user of CEPH, avid reader of this list for the hints and 
tricks, but first time poster...

I have a working cluster, been operating for around 18 months in various 
versions and decided we knew enough how it worked to depend on it for long term 
storage.

Basic storage nodes are not high end but quite capable Supermicro based, Dual 
AMD board, 32gig of ram, 24 OSD's attached to each, 9 nodes, a variety of 2tb 
and 3tb drives, single OSD per drive, 216 OSD's in total. All running Ubuntu 
14.04 LTS with ceph version 0.80.10, which is the default for ubuntu's "trusty 
tahr". All nodes boot from a 120 gig SSD, 80 gig OS, 40 gig swap, all journals 
stored on the spinning storage, and everything ceph-ish is installed via 
ceph-deploy.

I have a different CRUSH ruleset for the production OSD's, so that new (or 
flapping) OSD's don't automatically get added to the pool.

Normal running memory use was 24gig ram (out of 32gig) in userland, so it 
wasn't quick, but it worked well enough to for our storage needs, with a plan 
for a ram upgrade when needed.

6 of the OSD's got "near full" over the xmas break - so the plan was to add two 
nodes on the return to work on the 4th Jan.

That turned out... badly.

On adding the new nodes, "ceph -w" reported the normal backfilling when I moved 
the OSD's into the correct CRUSH map place.... and then three of the nodes 
started gobbling ram like mad, hitting swap, eventually dropping all their 
OSD's, which then cascaded to the rest of the nodes..... with the end result of 
all the OSD's being marked as out, and moving back into the default CRUSH map 
location.

Now, 24 hrs later, despite my best efforts, reading all the hints and tricks to 
reduce memory usage, as soon as I move more than three OSD's into the 
production HDD ruleset, RAM is just gobbled with the machine(s) dying with 
load'avs in the thousands, producing funky messages to syslog and generally 
being unhappy.

We upgraded 7 of the nodes to 48gig ram (what we had lying around), and reduced 
the last nodes to 12 OSD's each, moving the remaining 24 OSD's to the two new 
nodes. Still no joy.

What I ended with in the ceph.conf...

[global]
fsid = xxxxxxxxxxxxxxxxxxxxxxxx
mon_initial_members = ceph-iscsi01, ceph-iscsi02
mon_host = 10.201.4.198,10.201.4.199
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public network = 10.201.4.0/24
cluster network = 10.201.248.0/24
osd map message max = 5
[osd]
osd target transaction size = 50
osd recovery max active = 1
osd max backfills = 1
osd op threads = 1
osd disk threads = 1
osd map cache size = 4
osd map cache bl size = 1
osd map max advance = 1
osd map share max epochs = 1
osd pg epoch persisted max stale = 1
osd backfill scan min = 2
osd backfill scan max = 4
osd_min_pg_log_entries = 100
osd_max_pg_log_entries = 500

Its not a PID/thread issue, threads generally sit around 28k-29k or so, 
limits.conf has 64k open file limit set for both root and ceph-admin.

I can see all data on the OSD's, I'm prepared to do a recovery (it had five RBD 
images on there, all XFS, looks like just glueing the image back together 
hopefully) which could take a while, thats ok.

Whats concerning me is ... what have I done wrong, and once I rebuild the 
cluster after doing the recovery, will it happen again when some of the OSD's 
get over 85% used?

thanks in advance for the help...

Mark.


Mark Dignam
Technical Director

t: +61 8 6141 1011
f: +61 8 9499 4083
e: mark.dig...@dctwo.com.au<mailto:mark.dig...@dctwo.com.au>
w: www.dctwo.com.au<http://www.dctwo.com.au/>
    [Description: DC Two Logo SM]


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to