Re: [ceph-users] Sanity check of deploying Ceph very unconventionally (on top of RAID6, with very few nodes and OSDs)

2013-12-26 Thread Christian Balzer

Hello,

I'm leaving the original post below for your perusal, but will top-post
here with the conclusions/decisions so far.

Firstly, I got a quotation from our vendors here (assuming a slightly
incorrect 100:1 exchange rate of Japanese Yen to USD).
A storage node as described below, with 24 3TB disks, 2 Intel 3700DC SSDs,
a mobo with two 4300 CPUs (12 cores total), 32GB RAM and one dual port
Infiniband HBA will cost about $7700. 
The Areca 1882 24port controller with 4GB cache is another $1900 (I'm sure
it would be cheaper in the US, like all the other gear).
The above times two for about 60TB capacity total.

To get the same storage space and (roughly) reliability relying on Ceph
replication and rebalancing with one OSD per disk would require an
additional storage node at the aforementioned $7700. And $700 (I looked
this up on google, so these are real dollars and thus certainly more
expensive in Japan, but lets just go with that number) per node for HBAs
that I trust (and have the right connectors for the backplane), namely a
LSI 16 port and an 8 port one. 
In all likelihood with 24 OSDs per node larger SSDs would be in order,
too. 

Which winds up being $6000 more expensive than the approach with the
insanely expensive RAID controllers.

I fully acknowledge the point of spindle contention in my design that Mike
Dawson brought up, however I'm confident that between enabling RBD caching,
OSD journals on SSDs and the 4GB controller cache this won't be as
crippling as one might think.

I'm simply not convinced that the additional $6000, 4U of rack space,
having to deal with failed OSDs on a frequent basis and requiring people
with at least half a clue to replace disks are worth the additional IOPs
in my case.

Further comments are of course still welcome!

Christian

On Tue, 17 Dec 2013 16:44:29 +0900 Christian Balzer wrote:

> 
> Hello,
> 
> I've been doing a lot of reading and am looking at the following design
> for a storage cluster based on Ceph. I will address all the likely
> knee-jerk reactions and reasoning below, so hold your guns until you've
> read it all. I also have a number of questions I've not yet found the
> answer to or determined it by experimentation.
> 
> Hardware: 
> 2x 4U (can you say Supermicro? ^.^) servers with 24 3.5" hotswap bays, 2
> internal OS (journal?) drives, probably Opteron 4300 CPUs (see below),
> Areca 1882 controller with 4GB cache, 2 or 3 2-port Infiniband HCAs. 
> 24 3TB HDs (30% of the price of a 4TB one!) in one or two RAID6, 2 of
> them hotspares, giving us 60TB per node and thus with a replication
> factor of 2 that's also the usable space.
> Space for 2 more identical servers if need be.
> 
> Network: 
> Infiniband QDR, 2x 18port switches (interconnected of course), redundant
> paths everywhere, including to the clients (compute nodes).
> 
> Ceph configuration:
> Additional server with mon, mons also on the 2 storage nodes, at least 2
> OSDs per node (see below)
> 
> This is for a private cloud with about 500 VMs at most. There will 2
> types of VMs, the majority writing a small amount of log chatter to their
> volumes, the other type (a few dozen) writing a more substantial data
> stream. 
> I estimate less than 100MB/s of read/writes at full build out, which
> should be well within the abilities of this setup.
> 
> 
> Now for the rationale of this design that goes contrary to anything
> normal Ceph layouts suggest:
> 
> 1. Idiot (aka NOC monkey) proof hotswap of disks.
> This will be deployed in a remote data center, meaning that qualified
> people will not be available locally and thus would have to travel there
> each time a disk or two fails. 
> In short, telling somebody to pull the disk tray with the red flashing
> LED and put a new one from the spare pile in there is a lot more likely
> to result in success than telling them to pull the 3rd row, 4th column
> disk in server 2. ^o^
> 
> 2. Density, TCO
> Ideally I would love to deploy something like this:
> http://www.mbx.com/60-drive-4u-storage-server/
> but they seem to not really have a complete product description, price
> list, etc. ^o^ With a monster like that, I'd be willing to reconsider
> local raids and just overspec things in a way that a LOT disks can fail
> before somebody (with a clue) needs to visit that DC.
> However failing that, the typical approach to use many smaller servers
> for OSDs increases the costs and/or reduces density. Replacing the 4U
> servers with 2U ones (that hold 12 disks) would require some sort of
> controller (to satisfy my #1 requirement) and similar amounts of HCAs
> per node, clearly driving the TCO up. 1U servers with typically 4 disk
> would be even worse.
> 
> 3. Increased reliability/stability 
> Failure of a single disk has no impact on the whole cluster, no need for
> any CPU/network intensive rebalancing.
> 
> 
> Questions/remarks:
> 
> Due to the fact that there will be redundancy, reliability on the disk
> level and that there will be only 2 storage nod

Re: [ceph-users] Sanity check of deploying Ceph very unconventionally (on top of RAID6, with very few nodes and OSDs)

2013-12-22 Thread Christian Balzer

Hello,

On Sun, 22 Dec 2013 12:27:46 +0100 Gandalf Corvotempesta wrote:

> 2013/12/17 Christian Balzer :
> > Network:
> > Infiniband QDR, 2x 18port switches (interconnected of course),
> > redundant paths everywhere, including to the clients (compute nodes).
> 
> Are you using IPoIB ? How do you interconnect both switches without
> making loops ? AFAIK, IB switches doesn't support STP. Will you
> interconnect IB switches with just one cable ?
> 
It is my understanding that OpenSM will deal with this (potential loops)
nicely. 
Also for redundancy, I don't really need any switch interconnection at
all, but I plan on using it anyway.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sanity check of deploying Ceph very unconventionally (on top of RAID6, with very few nodes and OSDs)

2013-12-22 Thread Gandalf Corvotempesta
2013/12/17 Christian Balzer :
> Network:
> Infiniband QDR, 2x 18port switches (interconnected of course), redundant
> paths everywhere, including to the clients (compute nodes).

Are you using IPoIB ? How do you interconnect both switches without
making loops ? AFAIK, IB switches doesn't support STP. Will you
interconnect IB switches with just one cable ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sanity check of deploying Ceph very unconventionally (on top of RAID6, with very few nodes and OSDs)

2013-12-17 Thread Christian Balzer

Hello,

Quick update, one more data point.
On a currently inactive cluster (DRBD) with a 7 disk (7200rpm SATA) RAID6
backing storage (crappy Adaptec controller with slower RAID6 engine than
the proposed Areca and just 512MB cache) I just installed fio and ran the
iometer compatible script on the mounted (ext4) drbd device:

  write: io=841813KB, bw=600318 B/s, iops=190 , runt=1435932msec

Which is about what I expected, given that this cluster handled 14000 mail
deliveries per minute sustained (postal, 10 threads, average 10KB size)
during testing. 

Definitely higher than the write IOPs one would have expected by just
punching in the numbers for the backing storage, never mind that this is
on top of DRBD:
7disk * 75IOPs / 6 (RAID6WritePenality) = 87.5 WIOP/s

Christian

On Wed, 18 Dec 2013 09:12:15 +0900 Christian Balzer wrote:

> 
> Hello Mike,
> 
> On Tue, 17 Dec 2013 12:32:35 -0500 Mike Dawson wrote:
> 
> > Christian,
> > 
> > I think you are going to suffer the effects of spindle contention with 
> > this type of setup. Based on your email and my assumptions, I will use 
> > the following inputs:
> > 
> > - 4 OSDs, each backed by a 12-disk RAID 6 set
> > - 75iops for each 7200rpm 3TB drive
> > - RAID 6 write penalty of 6
> > - OSD Journal co-located with OSD
> As I wrote, would be on SSDs if proven to be beneficial, but lets
> continue with this assumption.
> 
> > - Ceph replication size of 2
> > 
> > 4osds * 12disks * 75iops / 6(RAID6WritePenalty) / 2(OsdJournalHit) / 
> > 2(CephReplication) = 150 Writes/second max
> > 
> I have a gut feeling that this can't be quite right given my experience
> with DRBD based clusters (but on those I keep the closest analogy to the
> OSD journal, the activity log on SSDs).
> For starters it would seem to ignore the ability of the RAID controller
> to merge I/O, especially with a 4GB cache.
> 
> For example this iostat output right now, sda being the backing device
> for drbd0:
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda   0.00   354.001.00  766.00 4.00  4558.67
> 11.90 0.791.036.671.03   0.03   2.27 drbd0
> 0.00 0.001.00 1120.00 4.00  4558.67 8.1416.64
> 14.846.67   14.85   0.05   5.47
> 
> 
> > 4osds * 12disks * 75iops / 2xCephReplication = 1800 Reads/second max
> > 
> > My guess is 150 writes/second is far lower than your 500 VMs will 
> > require. After all, this setup will likely give you lower
> > writes/second than a single 15K SAS drive. Further, if you need to
> > replace a drive, I suspect this setup would grind to a halt as the
> > RAID6 set set attempts to repair.
> > 
> 150 IOPs is would indeed be a downer. Though the 500VMs will be the end
> state, something about 2 years in the future. Thus also my plan to leave
> space for more storage nodes.
> As for RAID recovery, it is the usual trade off between impact to the
> RAID performance and the time you're willing to wait for full
> restoration of redundancy. Not unlikely to what is configurable with
> Ceph for OSD rebalancing after a failure. Usually that means low impact,
> but also low resync speed during busy times, but a quicker recovery
> during off peak hours.
> 
> > On the other hand, if you planned for 48 individual drives with OSD 
> > journals on SSDs in a typical setup of perhaps 5:1 or lower ratio of 
> > SSDs:HDs, the calculation would look like:
> > 
> > 48osds * 75iops / 2xCephReplication = 1800 Writes/second max
> > 
> > 48osds * 75iops / 2xCephReplication = 1800 Reads/second max
> > 
> > As you can see, I estimate 12x more random writes without RAID6 (6x)
> > and co-located osd journals (2x).
> > 
> Alas at this point we are comparing one round fruit to a different one.
> To get the same resilience and redundancy as my design the "Ceph" way I
> think the result would be:
> 7x 2U nodes with 12drives, 2 SSDs, 10 storage, 3 way replication.
> 6 nodes needed to get the same storage capacity, 7th to avoid hitting
> full ratio when a node goes down.
> 14U instead of 8, 70 disks instead of 48, etc.
> 
> > Plus you'll be able to configure 12x more placement groups in your
> > CRUSH rules by going from 4 osds to 48 osds. That will allow Ceph's 
> > psuedo-random placement rules to significantly improve the
> > distribution of data and io load across the cluster to decrease the
> > risk of hot-spots.
> > 
> > A few other notes:
> > 
> > - You'll certainly want QEMU 1.4.2 or later to get asynchronous io for
> > RBD.
> > 
> > - You'll likely want to enable RBD writeback cache. It helps coalesce 
> > small writes before hitting the disks.
> > 
> Sage advice that I would love to follow but likely won't be able to for
> initial deployment, as native QEMU RBD support is just being added to
> ganiti. And no, Openstack is not an option at this point in time (no CPU
> pinning, manual node failure recovery), especially given that this is
> supposed to be up and 

Re: [ceph-users] Sanity check of deploying Ceph very unconventionally (on top of RAID6, with very few nodes and OSDs)

2013-12-17 Thread Christian Balzer

Hello Mike,

On Tue, 17 Dec 2013 12:32:35 -0500 Mike Dawson wrote:

> Christian,
> 
> I think you are going to suffer the effects of spindle contention with 
> this type of setup. Based on your email and my assumptions, I will use 
> the following inputs:
> 
> - 4 OSDs, each backed by a 12-disk RAID 6 set
> - 75iops for each 7200rpm 3TB drive
> - RAID 6 write penalty of 6
> - OSD Journal co-located with OSD
As I wrote, would be on SSDs if proven to be beneficial, but lets
continue with this assumption.

> - Ceph replication size of 2
> 
> 4osds * 12disks * 75iops / 6(RAID6WritePenalty) / 2(OsdJournalHit) / 
> 2(CephReplication) = 150 Writes/second max
> 
I have a gut feeling that this can't be quite right given my experience
with DRBD based clusters (but on those I keep the closest analogy to the
OSD journal, the activity log on SSDs).
For starters it would seem to ignore the ability of the RAID controller to
merge I/O, especially with a 4GB cache.

For example this iostat output right now, sda being the backing device for
drbd0:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00   354.001.00  766.00 4.00  4558.6711.90 
0.791.036.671.03   0.03   2.27
drbd0 0.00 0.001.00 1120.00 4.00  4558.67 8.14
16.64   14.846.67   14.85   0.05   5.47


> 4osds * 12disks * 75iops / 2xCephReplication = 1800 Reads/second max
> 
> My guess is 150 writes/second is far lower than your 500 VMs will 
> require. After all, this setup will likely give you lower writes/second 
> than a single 15K SAS drive. Further, if you need to replace a drive, I 
> suspect this setup would grind to a halt as the RAID6 set set attempts 
> to repair.
> 
150 IOPs is would indeed be a downer. Though the 500VMs will be the end
state, something about 2 years in the future. Thus also my plan to leave
space for more storage nodes.
As for RAID recovery, it is the usual trade off between impact to the RAID
performance and the time you're willing to wait for full restoration of
redundancy. Not unlikely to what is configurable with Ceph for OSD
rebalancing after a failure. Usually that means low impact, but also low
resync speed during busy times, but a quicker recovery during off peak
hours.

> On the other hand, if you planned for 48 individual drives with OSD 
> journals on SSDs in a typical setup of perhaps 5:1 or lower ratio of 
> SSDs:HDs, the calculation would look like:
> 
> 48osds * 75iops / 2xCephReplication = 1800 Writes/second max
> 
> 48osds * 75iops / 2xCephReplication = 1800 Reads/second max
> 
> As you can see, I estimate 12x more random writes without RAID6 (6x) and 
> co-located osd journals (2x).
> 
Alas at this point we are comparing one round fruit to a different one.
To get the same resilience and redundancy as my design the "Ceph" way I
think the result would be:
7x 2U nodes with 12drives, 2 SSDs, 10 storage, 3 way replication.
6 nodes needed to get the same storage capacity, 7th to avoid hitting full
ratio when a node goes down.
14U instead of 8, 70 disks instead of 48, etc.

> Plus you'll be able to configure 12x more placement groups in your CRUSH 
> rules by going from 4 osds to 48 osds. That will allow Ceph's 
> psuedo-random placement rules to significantly improve the distribution 
> of data and io load across the cluster to decrease the risk of hot-spots.
> 
> A few other notes:
> 
> - You'll certainly want QEMU 1.4.2 or later to get asynchronous io for
> RBD.
> 
> - You'll likely want to enable RBD writeback cache. It helps coalesce 
> small writes before hitting the disks.
> 
Sage advice that I would love to follow but likely won't be able to for
initial deployment, as native QEMU RBD support is just being added to
ganiti. And no, Openstack is not an option at this point in time (no CPU
pinning, manual node failure recovery), especially given that this is
supposed to be up and running in 3 months tops. 

Thanks,

Christian
> 
> Cheers,
> Mike
> 
> 
> 
> On 12/17/2013 2:44 AM, Christian Balzer wrote:
> >
> > Hello,
> >
> > I've been doing a lot of reading and am looking at the following design
> > for a storage cluster based on Ceph. I will address all the likely
> > knee-jerk reactions and reasoning below, so hold your guns until you've
> > read it all. I also have a number of questions I've not yet found the
> > answer to or determined it by experimentation.
> >
> > Hardware:
> > 2x 4U (can you say Supermicro? ^.^) servers with 24 3.5" hotswap bays,
> > 2 internal OS (journal?) drives, probably Opteron 4300 CPUs (see
> > below), Areca 1882 controller with 4GB cache, 2 or 3 2-port Infiniband
> > HCAs. 24 3TB HDs (30% of the price of a 4TB one!) in one or two RAID6,
> > 2 of them hotspares, giving us 60TB per node and thus with a
> > replication factor of 2 that's also the usable space.
> > Space for 2 more identical servers if need be.
> >
> > Network:
> > Infiniband

Re: [ceph-users] Sanity check of deploying Ceph very unconventionally (on top of RAID6, with very few nodes and OSDs)

2013-12-17 Thread Mike Dawson

Christian,

I think you are going to suffer the effects of spindle contention with 
this type of setup. Based on your email and my assumptions, I will use 
the following inputs:


- 4 OSDs, each backed by a 12-disk RAID 6 set
- 75iops for each 7200rpm 3TB drive
- RAID 6 write penalty of 6
- OSD Journal co-located with OSD
- Ceph replication size of 2

4osds * 12disks * 75iops / 6(RAID6WritePenalty) / 2(OsdJournalHit) / 
2(CephReplication) = 150 Writes/second max


4osds * 12disks * 75iops / 2xCephReplication = 1800 Reads/second max

My guess is 150 writes/second is far lower than your 500 VMs will 
require. After all, this setup will likely give you lower writes/second 
than a single 15K SAS drive. Further, if you need to replace a drive, I 
suspect this setup would grind to a halt as the RAID6 set set attempts 
to repair.


On the other hand, if you planned for 48 individual drives with OSD 
journals on SSDs in a typical setup of perhaps 5:1 or lower ratio of 
SSDs:HDs, the calculation would look like:


48osds * 75iops / 2xCephReplication = 1800 Writes/second max

48osds * 75iops / 2xCephReplication = 1800 Reads/second max

As you can see, I estimate 12x more random writes without RAID6 (6x) and 
co-located osd journals (2x).


Plus you'll be able to configure 12x more placement groups in your CRUSH 
rules by going from 4 osds to 48 osds. That will allow Ceph's 
psuedo-random placement rules to significantly improve the distribution 
of data and io load across the cluster to decrease the risk of hot-spots.


A few other notes:

- You'll certainly want QEMU 1.4.2 or later to get asynchronous io for RBD.

- You'll likely want to enable RBD writeback cache. It helps coalesce 
small writes before hitting the disks.



Cheers,
Mike



On 12/17/2013 2:44 AM, Christian Balzer wrote:


Hello,

I've been doing a lot of reading and am looking at the following design
for a storage cluster based on Ceph. I will address all the likely
knee-jerk reactions and reasoning below, so hold your guns until you've
read it all. I also have a number of questions I've not yet found the
answer to or determined it by experimentation.

Hardware:
2x 4U (can you say Supermicro? ^.^) servers with 24 3.5" hotswap bays, 2
internal OS (journal?) drives, probably Opteron 4300 CPUs (see below),
Areca 1882 controller with 4GB cache, 2 or 3 2-port Infiniband HCAs.
24 3TB HDs (30% of the price of a 4TB one!) in one or two RAID6, 2 of them
hotspares, giving us 60TB per node and thus with a replication factor of 2
that's also the usable space.
Space for 2 more identical servers if need be.

Network:
Infiniband QDR, 2x 18port switches (interconnected of course), redundant
paths everywhere, including to the clients (compute nodes).

Ceph configuration:
Additional server with mon, mons also on the 2 storage nodes, at least 2
OSDs per node (see below)

This is for a private cloud with about 500 VMs at most. There will 2 types
of VMs, the majority writing a small amount of log chatter to their
volumes, the other type (a few dozen) writing a more substantial data
stream.
I estimate less than 100MB/s of read/writes at full build out, which
should be well within the abilities of this setup.


Now for the rationale of this design that goes contrary to anything normal
Ceph layouts suggest:

1. Idiot (aka NOC monkey) proof hotswap of disks.
This will be deployed in a remote data center, meaning that qualified
people will not be available locally and thus would have to travel there
each time a disk or two fails.
In short, telling somebody to pull the disk tray with the red flashing LED
and put a new one from the spare pile in there is a lot more likely to
result in success than telling them to pull the 3rd row, 4th column disk
in server 2. ^o^

2. Density, TCO
Ideally I would love to deploy something like this:
http://www.mbx.com/60-drive-4u-storage-server/
but they seem to not really have a complete product description, price
list, etc. ^o^ With a monster like that, I'd be willing to reconsider local
raids and just overspec things in a way that a LOT disks can fail before
somebody (with a clue) needs to visit that DC.
However failing that, the typical approach to use many smaller servers for
OSDs increases the costs and/or reduces density. Replacing the 4U servers
with 2U ones (that hold 12 disks) would require some sort of controller (to
satisfy my #1 requirement) and similar amounts of HCAs per node, clearly
driving the TCO up. 1U servers with typically 4 disk would be even worse.

3. Increased reliability/stability
Failure of a single disk has no impact on the whole cluster, no need for
any CPU/network intensive rebalancing.


Questions/remarks:

Due to the fact that there will be redundancy, reliability on the disk
level and that there will be only 2 storage nodes initially, I'm
planning to disable rebalancing.
Or will Ceph realize that making replicas on the same server won't really
save the day and refrain from doing so?
If more nodes are added lat

[ceph-users] Sanity check of deploying Ceph very unconventionally (on top of RAID6, with very few nodes and OSDs)

2013-12-16 Thread Christian Balzer

Hello,

I've been doing a lot of reading and am looking at the following design
for a storage cluster based on Ceph. I will address all the likely
knee-jerk reactions and reasoning below, so hold your guns until you've
read it all. I also have a number of questions I've not yet found the
answer to or determined it by experimentation.

Hardware: 
2x 4U (can you say Supermicro? ^.^) servers with 24 3.5" hotswap bays, 2
internal OS (journal?) drives, probably Opteron 4300 CPUs (see below),
Areca 1882 controller with 4GB cache, 2 or 3 2-port Infiniband HCAs. 
24 3TB HDs (30% of the price of a 4TB one!) in one or two RAID6, 2 of them
hotspares, giving us 60TB per node and thus with a replication factor of 2
that's also the usable space.
Space for 2 more identical servers if need be.

Network: 
Infiniband QDR, 2x 18port switches (interconnected of course), redundant
paths everywhere, including to the clients (compute nodes).

Ceph configuration:
Additional server with mon, mons also on the 2 storage nodes, at least 2
OSDs per node (see below)

This is for a private cloud with about 500 VMs at most. There will 2 types
of VMs, the majority writing a small amount of log chatter to their
volumes, the other type (a few dozen) writing a more substantial data
stream. 
I estimate less than 100MB/s of read/writes at full build out, which
should be well within the abilities of this setup.


Now for the rationale of this design that goes contrary to anything normal
Ceph layouts suggest:

1. Idiot (aka NOC monkey) proof hotswap of disks.
This will be deployed in a remote data center, meaning that qualified
people will not be available locally and thus would have to travel there
each time a disk or two fails. 
In short, telling somebody to pull the disk tray with the red flashing LED
and put a new one from the spare pile in there is a lot more likely to
result in success than telling them to pull the 3rd row, 4th column disk
in server 2. ^o^

2. Density, TCO
Ideally I would love to deploy something like this:
http://www.mbx.com/60-drive-4u-storage-server/
but they seem to not really have a complete product description, price
list, etc. ^o^ With a monster like that, I'd be willing to reconsider local
raids and just overspec things in a way that a LOT disks can fail before
somebody (with a clue) needs to visit that DC.
However failing that, the typical approach to use many smaller servers for
OSDs increases the costs and/or reduces density. Replacing the 4U servers
with 2U ones (that hold 12 disks) would require some sort of controller (to
satisfy my #1 requirement) and similar amounts of HCAs per node, clearly
driving the TCO up. 1U servers with typically 4 disk would be even worse.

3. Increased reliability/stability 
Failure of a single disk has no impact on the whole cluster, no need for
any CPU/network intensive rebalancing.


Questions/remarks:

Due to the fact that there will be redundancy, reliability on the disk
level and that there will be only 2 storage nodes initially, I'm
planning to disable rebalancing. 
Or will Ceph realize that making replicas on the same server won't really
save the day and refrain from doing so? 
If more nodes are added later, I will likely set an appropriate full ratio
and activate rebalancing on a permanent basis again (except for planned
maintenances of course).
My experience tells me that an actual node failure will be due to:
1. Software bugs, kernel or otherwise.
2. Marginal hardware (CPU/memory/mainboard hairline cracks, I've seen it
all)
Actual total loss of power in the DC doesn't worry me, because if that
happens I'm likely under a ton of rubble, this being Japan. ^_^

Given that a RAID6 with just 7 disk connected to an Areca 1882
controller in a different cluster I'm running here gives me about
800MB/s writes and 1GB/s reads I have a feeling that putting the journal on
SSDs (Intel DC S3700) would be a waste, if not outright harmful. 
But I guess I shall determine that by testing, maybe the higher IOPS rate
will still be beneficial. 
Since the expected performance of this RAID will be at least double the
bandwidth available on a single IB interface, I'm thinking of splitting it
in half and have an OSD for each half and bound to a different interface.
One hopes that nothing in the OSD design stops it from dealing with these
speeds/bandwidths.

The plan is to use Ceph only for RBD, so would "filestore xattr use omap"
really be needed in case tests determine ext4 to be faster than xfs in my
setup?

Given the above configuration, I'm wondering how many CPU cores would be
sufficient in the storage nodes. 
Somewhere in the documentation
http://ceph.com/docs/master/start/hardware-recommendations/ 
is a recommendation for 1GB RAM per 1TB of storage, but later on the same
page we see a storage server example with 36TB and 16GB RAM. 
Ideally I would love to use just one 6 or 8 core Opteron 4300 with 32GB of
memory, thus having only one NUMA domain and keeping all the
processes dealing with I