Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-13 Thread Leen Besselink
On Mon, May 13, 2013 at 09:30:38PM +0200, Tim Mohlmann wrote:
> Hi,
> 
> Ok, thanks for al the info.
> 
> Just "fyi", I am a mechanical / electrical marine service engineer. So 
> basicly 
> I think in Pressure, Flow, Contents, Voltage, (mili)amps, power and torque. 
> So 
> I am just trying to relevate it to the same prinicple. Hence my questions. I 
> am certainly not a noob in linux, opensource and that kind of stuff.
> 
> It is just I got interested in on-line storage and by some googling I came 
> across certain "products" (most of them being propietary) and some of them 
> opensource (but looked unmaintained / not very modern) and one of them was 
> ceph. After the reading the docs I had some questions and in my opinion they 
> are answered.
> 
> I know now how to spend the money, and now it time to start finding out how 
> to 
> make it. I've got a whole bucket of ideas about public apps for my storage 
> and 
> all this needs to be researched for possibilities. (This was yet the start of 
> my quest).
> 

Every journey starts with the first step. :-)

> Again, thanks for the info. If this baby is going to fly, I will keep you 
> posted about my findings. Maybe (and really really maybe) I will try to 
> contribute to the source, for some features I already think I want to have ;).
> 

Don't be shy about sharing your ideas for new features, maybe some are already 
available
or you might be able to do it with some scripting. Maybe someone thinks it is 
something
they might want to have as well and will start to work on it.

Recently there was an online developer summit to try and compile such a list 
for the near term:

http://ceph.com/events/ceph-developer-summit-summary-and-session-videos/

> Regards, Tim
> 
> 
> On Monday 13 May 2013 00:25:19 Dmitri Maziuk wrote:
> > On 2013-05-12 08:34, Tim Mohlmann wrote:
> > > As for choking the backplane: That would just slow things down a bit, am I
> > > right?
> > 
> > A bit, a lot, or not at all -- I think IRL you'll have to test it under
> > your workload and see.
> > 
> > [ WD performance ]
> > 
> > > Did not know that. Do you have any references. Does this also apply for
> > > the
> > > enterprise disks?
> > 
> > Here's one write-up: https://wiki.archlinux.org/index.php/Advanced_Format
> > 
> > Have not tested "enterprise" disks.
> > 
> > > Another question: do you use desktop or enterprise disks in your cluster?
> > > I am having trouble finding a MTBFs for desktop drives. And if I find
> > > them, they are almost the same as enterprise drives. Is there a caveat in
> > > there? Is the failure test done is different conditions? (Not that you
> > > have to know that)
> > > 
> > > If the annual failure rate would be double, it would still be cheaper to
> > > use desktop drives in a large cluster, but I just like to know to be
> > > sure.
> > I don't think anyone knows for sure how much of it is marketing bull.
> > One rumour is the difference between "enterprise" and "desktop" drives
> > is very often only the firmware and the price tag. So yeah, we use
> > desktop versions because it's cheaper, but we use them in raids (usually
> > 1/10 - and it's still cheaper), and we don't do super high performance
> > i/o on them. (Our requirements are size rather than speed.)
> > 
> > Dima
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-13 Thread Tim Mohlmann
Hi,

Ok, thanks for al the info.

Just "fyi", I am a mechanical / electrical marine service engineer. So basicly 
I think in Pressure, Flow, Contents, Voltage, (mili)amps, power and torque. So 
I am just trying to relevate it to the same prinicple. Hence my questions. I 
am certainly not a noob in linux, opensource and that kind of stuff.

It is just I got interested in on-line storage and by some googling I came 
across certain "products" (most of them being propietary) and some of them 
opensource (but looked unmaintained / not very modern) and one of them was 
ceph. After the reading the docs I had some questions and in my opinion they 
are answered.

I know now how to spend the money, and now it time to start finding out how to 
make it. I've got a whole bucket of ideas about public apps for my storage and 
all this needs to be researched for possibilities. (This was yet the start of 
my quest).

Again, thanks for the info. If this baby is going to fly, I will keep you 
posted about my findings. Maybe (and really really maybe) I will try to 
contribute to the source, for some features I already think I want to have ;).

Regards, Tim


On Monday 13 May 2013 00:25:19 Dmitri Maziuk wrote:
> On 2013-05-12 08:34, Tim Mohlmann wrote:
> > As for choking the backplane: That would just slow things down a bit, am I
> > right?
> 
> A bit, a lot, or not at all -- I think IRL you'll have to test it under
> your workload and see.
> 
> [ WD performance ]
> 
> > Did not know that. Do you have any references. Does this also apply for
> > the
> > enterprise disks?
> 
> Here's one write-up: https://wiki.archlinux.org/index.php/Advanced_Format
> 
> Have not tested "enterprise" disks.
> 
> > Another question: do you use desktop or enterprise disks in your cluster?
> > I am having trouble finding a MTBFs for desktop drives. And if I find
> > them, they are almost the same as enterprise drives. Is there a caveat in
> > there? Is the failure test done is different conditions? (Not that you
> > have to know that)
> > 
> > If the annual failure rate would be double, it would still be cheaper to
> > use desktop drives in a large cluster, but I just like to know to be
> > sure.
> I don't think anyone knows for sure how much of it is marketing bull.
> One rumour is the difference between "enterprise" and "desktop" drives
> is very often only the firmware and the price tag. So yeah, we use
> desktop versions because it's cheaper, but we use them in raids (usually
> 1/10 - and it's still cheaper), and we don't do super high performance
> i/o on them. (Our requirements are size rather than speed.)
> 
> Dima
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-12 Thread Dmitri Maziuk

On 2013-05-12 08:34, Tim Mohlmann wrote:


As for choking the backplane: That would just slow things down a bit, am I
right?


A bit, a lot, or not at all -- I think IRL you'll have to test it under 
your workload and see.


[ WD performance ]


Did not know that. Do you have any references. Does this also apply for the
enterprise disks?


Here's one write-up: https://wiki.archlinux.org/index.php/Advanced_Format

Have not tested "enterprise" disks.


Another question: do you use desktop or enterprise disks in your cluster? I am
having trouble finding a MTBFs for desktop drives. And if I find them, they are
almost the same as enterprise drives. Is there a caveat in there? Is the
failure test done is different conditions? (Not that you have to know that)

If the annual failure rate would be double, it would still be cheaper to use
desktop drives in a large cluster, but I just like to know to be sure.


I don't think anyone knows for sure how much of it is marketing bull. 
One rumour is the difference between "enterprise" and "desktop" drives 
is very often only the firmware and the price tag. So yeah, we use 
desktop versions because it's cheaper, but we use them in raids (usually 
1/10 - and it's still cheaper), and we don't do super high performance 
i/o on them. (Our requirements are size rather than speed.)


Dima

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-12 Thread Leen Besselink
On Sun, May 12, 2013 at 10:22:10PM +0200, Tim Mohlmann wrote:
> Hi,
> 
> On Sunday 12 May 2013 18:05:16 Leen Besselink wrote:
> 
> > 
> > I did see you mentioned you wanted to have, many disks in the same machine.
> > 
> > Not just machines with let's say 12 disks for example.
> > 
> > Did you know you need the CPU-power of a 1Ghz Xeon core per OSD for the
> > times when recovery is happening ?
> Nope, did not know it.
> 
> The current intent is to install 2x 2.4 Ghz xeon CPU, handeling 8 threads 
> each. So, 2*8*2.4=38.4 for max OSD's. It should be fine.
> 
> If I would go for the 72 disk option, I have to consider doubling that power. 
> The current max I can select from the dealer I am looking at, for the socket 
> housed in the supermicro 72x 3.5" version are 2x a Xeon x5680. Utilizing 12 
> threads each, at 3.33Ghz. So, 2*12*3.33=79.79 for max OSD's. Also this should 
> be fine.
> 
> What will happen if the CPU is maxed out anyway? Slowing things or crashing 
> things? In my opinion it is not a bad thing if a system is maxed out in such 
> a 
> massive migration, which should not occur on a daily base. Sure, a disk that 
> fails every two weeks, no prob. What are we talking about? 0.3% of the 
> complete storage cluster. Even 0.15% if I would take the 72x3.5" servers.
> 

Even if one disk/OSD fails, it would need to recheck where each placement groups
should be stored and move stuff around if needed.

If during this action your CPUs are maxed out, you might start to lose 
connections
between OSDs and the process will need to start over.

At least that is how I understand it, I've done a few test installations, but
not yet deployed it in production.

The Inktank people said in the presentations I've seen (and looking at the 
picture
in the video from DreamHost I have a feeling that is what they've deployed):

12 HDD == 12 OSD per machine is ideal, maybe with 2 or 3 SSD for journaling if 
you
want more performance.

> If a complete server stops working, that is something else. But as I said in 
> a 
> different split of this thread: if that happens I have got different things 
> to 
> worry about, than a slow migration of data. As long as there is no data lost, 
> I don't really care it takes a bit longer.
> 
> Thanks for the advise.
> 
> Tim
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-12 Thread Tim Mohlmann
Hi,

On Sunday 12 May 2013 18:05:16 Leen Besselink wrote:

> 
> I did see you mentioned you wanted to have, many disks in the same machine.
> 
> Not just machines with let's say 12 disks for example.
> 
> Did you know you need the CPU-power of a 1Ghz Xeon core per OSD for the
> times when recovery is happening ?
Nope, did not know it.

The current intent is to install 2x 2.4 Ghz xeon CPU, handeling 8 threads 
each. So, 2*8*2.4=38.4 for max OSD's. It should be fine.

If I would go for the 72 disk option, I have to consider doubling that power. 
The current max I can select from the dealer I am looking at, for the socket 
housed in the supermicro 72x 3.5" version are 2x a Xeon x5680. Utilizing 12 
threads each, at 3.33Ghz. So, 2*12*3.33=79.79 for max OSD's. Also this should 
be fine.

What will happen if the CPU is maxed out anyway? Slowing things or crashing 
things? In my opinion it is not a bad thing if a system is maxed out in such a 
massive migration, which should not occur on a daily base. Sure, a disk that 
fails every two weeks, no prob. What are we talking about? 0.3% of the 
complete storage cluster. Even 0.15% if I would take the 72x3.5" servers.

If a complete server stops working, that is something else. But as I said in a 
different split of this thread: if that happens I have got different things to 
worry about, than a slow migration of data. As long as there is no data lost, 
I don't really care it takes a bit longer.

Thanks for the advise.

Tim

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-12 Thread Leen Besselink
On Sun, May 12, 2013 at 03:14:15PM +0200, Tim Mohlmann wrote:
> Hi,
> 
> On Saturday 11 May 2013 16:04:27 Leen Besselink wrote:
>  
> > Someone is going to correct me if I'm wrong, but I think you misread
> > something.
> >
> >
> > The Mon-daemon doesn't need that much RAM:
> > 
> > The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon.
> > 
> Gosh, I feel embarresed. This ectually was my main concern / bottleneck. 
> Thanks for pointing this out. Seems Ceph really rocks in deploying affordable 
> data clusters.
> 

I did see you mentioned you wanted to have, many disks in the same machine.

Not just machines with let's say 12 disks for example.

Did you know you need the CPU-power of a 1Ghz Xeon core per OSD for the times 
when
recovery is happening ?

> Regards, Tim
> 
> > On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote:
> > > Hi,
> > > 
> > > First of all I am new to ceph and this mailing list. At this moment I am
> > > looking into the possibilities to get involved in the storage business. I
> > > am trying to get an estimate about costs and after that I will start to
> > > determine how to get sufficient income.
> > > 
> > > First I will describe my case, at the bottom you will find my questions.
> > > 
> > > 
> > > GENERAL LAYOUT:
> > > 
> > > Part of this cost calculation is of course hardware. For the larger part
> > > I've already figured it out. In my plans I will be leasing a full rack
> > > (46U). Depending on the domestic needs I will be using 36 or 40U for ODS
> > > storage servers. (I will assume 36U from here on, to keep a solid value
> > > for calculation and have enough spare space for extra devices).
> > > 
> > > Each OSD server uses 4U and can take 36x3.5" drives. So in 36U I can put
> > > 36/4=9 OSD servers, containing 9*36=324 HDDs.
> > > 
> > > 
> > > HARD DISK DRIVES
> > > 
> > > I have been looking for WD digital RE and RED series. RE is more expensive
> > > per GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap
> > > per GB, but only goes as far a 3TB.
> > > 
> > > At my current calculations it does not matter much if I would put
> > > expensive WD RE 4TB disks or cheaper WD RED 3TB, the price per GB over
> > > the complete cluster expense and 3 years of running costs (including AFR)
> > > is almost the same.
> > > 
> > > So basically, if I could reduce the costs of all the other components used
> > > in the cluster, I would go for the 3TB disk and if the costs will be
> > > higher then my first calculation, I would use the 4TB disk.
> > > 
> > > Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;).
> > > 
> > > 
> > > NETWORK
> > > 
> > > I will use a redundant 2x10Gbe network connection for each node. Two
> > > independent 10Gbe switches will be used and I will use bonding between the
> > > interfaces on each node. (Thanks some guy in the #Ceph irc for pointing
> > > this option out). I will use VLAN's to split front-side, backside and
> > > Internet networks.
> > > 
> > > 
> > > OSD SERVER
> > > 
> > > SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets.
> > > It is advertised they can take up to 512GB of RAM. I will install 2 x
> > > Intel Xeon E5620 2.40ghz processor, having 4 cores and 8 threads each.
> > > For the RAM I am in doubt (see below). I am looking into running 1 OSD
> > > per disk.
> > > 
> > > 
> > > MON AND MDS SERVERS
> > > 
> > > Now comes the big question. What specs are required? It first I had the
> > > plan to use 4 SuperMicro superservers, with a 4 socket mainboards that
> > > contain up to the new 16core AMD processors and up to 1TB of RAM.
> > > 
> > > I want all 4 of the servers to run a MON service, MDS service and costumer
> > > / public services. Probably I would use VM's (kvm) to separate them. I
> > > will compile my own kernel to enable Kernel Samepage Merge, Hugepage
> > > support and memory compaction to make RAM use more efficient. The
> > > requirements for my public services will be added up, once I know what I
> > > need for MON and MDS.
> > > 
> > > 
> > > RAM FOR ALL SERVERS
> > > 
> > > So what would you estimate to be the ram usage?
> > > http://ceph.com/docs/master/install/hardware-recommendations/#minimum-
> > > hardware-recommendations.
> > > 
> > > Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM
> > > requirement for my OSD server to 18GB. 32GB should be more then enough.
> > > Although I would like to see if it is possible to use btrfs compression?
> > > In
> > > that case I'd need more RAM in there.
> > > 
> > > What I really want to know: how many RAM do I need for MON and MDS
> > > servers?
> > > 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive!
> > > 
> > > In my case I would need at least 324 GB of ram for each of them. Initially
> > > I was planning to use 4 servers and each of them running both. Joining
> > > those in a single system, with the other duties the system has to perform
> > > I woul

Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-12 Thread Tim Mohlmann
Hi,

On Saturday 11 May 2013 16:22:15 Dimitri Maziuk wrote:
> SuperMicro has a new 4U chassis w/ 72x3.5" drives (2/canister). You can
> double the number of drives. (With faster drives you may be getting
> close to chocking the expander backplane, though.)
Just checked their site and those are awesome. Did not ran into them before 
because they are not available / advertised yet in the Netherlands. Probably 
requesting a quote from a distributor would still be possible.

As for choking the backplane: That would just slow things down a bit, am I 
right?

Probably  I intent to write some management scripting, not to keep all the 
disk in the cluster all the time. When storage grows, I will let the script 
add disks / osd's to the cluster. The unused disks will we in stand-by mode / 
spun down. Probably well before the last disks are put in the cluster, I 
should consider re-investment and adding servers anyway.

> WD 3+TB drives don't have the option to turn off "advanced format" or
> whatever it's called: the part where they lie to the OS about sector
> size because they ran out of bits for some other counter (will they ever
> learn). In my tests iostat shows 10x i/o wait on "desktop" wd drives
> compared to seagates. Aligning partitions to 4096, 16384, or any other
> sector boundary didn't seem to make any difference.

Did not know that. Do you have any references. Does this also apply for the 
enterprise disks?

> So we quit buying wds. Consider seagates, they go to 4TB in both
> "enterprise" and desktop lines, too.
Pricing is about the same, so why not?

Another question: do you use desktop or enterprise disks in your cluster? I am 
having trouble finding a MTBFs for desktop drives. And if I find them, they are 
almost the same as enterprise drives. Is there a caveat in there? Is the 
failure test done is different conditions? (Not that you have to know that)

If the annual failure rate would be double, it would still be cheaper to use 
desktop drives in a large cluster, but I just like to know to be sure.

Thanks and regards,

Tim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-12 Thread Tim Mohlmann
Hi,

On Saturday 11 May 2013 16:04:27 Leen Besselink wrote:
 
> Someone is going to correct me if I'm wrong, but I think you misread
> something.
>
>
> The Mon-daemon doesn't need that much RAM:
> 
> The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon.
> 
Gosh, I feel embarresed. This ectually was my main concern / bottleneck. 
Thanks for pointing this out. Seems Ceph really rocks in deploying affordable 
data clusters.

Regards, Tim

> On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote:
> > Hi,
> > 
> > First of all I am new to ceph and this mailing list. At this moment I am
> > looking into the possibilities to get involved in the storage business. I
> > am trying to get an estimate about costs and after that I will start to
> > determine how to get sufficient income.
> > 
> > First I will describe my case, at the bottom you will find my questions.
> > 
> > 
> > GENERAL LAYOUT:
> > 
> > Part of this cost calculation is of course hardware. For the larger part
> > I've already figured it out. In my plans I will be leasing a full rack
> > (46U). Depending on the domestic needs I will be using 36 or 40U for ODS
> > storage servers. (I will assume 36U from here on, to keep a solid value
> > for calculation and have enough spare space for extra devices).
> > 
> > Each OSD server uses 4U and can take 36x3.5" drives. So in 36U I can put
> > 36/4=9 OSD servers, containing 9*36=324 HDDs.
> > 
> > 
> > HARD DISK DRIVES
> > 
> > I have been looking for WD digital RE and RED series. RE is more expensive
> > per GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap
> > per GB, but only goes as far a 3TB.
> > 
> > At my current calculations it does not matter much if I would put
> > expensive WD RE 4TB disks or cheaper WD RED 3TB, the price per GB over
> > the complete cluster expense and 3 years of running costs (including AFR)
> > is almost the same.
> > 
> > So basically, if I could reduce the costs of all the other components used
> > in the cluster, I would go for the 3TB disk and if the costs will be
> > higher then my first calculation, I would use the 4TB disk.
> > 
> > Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;).
> > 
> > 
> > NETWORK
> > 
> > I will use a redundant 2x10Gbe network connection for each node. Two
> > independent 10Gbe switches will be used and I will use bonding between the
> > interfaces on each node. (Thanks some guy in the #Ceph irc for pointing
> > this option out). I will use VLAN's to split front-side, backside and
> > Internet networks.
> > 
> > 
> > OSD SERVER
> > 
> > SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets.
> > It is advertised they can take up to 512GB of RAM. I will install 2 x
> > Intel Xeon E5620 2.40ghz processor, having 4 cores and 8 threads each.
> > For the RAM I am in doubt (see below). I am looking into running 1 OSD
> > per disk.
> > 
> > 
> > MON AND MDS SERVERS
> > 
> > Now comes the big question. What specs are required? It first I had the
> > plan to use 4 SuperMicro superservers, with a 4 socket mainboards that
> > contain up to the new 16core AMD processors and up to 1TB of RAM.
> > 
> > I want all 4 of the servers to run a MON service, MDS service and costumer
> > / public services. Probably I would use VM's (kvm) to separate them. I
> > will compile my own kernel to enable Kernel Samepage Merge, Hugepage
> > support and memory compaction to make RAM use more efficient. The
> > requirements for my public services will be added up, once I know what I
> > need for MON and MDS.
> > 
> > 
> > RAM FOR ALL SERVERS
> > 
> > So what would you estimate to be the ram usage?
> > http://ceph.com/docs/master/install/hardware-recommendations/#minimum-
> > hardware-recommendations.
> > 
> > Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM
> > requirement for my OSD server to 18GB. 32GB should be more then enough.
> > Although I would like to see if it is possible to use btrfs compression?
> > In
> > that case I'd need more RAM in there.
> > 
> > What I really want to know: how many RAM do I need for MON and MDS
> > servers?
> > 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive!
> > 
> > In my case I would need at least 324 GB of ram for each of them. Initially
> > I was planning to use 4 servers and each of them running both. Joining
> > those in a single system, with the other duties the system has to perform
> > I would need the full 1TB of RAM. I would need to use 32GB modules witch
> > are really expensive per GB and difficult to find. (not may server
> > hardware vendors in the Netherlands have them).
> > 
> > 
> > QUESTIONS
> > 
> > Question 1: Is it really the amount for OSD's that counts for MON and MDS
> > RAM usage, or the size of the object store?
> > 
> > Question 2: can I do it with less RAM? Any statistics, or better: a
> > calculation? I can imagine memory pages becoming redundant if the cluster
> > grows, so less memory r

Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-11 Thread Dimitri Maziuk
On 05/11/2013 08:42 AM, Tim Mohlmann wrote:

> Each OSD server uses 4U and can take 36x3.5" drives. So in 36U I can put 
> 36/4=9 OSD servers, containing 9*36=324 HDDs.

SuperMicro has a new 4U chassis w/ 72x3.5" drives (2/canister). You can
double the number of drives. (With faster drives you may be getting
close to chocking the expander backplane, though.)

> HARD DISK DRIVES
> 
> I have been looking for WD digital RE and RED series. RE is more expensive 
> per 
> GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, 
> but only goes as far a 3TB.

WD 3+TB drives don't have the option to turn off "advanced format" or
whatever it's called: the part where they lie to the OS about sector
size because they ran out of bits for some other counter (will they ever
learn). In my tests iostat shows 10x i/o wait on "desktop" wd drives
compared to seagates. Aligning partitions to 4096, 16384, or any other
sector boundary didn't seem to make any difference.

So we quit buying wds. Consider seagates, they go to 4TB in both
"enterprise" and desktop lines, too.

HTH
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware recommendation / calculation for large cluster

2013-05-11 Thread Leen Besselink
Hi,

Someone is going to correct me if I'm wrong, but I think you misread something.

The Mon-daemon doesn't need that much RAM:

The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon.

The same for disk-space.

You should read this page again:

http://ceph.com/docs/master/install/hardware-recommendations/

Some of the other questions are answered there as well.

Like how much memory does a OSD-daemon need and why/when.



On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote:
> Hi,
> 
> First of all I am new to ceph and this mailing list. At this moment I am 
> looking into the possibilities to get involved in the storage business. I am 
> trying to get an estimate about costs and after that I will start to 
> determine 
> how to get sufficient income.
> 
> First I will describe my case, at the bottom you will find my questions.
> 
> 
> GENERAL LAYOUT:
> 
> Part of this cost calculation is of course hardware. For the larger part I've 
> already figured it out. In my plans I will be leasing a full rack (46U). 
> Depending on the domestic needs I will be using 36 or 40U for ODS storage 
> servers. (I will assume 36U from here on, to keep a solid value for 
> calculation and have enough spare space for extra devices).
> 
> Each OSD server uses 4U and can take 36x3.5" drives. So in 36U I can put 
> 36/4=9 OSD servers, containing 9*36=324 HDDs.
> 
> 
> HARD DISK DRIVES
> 
> I have been looking for WD digital RE and RED series. RE is more expensive 
> per 
> GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, 
> but only goes as far a 3TB.
> 
> At my current calculations it does not matter much if I would put expensive 
> WD 
> RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete 
> cluster 
> expense and 3 years of running costs (including AFR) is almost the same.
> 
> So basically, if I could reduce the costs of all the other components used in 
> the cluster, I would go for the 3TB disk and if the costs will be higher then 
> my first calculation, I would use the 4TB disk.
> 
> Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;).
> 
> 
> NETWORK
> 
> I will use a redundant 2x10Gbe network connection for each node. Two 
> independent 10Gbe switches will be used and I will use bonding between the 
> interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this 
> option out). I will use VLAN's to split front-side, backside and Internet 
> networks.
> 
> 
> OSD SERVER
> 
> SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It 
> is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon 
> E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am 
> in doubt (see below). I am looking into running 1 OSD per disk.
> 
> 
> MON AND MDS SERVERS
> 
> Now comes the big question. What specs are required? It first I had the plan 
> to 
> use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to 
> the new 16core AMD processors and up to 1TB of RAM.
> 
> I want all 4 of the servers to run a MON service, MDS service and costumer / 
> public services. Probably I would use VM's (kvm) to separate them. I will 
> compile my own kernel to enable Kernel Samepage Merge, Hugepage support and 
> memory compaction to make RAM use more efficient. The requirements for my 
> public 
> services will be added up, once I know what I need for MON and MDS.
> 
> 
> RAM FOR ALL SERVERS
> 
> So what would you estimate to be the ram usage?
> http://ceph.com/docs/master/install/hardware-recommendations/#minimum-
> hardware-recommendations.
> 
> Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM 
> requirement for my OSD server to 18GB. 32GB should be more then enough. 
> Although I would like to see if it is possible to use btrfs compression? In 
> that case I'd need more RAM in there.
> 
> What I really want to know: how many RAM do I need for MON and MDS servers? 
> 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive!
> 
> In my case I would need at least 324 GB of ram for each of them. Initially I 
> was planning to use 4 servers and each of them running both. Joining those in 
> a single system, with the other duties the system has to perform I would need 
> the full 1TB of RAM. I would need to use 32GB modules witch are really 
> expensive per GB and difficult to find. (not may server hardware vendors in 
> the 
> Netherlands have them).
> 
> 
> QUESTIONS
> 
> Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM 
> usage, or the size of the object store?
> 
> Question 2: can I do it with less RAM? Any statistics, or better: a 
> calculation? I can imagine memory pages becoming redundant if the cluster 
> grows, so less memory required per OSD.
> 
> Question 3: If it is the amount of OSDs that counts, would it be beneficial 
> to 
> combine disks in a RAID 0 (lvm or btrfs) array?
> 
> Question 4: Is

[ceph-users] Hardware recommendation / calculation for large cluster

2013-05-11 Thread Tim Mohlmann
Hi,

First of all I am new to ceph and this mailing list. At this moment I am 
looking into the possibilities to get involved in the storage business. I am 
trying to get an estimate about costs and after that I will start to determine 
how to get sufficient income.

First I will describe my case, at the bottom you will find my questions.


GENERAL LAYOUT:

Part of this cost calculation is of course hardware. For the larger part I've 
already figured it out. In my plans I will be leasing a full rack (46U). 
Depending on the domestic needs I will be using 36 or 40U for ODS storage 
servers. (I will assume 36U from here on, to keep a solid value for 
calculation and have enough spare space for extra devices).

Each OSD server uses 4U and can take 36x3.5" drives. So in 36U I can put 
36/4=9 OSD servers, containing 9*36=324 HDDs.


HARD DISK DRIVES

I have been looking for WD digital RE and RED series. RE is more expensive per 
GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, 
but only goes as far a 3TB.

At my current calculations it does not matter much if I would put expensive WD 
RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete cluster 
expense and 3 years of running costs (including AFR) is almost the same.

So basically, if I could reduce the costs of all the other components used in 
the cluster, I would go for the 3TB disk and if the costs will be higher then 
my first calculation, I would use the 4TB disk.

Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;).


NETWORK

I will use a redundant 2x10Gbe network connection for each node. Two 
independent 10Gbe switches will be used and I will use bonding between the 
interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this 
option out). I will use VLAN's to split front-side, backside and Internet 
networks.


OSD SERVER

SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It 
is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon 
E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am 
in doubt (see below). I am looking into running 1 OSD per disk.


MON AND MDS SERVERS

Now comes the big question. What specs are required? It first I had the plan to 
use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to 
the new 16core AMD processors and up to 1TB of RAM.

I want all 4 of the servers to run a MON service, MDS service and costumer / 
public services. Probably I would use VM's (kvm) to separate them. I will 
compile my own kernel to enable Kernel Samepage Merge, Hugepage support and 
memory compaction to make RAM use more efficient. The requirements for my 
public 
services will be added up, once I know what I need for MON and MDS.


RAM FOR ALL SERVERS

So what would you estimate to be the ram usage?
http://ceph.com/docs/master/install/hardware-recommendations/#minimum-
hardware-recommendations.

Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM 
requirement for my OSD server to 18GB. 32GB should be more then enough. 
Although I would like to see if it is possible to use btrfs compression? In 
that case I'd need more RAM in there.

What I really want to know: how many RAM do I need for MON and MDS servers? 
1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive!

In my case I would need at least 324 GB of ram for each of them. Initially I 
was planning to use 4 servers and each of them running both. Joining those in 
a single system, with the other duties the system has to perform I would need 
the full 1TB of RAM. I would need to use 32GB modules witch are really 
expensive per GB and difficult to find. (not may server hardware vendors in the 
Netherlands have them).


QUESTIONS

Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM 
usage, or the size of the object store?

Question 2: can I do it with less RAM? Any statistics, or better: a 
calculation? I can imagine memory pages becoming redundant if the cluster 
grows, so less memory required per OSD.

Question 3: If it is the amount of OSDs that counts, would it be beneficial to 
combine disks in a RAID 0 (lvm or btrfs) array?

Question 4: Is it safe / possible to store MON files inside of the cluster 
itself? The 10GB per daemon requirement would mean I need 3240GB of storage 
for each MON, meaning I need to get some huge disks and a (lvm) RAID 1 array 
for redundancy, while I have a huge redundant file sytem at hand already.

Question 5: Is it possible to enable btrfs compression? I know btrfs is not 
stable for production yet, but it would be nice if compression is supported in 
the future, when it does become stable

If the RAM requirement is not so steep, I am thinking about the possibility to 
run the MON service from 4 OSD servers. Upgrading them to 16x16GB of RAM would 
give me 256GB of RAM. (Again, 32GB modules are to expensive and not an 
option). T