Re: Designing a cluster guide

2012-06-29 Thread Gregory Farnum
On Thu, May 17, 2012 at 2:27 PM, Gregory Farnum g...@inktank.com wrote:
 Sorry this got left for so long...

 On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
 Hi,

 the Designing a cluster guide
 http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
 still leaves some questions unanswered.

 It mentions for example Fast CPU for the mds system. What does fast
 mean? Just the speed of one core? Or is ceph designed to use multi core?
 Is multi core or more speed important?
 Right now, it's primarily the speed of a single core. The MDS is
 highly threaded but doing most things requires grabbing a big lock.
 How fast is a qualitative rather than quantitative assessment at this
 point, though.

 The Cluster Design Recommendations mentions to seperate all Daemons on
 dedicated machines. Is this also for the MON useful? As they're so
 leightweight why not running them on the OSDs?
 It depends on what your nodes look like, and what sort of cluster
 you're running. The monitors are pretty lightweight, but they will add
 *some* load. More important is their disk access patterns — they have
 to do a lot of syncs. So if they're sharing a machine with some other
 daemon you want them to have an independent disk and to be running a
 new kernelglibc so that they can use syncfs rather than sync. (The
 only distribution I know for sure does this is Ubuntu 12.04.)

I just had it pointed out to me that I rather overstated the
importance of syncfs if you were going to do this. The monitor mostly
does fsync, not sync/syncfs(), so that's not so important. What is
important is that it has highly seeky disk behavior, so you don't want
a ceph-osd and ceph-mon daemon to be sharing a disk. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-06-29 Thread Brian Edmonds
On Fri, Jun 29, 2012 at 11:07 AM, Gregory Farnum g...@inktank.com wrote:
 the Designing a cluster guide
 http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
 still leaves some questions unanswered.

Oh, thank you.  I've been poking through the Ceph docs, but somehow
had not managed to turn up the wiki yet.

What are the likely and worst case scenarios if the OSD journal were
to simply be on a garden variety ramdisk, no battery backing?  In the
case of a single node losing power, and thus losing some data, surely
Ceph can recognize this, and handle it through normal redundancy?  I
could see it being an issue if the whole cluster lost power at once.
Anything I'm missing?

Brian.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-06-29 Thread Brian Edmonds
On Fri, Jun 29, 2012 at 11:50 AM, Gregory Farnum g...@inktank.com wrote:
 If you lose a journal, you lose the OSD.

Really?  Everything?  Not just recent commits?  I would have hoped it
would just come back up in an old state.  Replication should have
already been taking care of regaining redundancy for the stuff that
was on it, particularly the newest stuff that wouldn't return with it
and say Hi, I'm back.

I suppose it makes the design easier though. =)

Brian.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-06-29 Thread Gregory Farnum
On Fri, Jun 29, 2012 at 1:59 PM, Brian Edmonds mor...@gmail.com wrote:
 On Fri, Jun 29, 2012 at 11:50 AM, Gregory Farnum g...@inktank.com wrote:
 If you lose a journal, you lose the OSD.

 Really?  Everything?  Not just recent commits?  I would have hoped it
 would just come back up in an old state.  Replication should have
 already been taking care of regaining redundancy for the stuff that
 was on it, particularly the newest stuff that wouldn't return with it
 and say Hi, I'm back.

 I suppose it makes the design easier though. =)

Well, actually this depends on the filesystem you're using. With
btrfs, the OSD will roll back to a consistent state, but you don't
know how out-of-date that state is. (Practically speaking, it's pretty
new, but if you were doing any writes it is going to be data loss.)
With xfs/ext4/other, the OSD can't create consistency points the same
way it can with btrfs, and so the loss of a journal means that it
can't repair itself.

Sorry for not mentioning the distinction earlier; I didn't think we'd
implemented the rollback on btrfs. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-06-29 Thread Brian Edmonds
On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum g...@inktank.com wrote:
 Well, actually this depends on the filesystem you're using. With
 btrfs, the OSD will roll back to a consistent state, but you don't
 know how out-of-date that state is.

Ok, so assuming btrfs, then a single machine failure with a ramdisk
journal should not result in any data loss, assuming replication is
working?  The cluster would then be at risk of data loss primarily
from a full power outage.  (In practice I'd expect either one machine
to die, or a power loss to take out all of them, and smaller but
non-unitary losses would be uncommon.)

Something to play with, perhaps.

Brian.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-06-29 Thread Gregory Farnum
On Fri, Jun 29, 2012 at 2:18 PM, Brian Edmonds mor...@gmail.com wrote:
 On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum g...@inktank.com wrote:
 Well, actually this depends on the filesystem you're using. With
 btrfs, the OSD will roll back to a consistent state, but you don't
 know how out-of-date that state is.

 Ok, so assuming btrfs, then a single machine failure with a ramdisk
 journal should not result in any data loss, assuming replication is
 working?  The cluster would then be at risk of data loss primarily
 from a full power outage.  (In practice I'd expect either one machine
 to die, or a power loss to take out all of them, and smaller but
 non-unitary losses would be uncommon.)

That's correct. And replication will be working — it's all
synchronous, so if the replication isn't working, you won't be able to
write. :) There are some edge cases here — if an OSD is down but not
out then you might not have the same number of data copies as
normal, but that's all configurable.


 Something to play with, perhaps.

 Brian.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-06-29 Thread Sage Weil
On Fri, 29 Jun 2012, Brian Edmonds wrote:
 On Fri, Jun 29, 2012 at 2:11 PM, Gregory Farnum g...@inktank.com wrote:
  Well, actually this depends on the filesystem you're using. With
  btrfs, the OSD will roll back to a consistent state, but you don't
  know how out-of-date that state is.
 
 Ok, so assuming btrfs, then a single machine failure with a ramdisk
 journal should not result in any data loss, assuming replication is
 working?  The cluster would then be at risk of data loss primarily
 from a full power outage.  (In practice I'd expect either one machine
 to die, or a power loss to take out all of them, and smaller but
 non-unitary losses would be uncommon.)

Right.  From a data-safety perspective (the cluster said my writes were 
safe.. are they?) consider journal loss an OSD failure.  If there aren't 
other surviving replicas, something may be lost.

From a recovery perspective, it is a partial failure; not everything was 
lost, and recovery will be quick (only recent objects get copied around).  
Maybe your application can tolerate that, maybe it can't.

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Designing a cluster guide

2012-05-29 Thread Quenten Grasso
Interesting, I've been thinking about this and I think most Ceph installations 
could benefit from more nodes and less disks per node.

For example 

We have a replica level of 2, your RBD block size of 4mb. You start writing a 
file of 10gb, This is divided effectively into 4mb chunks, 

The first chunk to node 1 and node 2 (at the same time I assume) which is 
written to a journal then replayed to the data file system.

Second chunk might be sent to node 2 and 3 at the same time which is written to 
a journal then replayed. (we now have overlap from chunk 1) 

Third chunk might be sent to 1 and 3 (we have more overlap from chunks 1 and 2) 
and as you can see this quickly this becomes an issue.

So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see 
better write and read performance as you would have less overlap.

Now we take BTRFS into the picture as I understand journals are not necessary 
due to the nature of the way it writes/snapshots and reads data this alone 
would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ).

Side note this may sound crazy but the more I read about SSD's the less I wish 
to use/rely on them and RAM SSD's are crazly priced imo. =)

Regards,
Quenten


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron
Sent: Tuesday, 22 May 2012 3:52 PM
To: Quenten Grasso
Cc: Gregory Farnum; ceph-devel@vger.kernel.org
Subject: Re: Designing a cluster guide

I have some performance from rbd cluster near 320MB/s on VM from 3
node cluster, but with 10GE, and with 26 2.5 SAS drives used on every
machine it's not everything that can be.
Every osd drive is raid0 with one drive via battery cached nvram in
hardware raid ctrl.
Every osd take much ram for caching.

That's why i'am thinking about to change 2 drives for SSD in raid1
with hpa tuned for increase durability of drive for journaling - but
if this will work ;)

With newest drives can theoreticaly get 500MB/s with a long queue
depth. This means that i can in theory improve bandwith score, and
take lower latency, and better handling of multiple IO writes, from
many hosts.
Reads are cached in ram from OSD daemon, VFS in kernel, nvram in ctrl,
and in near future improve from cache in kvm (i need to test that -
this will improve performance)

But if SSD drive goes slower, it can get whole performance down in
writes. It's is very delicate.

Pozdrawiam

iSS

Dnia 22 maj 2012 o godz. 02:47 Quenten Grasso qgra...@onq.com.au napisał(a):

 I Should have added For storage I'm considering something like Enterprise 
 nearline SAS 3TB disks running individual disks not raided with rep level of 
 2 as suggested :)


 Regards,
 Quenten


 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Quenten Grasso
 Sent: Tuesday, 22 May 2012 10:43 AM
 To: 'Gregory Farnum'
 Cc: ceph-devel@vger.kernel.org
 Subject: RE: Designing a cluster guide

 Hi Greg,

 I'm only talking about journal disks not storage. :)



 Regards,
 Quenten


 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum
 Sent: Tuesday, 22 May 2012 10:30 AM
 To: Quenten Grasso
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: Designing a cluster guide

 On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote:
 Hi All,


 I've been thinking about this issue myself past few days, and an idea I've 
 come up with is running 16 x 2.5 15K 72/146GB Disks,
 in raid 10 inside a 2U Server with JBOD's attached to the server for actual 
 storage.

 Can someone help clarify this one,

 Once the data is written to the (journal disk) and then read from the 
 (journal disk) then written to the (storage disk) once this is complete this 
 is considered a successful write by the client?
 Or
 Once the data is written to the (journal disk) is this considered successful 
 by the client?
 This one — the write is considered safe once it is on-disk on all
 OSDs currently responsible for hosting the object.

 Every time anybody mentions RAID10 I have to remind them of the
 storage amplification that entails, though. Are you sure you want that
 on top of (well, underneath, really) Ceph's own replication?

 Or
 Once the data is written to the (journal disk) and written to the (storage 
 disk) at the same time, once complete this is considered a successful write 
 by the client? (if this is the case SSD's may not be so useful)


 Pros
 Quite fast Write throughput to the journal disks,
 No write wareout of SSD's
 RAID 10 with 1GB Cache Controller also helps improve things (if really keen 
 you could use a cachecade as well)


 Cons
 Not as fast as SSD's
 More rackspace required per server.


 Regards,
 Quenten

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf

Re: Designing a cluster guide

2012-05-29 Thread Tommi Virtanen
On Tue, May 29, 2012 at 12:25 AM, Quenten Grasso qgra...@onq.com.au wrote:
 So if we have 10 nodes vs. 3 nodes with the same mount of disks we should see 
 better write and read performance as you would have less overlap.

First of all, a typical way to run Ceph is with say 8-12 disks per
node, and an OSD per disk. That means your 3-10 node clusters actually
have 24-120 OSDs on them. The number of physical machines is not
really a factor, number of OSDs is what matters.

Secondly, 10-node or 3-node clusters are fairly uninteresting for
Ceph. The real challenge is at the hundreds, thousands and above
range.

 Now we take BTRFS into the picture as I understand journals are not necessary 
 due to the nature of the way it writes/snapshots and reads data this alone 
 would be a major performance increase on a BTRFS Raid level (like ZFS RAIDZ).

A journal is still needed on btrfs, snapshots just enable us to write
to the journal in parallel to the real write, instead of needing to
journal first.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-24 Thread Jerker Nyberg

On Wed, 23 May 2012, Gregory Farnum wrote:


On Wed, May 23, 2012 at 12:47 PM, Jerker Nyberg jer...@update.uu.se wrote:


 * Scratch file system for HPC. (kernel client)
 * Scratch file system for research groups. (SMB, NFS, SSH)
 * Backend for simple disk backup. (SSH/rsync, AFP, BackupPC)
 * Metropolitan cluster.
 * VDI backend. KVM with RBD.


Hmm. Sounds to me like scratch filesystems would get a lot out of not
having to hit disk on the commit, but not much out of having separate
caching locations versus just letting the OSD page cache handle it. :)
The others, I don't really see collaborative caching helping much either.


Oh, sorry, those were my use cases for ceph in general. Yes, scratch is 
mosty of interest. But also fast backup. Currently IOPS is limiting our 
backup speed on a small cluster with many files but not much data. I have 
problems scanning through and backing all changed files every night. 
Currently I am backing to ZFS but Ceph might help with scaling up 
performance and size. Another option is going for SSD instead of 
mechanical drives.



Anyway, make a bug for it in the tracker (I don't think one exists
yet, though I could be wrong) and someday when we start work on the
filesystem again we should be able to get to it. :)


Thank you for your thoughts on this. I hope to be able to do that soon.

Regards,
Jerker Nyberg, Uppsala, Sweden.

Re: Designing a cluster guide

2012-05-23 Thread Gregory Farnum
On Wed, May 23, 2012 at 12:47 PM, Jerker Nyberg jer...@update.uu.se wrote:
 On Tue, 22 May 2012, Gregory Farnum wrote:

 Direct users of the RADOS object store (i.e., librados) can do all kinds
 of things with the integrity guarantee options. But I don't believe there's
 currently a way to make the filesystem do so яя among other things, you're
 running through the page cache and other writeback caches anyway, so it
 generally wouldn't be useful except when running an fsync or similar. And at
 that point you probably really want to not be lying to the application
 that's asking for it.


 I am comparing with in-memory databases. If replication and failovers are
 used, couldn't in-memory in some cases be good enough? And faster.


 do you have a use case on Ceph?


 Currently of interest:

  * Scratch file system for HPC. (kernel client)
  * Scratch file system for research groups. (SMB, NFS, SSH)
  * Backend for simple disk backup. (SSH/rsync, AFP, BackupPC)
  * Metropolitan cluster.
  * VDI backend. KVM with RBD.
Hmm. Sounds to me like scratch filesystems would get a lot out of not
having to hit disk on the commit, but not much out of having separate
caching locations versus just letting the OSD page cache handle it. :)
The others, I don't really see collaborative caching helping much either.

So basically it sounds like you want to be able to toggle off Ceph's
data safety requirements. That would have to be done in the clients;
it wouldn't even be hard in ceph-fuse (although I'm not sure about the
kernel client). It's probably a pretty easy way to jump into the code
base :)
Anyway, make a bug for it in the tracker (I don't think one exists
yet, though I could be wrong) and someday when we start work on the
filesystem again we should be able to get to it. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-22 Thread Stefan Priebe - Profihost AG
Am 21.05.2012 20:13, schrieb Gregory Farnum:
 On Sat, May 19, 2012 at 1:37 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 So would you recommand a fast (more ghz) Core i3 instead of a single xeon
 for this system? (price per ghz is better).
 
 If that's all the MDS is doing there, probably? (It would also depend
 on cache sizes and things; I don't have a good sense for how that
 impacts the MDS' performance.)
As i'm only using KVM / rbd i don't have any MDS.

 Well, RAID1 isn't going to make it any faster than just the single
 SSD, is why I pointed that out.

 I wouldn't recommend using a ramdisk for the journal — that will
 guarantee local data loss in the event the server doesn't shut down
 properly, and if it happens to several servers at once you get a good
 chance of losing client writes.
Sure but it's the same WHEN NOT using a Raid 1 for the journal isn't it?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-22 Thread Jerker Nyberg

On Mon, 21 May 2012, Gregory Farnum wrote:


This one  the write is considered safe once it is on-disk on all
OSDs currently responsible for hosting the object.


Is it possible to configure the client to consider the write successful 
when the data is hitting RAM on all the OSDs but not yet committed to 
disk?


Also, the IBM zFS research file system is talking about cooperative cache 
and Lustre about a collaborative cache. Do you have any thoughts of this 
regarding Ceph?


Regards,
Jerker Nyberg, Uppsala, Sweden.

Re: Designing a cluster guide

2012-05-21 Thread Stefan Priebe - Profihost AG
Am 20.05.2012 10:31, schrieb Christian Brunner:
 That's exactly what i thought too but then you need a seperate ceph / rbd
 cluster for each type.

 Which will result in a minimum of:
 3x mon servers per type
 4x osd servers per type
 ---

 so you'll need a minimum of 12x osd systems and 9x mon systems.
 
 You can arrange the storage types in different pools, so that you
 don't need separate mon servers (this can be done by adjusting the
 crushmap) and you could even run multiple OSDs per server.
That sounds great. Can you give me a hint how to setup pools? Right now
i have data, metadata and rbd = the default pools. But i wasn't able to
find any page in the wiki which described how to setup pools.

Thanks,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Christian Brunner
2012/5/20 Tim O'Donovan t...@icukhosting.co.uk:
 - High performance Block Storage (RBD)

   Many large SATA SSDs for the storage (prbably in a RAID5 config)
   stec zeusram ssd drive for the journal

 How do you think standard SATA disks would perform in comparison to
 this, and is a separate journaling device really necessary?

A journaling device is improving write latency a lot and the write
latency is directly related to the throughput you get in your virtual
machine. If you have a raid controller with a battery backed write
cache you could try to put the journal on a separate, small partition
of your SATA disk. I haven't tried this, but I think this could work.

Apart from that you should calculate the sum of the IOPS your guests
genereate. In the end everything has to be written on your backend
storage and is has to be able to deliver the IOPS.

With the journal you might be able to compensate short write peaks and
there might be a gain by merging write requests on the OSDs, but for a
solid sizing I would neglect this. Read requests can be delivered for
the OSDs cache (RAM), but again this will probably give you only a
small gain.

For a single SATA disk you can calculate with 100-150 IOPS (depending
on the speed of the disk). SSDs can deliver much higher IOPS values.

 Perhaps three servers, each with 12 x 1TB SATA disks configured in
 RAID10, an osd on each server and three separate mon servers.

With a replication level of two this would be 1350 IOPS:

150 IOPS per disk * 12 disks * 3 servers / 2 for the RAID10 / 2 for
ceph replication

Comments on this formula would be welcome...

 Would this be suitable for the storage backend for a small OpenStack
 cloud, performance wise, for instance?

That depends on what you are doing in your guests.

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Christian Brunner
2012/5/21 Stefan Priebe - Profihost AG s.pri...@profihost.ag:
 Am 20.05.2012 10:31, schrieb Christian Brunner:
 That's exactly what i thought too but then you need a seperate ceph / rbd
 cluster for each type.

 Which will result in a minimum of:
 3x mon servers per type
 4x osd servers per type
 ---

 so you'll need a minimum of 12x osd systems and 9x mon systems.

 You can arrange the storage types in different pools, so that you
 don't need separate mon servers (this can be done by adjusting the
 crushmap) and you could even run multiple OSDs per server.
 That sounds great. Can you give me a hint how to setup pools? Right now
 i have data, metadata and rbd = the default pools. But i wasn't able to
 find any page in the wiki which described how to setup pools.

rados mkpool pool-name [123[ 4]] create pool pool-name'
[with auid 123[and using crush rule 4]]

Christian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Tomasz Paszkowski
Another great thing that should be mentioned is:
https://github.com/facebook/flashcache/. It gives really huge
performance improvements for reads/writes (especialy on FunsionIO
drives) event without using librbd caching :-)



On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com wrote:
 Hi,

 For your journal , if you have money, you can use

 stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k 
 block).
 I'm using them with zfs san, they rocks for journal.
 http://www.stec-inc.com/product/zeusram.php

 another interessesting product is ddrdrive
 http://www.ddrdrive.com/

 - Mail original -

 De: Stefan Priebe s.pri...@profihost.ag
 À: Gregory Farnum g...@inktank.com
 Cc: ceph-devel@vger.kernel.org
 Envoyé: Samedi 19 Mai 2012 10:37:01
 Objet: Re: Designing a cluster guide

 Hi Greg,

 Am 17.05.2012 23:27, schrieb Gregory Farnum:
 It mentions for example Fast CPU for the mds system. What does fast
 mean? Just the speed of one core? Or is ceph designed to use multi core?
 Is multi core or more speed important?
 Right now, it's primarily the speed of a single core. The MDS is
 highly threaded but doing most things requires grabbing a big lock.
 How fast is a qualitative rather than quantitative assessment at this
 point, though.
 So would you recommand a fast (more ghz) Core i3 instead of a single
 xeon for this system? (price per ghz is better).

 It depends on what your nodes look like, and what sort of cluster
 you're running. The monitors are pretty lightweight, but they will add
 *some* load. More important is their disk access patterns — they have
 to do a lot of syncs. So if they're sharing a machine with some other
 daemon you want them to have an independent disk and to be running a
 new kernelglibc so that they can use syncfs rather than sync. (The
 only distribution I know for sure does this is Ubuntu 12.04.)
 Which kernel and which glibc version supports this? I have searched
 google but haven't found an exact version. We're using debian lenny
 squeeze with a custom kernel.

 Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
 perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
 and you should go for 22x SSD Disks in a Raid 6?
 You'll need to do your own failure calculations on this one, I'm
 afraid. Just take note that you'll presumably be limited to the speed
 of your journaling device here.
 Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
 is this still too slow? Another idea was to use only a ramdisk for the
 journal and backup the files while shutting down to disk and restore
 them after boot.

 Given that Ceph is going to be doing its own replication, though, I
 wouldn't want to add in another whole layer of replication with raid10
 — do you really want to multiply your storage requirements by another
 factor of two?
 OK correct bad idea.

 Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
 I would use the hardware controller over btrfs raid for now; it allows
 more flexibility in eg switching to xfs. :)
 OK but overall you would recommand running one osd per disk right? So
 instead of using a Raid 6 with for example 10 disks you would run 6 osds
 on this machine?

 Use single socket Xeon for the OSDs or Dual Socket?
 Dual socket servers will be overkill given the setup you're
 describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
 daemon. You might consider it if you decided you wanted to do an OSD
 per disk instead (that's a more common configuration, but it requires
 more CPU and RAM per disk and we don't know yet which is the better
 choice).
 Is there also a rule of thumb for the memory?

 My biggest problem with ceph right now is the awful slow speed while
 doing random reads and writes.

 Sequential read and writes are at 200Mb/s (that's pretty good for bonded
 dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
 which is def. too slow.

 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html



 --

 --




        Alexandre D erumier
 Ingénieur Système
 Fixe : 03 20 68 88 90
 Fax : 03 20 68 90 81
 45 Bvd du Général Leclerc 59100 Roubaix - France
 12 rue Marivaux 75002 Paris - France

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Tomasz Paszkowski
Project is indeed very interesting, but requires to patch a kernel
source. For me using lkm is safer ;)


On Mon, May 21, 2012 at 5:30 PM, Kiran Patil kirantpa...@gmail.com wrote:
 Hello,

 Has someone looked into bcache (http://bcache.evilpiepirate.org/) ?

 It seems, it is superior to flashcache.

 Lwn.net article: https://lwn.net/Articles/497024/

 Mailing list: http://news.gmane.org/gmane.linux.kernel.bcache.devel

 Source code: http://evilpiepirate.org/cgi-bin/cgit.cgi/linux-bcache.git/

 Thanks,
 Kiran Patil.


 On Mon, May 21, 2012 at 8:42 PM, Tomasz Paszkowski ss7...@gmail.com wrote:

 If you're using Qemu/KVM you can use 'info blockstats' command for
 measruing I/O on particular VM.


 On Mon, May 21, 2012 at 5:05 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
  Am 21.05.2012 16:59, schrieb Christian Brunner:
  Apart from that you should calculate the sum of the IOPS your guests
  genereate. In the end everything has to be written on your backend
  storage and is has to be able to deliver the IOPS.
  How to measure the IOPs of a dedicated actual system?
 
  Stefan
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Tomasz Paszkowski
 SS7, Asterisk, SAN, Datacenter, Cloud Computing
 +48500166299
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html





-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Gregory Farnum
On Sat, May 19, 2012 at 1:37 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi Greg,

 Am 17.05.2012 23:27, schrieb Gregory Farnum:

 It mentions for example Fast CPU for the mds system. What does fast
 mean? Just the speed of one core? Or is ceph designed to use multi core?
 Is multi core or more speed important?

 Right now, it's primarily the speed of a single core. The MDS is
 highly threaded but doing most things requires grabbing a big lock.
 How fast is a qualitative rather than quantitative assessment at this
 point, though.

 So would you recommand a fast (more ghz) Core i3 instead of a single xeon
 for this system? (price per ghz is better).

If that's all the MDS is doing there, probably? (It would also depend
on cache sizes and things; I don't have a good sense for how that
impacts the MDS' performance.)

 It depends on what your nodes look like, and what sort of cluster
 you're running. The monitors are pretty lightweight, but they will add
 *some* load. More important is their disk access patterns — they have
 to do a lot of syncs. So if they're sharing a machine with some other
 daemon you want them to have an independent disk and to be running a
 new kernelglibc so that they can use syncfs rather than sync. (The
 only distribution I know for sure does this is Ubuntu 12.04.)

 Which kernel and which glibc version supports this? I have searched google
 but haven't found an exact version. We're using debian lenny squeeze with a
 custom kernel.

syncfs is in Linux 2.6.39; I'm not sure about glibc but from a quick
web search it looks like it might have appeared in glibc 2.15?

 Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
 perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
 and you should go for 22x SSD Disks in a Raid 6?

 You'll need to do your own failure calculations on this one, I'm
 afraid. Just take note that you'll presumably be limited to the speed
 of your journaling device here.

 Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or is
 this still too slow? Another idea was to use only a ramdisk for the journal
 and backup the files while shutting down to disk and restore them after
 boot.

Well, RAID1 isn't going to make it any faster than just the single
SSD, is why I pointed that out.
I wouldn't recommend using a ramdisk for the journal — that will
guarantee local data loss in the event the server doesn't shut down
properly, and if it happens to several servers at once you get a good
chance of losing client writes.

 Is it more useful the use a Raid 6 HW Controller or the btrfs raid?

 I would use the hardware controller over btrfs raid for now; it allows
 more flexibility in eg switching to xfs. :)

 OK but overall you would recommand running one osd per disk right? So
 instead of using a Raid 6 with for example 10 disks you would run 6 osds on
 this machine?
Right now all the production systems I'm involved in are using 1 OSD
per disk, but honestly we don't know if that's the right answer or
not. It's a tradeoff — more OSDs increases cpu and memory requirements
(per storage space) but also localizes failure a bit more.

 Use single socket Xeon for the OSDs or Dual Socket?

 Dual socket servers will be overkill given the setup you're
 describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
 daemon. You might consider it if you decided you wanted to do an OSD
 per disk instead (that's a more common configuration, but it requires
 more CPU and RAM per disk and we don't know yet which is the better
 choice).

 Is there also a rule of thumb for the memory?
About 200MB per daemon right now, plus however much you want the page
cache to be able to use. :) This might go up a bit during peering, but
under normal operation it shouldn't be more than another couple
hundred MB.

 My biggest problem with ceph right now is the awful slow speed while doing
 random reads and writes.

 Sequential read and writes are at 200Mb/s (that's pretty good for bonded
 dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s which is
 def. too slow.
Hmm. I'm not super-familiar where our random IO performance is right
now (and lots of other people seem to have advice on journaling
devices :), but that's about in line with what you get from a hard
disk normally. Unless you've designed your application very carefully
(lots and lots of parallel IO), an individual client doing synchronous
random IO is unlikely to be able to get much faster than a regular
drive.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Damien Churchill
On 21 May 2012 16:36, Tomasz Paszkowski ss7...@gmail.com wrote:
 Project is indeed very interesting, but requires to patch a kernel
 source. For me using lkm is safer ;)


I believe bcache is actually in the process of being mainlined and
moved to a device mapper target, although I could wrong about one or
more of those things.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Stefan Priebe

Am 21.05.2012 17:12, schrieb Tomasz Paszkowski:

If you're using Qemu/KVM you can use 'info blockstats' command for
measruing I/O on particular VM.


I want to migrate physical servers to KVM. Any idea for that?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Tomasz Paszkowski
Just to clarify. You'd like to measure I/O on those system which are
currently running on physical machines ?


On Mon, May 21, 2012 at 10:11 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Am 21.05.2012 17:12, schrieb Tomasz Paszkowski:

 If you're using Qemu/KVM you can use 'info blockstats' command for
 measruing I/O on particular VM.


 I want to migrate physical servers to KVM. Any idea for that?

 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Stefan Priebe

Am 21.05.2012 22:13, schrieb Tomasz Paszkowski:

Just to clarify. You'd like to measure I/O on those system which are
currently running on physical machines ?

IOPs not just I/O.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Tomasz Paszkowski
On Linux boxes you may use output from iostat -x /dev/sda and connect
it it to any monitoring system like: zabbix or cacti :-)


On Mon, May 21, 2012 at 10:14 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Am 21.05.2012 22:13, schrieb Tomasz Paszkowski:

 Just to clarify. You'd like to measure I/O on those system which are
 currently running on physical machines ?

 IOPs not just I/O.

 Stefan



-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-21 Thread Sławomir Skowron
Maybe good for journal will be two cheap MLC Intel drives on Sandforce
(320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
separate journaling partitions with hardware RAID1.

I like to test setup like this, but maybe someone have any real life info ??

On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski ss7...@gmail.com wrote:
 Another great thing that should be mentioned is:
 https://github.com/facebook/flashcache/. It gives really huge
 performance improvements for reads/writes (especialy on FunsionIO
 drives) event without using librbd caching :-)



 On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com 
 wrote:
 Hi,

 For your journal , if you have money, you can use

 stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k 
 block).
 I'm using them with zfs san, they rocks for journal.
 http://www.stec-inc.com/product/zeusram.php

 another interessesting product is ddrdrive
 http://www.ddrdrive.com/

 - Mail original -

 De: Stefan Priebe s.pri...@profihost.ag
 À: Gregory Farnum g...@inktank.com
 Cc: ceph-devel@vger.kernel.org
 Envoyé: Samedi 19 Mai 2012 10:37:01
 Objet: Re: Designing a cluster guide

 Hi Greg,

 Am 17.05.2012 23:27, schrieb Gregory Farnum:
 It mentions for example Fast CPU for the mds system. What does fast
 mean? Just the speed of one core? Or is ceph designed to use multi core?
 Is multi core or more speed important?
 Right now, it's primarily the speed of a single core. The MDS is
 highly threaded but doing most things requires grabbing a big lock.
 How fast is a qualitative rather than quantitative assessment at this
 point, though.
 So would you recommand a fast (more ghz) Core i3 instead of a single
 xeon for this system? (price per ghz is better).

 It depends on what your nodes look like, and what sort of cluster
 you're running. The monitors are pretty lightweight, but they will add
 *some* load. More important is their disk access patterns — they have
 to do a lot of syncs. So if they're sharing a machine with some other
 daemon you want them to have an independent disk and to be running a
 new kernelglibc so that they can use syncfs rather than sync. (The
 only distribution I know for sure does this is Ubuntu 12.04.)
 Which kernel and which glibc version supports this? I have searched
 google but haven't found an exact version. We're using debian lenny
 squeeze with a custom kernel.

 Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
 perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
 and you should go for 22x SSD Disks in a Raid 6?
 You'll need to do your own failure calculations on this one, I'm
 afraid. Just take note that you'll presumably be limited to the speed
 of your journaling device here.
 Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
 is this still too slow? Another idea was to use only a ramdisk for the
 journal and backup the files while shutting down to disk and restore
 them after boot.

 Given that Ceph is going to be doing its own replication, though, I
 wouldn't want to add in another whole layer of replication with raid10
 — do you really want to multiply your storage requirements by another
 factor of two?
 OK correct bad idea.

 Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
 I would use the hardware controller over btrfs raid for now; it allows
 more flexibility in eg switching to xfs. :)
 OK but overall you would recommand running one osd per disk right? So
 instead of using a Raid 6 with for example 10 disks you would run 6 osds
 on this machine?

 Use single socket Xeon for the OSDs or Dual Socket?
 Dual socket servers will be overkill given the setup you're
 describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
 daemon. You might consider it if you decided you wanted to do an OSD
 per disk instead (that's a more common configuration, but it requires
 more CPU and RAM per disk and we don't know yet which is the better
 choice).
 Is there also a rule of thumb for the memory?

 My biggest problem with ceph right now is the awful slow speed while
 doing random reads and writes.

 Sequential read and writes are at 200Mb/s (that's pretty good for bonded
 dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
 which is def. too slow.

 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html



 --

 --




        Alexandre D erumier
 Ingénieur Système
 Fixe : 03 20 68 88 90
 Fax : 03 20 68 90 81
 45 Bvd du Général Leclerc 59100 Roubaix - France
 12 rue Marivaux 75002 Paris - France

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Tomasz Paszkowski
 SS7, Asterisk, SAN, Datacenter, Cloud

Re: Designing a cluster guide

2012-05-21 Thread Gregory Farnum
On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote:
 Hi All,


 I've been thinking about this issue myself past few days, and an idea I've 
 come up with is running 16 x 2.5 15K 72/146GB Disks,
 in raid 10 inside a 2U Server with JBOD's attached to the server for actual 
 storage.

 Can someone help clarify this one,

 Once the data is written to the (journal disk) and then read from the 
 (journal disk) then written to the (storage disk) once this is complete this 
 is considered a successful write by the client?
 Or
 Once the data is written to the (journal disk) is this considered successful 
 by the client?
This one — the write is considered safe once it is on-disk on all
OSDs currently responsible for hosting the object.

Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?

 Or
 Once the data is written to the (journal disk) and written to the (storage 
 disk) at the same time, once complete this is considered a successful write 
 by the client? (if this is the case SSD's may not be so useful)


 Pros
 Quite fast Write throughput to the journal disks,
 No write wareout of SSD's
 RAID 10 with 1GB Cache Controller also helps improve things (if really keen 
 you could use a cachecade as well)


 Cons
 Not as fast as SSD's
 More rackspace required per server.


 Regards,
 Quenten

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron
 Sent: Tuesday, 22 May 2012 7:22 AM
 To: ceph-devel@vger.kernel.org
 Cc: Tomasz Paszkowski
 Subject: Re: Designing a cluster guide

 Maybe good for journal will be two cheap MLC Intel drives on Sandforce
 (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
 separate journaling partitions with hardware RAID1.

 I like to test setup like this, but maybe someone have any real life info ??

 On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski ss7...@gmail.com wrote:
 Another great thing that should be mentioned is:
 https://github.com/facebook/flashcache/. It gives really huge
 performance improvements for reads/writes (especialy on FunsionIO
 drives) event without using librbd caching :-)



 On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com 
 wrote:
 Hi,

 For your journal , if you have money, you can use

 stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k 
 block).
 I'm using them with zfs san, they rocks for journal.
 http://www.stec-inc.com/product/zeusram.php

 another interessesting product is ddrdrive
 http://www.ddrdrive.com/

 - Mail original -

 De: Stefan Priebe s.pri...@profihost.ag
 À: Gregory Farnum g...@inktank.com
 Cc: ceph-devel@vger.kernel.org
 Envoyé: Samedi 19 Mai 2012 10:37:01
 Objet: Re: Designing a cluster guide

 Hi Greg,

 Am 17.05.2012 23:27, schrieb Gregory Farnum:
 It mentions for example Fast CPU for the mds system. What does fast
 mean? Just the speed of one core? Or is ceph designed to use multi core?
 Is multi core or more speed important?
 Right now, it's primarily the speed of a single core. The MDS is
 highly threaded but doing most things requires grabbing a big lock.
 How fast is a qualitative rather than quantitative assessment at this
 point, though.
 So would you recommand a fast (more ghz) Core i3 instead of a single
 xeon for this system? (price per ghz is better).

 It depends on what your nodes look like, and what sort of cluster
 you're running. The monitors are pretty lightweight, but they will add
 *some* load. More important is their disk access patterns — they have
 to do a lot of syncs. So if they're sharing a machine with some other
 daemon you want them to have an independent disk and to be running a
 new kernelglibc so that they can use syncfs rather than sync. (The
 only distribution I know for sure does this is Ubuntu 12.04.)
 Which kernel and which glibc version supports this? I have searched
 google but haven't found an exact version. We're using debian lenny
 squeeze with a custom kernel.

 Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
 perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
 and you should go for 22x SSD Disks in a Raid 6?
 You'll need to do your own failure calculations on this one, I'm
 afraid. Just take note that you'll presumably be limited to the speed
 of your journaling device here.
 Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
 is this still too slow? Another idea was to use only a ramdisk for the
 journal and backup the files while shutting down to disk and restore
 them after boot.

 Given that Ceph is going to be doing its own replication, though, I
 wouldn't want to add in another whole layer of replication with raid10
 — do you really want to multiply your storage requirements by another
 factor of two?
 OK

RE: Designing a cluster guide

2012-05-21 Thread Quenten Grasso
Hi Greg,

I'm only talking about journal disks not storage. :)



Regards,
Quenten 


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Tuesday, 22 May 2012 10:30 AM
To: Quenten Grasso
Cc: ceph-devel@vger.kernel.org
Subject: Re: Designing a cluster guide

On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote:
 Hi All,


 I've been thinking about this issue myself past few days, and an idea I've 
 come up with is running 16 x 2.5 15K 72/146GB Disks,
 in raid 10 inside a 2U Server with JBOD's attached to the server for actual 
 storage.

 Can someone help clarify this one,

 Once the data is written to the (journal disk) and then read from the 
 (journal disk) then written to the (storage disk) once this is complete this 
 is considered a successful write by the client?
 Or
 Once the data is written to the (journal disk) is this considered successful 
 by the client?
This one — the write is considered safe once it is on-disk on all
OSDs currently responsible for hosting the object.

Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?

 Or
 Once the data is written to the (journal disk) and written to the (storage 
 disk) at the same time, once complete this is considered a successful write 
 by the client? (if this is the case SSD's may not be so useful)


 Pros
 Quite fast Write throughput to the journal disks,
 No write wareout of SSD's
 RAID 10 with 1GB Cache Controller also helps improve things (if really keen 
 you could use a cachecade as well)


 Cons
 Not as fast as SSD's
 More rackspace required per server.


 Regards,
 Quenten

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron
 Sent: Tuesday, 22 May 2012 7:22 AM
 To: ceph-devel@vger.kernel.org
 Cc: Tomasz Paszkowski
 Subject: Re: Designing a cluster guide

 Maybe good for journal will be two cheap MLC Intel drives on Sandforce
 (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
 separate journaling partitions with hardware RAID1.

 I like to test setup like this, but maybe someone have any real life info ??

 On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski ss7...@gmail.com wrote:
 Another great thing that should be mentioned is:
 https://github.com/facebook/flashcache/. It gives really huge
 performance improvements for reads/writes (especialy on FunsionIO
 drives) event without using librbd caching :-)



 On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com 
 wrote:
 Hi,

 For your journal , if you have money, you can use

 stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k 
 block).
 I'm using them with zfs san, they rocks for journal.
 http://www.stec-inc.com/product/zeusram.php

 another interessesting product is ddrdrive
 http://www.ddrdrive.com/

 - Mail original -

 De: Stefan Priebe s.pri...@profihost.ag
 À: Gregory Farnum g...@inktank.com
 Cc: ceph-devel@vger.kernel.org
 Envoyé: Samedi 19 Mai 2012 10:37:01
 Objet: Re: Designing a cluster guide

 Hi Greg,

 Am 17.05.2012 23:27, schrieb Gregory Farnum:
 It mentions for example Fast CPU for the mds system. What does fast
 mean? Just the speed of one core? Or is ceph designed to use multi core?
 Is multi core or more speed important?
 Right now, it's primarily the speed of a single core. The MDS is
 highly threaded but doing most things requires grabbing a big lock.
 How fast is a qualitative rather than quantitative assessment at this
 point, though.
 So would you recommand a fast (more ghz) Core i3 instead of a single
 xeon for this system? (price per ghz is better).

 It depends on what your nodes look like, and what sort of cluster
 you're running. The monitors are pretty lightweight, but they will add
 *some* load. More important is their disk access patterns — they have
 to do a lot of syncs. So if they're sharing a machine with some other
 daemon you want them to have an independent disk and to be running a
 new kernelglibc so that they can use syncfs rather than sync. (The
 only distribution I know for sure does this is Ubuntu 12.04.)
 Which kernel and which glibc version supports this? I have searched
 google but haven't found an exact version. We're using debian lenny
 squeeze with a custom kernel.

 Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
 perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
 and you should go for 22x SSD Disks in a Raid 6?
 You'll need to do your own failure calculations on this one, I'm
 afraid. Just take note that you'll presumably be limited to the speed
 of your journaling device here.
 Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
 is this still too slow? Another idea

RE: Designing a cluster guide

2012-05-21 Thread Quenten Grasso
I Should have added For storage I'm considering something like Enterprise 
nearline SAS 3TB disks running individual disks not raided with rep level of 2 
as suggested :)


Regards,
Quenten 


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Quenten Grasso
Sent: Tuesday, 22 May 2012 10:43 AM
To: 'Gregory Farnum'
Cc: ceph-devel@vger.kernel.org
Subject: RE: Designing a cluster guide

Hi Greg,

I'm only talking about journal disks not storage. :)



Regards,
Quenten 


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Tuesday, 22 May 2012 10:30 AM
To: Quenten Grasso
Cc: ceph-devel@vger.kernel.org
Subject: Re: Designing a cluster guide

On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote:
 Hi All,


 I've been thinking about this issue myself past few days, and an idea I've 
 come up with is running 16 x 2.5 15K 72/146GB Disks,
 in raid 10 inside a 2U Server with JBOD's attached to the server for actual 
 storage.

 Can someone help clarify this one,

 Once the data is written to the (journal disk) and then read from the 
 (journal disk) then written to the (storage disk) once this is complete this 
 is considered a successful write by the client?
 Or
 Once the data is written to the (journal disk) is this considered successful 
 by the client?
This one — the write is considered safe once it is on-disk on all
OSDs currently responsible for hosting the object.

Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?

 Or
 Once the data is written to the (journal disk) and written to the (storage 
 disk) at the same time, once complete this is considered a successful write 
 by the client? (if this is the case SSD's may not be so useful)


 Pros
 Quite fast Write throughput to the journal disks,
 No write wareout of SSD's
 RAID 10 with 1GB Cache Controller also helps improve things (if really keen 
 you could use a cachecade as well)


 Cons
 Not as fast as SSD's
 More rackspace required per server.


 Regards,
 Quenten

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron
 Sent: Tuesday, 22 May 2012 7:22 AM
 To: ceph-devel@vger.kernel.org
 Cc: Tomasz Paszkowski
 Subject: Re: Designing a cluster guide

 Maybe good for journal will be two cheap MLC Intel drives on Sandforce
 (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
 separate journaling partitions with hardware RAID1.

 I like to test setup like this, but maybe someone have any real life info ??

 On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski ss7...@gmail.com wrote:
 Another great thing that should be mentioned is:
 https://github.com/facebook/flashcache/. It gives really huge
 performance improvements for reads/writes (especialy on FunsionIO
 drives) event without using librbd caching :-)



 On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com 
 wrote:
 Hi,

 For your journal , if you have money, you can use

 stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k 
 block).
 I'm using them with zfs san, they rocks for journal.
 http://www.stec-inc.com/product/zeusram.php

 another interessesting product is ddrdrive
 http://www.ddrdrive.com/

 - Mail original -

 De: Stefan Priebe s.pri...@profihost.ag
 À: Gregory Farnum g...@inktank.com
 Cc: ceph-devel@vger.kernel.org
 Envoyé: Samedi 19 Mai 2012 10:37:01
 Objet: Re: Designing a cluster guide

 Hi Greg,

 Am 17.05.2012 23:27, schrieb Gregory Farnum:
 It mentions for example Fast CPU for the mds system. What does fast
 mean? Just the speed of one core? Or is ceph designed to use multi core?
 Is multi core or more speed important?
 Right now, it's primarily the speed of a single core. The MDS is
 highly threaded but doing most things requires grabbing a big lock.
 How fast is a qualitative rather than quantitative assessment at this
 point, though.
 So would you recommand a fast (more ghz) Core i3 instead of a single
 xeon for this system? (price per ghz is better).

 It depends on what your nodes look like, and what sort of cluster
 you're running. The monitors are pretty lightweight, but they will add
 *some* load. More important is their disk access patterns — they have
 to do a lot of syncs. So if they're sharing a machine with some other
 daemon you want them to have an independent disk and to be running a
 new kernelglibc so that they can use syncfs rather than sync. (The
 only distribution I know for sure does this is Ubuntu 12.04.)
 Which kernel and which glibc version supports this? I have searched
 google but haven't found an exact version. We're using debian lenny
 squeeze with a custom kernel.

 Regarding the OSDs

Re: Designing a cluster guide

2012-05-21 Thread Sławomir Skowron
I have some performance from rbd cluster near 320MB/s on VM from 3
node cluster, but with 10GE, and with 26 2.5 SAS drives used on every
machine it's not everything that can be.
Every osd drive is raid0 with one drive via battery cached nvram in
hardware raid ctrl.
Every osd take much ram for caching.

That's why i'am thinking about to change 2 drives for SSD in raid1
with hpa tuned for increase durability of drive for journaling - but
if this will work ;)

With newest drives can theoreticaly get 500MB/s with a long queue
depth. This means that i can in theory improve bandwith score, and
take lower latency, and better handling of multiple IO writes, from
many hosts.
Reads are cached in ram from OSD daemon, VFS in kernel, nvram in ctrl,
and in near future improve from cache in kvm (i need to test that -
this will improve performance)

But if SSD drive goes slower, it can get whole performance down in
writes. It's is very delicate.

Pozdrawiam

iSS

Dnia 22 maj 2012 o godz. 02:47 Quenten Grasso qgra...@onq.com.au napisał(a):

 I Should have added For storage I'm considering something like Enterprise 
 nearline SAS 3TB disks running individual disks not raided with rep level of 
 2 as suggested :)


 Regards,
 Quenten


 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Quenten Grasso
 Sent: Tuesday, 22 May 2012 10:43 AM
 To: 'Gregory Farnum'
 Cc: ceph-devel@vger.kernel.org
 Subject: RE: Designing a cluster guide

 Hi Greg,

 I'm only talking about journal disks not storage. :)



 Regards,
 Quenten


 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum
 Sent: Tuesday, 22 May 2012 10:30 AM
 To: Quenten Grasso
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: Designing a cluster guide

 On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso qgra...@onq.com.au wrote:
 Hi All,


 I've been thinking about this issue myself past few days, and an idea I've 
 come up with is running 16 x 2.5 15K 72/146GB Disks,
 in raid 10 inside a 2U Server with JBOD's attached to the server for actual 
 storage.

 Can someone help clarify this one,

 Once the data is written to the (journal disk) and then read from the 
 (journal disk) then written to the (storage disk) once this is complete this 
 is considered a successful write by the client?
 Or
 Once the data is written to the (journal disk) is this considered successful 
 by the client?
 This one — the write is considered safe once it is on-disk on all
 OSDs currently responsible for hosting the object.

 Every time anybody mentions RAID10 I have to remind them of the
 storage amplification that entails, though. Are you sure you want that
 on top of (well, underneath, really) Ceph's own replication?

 Or
 Once the data is written to the (journal disk) and written to the (storage 
 disk) at the same time, once complete this is considered a successful write 
 by the client? (if this is the case SSD's may not be so useful)


 Pros
 Quite fast Write throughput to the journal disks,
 No write wareout of SSD's
 RAID 10 with 1GB Cache Controller also helps improve things (if really keen 
 you could use a cachecade as well)


 Cons
 Not as fast as SSD's
 More rackspace required per server.


 Regards,
 Quenten

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Slawomir Skowron
 Sent: Tuesday, 22 May 2012 7:22 AM
 To: ceph-devel@vger.kernel.org
 Cc: Tomasz Paszkowski
 Subject: Re: Designing a cluster guide

 Maybe good for journal will be two cheap MLC Intel drives on Sandforce
 (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
 separate journaling partitions with hardware RAID1.

 I like to test setup like this, but maybe someone have any real life info ??

 On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski ss7...@gmail.com wrote:
 Another great thing that should be mentioned is:
 https://github.com/facebook/flashcache/. It gives really huge
 performance improvements for reads/writes (especialy on FunsionIO
 drives) event without using librbd caching :-)



 On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER aderum...@odiso.com 
 wrote:
 Hi,

 For your journal , if you have money, you can use

 stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 
 4k block).
 I'm using them with zfs san, they rocks for journal.
 http://www.stec-inc.com/product/zeusram.php

 another interessesting product is ddrdrive
 http://www.ddrdrive.com/

 - Mail original -

 De: Stefan Priebe s.pri...@profihost.ag
 À: Gregory Farnum g...@inktank.com
 Cc: ceph-devel@vger.kernel.org
 Envoyé: Samedi 19 Mai 2012 10:37:01
 Objet: Re: Designing a cluster guide

 Hi Greg,

 Am 17.05.2012 23:27, schrieb Gregory Farnum:
 It mentions for example Fast CPU for the mds system. What does fast
 mean? Just the speed of one core? Or is ceph designed to use multi

Re: Designing a cluster guide

2012-05-20 Thread Stefan Priebe

Am 19.05.2012 18:15, schrieb Alexandre DERUMIER:

Hi,

For your journal , if you have money, you can use

stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k 
block).
I'm using them with zfs san, they rocks for journal.
http://www.stec-inc.com/product/zeusram.php

another interessesting product is ddrdrive
http://www.ddrdrive.com/


Great products but really expensive. The question is do we really need 
this in case of rbd block device.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-20 Thread Alexandre DERUMIER
I think that depend how much random writes io you have and acceptable latency 
you need.

(As the purpose of the journal is to take random io then flush them 
sequentially to slow storage).

Maybe some slower ssd will fill your needs.
(just be carefull of performance degradation in time, trim,)




- Mail original - 

De: Stefan Priebe s.pri...@profihost.ag 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: ceph-devel@vger.kernel.org, Gregory Farnum g...@inktank.com 
Envoyé: Dimanche 20 Mai 2012 09:56:21 
Objet: Re: Designing a cluster guide 

Am 19.05.2012 18:15, schrieb Alexandre DERUMIER: 
 Hi, 
 
 For your journal , if you have money, you can use 
 
 stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k 
 block). 
 I'm using them with zfs san, they rocks for journal. 
 http://www.stec-inc.com/product/zeusram.php 
 
 another interessesting product is ddrdrive 
 http://www.ddrdrive.com/ 

Great products but really expensive. The question is do we really need 
this in case of rbd block device. 

Stefan 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-20 Thread Christian Brunner
2012/5/20 Stefan Priebe s.pri...@profihost.ag:
 Am 19.05.2012 18:15, schrieb Alexandre DERUMIER:

 Hi,

 For your journal , if you have money, you can use

 stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with
 4k block).
 I'm using them with zfs san, they rocks for journal.
 http://www.stec-inc.com/product/zeusram.php

 another interessesting product is ddrdrive
 http://www.ddrdrive.com/


 Great products but really expensive. The question is do we really need this
 in case of rbd block device.

I think it depends, what you are planning to do. I was calculating
different storage type for our cloud solution lately. I think that
there are three different types that make sense (at least for us):

- Cheap Object Storage (S3):

  Many 3,5'' SATA Drives for the storage (probably in a RAID config)
  A small and cheap SSD for the journal

- Basic Block Storage (RBD):

  Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs)
  Small MaxIOPS SSDs for each OSD journal

- High performance Block Storage (RBD)

  Many large SATA SSDs for the storage (prbably in a RAID5 config)
  stec zeusram ssd drive for the journal

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-20 Thread Stefan Priebe

Am 20.05.2012 10:19, schrieb Christian Brunner:

- Cheap Object Storage (S3):

   Many 3,5'' SATA Drives for the storage (probably in a RAID config)
   A small and cheap SSD for the journal

- Basic Block Storage (RBD):

   Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs)
   Small MaxIOPS SSDs for each OSD journal

- High performance Block Storage (RBD)

   Many large SATA SSDs for the storage (prbably in a RAID5 config)
   stec zeusram ssd drive for the journal
That's exactly what i thought too but then you need a seperate ceph / 
rbd cluster for each type.


Which will result in a minimum of:
3x mon servers per type
4x osd servers per type
---

so you'll need a minimum of 12x osd systems and 9x mon systems.

Regards,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-20 Thread Christian Brunner
2012/5/20 Stefan Priebe s.pri...@profihost.ag:
 Am 20.05.2012 10:19, schrieb Christian Brunner:

 - Cheap Object Storage (S3):

   Many 3,5'' SATA Drives for the storage (probably in a RAID config)
   A small and cheap SSD for the journal

 - Basic Block Storage (RBD):

   Many 2,5'' SATA Drives for the storage (RAID10 and/or mutliple OSDs)
   Small MaxIOPS SSDs for each OSD journal

 - High performance Block Storage (RBD)

   Many large SATA SSDs for the storage (prbably in a RAID5 config)
   stec zeusram ssd drive for the journal

 That's exactly what i thought too but then you need a seperate ceph / rbd
 cluster for each type.

 Which will result in a minimum of:
 3x mon servers per type
 4x osd servers per type
 ---

 so you'll need a minimum of 12x osd systems and 9x mon systems.

You can arrange the storage types in different pools, so that you
don't need separate mon servers (this can be done by adjusting the
crushmap) and you could even run multiple OSDs per server.

Christian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-20 Thread Tim O'Donovan
 - High performance Block Storage (RBD)
 
   Many large SATA SSDs for the storage (prbably in a RAID5 config)
   stec zeusram ssd drive for the journal

How do you think standard SATA disks would perform in comparison to
this, and is a separate journaling device really necessary?

Perhaps three servers, each with 12 x 1TB SATA disks configured in
RAID10, an osd on each server and three separate mon servers.

Would this be suitable for the storage backend for a small OpenStack
cloud, performance wise, for instance?


Regards,
Tim O'Donovan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-20 Thread Stefan Priebe
Am 20.05.2012 um 10:56 schrieb Tim O'Donovan t...@icukhosting.co.uk:

 - High performance Block Storage (RBD)
 
  Many large SATA SSDs for the storage (prbably in a RAID5 config)
  stec zeusram ssd drive for the journal
 
 How do you think standard SATA disks would perform in comparison to
 this, and is a separate journaling device really necessary?
 
 Perhaps three servers, each with 12 x 1TB SATA disks configured in
 RAID10, an osd on each server and three separate mon servers.
 
He's talking about ssd's not normal sata disks.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-20 Thread Tim O'Donovan
 He's talking about ssd's not normal sata disks.

I realise that. I'm looking for similar advice and have been following
this thread. It didn't seem off topic to ask here.


Regards,
Tim O'Donovan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-20 Thread Stefan Priebe
No sorry just wanted to clarify as you quoted the ssd part. 

Stefan

Am 20.05.2012 um 11:46 schrieb Tim O'Donovan t...@icukhosting.co.uk:

 He's talking about ssd's not normal sata disks.
 
 I realise that. I'm looking for similar advice and have been following
 this thread. It didn't seem off topic to ask here.
 
 
 Regards,
 Tim O'Donovan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-19 Thread Stefan Priebe

Hi Greg,

Am 17.05.2012 23:27, schrieb Gregory Farnum:

It mentions for example Fast CPU for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi core?
Is multi core or more speed important?

Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.
So would you recommand a fast (more ghz) Core i3 instead of a single 
xeon for this system? (price per ghz is better).



It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will add
*some* load. More important is their disk access patterns — they have
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernelglibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)
Which kernel and which glibc version supports this? I have searched 
google but haven't found an exact version. We're using debian lenny 
squeeze with a custom kernel.



Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
and you should go for 22x SSD Disks in a Raid 6?

You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or 
is this still too slow? Another idea was to use only a ramdisk for the 
journal and backup the files while shutting down to disk and restore 
them after boot.



Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid10
— do you really want to multiply your storage requirements by another
factor of two?

OK correct bad idea.


Is it more useful the use a Raid 6 HW Controller or the btrfs raid?

I would use the hardware controller over btrfs raid for now; it allows
more flexibility in eg switching to xfs. :)
OK but overall you would recommand running one osd per disk right? So 
instead of using a Raid 6 with for example 10 disks you would run 6 osds 
on this machine?



Use single socket Xeon for the OSDs or Dual Socket?

Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).

Is there also a rule of thumb for the memory?

My biggest problem with ceph right now is the awful slow speed while 
doing random reads and writes.


Sequential read and writes are at 200Mb/s (that's pretty good for bonded 
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s 
which is def. too slow.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-19 Thread Alexandre DERUMIER
Hi,

For your journal , if you have money, you can use

stec zeusram ssd drive. (around 2000€ /8GB / 10 iops read/write with 4k 
block).
I'm using them with zfs san, they rocks for journal. 
http://www.stec-inc.com/product/zeusram.php

another interessesting product is ddrdrive
http://www.ddrdrive.com/

- Mail original - 

De: Stefan Priebe s.pri...@profihost.ag 
À: Gregory Farnum g...@inktank.com 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Samedi 19 Mai 2012 10:37:01 
Objet: Re: Designing a cluster guide 

Hi Greg, 

Am 17.05.2012 23:27, schrieb Gregory Farnum: 
 It mentions for example Fast CPU for the mds system. What does fast 
 mean? Just the speed of one core? Or is ceph designed to use multi core? 
 Is multi core or more speed important? 
 Right now, it's primarily the speed of a single core. The MDS is 
 highly threaded but doing most things requires grabbing a big lock. 
 How fast is a qualitative rather than quantitative assessment at this 
 point, though. 
So would you recommand a fast (more ghz) Core i3 instead of a single 
xeon for this system? (price per ghz is better). 

 It depends on what your nodes look like, and what sort of cluster 
 you're running. The monitors are pretty lightweight, but they will add 
 *some* load. More important is their disk access patterns — they have 
 to do a lot of syncs. So if they're sharing a machine with some other 
 daemon you want them to have an independent disk and to be running a 
 new kernelglibc so that they can use syncfs rather than sync. (The 
 only distribution I know for sure does this is Ubuntu 12.04.) 
Which kernel and which glibc version supports this? I have searched 
google but haven't found an exact version. We're using debian lenny 
squeeze with a custom kernel. 

 Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and 
 perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd 
 and you should go for 22x SSD Disks in a Raid 6? 
 You'll need to do your own failure calculations on this one, I'm 
 afraid. Just take note that you'll presumably be limited to the speed 
 of your journaling device here. 
Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or 
is this still too slow? Another idea was to use only a ramdisk for the 
journal and backup the files while shutting down to disk and restore 
them after boot. 

 Given that Ceph is going to be doing its own replication, though, I 
 wouldn't want to add in another whole layer of replication with raid10 
 — do you really want to multiply your storage requirements by another 
 factor of two? 
OK correct bad idea. 

 Is it more useful the use a Raid 6 HW Controller or the btrfs raid? 
 I would use the hardware controller over btrfs raid for now; it allows 
 more flexibility in eg switching to xfs. :) 
OK but overall you would recommand running one osd per disk right? So 
instead of using a Raid 6 with for example 10 disks you would run 6 osds 
on this machine? 

 Use single socket Xeon for the OSDs or Dual Socket? 
 Dual socket servers will be overkill given the setup you're 
 describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD 
 daemon. You might consider it if you decided you wanted to do an OSD 
 per disk instead (that's a more common configuration, but it requires 
 more CPU and RAM per disk and we don't know yet which is the better 
 choice). 
Is there also a rule of thumb for the memory? 

My biggest problem with ceph right now is the awful slow speed while 
doing random reads and writes. 

Sequential read and writes are at 200Mb/s (that's pretty good for bonded 
dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s 
which is def. too slow. 

Stefan 
-- 
To unsubscribe from this list: send the line unsubscribe ceph-devel in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Designing a cluster guide

2012-05-17 Thread Gregory Farnum
Sorry this got left for so long...

On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi,

 the Designing a cluster guide
 http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
 still leaves some questions unanswered.

 It mentions for example Fast CPU for the mds system. What does fast
 mean? Just the speed of one core? Or is ceph designed to use multi core?
 Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.

 The Cluster Design Recommendations mentions to seperate all Daemons on
 dedicated machines. Is this also for the MON useful? As they're so
 leightweight why not running them on the OSDs?
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will add
*some* load. More important is their disk access patterns — they have
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernelglibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)

 Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
 perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
 and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid10
— do you really want to multiply your storage requirements by another
factor of two?
 Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allows
more flexibility in eg switching to xfs. :)

 Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html