Re: Ceph journal

2012-11-03 Thread Gregory Farnum
On Thu, Nov 1, 2012 at 10:33 PM, Gandalf Corvotempesta
 wrote:
> 2012/11/1 Mark Nelson :
>> It will do that for a while, based on how you've tweaked the flush intervals
>> and various journal settings to determine how much data ceph will allow to
>> hang out in the journal while still accepting new requests.
>
> Ceph is not able to write to both journal and disks simultaneously?
> For example, by using SSD for operating system and journal, we will be
> able to have not least than 100GB of journal that is a large amount of
> data wrote at SSD speed.
>
> When ceph has to flush this data to disks, will stop to write more
> data to journal?

No, of course it will flush data to the disk at the same time as it
will take writes to the journal. However, if you have a 1GB journal
that writes at 200MB/s and a backing disk that writes at 100MB/s, and
you then push 200MB/s through long enough that the journal fills up,
then you will slow down to writing at 100MB/s because that's as fast
as Ceph can fill up the backing store, and the journal is no longer
buffering.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph journal

2012-11-01 Thread Mark Nelson

On 11/01/2012 04:18 PM, Gandalf Corvotempesta wrote:

2012/10/31 Stefan Kleijkers :

As far as I know, this is correct. You get a ACK (on the write) back after
it landed on ALL three journals (or/and osds in case of BTRFS in parallel
mode). So If you lose one node, you still have it in two more nodes and they
will commit it to disk. After recovering the missing node/osd it will get
the data from one of the other nodes. So you won't lose any data.


In this case I can suppose that ceph writing speed is relative to the
journal's writing speed and never to ODS disks.



Eventually you will need to write all of that data out to disk and 
writes to the journal will have to stop to allow the underlying disk to 
catch up.  In cases like that you will often see performance going along 
seemingly speedily and then all of a sudden see long pauses and possibly 
chaotic performance characteristics.



Let's assume a journal size of 150GB, capable to write at 200MB/s in a
2gbit/s network (lacp between two gigabit ports), no replica between
OSDs and very very slow SATA disk (5400 RPM, for example, much slower
than jurnal). Just a single osd.
Ceph will write at 200MB/s, and in background it will flush journals
to disk, right?


It will do that for a while, based on how you've tweaked the flush 
intervals and various journal settings to determine how much data ceph 
will allow to hang out in the journal while still accepting new requests.




I can assume that journal is a buffer and RBD will write only to it.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph journal

2012-10-31 Thread Sébastien Han
Hi,

Personally I won't take the risk to loose transactions. If a client
writes into a journal, assuming it's the first write and if the server
crashs for whatever reason, you have high risk of un-consistent data.
Because you just lost what was in the journal.
Tmpfs is the cheapest solution for achieving better performance, but
it's definitely not the most reliable. Keep in mind that you don't
really want to see your load/data balancing through the cluster while
recovering from a failure...
At last resort, I will use the root filesystem if it's decently fast
to store the journals.

My 2 cents...

Cheers!

--
Bien cordialement.
Sébastien HAN.



On Wed, Oct 31, 2012 at 11:07 PM, Gandalf Corvotempesta
 wrote:
>
> 2012/10/31 Stefan Kleijkers :
> > As far as I know, this is correct. You get a ACK (on the write) back after
> > it landed on ALL three journals (or/and osds in case of BTRFS in parallel
> > mode). So If you lose one node, you still have it in two more nodes and they
> > will commit it to disk. After recovering the missing node/osd it will get
> > the data from one of the other nodes. So you won't lose any data.
>
> Sounds perfect, this will allow me to avoid SSD disks and using al 12
> disks on a DELL R515 as OSD
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph journal

2012-10-31 Thread Stefan Kleijkers

Hello,

On 10/31/2012 10:58 PM, Gandalf Corvotempesta wrote:

2012/10/31 Tren Blackburn :

Unless you're using btrfs which writes to the journal and osd fs
concurrently, if you lose the journal device (such as due to a
reboot), you've lost the osd device, requiring it to be remade and
re-added.

I don't understood.
Loosing a journal, will result in the whole OSD lost?

AFAIK, Ceph will write to journal. After this write it will return an "OK".
After that, the journal is wrote (in background) to a disk, so,
loosing a journal should result in loosing that portion of data, not
the whole osd.

Now, in case of 3 replicated nodes, ceph will write the same data at
the same time to the three journals? If yes, loosing a single
journal/osd should not result in loss of data, because the same data
are still on other 2 nodes. In this case, should be possible to use a
tmpfs as journal and using the replica as redundancy.


As far as I know, this is correct. You get a ACK (on the write) back 
after it landed on ALL three journals (or/and osds in case of BTRFS in 
parallel mode). So If you lose one node, you still have it in two more 
nodes and they will commit it to disk. After recovering the missing 
node/osd it will get the data from one of the other nodes. So you won't 
lose any data.


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph journal

2012-10-31 Thread Sage Weil
On Wed, 31 Oct 2012, Tren Blackburn wrote:
> On Wed, Oct 31, 2012 at 2:18 PM, Gandalf Corvotempesta
>  wrote:
> > In a multi replica cluster (for example, replica = 3) is safe to set
> > journal on a tmpfs?
> > As fa as I understood with journal enabled all writes are wrote on
> > journal and then to disk in a second time.
> > If node hangs when datas are still on journal (and journal is not on a
> > permanent disk), some data lost could happens.
> >
> > In a multi replica environment, other nodes should be able to write
> > the same datas to disk, right? I this case, using a journal on a tmpfs
> > should be safe enough.
> 
> Unless you're using btrfs which writes to the journal and osd fs
> concurrently, if you lose the journal device (such as due to a
> reboot), you've lost the osd device, requiring it to be remade and
> re-added.

This is correct.  For non-btrfs file systems we rely on the journal for 
basic consistency.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph journal

2012-10-31 Thread Stefan Kleijkers

Hello,

On 10/31/2012 10:24 PM, Tren Blackburn wrote:

On Wed, Oct 31, 2012 at 2:18 PM, Gandalf Corvotempesta
 wrote:

In a multi replica cluster (for example, replica = 3) is safe to set
journal on a tmpfs?
As fa as I understood with journal enabled all writes are wrote on
journal and then to disk in a second time.
If node hangs when datas are still on journal (and journal is not on a
permanent disk), some data lost could happens.

In a multi replica environment, other nodes should be able to write
the same datas to disk, right? I this case, using a journal on a tmpfs
should be safe enough.

Unless you're using btrfs which writes to the journal and osd fs
concurrently, if you lose the journal device (such as due to a
reboot), you've lost the osd device, requiring it to be remade and
re-added.

This is what I understand at least. If I'm wrong one of the devs will
strike me down I'm sure ;)


There's an option to recreate the journal, so you don't lose the OSD. 
Ofcourse you lose the data in the journal.
It's possible to have the journal on a tmpfs device, ofcourse it's not 
100% safe, in case you lose the three nodes with the replica's you can 
lose data. But then again if you have more replica's the change of 
losing them all gets smaller. It's a trade between speed, reliability 
and space.


Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph journal

2012-10-31 Thread Tren Blackburn
On Wed, Oct 31, 2012 at 2:18 PM, Gandalf Corvotempesta
 wrote:
> In a multi replica cluster (for example, replica = 3) is safe to set
> journal on a tmpfs?
> As fa as I understood with journal enabled all writes are wrote on
> journal and then to disk in a second time.
> If node hangs when datas are still on journal (and journal is not on a
> permanent disk), some data lost could happens.
>
> In a multi replica environment, other nodes should be able to write
> the same datas to disk, right? I this case, using a journal on a tmpfs
> should be safe enough.

Unless you're using btrfs which writes to the journal and osd fs
concurrently, if you lose the journal device (such as due to a
reboot), you've lost the osd device, requiring it to be remade and
re-added.

This is what I understand at least. If I'm wrong one of the devs will
strike me down I'm sure ;)

t.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html