Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread Mark Nelson
One small point:  It's a bit easier to observe distinct WAL and DB 
behavior when they are on separate partitions.  I often do this for 
benchmarking and testing though I don't know that it would be enough of 
a benefit to do it in production.


Mark

On 11/09/2017 04:16 AM, Richard Hesketh wrote:

You're correct, if you were going to put the WAL and DB on the same device you 
should just make one partition and allocate the DB to it, the WAL will 
automatically be stored with the DB. It only makes sense to specify them 
separately if they are going to go on different devices, and that itself only 
makes sense if the WAL device will be much faster than the DB device, otherwise 
you're just making your setup more complex for no gain.

On 09/11/17 08:05, jorpilo wrote:


I get confused there because on the documentation:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

"If there is more, provisioning a DB device makes more sense. The BlueStore journal 
will always be placed on the fastest device available, so using a DB device will provide 
the same benefit that the WAL device would while also allowing additional metadata to be 
stored there"

So I guess it doesn't make any sense to implicit put WAL and DB on a SSD, only 
with DB, the biggest you can, would be enough, unless you have 2 different 
kinds of SSD (for example a tiny Nvme and a SSD)

Am I right? Or would I get any benefit from setting implicit WAL partition on 
the same SSD?


 Mensaje original 
De: Nick Fisk <n...@fisk.me.uk>
Fecha: 8/11/17 10:16 p. m. (GMT+01:00)
Para: 'Mark Nelson' <mnel...@redhat.com>, 'Wolfgang Lendl' 
<wolfgang.le...@meduniwien.ac.at>
Cc: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] bluestore - wal,db on faster devices?


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 08 November 2017 19:46
To: Wolfgang Lendl <wolfgang.le...@meduniwien.ac.at>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore - wal,db on faster devices?

Hi Wolfgang,

You've got the right idea.  RBD is probably going to benefit less since

you

have a small number of large objects and little extra OMAP data.
Having the allocation and object metadata on flash certainly shouldn't

hurt,

and you should still have less overhead for small (<64k) writes.
With RGW however you also have to worry about bucket index updates
during writes and that's a big potential bottleneck that you don't need to
worry about with RBD.


If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.




Mark

On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:

Hi Mark,

thanks for your reply!
I'm a big fan of keeping things simple - this means that there has to
be a very good reason to put the WAL and DB on a separate device
otherwise I'll keep it collocated (and simpler).

as far as I understood - putting the WAL,DB on a faster (than hdd)
device makes more sense in cephfs and rgw environments (more

metadata)

- and less sense in rbd environments - correct?

br
wolfgang

On 11/08/2017 02:21 PM, Mark Nelson wrote:

Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's
journal, but bluestore isn't dependent on it for guaranteeing
durability of large writes.  With bluestore you can often get higher
large-write throughput than with filestore when using HDD-only or
flash-only OSDs.

Bluestore also stores allocation, object, and cluster metadata in the
DB.  That, in combination with the way bluestore stores objects,
dramatically improves behavior during certain workloads.  A big one
is creating millions of small objects as quickly as possible.  In
filestore, PG splitting has a huge impact on performance and tail
latency.  Bluestore is much better just on HDD, and putting the DB
and WAL on flash makes it better still since metadata no longer is a
bottleneck.

Bluestore does have a couple of shortcomings vs filestore currently.
The allocator is not as good as XFS's and can fragment more over time.
There is no server-side readahead so small sequential read
performance is very dependent on client-side readahead.  There's
still a number of optimizations to various things ranging from
threading and locking in the shardedopwq to pglog and dup_ops that
potentially could improve performance.

I have a blog post that we've been working on that explores some of
these things but I'm still waiting on review before I publish it.

Mark

On 11/0

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread Denes Dolhay

-sorry, wrong address


Hi Richard,

I have seen a few lectures about bluestore, and they made it abundantly 
clear, that bluestore is superior to filestore in that manner, that it 
writes data to the disc only once (this way they could achieve a 2x-3x 
speed increase).


So this is true if there is no separate wal (and db) device? How does 
this work?



Thanks!

Denke.
On 11/09/2017 11:16 AM, Richard Hesketh wrote:

You're correct, if you were going to put the WAL and DB on the same device you 
should just make one partition and allocate the DB to it, the WAL will 
automatically be stored with the DB. It only makes sense to specify them 
separately if they are going to go on different devices, and that itself only 
makes sense if the WAL device will be much faster than the DB device, otherwise 
you're just making your setup more complex for no gain.

On 09/11/17 08:05, jorpilo wrote:

I get confused there because on the documentation:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

"If there is more, provisioning a DB device makes more sense. The BlueStore journal 
will always be placed on the fastest device available, so using a DB device will provide 
the same benefit that the WAL device would while also allowing additional metadata to be 
stored there"

So I guess it doesn't make any sense to implicit put WAL and DB on a SSD, only 
with DB, the biggest you can, would be enough, unless you have 2 different 
kinds of SSD (for example a tiny Nvme and a SSD)

Am I right? Or would I get any benefit from setting implicit WAL partition on 
the same SSD?


 Mensaje original 
De: Nick Fisk <n...@fisk.me.uk>
Fecha: 8/11/17 10:16 p. m. (GMT+01:00)
Para: 'Mark Nelson' <mnel...@redhat.com>, 'Wolfgang Lendl' 
<wolfgang.le...@meduniwien.ac.at>
Cc: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] bluestore - wal,db on faster devices?


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 08 November 2017 19:46
To: Wolfgang Lendl <wolfgang.le...@meduniwien.ac.at>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore - wal,db on faster devices?

Hi Wolfgang,

You've got the right idea.  RBD is probably going to benefit less since

you

have a small number of large objects and little extra OMAP data.
Having the allocation and object metadata on flash certainly shouldn't

hurt,

and you should still have less overhead for small (<64k) writes.
With RGW however you also have to worry about bucket index updates
during writes and that's a big potential bottleneck that you don't need to
worry about with RBD.

If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.



Mark

On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:

Hi Mark,

thanks for your reply!
I'm a big fan of keeping things simple - this means that there has to
be a very good reason to put the WAL and DB on a separate device
otherwise I'll keep it collocated (and simpler).

as far as I understood - putting the WAL,DB on a faster (than hdd)
device makes more sense in cephfs and rgw environments (more

metadata)

- and less sense in rbd environments - correct?

br
wolfgang

On 11/08/2017 02:21 PM, Mark Nelson wrote:

Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's
journal, but bluestore isn't dependent on it for guaranteeing
durability of large writes.  With bluestore you can often get higher
large-write throughput than with filestore when using HDD-only or
flash-only OSDs.

Bluestore also stores allocation, object, and cluster metadata in the
DB.  That, in combination with the way bluestore stores objects,
dramatically improves behavior during certain workloads.  A big one
is creating millions of small objects as quickly as possible.  In
filestore, PG splitting has a huge impact on performance and tail
latency.  Bluestore is much better just on HDD, and putting the DB
and WAL on flash makes it better still since metadata no longer is a
bottleneck.

Bluestore does have a couple of shortcomings vs filestore currently.
The allocator is not as good as XFS's and can fragment more over time.
There is no server-side readahead so small sequential read
performance is very dependent on client-side readahead.  There's
still a number of optimizations to various things ranging from
threading and locking in the shardedopwq to pglog and dup_ops that
potentially could improve performance.

I have a blog post that we'

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread Richard Hesketh
You're correct, if you were going to put the WAL and DB on the same device you 
should just make one partition and allocate the DB to it, the WAL will 
automatically be stored with the DB. It only makes sense to specify them 
separately if they are going to go on different devices, and that itself only 
makes sense if the WAL device will be much faster than the DB device, otherwise 
you're just making your setup more complex for no gain.

On 09/11/17 08:05, jorpilo wrote:
> 
> I get confused there because on the documentation:
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
> 
> "If there is more, provisioning a DB device makes more sense. The BlueStore 
> journal will always be placed on the fastest device available, so using a DB 
> device will provide the same benefit that the WAL device would while also 
> allowing additional metadata to be stored there"
> 
> So I guess it doesn't make any sense to implicit put WAL and DB on a SSD, 
> only with DB, the biggest you can, would be enough, unless you have 2 
> different kinds of SSD (for example a tiny Nvme and a SSD)
> 
> Am I right? Or would I get any benefit from setting implicit WAL partition on 
> the same SSD?
> 
> 
>  Mensaje original 
> De: Nick Fisk <n...@fisk.me.uk>
> Fecha: 8/11/17 10:16 p. m. (GMT+01:00)
> Para: 'Mark Nelson' <mnel...@redhat.com>, 'Wolfgang Lendl' 
> <wolfgang.le...@meduniwien.ac.at>
> Cc: ceph-users@lists.ceph.com
> Asunto: Re: [ceph-users] bluestore - wal,db on faster devices?
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Mark Nelson
>> Sent: 08 November 2017 19:46
>> To: Wolfgang Lendl <wolfgang.le...@meduniwien.ac.at>
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
>>
>> Hi Wolfgang,
>>
>> You've got the right idea.  RBD is probably going to benefit less since
> you
>> have a small number of large objects and little extra OMAP data.
>> Having the allocation and object metadata on flash certainly shouldn't
> hurt,
>> and you should still have less overhead for small (<64k) writes.
>> With RGW however you also have to worry about bucket index updates
>> during writes and that's a big potential bottleneck that you don't need to
>> worry about with RBD.
> 
> If you are running anything which is sensitive to sync write latency, like
> databases. You will see a big performance improvement in using WAL on SSD.
> As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
> 1-2us difference. It will also batch lots of these small writes
> together and write them to disk in bigger chunks much more effectively. If
> you want to run active workloads on RBD and want them to match enterprise
> storage array with BBWC type performance, I would say DB and WAL on SSD is a
> requirement.
> 
> 
>>
>> Mark
>>
>> On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
>> > Hi Mark,
>> >
>> > thanks for your reply!
>> > I'm a big fan of keeping things simple - this means that there has to
>> > be a very good reason to put the WAL and DB on a separate device
>> > otherwise I'll keep it collocated (and simpler).
>> >
>> > as far as I understood - putting the WAL,DB on a faster (than hdd)
>> > device makes more sense in cephfs and rgw environments (more
>> metadata)
>> > - and less sense in rbd environments - correct?
>> >
>> > br
>> > wolfgang
>> >
>> > On 11/08/2017 02:21 PM, Mark Nelson wrote:
>> >> Hi Wolfgang,
>> >>
>> >> In bluestore the WAL serves sort of a similar purpose to filestore's
>> >> journal, but bluestore isn't dependent on it for guaranteeing
>> >> durability of large writes.  With bluestore you can often get higher
>> >> large-write throughput than with filestore when using HDD-only or
>> >> flash-only OSDs.
>> >>
>> >> Bluestore also stores allocation, object, and cluster metadata in the
>> >> DB.  That, in combination with the way bluestore stores objects,
>> >> dramatically improves behavior during certain workloads.  A big one
>> >> is creating millions of small objects as quickly as possible.  In
>> >> filestore, PG splitting has a huge impact on performance and tail
>> >> latency.  Bluestore is much better just on HDD, and putting the DB
>> >> and WAL on flash makes it better still since metadata no longer is a
>> >> bottleneck.
>> >>

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread jorpilo

I get confused there because on the 
documentation:http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
"If there is more, provisioning a DB device makes more sense. The BlueStore 
journal will always be placed on the fastest device available, so using a DB 
device will provide the same benefit that the WAL device would while also 
allowing additional metadata to be stored there"
So I guess it doesn't make any sense to implicit put WAL and DB on a SSD, only 
with DB, the biggest you can, would be enough, unless you have 2 different 
kinds of SSD (for example a tiny Nvme and a SSD)
Am I right? Or would I get any benefit from setting implicit WAL partition on 
the same SSD?

 Mensaje original De: Nick Fisk <n...@fisk.me.uk> Fecha: 
8/11/17  10:16 p. m.  (GMT+01:00) Para: 'Mark Nelson' <mnel...@redhat.com>, 
'Wolfgang Lendl' <wolfgang.le...@meduniwien.ac.at> Cc: 
ceph-users@lists.ceph.com Asunto: Re: [ceph-users] bluestore - wal,db on faster 
devices? 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 08 November 2017 19:46
> To: Wolfgang Lendl <wolfgang.le...@meduniwien.ac.at>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
> 
> Hi Wolfgang,
> 
> You've got the right idea.  RBD is probably going to benefit less since
you
> have a small number of large objects and little extra OMAP data.
> Having the allocation and object metadata on flash certainly shouldn't
hurt,
> and you should still have less overhead for small (<64k) writes.
> With RGW however you also have to worry about bucket index updates
> during writes and that's a big potential bottleneck that you don't need to
> worry about with RBD.

If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.


> 
> Mark
> 
> On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
> > Hi Mark,
> >
> > thanks for your reply!
> > I'm a big fan of keeping things simple - this means that there has to
> > be a very good reason to put the WAL and DB on a separate device
> > otherwise I'll keep it collocated (and simpler).
> >
> > as far as I understood - putting the WAL,DB on a faster (than hdd)
> > device makes more sense in cephfs and rgw environments (more
> metadata)
> > - and less sense in rbd environments - correct?
> >
> > br
> > wolfgang
> >
> > On 11/08/2017 02:21 PM, Mark Nelson wrote:
> >> Hi Wolfgang,
> >>
> >> In bluestore the WAL serves sort of a similar purpose to filestore's
> >> journal, but bluestore isn't dependent on it for guaranteeing
> >> durability of large writes.  With bluestore you can often get higher
> >> large-write throughput than with filestore when using HDD-only or
> >> flash-only OSDs.
> >>
> >> Bluestore also stores allocation, object, and cluster metadata in the
> >> DB.  That, in combination with the way bluestore stores objects,
> >> dramatically improves behavior during certain workloads.  A big one
> >> is creating millions of small objects as quickly as possible.  In
> >> filestore, PG splitting has a huge impact on performance and tail
> >> latency.  Bluestore is much better just on HDD, and putting the DB
> >> and WAL on flash makes it better still since metadata no longer is a
> >> bottleneck.
> >>
> >> Bluestore does have a couple of shortcomings vs filestore currently.
> >> The allocator is not as good as XFS's and can fragment more over time.
> >> There is no server-side readahead so small sequential read
> >> performance is very dependent on client-side readahead.  There's
> >> still a number of optimizations to various things ranging from
> >> threading and locking in the shardedopwq to pglog and dup_ops that
> >> potentially could improve performance.
> >>
> >> I have a blog post that we've been working on that explores some of
> >> these things but I'm still waiting on review before I publish it.
> >>
> >> Mark
> >>
> >> On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:
> >>> Hello,
> >>>
> >>>

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Nick Fisk
> -Original Message-
> From: Mark Nelson [mailto:mnel...@redhat.com]
> Sent: 08 November 2017 21:42
> To: n...@fisk.me.uk; 'Wolfgang Lendl' <wolfgang.le...@meduniwien.ac.at>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
> 
> 
> 
> On 11/08/2017 03:16 PM, Nick Fisk wrote:
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Mark Nelson
> >> Sent: 08 November 2017 19:46
> >> To: Wolfgang Lendl <wolfgang.le...@meduniwien.ac.at>
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
> >>
> >> Hi Wolfgang,
> >>
> >> You've got the right idea.  RBD is probably going to benefit less
> >> since
> > you
> >> have a small number of large objects and little extra OMAP data.
> >> Having the allocation and object metadata on flash certainly
> >> shouldn't
> > hurt,
> >> and you should still have less overhead for small (<64k) writes.
> >> With RGW however you also have to worry about bucket index updates
> >> during writes and that's a big potential bottleneck that you don't
> >> need to worry about with RBD.
> >
> > If you are running anything which is sensitive to sync write latency,
> > like databases. You will see a big performance improvement in using WAL
> on SSD.
> > As Mark says, small writes will get ack'd once written to SSD.
> > ~10-200us vs 1-2us difference. It will also batch lots of
> > these small writes together and write them to disk in bigger chunks
> > much more effectively. If you want to run active workloads on RBD and
> > want them to match enterprise storage array with BBWC type
> > performance, I would say DB and WAL on SSD is a requirement.
> 
> Hi Nick,
> 
> You've done more investigation in this area than most I think.  Once you get
> to the point under continuous load where RocksDB is compacting, do you see
> better than a 2X gain?
> 
> Mark

Hi Mark,

I've not really been testing it in a way where all the OSD's would be under 
100% load for a long period of time. It's been more of a real world user facing 
test were IO comes and goes in short bursts and spikes. I've been busy in other 
areas for the last few months and so have sort of missed out on all the 
official Luminous/bluestore goodness. I hope to get round to doing some more 
testing towards the end of the year though. Once I do, I will look into the 
compaction and see what impact it might be having.

> 
> >
> >>
> >> Mark
> >>
> >> On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
> >>> Hi Mark,
> >>>
> >>> thanks for your reply!
> >>> I'm a big fan of keeping things simple - this means that there has
> >>> to be a very good reason to put the WAL and DB on a separate device
> >>> otherwise I'll keep it collocated (and simpler).
> >>>
> >>> as far as I understood - putting the WAL,DB on a faster (than hdd)
> >>> device makes more sense in cephfs and rgw environments (more
> >> metadata)
> >>> - and less sense in rbd environments - correct?
> >>>
> >>> br
> >>> wolfgang
> >>>
> >>> On 11/08/2017 02:21 PM, Mark Nelson wrote:
> >>>> Hi Wolfgang,
> >>>>
> >>>> In bluestore the WAL serves sort of a similar purpose to
> >>>> filestore's journal, but bluestore isn't dependent on it for
> >>>> guaranteeing durability of large writes.  With bluestore you can
> >>>> often get higher large-write throughput than with filestore when
> >>>> using HDD-only or flash-only OSDs.
> >>>>
> >>>> Bluestore also stores allocation, object, and cluster metadata in
> >>>> the DB.  That, in combination with the way bluestore stores
> >>>> objects, dramatically improves behavior during certain workloads.
> >>>> A big one is creating millions of small objects as quickly as
> >>>> possible.  In filestore, PG splitting has a huge impact on
> >>>> performance and tail latency.  Bluestore is much better just on
> >>>> HDD, and putting the DB and WAL on flash makes it better still
> >>>> since metadata no longer is a bottleneck.
> >>>>
> >>>> Bluestore does have a couple of shortcomings vs filestore currently.
> >>>> The allocator is not as good as XFS's and can fragment more o

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Mark Nelson



On 11/08/2017 03:16 PM, Nick Fisk wrote:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 08 November 2017 19:46
To: Wolfgang Lendl <wolfgang.le...@meduniwien.ac.at>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore - wal,db on faster devices?

Hi Wolfgang,

You've got the right idea.  RBD is probably going to benefit less since

you

have a small number of large objects and little extra OMAP data.
Having the allocation and object metadata on flash certainly shouldn't

hurt,

and you should still have less overhead for small (<64k) writes.
With RGW however you also have to worry about bucket index updates
during writes and that's a big potential bottleneck that you don't need to
worry about with RBD.


If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.


Hi Nick,

You've done more investigation in this area than most I think.  Once you 
get to the point under continuous load where RocksDB is compacting, do 
you see better than a 2X gain?


Mark





Mark

On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:

Hi Mark,

thanks for your reply!
I'm a big fan of keeping things simple - this means that there has to
be a very good reason to put the WAL and DB on a separate device
otherwise I'll keep it collocated (and simpler).

as far as I understood - putting the WAL,DB on a faster (than hdd)
device makes more sense in cephfs and rgw environments (more

metadata)

- and less sense in rbd environments - correct?

br
wolfgang

On 11/08/2017 02:21 PM, Mark Nelson wrote:

Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's
journal, but bluestore isn't dependent on it for guaranteeing
durability of large writes.  With bluestore you can often get higher
large-write throughput than with filestore when using HDD-only or
flash-only OSDs.

Bluestore also stores allocation, object, and cluster metadata in the
DB.  That, in combination with the way bluestore stores objects,
dramatically improves behavior during certain workloads.  A big one
is creating millions of small objects as quickly as possible.  In
filestore, PG splitting has a huge impact on performance and tail
latency.  Bluestore is much better just on HDD, and putting the DB
and WAL on flash makes it better still since metadata no longer is a
bottleneck.

Bluestore does have a couple of shortcomings vs filestore currently.
The allocator is not as good as XFS's and can fragment more over time.
There is no server-side readahead so small sequential read
performance is very dependent on client-side readahead.  There's
still a number of optimizations to various things ranging from
threading and locking in the shardedopwq to pglog and dup_ops that
potentially could improve performance.

I have a blog post that we've been working on that explores some of
these things but I'm still waiting on review before I publish it.

Mark

On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:

Hello,

it's clear to me getting a performance gain from putting the journal
on a fast device (ssd,nvme) when using filestore backend.
it's not when it comes to bluestore - are there any resources,
performance test, etc. out there how a fast wal,db device impacts
performance?


br
wolfgang


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 08 November 2017 19:46
> To: Wolfgang Lendl <wolfgang.le...@meduniwien.ac.at>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
> 
> Hi Wolfgang,
> 
> You've got the right idea.  RBD is probably going to benefit less since
you
> have a small number of large objects and little extra OMAP data.
> Having the allocation and object metadata on flash certainly shouldn't
hurt,
> and you should still have less overhead for small (<64k) writes.
> With RGW however you also have to worry about bucket index updates
> during writes and that's a big potential bottleneck that you don't need to
> worry about with RBD.

If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.

> 
> Mark
> 
> On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
> > Hi Mark,
> >
> > thanks for your reply!
> > I'm a big fan of keeping things simple - this means that there has to
> > be a very good reason to put the WAL and DB on a separate device
> > otherwise I'll keep it collocated (and simpler).
> >
> > as far as I understood - putting the WAL,DB on a faster (than hdd)
> > device makes more sense in cephfs and rgw environments (more
> metadata)
> > - and less sense in rbd environments - correct?
> >
> > br
> > wolfgang
> >
> > On 11/08/2017 02:21 PM, Mark Nelson wrote:
> >> Hi Wolfgang,
> >>
> >> In bluestore the WAL serves sort of a similar purpose to filestore's
> >> journal, but bluestore isn't dependent on it for guaranteeing
> >> durability of large writes.  With bluestore you can often get higher
> >> large-write throughput than with filestore when using HDD-only or
> >> flash-only OSDs.
> >>
> >> Bluestore also stores allocation, object, and cluster metadata in the
> >> DB.  That, in combination with the way bluestore stores objects,
> >> dramatically improves behavior during certain workloads.  A big one
> >> is creating millions of small objects as quickly as possible.  In
> >> filestore, PG splitting has a huge impact on performance and tail
> >> latency.  Bluestore is much better just on HDD, and putting the DB
> >> and WAL on flash makes it better still since metadata no longer is a
> >> bottleneck.
> >>
> >> Bluestore does have a couple of shortcomings vs filestore currently.
> >> The allocator is not as good as XFS's and can fragment more over time.
> >> There is no server-side readahead so small sequential read
> >> performance is very dependent on client-side readahead.  There's
> >> still a number of optimizations to various things ranging from
> >> threading and locking in the shardedopwq to pglog and dup_ops that
> >> potentially could improve performance.
> >>
> >> I have a blog post that we've been working on that explores some of
> >> these things but I'm still waiting on review before I publish it.
> >>
> >> Mark
> >>
> >> On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:
> >>> Hello,
> >>>
> >>> it's clear to me getting a performance gain from putting the journal
> >>> on a fast device (ssd,nvme) when using filestore backend.
> >>> it's not when it comes to bluestore - are there any resources,
> >>> performance test, etc. out there how a fast wal,db device impacts
> >>> performance?
> >>>
> >>>
> >>> br
> >>> wolfgang
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Mark Nelson

Hi Wolfgang,

You've got the right idea.  RBD is probably going to benefit less since 
you have a small number of large objects and little extra OMAP data. 
Having the allocation and object metadata on flash certainly shouldn't 
hurt, and you should still have less overhead for small (<64k) writes. 
With RGW however you also have to worry about bucket index updates 
during writes and that's a big potential bottleneck that you don't need 
to worry about with RBD.


Mark

On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:

Hi Mark,

thanks for your reply!
I'm a big fan of keeping things simple - this means that there has to be
a very good reason to put the WAL and DB on a separate device otherwise
I'll keep it collocated (and simpler).

as far as I understood - putting the WAL,DB on a faster (than hdd)
device makes more sense in cephfs and rgw environments (more metadata) -
and less sense in rbd environments - correct?

br
wolfgang

On 11/08/2017 02:21 PM, Mark Nelson wrote:

Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's
journal, but bluestore isn't dependent on it for guaranteeing
durability of large writes.  With bluestore you can often get higher
large-write throughput than with filestore when using HDD-only or
flash-only OSDs.

Bluestore also stores allocation, object, and cluster metadata in the
DB.  That, in combination with the way bluestore stores objects,
dramatically improves behavior during certain workloads.  A big one is
creating millions of small objects as quickly as possible.  In
filestore, PG splitting has a huge impact on performance and tail
latency.  Bluestore is much better just on HDD, and putting the DB and
WAL on flash makes it better still since metadata no longer is a
bottleneck.

Bluestore does have a couple of shortcomings vs filestore currently.
The allocator is not as good as XFS's and can fragment more over time.
There is no server-side readahead so small sequential read performance
is very dependent on client-side readahead.  There's still a number of
optimizations to various things ranging from threading and locking in
the shardedopwq to pglog and dup_ops that potentially could improve
performance.

I have a blog post that we've been working on that explores some of
these things but I'm still waiting on review before I publish it.

Mark

On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:

Hello,

it's clear to me getting a performance gain from putting the journal on
a fast device (ssd,nvme) when using filestore backend.
it's not when it comes to bluestore - are there any resources,
performance test, etc. out there how a fast wal,db device impacts
performance?


br
wolfgang


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Wolfgang Lendl
Hi Mark,

thanks for your reply!
I'm a big fan of keeping things simple - this means that there has to be
a very good reason to put the WAL and DB on a separate device otherwise
I'll keep it collocated (and simpler).

as far as I understood - putting the WAL,DB on a faster (than hdd)
device makes more sense in cephfs and rgw environments (more metadata) -
and less sense in rbd environments - correct?

br
wolfgang

On 11/08/2017 02:21 PM, Mark Nelson wrote:
> Hi Wolfgang,
>
> In bluestore the WAL serves sort of a similar purpose to filestore's
> journal, but bluestore isn't dependent on it for guaranteeing
> durability of large writes.  With bluestore you can often get higher
> large-write throughput than with filestore when using HDD-only or
> flash-only OSDs.
>
> Bluestore also stores allocation, object, and cluster metadata in the
> DB.  That, in combination with the way bluestore stores objects,
> dramatically improves behavior during certain workloads.  A big one is
> creating millions of small objects as quickly as possible.  In
> filestore, PG splitting has a huge impact on performance and tail
> latency.  Bluestore is much better just on HDD, and putting the DB and
> WAL on flash makes it better still since metadata no longer is a
> bottleneck.
>
> Bluestore does have a couple of shortcomings vs filestore currently.
> The allocator is not as good as XFS's and can fragment more over time.
> There is no server-side readahead so small sequential read performance
> is very dependent on client-side readahead.  There's still a number of
> optimizations to various things ranging from threading and locking in
> the shardedopwq to pglog and dup_ops that potentially could improve
> performance.
>
> I have a blog post that we've been working on that explores some of
> these things but I'm still waiting on review before I publish it.
>
> Mark
>
> On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:
>> Hello,
>>
>> it's clear to me getting a performance gain from putting the journal on
>> a fast device (ssd,nvme) when using filestore backend.
>> it's not when it comes to bluestore - are there any resources,
>> performance test, etc. out there how a fast wal,db device impacts
>> performance?
>>
>>
>> br
>> wolfgang
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-08 Thread Mark Nelson

Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's 
journal, but bluestore isn't dependent on it for guaranteeing durability 
of large writes.  With bluestore you can often get higher large-write 
throughput than with filestore when using HDD-only or flash-only OSDs.


Bluestore also stores allocation, object, and cluster metadata in the 
DB.  That, in combination with the way bluestore stores objects, 
dramatically improves behavior during certain workloads.  A big one is 
creating millions of small objects as quickly as possible.  In 
filestore, PG splitting has a huge impact on performance and tail 
latency.  Bluestore is much better just on HDD, and putting the DB and 
WAL on flash makes it better still since metadata no longer is a bottleneck.


Bluestore does have a couple of shortcomings vs filestore currently. 
The allocator is not as good as XFS's and can fragment more over time. 
There is no server-side readahead so small sequential read performance 
is very dependent on client-side readahead.  There's still a number of 
optimizations to various things ranging from threading and locking in 
the shardedopwq to pglog and dup_ops that potentially could improve 
performance.


I have a blog post that we've been working on that explores some of 
these things but I'm still waiting on review before I publish it.


Mark

On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:

Hello,

it's clear to me getting a performance gain from putting the journal on
a fast device (ssd,nvme) when using filestore backend.
it's not when it comes to bluestore - are there any resources,
performance test, etc. out there how a fast wal,db device impacts
performance?


br
wolfgang


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com