Re: [ceph-users] Bluestore "separate" WAL and DB

2017-10-16 Thread Wido den Hollander
I thought I'd pick up on this older thread instead of starting a new one.

For the WAL something between 512MB and 2GB should be sufficient as Mark Nelson 
explained in a different thread.

The DB however I'm not certain about at this moment. The general consensus 
seems to be: "use as much as available", but that could be a lot of space.

The DB will roll over to the DATA partition in case it grows too large.

There is a relation between the amount of objects and the size of the DB. For 
each object (regardless of the size) you will have a RocksDB entry and that 
will occupy space in the DB.

Hopefully Mark (or somebody else) can shine some light on this. Eg, is 10GB 
sufficient for DB? 20GB? 100GB? What is a reasonable amount?

Wido

> Op 20 september 2017 om 20:50 schreef Alejandro Comisario 
> :
> 
> 
> Bump! i would love the thoughts about this !
> 
> On Fri, Sep 8, 2017 at 7:44 AM, Richard Hesketh <
> richard.hesk...@rd.bbc.co.uk> wrote:
> 
> > Hi,
> >
> > Reading the ceph-users list I'm obviously seeing a lot of people talking
> > about using bluestore now that Luminous has been released. I note that many
> > users seem to be under the impression that they need separate block devices
> > for the bluestore data block, the DB, and the WAL... even when they are
> > going to put the DB and the WAL on the same device!
> >
> > As per the docs at http://docs.ceph.com/docs/master/rados/configuration/
> > bluestore-config-ref/ this is nonsense:
> >
> > > If there is only a small amount of fast storage available (e.g., less
> > than a gigabyte), we recommend using it as a WAL device. If there is more,
> > provisioning a DB
> > > device makes more sense. The BlueStore journal will always be placed on
> > the fastest device available, so using a DB device will provide the same
> > benefit that the WAL
> > > device would while also allowing additional metadata to be stored there
> > (if it will fix). [sic, I assume that should be "fit"]
> >
> > I understand that if you've got three speeds of storage available, there
> > may be some sense to dividing these. For instance, if you've got lots of
> > HDD, a bit of SSD, and a tiny NVMe available in the same host, data on HDD,
> > DB on SSD and WAL on NVMe may be a sensible division of data. That's not
> > the case for most of the examples I'm reading; they're talking about
> > putting DB and WAL on the same block device, but in different partitions.
> > There's even one example of someone suggesting to try partitioning a single
> > SSD to put data/DB/WAL all in separate partitions!
> >
> > Are the docs wrong and/or I am missing something about optimal bluestore
> > setup, or do people simply have the wrong end of the stick? I ask because
> > I'm just going through switching all my OSDs over to Bluestore now and I've
> > just been reusing the partitions I set up for journals on my SSDs as DB
> > devices for Bluestore HDDs without specifying anything to do with the WAL,
> > and I'd like to know sooner rather than later if I'm making some sort of
> > horrible mistake.
> >
> > Rich
> > --
> > Richard Hesketh
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> 
> 
> -- 
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB

2017-09-20 Thread Alejandro Comisario
Bump! i would love the thoughts about this !

On Fri, Sep 8, 2017 at 7:44 AM, Richard Hesketh <
richard.hesk...@rd.bbc.co.uk> wrote:

> Hi,
>
> Reading the ceph-users list I'm obviously seeing a lot of people talking
> about using bluestore now that Luminous has been released. I note that many
> users seem to be under the impression that they need separate block devices
> for the bluestore data block, the DB, and the WAL... even when they are
> going to put the DB and the WAL on the same device!
>
> As per the docs at http://docs.ceph.com/docs/master/rados/configuration/
> bluestore-config-ref/ this is nonsense:
>
> > If there is only a small amount of fast storage available (e.g., less
> than a gigabyte), we recommend using it as a WAL device. If there is more,
> provisioning a DB
> > device makes more sense. The BlueStore journal will always be placed on
> the fastest device available, so using a DB device will provide the same
> benefit that the WAL
> > device would while also allowing additional metadata to be stored there
> (if it will fix). [sic, I assume that should be "fit"]
>
> I understand that if you've got three speeds of storage available, there
> may be some sense to dividing these. For instance, if you've got lots of
> HDD, a bit of SSD, and a tiny NVMe available in the same host, data on HDD,
> DB on SSD and WAL on NVMe may be a sensible division of data. That's not
> the case for most of the examples I'm reading; they're talking about
> putting DB and WAL on the same block device, but in different partitions.
> There's even one example of someone suggesting to try partitioning a single
> SSD to put data/DB/WAL all in separate partitions!
>
> Are the docs wrong and/or I am missing something about optimal bluestore
> setup, or do people simply have the wrong end of the stick? I ask because
> I'm just going through switching all my OSDs over to Bluestore now and I've
> just been reusing the partitions I set up for journals on my SSDs as DB
> devices for Bluestore HDDs without specifying anything to do with the WAL,
> and I'd like to know sooner rather than later if I'm making some sort of
> horrible mistake.
>
> Rich
> --
> Richard Hesketh
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?) [and recovery sleep]

2017-09-14 Thread Richard Hesketh
I do run with osd_max_backfills and osd_recovery_max_active turned up quite a 
bit from the defaults, I'm trying for as much recovery throughput as possible. 
I would hazard a guess that the impact seen from the sleep settings is 
proportionally much smaller if your other recovery-related parameters are more 
default - but it starts to dominate if you remove other bottlenecks on recovery 
I/O.

Rich

On 14/09/17 15:02, Mark Nelson wrote:
> I'm really glad to hear that it wasn't bluestore! :)
> 
> It raises another concern though. We didn't expect to see that much of a 
> slowdown with the current throttle settings.  An order of magnitude slowdown 
> in recovery performance isn't ideal at all.
> 
> I wonder if we could improve things dramatically if we kept track of client 
> IO activity on the OSD and remove the throttle if there's been no client 
> activity for X seconds.  Theoretically more advanced heuristics might cover 
> this, but in the interim it seems to me like this would solve the very 
> specific problem you are seeing while still throttling recovery when IO is 
> happening.
> 
> Mark
> 
> On 09/14/2017 06:19 AM, Richard Hesketh wrote:
>> Yeah, that hit the nail on the head. Significantly reducing/eliminating the 
>> recovery sleep times increases the recovery speed back up (and beyond!) the 
>> levels I was expecting to see - recovery is almost an order of magnitude 
>> faster now. Thanks for educating me about those changes!
>>
>> Rich
>>
>> On 14/09/17 11:16, Richard Hesketh wrote:
>>> Hi Mark,
>>>
>>> No, I wasn't familiar with that work. I am in fact comparing speed of 
>>> recovery to maintenance work I did while the cluster was in Jewel; I 
>>> haven't manually done anything to sleep settings, only adjusted max 
>>> backfills OSD settings. New options that introduce arbitrary slowdown to 
>>> recovery operations to preserve client performance would explain what I'm 
>>> seeing! I'll have a tinker with adjusting those values (in my particular 
>>> case client load on the cluster is very low and I don't have to honour any 
>>> guarantees about client performance - getting back into HEALTH_OK asap is 
>>> preferable).
>>>
>>> Rich
>>>
>>> On 13/09/17 21:14, Mark Nelson wrote:
 Hi Richard,

 Regarding recovery speed, have you looked through any of Neha's results on 
 recovery sleep testing earlier this summer?

 https://www.spinics.net/lists/ceph-devel/msg37665.html

 She tested bluestore and filestore under a couple of different scenarios.  
 The gist of it is that time to recover changes pretty dramatically 
 depending on the sleep setting.

 I don't recall if you said earlier, but are you comparing filestore and 
 bluestore recovery performance on the same version of ceph with the same 
 sleep settings?

 Mark

 On 09/12/2017 05:24 AM, Richard Hesketh wrote:
> Thanks for the links. That does seem to largely confirm that what I 
> haven't horribly misunderstood anything and I've not been doing anything 
> obviously wrong while converting my disks; there's no point specifying 
> separate WAL/DB partitions if they're going to go on the same device, 
> throw as much space as you have available at the DB partitions and 
> they'll use all the space they can, and significantly reduced I/O on the 
> DB/WAL device compared to Filestore is expected since bluestore's nixed 
> the write amplification as much as possible.
>
> I'm still seeing much reduced recovery speed on my newly Bluestored 
> cluster, but I guess that's a tuning issue rather than evidence of 
> catastrophe.
>
> Rich
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Richard Hesketh
Systems Engineer, Research Platforms
BBC Research & Development



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?) [and recovery sleep]

2017-09-14 Thread Mark Nelson

I'm really glad to hear that it wasn't bluestore! :)

It raises another concern though. We didn't expect to see that much of a 
slowdown with the current throttle settings.  An order of magnitude 
slowdown in recovery performance isn't ideal at all.


I wonder if we could improve things dramatically if we kept track of 
client IO activity on the OSD and remove the throttle if there's been no 
client activity for X seconds.  Theoretically more advanced heuristics 
might cover this, but in the interim it seems to me like this would 
solve the very specific problem you are seeing while still throttling 
recovery when IO is happening.


Mark

On 09/14/2017 06:19 AM, Richard Hesketh wrote:

Yeah, that hit the nail on the head. Significantly reducing/eliminating the 
recovery sleep times increases the recovery speed back up (and beyond!) the 
levels I was expecting to see - recovery is almost an order of magnitude faster 
now. Thanks for educating me about those changes!

Rich

On 14/09/17 11:16, Richard Hesketh wrote:

Hi Mark,

No, I wasn't familiar with that work. I am in fact comparing speed of recovery 
to maintenance work I did while the cluster was in Jewel; I haven't manually 
done anything to sleep settings, only adjusted max backfills OSD settings. New 
options that introduce arbitrary slowdown to recovery operations to preserve 
client performance would explain what I'm seeing! I'll have a tinker with 
adjusting those values (in my particular case client load on the cluster is 
very low and I don't have to honour any guarantees about client performance - 
getting back into HEALTH_OK asap is preferable).

Rich

On 13/09/17 21:14, Mark Nelson wrote:

Hi Richard,

Regarding recovery speed, have you looked through any of Neha's results on 
recovery sleep testing earlier this summer?

https://www.spinics.net/lists/ceph-devel/msg37665.html

She tested bluestore and filestore under a couple of different scenarios.  The 
gist of it is that time to recover changes pretty dramatically depending on the 
sleep setting.

I don't recall if you said earlier, but are you comparing filestore and 
bluestore recovery performance on the same version of ceph with the same sleep 
settings?

Mark

On 09/12/2017 05:24 AM, Richard Hesketh wrote:

Thanks for the links. That does seem to largely confirm that what I haven't 
horribly misunderstood anything and I've not been doing anything obviously 
wrong while converting my disks; there's no point specifying separate WAL/DB 
partitions if they're going to go on the same device, throw as much space as 
you have available at the DB partitions and they'll use all the space they can, 
and significantly reduced I/O on the DB/WAL device compared to Filestore is 
expected since bluestore's nixed the write amplification as much as possible.

I'm still seeing much reduced recovery speed on my newly Bluestored cluster, 
but I guess that's a tuning issue rather than evidence of catastrophe.

Rich




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?) [and recovery sleep]

2017-09-14 Thread Richard Hesketh
Yeah, that hit the nail on the head. Significantly reducing/eliminating the 
recovery sleep times increases the recovery speed back up (and beyond!) the 
levels I was expecting to see - recovery is almost an order of magnitude faster 
now. Thanks for educating me about those changes!

Rich

On 14/09/17 11:16, Richard Hesketh wrote:
> Hi Mark,
> 
> No, I wasn't familiar with that work. I am in fact comparing speed of 
> recovery to maintenance work I did while the cluster was in Jewel; I haven't 
> manually done anything to sleep settings, only adjusted max backfills OSD 
> settings. New options that introduce arbitrary slowdown to recovery 
> operations to preserve client performance would explain what I'm seeing! I'll 
> have a tinker with adjusting those values (in my particular case client load 
> on the cluster is very low and I don't have to honour any guarantees about 
> client performance - getting back into HEALTH_OK asap is preferable).
> 
> Rich
> 
> On 13/09/17 21:14, Mark Nelson wrote:
>> Hi Richard,
>>
>> Regarding recovery speed, have you looked through any of Neha's results on 
>> recovery sleep testing earlier this summer?
>>
>> https://www.spinics.net/lists/ceph-devel/msg37665.html
>>
>> She tested bluestore and filestore under a couple of different scenarios.  
>> The gist of it is that time to recover changes pretty dramatically depending 
>> on the sleep setting.
>>
>> I don't recall if you said earlier, but are you comparing filestore and 
>> bluestore recovery performance on the same version of ceph with the same 
>> sleep settings?
>>
>> Mark
>>
>> On 09/12/2017 05:24 AM, Richard Hesketh wrote:
>>> Thanks for the links. That does seem to largely confirm that what I haven't 
>>> horribly misunderstood anything and I've not been doing anything obviously 
>>> wrong while converting my disks; there's no point specifying separate 
>>> WAL/DB partitions if they're going to go on the same device, throw as much 
>>> space as you have available at the DB partitions and they'll use all the 
>>> space they can, and significantly reduced I/O on the DB/WAL device compared 
>>> to Filestore is expected since bluestore's nixed the write amplification as 
>>> much as possible.
>>>
>>> I'm still seeing much reduced recovery speed on my newly Bluestored 
>>> cluster, but I guess that's a tuning issue rather than evidence of 
>>> catastrophe.
>>>
>>> Rich
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Richard Hesketh
Systems Engineer, Research Platforms
BBC Research & Development



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?)

2017-09-13 Thread Mark Nelson

Hi Richard,

Regarding recovery speed, have you looked through any of Neha's results 
on recovery sleep testing earlier this summer?


https://www.spinics.net/lists/ceph-devel/msg37665.html

She tested bluestore and filestore under a couple of different 
scenarios.  The gist of it is that time to recover changes pretty 
dramatically depending on the sleep setting.


I don't recall if you said earlier, but are you comparing filestore and 
bluestore recovery performance on the same version of ceph with the same 
sleep settings?


Mark

On 09/12/2017 05:24 AM, Richard Hesketh wrote:

Thanks for the links. That does seem to largely confirm that what I haven't 
horribly misunderstood anything and I've not been doing anything obviously 
wrong while converting my disks; there's no point specifying separate WAL/DB 
partitions if they're going to go on the same device, throw as much space as 
you have available at the DB partitions and they'll use all the space they can, 
and significantly reduced I/O on the DB/WAL device compared to Filestore is 
expected since bluestore's nixed the write amplification as much as possible.

I'm still seeing much reduced recovery speed on my newly Bluestored cluster, 
but I guess that's a tuning issue rather than evidence of catastrophe.

Rich

On 12/09/17 00:13, Brad Hubbard wrote:

Take a look at these which should answer at least some of your questions.

http://ceph.com/community/new-luminous-bluestore/

http://ceph.com/planet/understanding-bluestore-cephs-new-storage-backend/

On Mon, Sep 11, 2017 at 8:45 PM, Richard Hesketh
 wrote:

On 08/09/17 11:44, Richard Hesketh wrote:

Hi,

Reading the ceph-users list I'm obviously seeing a lot of people talking about 
using bluestore now that Luminous has been released. I note that many users 
seem to be under the impression that they need separate block devices for the 
bluestore data block, the DB, and the WAL... even when they are going to put 
the DB and the WAL on the same device!

As per the docs at 
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ this 
is nonsense:


If there is only a small amount of fast storage available (e.g., less than a 
gigabyte), we recommend using it as a WAL device. If there is more, 
provisioning a DB
device makes more sense. The BlueStore journal will always be placed on the 
fastest device available, so using a DB device will provide the same benefit 
that the WAL
device would while also allowing additional metadata to be stored there (if it will fix). 
[sic, I assume that should be "fit"]


I understand that if you've got three speeds of storage available, there may be 
some sense to dividing these. For instance, if you've got lots of HDD, a bit of 
SSD, and a tiny NVMe available in the same host, data on HDD, DB on SSD and WAL 
on NVMe may be a sensible division of data. That's not the case for most of the 
examples I'm reading; they're talking about putting DB and WAL on the same 
block device, but in different partitions. There's even one example of someone 
suggesting to try partitioning a single SSD to put data/DB/WAL all in separate 
partitions!

Are the docs wrong and/or I am missing something about optimal bluestore setup, 
or do people simply have the wrong end of the stick? I ask because I'm just 
going through switching all my OSDs over to Bluestore now and I've just been 
reusing the partitions I set up for journals on my SSDs as DB devices for 
Bluestore HDDs without specifying anything to do with the WAL, and I'd like to 
know sooner rather than later if I'm making some sort of horrible mistake.

Rich


Having had no explanatory reply so far I'll ask further...

I have been continuing to update my OSDs and so far the performance offered by 
bluestore has been somewhat underwhelming. Recovery operations after replacing 
the Filestore OSDs with Bluestore equivalents have been much slower than 
expected, not even half the speed of recovery ops when I was upgrading 
Filestore OSDs with larger disks a few months ago. This contributes to my sense 
that I am doing something wrong.

I've found that if I allow ceph-disk to partition my DB SSDs rather than 
reusing the rather large journal partitions I originally created for Filestore, 
it is only creating very small 1GB partitions. Attempting to search for 
bluestore configuration parameters has pointed me towards 
bluestore_block_db_size and bluestore_block_wal_size config settings. 
Unfortunately these settings are completely undocumented so I'm not sure what 
their functional purpose is. In any event in my running config I seem to have 
the following default values:

# ceph-conf --show-config | grep bluestore
...
bluestore_block_create = true
bluestore_block_db_create = false
bluestore_block_db_path =
bluestore_block_db_size = 0
bluestore_block_path =
bluestore_block_preallocate_file = false
bluestore_block_size = 10737418240
bluestore_block_wal_create = false

Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?)

2017-09-12 Thread Richard Hesketh
Thanks for the links. That does seem to largely confirm that what I haven't 
horribly misunderstood anything and I've not been doing anything obviously 
wrong while converting my disks; there's no point specifying separate WAL/DB 
partitions if they're going to go on the same device, throw as much space as 
you have available at the DB partitions and they'll use all the space they can, 
and significantly reduced I/O on the DB/WAL device compared to Filestore is 
expected since bluestore's nixed the write amplification as much as possible.

I'm still seeing much reduced recovery speed on my newly Bluestored cluster, 
but I guess that's a tuning issue rather than evidence of catastrophe.

Rich

On 12/09/17 00:13, Brad Hubbard wrote:
> Take a look at these which should answer at least some of your questions.
> 
> http://ceph.com/community/new-luminous-bluestore/
> 
> http://ceph.com/planet/understanding-bluestore-cephs-new-storage-backend/
> 
> On Mon, Sep 11, 2017 at 8:45 PM, Richard Hesketh
>  wrote:
>> On 08/09/17 11:44, Richard Hesketh wrote:
>>> Hi,
>>>
>>> Reading the ceph-users list I'm obviously seeing a lot of people talking 
>>> about using bluestore now that Luminous has been released. I note that many 
>>> users seem to be under the impression that they need separate block devices 
>>> for the bluestore data block, the DB, and the WAL... even when they are 
>>> going to put the DB and the WAL on the same device!
>>>
>>> As per the docs at 
>>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ 
>>> this is nonsense:
>>>
 If there is only a small amount of fast storage available (e.g., less than 
 a gigabyte), we recommend using it as a WAL device. If there is more, 
 provisioning a DB
 device makes more sense. The BlueStore journal will always be placed on 
 the fastest device available, so using a DB device will provide the same 
 benefit that the WAL
 device would while also allowing additional metadata to be stored there 
 (if it will fix). [sic, I assume that should be "fit"]
>>>
>>> I understand that if you've got three speeds of storage available, there 
>>> may be some sense to dividing these. For instance, if you've got lots of 
>>> HDD, a bit of SSD, and a tiny NVMe available in the same host, data on HDD, 
>>> DB on SSD and WAL on NVMe may be a sensible division of data. That's not 
>>> the case for most of the examples I'm reading; they're talking about 
>>> putting DB and WAL on the same block device, but in different partitions. 
>>> There's even one example of someone suggesting to try partitioning a single 
>>> SSD to put data/DB/WAL all in separate partitions!
>>>
>>> Are the docs wrong and/or I am missing something about optimal bluestore 
>>> setup, or do people simply have the wrong end of the stick? I ask because 
>>> I'm just going through switching all my OSDs over to Bluestore now and I've 
>>> just been reusing the partitions I set up for journals on my SSDs as DB 
>>> devices for Bluestore HDDs without specifying anything to do with the WAL, 
>>> and I'd like to know sooner rather than later if I'm making some sort of 
>>> horrible mistake.
>>>
>>> Rich
>>
>> Having had no explanatory reply so far I'll ask further...
>>
>> I have been continuing to update my OSDs and so far the performance offered 
>> by bluestore has been somewhat underwhelming. Recovery operations after 
>> replacing the Filestore OSDs with Bluestore equivalents have been much 
>> slower than expected, not even half the speed of recovery ops when I was 
>> upgrading Filestore OSDs with larger disks a few months ago. This 
>> contributes to my sense that I am doing something wrong.
>>
>> I've found that if I allow ceph-disk to partition my DB SSDs rather than 
>> reusing the rather large journal partitions I originally created for 
>> Filestore, it is only creating very small 1GB partitions. Attempting to 
>> search for bluestore configuration parameters has pointed me towards 
>> bluestore_block_db_size and bluestore_block_wal_size config settings. 
>> Unfortunately these settings are completely undocumented so I'm not sure 
>> what their functional purpose is. In any event in my running config I seem 
>> to have the following default values:
>>
>> # ceph-conf --show-config | grep bluestore
>> ...
>> bluestore_block_create = true
>> bluestore_block_db_create = false
>> bluestore_block_db_path =
>> bluestore_block_db_size = 0
>> bluestore_block_path =
>> bluestore_block_preallocate_file = false
>> bluestore_block_size = 10737418240
>> bluestore_block_wal_create = false
>> bluestore_block_wal_path =
>> bluestore_block_wal_size = 100663296
>> ...
>>
>> I have been creating bluestore osds by:
>>
>> ceph-disk prepare --bluestore /dev/sdX --block.db /dev/sdY1 --osd-id Z # 
>> re-using existing partitions for DB
>> or
>> ceph-disk prepare --bluestore /dev/sdX --block.db /dev/sdY --osd-id Z # 
>> letting ceph-disk partition 

Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?)

2017-09-11 Thread Brad Hubbard
Take a look at these which should answer at least some of your questions.

http://ceph.com/community/new-luminous-bluestore/

http://ceph.com/planet/understanding-bluestore-cephs-new-storage-backend/

On Mon, Sep 11, 2017 at 8:45 PM, Richard Hesketh
 wrote:
> On 08/09/17 11:44, Richard Hesketh wrote:
>> Hi,
>>
>> Reading the ceph-users list I'm obviously seeing a lot of people talking 
>> about using bluestore now that Luminous has been released. I note that many 
>> users seem to be under the impression that they need separate block devices 
>> for the bluestore data block, the DB, and the WAL... even when they are 
>> going to put the DB and the WAL on the same device!
>>
>> As per the docs at 
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ 
>> this is nonsense:
>>
>>> If there is only a small amount of fast storage available (e.g., less than 
>>> a gigabyte), we recommend using it as a WAL device. If there is more, 
>>> provisioning a DB
>>> device makes more sense. The BlueStore journal will always be placed on the 
>>> fastest device available, so using a DB device will provide the same 
>>> benefit that the WAL
>>> device would while also allowing additional metadata to be stored there (if 
>>> it will fix). [sic, I assume that should be "fit"]
>>
>> I understand that if you've got three speeds of storage available, there may 
>> be some sense to dividing these. For instance, if you've got lots of HDD, a 
>> bit of SSD, and a tiny NVMe available in the same host, data on HDD, DB on 
>> SSD and WAL on NVMe may be a sensible division of data. That's not the case 
>> for most of the examples I'm reading; they're talking about putting DB and 
>> WAL on the same block device, but in different partitions. There's even one 
>> example of someone suggesting to try partitioning a single SSD to put 
>> data/DB/WAL all in separate partitions!
>>
>> Are the docs wrong and/or I am missing something about optimal bluestore 
>> setup, or do people simply have the wrong end of the stick? I ask because 
>> I'm just going through switching all my OSDs over to Bluestore now and I've 
>> just been reusing the partitions I set up for journals on my SSDs as DB 
>> devices for Bluestore HDDs without specifying anything to do with the WAL, 
>> and I'd like to know sooner rather than later if I'm making some sort of 
>> horrible mistake.
>>
>> Rich
>
> Having had no explanatory reply so far I'll ask further...
>
> I have been continuing to update my OSDs and so far the performance offered 
> by bluestore has been somewhat underwhelming. Recovery operations after 
> replacing the Filestore OSDs with Bluestore equivalents have been much slower 
> than expected, not even half the speed of recovery ops when I was upgrading 
> Filestore OSDs with larger disks a few months ago. This contributes to my 
> sense that I am doing something wrong.
>
> I've found that if I allow ceph-disk to partition my DB SSDs rather than 
> reusing the rather large journal partitions I originally created for 
> Filestore, it is only creating very small 1GB partitions. Attempting to 
> search for bluestore configuration parameters has pointed me towards 
> bluestore_block_db_size and bluestore_block_wal_size config settings. 
> Unfortunately these settings are completely undocumented so I'm not sure what 
> their functional purpose is. In any event in my running config I seem to have 
> the following default values:
>
> # ceph-conf --show-config | grep bluestore
> ...
> bluestore_block_create = true
> bluestore_block_db_create = false
> bluestore_block_db_path =
> bluestore_block_db_size = 0
> bluestore_block_path =
> bluestore_block_preallocate_file = false
> bluestore_block_size = 10737418240
> bluestore_block_wal_create = false
> bluestore_block_wal_path =
> bluestore_block_wal_size = 100663296
> ...
>
> I have been creating bluestore osds by:
>
> ceph-disk prepare --bluestore /dev/sdX --block.db /dev/sdY1 --osd-id Z # 
> re-using existing partitions for DB
> or
> ceph-disk prepare --bluestore /dev/sdX --block.db /dev/sdY --osd-id Z # 
> letting ceph-disk partition DB, after zapping original partitions
>
> Are these sane values? What does it mean that block_db_size is 0 - is it just 
> using the entire block device specified or not actually using it at all? Is 
> the WAL actually being placed on the DB block device? And is that 1GB default 
> really a sensible size for the DB partition?
>
> Rich
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?)

2017-09-11 Thread Richard Hesketh
On 08/09/17 11:44, Richard Hesketh wrote:
> Hi,
> 
> Reading the ceph-users list I'm obviously seeing a lot of people talking 
> about using bluestore now that Luminous has been released. I note that many 
> users seem to be under the impression that they need separate block devices 
> for the bluestore data block, the DB, and the WAL... even when they are going 
> to put the DB and the WAL on the same device!
> 
> As per the docs at 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ 
> this is nonsense:
> 
>> If there is only a small amount of fast storage available (e.g., less than a 
>> gigabyte), we recommend using it as a WAL device. If there is more, 
>> provisioning a DB
>> device makes more sense. The BlueStore journal will always be placed on the 
>> fastest device available, so using a DB device will provide the same benefit 
>> that the WAL
>> device would while also allowing additional metadata to be stored there (if 
>> it will fix). [sic, I assume that should be "fit"]
> 
> I understand that if you've got three speeds of storage available, there may 
> be some sense to dividing these. For instance, if you've got lots of HDD, a 
> bit of SSD, and a tiny NVMe available in the same host, data on HDD, DB on 
> SSD and WAL on NVMe may be a sensible division of data. That's not the case 
> for most of the examples I'm reading; they're talking about putting DB and 
> WAL on the same block device, but in different partitions. There's even one 
> example of someone suggesting to try partitioning a single SSD to put 
> data/DB/WAL all in separate partitions!
> 
> Are the docs wrong and/or I am missing something about optimal bluestore 
> setup, or do people simply have the wrong end of the stick? I ask because I'm 
> just going through switching all my OSDs over to Bluestore now and I've just 
> been reusing the partitions I set up for journals on my SSDs as DB devices 
> for Bluestore HDDs without specifying anything to do with the WAL, and I'd 
> like to know sooner rather than later if I'm making some sort of horrible 
> mistake.
> 
> Rich

Having had no explanatory reply so far I'll ask further...

I have been continuing to update my OSDs and so far the performance offered by 
bluestore has been somewhat underwhelming. Recovery operations after replacing 
the Filestore OSDs with Bluestore equivalents have been much slower than 
expected, not even half the speed of recovery ops when I was upgrading 
Filestore OSDs with larger disks a few months ago. This contributes to my sense 
that I am doing something wrong.

I've found that if I allow ceph-disk to partition my DB SSDs rather than 
reusing the rather large journal partitions I originally created for Filestore, 
it is only creating very small 1GB partitions. Attempting to search for 
bluestore configuration parameters has pointed me towards 
bluestore_block_db_size and bluestore_block_wal_size config settings. 
Unfortunately these settings are completely undocumented so I'm not sure what 
their functional purpose is. In any event in my running config I seem to have 
the following default values:

# ceph-conf --show-config | grep bluestore
...
bluestore_block_create = true
bluestore_block_db_create = false
bluestore_block_db_path = 
bluestore_block_db_size = 0
bluestore_block_path = 
bluestore_block_preallocate_file = false
bluestore_block_size = 10737418240
bluestore_block_wal_create = false
bluestore_block_wal_path = 
bluestore_block_wal_size = 100663296
...

I have been creating bluestore osds by:

ceph-disk prepare --bluestore /dev/sdX --block.db /dev/sdY1 --osd-id Z # 
re-using existing partitions for DB
or
ceph-disk prepare --bluestore /dev/sdX --block.db /dev/sdY --osd-id Z # letting 
ceph-disk partition DB, after zapping original partitions

Are these sane values? What does it mean that block_db_size is 0 - is it just 
using the entire block device specified or not actually using it at all? Is the 
WAL actually being placed on the DB block device? And is that 1GB default 
really a sensible size for the DB partition?

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com