Re: [ceph-users] Ceph osd crush weight to utilization incorrect on one node

2018-05-11 Thread Pardhiv Karri
Hi David,

Here is the output of ceph df. We have lot of space in our ceph cluster. We
have 2 OSDs (266,500) down earlier due to hardware issue and never got a
chance to fix them.

GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
1101T  701T 400T 36.37
POOLS:
NAME   ID USED   %USED MAX AVAIL
OBJECTS
rbd0   0 0  159T
 0
.rgw.root  3 780 0  159T
 3
.rgw.control   4   0 0  159T
 8
.rgw.gc5   0 0  159T
35
.users.uid 66037 0  159T
32
images 7  16462G  4.38  159T
 2660844
.rgw   10820 0  159T
 4
volumes11   106T 28.91  159T
28011837
compute12 11327G  3.01  159T
 1467722
backups15  0 0  159T
 0
.rgw.buckets.index 16  0 0  159T
 2
.rgw.buckets   17  0 0  159T
 0


Thanks,
Pardhiv K

On Fri, May 11, 2018 at 7:14 PM, David Turner  wrote:

> What's your `ceph osd tree`, `ceph df`, `ceph osd df`? You sound like you
> just have a fairly fill cluster that you haven't balanced the crush weights
> on.
>
>
> On Fri, May 11, 2018, 10:06 PM Pardhiv Karri 
> wrote:
>
>> Hi David,
>>
>> Thanks for the reply. Yeah we are seeing that 0.0001 usage on pretty much
>> on all OSDs. But this node it is different  whether full weight or just
>> 0.2of OSD 611 the OSD 611 start increasing.
>>
>> --Pardhiv K
>>
>>
>> On Fri, May 11, 2018 at 10:50 AM, David Turner 
>> wrote:
>>
>>> There was a time in the history of Ceph where a weight of 0.0 was not
>>> always what you thought.  People had better experiences with crush weights
>>> of something like 0.0001 or something.  This is just a memory tickling in
>>> the back of my mind of things I've read on the ML years back.
>>>
>>> On Fri, May 11, 2018 at 1:26 PM Bryan Stillwell 
>>> wrote:
>>>
 > We have a large 1PB ceph cluster. We recently added 6 nodes with 16
 2TB disks
 > each to the cluster. All the 5 nodes rebalanced well without any
 issues and
 > the sixth/last node OSDs started acting weird as I increase weight of
 one osd
 > the utilization doesn't change but a different osd on the same node
 > utilization is getting increased. Rebalance complete fine but
 utilization is
 > not right.
 >
 > Increased weight of OSD 610 to 0.2 from 0.0 but utilization of OSD 611
 > started increasing but its weight is 0.0. If I increase weight of OSD
 611 to
 > 0.2 then its overall utilization is growing to what if its weight is
 0.4. So
 > if I increase weight of 610 and 615 to their full weight then
 utilization on
 > OSD 610 is 1% and on OSD 611 is inching towards 100% where I had to
 stop and
 > downsize the OSD's crush weight back to 0.0 to avoid any implications
 on ceph
 > cluster. Its not just one osd but different OSD's on that one node.
 The only
 > correlation I found out is 610 and 611 OSD Journal partitions are on
 the same
 > SSD drive and all the OSDs are SAS drives. Any help on how to debug or
 > resolve this will be helpful.

 You didn't say which version of Ceph you were using, but based on the
 output
 of 'ceph osd df' I'm guessing it's pre-Jewel (maybe Hammer?) cluster?

 I've found that data placement can be a little weird when you have
 really
 low CRUSH weights (0.2) on one of the nodes where the other nodes have
 large
 CRUSH weights (2.0).  I've had it where a single OSD in a node was
 getting
 almost all the data.  It wasn't until I increased the weights to be
 more in
 line with the rest of the cluster that it evened back out.

 I believe this can also be caused by not having enough PGs in your
 cluster.
 Or the PGs you do have aren't distributed correctly based on the data
 usage
 in each pool.  Have you used https://ceph.com/pgcalc/ to determine the
 correct number of PGs you should have per pool?

 Since you are likely running a pre-Jewel cluster it could also be that
 you
 haven't switched your tunables to use the straw2 data placement
 algorithm:

 http://docs.ceph.com/docs/master/rados/operations/crush-
 map/#hammer-crush-v4

 That should help as well.  Once that's enabled you can convert your
 existing
 buckets to straw2 as well.  Just be careful you don't have any old
 clients
 connecting to your cluster that don't support that feature yet.

 Bryan

 

Re: [ceph-users] Open-sourcing GRNET's Ceph-related tooling

2018-05-11 Thread Brad Hubbard
+ceph-devel

On Wed, May 9, 2018 at 10:00 PM, Nikos Kormpakis  wrote:
> Hello,
>
> I'm happy to announce that GRNET [1] is open-sourcing its Ceph-related
> tooling on GitHub [2]. This repo includes multiple monitoring health
> checks compatible with Luminous and tooling in order deploy quickly our
> new Ceph clusters based on Luminous, ceph-volume lvm and BlueStore.
>
> We hope that people may find these tools useful.
>
> Best regards,
> Nikos
>
> [1] https://grnet.gr/en/
> [2] https://github.com/grnet/cephtools
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread David Turner
That's right. I didn't actually use Jewel for very long. I'm glad it worked
for you.

On Fri, May 11, 2018, 4:49 PM Webert de Souza Lima 
wrote:

> Thanks David.
> Although you mentioned this was introduced with Luminous, it's working
> with Jewel.
>
> ~# ceph osd pool stats
>
> Fri May 11 17:41:39 2018
>
> pool rbd id 5
>   client io 505 kB/s rd, 3801 kB/s wr, 46 op/s rd, 27 op/s wr
>
> pool rbd_cache id 6
>   client io 2538 kB/s rd, 3070 kB/s wr, 601 op/s rd, 758 op/s wr
>   cache tier io 12225 kB/s flush, 0 op/s promote, 3 PG(s) flushing
>
> pool cephfs_metadata id 7
>   client io 2233 kB/s rd, 2260 kB/s wr, 95 op/s rd, 587 op/s wr
>
> pool cephfs_data_ssd id 8
>   client io 1126 kB/s rd, 94897 B/s wr, 33 op/s rd, 42 op/s wr
>
> pool cephfs_data id 9
>   client io 0 B/s rd, 11203 kB/s wr, 12 op/s rd, 12 op/s wr
>
> pool cephfs_data_cache id 10
>   client io 4383 kB/s rd, 550 kB/s wr, 57 op/s rd, 39 op/s wr
>   cache tier io 7012 kB/s flush, 4399 kB/s evict, 11 op/s promote
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
>
> On Fri, May 11, 2018 at 5:14 PM David Turner 
> wrote:
>
>> `ceph osd pool stats` with the option to specify the pool you are
>> interested in should get you the breakdown of IO per pool.  This was
>> introduced with luminous.
>>
>> On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima <
>> webert.b...@gmail.com> wrote:
>>
>>> I think ceph doesn't have IO metrics will filters by pool right? I see
>>> IO metrics from clients only:
>>>
>>> ceph_client_io_ops
>>> ceph_client_io_read_bytes
>>> ceph_client_io_read_ops
>>> ceph_client_io_write_bytes
>>> ceph_client_io_write_ops
>>>
>>> and pool "byte" metrics, but not "io":
>>>
>>> ceph_pool(write/read)_bytes(_total)
>>>
>>> Regards,
>>>
>>> Webert Lima
>>> DevOps Engineer at MAV Tecnologia
>>> *Belo Horizonte - Brasil*
>>> *IRC NICK - WebertRLZ*
>>>
>>> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima <
>>> webert.b...@gmail.com> wrote:
>>>
 Hey Jon!

 On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:

> It depends on the metadata intensity of your workload.  It might be
> quite interesting to gather some drive stats on how many IOPS are
> currently hitting your metadata pool over a week of normal activity.
>

 Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
 sure what I should be looking at).
 My current SSD disks have 2 partitions.
  - One is used for cephfs cache tier pool,
  - The other is used for both:  cephfs meta-data pool and cephfs
 data-ssd (this is an additional cephfs data pool with only ssds with file
 layout for a specific direcotory to use it)

 Because of this, iostat shows me peaks of 12k IOPS in the metadata
 partition, but this could definitely be IO for the data-ssd pool.


> If you are doing large file workloads, and the metadata mostly fits in
> RAM, then the number of IOPS from the MDS can be very, very low.  On
> the other hand, if you're doing random metadata reads from a small
> file workload where the metadata does not fit in RAM, almost every
> client read could generate a read operation, and each MDS could easily
> generate thousands of ops per second.
>

 I have yet to measure it the right way but I'd assume my metadata fits
 in RAM (a few 100s of MB only).

 This is an email hosting cluster with dozens of thousands of users so
 there are a lot of random reads and writes, but not too many small files.
 Email messages are concatenated together in files up to 4MB in size
 (when a rotation happens).
 Most user operations are dovecot's INDEX operations and I will keep
 index directory in a SSD-dedicaded pool.



> Isolating metadata OSDs is useful if the data OSDs are going to be
> completely saturated: metadata performance will be protected even if
> clients are hitting the data OSDs hard.
>

 This seems to be the case.


> If "heavy write" means completely saturating the cluster, then sharing
> the OSDs is risky.  If "heavy write" just means that there are more
> writes than reads, then it may be fine if the metadata workload is not
> heavy enough to make good use of SSDs.
>

 Saturarion will only happen in peak workloads, not often. By heavy
 write I mean there are much more writes than reads, yes.
 So I think I can start sharing the OSDs, if I think this is impacting
 performance I can just change the ruleset and move metadata to a SSD-only
 pool, right?


> The way I'd summarise this is: in the general case, dedicated SSDs are
> the safe way to go -- they're intrinsically better suited to metadata.
> However, in some quite common special cases, the overall 

Re: [ceph-users] Ceph osd crush weight to utilization incorrect on one node

2018-05-11 Thread David Turner
What's your `ceph osd tree`, `ceph df`, `ceph osd df`? You sound like you
just have a fairly fill cluster that you haven't balanced the crush weights
on.

On Fri, May 11, 2018, 10:06 PM Pardhiv Karri  wrote:

> Hi David,
>
> Thanks for the reply. Yeah we are seeing that 0.0001 usage on pretty much
> on all OSDs. But this node it is different  whether full weight or just
> 0.2of OSD 611 the OSD 611 start increasing.
>
> --Pardhiv K
>
>
> On Fri, May 11, 2018 at 10:50 AM, David Turner 
> wrote:
>
>> There was a time in the history of Ceph where a weight of 0.0 was not
>> always what you thought.  People had better experiences with crush weights
>> of something like 0.0001 or something.  This is just a memory tickling in
>> the back of my mind of things I've read on the ML years back.
>>
>> On Fri, May 11, 2018 at 1:26 PM Bryan Stillwell 
>> wrote:
>>
>>> > We have a large 1PB ceph cluster. We recently added 6 nodes with 16
>>> 2TB disks
>>> > each to the cluster. All the 5 nodes rebalanced well without any
>>> issues and
>>> > the sixth/last node OSDs started acting weird as I increase weight of
>>> one osd
>>> > the utilization doesn't change but a different osd on the same node
>>> > utilization is getting increased. Rebalance complete fine but
>>> utilization is
>>> > not right.
>>> >
>>> > Increased weight of OSD 610 to 0.2 from 0.0 but utilization of OSD 611
>>> > started increasing but its weight is 0.0. If I increase weight of OSD
>>> 611 to
>>> > 0.2 then its overall utilization is growing to what if its weight is
>>> 0.4. So
>>> > if I increase weight of 610 and 615 to their full weight then
>>> utilization on
>>> > OSD 610 is 1% and on OSD 611 is inching towards 100% where I had to
>>> stop and
>>> > downsize the OSD's crush weight back to 0.0 to avoid any implications
>>> on ceph
>>> > cluster. Its not just one osd but different OSD's on that one node.
>>> The only
>>> > correlation I found out is 610 and 611 OSD Journal partitions are on
>>> the same
>>> > SSD drive and all the OSDs are SAS drives. Any help on how to debug or
>>> > resolve this will be helpful.
>>>
>>> You didn't say which version of Ceph you were using, but based on the
>>> output
>>> of 'ceph osd df' I'm guessing it's pre-Jewel (maybe Hammer?) cluster?
>>>
>>> I've found that data placement can be a little weird when you have really
>>> low CRUSH weights (0.2) on one of the nodes where the other nodes have
>>> large
>>> CRUSH weights (2.0).  I've had it where a single OSD in a node was
>>> getting
>>> almost all the data.  It wasn't until I increased the weights to be more
>>> in
>>> line with the rest of the cluster that it evened back out.
>>>
>>> I believe this can also be caused by not having enough PGs in your
>>> cluster.
>>> Or the PGs you do have aren't distributed correctly based on the data
>>> usage
>>> in each pool.  Have you used https://ceph.com/pgcalc/ to determine the
>>> correct number of PGs you should have per pool?
>>>
>>> Since you are likely running a pre-Jewel cluster it could also be that
>>> you
>>> haven't switched your tunables to use the straw2 data placement
>>> algorithm:
>>>
>>>
>>> http://docs.ceph.com/docs/master/rados/operations/crush-map/#hammer-crush-v4
>>>
>>> That should help as well.  Once that's enabled you can convert your
>>> existing
>>> buckets to straw2 as well.  Just be careful you don't have any old
>>> clients
>>> connecting to your cluster that don't support that feature yet.
>>>
>>> Bryan
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> *Pardhiv Karri*
> "Rise and Rise again until LAMBS become LIONS"
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph osd crush weight to utilization incorrect on one node

2018-05-11 Thread Pardhiv Karri
Hi David,

Thanks for the reply. Yeah we are seeing that 0.0001 usage on pretty much
on all OSDs. But this node it is different  whether full weight or just
0.2of OSD 611 the OSD 611 start increasing.

--Pardhiv K


On Fri, May 11, 2018 at 10:50 AM, David Turner 
wrote:

> There was a time in the history of Ceph where a weight of 0.0 was not
> always what you thought.  People had better experiences with crush weights
> of something like 0.0001 or something.  This is just a memory tickling in
> the back of my mind of things I've read on the ML years back.
>
> On Fri, May 11, 2018 at 1:26 PM Bryan Stillwell 
> wrote:
>
>> > We have a large 1PB ceph cluster. We recently added 6 nodes with 16 2TB
>> disks
>> > each to the cluster. All the 5 nodes rebalanced well without any issues
>> and
>> > the sixth/last node OSDs started acting weird as I increase weight of
>> one osd
>> > the utilization doesn't change but a different osd on the same node
>> > utilization is getting increased. Rebalance complete fine but
>> utilization is
>> > not right.
>> >
>> > Increased weight of OSD 610 to 0.2 from 0.0 but utilization of OSD 611
>> > started increasing but its weight is 0.0. If I increase weight of OSD
>> 611 to
>> > 0.2 then its overall utilization is growing to what if its weight is
>> 0.4. So
>> > if I increase weight of 610 and 615 to their full weight then
>> utilization on
>> > OSD 610 is 1% and on OSD 611 is inching towards 100% where I had to
>> stop and
>> > downsize the OSD's crush weight back to 0.0 to avoid any implications
>> on ceph
>> > cluster. Its not just one osd but different OSD's on that one node. The
>> only
>> > correlation I found out is 610 and 611 OSD Journal partitions are on
>> the same
>> > SSD drive and all the OSDs are SAS drives. Any help on how to debug or
>> > resolve this will be helpful.
>>
>> You didn't say which version of Ceph you were using, but based on the
>> output
>> of 'ceph osd df' I'm guessing it's pre-Jewel (maybe Hammer?) cluster?
>>
>> I've found that data placement can be a little weird when you have really
>> low CRUSH weights (0.2) on one of the nodes where the other nodes have
>> large
>> CRUSH weights (2.0).  I've had it where a single OSD in a node was getting
>> almost all the data.  It wasn't until I increased the weights to be more
>> in
>> line with the rest of the cluster that it evened back out.
>>
>> I believe this can also be caused by not having enough PGs in your
>> cluster.
>> Or the PGs you do have aren't distributed correctly based on the data
>> usage
>> in each pool.  Have you used https://ceph.com/pgcalc/ to determine the
>> correct number of PGs you should have per pool?
>>
>> Since you are likely running a pre-Jewel cluster it could also be that you
>> haven't switched your tunables to use the straw2 data placement algorithm:
>>
>> http://docs.ceph.com/docs/master/rados/operations/crush-
>> map/#hammer-crush-v4
>>
>> That should help as well.  Once that's enabled you can convert your
>> existing
>> buckets to straw2 as well.  Just be careful you don't have any old clients
>> connecting to your cluster that don't support that feature yet.
>>
>> Bryan
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Pardhiv Karri*
"Rise and Rise again until LAMBS become LIONS"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph osd crush weight to utilization incorrect on one node

2018-05-11 Thread Pardhiv Karri
Hi Bryan,

Thank you for the reply.

We are on Hammer, ceph version 0.94.9
(fe6d859066244b97b24f09d46552afc2071e6f90)

We tried with full weight on all OSDs on that node and the OSDs like 611
are going above 90% so downsized and tested with only 0.2

Our PGs are at 119 for all 12 pools in the cluster.

We are using tree algorithm for our clusters.

We deleted and re-added the OSDs and still the same issue.

Not sure if upgrading the cluster might fix it but we are afraid of
upgrading the cluster, hoping for a fix without upgrading.

Thanks,
Pardhiv K



On Fri, May 11, 2018 at 10:26 AM, Bryan Stillwell 
wrote:

> > We have a large 1PB ceph cluster. We recently added 6 nodes with 16 2TB
> disks
> > each to the cluster. All the 5 nodes rebalanced well without any issues
> and
> > the sixth/last node OSDs started acting weird as I increase weight of
> one osd
> > the utilization doesn't change but a different osd on the same node
> > utilization is getting increased. Rebalance complete fine but
> utilization is
> > not right.
> >
> > Increased weight of OSD 610 to 0.2 from 0.0 but utilization of OSD 611
> > started increasing but its weight is 0.0. If I increase weight of OSD
> 611 to
> > 0.2 then its overall utilization is growing to what if its weight is
> 0.4. So
> > if I increase weight of 610 and 615 to their full weight then
> utilization on
> > OSD 610 is 1% and on OSD 611 is inching towards 100% where I had to stop
> and
> > downsize the OSD's crush weight back to 0.0 to avoid any implications on
> ceph
> > cluster. Its not just one osd but different OSD's on that one node. The
> only
> > correlation I found out is 610 and 611 OSD Journal partitions are on the
> same
> > SSD drive and all the OSDs are SAS drives. Any help on how to debug or
> > resolve this will be helpful.
>
> You didn't say which version of Ceph you were using, but based on the
> output
> of 'ceph osd df' I'm guessing it's pre-Jewel (maybe Hammer?) cluster?
>
> I've found that data placement can be a little weird when you have really
> low CRUSH weights (0.2) on one of the nodes where the other nodes have
> large
> CRUSH weights (2.0).  I've had it where a single OSD in a node was getting
> almost all the data.  It wasn't until I increased the weights to be more in
> line with the rest of the cluster that it evened back out.
>
> I believe this can also be caused by not having enough PGs in your cluster.
> Or the PGs you do have aren't distributed correctly based on the data usage
> in each pool.  Have you used https://ceph.com/pgcalc/ to determine the
> correct number of PGs you should have per pool?
>
> Since you are likely running a pre-Jewel cluster it could also be that you
> haven't switched your tunables to use the straw2 data placement algorithm:
>
> http://docs.ceph.com/docs/master/rados/operations/crush-
> map/#hammer-crush-v4
>
> That should help as well.  Once that's enabled you can convert your
> existing
> buckets to straw2 as well.  Just be careful you don't have any old clients
> connecting to your cluster that don't support that feature yet.
>
> Bryan
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Pardhiv Karri*
"Rise and Rise again until LAMBS become LIONS"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Test for Leo

2018-05-11 Thread Tom W
Test for Leo, please ignore.




NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named 
person(s). If you are not the intended recipient, notify the sender 
immediately, delete this email from your system and do not disclose or use for 
any purpose. We may monitor all incoming and outgoing emails in line with 
current legislation. We have taken steps to ensure that this email and 
attachments are free from any virus, but it remains your responsibility to 
ensure that viruses do not adversely affect you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread Webert de Souza Lima
Thanks David.
Although you mentioned this was introduced with Luminous, it's working with
Jewel.

~# ceph osd pool stats

Fri May 11 17:41:39 2018

pool rbd id 5
  client io 505 kB/s rd, 3801 kB/s wr, 46 op/s rd, 27 op/s wr

pool rbd_cache id 6
  client io 2538 kB/s rd, 3070 kB/s wr, 601 op/s rd, 758 op/s wr
  cache tier io 12225 kB/s flush, 0 op/s promote, 3 PG(s) flushing

pool cephfs_metadata id 7
  client io 2233 kB/s rd, 2260 kB/s wr, 95 op/s rd, 587 op/s wr

pool cephfs_data_ssd id 8
  client io 1126 kB/s rd, 94897 B/s wr, 33 op/s rd, 42 op/s wr

pool cephfs_data id 9
  client io 0 B/s rd, 11203 kB/s wr, 12 op/s rd, 12 op/s wr

pool cephfs_data_cache id 10
  client io 4383 kB/s rd, 550 kB/s wr, 57 op/s rd, 39 op/s wr
  cache tier io 7012 kB/s flush, 4399 kB/s evict, 11 op/s promote


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 5:14 PM David Turner  wrote:

> `ceph osd pool stats` with the option to specify the pool you are
> interested in should get you the breakdown of IO per pool.  This was
> introduced with luminous.
>
> On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
>> I think ceph doesn't have IO metrics will filters by pool right? I see IO
>> metrics from clients only:
>>
>> ceph_client_io_ops
>> ceph_client_io_read_bytes
>> ceph_client_io_read_ops
>> ceph_client_io_write_bytes
>> ceph_client_io_write_ops
>>
>> and pool "byte" metrics, but not "io":
>>
>> ceph_pool(write/read)_bytes(_total)
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima <
>> webert.b...@gmail.com> wrote:
>>
>>> Hey Jon!
>>>
>>> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>>>
 It depends on the metadata intensity of your workload.  It might be
 quite interesting to gather some drive stats on how many IOPS are
 currently hitting your metadata pool over a week of normal activity.

>>>
>>> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
>>> sure what I should be looking at).
>>> My current SSD disks have 2 partitions.
>>>  - One is used for cephfs cache tier pool,
>>>  - The other is used for both:  cephfs meta-data pool and cephfs
>>> data-ssd (this is an additional cephfs data pool with only ssds with file
>>> layout for a specific direcotory to use it)
>>>
>>> Because of this, iostat shows me peaks of 12k IOPS in the metadata
>>> partition, but this could definitely be IO for the data-ssd pool.
>>>
>>>
 If you are doing large file workloads, and the metadata mostly fits in
 RAM, then the number of IOPS from the MDS can be very, very low.  On
 the other hand, if you're doing random metadata reads from a small
 file workload where the metadata does not fit in RAM, almost every
 client read could generate a read operation, and each MDS could easily
 generate thousands of ops per second.

>>>
>>> I have yet to measure it the right way but I'd assume my metadata fits
>>> in RAM (a few 100s of MB only).
>>>
>>> This is an email hosting cluster with dozens of thousands of users so
>>> there are a lot of random reads and writes, but not too many small files.
>>> Email messages are concatenated together in files up to 4MB in size
>>> (when a rotation happens).
>>> Most user operations are dovecot's INDEX operations and I will keep
>>> index directory in a SSD-dedicaded pool.
>>>
>>>
>>>
 Isolating metadata OSDs is useful if the data OSDs are going to be
 completely saturated: metadata performance will be protected even if
 clients are hitting the data OSDs hard.

>>>
>>> This seems to be the case.
>>>
>>>
 If "heavy write" means completely saturating the cluster, then sharing
 the OSDs is risky.  If "heavy write" just means that there are more
 writes than reads, then it may be fine if the metadata workload is not
 heavy enough to make good use of SSDs.

>>>
>>> Saturarion will only happen in peak workloads, not often. By heavy write
>>> I mean there are much more writes than reads, yes.
>>> So I think I can start sharing the OSDs, if I think this is impacting
>>> performance I can just change the ruleset and move metadata to a SSD-only
>>> pool, right?
>>>
>>>
 The way I'd summarise this is: in the general case, dedicated SSDs are
 the safe way to go -- they're intrinsically better suited to metadata.
 However, in some quite common special cases, the overall number of
 metadata ops is so low that the device doesn't matter.
>>>
>>>
>>>
>>> Thank you very much John!
>>> Webert Lima
>>> DevOps Engineer at MAV Tecnologia
>>> Belo Horizonte - Brasil
>>> IRC NICK - WebertRLZ
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> 

Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread David Turner
`ceph osd pool stats` with the option to specify the pool you are
interested in should get you the breakdown of IO per pool.  This was
introduced with luminous.

On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima 
wrote:

> I think ceph doesn't have IO metrics will filters by pool right? I see IO
> metrics from clients only:
>
> ceph_client_io_ops
> ceph_client_io_read_bytes
> ceph_client_io_read_ops
> ceph_client_io_write_bytes
> ceph_client_io_write_ops
>
> and pool "byte" metrics, but not "io":
>
> ceph_pool(write/read)_bytes(_total)
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima 
> wrote:
>
>> Hey Jon!
>>
>> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>>
>>> It depends on the metadata intensity of your workload.  It might be
>>> quite interesting to gather some drive stats on how many IOPS are
>>> currently hitting your metadata pool over a week of normal activity.
>>>
>>
>> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
>> sure what I should be looking at).
>> My current SSD disks have 2 partitions.
>>  - One is used for cephfs cache tier pool,
>>  - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
>> (this is an additional cephfs data pool with only ssds with file layout for
>> a specific direcotory to use it)
>>
>> Because of this, iostat shows me peaks of 12k IOPS in the metadata
>> partition, but this could definitely be IO for the data-ssd pool.
>>
>>
>>> If you are doing large file workloads, and the metadata mostly fits in
>>> RAM, then the number of IOPS from the MDS can be very, very low.  On
>>> the other hand, if you're doing random metadata reads from a small
>>> file workload where the metadata does not fit in RAM, almost every
>>> client read could generate a read operation, and each MDS could easily
>>> generate thousands of ops per second.
>>>
>>
>> I have yet to measure it the right way but I'd assume my metadata fits in
>> RAM (a few 100s of MB only).
>>
>> This is an email hosting cluster with dozens of thousands of users so
>> there are a lot of random reads and writes, but not too many small files.
>> Email messages are concatenated together in files up to 4MB in size (when
>> a rotation happens).
>> Most user operations are dovecot's INDEX operations and I will keep index
>> directory in a SSD-dedicaded pool.
>>
>>
>>
>>> Isolating metadata OSDs is useful if the data OSDs are going to be
>>> completely saturated: metadata performance will be protected even if
>>> clients are hitting the data OSDs hard.
>>>
>>
>> This seems to be the case.
>>
>>
>>> If "heavy write" means completely saturating the cluster, then sharing
>>> the OSDs is risky.  If "heavy write" just means that there are more
>>> writes than reads, then it may be fine if the metadata workload is not
>>> heavy enough to make good use of SSDs.
>>>
>>
>> Saturarion will only happen in peak workloads, not often. By heavy write
>> I mean there are much more writes than reads, yes.
>> So I think I can start sharing the OSDs, if I think this is impacting
>> performance I can just change the ruleset and move metadata to a SSD-only
>> pool, right?
>>
>>
>>> The way I'd summarise this is: in the general case, dedicated SSDs are
>>> the safe way to go -- they're intrinsically better suited to metadata.
>>> However, in some quite common special cases, the overall number of
>>> metadata ops is so low that the device doesn't matter.
>>
>>
>>
>> Thank you very much John!
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> Belo Horizonte - Brasil
>> IRC NICK - WebertRLZ
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread David Turner
For if you should do WAL only on the NVMe vs use a filestore journal, that
depends on your write patterns, use case, etc.  In my clusters with 10TB
disks I use 2GB partitions for the WAL and leave the DB on the HDD with the
data.  Those are in archival RGW use cases and that works fine for the
throughput.  The pains of filestore subfolder splitting are too severe for
us to think about using our 10TB disks with filestore and journals, but we
have 100s of millions of tiny objects.

The WAL is pretty static and wouldn't be a problem with DB and WAL on the
same device even if the DB fills up the device.  I'm fairly certain that
ceph will prioritize things such that the WAL won't spill over at all and
just have the DB going over to the HDD.  I didn't want to deal with speed
differentials between OSDs.  The troubleshooting slow requests of that just
sounds awful.

With ceph-volume the volume type id doesn't matter at all.  I honestly
don't know what the id's of my wal partitions are.  That was one of the
goals of ceph-volume, to remove all of the magic id's everywhere required
for things to start up on system boot.  It's a lot more deterministic with
less things like partition type id's needing to all be perfect.

On Fri, May 11, 2018 at 2:14 PM Oliver Schulz 
wrote:

> Dear David,
>
> thanks a lot for the detailed answer(s) and clarifications!
> Can I ask just a few more questions?
>
> On 11.05.2018 18:46, David Turner wrote:
> > partitions is 10GB per 1TB of OSD.  If your OSD is a 4TB disk you should
> > be looking closer to a 40GB block.db partition.  If your block.db
> > partition is too small, then once it fills up it will spill over onto
> > the data volume and slow things down.
>
> Ooops ... I have 15 x 10 TB disks in the servers, and one Optane
> SSD for all of them - so I don't have 10GB SSD per TB of HDD. :-(
> Will I still get a speed-up if only part of the block.db fits? Or
> should I use the SSD for WAL only? Or even use good old filestore
> with 10GB journals, instead of bluestore?
>
>
> >> And just to make sure - if I specify "--osd-db", I don't need
> >> to set "--osd-wal" as well, since the WAL will end up on the
> >> DB partition automatically, correct?
> > This is correct.  The wal will automatically be placed on the db if not
> > otherwise specified.
>
> Would there still be any benefit to having separate WAL
> and DB partitions (so they that DB doesn't compete
> with WAL for space, or something like that)?
>
>
> > I don't use ceph-deploy, but the process for creating the OSDs should be
> > something like this.  After the OSDs are created it is a good idea to
> > make sure that the OSD is not looking for the db partition with the
> > /dev/nvme0n1p2 distinction as that can change on reboots if you have
>
> Yes, I just put that in as an example. I had thought about creating
> the partitions with
>
>  sgdisk -n 0:0:+10G -t 0:8300 -c 0:"osd-XYZ-db" -- /dev/nvme0n1
>
> and then use "/dev/disk/by-partlabel/osd-XYZ-db" (or the partition
> UUIDs) for "ceph volume ...". Thanks for the tip about checking
> the symlinks! Btw, is "-t 0:8300" Ok? I guess the type number won't
> really matter, though?
>
>
> Cheers,
>
> Oliver
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread Webert de Souza Lima
I think ceph doesn't have IO metrics will filters by pool right? I see IO
metrics from clients only:

ceph_client_io_ops
ceph_client_io_read_bytes
ceph_client_io_read_ops
ceph_client_io_write_bytes
ceph_client_io_write_ops

and pool "byte" metrics, but not "io":

ceph_pool(write/read)_bytes(_total)

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima 
wrote:

> Hey Jon!
>
> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>
>> It depends on the metadata intensity of your workload.  It might be
>> quite interesting to gather some drive stats on how many IOPS are
>> currently hitting your metadata pool over a week of normal activity.
>>
>
> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
> sure what I should be looking at).
> My current SSD disks have 2 partitions.
>  - One is used for cephfs cache tier pool,
>  - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
> (this is an additional cephfs data pool with only ssds with file layout for
> a specific direcotory to use it)
>
> Because of this, iostat shows me peaks of 12k IOPS in the metadata
> partition, but this could definitely be IO for the data-ssd pool.
>
>
>> If you are doing large file workloads, and the metadata mostly fits in
>> RAM, then the number of IOPS from the MDS can be very, very low.  On
>> the other hand, if you're doing random metadata reads from a small
>> file workload where the metadata does not fit in RAM, almost every
>> client read could generate a read operation, and each MDS could easily
>> generate thousands of ops per second.
>>
>
> I have yet to measure it the right way but I'd assume my metadata fits in
> RAM (a few 100s of MB only).
>
> This is an email hosting cluster with dozens of thousands of users so
> there are a lot of random reads and writes, but not too many small files.
> Email messages are concatenated together in files up to 4MB in size (when
> a rotation happens).
> Most user operations are dovecot's INDEX operations and I will keep index
> directory in a SSD-dedicaded pool.
>
>
>
>> Isolating metadata OSDs is useful if the data OSDs are going to be
>> completely saturated: metadata performance will be protected even if
>> clients are hitting the data OSDs hard.
>>
>
> This seems to be the case.
>
>
>> If "heavy write" means completely saturating the cluster, then sharing
>> the OSDs is risky.  If "heavy write" just means that there are more
>> writes than reads, then it may be fine if the metadata workload is not
>> heavy enough to make good use of SSDs.
>>
>
> Saturarion will only happen in peak workloads, not often. By heavy write I
> mean there are much more writes than reads, yes.
> So I think I can start sharing the OSDs, if I think this is impacting
> performance I can just change the ruleset and move metadata to a SSD-only
> pool, right?
>
>
>> The way I'd summarise this is: in the general case, dedicated SSDs are
>> the safe way to go -- they're intrinsically better suited to metadata.
>> However, in some quite common special cases, the overall number of
>> metadata ops is so low that the device doesn't matter.
>
>
>
> Thank you very much John!
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Node crash, filesytem not usable

2018-05-11 Thread Webert de Souza Lima
This message seems to be very concerning:
 >mds0: Metadata damage detected

but for the rest, the cluster seems still to be recovering. you could try
to seep thing up with ceph tell, like:

ceph tell osd.* injectargs --osd_max_backfills=10
ceph tell osd.* injectargs --osd_recovery_sleep=0.0
ceph tell osd.* injectargs --osd_recovery_threads=2


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 3:06 PM Daniel Davidson 
wrote:

> Below id the information you were asking for.  I think they are size=2,
> min size=1.
>
> Dan
>
> # ceph status
> cluster
> 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
>
>  health
> HEALTH_ERR
>
> 140 pgs are stuck inactive for more than 300 seconds
> 64 pgs backfill_wait
> 76 pgs backfilling
> 140 pgs degraded
> 140 pgs stuck degraded
> 140 pgs stuck inactive
> 140 pgs stuck unclean
> 140 pgs stuck undersized
> 140 pgs undersized
> 210 requests are blocked > 32 sec
> recovery 38725029/695508092 objects degraded (5.568%)
> recovery 10844554/695508092 objects misplaced (1.559%)
> mds0: Metadata damage detected
> mds0: Behind on trimming (71/30)
> noscrub,nodeep-scrub flag(s) set
>  monmap e3: 4 mons at {ceph-0=
> 172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0
> }
> election epoch 824, quorum 0,1,2,3 ceph-0,ceph-1,ceph-2,ceph-3
>   fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby
>  osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs
> flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>   pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects
> 1444 TB used, 1011 TB / 2455 TB avail
> 38725029/695508092 objects degraded (5.568%)
> 10844554/695508092 objects misplaced (1.559%)
> 1396 active+clean
>   76 undersized+degraded+remapped+backfilling+peered
>   64 undersized+degraded+remapped+wait_backfill+peered
> recovery io 1244 MB/s, 1612 keys/s, 705 objects/s
>
> ID  WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 2619.54541 root default
>  -2  163.72159 host ceph-0
>   0   81.86079 osd.0 up  1.0  1.0
>   1   81.86079 osd.1 up  1.0  1.0
>  -3  163.72159 host ceph-1
>   2   81.86079 osd.2 up  1.0  1.0
>   3   81.86079 osd.3 up  1.0  1.0
>  -4  163.72159 host ceph-2
>   8   81.86079 osd.8 up  1.0  1.0
>   9   81.86079 osd.9 up  1.0  1.0
>  -5  163.72159 host ceph-3
>  10   81.86079 osd.10up  1.0  1.0
>  11   81.86079 osd.11up  1.0  1.0
>  -6  163.72159 host ceph-4
>   4   81.86079 osd.4 up  1.0  1.0
>   5   81.86079 osd.5 up  1.0  1.0
>  -7  163.72159 host ceph-5
>   6   81.86079 osd.6 up  1.0  1.0
>   7   81.86079 osd.7 up  1.0  1.0
>  -8  163.72159 host ceph-6
>  12   81.86079 osd.12up  0.7  1.0
>  13   81.86079 osd.13up  1.0  1.0
>  -9  163.72159 host ceph-7
>  14   81.86079 osd.14up  1.0  1.0
>  15   81.86079 osd.15up  1.0  1.0
> -10  163.72159 host ceph-8
>  16   81.86079 osd.16up  1.0  1.0
>  17   81.86079 osd.17up  1.0  1.0
> -11  163.72159 host ceph-9
>  18   81.86079 osd.18up  1.0  1.0
>  19   81.86079 osd.19up  1.0  1.0
> -12  163.72159 host ceph-10
>  20   81.86079 osd.20up  1.0  1.0
>  21   81.86079 osd.21up  1.0  1.0
> -13  163.72159 host ceph-11
>  22   81.86079 osd.22up  1.0  1.0
>  23   81.86079 osd.23up  1.0  1.0
> -14  163.72159 host ceph-12
>  24   81.86079 osd.24up  1.0  1.0
>  25   81.86079 osd.25up  1.0  1.0
> -15  163.72159 host ceph-13
>  26   81.86079 osd.26  down0  1.0
>  27   81.86079 osd.27  down0  1.0
> -16  163.72159 host ceph-14
>  28   81.86079 osd.28up  1.0  1.0
>  29   81.86079 osd.29up  1.0  1.0
> -17  163.72159 host ceph-15
>  30   

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-11 Thread Webert de Souza Lima
You could use "mds_cache_size" to limit number of CAPS untill you have this
fixed, but I'd say for your number of caps and inodes, 20GB is normal.

this mds (jewel) here is consuming 24GB RAM:

{
"mds": {
"request": 7194867047,
"reply": 7194866688,
"reply_latency": {
"avgcount": 7194866688,
"sum": 27779142.611775008
},
"forward": 0,
"dir_fetch": 179223482,
"dir_commit": 1529387896,
"dir_split": 0,
"inode_max": 300,
"inodes": 3001264,
"inodes_top": 160517,
"inodes_bottom": 226577,
"inodes_pin_tail": 2614170,
"inodes_pinned": 2770689,
"inodes_expired": 2920014835,
"inodes_with_caps": 2743194,
"caps": 2803568,
"subtrees": 2,
"traverse": 8255083028,
"traverse_hit": 7452972311,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 180547123,
"traverse_remote_ino": 122257,
"traverse_lock": 5957156,
"load_cent": 18446743934203149911,
"q": 54,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
}
}


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 3:13 PM Alexandre DERUMIER 
wrote:

> Hi,
>
> I'm still seeing memory leak with 12.2.5.
>
> seem to leak some MB each 5 minutes.
>
> I'll try to resent some stats next weekend.
>
>
> - Mail original -
> De: "Patrick Donnelly" 
> À: "Brady Deetz" 
> Cc: "Alexandre Derumier" , "ceph-users" <
> ceph-users@lists.ceph.com>
> Envoyé: Jeudi 10 Mai 2018 21:11:19
> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>
> On Thu, May 10, 2018 at 12:00 PM, Brady Deetz  wrote:
> > [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
> > ceph 1841 3.5 94.3 133703308 124425384 ? Ssl Apr04 1808:32
> > /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup
> ceph
> >
> >
> > [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
> > {
> > "pool": {
> > "items": 173261056,
> > "bytes": 76504108600
> > }
> > }
> >
> > So, 80GB is my configured limit for the cache and it appears the mds is
> > following that limit. But, the mds process is using over 100GB RAM in my
> > 128GB host. I thought I was playing it safe by configuring at 80. What
> other
> > things consume a lot of RAM for this process?
> >
> > Let me know if I need to create a new thread.
>
> The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade
> ASAP.
>
> [1] https://tracker.ceph.com/issues/22972
>
> --
> Patrick Donnelly
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread Oliver Schulz

Dear David,

thanks a lot for the detailed answer(s) and clarifications!
Can I ask just a few more questions?

On 11.05.2018 18:46, David Turner wrote:
partitions is 10GB per 1TB of OSD.  If your OSD is a 4TB disk you should 
be looking closer to a 40GB block.db partition.  If your block.db 
partition is too small, then once it fills up it will spill over onto 
the data volume and slow things down.


Ooops ... I have 15 x 10 TB disks in the servers, and one Optane
SSD for all of them - so I don't have 10GB SSD per TB of HDD. :-(
Will I still get a speed-up if only part of the block.db fits? Or
should I use the SSD for WAL only? Or even use good old filestore
with 10GB journals, instead of bluestore?



And just to make sure - if I specify "--osd-db", I don't need
to set "--osd-wal" as well, since the WAL will end up on the
DB partition automatically, correct?
This is correct.  The wal will automatically be placed on the db if not 
otherwise specified.


Would there still be any benefit to having separate WAL
and DB partitions (so they that DB doesn't compete
with WAL for space, or something like that)?


I don't use ceph-deploy, but the process for creating the OSDs should be 
something like this.  After the OSDs are created it is a good idea to 
make sure that the OSD is not looking for the db partition with the
/dev/nvme0n1p2 distinction as that can change on reboots if you have 


Yes, I just put that in as an example. I had thought about creating
the partitions with

sgdisk -n 0:0:+10G -t 0:8300 -c 0:"osd-XYZ-db" -- /dev/nvme0n1

and then use "/dev/disk/by-partlabel/osd-XYZ-db" (or the partition
UUIDs) for "ceph volume ...". Thanks for the tip about checking
the symlinks! Btw, is "-t 0:8300" Ok? I guess the type number won't 
really matter, though?



Cheers,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Node crash, filesytem not usable

2018-05-11 Thread Daniel Davidson
Below id the information you were asking for.  I think they are size=2, 
min size=1.


Dan

# ceph status
    cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
 health HEALTH_ERR
    140 pgs are stuck inactive for more than 300 seconds
    64 pgs backfill_wait
    76 pgs backfilling
    140 pgs degraded
    140 pgs stuck degraded
    140 pgs stuck inactive
    140 pgs stuck unclean
    140 pgs stuck undersized
    140 pgs undersized
    210 requests are blocked > 32 sec
    recovery 38725029/695508092 objects degraded (5.568%)
    recovery 10844554/695508092 objects misplaced (1.559%)
    mds0: Metadata damage detected
    mds0: Behind on trimming (71/30)
    noscrub,nodeep-scrub flag(s) set
 monmap e3: 4 mons at 
{ceph-0=172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0}

    election epoch 824, quorum 0,1,2,3 ceph-0,ceph-1,ceph-2,ceph-3
  fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby
 osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs
    flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
  pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects
    1444 TB used, 1011 TB / 2455 TB avail
    38725029/695508092 objects degraded (5.568%)
    10844554/695508092 objects misplaced (1.559%)
    1396 active+clean
  76 undersized+degraded+remapped+backfilling+peered
  64 undersized+degraded+remapped+wait_backfill+peered
recovery io 1244 MB/s, 1612 keys/s, 705 objects/s

ID  WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 2619.54541 root default
 -2  163.72159 host ceph-0
  0   81.86079 osd.0 up  1.0  1.0
  1   81.86079 osd.1 up  1.0  1.0
 -3  163.72159 host ceph-1
  2   81.86079 osd.2 up  1.0  1.0
  3   81.86079 osd.3 up  1.0  1.0
 -4  163.72159 host ceph-2
  8   81.86079 osd.8 up  1.0  1.0
  9   81.86079 osd.9 up  1.0  1.0
 -5  163.72159 host ceph-3
 10   81.86079 osd.10    up  1.0  1.0
 11   81.86079 osd.11    up  1.0  1.0
 -6  163.72159 host ceph-4
  4   81.86079 osd.4 up  1.0  1.0
  5   81.86079 osd.5 up  1.0  1.0
 -7  163.72159 host ceph-5
  6   81.86079 osd.6 up  1.0  1.0
  7   81.86079 osd.7 up  1.0  1.0
 -8  163.72159 host ceph-6
 12   81.86079 osd.12    up  0.7  1.0
 13   81.86079 osd.13    up  1.0  1.0
 -9  163.72159 host ceph-7
 14   81.86079 osd.14    up  1.0  1.0
 15   81.86079 osd.15    up  1.0  1.0
-10  163.72159 host ceph-8
 16   81.86079 osd.16    up  1.0  1.0
 17   81.86079 osd.17    up  1.0  1.0
-11  163.72159 host ceph-9
 18   81.86079 osd.18    up  1.0  1.0
 19   81.86079 osd.19    up  1.0  1.0
-12  163.72159 host ceph-10
 20   81.86079 osd.20    up  1.0  1.0
 21   81.86079 osd.21    up  1.0  1.0
-13  163.72159 host ceph-11
 22   81.86079 osd.22    up  1.0  1.0
 23   81.86079 osd.23    up  1.0  1.0
-14  163.72159 host ceph-12
 24   81.86079 osd.24    up  1.0  1.0
 25   81.86079 osd.25    up  1.0  1.0
-15  163.72159 host ceph-13
 26   81.86079 osd.26  down    0  1.0
 27   81.86079 osd.27  down    0  1.0
-16  163.72159 host ceph-14
 28   81.86079 osd.28    up  1.0  1.0
 29   81.86079 osd.29    up  1.0  1.0
-17  163.72159 host ceph-15
 30   81.86079 osd.30    up  1.0  1.0
 31   81.86079 osd.31    up  1.0  1.0



On 05/11/2018 11:56 AM, David Turner wrote:
What are some outputs of commands to show us the state of your 
cluster.  Most notable is `ceph status` but `ceph osd tree` would be 
helpful. What are the size of the pools in your cluster?  Are they all 
size=3 min_size=2?


On Fri, May 11, 2018 at 12:05 PM Daniel Davidson 
> wrote:


Hello,

Today we had a node crash, and looking at it, it seems there is a
problem with the RAID controller, so it is not coming back up, maybe
ever.  It corrupted the local filesytem for the ceph 

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-11 Thread Alexandre DERUMIER
Hi,

I'm still seeing memory leak with 12.2.5.

seem to leak some MB each 5 minutes.

I'll try to resent some stats next weekend.


- Mail original -
De: "Patrick Donnelly" 
À: "Brady Deetz" 
Cc: "Alexandre Derumier" , "ceph-users" 

Envoyé: Jeudi 10 Mai 2018 21:11:19
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

On Thu, May 10, 2018 at 12:00 PM, Brady Deetz  wrote: 
> [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds 
> ceph 1841 3.5 94.3 133703308 124425384 ? Ssl Apr04 1808:32 
> /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup ceph 
> 
> 
> [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status 
> { 
> "pool": { 
> "items": 173261056, 
> "bytes": 76504108600 
> } 
> } 
> 
> So, 80GB is my configured limit for the cache and it appears the mds is 
> following that limit. But, the mds process is using over 100GB RAM in my 
> 128GB host. I thought I was playing it safe by configuring at 80. What other 
> things consume a lot of RAM for this process? 
> 
> Let me know if I need to create a new thread. 

The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade 
ASAP. 

[1] https://tracker.ceph.com/issues/22972 

-- 
Patrick Donnelly 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph osd crush weight to utilization incorrect on one node

2018-05-11 Thread David Turner
There was a time in the history of Ceph where a weight of 0.0 was not
always what you thought.  People had better experiences with crush weights
of something like 0.0001 or something.  This is just a memory tickling in
the back of my mind of things I've read on the ML years back.

On Fri, May 11, 2018 at 1:26 PM Bryan Stillwell 
wrote:

> > We have a large 1PB ceph cluster. We recently added 6 nodes with 16 2TB
> disks
> > each to the cluster. All the 5 nodes rebalanced well without any issues
> and
> > the sixth/last node OSDs started acting weird as I increase weight of
> one osd
> > the utilization doesn't change but a different osd on the same node
> > utilization is getting increased. Rebalance complete fine but
> utilization is
> > not right.
> >
> > Increased weight of OSD 610 to 0.2 from 0.0 but utilization of OSD 611
> > started increasing but its weight is 0.0. If I increase weight of OSD
> 611 to
> > 0.2 then its overall utilization is growing to what if its weight is
> 0.4. So
> > if I increase weight of 610 and 615 to their full weight then
> utilization on
> > OSD 610 is 1% and on OSD 611 is inching towards 100% where I had to stop
> and
> > downsize the OSD's crush weight back to 0.0 to avoid any implications on
> ceph
> > cluster. Its not just one osd but different OSD's on that one node. The
> only
> > correlation I found out is 610 and 611 OSD Journal partitions are on the
> same
> > SSD drive and all the OSDs are SAS drives. Any help on how to debug or
> > resolve this will be helpful.
>
> You didn't say which version of Ceph you were using, but based on the
> output
> of 'ceph osd df' I'm guessing it's pre-Jewel (maybe Hammer?) cluster?
>
> I've found that data placement can be a little weird when you have really
> low CRUSH weights (0.2) on one of the nodes where the other nodes have
> large
> CRUSH weights (2.0).  I've had it where a single OSD in a node was getting
> almost all the data.  It wasn't until I increased the weights to be more in
> line with the rest of the cluster that it evened back out.
>
> I believe this can also be caused by not having enough PGs in your cluster.
> Or the PGs you do have aren't distributed correctly based on the data usage
> in each pool.  Have you used https://ceph.com/pgcalc/ to determine the
> correct number of PGs you should have per pool?
>
> Since you are likely running a pre-Jewel cluster it could also be that you
> haven't switched your tunables to use the straw2 data placement algorithm:
>
>
> http://docs.ceph.com/docs/master/rados/operations/crush-map/#hammer-crush-v4
>
> That should help as well.  Once that's enabled you can convert your
> existing
> buckets to straw2 as well.  Just be careful you don't have any old clients
> connecting to your cluster that don't support that feature yet.
>
> Bryan
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread David Turner
Nope, only detriment.  If you lost sdb, you would have to rebuild 2 OSDs
instead of just 1.  Also you add more complexity as ceph-volume would much
prefer to just take sda and make it the OSD with all data/db/wal without
partitions or anything.

On Fri, May 11, 2018 at 1:06 PM Jacob DeGlopper  wrote:

> Thanks, this is useful in general.  I have a semi-related question:
>
> Given an OSD server with multiple SSDs or NVME devices, is there an
> advantage to putting wal/db on a different device of the same speed?  For
> example, data on sda1, matching wal/db on sdb1,  and then data on sdb2 and
> wal/db on sda2?
>
> -- jacob
>
> On 05/11/2018 12:46 PM, David Turner wrote:
>
> This thread is off in left field and needs to be brought back to how
> things work.
>
> While multiple OSDs can use the same device for block/wal partitions, they
> each need their own partition.  osd.0 could use nvme0n1p1, osd.2/nvme0n1p2,
> etc.  You cannot use the same partition for each osd.  Ceph-volume will not
> create the db/wal partitions for you, you need to manually create the
> partitions to be used by the OSD.  There is no need to put a filesystem on
> top of the partition for the wal/db.  That is wasted overhead that will
> slow things down.
>
> Back to the original email.
>
> > Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
> > osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?
> This is what you need to do, but like said above, you need to create the
> partitions for --block-db yourself.  You talked about having a 10GB
> partition for this, but the general recommendation for block-db partitions
> is 10GB per 1TB of OSD.  If your OSD is a 4TB disk you should be looking
> closer to a 40GB block.db partition.  If your block.db partition is too
> small, then once it fills up it will spill over onto the data volume and
> slow things down.
>
> > And just to make sure - if I specify "--osd-db", I don't need
> > to set "--osd-wal" as well, since the WAL will end up on the
> > DB partition automatically, correct?
> This is correct.  The wal will automatically be placed on the db if not
> otherwise specified.
>
>
> I don't use ceph-deploy, but the process for creating the OSDs should be
> something like this.  After the OSDs are created it is a good idea to make
> sure that the OSD is not looking for the db partition with the
> /dev/nvme0n1p2 distinction as that can change on reboots if you have
> multiple nvme devices.
>
> # Make sure the disks are clean and ready to use as an OSD
> for hdd in /dev/sd{b..c}; do
>   ceph-volume lvm zap $hdd --destroy
> done
>
> # Create the nvme db partitions (assuming 10G size for a 1TB OSD)
> for partition in {2..3}; do
>   sgdisk -c /dev/nvme0n1 -n:$partition:0:+10G -c:$partition:'ceph db'
> done
>
> # Create the OSD
> echo "/dev/sdb /dev/nvme0n1p2
> /dev/sdc /dev/nvme0n1p3" | while read hdd db; do
>   ceph-volume lvm create --bluestore --data $hdd --block.db $db
> done
>
> # Fix the OSDs to look for the block.db partition by UUID instead of its
> device name.
> for db in /var/lib/ceph/osd/*/block.db; do
>   dev=$(readlink $db | grep -Eo nvme[[:digit:]]+n[[:digit:]]+p[[:digit:]]+
> || echo false)
>   if [[ "$dev" != false ]]; then
> uuid=$(ls -l /dev/disk/by-partuuid/ | awk '/'${dev}'$/ {print $9}')
> ln -sf /dev/disk/by-partuuid/$uuid $db
>   fi
> done
> systemctl restart ceph-osd.target
>
> On Fri, May 11, 2018 at 10:59 AM João Paulo Sacchetto Ribeiro Bastos <
> joaopaulos...@gmail.com> wrote:
>
>> Actually, if you go to https://ceph.com/community/new-luminous-bluestore/ you
>> will see that DB/WAL work on a XFS partition, while the data itself goes on
>> a raw block.
>>
>> Also, I told you the wrong command in the last mail. When i said --osd-db
>> it should be --block-db.
>>
>> On Fri, May 11, 2018 at 11:51 AM Oliver Schulz <
>> oliver.sch...@tu-dortmund.de> wrote:
>>
>>> Hi,
>>>
>>> thanks for the advice! I'm a bit confused now, though. ;-)
>>> I thought DB and WAL were supposed to go on raw block
>>> devices, not file systems?
>>>
>>>
>>> Cheers,
>>>
>>> Oliver
>>>
>>>
>>> On 11.05.2018 16:01, João Paulo Sacchetto Ribeiro Bastos wrote:
>>> > Hello Oliver,
>>> >
>>> > As far as I know yet, you can use the same DB device for about 4 or 5
>>> > OSDs, just need to be aware of the free space. I'm also developing a
>>> > bluestore cluster, and our DB and WAL will be in the same SSD of about
>>> > 480GB serving 4 OSD HDDs of 4 TB each. About the sizes, its just a
>>> > feeling because I couldn't find yet any clear rule about how to
>>> measure
>>> > the requirements.
>>> >
>>> > * The only concern that took me some time to realize is that you
>>> should
>>> > create a XFS partition if using ceph-deploy because if you don't it
>>> will
>>> > simply give you a RuntimeError that doesn't give any hint about what's
>>> > going on.
>>> >
>>> > So, answering your question, you could do something like:
>>> > $ ceph-deploy osd create --bluestore --data=/dev/sdb 

Re: [ceph-users] Ceph osd crush weight to utilization incorrect on one node

2018-05-11 Thread Bryan Stillwell
> We have a large 1PB ceph cluster. We recently added 6 nodes with 16 2TB disks
> each to the cluster. All the 5 nodes rebalanced well without any issues and
> the sixth/last node OSDs started acting weird as I increase weight of one osd
> the utilization doesn't change but a different osd on the same node
> utilization is getting increased. Rebalance complete fine but utilization is
> not right.
>
> Increased weight of OSD 610 to 0.2 from 0.0 but utilization of OSD 611
> started increasing but its weight is 0.0. If I increase weight of OSD 611 to
> 0.2 then its overall utilization is growing to what if its weight is 0.4. So
> if I increase weight of 610 and 615 to their full weight then utilization on
> OSD 610 is 1% and on OSD 611 is inching towards 100% where I had to stop and
> downsize the OSD's crush weight back to 0.0 to avoid any implications on ceph
> cluster. Its not just one osd but different OSD's on that one node. The only
> correlation I found out is 610 and 611 OSD Journal partitions are on the same
> SSD drive and all the OSDs are SAS drives. Any help on how to debug or
> resolve this will be helpful.

You didn't say which version of Ceph you were using, but based on the output
of 'ceph osd df' I'm guessing it's pre-Jewel (maybe Hammer?) cluster?

I've found that data placement can be a little weird when you have really
low CRUSH weights (0.2) on one of the nodes where the other nodes have large
CRUSH weights (2.0).  I've had it where a single OSD in a node was getting
almost all the data.  It wasn't until I increased the weights to be more in
line with the rest of the cluster that it evened back out.

I believe this can also be caused by not having enough PGs in your cluster.
Or the PGs you do have aren't distributed correctly based on the data usage
in each pool.  Have you used https://ceph.com/pgcalc/ to determine the
correct number of PGs you should have per pool?

Since you are likely running a pre-Jewel cluster it could also be that you
haven't switched your tunables to use the straw2 data placement algorithm:

http://docs.ceph.com/docs/master/rados/operations/crush-map/#hammer-crush-v4

That should help as well.  Once that's enabled you can convert your existing
buckets to straw2 as well.  Just be careful you don't have any old clients
connecting to your cluster that don't support that feature yet.

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread Jacob DeGlopper

Thanks, this is useful in general.  I have a semi-related question:

Given an OSD server with multiple SSDs or NVME devices, is there an 
advantage to putting wal/db on a different device of the same speed?  
For example, data on sda1, matching wal/db on sdb1,  and then data on 
sdb2 and wal/db on sda2?


    -- jacob


On 05/11/2018 12:46 PM, David Turner wrote:
This thread is off in left field and needs to be brought back to how 
things work.


While multiple OSDs can use the same device for block/wal partitions, 
they each need their own partition.  osd.0 could use nvme0n1p1, 
osd.2/nvme0n1p2, etc.  You cannot use the same partition for each 
osd.  Ceph-volume will not create the db/wal partitions for you, you 
need to manually create the partitions to be used by the OSD.  There 
is no need to put a filesystem on top of the partition for the 
wal/db.  That is wasted overhead that will slow things down.


Back to the original email.

> Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
> osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?
This is what you need to do, but like said above, you need to create 
the partitions for --block-db yourself.  You talked about having a 
10GB partition for this, but the general recommendation for block-db 
partitions is 10GB per 1TB of OSD.  If your OSD is a 4TB disk you 
should be looking closer to a 40GB block.db partition.  If your 
block.db partition is too small, then once it fills up it will spill 
over onto the data volume and slow things down.


> And just to make sure - if I specify "--osd-db", I don't need
> to set "--osd-wal" as well, since the WAL will end up on the
> DB partition automatically, correct?
This is correct.  The wal will automatically be placed on the db if 
not otherwise specified.



I don't use ceph-deploy, but the process for creating the OSDs should 
be something like this.  After the OSDs are created it is a good idea 
to make sure that the OSD is not looking for the db partition with the 
/dev/nvme0n1p2 distinction as that can change on reboots if you have 
multiple nvme devices.


# Make sure the disks are clean and ready to use as an OSD
for hdd in /dev/sd{b..c}; do
  ceph-volume lvm zap $hdd --destroy
done

# Create the nvme db partitions (assuming 10G size for a 1TB OSD)
for partition in {2..3}; do
  sgdisk -c /dev/nvme0n1 -n:$partition:0:+10G -c:$partition:'ceph db'
done

# Create the OSD
echo "/dev/sdb /dev/nvme0n1p2
/dev/sdc /dev/nvme0n1p3" | while read hdd db; do
  ceph-volume lvm create --bluestore --data $hdd --block.db $db
done

# Fix the OSDs to look for the block.db partition by UUID instead of 
its device name.

for db in /var/lib/ceph/osd/*/block.db; do
  dev=$(readlink $db | grep -Eo 
nvme[[:digit:]]+n[[:digit:]]+p[[:digit:]]+ || echo false)

  if [[ "$dev" != false ]]; then
    uuid=$(ls -l /dev/disk/by-partuuid/ | awk '/'${dev}'$/ {print $9}')
    ln -sf /dev/disk/by-partuuid/$uuid $db
  fi
done
systemctl restart ceph-osd.target

On Fri, May 11, 2018 at 10:59 AM João Paulo Sacchetto Ribeiro Bastos 
> wrote:


Actually, if you go to
https://ceph.com/community/new-luminous-bluestore/ you will see
that DB/WAL work on a XFS partition, while the data itself goes on
a raw block.

Also, I told you the wrong command in the last mail. When i said
--osd-db it should be --block-db.

On Fri, May 11, 2018 at 11:51 AM Oliver Schulz
> wrote:

Hi,

thanks for the advice! I'm a bit confused now, though. ;-)
I thought DB and WAL were supposed to go on raw block
devices, not file systems?


Cheers,

Oliver


On 11.05.2018 16:01, João Paulo Sacchetto Ribeiro Bastos wrote:
> Hello Oliver,
>
> As far as I know yet, you can use the same DB device for
about 4 or 5
> OSDs, just need to be aware of the free space. I'm also
developing a
> bluestore cluster, and our DB and WAL will be in the same
SSD of about
> 480GB serving 4 OSD HDDs of 4 TB each. About the sizes, its
just a
> feeling because I couldn't find yet any clear rule about how
to measure
> the requirements.
>
> * The only concern that took me some time to realize is that
you should
> create a XFS partition if using ceph-deploy because if you
don't it will
> simply give you a RuntimeError that doesn't give any hint
about what's
> going on.
>
> So, answering your question, you could do something like:
> $ ceph-deploy osd create --bluestore --data=/dev/sdb --block-db
> /dev/nvme0n1p1 $HOSTNAME
> $ ceph-deploy osd create --bluestore --data=/dev/sdc --block-db
> /dev/nvme0n1p1 $HOSTNAME
>
> On Fri, May 11, 2018 at 10:35 AM Oliver Schulz
 

Re: [ceph-users] Node crash, filesytem not usable

2018-05-11 Thread David Turner
What are some outputs of commands to show us the state of your cluster.
Most notable is `ceph status` but `ceph osd tree` would be helpful. What
are the size of the pools in your cluster?  Are they all size=3 min_size=2?

On Fri, May 11, 2018 at 12:05 PM Daniel Davidson 
wrote:

> Hello,
>
> Today we had a node crash, and looking at it, it seems there is a
> problem with the RAID controller, so it is not coming back up, maybe
> ever.  It corrupted the local filesytem for the ceph storage there.
>
> The remainder of our storage (10.2.10) cluster is running, and it looks
> to be repairing and our min_size is set to 2.  Normally, I would expect
> that the system would keep running normally from and end user
> perspective when this happens, but the system is down. All mounts that
> were up when this started look to be stale, and new mounts give the
> following error:
>
> # mount -t ceph ceph-0:/ /test/ -o
> name=admin,secretfile=/etc/ceph/admin.secret,noatime,_netdev,rbytes
> mount error 5 = Input/output error
>
> Any suggestions?
>
> Dan
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread David Turner
Note that instead of including the step to use the UUID in the osd creation
like [1] this, I opted to separate it out in those instructions.  That was
to simplify the commands and to give people an idea of how to fix their
OSDs if they created them using the device name instead of UUID.  It would
be simpler to just create the OSD using the partuuid instead.  Also not
mentioned in my previous response is that if you would like your OSDs to be
encrypted at rest, you should add --dmcrypt to the ceph-volume command
(included in the example below).

[1] # Create the OSD
echo "/dev/sdb /dev/nvme0n1p2
/dev/sdc /dev/nvme0n1p3" | while read hdd db; do
  uuid=$(ls -l /dev/disk/by-partuuid/ | awk '/'${db}'$/ {print $9}')
  ceph-volume lvm create --bluestore --dmcrypt --data $hdd --block.db
/dev/disk/by-partuuid/$uuid
done

On Fri, May 11, 2018 at 12:46 PM David Turner  wrote:

> This thread is off in left field and needs to be brought back to how
> things work.
>
> While multiple OSDs can use the same device for block/wal partitions, they
> each need their own partition.  osd.0 could use nvme0n1p1, osd.2/nvme0n1p2,
> etc.  You cannot use the same partition for each osd.  Ceph-volume will not
> create the db/wal partitions for you, you need to manually create the
> partitions to be used by the OSD.  There is no need to put a filesystem on
> top of the partition for the wal/db.  That is wasted overhead that will
> slow things down.
>
> Back to the original email.
>
> > Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
> > osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?
> This is what you need to do, but like said above, you need to create the
> partitions for --block-db yourself.  You talked about having a 10GB
> partition for this, but the general recommendation for block-db partitions
> is 10GB per 1TB of OSD.  If your OSD is a 4TB disk you should be looking
> closer to a 40GB block.db partition.  If your block.db partition is too
> small, then once it fills up it will spill over onto the data volume and
> slow things down.
>
>
> > And just to make sure - if I specify "--osd-db", I don't need
> > to set "--osd-wal" as well, since the WAL will end up on the
> > DB partition automatically, correct?
> This is correct.  The wal will automatically be placed on the db if not
> otherwise specified.
>
>
> I don't use ceph-deploy, but the process for creating the OSDs should be
> something like this.  After the OSDs are created it is a good idea to make
> sure that the OSD is not looking for the db partition with the
> /dev/nvme0n1p2 distinction as that can change on reboots if you have
> multiple nvme devices.
>
> # Make sure the disks are clean and ready to use as an OSD
> for hdd in /dev/sd{b..c}; do
>   ceph-volume lvm zap $hdd --destroy
> done
>
> # Create the nvme db partitions (assuming 10G size for a 1TB OSD)
> for partition in {2..3}; do
>   sgdisk -c /dev/nvme0n1 -n:$partition:0:+10G -c:$partition:'ceph db'
> done
>
> # Create the OSD
> echo "/dev/sdb /dev/nvme0n1p2
> /dev/sdc /dev/nvme0n1p3" | while read hdd db; do
>   ceph-volume lvm create --bluestore --data $hdd --block.db $db
> done
>
> # Fix the OSDs to look for the block.db partition by UUID instead of its
> device name.
> for db in /var/lib/ceph/osd/*/block.db; do
>   dev=$(readlink $db | grep -Eo nvme[[:digit:]]+n[[:digit:]]+p[[:digit:]]+
> || echo false)
>   if [[ "$dev" != false ]]; then
> uuid=$(ls -l /dev/disk/by-partuuid/ | awk '/'${dev}'$/ {print $9}')
> ln -sf /dev/disk/by-partuuid/$uuid $db
>   fi
> done
> systemctl restart ceph-osd.target
>
> On Fri, May 11, 2018 at 10:59 AM João Paulo Sacchetto Ribeiro Bastos <
> joaopaulos...@gmail.com> wrote:
>
>> Actually, if you go to https://ceph.com/community/new-luminous-bluestore/ you
>> will see that DB/WAL work on a XFS partition, while the data itself goes on
>> a raw block.
>>
>> Also, I told you the wrong command in the last mail. When i said --osd-db
>> it should be --block-db.
>>
>> On Fri, May 11, 2018 at 11:51 AM Oliver Schulz <
>> oliver.sch...@tu-dortmund.de> wrote:
>>
>>> Hi,
>>>
>>> thanks for the advice! I'm a bit confused now, though. ;-)
>>> I thought DB and WAL were supposed to go on raw block
>>> devices, not file systems?
>>>
>>>
>>> Cheers,
>>>
>>> Oliver
>>>
>>>
>>> On 11.05.2018 16:01, João Paulo Sacchetto Ribeiro Bastos wrote:
>>> > Hello Oliver,
>>> >
>>> > As far as I know yet, you can use the same DB device for about 4 or 5
>>> > OSDs, just need to be aware of the free space. I'm also developing a
>>> > bluestore cluster, and our DB and WAL will be in the same SSD of about
>>> > 480GB serving 4 OSD HDDs of 4 TB each. About the sizes, its just a
>>> > feeling because I couldn't find yet any clear rule about how to
>>> measure
>>> > the requirements.
>>> >
>>> > * The only concern that took me some time to realize is that you
>>> should
>>> > create a XFS partition if using ceph-deploy because if you don't it
>>> will
>>> > simply give 

Re: [ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread David Turner
This thread is off in left field and needs to be brought back to how things
work.

While multiple OSDs can use the same device for block/wal partitions, they
each need their own partition.  osd.0 could use nvme0n1p1, osd.2/nvme0n1p2,
etc.  You cannot use the same partition for each osd.  Ceph-volume will not
create the db/wal partitions for you, you need to manually create the
partitions to be used by the OSD.  There is no need to put a filesystem on
top of the partition for the wal/db.  That is wasted overhead that will
slow things down.

Back to the original email.

> Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
> osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?
This is what you need to do, but like said above, you need to create the
partitions for --block-db yourself.  You talked about having a 10GB
partition for this, but the general recommendation for block-db partitions
is 10GB per 1TB of OSD.  If your OSD is a 4TB disk you should be looking
closer to a 40GB block.db partition.  If your block.db partition is too
small, then once it fills up it will spill over onto the data volume and
slow things down.

> And just to make sure - if I specify "--osd-db", I don't need
> to set "--osd-wal" as well, since the WAL will end up on the
> DB partition automatically, correct?
This is correct.  The wal will automatically be placed on the db if not
otherwise specified.


I don't use ceph-deploy, but the process for creating the OSDs should be
something like this.  After the OSDs are created it is a good idea to make
sure that the OSD is not looking for the db partition with the
/dev/nvme0n1p2 distinction as that can change on reboots if you have
multiple nvme devices.

# Make sure the disks are clean and ready to use as an OSD
for hdd in /dev/sd{b..c}; do
  ceph-volume lvm zap $hdd --destroy
done

# Create the nvme db partitions (assuming 10G size for a 1TB OSD)
for partition in {2..3}; do
  sgdisk -c /dev/nvme0n1 -n:$partition:0:+10G -c:$partition:'ceph db'
done

# Create the OSD
echo "/dev/sdb /dev/nvme0n1p2
/dev/sdc /dev/nvme0n1p3" | while read hdd db; do
  ceph-volume lvm create --bluestore --data $hdd --block.db $db
done

# Fix the OSDs to look for the block.db partition by UUID instead of its
device name.
for db in /var/lib/ceph/osd/*/block.db; do
  dev=$(readlink $db | grep -Eo nvme[[:digit:]]+n[[:digit:]]+p[[:digit:]]+
|| echo false)
  if [[ "$dev" != false ]]; then
uuid=$(ls -l /dev/disk/by-partuuid/ | awk '/'${dev}'$/ {print $9}')
ln -sf /dev/disk/by-partuuid/$uuid $db
  fi
done
systemctl restart ceph-osd.target

On Fri, May 11, 2018 at 10:59 AM João Paulo Sacchetto Ribeiro Bastos <
joaopaulos...@gmail.com> wrote:

> Actually, if you go to https://ceph.com/community/new-luminous-bluestore/ you
> will see that DB/WAL work on a XFS partition, while the data itself goes on
> a raw block.
>
> Also, I told you the wrong command in the last mail. When i said --osd-db
> it should be --block-db.
>
> On Fri, May 11, 2018 at 11:51 AM Oliver Schulz <
> oliver.sch...@tu-dortmund.de> wrote:
>
>> Hi,
>>
>> thanks for the advice! I'm a bit confused now, though. ;-)
>> I thought DB and WAL were supposed to go on raw block
>> devices, not file systems?
>>
>>
>> Cheers,
>>
>> Oliver
>>
>>
>> On 11.05.2018 16:01, João Paulo Sacchetto Ribeiro Bastos wrote:
>> > Hello Oliver,
>> >
>> > As far as I know yet, you can use the same DB device for about 4 or 5
>> > OSDs, just need to be aware of the free space. I'm also developing a
>> > bluestore cluster, and our DB and WAL will be in the same SSD of about
>> > 480GB serving 4 OSD HDDs of 4 TB each. About the sizes, its just a
>> > feeling because I couldn't find yet any clear rule about how to measure
>> > the requirements.
>> >
>> > * The only concern that took me some time to realize is that you should
>> > create a XFS partition if using ceph-deploy because if you don't it
>> will
>> > simply give you a RuntimeError that doesn't give any hint about what's
>> > going on.
>> >
>> > So, answering your question, you could do something like:
>> > $ ceph-deploy osd create --bluestore --data=/dev/sdb --block-db
>> > /dev/nvme0n1p1 $HOSTNAME
>> > $ ceph-deploy osd create --bluestore --data=/dev/sdc --block-db
>> > /dev/nvme0n1p1 $HOSTNAME
>> >
>> > On Fri, May 11, 2018 at 10:35 AM Oliver Schulz
>> > >
>> wrote:
>> >
>> > Dear Ceph Experts,
>> >
>> > I'm trying to set up some new OSD storage nodes, now with
>> > bluestore (our existing nodes still use filestore). I'm
>> > a bit unclear on how to specify WAL/DB devices: Can
>> > several OSDs share one WAL/DB partition? So, can I do
>> >
>> >   ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
>> > --data=/dev/sdb HOSTNAME
>> >
>> >   ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
>> > --data=/dev/sdc HOSTNAME
>> >
>> >   ...
>> >
>> > Or do I need to use 

[ceph-users] Bucket reporting content inconsistently

2018-05-11 Thread Sean Redmond
HI all,



We have recently upgraded to 10.2.10 in preparation for our upcoming
upgrade to Luminous and I have been attempting to remove a bucket. When
using tools such as s3cmd I can see files are listed, verified by the
checking with bi list too as shown below:



root@ceph-rgw-1:~# radosgw-admin --id rgw.ceph-rgw-1 bi list
--bucket='bucketnamehere' | grep -i "\"idx\":" | wc -l

3278



However, on attempting to delete the bucket and purge the objects , it
appears not to be recognised:



root@ceph-rgw-1:~# radosgw-admin --id rgw.ceph-rgw-1 bucket rm --bucket=
bucketnamehere --purge-objects

2018-05-10 14:11:05.393851 7f0ab07b6a00 -1 ERROR: unable to remove
bucket(2) No such file or directory



Checking the bucket stats, it does appear that the bucket is reporting no
content, and repeat the above content test there has been no change to the
3278 figure:



root@ceph-rgw-1:~# radosgw-admin --id rgw.ceph-rgw-1 bucket stats
--bucket="bucketnamehere"

{

"bucket": "bucketnamehere",

"pool": ".rgw.buckets",

"index_pool": ".rgw.buckets.index",

"id": "default.28142894.1",

"marker": "default.28142894.1",

"owner": "16355",

"ver":
"0#5463545,1#5483686,2#5483484,3#5474696,4#5479052,5#5480339,6#5469460,7#5463976",

"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0",

"mtime": "2015-12-08 12:42:26.286153",

"max_marker": "0#,1#,2#,3#,4#,5#,6#,7#",

"usage": {

"rgw.main": {

"size_kb": 0,

"size_kb_actual": 0,

"num_objects": 0

},

"rgw.multimeta": {

"size_kb": 0,

"size_kb_actual": 0,

"num_objects": 0

}

},

"bucket_quota": {

"enabled": false,

"max_size_kb": -1,

"max_objects": -1

}

}



I have attempted a bucket index check and fix on this, however, it does not
appear to have made a difference and no fixes or errors reported from it.
Does anyone have any advice on how to proceed with removing this content?
At this stage I am not too concerned if the method needed to remove this
generates orphans, as we will shortly be running a large orphan scan after
our upgrade to Luminous. Cluster health otherwise reports normal.


Thanks

Sean Redmond
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Node crash, filesytem not usable

2018-05-11 Thread Daniel Davidson

Hello,

Today we had a node crash, and looking at it, it seems there is a 
problem with the RAID controller, so it is not coming back up, maybe 
ever.  It corrupted the local filesytem for the ceph storage there.


The remainder of our storage (10.2.10) cluster is running, and it looks 
to be repairing and our min_size is set to 2.  Normally, I would expect 
that the system would keep running normally from and end user 
perspective when this happens, but the system is down. All mounts that 
were up when this started look to be stale, and new mounts give the 
following error:


# mount -t ceph ceph-0:/ /test/ -o 
name=admin,secretfile=/etc/ceph/admin.secret,noatime,_netdev,rbytes

mount error 5 = Input/output error

Any suggestions?

Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG automatically got "repaired"?

2018-05-11 Thread Nikos Kormpakis

On 2018-05-10 00:39, Gregory Farnum wrote:
On Wed, May 9, 2018 at 8:21 AM Nikos Kormpakis  
wrote:
1) After how much time RADOS tries to read from a secondary replica? 
Is

this
timeout configurable?
2) If a primary shard is missing, Ceph tries to recreate it somehow
automatically?
3) If Ceph recreates the primary shard (even automatically, or with
`ceph
pg repair`, why did we not observe IO errors again? Does BlueStore
know
which disk blocks are bad an somehow avoids them or the same 
object

can
be stored on different blocks if recreated? Unfortunately, I'm not
familiar with its internals.
4) Is there any reason why did slow requests appear? Can we correlate
these
requests somehow with our problem?

This behavior looks very confusing from a first sight and we'd really
want
to know what is happening and what is Ceph doing internally. I'd 
really

appreciate any insights or pointers.



David and a few other people have been making a lot of changes around 
this
area lately to make Ceph handle failures more transparently, and I 
haven't

kept up with all of it. But I *believe* what happened is:
1) the scrub caused a read of the object, and BlueStore returned a read
error
2) the OSD would have previously treated this as a catastrophic failure 
and
crashed, but now it handles it by marking the object as missing and 
needing

recovery
— I don't quite remember the process here. Either 3') it tries to do
recovery on its own when there are available resources for it, or
3) the user requested an object the OSD had marked as missing, so
4) the recovery code kicked off and the OSD grabbed it from another 
replica.


In particular reference to your questions
1) It's not about time; a read error means the object is marked as gone
locally; when that happens it will try and recover the object from 
elsewhere

2) not a whole shard, but an object, sure. (I mean, it will also try to
recover a shard, but that's the normal peering, recovery, backfill sort 
of

thing...)
3) I don't know the BlueStore internals well enough to say for sure if 
it
marks the blocks as bad, but most modern *disks* will do that 
transparently
to the upper layers, so BlueStore just needs to write the data out 
again.
To BlueStore, the write will look like a completely different object, 
so

the fact a previous bit of hard drive was bad won't matter.
4) Probably your cluster was already busy, and ops got backed up on 
either

the primary OSD or one of the others participating in recovery? I mean,
that generally shouldn't occur, but slow requests tend to happen if you
overload a cluster and maybe the recovery pushed it over the edge...
-Greg


I was accustomed to the old behavior, where OSDs crashed when having an 
IO

error, thus this behavior surprised me, hence I wrote this mail.

About the slow requests, our cluster does not have any serious load, but
again, I'm not 100% sure and I'll try to reproduce it.

Thanks for your info,
Nikos.






Best regards,
--
Nikos Kormpakis - nk...@noc.grnet.gr
Network Operations Center, Greek Research & Technology Network
Tel: +30 210 7475712 <+30%2021%200747%205712> - http://www.grnet.gr
7, Kifisias Av., 115 23 Athens, Greece

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Nikos Kormpakis - nk...@noc.grnet.gr
Network Operations Center, Greek Research & Technology Network
Tel: +30 210 7475712 - http://www.grnet.gr
7, Kifisias Av., 115 23 Athens, Greece
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo

2018-05-11 Thread Oliver Freyermuth
Hi David,

Am 11.05.2018 um 16:55 schrieb David C:
> Hi Oliver
> 
> Thanks for the detailed reponse! I've downgraded my libcephfs2 to 12.2.4 and 
> still get a similar error:
> 
> load_fsal :NFS STARTUP :CRIT :Could not dlopen 
> module:/usr/lib64/ganesha/libfsalceph.so Error:/lib64/libcephfs.so.2: 
> undefined symbol: 
> _Z14common_preinitRK18CephInitParameters18code_environment_ti
> load_fsal :NFS STARTUP :MAJ :Failed to load module 
> (/usr/lib64/ganesha/libfsalceph.so) because: Can not access a needed shared 
> library
> 
> I'm on CentOS 7.4, using the following package versions:
> 
> # rpm -qa | grep ganesha
> nfs-ganesha-2.6.1-0.1.el7.x86_64
> nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64
> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> 
> # rpm -qa | grep ceph
> libcephfs2-12.2.4-0.el7.x86_64
> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64

Mhhhm - that sounds like a messup in the dependencies. 
The symbol you are missing should be provided by
librados2-12.2.4-0.el7.x86_64
which contains
/usr/lib64/ceph/ceph/libcephfs-common.so.0
Do you have a different version of librados2 installed? If so, I wonder how yum 
/ rpm allowed that ;-). 

Thinking again, it might also be (if you indeed have a different version there) 
that this is the cause also for the previous error. 
If the problematic symbol is indeed not exposed, but can be resolved only if 
both libraries (libcephfs-common and libcephfs) are loaded in unison with 
matching versions,
it might be that also 12.2.5 works fine... 

First thing, in any case, is to checkout which version of librados2 you are 
using ;-). 

Cheers,
Oliver

> 
> I don't have the ceph user space components installed, assuming they're not 
> nesscary apart from libcephfs2? Any idea why it's giving me this error?
> 
> Thanks,
> 
> On Fri, May 11, 2018 at 2:17 AM, Oliver Freyermuth 
> > wrote:
> 
> Hi David,
> 
> for what it's worth, we are running with nfs-ganesha 2.6.1 from Ceph 
> repos on CentOS 7.4 with the following set of versions:
> libcephfs2-12.2.4-0.el7.x86_64
> nfs-ganesha-2.6.1-0.1.el7.x86_64
> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> Of course, we plan to upgrade to 12.2.5 soon-ish...
> 
> Am 11.05.2018 um 00:05 schrieb David C:
> > Hi All
> > 
> > I'm testing out the nfs-ganesha-2.6.1-0.1.el7.x86_64.rpm package from 
> http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/ 
> 
> > 
> > It's failing to load /usr/lib64/ganesha/libfsalceph.so
> > 
> > With libcephfs-12.2.1 installed I get the following error in my ganesha 
> log:
> > 
> >     load_fsal :NFS STARTUP :CRIT :Could not dlopen 
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> >     /usr/lib64/ganesha/libfsalceph.so: undefined symbol: 
> ceph_set_deleg_timeout
> >     load_fsal :NFS STARTUP :MAJ :Failed to load module 
> (/usr/lib64/ganesha/libfsalceph.so) because
> >     : Can not access a needed shared library
> 
> That looks like an ABI incompatibility, probably the nfs-ganesha packages 
> should block this libcephfs2-version (and older ones).
> 
> > 
> > 
> > With libcephfs-12.2.5 installed I get:
> > 
> >     load_fsal :NFS STARTUP :CRIT :Could not dlopen 
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> >     /lib64/libcephfs.so.2: undefined symbol: 
> _ZNK5FSMap10parse_roleEN5boost17basic_string_viewIcSt11char_traitsIcEEEP10mds_role_tRSo
> >     load_fsal :NFS STARTUP :MAJ :Failed to load module 
> (/usr/lib64/ganesha/libfsalceph.so) because
> >     : Can not access a needed shared library
> 
> That looks ugly and makes me fear for our planned 12.2.5-upgrade.
> Interestingly, we do not have that symbol on 12.2.4:
> # nm -D /lib64/libcephfs.so.2 | grep FSMap
>                  U _ZNK5FSMap10parse_roleERKSsP10mds_role_tRSo
>                  U _ZNK5FSMap13print_summaryEPN4ceph9FormatterEPSo
> and NFS-Ganesha works fine.
> 
> Looking at:
> https://github.com/ceph/ceph/blob/v12.2.4/src/mds/FSMap.h 
> 
> versus
> https://github.com/ceph/ceph/blob/v12.2.5/src/mds/FSMap.h 
> 
> it seems this commit:
> 
> https://github.com/ceph/ceph/commit/7d8b3c1082b6b870710989773f3cd98a472b9a3d 
> 
> changed libcephfs2 ABI.
> 
> I've no idea how that's usually handled and whether ABI breakage should 
> occur within point releases (I would not have expected that...).
> At least, this means either:
> - ABI needs to be reverted to the old state.
> - A new NFS Ganesha build is needed. Probably, if this is a common thing, 
> builds should be automated and be synchronized to ceph releases,
>   and old versions 

Re: [ceph-users] RBD Cache and rbd-nbd

2018-05-11 Thread Jason Dillaman
On Fri, May 11, 2018 at 3:59 AM, Marc Schöchlin  wrote:
> Hello Jason,
>
> thanks for your response.
>
>
> Am 10.05.2018 um 21:18 schrieb Jason Dillaman:
>
> If i configure caches like described at
> http://docs.ceph.com/docs/luminous/rbd/rbd-config-ref/, are there dedicated
> caches per rbd-nbd/krbd device or is there a only a single cache area.
>
> The librbd cache is per device, but if you aren't performing direct
> IOs to the device, you would also have the unified Linux pagecache on
> top of all the devices.
>
> XENServer directly utilizes nbd devices which are connected in my
> understanding by blkback (dom-0) and blkfront (dom-U) to the virtual
> machines.
> In my understanding pagecache is only part of the game if i use data on
> mounted filesystems (VFS usage).
> Therefore it would be a good thing to use rbd cache for rbd-nbd (/dev/nbdX).

I cannot speak for Xen, but in general IO to a block device will hit
the pagecache unless the IO operation is flagged as direct (e.g.
O_DIRECT) to bypass the pagecache and directly send it to the block
device.

> How can i identify the rbd cache with the tools provided by the operating
> system?
>
> Identify how? You can enable the admin sockets and use "ceph
> --admin-deamon config show" to display the in-use settings.
>
>
> Ah ok, i discovered that i can gather configuration settings by executing:
> (xen_test is the identity of the xen rbd_nbd user)
>
> ceph --id xen_test --admin-daemon /var/run/ceph/ceph-client.xen_test.asok
> config show | less -p rbd_cache
>
> Sorry, my question was a bit unprecice: I was searching for usage statistics
> of the rbd cache.
> Is there also a possibility to gather rbd_cache usage statistics as a source
> of verification for optimizing the cache settings?

You can run "perf dump" instead of "config show" to dump out the
current performance counters. There are some stats from the in-memory
cache included in there.

> Due to the fact that a rbd cache is created for every device, i assume that
> the rbd cache simply part of the rbd-nbd process memory.

Correct.

>
> Can you provide some hints how to about adequate cache settings for a write
> intensive environment (70% write, 30% read)?
> Is it a good idea to specify a huge rbd cache of 1 GB with a max dirty age
> of 10 seconds?

Depends on your workload and your testing results. I suspect a
database on top of RBD is going to do its own read caching and will be
issuing lots of flush calls to the block device, potentially negating
the need for a large cache.

> The librbd cache is really only useful for sequential read-ahead and
> for small writes (assuming writeback is enabled). Assuming you aren't
> using direct IO, I'd suspect your best performance would be to disable
> the librbd cache and rely on the Linux pagecache to work its magic.
>
> As described, xenserver directly utilizes the nbd devices.
>
> Our typical workload is originated over 70 percent in database write
> operations in the virtual machines.
> Therefore collecting write operations with rbd cache and writing them in
> chunks to ceph might be a good thing.
> A higher limit for "rbd cache max dirty" might be a adequate here.
> At the other side our read workload typically reads huge files in sequential
> manner.
>
> Therefore it might be useful to do start with a configuration like that:
>
> rbd cache size = 64MB
> rbd cache max dirty = 48MB
> rbd cache target dirty = 32MB
> rbd cache max dirty age = 10
>
> What is the strategy of librbd to write data to the storage from rbd_cache
> if "rbd cache max dirty = 48MB" is reached?
> Is there a reduction of io operations (merging of ios) compared to the
> granularity of writes of my virtual machines?

If the cache is full, incoming IO will be stalled as the dirty bits
are written back to the backing RBD image to make room available for
the new IO request.

> Additionally, i would do no non-default settings for readahead on nbd level
> to have the possibility to configure this at operating system level of the
> vms.
>
> Our operating systems in the virtual machines use currently a readahead of
> 256 (256*512 = 128KB).
> From my point of view it would be a good thing for sequential reads in big
> files to increase readahead to a higher value.
> We haven't changed the default rbd object size of 4MB - nevertheless it
> might be a good thing to increase the readahead to 1024 (=512KB) to decrease
> read requests by factor of 4 for sequential reads.
>
> What do you think about this?

Depends on your workload.

> Regards
> Marc
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread João Paulo Sacchetto Ribeiro Bastos
Actually, if you go to https://ceph.com/community/new-luminous-bluestore/ you
will see that DB/WAL work on a XFS partition, while the data itself goes on
a raw block.

Also, I told you the wrong command in the last mail. When i said --osd-db
it should be --block-db.

On Fri, May 11, 2018 at 11:51 AM Oliver Schulz 
wrote:

> Hi,
>
> thanks for the advice! I'm a bit confused now, though. ;-)
> I thought DB and WAL were supposed to go on raw block
> devices, not file systems?
>
>
> Cheers,
>
> Oliver
>
>
> On 11.05.2018 16:01, João Paulo Sacchetto Ribeiro Bastos wrote:
> > Hello Oliver,
> >
> > As far as I know yet, you can use the same DB device for about 4 or 5
> > OSDs, just need to be aware of the free space. I'm also developing a
> > bluestore cluster, and our DB and WAL will be in the same SSD of about
> > 480GB serving 4 OSD HDDs of 4 TB each. About the sizes, its just a
> > feeling because I couldn't find yet any clear rule about how to measure
> > the requirements.
> >
> > * The only concern that took me some time to realize is that you should
> > create a XFS partition if using ceph-deploy because if you don't it will
> > simply give you a RuntimeError that doesn't give any hint about what's
> > going on.
> >
> > So, answering your question, you could do something like:
> > $ ceph-deploy osd create --bluestore --data=/dev/sdb --block-db
> > /dev/nvme0n1p1 $HOSTNAME
> > $ ceph-deploy osd create --bluestore --data=/dev/sdc --block-db
> > /dev/nvme0n1p1 $HOSTNAME
> >
> > On Fri, May 11, 2018 at 10:35 AM Oliver Schulz
> > >
> wrote:
> >
> > Dear Ceph Experts,
> >
> > I'm trying to set up some new OSD storage nodes, now with
> > bluestore (our existing nodes still use filestore). I'm
> > a bit unclear on how to specify WAL/DB devices: Can
> > several OSDs share one WAL/DB partition? So, can I do
> >
> >   ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
> > --data=/dev/sdb HOSTNAME
> >
> >   ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
> > --data=/dev/sdc HOSTNAME
> >
> >   ...
> >
> > Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
> > osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?
> >
> > And just to make sure - if I specify "--osd-db", I don't need
> > to set "--osd-wal" as well, since the WAL will end up on the
> > DB partition automatically, correct?
> >
> >
> > Thanks for any hints,
> >
> > Oliver
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > --
> >
> > João Paulo Sacchetto Ribeiro Bastos
> > +55 31 99279-7092 <+55%2031%2099279-7092>
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 

João Paulo Bastos
DevOps Engineer at Mav Tecnologia
Belo Horizonte - Brazil
+55 31 99279-7092
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo

2018-05-11 Thread David C
Hi Oliver

Thanks for the detailed reponse! I've downgraded my libcephfs2 to 12.2.4
and still get a similar error:

load_fsal :NFS STARTUP :CRIT :Could not dlopen
module:/usr/lib64/ganesha/libfsalceph.so
Error:/lib64/libcephfs.so.2: undefined symbol: _Z14common_
preinitRK18CephInitParameters18code_environment_ti
load_fsal :NFS STARTUP :MAJ :Failed to load module
(/usr/lib64/ganesha/libfsalceph.so)
because: Can not access a needed shared library

I'm on CentOS 7.4, using the following package versions:

# rpm -qa | grep ganesha
nfs-ganesha-2.6.1-0.1.el7.x86_64
nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64
nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64

# rpm -qa | grep ceph
libcephfs2-12.2.4-0.el7.x86_64
nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64

I don't have the ceph user space components installed, assuming they're not
nesscary apart from libcephfs2? Any idea why it's giving me this error?

Thanks,

On Fri, May 11, 2018 at 2:17 AM, Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Hi David,
>
> for what it's worth, we are running with nfs-ganesha 2.6.1 from Ceph repos
> on CentOS 7.4 with the following set of versions:
> libcephfs2-12.2.4-0.el7.x86_64
> nfs-ganesha-2.6.1-0.1.el7.x86_64
> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> Of course, we plan to upgrade to 12.2.5 soon-ish...
>
> Am 11.05.2018 um 00:05 schrieb David C:
> > Hi All
> >
> > I'm testing out the nfs-ganesha-2.6.1-0.1.el7.x86_64.rpm package from
> http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/
> >
> > It's failing to load /usr/lib64/ganesha/libfsalceph.so
> >
> > With libcephfs-12.2.1 installed I get the following error in my ganesha
> log:
> >
> > load_fsal :NFS STARTUP :CRIT :Could not dlopen
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> > /usr/lib64/ganesha/libfsalceph.so: undefined symbol:
> ceph_set_deleg_timeout
> > load_fsal :NFS STARTUP :MAJ :Failed to load module
> (/usr/lib64/ganesha/libfsalceph.so) because
> > : Can not access a needed shared library
>
> That looks like an ABI incompatibility, probably the nfs-ganesha packages
> should block this libcephfs2-version (and older ones).
>
> >
> >
> > With libcephfs-12.2.5 installed I get:
> >
> > load_fsal :NFS STARTUP :CRIT :Could not dlopen
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> > /lib64/libcephfs.so.2: undefined symbol: _ZNK5FSMap10parse_
> roleEN5boost17basic_string_viewIcSt11char_traitsIcEEEP10mds_role_tRSo
> > load_fsal :NFS STARTUP :MAJ :Failed to load module
> (/usr/lib64/ganesha/libfsalceph.so) because
> > : Can not access a needed shared library
>
> That looks ugly and makes me fear for our planned 12.2.5-upgrade.
> Interestingly, we do not have that symbol on 12.2.4:
> # nm -D /lib64/libcephfs.so.2 | grep FSMap
>  U _ZNK5FSMap10parse_roleERKSsP10mds_role_tRSo
>  U _ZNK5FSMap13print_summaryEPN4ceph9FormatterEPSo
> and NFS-Ganesha works fine.
>
> Looking at:
> https://github.com/ceph/ceph/blob/v12.2.4/src/mds/FSMap.h
> versus
> https://github.com/ceph/ceph/blob/v12.2.5/src/mds/FSMap.h
> it seems this commit:
> https://github.com/ceph/ceph/commit/7d8b3c1082b6b870710989773f3cd9
> 8a472b9a3d
> changed libcephfs2 ABI.
>
> I've no idea how that's usually handled and whether ABI breakage should
> occur within point releases (I would not have expected that...).
> At least, this means either:
> - ABI needs to be reverted to the old state.
> - A new NFS Ganesha build is needed. Probably, if this is a common thing,
> builds should be automated and be synchronized to ceph releases,
>   and old versions should be kept around.
>
> I'll hold back our update to 12.2.5 until this is resolved, so many thanks
> from my side!
>
> Let's see who jumps in to resolve it...
>
> Cheers,
> Oliver
> >
> >
> > My cluster is running 12.2.1
> >
> > All package versions:
> >
> > nfs-ganesha-2.6.1-0.1.el7.x86_64
> > nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> > libcephfs2-12.2.5-0.el7.x86_64
> >
> > Can anyone point me in the right direction?
> >
> > Thanks,
> > David
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread Oliver Schulz

Hi,

thanks for the advice! I'm a bit confused now, though. ;-)
I thought DB and WAL were supposed to go on raw block
devices, not file systems?


Cheers,

Oliver


On 11.05.2018 16:01, João Paulo Sacchetto Ribeiro Bastos wrote:

Hello Oliver,

As far as I know yet, you can use the same DB device for about 4 or 5 
OSDs, just need to be aware of the free space. I'm also developing a 
bluestore cluster, and our DB and WAL will be in the same SSD of about 
480GB serving 4 OSD HDDs of 4 TB each. About the sizes, its just a 
feeling because I couldn't find yet any clear rule about how to measure 
the requirements.


* The only concern that took me some time to realize is that you should 
create a XFS partition if using ceph-deploy because if you don't it will 
simply give you a RuntimeError that doesn't give any hint about what's 
going on.


So, answering your question, you could do something like:
$ ceph-deploy osd create --bluestore --data=/dev/sdb --block-db 
/dev/nvme0n1p1 $HOSTNAME
$ ceph-deploy osd create --bluestore --data=/dev/sdc --block-db 
/dev/nvme0n1p1 $HOSTNAME


On Fri, May 11, 2018 at 10:35 AM Oliver Schulz 
> wrote:


Dear Ceph Experts,

I'm trying to set up some new OSD storage nodes, now with
bluestore (our existing nodes still use filestore). I'm
a bit unclear on how to specify WAL/DB devices: Can
several OSDs share one WAL/DB partition? So, can I do

      ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
--data=/dev/sdb HOSTNAME

      ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
--data=/dev/sdc HOSTNAME

      ...

Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?

And just to make sure - if I specify "--osd-db", I don't need
to set "--osd-wal" as well, since the WAL will end up on the
DB partition automatically, correct?


Thanks for any hints,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

João Paulo Sacchetto Ribeiro Bastos
+55 31 99279-7092


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread Oliver Schulz


Hi Jaroslaw,

I tried that (using /dev/nvme0n1), but no luck:

ceph_deploy.osd][ERROR ] Failed to execute command:
/usr/sbin/ceph- volume --cluster ceph lvm create --bluestore
--data /dev/sdb --block.wal /dev/nvme0n1

When I run "/usr/sbin/ceph-volume ..." on the storage node, it fails
with:

--> blkid could not detect a PARTUUID for device: /dev/nvme0n1

There is an LVM PV on /dev/nvme0n1p1 (for the node OS), could
that be a problem?

I'd be glad for any advice. If all else fails, I should be fine
if I create a 10GB DB partition for each ODS manually, right?


Cheers,

Oliver


On 11.05.2018 15:40, Jaroslaw Owsiewski wrote:
> Hi,
>
>
> ceph-deploy is smart enough:
>
> ceph-deploy --overwrite-conf osd prepare --bluestore --block-db 
/dev/nvme0n1 --block-wal /dev/nvme0n1 hostname:/dev/sd{b..m}

>
> Working example.
>
> $ lsblk
> NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda8:00 278.9G  0 disk
> └─sda1 8:10 278.9G  0 part /
> sdb8:16   0   9.1T  0 disk
> ├─sdb1 8:17   0   100M  0 part /var/lib/ceph/osd/ceph-204
> └─sdb2 8:18   0   9.1T  0 part
> sdc8:32   0   9.1T  0 disk
> ├─sdc1 8:33   0   100M  0 part /var/lib/ceph/osd/ceph-205
> └─sdc2 8:34   0   9.1T  0 part
> sdd8:48   0   9.1T  0 disk
> ├─sdd1 8:49   0   100M  0 part /var/lib/ceph/osd/ceph-206
> └─sdd2 8:50   0   9.1T  0 part
> sde8:64   0   9.1T  0 disk
> ├─sde1 8:65   0   100M  0 part /var/lib/ceph/osd/ceph-207
> └─sde2 8:66   0   9.1T  0 part
> sdf8:80   0   9.1T  0 disk
> ├─sdf1 8:81   0   100M  0 part /var/lib/ceph/osd/ceph-208
> └─sdf2 8:82   0   9.1T  0 part
> sdg8:96   0   9.1T  0 disk
> ├─sdg1 8:97   0   100M  0 part /var/lib/ceph/osd/ceph-209
> └─sdg2 8:98   0   9.1T  0 part
> sdh8:112  0   9.1T  0 disk
> ├─sdh1 8:113  0   100M  0 part /var/lib/ceph/osd/ceph-210
> └─sdh2 8:114  0   9.1T  0 part
> sdi8:128  0   9.1T  0 disk
> ├─sdi1 8:129  0   100M  0 part /var/lib/ceph/osd/ceph-211
> └─sdi2 8:130  0   9.1T  0 part
> sdj8:144  0   9.1T  0 disk
> ├─sdj1 8:145  0   100M  0 part /var/lib/ceph/osd/ceph-212
> └─sdj2 8:146  0   9.1T  0 part
> sdk8:160  0   9.1T  0 disk
> ├─sdk1 8:161  0   100M  0 part /var/lib/ceph/osd/ceph-213
> └─sdk2 8:162  0   9.1T  0 part
> sdl8:176  0   9.1T  0 disk
> ├─sdl1 8:177  0   100M  0 part /var/lib/ceph/osd/ceph-214
> └─sdl2 8:178  0   9.1T  0 part
> sdm8:192  0   9.1T  0 disk
> ├─sdm1 8:193  0   100M  0 part /var/lib/ceph/osd/ceph-215
> └─sdm2 8:194  0   9.1T  0 part
> nvme0n1  259:00 349.3G  0 disk
> ├─nvme0n1p1  259:20 1G  0 part
> ├─nvme0n1p2  259:40   576M  0 part
> ├─nvme0n1p3  259:10 1G  0 part
> ├─nvme0n1p4  259:30   576M  0 part
> ├─nvme0n1p5  259:50 1G  0 part
> ├─nvme0n1p6  259:60   576M  0 part
> ├─nvme0n1p7  259:70 1G  0 part
> ├─nvme0n1p8  259:80   576M  0 part
> ├─nvme0n1p9  259:90 1G  0 part
> ├─nvme0n1p10 259:10   0   576M  0 part
> ├─nvme0n1p11 259:11   0 1G  0 part
> ├─nvme0n1p12 259:12   0   576M  0 part
> ├─nvme0n1p13 259:13   0 1G  0 part
> ├─nvme0n1p14 259:14   0   576M  0 part
> ├─nvme0n1p15 259:15   0 1G  0 part
> ├─nvme0n1p16 259:16   0   576M  0 part
> ├─nvme0n1p17 259:17   0 1G  0 part
> ├─nvme0n1p18 259:18   0   576M  0 part
> ├─nvme0n1p19 259:19   0 1G  0 part
> ├─nvme0n1p20 259:20   0   576M  0 part
> ├─nvme0n1p21 259:21   0 1G  0 part
> ├─nvme0n1p22 259:22   0   576M  0 part
> ├─nvme0n1p23 259:23   0 1G  0 part
> └─nvme0n1p24 259:24   0   576M  0 part
>
> Regards
>
> --
> Jarek
>
> 2018-05-11 15:35 GMT+02:00 Oliver Schulz 
>:

>
> Dear Ceph Experts,
>
> I'm trying to set up some new OSD storage nodes, now with
> bluestore (our existing nodes still use filestore). I'm
> a bit unclear on how to specify WAL/DB devices: Can
> several OSDs share one WAL/DB partition? So, can I do
>
>  ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
> --data=/dev/sdb HOSTNAME
>
>  ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
> --data=/dev/sdc HOSTNAME
>
>  ...
>
> Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
> osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?
>
> And just to make sure - if I specify "--osd-db", I don't need
> to set "--osd-wal" as well, since the WAL will end up on the
> DB partition automatically, correct?
>
>
> Thanks for any hints,
>
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> 

Re: [ceph-users] howto: multiple ceph filesystems

2018-05-11 Thread Webert de Souza Lima
Basically what we're trying to figure out looks like what is being done
here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020958.html

But instead of using LIBRADOS to store EMAILs directly into RADOS we're
still using CEPHFS for it, just figuring out if it makes sense to separate
them in different workloads.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Fri, May 11, 2018 at 2:07 AM, Marc Roos  wrote:

>
>
> If I would like to use an erasurecode pool for a cephfs directory how
> would I create these placement rules?
>
>
>
>
> -Original Message-
> From: David Turner [mailto:drakonst...@gmail.com]
> Sent: vrijdag 11 mei 2018 1:54
> To: João Paulo Sacchetto Ribeiro Bastos
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] howto: multiple ceph filesystems
>
> Another option you could do is to use a placement rule. You could create
> a general pool for most data to go to and a special pool for specific
> folders on the filesystem. Particularly I think of a pool for replica vs
> EC vs flash for specific folders in the filesystem.
>
> If the pool and OSDs wasn't the main concern for multiple filesystems
> and the mds servers are then you could have multiple active mds servers
> and pin the metadata for the indexes to one of them while the rest is
> served by the other active mds servers.
>
> I really haven't come across a need for multiple filesystems in ceph
> with the type of granularity you can achieve with mds pinning, folder
> placement rules, and cephx authentication to limit a user to a specific
> subfolder.
>
>
> On Thu, May 10, 2018, 5:10 PM João Paulo Sacchetto Ribeiro Bastos
>  wrote:
>
>
> Hey John, thanks for you answer. For sure the hardware robustness
> will be nice enough. My true concern was actually the two FS ecosystem
> coexistence. In fact I realized that we may not use this as well because
> it may be represent a high overhead, despite the fact that it's a
> experiental feature yet.
>
> On Thu, 10 May 2018 at 15:48 John Spray  wrote:
>
>
> On Thu, May 10, 2018 at 7:38 PM, João Paulo Sacchetto
> Ribeiro
> Bastos
>  wrote:
> > Hello guys,
> >
> > My company is about to rebuild its whole infrastructure,
> so
> I was called in
> > order to help on the planning. We are essentially an
> corporate mail
> > provider, so we handle daily lots of clients using
> dovecot
> and roundcube and
> > in order to do so we want to design a better plant of
> our
> cluster. Today,
> > using Jewel, we have a single cephFS for both index and
> mail
> from dovecot,
> > but we want to split it into an index_FS and a mail_FS
> to
> handle the
> > workload a little better, is it profitable nowadays?
> From my
> research I
> > realized that we will need data and metadata individual
> pools for each FS
> > such as a group of MDS for each of then, also.
> >
> > The one thing that really scares me about all of this
> is: we
> are planning to
> > have four machines at full disposal to handle our MDS
> instances. We started
> > to think if an idea like the one below is valid, can
> anybody
> give a hint on
> > this? We basically want to handle two MDS instances on
> each
> machine (one for
> > each FS) and wonder if we'll be able to have them
> swapping
> between active
> > and standby simultaneously without any trouble.
> >
> > index_FS: (active={machines 1 and 3}, standby={machines
> 2
> and 4})
> > mail_FS: (active={machines 2 and 4}, standby={machines 1
> and
> 3})
>
> Nothing wrong with that setup, but remember that those
> servers
> are
> going to have to be well-resourced enough to run all four
> at
> once
> (when a failure occurs), so it might not matter very much
> exactly
> which servers are running which daemons.
>
> With a filesystem's MDS daemons (i.e. daemons with the same
> standby_for_fscid setting), Ceph will activate whichever
> daemon comes
> up first, so if it's important to you to have particular
> daemons
> active then you would need to take care of that at the
> point
> you're
> starting them up.
>
> John
>
> >
> > Regards,
> > --
> >
> > João Paulo Sacchetto Ribeiro Bastos
> > +55 31 99279-7092
> >
> >
> > 

Re: [ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread João Paulo Sacchetto Ribeiro Bastos
Hello Oliver,

As far as I know yet, you can use the same DB device for about 4 or 5 OSDs,
just need to be aware of the free space. I'm also developing a bluestore
cluster, and our DB and WAL will be in the same SSD of about 480GB serving
4 OSD HDDs of 4 TB each. About the sizes, its just a feeling because I
couldn't find yet any clear rule about how to measure the requirements.

* The only concern that took me some time to realize is that you should
create a XFS partition if using ceph-deploy because if you don't it will
simply give you a RuntimeError that doesn't give any hint about what's
going on.

So, answering your question, you could do something like:
$ ceph-deploy osd create --bluestore --data=/dev/sdb --block-db
/dev/nvme0n1p1 $HOSTNAME
$ ceph-deploy osd create --bluestore --data=/dev/sdc --block-db
/dev/nvme0n1p1 $HOSTNAME

On Fri, May 11, 2018 at 10:35 AM Oliver Schulz 
wrote:

> Dear Ceph Experts,
>
> I'm trying to set up some new OSD storage nodes, now with
> bluestore (our existing nodes still use filestore). I'm
> a bit unclear on how to specify WAL/DB devices: Can
> several OSDs share one WAL/DB partition? So, can I do
>
>  ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
> --data=/dev/sdb HOSTNAME
>
>  ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2
> --data=/dev/sdc HOSTNAME
>
>  ...
>
> Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
> osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?
>
> And just to make sure - if I specify "--osd-db", I don't need
> to set "--osd-wal" as well, since the WAL will end up on the
> DB partition automatically, correct?
>
>
> Thanks for any hints,
>
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 

João Paulo Sacchetto Ribeiro Bastos
+55 31 99279-7092
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Shared WAL/DB device partition for multiple OSDs?

2018-05-11 Thread Oliver Schulz

Dear Ceph Experts,

I'm trying to set up some new OSD storage nodes, now with
bluestore (our existing nodes still use filestore). I'm
a bit unclear on how to specify WAL/DB devices: Can
several OSDs share one WAL/DB partition? So, can I do

ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2 
--data=/dev/sdb HOSTNAME


ceph-deploy osd create --bluestore --osd-db=/dev/nvme0n1p2 
--data=/dev/sdc HOSTNAME


...

Or do I need to use osd-db=/dev/nvme0n1p2 for data=/dev/sdb,
osd-db=/dev/nvme0n1p3 for data=/dev/sdc, and so on?

And just to make sure - if I specify "--osd-db", I don't need
to set "--osd-wal" as well, since the WAL will end up on the
DB partition automatically, correct?


Thanks for any hints,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding pool to cephfs, setfattr permission denied

2018-05-11 Thread John Spray
On Fri, May 11, 2018 at 8:10 AM, Marc Roos  wrote:
>
>
> Thanks! That did it. This 'tag cephfs' is probably a restriction you can
> add when you have mulitple filesystems? And I don't need x permission on
> the osd's?

The "tag cephfs data " bit is authorising the client to access
any pools that are part of that filesystem.  If you have "allow rw"
without any other qualifier, then the client will be able to access
any non-cephfs pools as well.

John

>
>
>
>
> -Original Message-
> From: John Spray [mailto:jsp...@redhat.com]
> Sent: vrijdag 11 mei 2018 14:05
> To: Marc Roos
> Cc: ceph-users
> Subject: Re: [ceph-users] Adding pool to cephfs, setfattr permission
> denied
>
> On Fri, May 11, 2018 at 7:40 AM, Marc Roos 
> wrote:
>>
>> I have added a data pool by:
>>
>> ceph osd pool set fs_data.ec21 allow_ec_overwrites true ceph osd pool
>> application enable fs_data.ec21 cephfs ceph fs add_data_pool cephfs
>> fs_data.ec21
>>
>> setfattr -n ceph.dir.layout.pool -v fs_data.ec21 folder
>> setfattr: folder: Permission denied
>
> You need "rwp" mds auth caps to modify layouts (see
> http://docs.ceph.com/docs/master/cephfs/client-auth/#layout-and-quota-restriction-the-p-flag)
>
> John
>
>>
>> Added the pool also to client auth
>>
>>  caps mds = "allow rw"
>>  caps mgr = "allow r"
>>  caps mon = "allow r"
>>  caps osd = "allow rwx pool=fs_meta,allow rwx pool=fs_data,allow
>> rwx pool=fs_data.ec21"
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding pool to cephfs, setfattr permission denied

2018-05-11 Thread Marc Roos
 

Thanks! That did it. This 'tag cephfs' is probably a restriction you can 
add when you have mulitple filesystems? And I don't need x permission on 
the osd's?




-Original Message-
From: John Spray [mailto:jsp...@redhat.com] 
Sent: vrijdag 11 mei 2018 14:05
To: Marc Roos
Cc: ceph-users
Subject: Re: [ceph-users] Adding pool to cephfs, setfattr permission 
denied

On Fri, May 11, 2018 at 7:40 AM, Marc Roos  
wrote:
>
> I have added a data pool by:
>
> ceph osd pool set fs_data.ec21 allow_ec_overwrites true ceph osd pool 
> application enable fs_data.ec21 cephfs ceph fs add_data_pool cephfs 
> fs_data.ec21
>
> setfattr -n ceph.dir.layout.pool -v fs_data.ec21 folder
> setfattr: folder: Permission denied

You need "rwp" mds auth caps to modify layouts (see
http://docs.ceph.com/docs/master/cephfs/client-auth/#layout-and-quota-restriction-the-p-flag)

John

>
> Added the pool also to client auth
>
>  caps mds = "allow rw"
>  caps mgr = "allow r"
>  caps mon = "allow r"
>  caps osd = "allow rwx pool=fs_meta,allow rwx pool=fs_data,allow 
> rwx pool=fs_data.ec21"
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding pool to cephfs, setfattr permission denied

2018-05-11 Thread John Spray
On Fri, May 11, 2018 at 7:40 AM, Marc Roos  wrote:
>
> I have added a data pool by:
>
> ceph osd pool set fs_data.ec21 allow_ec_overwrites true
> ceph osd pool application enable fs_data.ec21 cephfs
> ceph fs add_data_pool cephfs fs_data.ec21
>
> setfattr -n ceph.dir.layout.pool -v fs_data.ec21 folder
> setfattr: folder: Permission denied

You need "rwp" mds auth caps to modify layouts (see
http://docs.ceph.com/docs/master/cephfs/client-auth/#layout-and-quota-restriction-the-p-flag)

John

>
> Added the pool also to client auth
>
>  caps mds = "allow rw"
>  caps mgr = "allow r"
>  caps mon = "allow r"
>  caps osd = "allow rwx pool=fs_meta,allow rwx pool=fs_data,allow rwx
> pool=fs_data.ec21"
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Adding pool to cephfs, setfattr permission denied

2018-05-11 Thread Marc Roos
 
I have added a data pool by:

ceph osd pool set fs_data.ec21 allow_ec_overwrites true
ceph osd pool application enable fs_data.ec21 cephfs
ceph fs add_data_pool cephfs fs_data.ec21

setfattr -n ceph.dir.layout.pool -v fs_data.ec21 folder
setfattr: folder: Permission denied

Added the pool also to client auth

 caps mds = "allow rw"
 caps mgr = "allow r"
 caps mon = "allow r"
 caps osd = "allow rwx pool=fs_meta,allow rwx pool=fs_data,allow rwx 
pool=fs_data.ec21"


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inaccurate client io stats

2018-05-11 Thread John Spray
On Fri, May 11, 2018 at 4:51 AM, Horace  wrote:
> Hi everyone,
>
> I've got a 3-node cluster running without any issue. However, I found out 
> that since upgraded to luminous, the client io stat is far too way off from 
> the real one. Have no idea how to troubleshoot this after went through all 
> the logs. Any help would be appreciated.

The ratio from logical IO (from clients) to raw IO (to disks) depends
on configuration:
 - Are you using filestore or bluestore?  Any SSD journals?
 - What replication level is in use?  3x?

If you're using filestore, no SSD journals, and 3x journalling, then
there will be a factor of six amplification between the client IO and
the disk IO.  The cluster IO stats do still look rather low though...

John

> Got more than 10 client hosts connecting to the cluster, running around 300 
> VMs.
>
> ceph version 12.2.4
>
> #ceph -s
>
>   cluster:
> id: xxx
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum ceph0,ceph1,ceph2
> mgr: ceph1(active), standbys: ceph0, ceph2
> osd: 24 osds: 24 up, 24 in
> rgw: 1 daemon active
>
>   data:
> pools:   17 pools, 956 pgs
> objects: 4225k objects, 14495 GB
> usage:   43424 GB used, 16231 GB / 59656 GB avail
> pgs: 956 active+clean
>
>   io:
> client:   123 kB/s rd, 2677 kB/s wr, 38 op/s rd, 278 op/s wr
>
> (at one of the node)
> #atop
>
> DSK |  sdb | busy 42% | read 268 | write519 |  KiB/w
> 109 | MBr/s2.4 | MBw/s5.6 | avio 5.26 ms |
> DSK |  sde | busy 26% | read 129 | write313 |  KiB/w
> 150 | MBr/s0.7 | MBw/s4.6 | avio 5.94 ms |
> DSK |  sdg | busy 24% | read  90 | write230 |  KiB/w 
> 86 | MBr/s0.5 | MBw/s1.9 | avio 7.50 ms |
> DSK |  sdf | busy 21% | read 109 | write148 |  KiB/w
> 162 | MBr/s0.8 | MBw/s2.3 | avio 8.12 ms |
> DSK |  sdh | busy 19% | read 100 | write221 |  KiB/w
> 118 | MBr/s0.5 | MBw/s2.5 | avio 5.78 ms |
> DSK |  sda | busy 18% | read 170 | write163 |  KiB/w 
> 83 | MBr/s1.6 | MBw/s1.3 | avio 5.35 ms |
> DSK |  sdc | busy  3% | read   0 | write   1545 |  KiB/w 
> 58 | MBr/s0.0 | MBw/s8.8 | avio 0.21 ms |
> DSK |  sdd | busy  3% | read   0 | write   1195 |  KiB/w 
> 57 | MBr/s0.0 | MBw/s6.7 | avio 0.24 ms |
>
> Regards,
> Horace Ng
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] howto: multiple ceph filesystems

2018-05-11 Thread Marc Roos
 

If I would like to use an erasurecode pool for a cephfs directory how 
would I create these placement rules?




-Original Message-
From: David Turner [mailto:drakonst...@gmail.com] 
Sent: vrijdag 11 mei 2018 1:54
To: João Paulo Sacchetto Ribeiro Bastos
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] howto: multiple ceph filesystems

Another option you could do is to use a placement rule. You could create 
a general pool for most data to go to and a special pool for specific 
folders on the filesystem. Particularly I think of a pool for replica vs 
EC vs flash for specific folders in the filesystem.

If the pool and OSDs wasn't the main concern for multiple filesystems 
and the mds servers are then you could have multiple active mds servers 
and pin the metadata for the indexes to one of them while the rest is 
served by the other active mds servers.

I really haven't come across a need for multiple filesystems in ceph 
with the type of granularity you can achieve with mds pinning, folder 
placement rules, and cephx authentication to limit a user to a specific 
subfolder.


On Thu, May 10, 2018, 5:10 PM João Paulo Sacchetto Ribeiro Bastos 
 wrote:


Hey John, thanks for you answer. For sure the hardware robustness 
will be nice enough. My true concern was actually the two FS ecosystem 
coexistence. In fact I realized that we may not use this as well because 
it may be represent a high overhead, despite the fact that it's a 
experiental feature yet.

On Thu, 10 May 2018 at 15:48 John Spray  wrote:


On Thu, May 10, 2018 at 7:38 PM, João Paulo Sacchetto Ribeiro 
Bastos
 wrote:
> Hello guys,
>
> My company is about to rebuild its whole infrastructure, so 
I was called in
> order to help on the planning. We are essentially an 
corporate mail
> provider, so we handle daily lots of clients using dovecot 
and roundcube and
> in order to do so we want to design a better plant of our 
cluster. Today,
> using Jewel, we have a single cephFS for both index and mail 
from dovecot,
> but we want to split it into an index_FS and a mail_FS to 
handle the
> workload a little better, is it profitable nowadays? From my 
research I
> realized that we will need data and metadata individual 
pools for each FS
> such as a group of MDS for each of then, also.
>
> The one thing that really scares me about all of this is: we 
are planning to
> have four machines at full disposal to handle our MDS 
instances. We started
> to think if an idea like the one below is valid, can anybody 
give a hint on
> this? We basically want to handle two MDS instances on each 
machine (one for
> each FS) and wonder if we'll be able to have them swapping 
between active
> and standby simultaneously without any trouble.
>
> index_FS: (active={machines 1 and 3}, standby={machines 2 
and 4})
> mail_FS: (active={machines 2 and 4}, standby={machines 1 and 
3})

Nothing wrong with that setup, but remember that those servers 
are
going to have to be well-resourced enough to run all four at 
once
(when a failure occurs), so it might not matter very much 
exactly
which servers are running which daemons.

With a filesystem's MDS daemons (i.e. daemons with the same
standby_for_fscid setting), Ceph will activate whichever 
daemon comes
up first, so if it's important to you to have particular 
daemons
active then you would need to take care of that at the point 
you're
starting them up.

John

>
> Regards,
> --
>
> João Paulo Sacchetto Ribeiro Bastos
> +55 31 99279-7092
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 


João Paulo Sacchetto Ribeiro Bastos
+55 31 99279-7092

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] RBD Cache and rbd-nbd

2018-05-11 Thread Marc Schöchlin
Hello Jason,

thanks for your response.


Am 10.05.2018 um 21:18 schrieb Jason Dillaman:

>> If i configure caches like described at
>> http://docs.ceph.com/docs/luminous/rbd/rbd-config-ref/, are there dedicated
>> caches per rbd-nbd/krbd device or is there a only a single cache area.
> The librbd cache is per device, but if you aren't performing direct
> IOs to the device, you would also have the unified Linux pagecache on
> top of all the devices.
XENServer directly utilizes nbd devices which are connected in my
understanding by blkback (dom-0) and blkfront (dom-U) to the virtual
machines.
In my understanding pagecache is only part of the game if i use data on
mounted filesystems (VFS usage).
Therefore it would be a good thing to use rbd cache for rbd-nbd (/dev/nbdX).
>> How can i identify the rbd cache with the tools provided by the operating
>> system?
> Identify how? You can enable the admin sockets and use "ceph
> --admin-deamon config show" to display the in-use settings.

Ah ok, i discovered that i can gather configuration settings by executing:
(xen_test is the identity of the xen rbd_nbd user)

ceph --id xen_test --admin-daemon
/var/run/ceph/ceph-client.xen_test.asok config show | less -p rbd_cache

Sorry, my question was a bit unprecice: I was searching for usage
statistics of the rbd cache.
Is there also a possibility to gather rbd_cache usage statistics as a
source of verification for optimizing the cache settings?

Due to the fact that a rbd cache is created for every device, i assume
that the rbd cache simply part of the rbd-nbd process memory.


>> Can you provide some hints how to about adequate cache settings for a write
>> intensive environment (70% write, 30% read)?
>> Is it a good idea to specify a huge rbd cache of 1 GB with a max dirty age
>> of 10 seconds?
> The librbd cache is really only useful for sequential read-ahead and
> for small writes (assuming writeback is enabled). Assuming you aren't
> using direct IO, I'd suspect your best performance would be to disable
> the librbd cache and rely on the Linux pagecache to work its magic.
As described, xenserver directly utilizes the nbd devices.

Our typical workload is originated over 70 percent in database write
operations in the virtual machines.
Therefore collecting write operations with rbd cache and writing them in
chunks to ceph might be a good thing.
A higher limit for "rbd cache max dirty" might be a adequate here.
At the other side our read workload typically reads huge files in
sequential manner.

Therefore it might be useful to do start with a configuration like that:

rbd cache size = 64MB
rbd cache max dirty = 48MB
rbd cache target dirty = 32MB
rbd cache max dirty age = 10

What is the strategy of librbd to write data to the storage from
rbd_cache if "rbd cache max dirty = 48MB" is reached?
Is there a reduction of io operations (merging of ios) compared to the
granularity of writes of my virtual machines?

Additionally, i would do no non-default settings for readahead on nbd
level to have the possibility to configure this at operating system
level of the vms.

Our operating systems in the virtual machines use currently a readahead
of 256 (256*512 = 128KB).
From my point of view it would be a good thing for sequential reads in
big files to increase readahead to a higher value.
We haven't changed the default rbd object size of 4MB - nevertheless it
might be a good thing to increase the readahead to 1024 (=512KB) to
decrease read requests by factor of 4for sequential reads.

What do you think about this?

Regards
Marc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com