[ceph-users] Inaccurate client io stats

2018-05-10 Thread Horace
Hi everyone,

I've got a 3-node cluster running without any issue. However, I found out that 
since upgraded to luminous, the client io stat is far too way off from the real 
one. Have no idea how to troubleshoot this after went through all the logs. Any 
help would be appreciated.

Got more than 10 client hosts connecting to the cluster, running around 300 VMs.

ceph version 12.2.4 

#ceph -s

  cluster:
id: xxx
health: HEALTH_OK

  services:
mon: 3 daemons, quorum ceph0,ceph1,ceph2
mgr: ceph1(active), standbys: ceph0, ceph2
osd: 24 osds: 24 up, 24 in
rgw: 1 daemon active

  data:
pools:   17 pools, 956 pgs
objects: 4225k objects, 14495 GB
usage:   43424 GB used, 16231 GB / 59656 GB avail
pgs: 956 active+clean

  io:
client:   123 kB/s rd, 2677 kB/s wr, 38 op/s rd, 278 op/s wr  

(at one of the node)
#atop

DSK |  sdb | busy 42% | read 268 | write519 |  KiB/w109 
| MBr/s2.4 | MBw/s5.6 | avio 5.26 ms |
DSK |  sde | busy 26% | read 129 | write313 |  KiB/w150 
| MBr/s0.7 | MBw/s4.6 | avio 5.94 ms |
DSK |  sdg | busy 24% | read  90 | write230 |  KiB/w 86 
| MBr/s0.5 | MBw/s1.9 | avio 7.50 ms |
DSK |  sdf | busy 21% | read 109 | write148 |  KiB/w162 
| MBr/s0.8 | MBw/s2.3 | avio 8.12 ms |
DSK |  sdh | busy 19% | read 100 | write221 |  KiB/w118 
| MBr/s0.5 | MBw/s2.5 | avio 5.78 ms |
DSK |  sda | busy 18% | read 170 | write163 |  KiB/w 83 
| MBr/s1.6 | MBw/s1.3 | avio 5.35 ms |
DSK |  sdc | busy  3% | read   0 | write   1545 |  KiB/w 58 
| MBr/s0.0 | MBw/s8.8 | avio 0.21 ms |
DSK |  sdd | busy  3% | read   0 | write   1195 |  KiB/w 57 
| MBr/s0.0 | MBw/s6.7 | avio 0.24 ms |

Regards,
Horace Ng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo

2018-05-10 Thread Oliver Freyermuth
Hi David,

for what it's worth, we are running with nfs-ganesha 2.6.1 from Ceph repos on 
CentOS 7.4 with the following set of versions:
libcephfs2-12.2.4-0.el7.x86_64
nfs-ganesha-2.6.1-0.1.el7.x86_64
nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
Of course, we plan to upgrade to 12.2.5 soon-ish... 

Am 11.05.2018 um 00:05 schrieb David C:
> Hi All
> 
> I'm testing out the nfs-ganesha-2.6.1-0.1.el7.x86_64.rpm package from 
> http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/
> 
> It's failing to load /usr/lib64/ganesha/libfsalceph.so
> 
> With libcephfs-12.2.1 installed I get the following error in my ganesha log:
> 
> load_fsal :NFS STARTUP :CRIT :Could not dlopen 
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> /usr/lib64/ganesha/libfsalceph.so: undefined symbol: 
> ceph_set_deleg_timeout
> load_fsal :NFS STARTUP :MAJ :Failed to load module 
> (/usr/lib64/ganesha/libfsalceph.so) because
> : Can not access a needed shared library

That looks like an ABI incompatibility, probably the nfs-ganesha packages 
should block this libcephfs2-version (and older ones). 

> 
> 
> With libcephfs-12.2.5 installed I get:
> 
> load_fsal :NFS STARTUP :CRIT :Could not dlopen 
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> /lib64/libcephfs.so.2: undefined symbol: 
> _ZNK5FSMap10parse_roleEN5boost17basic_string_viewIcSt11char_traitsIcEEEP10mds_role_tRSo
> load_fsal :NFS STARTUP :MAJ :Failed to load module 
> (/usr/lib64/ganesha/libfsalceph.so) because
> : Can not access a needed shared library

That looks ugly and makes me fear for our planned 12.2.5-upgrade. 
Interestingly, we do not have that symbol on 12.2.4:
# nm -D /lib64/libcephfs.so.2 | grep FSMap
 U _ZNK5FSMap10parse_roleERKSsP10mds_role_tRSo
 U _ZNK5FSMap13print_summaryEPN4ceph9FormatterEPSo
and NFS-Ganesha works fine. 

Looking at:
https://github.com/ceph/ceph/blob/v12.2.4/src/mds/FSMap.h
versus
https://github.com/ceph/ceph/blob/v12.2.5/src/mds/FSMap.h
it seems this commit:
https://github.com/ceph/ceph/commit/7d8b3c1082b6b870710989773f3cd98a472b9a3d
changed libcephfs2 ABI. 

I've no idea how that's usually handled and whether ABI breakage should occur 
within point releases (I would not have expected that...). 
At least, this means either:
- ABI needs to be reverted to the old state. 
- A new NFS Ganesha build is needed. Probably, if this is a common thing, 
builds should be automated and be synchronized to ceph releases,
  and old versions should be kept around. 

I'll hold back our update to 12.2.5 until this is resolved, so many thanks from 
my side! 

Let's see who jumps in to resolve it... 

Cheers,
Oliver
> 
> 
> My cluster is running 12.2.1
> 
> All package versions:
> 
> nfs-ganesha-2.6.1-0.1.el7.x86_64
> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> libcephfs2-12.2.5-0.el7.x86_64
> 
> Can anyone point me in the right direction?
> 
> Thanks,
> David
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] howto: multiple ceph filesystems

2018-05-10 Thread João Paulo Sacchetto Ribeiro Bastos
Hey David, thanks for your answer. You're probably right, my friend.

This idea of multiple FS just came after we realize that in general we have
a great amount of workload for /mnt/metadata, and a considerably low amount
for /mnt/data. (Just an example to clarify our case)

Despite the fact that the data dir depends on the metadata dir we thought
about splitting them apart in order to provide some sort of high
availability because part of our systems can still go on even if the data
dir goes down.

The main concern is, then, to be able to handle the workloads in separate
ways and in each own way. In fact we can get to a better approach that does
not have such an high overhead and for sure I'll read more about your
suggestions, maybe we can simply use the placement rules xD
On Thu, 10 May 2018 at 20:54 David Turner  wrote:

> Another option you could do is to use a placement rule. You could create a
> general pool for most data to go to and a special pool for specific folders
> on the filesystem. Particularly I think of a pool for replica vs EC vs
> flash for specific folders in the filesystem.
>
> If the pool and OSDs wasn't the main concern for multiple filesystems and
> the mds servers are then you could have multiple active mds servers and pin
> the metadata for the indexes to one of them while the rest is served by the
> other active mds servers.
>
> I really haven't come across a need for multiple filesystems in ceph with
> the type of granularity you can achieve with mds pinning, folder placement
> rules, and cephx authentication to limit a user to a specific subfolder.
>
>
> On Thu, May 10, 2018, 5:10 PM João Paulo Sacchetto Ribeiro Bastos <
> joaopaulos...@gmail.com> wrote:
>
>> Hey John, thanks for you answer. For sure the hardware robustness will be
>> nice enough. My true concern was actually the two FS ecosystem coexistence.
>> In fact I realized that we may not use this as well because it may be
>> represent a high overhead, despite the fact that it's a experiental feature
>> yet.
>> On Thu, 10 May 2018 at 15:48 John Spray  wrote:
>>
>>> On Thu, May 10, 2018 at 7:38 PM, João Paulo Sacchetto Ribeiro Bastos
>>>  wrote:
>>> > Hello guys,
>>> >
>>> > My company is about to rebuild its whole infrastructure, so I was
>>> called in
>>> > order to help on the planning. We are essentially an corporate mail
>>> > provider, so we handle daily lots of clients using dovecot and
>>> roundcube and
>>> > in order to do so we want to design a better plant of our cluster.
>>> Today,
>>> > using Jewel, we have a single cephFS for both index and mail from
>>> dovecot,
>>> > but we want to split it into an index_FS and a mail_FS to handle the
>>> > workload a little better, is it profitable nowadays? From my research I
>>> > realized that we will need data and metadata individual pools for each
>>> FS
>>> > such as a group of MDS for each of then, also.
>>> >
>>> > The one thing that really scares me about all of this is: we are
>>> planning to
>>> > have four machines at full disposal to handle our MDS instances. We
>>> started
>>> > to think if an idea like the one below is valid, can anybody give a
>>> hint on
>>> > this? We basically want to handle two MDS instances on each machine
>>> (one for
>>> > each FS) and wonder if we'll be able to have them swapping between
>>> active
>>> > and standby simultaneously without any trouble.
>>> >
>>> > index_FS: (active={machines 1 and 3}, standby={machines 2 and 4})
>>> > mail_FS: (active={machines 2 and 4}, standby={machines 1 and 3})
>>>
>>> Nothing wrong with that setup, but remember that those servers are
>>> going to have to be well-resourced enough to run all four at once
>>> (when a failure occurs), so it might not matter very much exactly
>>> which servers are running which daemons.
>>>
>>> With a filesystem's MDS daemons (i.e. daemons with the same
>>> standby_for_fscid setting), Ceph will activate whichever daemon comes
>>> up first, so if it's important to you to have particular daemons
>>> active then you would need to take care of that at the point you're
>>> starting them up.
>>>
>>> John
>>>
>>> >
>>> > Regards,
>>> > --
>>> >
>>> > João Paulo Sacchetto Ribeiro Bastos
>>> > +55 31 99279-7092
>>> >
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>>
>> --
>>
>> João Paulo Sacchetto Ribeiro Bastos
>> +55 31 99279-7092
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> --

João Paulo Sacchetto Ribeiro Bastos
+55 31 99279-7092
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] howto: multiple ceph filesystems

2018-05-10 Thread David Turner
Another option you could do is to use a placement rule. You could create a
general pool for most data to go to and a special pool for specific folders
on the filesystem. Particularly I think of a pool for replica vs EC vs
flash for specific folders in the filesystem.

If the pool and OSDs wasn't the main concern for multiple filesystems and
the mds servers are then you could have multiple active mds servers and pin
the metadata for the indexes to one of them while the rest is served by the
other active mds servers.

I really haven't come across a need for multiple filesystems in ceph with
the type of granularity you can achieve with mds pinning, folder placement
rules, and cephx authentication to limit a user to a specific subfolder.

On Thu, May 10, 2018, 5:10 PM João Paulo Sacchetto Ribeiro Bastos <
joaopaulos...@gmail.com> wrote:

> Hey John, thanks for you answer. For sure the hardware robustness will be
> nice enough. My true concern was actually the two FS ecosystem coexistence.
> In fact I realized that we may not use this as well because it may be
> represent a high overhead, despite the fact that it's a experiental feature
> yet.
> On Thu, 10 May 2018 at 15:48 John Spray  wrote:
>
>> On Thu, May 10, 2018 at 7:38 PM, João Paulo Sacchetto Ribeiro Bastos
>>  wrote:
>> > Hello guys,
>> >
>> > My company is about to rebuild its whole infrastructure, so I was
>> called in
>> > order to help on the planning. We are essentially an corporate mail
>> > provider, so we handle daily lots of clients using dovecot and
>> roundcube and
>> > in order to do so we want to design a better plant of our cluster.
>> Today,
>> > using Jewel, we have a single cephFS for both index and mail from
>> dovecot,
>> > but we want to split it into an index_FS and a mail_FS to handle the
>> > workload a little better, is it profitable nowadays? From my research I
>> > realized that we will need data and metadata individual pools for each
>> FS
>> > such as a group of MDS for each of then, also.
>> >
>> > The one thing that really scares me about all of this is: we are
>> planning to
>> > have four machines at full disposal to handle our MDS instances. We
>> started
>> > to think if an idea like the one below is valid, can anybody give a
>> hint on
>> > this? We basically want to handle two MDS instances on each machine
>> (one for
>> > each FS) and wonder if we'll be able to have them swapping between
>> active
>> > and standby simultaneously without any trouble.
>> >
>> > index_FS: (active={machines 1 and 3}, standby={machines 2 and 4})
>> > mail_FS: (active={machines 2 and 4}, standby={machines 1 and 3})
>>
>> Nothing wrong with that setup, but remember that those servers are
>> going to have to be well-resourced enough to run all four at once
>> (when a failure occurs), so it might not matter very much exactly
>> which servers are running which daemons.
>>
>> With a filesystem's MDS daemons (i.e. daemons with the same
>> standby_for_fscid setting), Ceph will activate whichever daemon comes
>> up first, so if it's important to you to have particular daemons
>> active then you would need to take care of that at the point you're
>> starting them up.
>>
>> John
>>
>> >
>> > Regards,
>> > --
>> >
>> > João Paulo Sacchetto Ribeiro Bastos
>> > +55 31 99279-7092
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
> --
>
> João Paulo Sacchetto Ribeiro Bastos
> +55 31 99279-7092
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Nfs-ganesha 2.6 packages in ceph repo

2018-05-10 Thread David C
Hi All

I'm testing out the nfs-ganesha-2.6.1-0.1.el7.x86_64.rpm package from
http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/

It's failing to load /usr/lib64/ganesha/libfsalceph.so

With libcephfs-12.2.1 installed I get the following error in my ganesha log:

load_fsal :NFS STARTUP :CRIT :Could not dlopen
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> /usr/lib64/ganesha/libfsalceph.so: undefined symbol: ceph_set_deleg_timeout
> load_fsal :NFS STARTUP :MAJ :Failed to load module
> (/usr/lib64/ganesha/libfsalceph.so) because
> : Can not access a needed shared library
>

With libcephfs-12.2.5 installed I get:

load_fsal :NFS STARTUP :CRIT :Could not dlopen
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> /lib64/libcephfs.so.2: undefined symbol:
> _ZNK5FSMap10parse_roleEN5boost17basic_string_viewIcSt11char_traitsIcEEEP10mds_role_tRSo
> load_fsal :NFS STARTUP :MAJ :Failed to load module
> (/usr/lib64/ganesha/libfsalceph.so) because
> : Can not access a needed shared library
>

My cluster is running 12.2.1

All package versions:

nfs-ganesha-2.6.1-0.1.el7.x86_64
nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
libcephfs2-12.2.5-0.el7.x86_64

Can anyone point me in the right direction?

Thanks,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] howto: multiple ceph filesystems

2018-05-10 Thread João Paulo Sacchetto Ribeiro Bastos
Hey John, thanks for you answer. For sure the hardware robustness will be
nice enough. My true concern was actually the two FS ecosystem coexistence.
In fact I realized that we may not use this as well because it may be
represent a high overhead, despite the fact that it's a experiental feature
yet.
On Thu, 10 May 2018 at 15:48 John Spray  wrote:

> On Thu, May 10, 2018 at 7:38 PM, João Paulo Sacchetto Ribeiro Bastos
>  wrote:
> > Hello guys,
> >
> > My company is about to rebuild its whole infrastructure, so I was called
> in
> > order to help on the planning. We are essentially an corporate mail
> > provider, so we handle daily lots of clients using dovecot and roundcube
> and
> > in order to do so we want to design a better plant of our cluster. Today,
> > using Jewel, we have a single cephFS for both index and mail from
> dovecot,
> > but we want to split it into an index_FS and a mail_FS to handle the
> > workload a little better, is it profitable nowadays? From my research I
> > realized that we will need data and metadata individual pools for each FS
> > such as a group of MDS for each of then, also.
> >
> > The one thing that really scares me about all of this is: we are
> planning to
> > have four machines at full disposal to handle our MDS instances. We
> started
> > to think if an idea like the one below is valid, can anybody give a hint
> on
> > this? We basically want to handle two MDS instances on each machine (one
> for
> > each FS) and wonder if we'll be able to have them swapping between active
> > and standby simultaneously without any trouble.
> >
> > index_FS: (active={machines 1 and 3}, standby={machines 2 and 4})
> > mail_FS: (active={machines 2 and 4}, standby={machines 1 and 3})
>
> Nothing wrong with that setup, but remember that those servers are
> going to have to be well-resourced enough to run all four at once
> (when a failure occurs), so it might not matter very much exactly
> which servers are running which daemons.
>
> With a filesystem's MDS daemons (i.e. daemons with the same
> standby_for_fscid setting), Ceph will activate whichever daemon comes
> up first, so if it's important to you to have particular daemons
> active then you would need to take care of that at the point you're
> starting them up.
>
> John
>
> >
> > Regards,
> > --
> >
> > João Paulo Sacchetto Ribeiro Bastos
> > +55 31 99279-7092
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
-- 

João Paulo Sacchetto Ribeiro Bastos
+55 31 99279-7092
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph osd crush weight to utilization incorrect on one node

2018-05-10 Thread Pardhiv Karri
Hi,

We have a large 1PB ceph cluster. We recently added 6 nodes with 16 2TB
disks each to the cluster. All the 5 nodes rebalanced well without any
issues and the sixth/last node OSDs started acting weird as I increase
weight of one osd the utilization doesn't change but a different osd on the
same node utilization is getting increased. Rebalance complete fine but
utilization is not right.


Increased weight of OSD 610 to 0.2 from 0.0 but utilization of OSD 611
started increasing but its weight is 0.0. If I increase weight of OSD 611
to 0.2 then its overall utilization is growing to what if its weight is
0.4. So if I increase weight of 610 and 615 to their full weight then
utilization on OSD 610 is 1% and on OSD 611 is inching towards 100% where I
had to stop and downsize the OSD's crush weight back to 0.0 to avoid any
implications on ceph cluster. Its not just one osd but different OSD's on
that one node. The only correlation I found out is 610 and 611 OSD Journal
partitions are on the same SSD drive and all the OSDs are SAS drives. Any
help on how to debug or resolve this will be helpful.


Attached the screenshot.  with shows 610, 612 and 620 osd crush weight is
increased to 0.2 but OSDs 611, 615 and 623 utilization increased but has 0
crush weight.


​
​


Thanks,
Pardhiv K
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Cache and rbd-nbd

2018-05-10 Thread Jason Dillaman
On Thu, May 10, 2018 at 12:03 PM, Marc Schöchlin  wrote:
> Hello list,
>
> i map ~30 rbds  per xenserver host by using rbd-nbd to run virtual machines
> on these devices.
>
> I have the following questions:
>
> Is it possible to use rbd cache for rbd-nbd? I assume that this is true, but
> the documentation does not make a clear statement about this.
> (http://docs.ceph.com/docs/luminous/rbd/rbd-config-ref/(

It's on by default since it's a librbd client and that's the default setting.

> If i configure caches like described at
> http://docs.ceph.com/docs/luminous/rbd/rbd-config-ref/, are there dedicated
> caches per rbd-nbd/krbd device or is there a only a single cache area.

The librbd cache is per device, but if you aren't performing direct
IOs to the device, you would also have the unified Linux pagecache on
top of all the devices.

> How can i identify the rbd cache with the tools provided by the operating
> system?

Identify how? You can enable the admin sockets and use "ceph
--admin-deamon config show" to display the in-use settings.

> Can you provide some hints how to about adequate cache settings for a write
> intensive environment (70% write, 30% read)?
> Is it a good idea to specify a huge rbd cache of 1 GB with a max dirty age
> of 10 seconds?

The librbd cache is really only useful for sequential read-ahead and
for small writes (assuming writeback is enabled). Assuming you aren't
using direct IO, I'd suspect your best performance would be to disable
the librbd cache and rely on the Linux pagecache to work its magic.

>
> Regards
> Marc
>
> Our system:
>
> Luminous/12.2.5
> Ubuntu 16.04
> 5 OSD Nodes (24*8 TB HDD OSDs, 48*1TB SSD OSDS, Bluestore, 6Gb Cache per
> OSD)
> Size per OSD, 192GB RAM, 56 HT CPUs)
> 3 Mons (64 GB RAM, 200GB SSD, 4 visible CPUs)
> 2 * 10 GBIT, SFP+, bonded xmit_hash_policy layer3+4
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-10 Thread Patrick Donnelly
On Thu, May 10, 2018 at 12:00 PM, Brady Deetz  wrote:
> [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
> ceph1841  3.5 94.3 133703308 124425384 ? Ssl  Apr04 1808:32
> /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup ceph
>
>
> [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
> {
> "pool": {
> "items": 173261056,
> "bytes": 76504108600
> }
> }
>
> So, 80GB is my configured limit for the cache and it appears the mds is
> following that limit. But, the mds process is using over 100GB RAM in my
> 128GB host. I thought I was playing it safe by configuring at 80. What other
> things consume a lot of RAM for this process?
>
> Let me know if I need to create a new thread.

The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade ASAP.

[1] https://tracker.ceph.com/issues/22972

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD Cache and rbd-nbd

2018-05-10 Thread Marc Schöchlin
Hello list,

i map ~30 rbds  per xenserver
 host by using
rbd-nbd to run virtual machines on these devices.

I have the following questions:

  * Is it possible to use rbd cache for rbd-nbd? I assume that this is
true, but  the documentation does not make a clear statement about this.
(http://docs.ceph.com/docs/luminous/rbd/rbd-config-ref/(
  * If i configure caches like described at
http://docs.ceph.com/docs/luminous/rbd/rbd-config-ref/, are there
dedicated caches per rbd-nbd/krbd device or is there a only a single
cache area.
How can i identify the rbd cache with the tools provided by the
operating system?
  * Can you provide some hints how to about adequate cache settings for
a write intensive environment (70% write, 30% read)?
Is it a good idea to specify a huge rbd cache of 1 GB with a max
dirty age of 10 seconds?

Regards
Marc

Our system:

  * Luminous/12.2.5
  * Ubuntu 16.04
  * 5 OSD Nodes (24*8 TB HDD OSDs, 48*1TB SSD OSDS, Bluestore, 6Gb Cache
per OSD)
  * Size per OSD, 192GB RAM, 56 HT CPUs)
  * 3 Mons (64 GB RAM, 200GB SSD, 4 visible CPUs)
  * 2 * 10 GBIT, SFP+, bonded xmit_hash_policy layer3+4

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-10 Thread Brady Deetz
[ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
ceph1841  3.5 94.3 133703308 124425384 ? Ssl  Apr04 1808:32
/usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup ceph


[ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
{
"pool": {
"items": 173261056,
"bytes": 76504108600
}
}

So, 80GB is my configured limit for the cache and it appears the mds is
following that limit. But, the mds process is using over 100GB RAM in my
128GB host. I thought I was playing it safe by configuring at 80. What
other things consume a lot of RAM for this process?

Let me know if I need to create a new thread.




On Thu, May 10, 2018 at 12:40 PM, Patrick Donnelly 
wrote:

> Hello Brady,
>
> On Thu, May 10, 2018 at 7:35 AM, Brady Deetz  wrote:
> > I am now seeing the exact same issues you are reporting. A heap release
> did
> > nothing for me.
>
> I'm not sure it's the same issue...
>
> > [root@mds0 ~]# ceph daemon mds.mds0 config get mds_cache_memory_limit
> > {
> > "mds_cache_memory_limit": "80530636800"
> > }
>
> 80G right? What was the memory use from `ps aux | grep ceph-mds`?
>
> > [root@mds0 ~]# ceph daemon mds.mds0 perf dump
> > {
> > ...
> > "inode_max": 2147483647,
> > "inodes": 35853368,
> > "inodes_top": 23669670,
> > "inodes_bottom": 12165298,
> > "inodes_pin_tail": 18400,
> > "inodes_pinned": 2039553,
> > "inodes_expired": 142389542,
> > "inodes_with_caps": 831824,
> > "caps": 881384,
>
> Your cap count is 2% of the inodes in cache; the inodes pinned 5% of
> the total. Your cache should be getting trimmed assuming the cache
> size (as measured by the MDS, there are fixes in 12.2.5 which improve
> its precision) is larger than your configured limit.
>
> If the cache size is larger than the limit (use `cache status` admin
> socket command) then we'd be interested in seeing a few seconds of the
> MDS debug log with higher debugging set (`config set debug_mds 20`).
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] howto: multiple ceph filesystems

2018-05-10 Thread John Spray
On Thu, May 10, 2018 at 7:38 PM, João Paulo Sacchetto Ribeiro Bastos
 wrote:
> Hello guys,
>
> My company is about to rebuild its whole infrastructure, so I was called in
> order to help on the planning. We are essentially an corporate mail
> provider, so we handle daily lots of clients using dovecot and roundcube and
> in order to do so we want to design a better plant of our cluster. Today,
> using Jewel, we have a single cephFS for both index and mail from dovecot,
> but we want to split it into an index_FS and a mail_FS to handle the
> workload a little better, is it profitable nowadays? From my research I
> realized that we will need data and metadata individual pools for each FS
> such as a group of MDS for each of then, also.
>
> The one thing that really scares me about all of this is: we are planning to
> have four machines at full disposal to handle our MDS instances. We started
> to think if an idea like the one below is valid, can anybody give a hint on
> this? We basically want to handle two MDS instances on each machine (one for
> each FS) and wonder if we'll be able to have them swapping between active
> and standby simultaneously without any trouble.
>
> index_FS: (active={machines 1 and 3}, standby={machines 2 and 4})
> mail_FS: (active={machines 2 and 4}, standby={machines 1 and 3})

Nothing wrong with that setup, but remember that those servers are
going to have to be well-resourced enough to run all four at once
(when a failure occurs), so it might not matter very much exactly
which servers are running which daemons.

With a filesystem's MDS daemons (i.e. daemons with the same
standby_for_fscid setting), Ceph will activate whichever daemon comes
up first, so if it's important to you to have particular daemons
active then you would need to take care of that at the point you're
starting them up.

John

>
> Regards,
> --
>
> João Paulo Sacchetto Ribeiro Bastos
> +55 31 99279-7092
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] howto: multiple ceph filesystems

2018-05-10 Thread João Paulo Sacchetto Ribeiro Bastos
Hello guys,

My company is about to rebuild its whole infrastructure, so I was called in
order to help on the planning. We are essentially an corporate mail
provider, so we handle daily lots of clients using dovecot and roundcube
and in order to do so we want to design a better plant of our cluster.
Today, using Jewel, we have a single cephFS for both index and mail from
dovecot, but we want to split it into an index_FS and a mail_FS to handle
the workload a little better, is it profitable nowadays? From my research I
realized that we will need data and metadata individual pools for each FS
such as a group of MDS for each of then, also.

The one thing that really scares me about all of this is: we are planning
to have four machines at full disposal to handle our MDS instances. We
started to think if an idea like the one below is valid, can anybody give a
hint on this? We basically want to handle two MDS instances on each machine
(one for each FS) and wonder if we'll be able to have them swapping between
active and standby simultaneously without any trouble.

index_FS: (active={machines 1 and 3}, standby={machines 2 and 4})
mail_FS: (active={machines 2 and 4}, standby={machines 1 and 3})

Regards,
-- 

João Paulo Sacchetto Ribeiro Bastos
+55 31 99279-7092
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-10 Thread Patrick Donnelly
Hello Brady,

On Thu, May 10, 2018 at 7:35 AM, Brady Deetz  wrote:
> I am now seeing the exact same issues you are reporting. A heap release did
> nothing for me.

I'm not sure it's the same issue...

> [root@mds0 ~]# ceph daemon mds.mds0 config get mds_cache_memory_limit
> {
> "mds_cache_memory_limit": "80530636800"
> }

80G right? What was the memory use from `ps aux | grep ceph-mds`?

> [root@mds0 ~]# ceph daemon mds.mds0 perf dump
> {
> ...
> "inode_max": 2147483647,
> "inodes": 35853368,
> "inodes_top": 23669670,
> "inodes_bottom": 12165298,
> "inodes_pin_tail": 18400,
> "inodes_pinned": 2039553,
> "inodes_expired": 142389542,
> "inodes_with_caps": 831824,
> "caps": 881384,

Your cap count is 2% of the inodes in cache; the inodes pinned 5% of
the total. Your cache should be getting trimmed assuming the cache
size (as measured by the MDS, there are fixes in 12.2.5 which improve
its precision) is larger than your configured limit.

If the cache size is larger than the limit (use `cache status` admin
socket command) then we'd be interested in seeing a few seconds of the
MDS debug log with higher debugging set (`config set debug_mds 20`).

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Buffer I/O errors cleared by flatten?

2018-05-10 Thread Jason Dillaman
 also, I should point out that if you've already upgraded to
Luminous, you can just use the new RBD caps profiles (a la mon
'profile rbd' osd 'profile rbd') [1]. The explicit blacklist caps
mentioned in the upgrade guide are only required since pre-Luminous
clusters didn't support the RBD caps profiles.

[1] 
http://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication

On Thu, May 10, 2018 at 10:11 AM, Jason Dillaman  wrote:
> It only bites you if you have a hard failure of a VM (i.e. the RBD
> image wasn't cleanly closed and the lock wasn't cleanly released). In
> that case, the next librbd client to attempt to acquire the lock will
> notice the dead lock owner and will attempt to blacklist it from the
> cluster to ensure it cannot write to the image.
>
> On Thu, May 10, 2018 at 10:08 AM, Jonathan Proulx  wrote:
>> On Thu, May 10, 2018 at 09:55:15AM -0700, Jason Dillaman wrote:
>> :My immediate guess is that your caps are incorrect for your OpenStack
>> :Ceph user. Please refer to step 6 from the Luminous upgrade guide to
>> :ensure your RBD users have permission to blacklist dead peers [1]
>> :
>> :[1] 
>> http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken
>>
>> Good spotting!  Thanks for fastreply.  Next question is why did this
>> take so long to bite me we've been on luminous for 6 months, not going
>> to worry too myc about that last quetion though.
>>
>> Hoepfully that was the problem (it definitely was a problem).
>>
>> Thanks,
>> -Jon
>>
>> :On Thu, May 10, 2018 at 9:49 AM, Jonathan Proulx  wrote:
>> :> Hi All,
>> :>
>> :> recently I saw a number of rbd backed VMs in my openstack cloud fail
>> :> to reboot after a hypervisor crash with errors simialr to:
>> :>
>> :> [5.279393] blk_update_request: I/O error, dev vda, sector 2048
>> :> [5.281427] Buffer I/O error on dev vda1, logical block 0, lost async 
>> page write
>> :> [5.284114] Buffer I/O error on dev vda1, logical block 1, lost async 
>> page write
>> :> [5.286600] Buffer I/O error on dev vda1, logical block 2, lost async 
>> page write
>> :> [5.289022] Buffer I/O error on dev vda1, logical block 3, lost async 
>> page write
>> :> [5.291515] Buffer I/O error on dev vda1, logical block 4, lost async 
>> page write
>> :> [5.338981] blk_update_request: I/O error, dev vda, sector 3088
>> :>
>> :> for many blocks and sectors. I was able to export the rbd images and
>> :> they seemed fine, also 'rbd flatten' made them boot again with no
>> :> errors.
>> :>
>> :> I found this puzzling and concerning but given the crash and limited
>> :> time didn't really follow up.
>> :>
>> :> Today I intetionally rebooted a VM on a health hypervisor and had it
>> :> land in the same condition, now I'm really worried.
>> :>
>> :> running:
>> :> Ubuntu16.04
>> :> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
>> (stable) (on hypervisor)
>> :> {
>> :> "mon": {
>> :> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
>> luminous (stable)": 3
>> :> },
>> :> "mgr": {
>> :> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
>> luminous (stable)": 3
>> :> },
>> :> "osd": {
>> :> "ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) 
>> luminous (stable)": 102,
>> :> "ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) 
>> luminous (stable)": 10,
>> :> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
>> luminous (stable)": 62
>> :> }
>> :> }
>> :> libvirt-bin1.3.1-1ubuntu10.21
>> :> qemu-system1:2.5+dfsg-5ubuntu10.24
>> :> OpenStack Mitaka
>> :>
>> :> Any one seen anything like this or have suggestions where to look for 
>> more details?
>> :>
>> :> -Jon
>> :> --
>> :> ___
>> :> ceph-users mailing list
>> :> ceph-users@lists.ceph.com
>> :> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> :
>> :
>> :
>> :--
>> :Jason
>>
>> --
>
>
>
> --
> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked

2018-05-10 Thread David Turner
2. When logging the 1/5 is what's written to the log file/what's
temporarily stored in memory.  If you want to increase logging, you need to
increase both numbers to 20/20 or 10/10.  You can also just set it to 20 or
10 and ceph will set them to the same number.  I personally do both numbers
to remind myself that the defaults aren't the same to set it back to.

3. You are not accessing something stored within the ceph cluster, so it
isn't using your admin cephx key that you have.  It is accessing the daemon
socket in the OS so you need to have proper permissions to be able to
access it.  The daemon sockets are located in /var/run/ceph/ which is a
folder without any public permissions.  You could use `sudo -u ceph ceph
daemon osd.15 dump_historic_opsor` or just `sudo ceph daemon osd.15
dump_historic_ops` as root can access the daemon socket as well.

On Thu, May 10, 2018 at 7:14 AM Grigory Murashov 
wrote:

> Hi JC!
>
> Thanks for your answer first.
>
> 1. I have added output of  ceph health detail to Zabbix in case of
> warning. So every time I will see with which OSD the problem is.
>
> 2. I have default level of all logs. As I see here
> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/
> default log level for OSD is 1/5. Should I try
>
> debug osd = 1/20 of 1/10 would be enough here?
>
> 3. Any thoughts why do I have Permission denied? All of my sockets are
> also defaults.
>
> [cephuser@osd3 ~]$ ceph daemon osd.15 dump_historic_ops
> admin_socket: exception getting command descriptions: [Errno 13]
> Permission denied
>
> Thanks in advance.
>
> Grigory Murashov
> Voximplant
>
> 08.05.2018 17:31, Jean-Charles Lopez пишет:
> > Hi Grigory,
> >
> > are these lines the only lines in your log file for OSD 15?
> >
> > Just for sanity, what are the log levels you have set, if any, in your
> config file away from the default? If you set all log levels to 0 like some
> people do you may want to simply go back to the default by commenting out
> the debug_ lines in your config file. If you want to see something more
> detailed you can indeed increase the log level to 5 or 10.
> >
> > What you can also do is to use the admin socket on the machine to see
> what operations are actually blocked: ceph daemon osd.15 dump_ops_in_flight
> and ceph daemon osd.15 dump_historic_ops.
> >
> > These two commands and their output will show you what exact operations
> are blocked and will also point you to the other OSDs this OSD is working
> with to serve the IO. May be the culprit is actually one of the OSDs
> handling the subops or it could be a network problem.
> >
> > Regards
> > JC
> >
> >> On May 8, 2018, at 03:11, Grigory Murashov 
> wrote:
> >>
> >> Hello Jean-Charles!
> >>
> >> I have finally catch the problem, It was at 13-02.
> >>
> >> [cephuser@storage-ru1-osd3 ~]$ ceph health detail
> >> HEALTH_WARN 18 slow requests are blocked > 32 sec
> >> REQUEST_SLOW 18 slow requests are blocked > 32 sec
> >>  3 ops are blocked > 65.536 sec
> >>  15 ops are blocked > 32.768 sec
> >>  osd.15 has blocked requests > 65.536 sec
> >> [cephuser@storage-ru1-osd3 ~]$
> >>
> >>
> >> But surprise - there is no information in ceph-osd.15.log that time
> >>
> >>
> >> 2018-05-08 12:54:26.105919 7f003f5f9700  4 rocksdb: (Original Log Time
> 2018/05/08-12:54:26.105843) EVENT_LOG_v1 {"time_micros": 1525773266105834,
> "job": 2793, "event": "trivial_move", "dest
> >> ination_level": 3, "files": 1, "total_files_size": 68316970}
> >> 2018-05-08 12:54:26.105926 7f003f5f9700  4 rocksdb: (Original Log Time
> 2018/05/08-12:54:26.105854)
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
> >>
> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1537]
> [default] Moved #1 files to level-3 68316970 bytes OK
> >> : base level 1 max bytes base 268435456 files[0 4 45 403 722 0 0] max
> score 0.98
> >>
> >> 2018-05-08 13:07:29.711425 7f004f619700  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
> >>
> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:684]
> reusing log 8051 from recycle list
> >>
> >> 2018-05-08 13:07:29.711497 7f004f619700  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
> >>
> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:725]
> [default] New memtable created with log file: #8089. Immutable memtables: 0.
> >>
> >> 2018-05-08 13:07:29.726107 7f003fdfa700  4 rocksdb: (Original Log Time
> 2018/05/08-13:07:29.711524)
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
> >>
> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:11

Re: [ceph-users] RBD Buffer I/O errors cleared by flatten?

2018-05-10 Thread Jason Dillaman
It only bites you if you have a hard failure of a VM (i.e. the RBD
image wasn't cleanly closed and the lock wasn't cleanly released). In
that case, the next librbd client to attempt to acquire the lock will
notice the dead lock owner and will attempt to blacklist it from the
cluster to ensure it cannot write to the image.

On Thu, May 10, 2018 at 10:08 AM, Jonathan Proulx  wrote:
> On Thu, May 10, 2018 at 09:55:15AM -0700, Jason Dillaman wrote:
> :My immediate guess is that your caps are incorrect for your OpenStack
> :Ceph user. Please refer to step 6 from the Luminous upgrade guide to
> :ensure your RBD users have permission to blacklist dead peers [1]
> :
> :[1] 
> http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken
>
> Good spotting!  Thanks for fastreply.  Next question is why did this
> take so long to bite me we've been on luminous for 6 months, not going
> to worry too myc about that last quetion though.
>
> Hoepfully that was the problem (it definitely was a problem).
>
> Thanks,
> -Jon
>
> :On Thu, May 10, 2018 at 9:49 AM, Jonathan Proulx  wrote:
> :> Hi All,
> :>
> :> recently I saw a number of rbd backed VMs in my openstack cloud fail
> :> to reboot after a hypervisor crash with errors simialr to:
> :>
> :> [5.279393] blk_update_request: I/O error, dev vda, sector 2048
> :> [5.281427] Buffer I/O error on dev vda1, logical block 0, lost async 
> page write
> :> [5.284114] Buffer I/O error on dev vda1, logical block 1, lost async 
> page write
> :> [5.286600] Buffer I/O error on dev vda1, logical block 2, lost async 
> page write
> :> [5.289022] Buffer I/O error on dev vda1, logical block 3, lost async 
> page write
> :> [5.291515] Buffer I/O error on dev vda1, logical block 4, lost async 
> page write
> :> [5.338981] blk_update_request: I/O error, dev vda, sector 3088
> :>
> :> for many blocks and sectors. I was able to export the rbd images and
> :> they seemed fine, also 'rbd flatten' made them boot again with no
> :> errors.
> :>
> :> I found this puzzling and concerning but given the crash and limited
> :> time didn't really follow up.
> :>
> :> Today I intetionally rebooted a VM on a health hypervisor and had it
> :> land in the same condition, now I'm really worried.
> :>
> :> running:
> :> Ubuntu16.04
> :> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
> (stable) (on hypervisor)
> :> {
> :> "mon": {
> :> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
> luminous (stable)": 3
> :> },
> :> "mgr": {
> :> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
> luminous (stable)": 3
> :> },
> :> "osd": {
> :> "ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) 
> luminous (stable)": 102,
> :> "ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) 
> luminous (stable)": 10,
> :> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
> luminous (stable)": 62
> :> }
> :> }
> :> libvirt-bin1.3.1-1ubuntu10.21
> :> qemu-system1:2.5+dfsg-5ubuntu10.24
> :> OpenStack Mitaka
> :>
> :> Any one seen anything like this or have suggestions where to look for more 
> details?
> :>
> :> -Jon
> :> --
> :> ___
> :> ceph-users mailing list
> :> ceph-users@lists.ceph.com
> :> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> :
> :
> :
> :--
> :Jason
>
> --



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Buffer I/O errors cleared by flatten?

2018-05-10 Thread Jonathan Proulx
On Thu, May 10, 2018 at 09:55:15AM -0700, Jason Dillaman wrote:
:My immediate guess is that your caps are incorrect for your OpenStack
:Ceph user. Please refer to step 6 from the Luminous upgrade guide to
:ensure your RBD users have permission to blacklist dead peers [1]
:
:[1] 
http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken

Good spotting!  Thanks for fastreply.  Next question is why did this
take so long to bite me we've been on luminous for 6 months, not going
to worry too myc about that last quetion though.

Hoepfully that was the problem (it definitely was a problem).

Thanks,
-Jon

:On Thu, May 10, 2018 at 9:49 AM, Jonathan Proulx  wrote:
:> Hi All,
:>
:> recently I saw a number of rbd backed VMs in my openstack cloud fail
:> to reboot after a hypervisor crash with errors simialr to:
:>
:> [5.279393] blk_update_request: I/O error, dev vda, sector 2048
:> [5.281427] Buffer I/O error on dev vda1, logical block 0, lost async 
page write
:> [5.284114] Buffer I/O error on dev vda1, logical block 1, lost async 
page write
:> [5.286600] Buffer I/O error on dev vda1, logical block 2, lost async 
page write
:> [5.289022] Buffer I/O error on dev vda1, logical block 3, lost async 
page write
:> [5.291515] Buffer I/O error on dev vda1, logical block 4, lost async 
page write
:> [5.338981] blk_update_request: I/O error, dev vda, sector 3088
:>
:> for many blocks and sectors. I was able to export the rbd images and
:> they seemed fine, also 'rbd flatten' made them boot again with no
:> errors.
:>
:> I found this puzzling and concerning but given the crash and limited
:> time didn't really follow up.
:>
:> Today I intetionally rebooted a VM on a health hypervisor and had it
:> land in the same condition, now I'm really worried.
:>
:> running:
:> Ubuntu16.04
:> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable) (on hypervisor)
:> {
:> "mon": {
:> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
luminous (stable)": 3
:> },
:> "mgr": {
:> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
luminous (stable)": 3
:> },
:> "osd": {
:> "ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) 
luminous (stable)": 102,
:> "ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) 
luminous (stable)": 10,
:> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
luminous (stable)": 62
:> }
:> }
:> libvirt-bin1.3.1-1ubuntu10.21
:> qemu-system1:2.5+dfsg-5ubuntu10.24
:> OpenStack Mitaka
:>
:> Any one seen anything like this or have suggestions where to look for more 
details?
:>
:> -Jon
:> --
:> ___
:> ceph-users mailing list
:> ceph-users@lists.ceph.com
:> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
:
:
:
:-- 
:Jason

-- 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Buffer I/O errors cleared by flatten?

2018-05-10 Thread Jason Dillaman
My immediate guess is that your caps are incorrect for your OpenStack
Ceph user. Please refer to step 6 from the Luminous upgrade guide to
ensure your RBD users have permission to blacklist dead peers [1]

[1] 
http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken

On Thu, May 10, 2018 at 9:49 AM, Jonathan Proulx  wrote:
> Hi All,
>
> recently I saw a number of rbd backed VMs in my openstack cloud fail
> to reboot after a hypervisor crash with errors simialr to:
>
> [5.279393] blk_update_request: I/O error, dev vda, sector 2048
> [5.281427] Buffer I/O error on dev vda1, logical block 0, lost async page 
> write
> [5.284114] Buffer I/O error on dev vda1, logical block 1, lost async page 
> write
> [5.286600] Buffer I/O error on dev vda1, logical block 2, lost async page 
> write
> [5.289022] Buffer I/O error on dev vda1, logical block 3, lost async page 
> write
> [5.291515] Buffer I/O error on dev vda1, logical block 4, lost async page 
> write
> [5.338981] blk_update_request: I/O error, dev vda, sector 3088
>
> for many blocks and sectors. I was able to export the rbd images and
> they seemed fine, also 'rbd flatten' made them boot again with no
> errors.
>
> I found this puzzling and concerning but given the crash and limited
> time didn't really follow up.
>
> Today I intetionally rebooted a VM on a health hypervisor and had it
> land in the same condition, now I'm really worried.
>
> running:
> Ubuntu16.04
> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
> (stable) (on hypervisor)
> {
> "mon": {
> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
> luminous (stable)": 3
> },
> "mgr": {
> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
> luminous (stable)": 3
> },
> "osd": {
> "ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) 
> luminous (stable)": 102,
> "ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) 
> luminous (stable)": 10,
> "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
> luminous (stable)": 62
> }
> }
> libvirt-bin1.3.1-1ubuntu10.21
> qemu-system1:2.5+dfsg-5ubuntu10.24
> OpenStack Mitaka
>
> Any one seen anything like this or have suggestions where to look for more 
> details?
>
> -Jon
> --
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD Buffer I/O errors cleared by flatten?

2018-05-10 Thread Jonathan Proulx
Hi All,

recently I saw a number of rbd backed VMs in my openstack cloud fail
to reboot after a hypervisor crash with errors simialr to:

[5.279393] blk_update_request: I/O error, dev vda, sector 2048
[5.281427] Buffer I/O error on dev vda1, logical block 0, lost async page 
write
[5.284114] Buffer I/O error on dev vda1, logical block 1, lost async page 
write
[5.286600] Buffer I/O error on dev vda1, logical block 2, lost async page 
write
[5.289022] Buffer I/O error on dev vda1, logical block 3, lost async page 
write
[5.291515] Buffer I/O error on dev vda1, logical block 4, lost async page 
write
[5.338981] blk_update_request: I/O error, dev vda, sector 3088

for many blocks and sectors. I was able to export the rbd images and
they seemed fine, also 'rbd flatten' made them boot again with no
errors.

I found this puzzling and concerning but given the crash and limited
time didn't really follow up.

Today I intetionally rebooted a VM on a health hypervisor and had it
land in the same condition, now I'm really worried.

running:
Ubuntu16.04
ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable) (on hypervisor) 
{
"mon": {
"ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
luminous (stable)": 3
},
"osd": {
"ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) 
luminous (stable)": 102,
"ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) 
luminous (stable)": 10,
"ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) 
luminous (stable)": 62
}
}
libvirt-bin1.3.1-1ubuntu10.21
qemu-system1:2.5+dfsg-5ubuntu10.24
OpenStack Mitaka

Any one seen anything like this or have suggestions where to look for more 
details?

-Jon
-- 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to normally expand OSD’s capacity?

2018-05-10 Thread Ronny Aasen

On 10.05.2018 12:24, Yi-Cian Pu wrote:


Hi All,

We are wondering if there is any way to expand OSD’s capacity. We are 
studying about this and conducted an experiment. However, in the 
result, the size of expanded capacity is counted on the USED part 
rather than the AVAIL one. The following shows the process of our 
experiment:


1.We prepare a small cluster of luminous v12.2.4 and write some data 
into pool. The osd.1 is manually deployed and it uses a disk partition 
of size 100GB (the whole disk size is 320GB).


=

[root@workstation /]# ceph osd df

ID CLASS WEIGHTREWEIGHT SIZE USEAVAIL%USEVARPGS

0hdd 0.289991.0 297G 27062M271G8.89 0.6732

1hdd 0.01.0 100G 27062M 76361M 26.17 1.9732

TOTAL 398G 54125M345G 13.27

MIN/MAX VAR: 0.67/1.97STDDEV: 9.63

=

2.Then, we expand the disk partition used by osd.1 by the following steps:

(1)Stop osd.1 daemon

(2)Use “parted” command to expand 50GB of the disk partition.

(3)Restart osd.1 daemon

3.After we do the above steps, we have the result that the expanded 
size is counted on USED part.


=

[root@workstation /]# ceph osd df

ID CLASS WEIGHTREWEIGHT SIZE USEAVAIL%USEVARPGS

0hdd 0.289991.0 297G 27063M271G8.89 0.3932

1hdd 0.01.0 150G 78263M 76360M 50.62 2.2132

TOTAL 448G102G345G 22.94

MIN/MAX VAR: 0.39/2.21STDDEV: 21.95

=

This is what we have tried, and the result looks very confusing. We’d 
really want to


know if there is any way to normally expand OSD’s capacity. Any 
feedback or suggestions would be much appreciated.




you do not do this in ceph

you would most likely not partition the osd drive, you use the whole 
drive.  so you would never get into the position to need to increase.
you add space by adding osd's and adding nodes, so increasing osd size 
is not logical


if you must for some oddball reason...   you can remove (drain or 
destroy)  - repartition - add the osd and let ceph backfill the drive.

or you can just make a new osd with the remaining disk space.
since the space increase will change the crushmap there is no way to 
avoid some data movement, anyway.


mvh
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Scrubbing impacting write latency since Luminous

2018-05-10 Thread Nick Fisk
Hi All,

I've just upgraded our main cluster to Luminous and have noticed that where
before the cluster 64k write latency was always hovering around 2ms
regardless of what scrubbing was going on, since the upgrade to Luminous,
scrubbing takes the average latency up to around 5-10ms and deep scrubbing
pushes it into the 30ms region.

No other changes apart from the upgrade have taken place. Is anyone aware of
any major changes in the way scrubbing is carried out Jewel->Luminous, which
may be causing this?

Thanks,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-10 Thread Brady Deetz
I am now seeing the exact same issues you are reporting. A heap release did
nothing for me.

The only odd thing I'm doing is migrating data in cephfs from one pool to
another. The process looks something like the following:
TARGET_DIR=/media/cephfs/labs/
TARGET_POOL="cephfs_ec_data"
setfattr -n ceph.dir.layout.pool -v ${TARGET_POOL} ${TARGET_DIR}
#for every file
##NEWFILE="${file}.ec"
##cp "${file}" "${NEWFILE}"
##mv "${NEWFILE}" "${file}"

I have a fear that this process may not be releasing the inode of ${file}
and deleting the objects from RADOS. But, I'm not sure that would have much
to do with MDS outside tracking an inode that isn't accessible anymore.



[root@mds0 ~]# rpm -qa | grep ceph
ceph-mgr-12.2.4-0.el7.x86_64
ceph-12.2.4-0.el7.x86_64
ceph-osd-12.2.4-0.el7.x86_64
ceph-release-1-1.el7.noarch
libcephfs2-12.2.4-0.el7.x86_64
ceph-base-12.2.4-0.el7.x86_64
ceph-mds-12.2.4-0.el7.x86_64
ceph-deploy-2.0.0-0.noarch
ceph-common-12.2.4-0.el7.x86_64
ceph-mon-12.2.4-0.el7.x86_64
ceph-radosgw-12.2.4-0.el7.x86_64
python-cephfs-12.2.4-0.el7.x86_64
ceph-selinux-12.2.4-0.el7.x86_64


[root@mds0 ~]# ceph daemon mds.mds0 config get mds_cache_memory_limit
{
"mds_cache_memory_limit": "80530636800"
}


[root@mds0 ~]# ceph daemon mds.mds0 perf dump
{
"AsyncMessenger::Worker-0": {
"msgr_recv_messages": 48568037,
"msgr_send_messages": 51895350,
"msgr_recv_bytes": 50001752194,
"msgr_send_bytes": 59667899407,
"msgr_created_connections": 28522,
"msgr_active_connections": 939,
"msgr_running_total_time": 9158.145665485,
"msgr_running_send_time": 3270.445768873,
"msgr_running_recv_time": 8951.883602486,
"msgr_running_fast_dispatch_time": 684.964408603
},
"AsyncMessenger::Worker-1": {
"msgr_recv_messages": 81557461,
"msgr_send_messages": 88149491,
"msgr_recv_bytes": 59543645402,
"msgr_send_bytes": 99790426210,
"msgr_created_connections": 28705,
"msgr_active_connections": 881,
"msgr_running_total_time": 14513.332929088,
"msgr_running_send_time": 5214.994372044,
"msgr_running_recv_time": 13891.320681575,
"msgr_running_fast_dispatch_time": 682.921363330
},
"AsyncMessenger::Worker-2": {
"msgr_recv_messages": 104018424,
"msgr_send_messages": 117265828,
"msgr_recv_bytes": 70248474177,
"msgr_send_bytes": 175930469394,
"msgr_created_connections": 30034,
"msgr_active_connections": 1043,
"msgr_running_total_time": 18836.813930876,
"msgr_running_send_time": 7227.884643396,
"msgr_running_recv_time": 17825.385233846,
"msgr_running_fast_dispatch_time": 692.710777921
},
"finisher-PurgeQueue": {
"queue_len": 0,
"complete_latency": {
"avgcount": 22554047,
"sum": 2515.425093728,
"avgtime": 0.000111528
}
},
"mds": {
"request": 156766118,
"reply": 156766111,
"reply_latency": {
"avgcount": 156766111,
"sum": 337276.533677320,
"avgtime": 0.002151463
},
"forward": 0,
"dir_fetch": 6468158,
"dir_commit": 539656,
"dir_split": 0,
"dir_merge": 0,
"inode_max": 2147483647,
"inodes": 35853368,
"inodes_top": 23669670,
"inodes_bottom": 12165298,
"inodes_pin_tail": 18400,
"inodes_pinned": 2039553,
"inodes_expired": 142389542,
"inodes_with_caps": 831824,
"caps": 881384,
"subtrees": 2,
"traverse": 167546977,
"traverse_hit": 53323050,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 4853,
"traverse_remote_ino": 0,
"traverse_lock": 39597,
"load_cent": 15676533928,
"q": 0,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
},
"mds_cache": {
"num_strays": 1369,
"num_strays_delayed": 12,
"num_strays_enqueuing": 0,
"strays_created": 2667808,
"strays_enqueued": 2666306,
"strays_reintegrated": 246,
"strays_migrated": 0,
"num_recovering_processing": 0,
"num_recovering_enqueued": 0,
"num_recovering_prioritized": 0,
"recovery_started": 524,
"recovery_completed": 524,
"ireq_enqueue_scrub": 0,
"ireq_exportdir": 0,
"ireq_flush": 0,
"ireq_fragmentdir": 0,
"ireq_fragstats": 0,
"ireq_inodestats": 0
},
"mds_log": {
"evadd": 34813343,
"evex": 34809732,
"evtrm": 34809732,
"ev": 22489,
"evexg": 0,
"evexd": 728,
"segadd": 47980,
"segex": 47980,
"segtrm": 47980,
"seg": 31,
"segexg": 0,
"segexd": 1,
"expos": 8687078876712,
"wrpos": 8687143594883,
"rd

Re: [ceph-users] How to normally expand OSD’s capacity?

2018-05-10 Thread David Turner
I do not believe there is any way to change the size of any part of a
bluestore OSD configuration.

On Thu, May 10, 2018 at 6:37 AM Paul Emmerich 
wrote:

> You usually don't do that because you are supposed to use the whole disk.
>
>
> Paul
>
> 2018-05-10 12:31 GMT+02:00 Yi-Cian Pu :
>
>> Hi All,
>>
>>
>>
>> We are wondering if there is any way to expand OSD’s capacity. We are
>> studying about this and conducted an experiment. However, in the result,
>> the size of expanded capacity is counted on the USED part rather than the
>> AVAIL one. The following shows the process of our experiment:
>>
>>
>>
>> 1.   We prepare a small cluster of luminous v12.2.4 and write some
>> data into pool. The osd.1 is manually deployed and it uses a disk partition
>> of size 100GB (the whole disk size is 320GB).
>>
>> =
>>
>> [root@workstation /]# ceph osd df
>>
>> ID CLASS WEIGHT  REWEIGHT SIZE USEAVAIL  %USE  VAR  PGS
>>
>>  0   hdd 0.28999  1.0 297G 27062M   271G  8.89 0.67  32
>>
>>  1   hdd 0.0  1.0 100G 27062M 76361M 26.17 1.97  32
>>
>> TOTAL 398G 54125M   345G 13.27
>>
>> MIN/MAX VAR: 0.67/1.97  STDDEV: 9.63
>>
>> =
>>
>>
>>
>> 2.   Then, we expand the disk partition used by osd.1 by the
>> following steps:
>>
>> (1) Stop osd.1 daemon
>>
>> (2) Use “parted” command to expand 50GB of the disk partition.
>>
>> (3) Restart osd.1 daemon
>>
>>
>>
>> 3.   After we do the above steps, we have the result that the
>> expanded size is counted on USED part.
>>
>> =
>>
>> [root@workstation /]# ceph osd df
>>
>> ID CLASS WEIGHT  REWEIGHT SIZE USEAVAIL  %USE  VAR  PGS
>>
>>  0   hdd 0.28999  1.0 297G 27063M   271G  8.89 0.39  32
>>
>>  1   hdd 0.0  1.0 150G 78263M 76360M 50.62 2.21  32
>>
>> TOTAL 448G   102G   345G 22.94
>>
>> MIN/MAX VAR: 0.39/2.21  STDDEV: 21.95
>>
>> =
>>
>>
>>
>> This is what we have tried, and the result looks very confusing. We’d
>> really want to
>>
>> know if there is any way to normally expand OSD’s capacity. Any feedback
>> or suggestions would be much appreciated.
>>
>>
>>
>> Regards,
>> Yi-Cian
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90 <+49%2089%20189658590>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy: is it a requirement that the name of each node of the ceph cluster must be resolved to the public IP ?

2018-05-10 Thread Massimo Sgaravatto
This [*] is my ceph.conf

10.70.42.9 is the public address

And it is indeed the IP used by the MON daemon:

[root@c-mon-02 ~]# netstat -anp | grep 6789
tcp0  0 10.70.42.9:6789 0.0.0.0:*   LISTEN
3835/ceph-mon
tcp0  0 10.70.42.9:3359210.70.42.10:6789
ESTABLISHED 3835/ceph-mon
tcp0  0 10.70.42.9:4178610.70.42.8:6789
 ESTABLISHED 3835/ceph-mon
tcp   106008  0 10.70.42.9:3321010.70.42.10:6789
CLOSE_WAIT  1162/ceph-mgr
tcp   100370  0 10.70.42.9:3321810.70.42.10:6789
CLOSE_WAIT  1162/ceph-mgr
tcp0  0 10.70.42.9:3357810.70.42.10:6789
ESTABLISHED 1162/ceph-mgr


But the command:

/usr/bin/ceph --connect-timeout=25 --cluster=ceph --name mon.
--keyring=/var/lib/ceph/mon/ceph-c-mon-02/keyring auth get client.admin
exported keyring for client.admin

fails with c-mon-02 resolved to the management IP.

As wokaround I can add in /etc/hosts the mapping with the public address:

10.70.42.9  c-mon-02


but I wonder if this is the expected behavior


Cheers, Massimo

[*]

[global]
fsid = 7a8cb8ff-562b-47da-a6aa-507136587dcf
public network = 10.70.42.0/24
cluster network = 10.69.42.0/24


auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

osd pool default size = 3  # Write an object 3 times.
osd pool default min size = 2


osd pool default pg num = 128
osd pool default pgp num = 128


[mon]
mon host =  c-mon-01, c-mon-02, c-mon-03
mon addr =  10.70.42.10:6789, 10.70.42.9:6789, 10.70.42.8:6789

[mon.c-mon-01]
mon addr = 10.70.42.10:6789
host = c-mon-01

[mon.c-mon-02]
mon addr = 10.70.42.9:6789
host = c-mon-02

[mon.c-mon-03]
mon addr = 10.70.42.8:6789
host = c-mon-03

[osd]
osd mount options xfs = rw,noatime,inode64,logbufs=8,logbsize=256k



On Thu, May 10, 2018 at 1:12 PM, Paul Emmerich 
wrote:

> check ceph.conf, it controls to which mon IP the client tries to connect.
>
> 2018-05-10 12:57 GMT+02:00 Massimo Sgaravatto <
> massimo.sgarava...@gmail.com>:
>
>> I configured the "public network" attribute in the ceph configuration
>> file.
>>
>> But it looks like to me that in the "auth get client.admin" command [*]
>> issued by ceph-deploy the address of the management network is used (I
>> guess because c-mon-02 gets resolved to the IP management address)
>>
>> Cheers, Massimo
>>
>> [*]
>> /usr/bin/ceph  --connect-timeout=25 --cluster=ceph --name mon.
>> --keyring=/var/lib/ceph/mon/ceph-c-mon-02/keyring auth get client.admin
>>
>> On Thu, May 10, 2018 at 12:49 PM, Paul Emmerich 
>> wrote:
>>
>>> Monitors can use only exactly one IP address. ceph-deploy uses some
>>> heuristics
>>> based on hostname resolution and ceph public addr configuration to guess
>>> which
>>> one to use during setup. (Which I've always found to be a quite annoying
>>> feature.)
>>>
>>> The mon's IP must be reachable from all ceph daemons and clients, so it
>>> should be
>>> on your "public" network. Changing the IP of a mon is possible but
>>> annoying, it is
>>> often easier to remove and then re-add with a new IP (if possible):
>>>
>>> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-
>>> mons/#changing-a-monitor-s-ip-address
>>>
>>>
>>> Paul
>>>
>>> 2018-05-10 12:36 GMT+02:00 Massimo Sgaravatto <
>>> massimo.sgarava...@gmail.com>:
>>>
 I have a ceph cluster that I manually deployed, and now I am trying to
 see if I can use ceph-deploy to deploy new nodes (in particular the object
 gw).

 The network configuration is the following:

 Each MON node has two network IP: one on a "management network" (not
 used for ceph related stuff) and one on a "public network",
 The MON daemon listens to on the pub network

 Each OSD node  has three network IPs: one on a "management network"
 (not used for ceph related stuff), one on a "public network" and the third
 one is an internal network to be used as ceph cluster network (for ceph
 internal traffic: replication, recovery, etc)


 Name resolution works, but names are resolved to the IP address of the
 management network.
 And it looks like this is a problem. E.g. the following command (used
 in ceph-deploy gatherkeys) issued on a MON host (c-mon-02) doesn't work:

 /usr/bin/ceph --verbose --connect-timeout=25 --cluster=ceph --name mon.
 --keyring=/var/lib/ceph/mon/ceph-c-mon-02/keyring auth get client.admin

 unless I change the name resolution of c-mon-02 to the public address


 Is it a requirement (at least for ceph-deploy) that the name of each
 node of the ceph cluster must be resolved to the public IP address ?


 Thanks, Massimo

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>>
>>> --
>>> --
>>> Paul Emmerich
>>>
>>> Looking for help with your Ceph cluster? Contact u

Re: [ceph-users] slow requests are blocked

2018-05-10 Thread Grigory Murashov

Hi JC!

Thanks for your answer first.

1. I have added output of  ceph health detail to Zabbix in case of 
warning. So every time I will see with which OSD the problem is.


2. I have default level of all logs. As I see here 
http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/ 
default log level for OSD is 1/5. Should I try


debug osd = 1/20 of 1/10 would be enough here?

3. Any thoughts why do I have Permission denied? All of my sockets are 
also defaults.


[cephuser@osd3 ~]$ ceph daemon osd.15 dump_historic_ops
admin_socket: exception getting command descriptions: [Errno 13] 
Permission denied


Thanks in advance.

Grigory Murashov
Voximplant

08.05.2018 17:31, Jean-Charles Lopez пишет:

Hi Grigory,

are these lines the only lines in your log file for OSD 15?

Just for sanity, what are the log levels you have set, if any, in your config 
file away from the default? If you set all log levels to 0 like some people do 
you may want to simply go back to the default by commenting out the debug_ 
lines in your config file. If you want to see something more detailed you can 
indeed increase the log level to 5 or 10.

What you can also do is to use the admin socket on the machine to see what 
operations are actually blocked: ceph daemon osd.15 dump_ops_in_flight and ceph 
daemon osd.15 dump_historic_ops.

These two commands and their output will show you what exact operations are 
blocked and will also point you to the other OSDs this OSD is working with to 
serve the IO. May be the culprit is actually one of the OSDs handling the 
subops or it could be a network problem.

Regards
JC


On May 8, 2018, at 03:11, Grigory Murashov  wrote:

Hello Jean-Charles!

I have finally catch the problem, It was at 13-02.

[cephuser@storage-ru1-osd3 ~]$ ceph health detail
HEALTH_WARN 18 slow requests are blocked > 32 sec
REQUEST_SLOW 18 slow requests are blocked > 32 sec
 3 ops are blocked > 65.536 sec
 15 ops are blocked > 32.768 sec
 osd.15 has blocked requests > 65.536 sec
[cephuser@storage-ru1-osd3 ~]$


But surprise - there is no information in ceph-osd.15.log that time


2018-05-08 12:54:26.105919 7f003f5f9700  4 rocksdb: (Original Log Time 2018/05/08-12:54:26.105843) EVENT_LOG_v1 
{"time_micros": 1525773266105834, "job": 2793, "event": "trivial_move", "dest
ination_level": 3, "files": 1, "total_files_size": 68316970}
2018-05-08 12:54:26.105926 7f003f5f9700  4 rocksdb: (Original Log Time 
2018/05/08-12:54:26.105854) 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1537]
 [default] Moved #1 files to level-3 68316970 bytes OK
: base level 1 max bytes base 268435456 files[0 4 45 403 722 0 0] max score 0.98

2018-05-08 13:07:29.711425 7f004f619700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:684] 
reusing log 8051 from recycle list

2018-05-08 13:07:29.711497 7f004f619700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:725] 
[default] New memtable created with log file: #8089. Immutable memtables: 0.

2018-05-08 13:07:29.726107 7f003fdfa700  4 rocksdb: (Original Log Time 
2018/05/08-13:07:29.711524) 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1158]
 Calling FlushMemTableToOutputFile with column family
[default], flush slots available 1, compaction slots allowed 1, compaction 
slots scheduled 1
2018-05-08 13:07:29.726124 7f003fdfa700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/flush_job.cc:264] 
[default] [JOB 2794] Flushing memtable with next log file: 8089

Should I have some deeply logging?


Grigory Murashov
Voximplant

07.05.2018 18:59, Jean-Charles Lopez пишет:

Hi,

ceph health detail

This will tell you which OSDs are experiencing the problem so you can then go 
and inspect the logs and use the admin socket to find out which requests are at 
the source.

Regards
JC


On May 7, 2018, at 03:52, Grigory Murashov  wrote:

Hello!

I'm not much experiensed in ceph troubleshouting that why I ask for help.

I have multiple warnings coming from zabbix as a result of ceph -s

REQUEST_SLOW: HEALTH_WARN : 21 slow requests are blocked > 32 sec

I don't see any hardware problems that time.

I'm able to find the same strings in ceph.log and ceph-mon.l

Re: [ceph-users] ceph-deploy: is it a requirement that the name of each node of the ceph cluster must be resolved to the public IP ?

2018-05-10 Thread Paul Emmerich
check ceph.conf, it controls to which mon IP the client tries to connect.

2018-05-10 12:57 GMT+02:00 Massimo Sgaravatto 
:

> I configured the "public network" attribute in the ceph configuration file.
>
> But it looks like to me that in the "auth get client.admin" command [*]
> issued by ceph-deploy the address of the management network is used (I
> guess because c-mon-02 gets resolved to the IP management address)
>
> Cheers, Massimo
>
> [*]
> /usr/bin/ceph  --connect-timeout=25 --cluster=ceph --name mon.
> --keyring=/var/lib/ceph/mon/ceph-c-mon-02/keyring auth get client.admin
>
> On Thu, May 10, 2018 at 12:49 PM, Paul Emmerich 
> wrote:
>
>> Monitors can use only exactly one IP address. ceph-deploy uses some
>> heuristics
>> based on hostname resolution and ceph public addr configuration to guess
>> which
>> one to use during setup. (Which I've always found to be a quite annoying
>> feature.)
>>
>> The mon's IP must be reachable from all ceph daemons and clients, so it
>> should be
>> on your "public" network. Changing the IP of a mon is possible but
>> annoying, it is
>> often easier to remove and then re-add with a new IP (if possible):
>>
>> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-
>> mons/#changing-a-monitor-s-ip-address
>>
>>
>> Paul
>>
>> 2018-05-10 12:36 GMT+02:00 Massimo Sgaravatto <
>> massimo.sgarava...@gmail.com>:
>>
>>> I have a ceph cluster that I manually deployed, and now I am trying to
>>> see if I can use ceph-deploy to deploy new nodes (in particular the object
>>> gw).
>>>
>>> The network configuration is the following:
>>>
>>> Each MON node has two network IP: one on a "management network" (not
>>> used for ceph related stuff) and one on a "public network",
>>> The MON daemon listens to on the pub network
>>>
>>> Each OSD node  has three network IPs: one on a "management network" (not
>>> used for ceph related stuff), one on a "public network" and the third one
>>> is an internal network to be used as ceph cluster network (for ceph
>>> internal traffic: replication, recovery, etc)
>>>
>>>
>>> Name resolution works, but names are resolved to the IP address of the
>>> management network.
>>> And it looks like this is a problem. E.g. the following command (used in
>>> ceph-deploy gatherkeys) issued on a MON host (c-mon-02) doesn't work:
>>>
>>> /usr/bin/ceph --verbose --connect-timeout=25 --cluster=ceph --name mon.
>>> --keyring=/var/lib/ceph/mon/ceph-c-mon-02/keyring auth get client.admin
>>>
>>> unless I change the name resolution of c-mon-02 to the public address
>>>
>>>
>>> Is it a requirement (at least for ceph-deploy) that the name of each
>>> node of the ceph cluster must be resolved to the public IP address ?
>>>
>>>
>>> Thanks, Massimo
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>>
>> --
>> --
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 
>> 81247 München
>> 
>> www.croit.io
>> Tel: +49 89 1896585 90
>>
>
>


-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] GDPR encryption at rest

2018-05-10 Thread Vik Tara
On 02/05/18 16:12, David Turner wrote:
> I've heard conflicting opinions if GDPR requires data to be encrypted
> at rest
Encryption both in transit and at rest is part of data protection by
design: it is about making sure that you have control over the data that
you hold/are processing and that if you lose physical control over the
storage medium (at rest) or the communication channel (in transit) that
you do not also have a loss of control (a data breach). Encrypted data,
whether it includes a personal data or not, is 'protected' secure data.

GDPR doesn't particularly describe encryption but the ICO guidance does
and in particular

"Where appropriate, you should look to use measures such as
pseudonymisation and encryption."

We're currently working on a Ceph based Document Management System with
object encryption which needs to comply with GDPR for users - and we're
opting for encrypting everything!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy: is it a requirement that the name of each node of the ceph cluster must be resolved to the public IP ?

2018-05-10 Thread Massimo Sgaravatto
I configured the "public network" attribute in the ceph configuration file.

But it looks like to me that in the "auth get client.admin" command [*]
issued by ceph-deploy the address of the management network is used (I
guess because c-mon-02 gets resolved to the IP management address)

Cheers, Massimo

[*]
/usr/bin/ceph  --connect-timeout=25 --cluster=ceph --name mon.
--keyring=/var/lib/ceph/mon/ceph-c-mon-02/keyring auth get client.admin

On Thu, May 10, 2018 at 12:49 PM, Paul Emmerich 
wrote:

> Monitors can use only exactly one IP address. ceph-deploy uses some
> heuristics
> based on hostname resolution and ceph public addr configuration to guess
> which
> one to use during setup. (Which I've always found to be a quite annoying
> feature.)
>
> The mon's IP must be reachable from all ceph daemons and clients, so it
> should be
> on your "public" network. Changing the IP of a mon is possible but
> annoying, it is
> often easier to remove and then re-add with a new IP (if possible):
>
> http://docs.ceph.com/docs/master/rados/operations/add-
> or-rm-mons/#changing-a-monitor-s-ip-address
>
>
> Paul
>
> 2018-05-10 12:36 GMT+02:00 Massimo Sgaravatto <
> massimo.sgarava...@gmail.com>:
>
>> I have a ceph cluster that I manually deployed, and now I am trying to
>> see if I can use ceph-deploy to deploy new nodes (in particular the object
>> gw).
>>
>> The network configuration is the following:
>>
>> Each MON node has two network IP: one on a "management network" (not used
>> for ceph related stuff) and one on a "public network",
>> The MON daemon listens to on the pub network
>>
>> Each OSD node  has three network IPs: one on a "management network" (not
>> used for ceph related stuff), one on a "public network" and the third one
>> is an internal network to be used as ceph cluster network (for ceph
>> internal traffic: replication, recovery, etc)
>>
>>
>> Name resolution works, but names are resolved to the IP address of the
>> management network.
>> And it looks like this is a problem. E.g. the following command (used in
>> ceph-deploy gatherkeys) issued on a MON host (c-mon-02) doesn't work:
>>
>> /usr/bin/ceph --verbose --connect-timeout=25 --cluster=ceph --name mon.
>> --keyring=/var/lib/ceph/mon/ceph-c-mon-02/keyring auth get client.admin
>>
>> unless I change the name resolution of c-mon-02 to the public address
>>
>>
>> Is it a requirement (at least for ceph-deploy) that the name of each node
>> of the ceph cluster must be resolved to the public IP address ?
>>
>>
>> Thanks, Massimo
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 
> 81247 München
> 
> www.croit.io
> Tel: +49 89 1896585 90
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy: is it a requirement that the name of each node of the ceph cluster must be resolved to the public IP ?

2018-05-10 Thread Paul Emmerich
Monitors can use only exactly one IP address. ceph-deploy uses some
heuristics
based on hostname resolution and ceph public addr configuration to guess
which
one to use during setup. (Which I've always found to be a quite annoying
feature.)

The mon's IP must be reachable from all ceph daemons and clients, so it
should be
on your "public" network. Changing the IP of a mon is possible but
annoying, it is
often easier to remove and then re-add with a new IP (if possible):

http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address


Paul

2018-05-10 12:36 GMT+02:00 Massimo Sgaravatto 
:

> I have a ceph cluster that I manually deployed, and now I am trying to see
> if I can use ceph-deploy to deploy new nodes (in particular the object gw).
>
> The network configuration is the following:
>
> Each MON node has two network IP: one on a "management network" (not used
> for ceph related stuff) and one on a "public network",
> The MON daemon listens to on the pub network
>
> Each OSD node  has three network IPs: one on a "management network" (not
> used for ceph related stuff), one on a "public network" and the third one
> is an internal network to be used as ceph cluster network (for ceph
> internal traffic: replication, recovery, etc)
>
>
> Name resolution works, but names are resolved to the IP address of the
> management network.
> And it looks like this is a problem. E.g. the following command (used in
> ceph-deploy gatherkeys) issued on a MON host (c-mon-02) doesn't work:
>
> /usr/bin/ceph --verbose --connect-timeout=25 --cluster=ceph --name mon.
> --keyring=/var/lib/ceph/mon/ceph-c-mon-02/keyring auth get client.admin
>
> unless I change the name resolution of c-mon-02 to the public address
>
>
> Is it a requirement (at least for ceph-deploy) that the name of each node
> of the ceph cluster must be resolved to the public IP address ?
>
>
> Thanks, Massimo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy: is it a requirement that the name of each node of the ceph cluster must be resolved to the public IP ?

2018-05-10 Thread Massimo Sgaravatto
I have a ceph cluster that I manually deployed, and now I am trying to see
if I can use ceph-deploy to deploy new nodes (in particular the object gw).

The network configuration is the following:

Each MON node has two network IP: one on a "management network" (not used
for ceph related stuff) and one on a "public network",
The MON daemon listens to on the pub network

Each OSD node  has three network IPs: one on a "management network" (not
used for ceph related stuff), one on a "public network" and the third one
is an internal network to be used as ceph cluster network (for ceph
internal traffic: replication, recovery, etc)


Name resolution works, but names are resolved to the IP address of the
management network.
And it looks like this is a problem. E.g. the following command (used in
ceph-deploy gatherkeys) issued on a MON host (c-mon-02) doesn't work:

/usr/bin/ceph --verbose --connect-timeout=25 --cluster=ceph --name mon.
--keyring=/var/lib/ceph/mon/ceph-c-mon-02/keyring auth get client.admin

unless I change the name resolution of c-mon-02 to the public address


Is it a requirement (at least for ceph-deploy) that the name of each node
of the ceph cluster must be resolved to the public IP address ?


Thanks, Massimo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to normally expand OSD’s capacity?

2018-05-10 Thread Paul Emmerich
You usually don't do that because you are supposed to use the whole disk.


Paul

2018-05-10 12:31 GMT+02:00 Yi-Cian Pu :

> Hi All,
>
>
>
> We are wondering if there is any way to expand OSD’s capacity. We are
> studying about this and conducted an experiment. However, in the result,
> the size of expanded capacity is counted on the USED part rather than the
> AVAIL one. The following shows the process of our experiment:
>
>
>
> 1.   We prepare a small cluster of luminous v12.2.4 and write some
> data into pool. The osd.1 is manually deployed and it uses a disk partition
> of size 100GB (the whole disk size is 320GB).
>
> =
>
> [root@workstation /]# ceph osd df
>
> ID CLASS WEIGHT  REWEIGHT SIZE USEAVAIL  %USE  VAR  PGS
>
>  0   hdd 0.28999  1.0 297G 27062M   271G  8.89 0.67  32
>
>  1   hdd 0.0  1.0 100G 27062M 76361M 26.17 1.97  32
>
> TOTAL 398G 54125M   345G 13.27
>
> MIN/MAX VAR: 0.67/1.97  STDDEV: 9.63
>
> =
>
>
>
> 2.   Then, we expand the disk partition used by osd.1 by the
> following steps:
>
> (1) Stop osd.1 daemon
>
> (2) Use “parted” command to expand 50GB of the disk partition.
>
> (3) Restart osd.1 daemon
>
>
>
> 3.   After we do the above steps, we have the result that the
> expanded size is counted on USED part.
>
> =
>
> [root@workstation /]# ceph osd df
>
> ID CLASS WEIGHT  REWEIGHT SIZE USEAVAIL  %USE  VAR  PGS
>
>  0   hdd 0.28999  1.0 297G 27063M   271G  8.89 0.39  32
>
>  1   hdd 0.0  1.0 150G 78263M 76360M 50.62 2.21  32
>
> TOTAL 448G   102G   345G 22.94
>
> MIN/MAX VAR: 0.39/2.21  STDDEV: 21.95
>
> =
>
>
>
> This is what we have tried, and the result looks very confusing. We’d
> really want to
>
> know if there is any way to normally expand OSD’s capacity. Any feedback
> or suggestions would be much appreciated.
>
>
>
> Regards,
> Yi-Cian
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to normally expand OSD’s capacity?

2018-05-10 Thread Yi-Cian Pu
Hi All,



We are wondering if there is any way to expand OSD’s capacity. We are
studying about this and conducted an experiment. However, in the result,
the size of expanded capacity is counted on the USED part rather than the
AVAIL one. The following shows the process of our experiment:



1.   We prepare a small cluster of luminous v12.2.4 and write some data
into pool. The osd.1 is manually deployed and it uses a disk partition of
size 100GB (the whole disk size is 320GB).

=

[root@workstation /]# ceph osd df

ID CLASS WEIGHT  REWEIGHT SIZE USEAVAIL  %USE  VAR  PGS

 0   hdd 0.28999  1.0 297G 27062M   271G  8.89 0.67  32

 1   hdd 0.0  1.0 100G 27062M 76361M 26.17 1.97  32

TOTAL 398G 54125M   345G 13.27

MIN/MAX VAR: 0.67/1.97  STDDEV: 9.63

=



2.   Then, we expand the disk partition used by osd.1 by the following
steps:

(1) Stop osd.1 daemon

(2) Use “parted” command to expand 50GB of the disk partition.

(3) Restart osd.1 daemon



3.   After we do the above steps, we have the result that the expanded
size is counted on USED part.

=

[root@workstation /]# ceph osd df

ID CLASS WEIGHT  REWEIGHT SIZE USEAVAIL  %USE  VAR  PGS

 0   hdd 0.28999  1.0 297G 27063M   271G  8.89 0.39  32

 1   hdd 0.0  1.0 150G 78263M 76360M 50.62 2.21  32

TOTAL 448G   102G   345G 22.94

MIN/MAX VAR: 0.39/2.21  STDDEV: 21.95

=



This is what we have tried, and the result looks very confusing. We’d
really want to

know if there is any way to normally expand OSD’s capacity. Any feedback or
suggestions would be much appreciated.



Regards,
Yi-Cian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to normally expand OSD’s capacity?

2018-05-10 Thread Yi-Cian Pu
Hi All,



We are wondering if there is any way to expand OSD’s capacity. We are
studying about this and conducted an experiment. However, in the result,
the size of expanded capacity is counted on the USED part rather than the
AVAIL one. The following shows the process of our experiment:



1.   We prepare a small cluster of luminous v12.2.4 and write some data
into pool. The osd.1 is manually deployed and it uses a disk partition of
size 100GB (the whole disk size is 320GB).

=

[root@workstation /]# ceph osd df

ID CLASS WEIGHT  REWEIGHT SIZE USEAVAIL  %USE  VAR  PGS

 0   hdd 0.28999  1.0 297G 27062M   271G  8.89 0.67  32

 1   hdd 0.0  1.0 100G 27062M 76361M 26.17 1.97  32

TOTAL 398G 54125M   345G 13.27

MIN/MAX VAR: 0.67/1.97  STDDEV: 9.63

=



2.   Then, we expand the disk partition used by osd.1 by the following
steps:

(1) Stop osd.1 daemon

(2) Use “parted” command to expand 50GB of the disk partition.

(3) Restart osd.1 daemon



3.   After we do the above steps, we have the result that the expanded
size is counted on USED part.

=

[root@workstation /]# ceph osd df

ID CLASS WEIGHT  REWEIGHT SIZE USEAVAIL  %USE  VAR  PGS

 0   hdd 0.28999  1.0 297G 27063M   271G  8.89 0.39  32

 1   hdd 0.0  1.0 150G 78263M 76360M 50.62 2.21  32

TOTAL 448G   102G   345G 22.94

MIN/MAX VAR: 0.39/2.21  STDDEV: 21.95

=



This is what we have tried, and the result looks very confusing. We’d
really want to

know if there is any way to normally expand OSD’s capacity. Any feedback or
suggestions would be much appreciated.



Regards,
Yi-Cian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Public network faster than cluster network

2018-05-10 Thread Gandalf Corvotempesta
Il giorno gio 10 mag 2018 alle ore 09:48 Christian Balzer 
ha scritto:
> Without knowing what your use case is (lots of large reads or writes, or
> the more typical smallish I/Os) it's hard to give specific advice.

99% VM hosting.
Everything else would be negligible and I don't care if not optimized.

> Which would give you 24 servers with up to 20Gb/s per server when both
> switches are working, something that's likely to be very close to 100%
> of the time.

24 servers between hypervisors and storages, right ?
Thus, are you saying to split in this way:

switch0.port0 to port 12 as hypervisor, network1
switch0.port13 to 24 as storage, network1

switch0.port0 to port 12 as hypervisor, network2
switch0.port13 to 24 as storage, network2

In this case, with 2 switches I can have a fully redundant network,
but I also need a ISL to aggregate bandwidth.

> That's a very optimistic number, assuming journal/WAL/DB on SSDs _and_ no
> concurrent write activity.
> Since you said hypervisors up there one assumes VMs on RBDs and a mixed
> I/O pattern, saturating your disks with IOPS long before bandwidth becomes
> an issue.

Based on a real use-case, how much bandwidth should I expect with 12 SATA
spinning disks (7200rpm)
in mixed workload ? Obviously, a sequential read would need about
12*100MB/s*8 mbit/s

> The biggest argument against the 1GB/s links is the latency as mentioned.

10GBe should have 1/10 latency, right ?

Now, as I'm evaluating many SDS and Ceph, on the paper, is the most
expensive in terms of needed hardware,
what do you suggest for a small (scalable) storage, starting with just 3
storage servers (12 disks each but not fully populated),
1x 16ports 10GBaseT switch, (many) 24ports Gigabit switch and about 5
hypervisors servers ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Public network faster than cluster network

2018-05-10 Thread Christian Balzer

Hello,

On Thu, 10 May 2018 07:24:20 + Gandalf Corvotempesta wrote:

> 
> > Lastly, more often than not segregated networks are not needed, add
> > unnecessary complexity and the resources spent on them would be better
> > used to have just one fast and redundant network instead.  
> 
> Biggest concern here is that I don't have enough 10Gbe ports/switches (due
> to their cost),
> thus having 4 10GbE (for a fully redundant environment) is not possible and
> our current switches
> are only with 16 10GBe ports.
> 
Without knowing what your use case is (lots of large reads or writes, or
the more typical smallish I/Os) it's hard to give specific advice. 

In general, switches that support MC-LAG (often called stacked switches,
V-LAG) are preferable, giving you 2x bandwidth _and_ redundancy.

> Probably, I can buy 2x 24 10GBe switches and use half ports for public
> network and the other half for cluster
> but this will reduce the environment to only 12 usable ports (so, 12
> "servers" at max, between hypervisors and storages)
> 
As David and I tried to point out, you don't really need separate networks.
Especially not with bonding and MC-LAG (vlag, etc) switches.

Which would give you 24 servers with up to 20Gb/s per server when both
switches are working, something that's likely to be very close to 100%
of the time.

> Our current storage servers are made with 12 slots (not all used) with SATA
> disks, they should provide 12*100MB/s = 1.2GBps/s when reading from all
> disks at once,

That's a very optimistic number, assuming journal/WAL/DB on SSDs _and_ no
concurrent write activity. 
Since you said hypervisors up there one assumes VMs on RBDs and a mixed
I/O pattern, saturating your disks with IOPS long before bandwidth becomes
an issue.

> thus, a 10GB network would be needed, right ? Maybe a dual gigabit port
> bonded together could do the job.
> A single gigabit link would be saturated by a single disk.
> 
> Is my assumption correct ?
>
The biggest argument against the 1GB/s links is the latency as mentioned.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Public network faster than cluster network

2018-05-10 Thread Gandalf Corvotempesta
Il giorno gio 10 mag 2018 alle ore 02:30 Christian Balzer 
ha scritto:
> This cosmic imbalance would clearly lead to the end of the universe.
> Seriously, think it through, what do you _think_ will happen?

I thought what David told:

"For a write on a replicated pool with size 3 the client writes to the
primary osd across the public network and then the primary osd sends the
other 2 copies across the cluster network to the secondary OSDs. So for
writes the public network uses N bandwidth while the cluster use 2N
bandwidth for the replica copies. Seeing as the write isn't acknowledged
until all 3 copies are written it makes no sense to have a faster public
network"

This is exactly what I've imagined

> Lastly, more often than not segregated networks are not needed, add
> unnecessary complexity and the resources spent on them would be better
> used to have just one fast and redundant network instead.

Biggest concern here is that I don't have enough 10Gbe ports/switches (due
to their cost),
thus having 4 10GbE (for a fully redundant environment) is not possible and
our current switches
are only with 16 10GBe ports.

Probably, I can buy 2x 24 10GBe switches and use half ports for public
network and the other half for cluster
but this will reduce the environment to only 12 usable ports (so, 12
"servers" at max, between hypervisors and storages)

Our current storage servers are made with 12 slots (not all used) with SATA
disks, they should provide 12*100MB/s = 1.2GBps/s when reading from all
disks at once,
thus, a 10GB network would be needed, right ? Maybe a dual gigabit port
bonded together could do the job.
A single gigabit link would be saturated by a single disk.

Is my assumption correct ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com