Re: [ceph-users] hardware requirements for metadata server

2019-05-01 Thread Martin Verges
Hello,

you should put your metadata on a fast (ssd/nvme) pool. The size depends on
your data, but you can scale it anytime as you know it from Ceph itself.
Maybe just start with 3 SSDs in 3 Servers and see how it goes.
For CPU / Ram it's a bit the same, you need a few gigs for smaller and
maybe more for bigger deployments. Maybe you can provide some insights
about your typical data (size, count,..) and don't forget, you can scale by
adding additional online mds as well.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Mi., 1. Mai 2019 um 02:08 Uhr schrieb Manuel Sopena Ballesteros <
manuel...@garvan.org.au>:

> Dear ceph users,
>
>
>
> I would like to ask, does the metadata server needs much block devices for
> storage? Or does it only needs RAM? How could I calculate the amount of
> disks and/or memory needed?
>
>
>
> Thank you very much
>
>
>
>
>
> Manuel Sopena Ballesteros
>
> Big Data Engineer | Kinghorn Centre for Clinical Genomics
>
>  [image: cid:image001.png@01D4C835.ED3C2230] 
>
>
> *a:* 384 Victoria Street, Darlinghurst NSW 2010
> *p:* +61 2 9355 5760  |  +61 4 12 123 123
> *e:* manuel...@garvan.org.au
>
> Like us on Facebook  | Follow us
> on Twitter  and LinkedIn
> 
>
>
> NOTICE
> Please consider the environment before printing this email. This message
> and any attachments are intended for the addressee named and may contain
> legally privileged/confidential/copyright information. If you are not the
> intended recipient, you should not read, use, disclose, copy or distribute
> this communication. If you have received this message in error please
> notify us at once by return email and then delete both messages. We accept
> no liability for the distribution of viruses or similar in electronic
> communications. This notice should not be removed.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] hardware requirements for metadata server

2019-05-01 Thread Manuel Sopena Ballesteros
Dear Ceph users,

I would like to ask, does the metadata server needs much block devices for 
storage? Or does it only needs RAM? How could I calculate the amount of disks 
and/or memory needed?

Thank you very much
NOTICE
Please consider the environment before printing this email. This message and 
any attachments are intended for the addressee named and may contain legally 
privileged/confidential/copyright information. If you are not the intended 
recipient, you should not read, use, disclose, copy or distribute this 
communication. If you have received this message in error please notify us at 
once by return email and then delete both messages. We accept no liability for 
the distribution of viruses or similar in electronic communications. This 
notice should not be removed.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-01 Thread Mark Nelson

On 5/1/19 12:59 AM, Igor Podlesny wrote:

On Tue, 30 Apr 2019 at 20:56, Igor Podlesny  wrote:

On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
[..]

Any suggestions ?

-- Try different allocator.

Ah, BTW, except memory allocator there's another option: recently
backported bitmap allocator.
Igor Fedotov wrote about it's expected to have lesser memory footprint
with time:

 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html

Also I'm not sure though if it's okay to switch existent OSDs "on-fly"
-- changing config and restarting OSDs.
Igor (Fedotov), can you please elaborate on this matter?



FWIW, if you still have an OSD up with tcmalloc, it's probably worth 
looking at the heap stats to see how much memory tcmalloc thinks it's 
allocated vs how much RSS memory is being used by the process.  It's 
quite possible that there is memory that has been unmapped but that the 
kernel can't (or has decided not yet to) reclaim.  Transparent huge 
pages can potentially have an effect here both with tcmalloc and with 
jemalloc so it's not certain that switching the allocator will fix it 
entirely.



First I would just get the heap stats and then after that I would be 
very curious if disabling transparent huge pages helps. Alternately, 
it's always possible it's a memory leak. :D



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd ssd pool for (windows) vms

2019-05-01 Thread Jason Dillaman
On Wed, May 1, 2019 at 5:00 PM Marc Roos  wrote:
>
>
> Do you need to tell the vm's that they are on a ssd rbd pool? Or does
> ceph and the libvirt drivers do this automatically for you?

Like discard, any advanced QEMU options would need to be manually specified.

> When testing a nutanix acropolis virtual install, I had to 'cheat' it by
> adding this
>  
> To make the installer think there was a ssd drive.
>
> I only have 'Thin provisioned drive' mentioned regardless if the vm is
> on a hdd rbd pool or a ssd rbd pool.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd ssd pool for (windows) vms

2019-05-01 Thread Marc Roos


Do you need to tell the vm's that they are on a ssd rbd pool? Or does 
ceph and the libvirt drivers do this automatically for you?

When testing a nutanix acropolis virtual install, I had to 'cheat' it by 
adding this
 
To make the installer think there was a ssd drive. 

I only have 'Thin provisioned drive' mentioned regardless if the vm is 
on a hdd rbd pool or a ssd rbd pool.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED

2019-05-01 Thread Joe Ryner
I think I have figured out the issue.

 POOLSIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET
RATIO  PG_NUM  NEW PG_NUM  AUTOSCALE
 images28523G3.068779G  1.2441
 1000  warn

My images are 28523G with a replication level 3 and have a total of
68779G in Raw Capacity.

 According to the documentation
http://docs.ceph.com/docs/master/rados/operations/placement-groups/

"*SIZE* is the amount of data stored in the pool. *TARGET SIZE*, if
present, is the amount of data the administrator has specified that
they expect to eventually be stored in this pool. The system uses the
larger of the two values for its calculation.

*RATE* is the multiplier for the pool that determines how much raw storage
capacity is consumed. For example, a 3 replica pool will have a ratio of
3.0, while a k=4,m=2 erasure coded pool will have a ratio of 1.5.

*RAW CAPACITY* is the total amount of raw storage capacity on the OSDs that
are responsible for storing this pool’s (and perhaps other pools’) data.
*RATIO* is the ratio of that total capacity that this pool is consuming
(i.e., ratio = size * rate / raw capacity)."

So ratio = "28523G * 3.0/68779G" = 1.2441x


So I'm oversubscribing by 1.2441x, thus the warning.


But ... looking at #ceph df

POOL ID STORED  OBJECTS USED%USED MAX AVAIL

images3 9.3 TiB   2.82M  28 TiB 57.94   6.7 TiB


I believe the 9.3TiB is the amount I have that is thinly provisioned
vs a fully provisioned 28 TiB?

The raw capacity of the cluster is sitting at about 50% used.


Shouldn't the ratio be the amount STORED(from ceph df) * SIZE (from
ceph osd pool autoscale-status) / Raw Capacity, since ceph uses thin
provisioning in rbd?

Otherwise, this ratio will only work for people who don't thin
provision which goes against what ceph is doing with rbd

http://docs.ceph.com/docs/master/rbd/





On Wed, May 1, 2019 at 11:44 AM Joe Ryner  wrote:

> I have found a little more information.
> When I turn off pg_autoscaler the warning goes away turn it back on and
> the warning comes back.
>
> I have ran the following:
> # ceph osd pool autoscale-status
>  POOLSIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO
> PG_NUM  NEW PG_NUM  AUTOSCALE
>  images28523G3.068779G  1.2441
>   1000  warn
>  locks 676.5M3.068779G  0.
>  8  warn
>  rbd   0 3.068779G  0.
>  8  warn
>  data  0 3.068779G  0.
>  8  warn
>  metadata   3024k3.068779G  0.
>  8  warn
>
> # ceph df
> RAW STORAGE:
> CLASS SIZE   AVAIL   USEDRAW USED %RAW USED
> hdd   51 TiB  26 TiB  24 TiB   24 TiB 48.15
> ssd   17 TiB 8.5 TiB 8.1 TiB  8.1 TiB 48.69
> TOTAL 67 TiB  35 TiB  32 TiB   32 TiB 48.28
>
> POOLS:
> POOL ID STORED  OBJECTS USED%USED MAX
> AVAIL
> data  0 0 B   0 0 B 0
>  6.7 TiB
> metadata  1 6.3 KiB  21 3.0 MiB 0
>  6.7 TiB
> rbd   2 0 B   2 0 B 0
>  6.7 TiB
> images3 9.3 TiB   2.82M  28 TiB 57.94
>  6.7 TiB
> locks 4 215 MiB 517 677 MiB 0
>  6.7 TiB
>
>
> It looks to me like it thinks the images pool no right in the
> autoscale-status.
>
> Below is a osd crush tree
> # ceph osd crush tree
> ID  CLASS WEIGHT   (compat) TYPE NAME
>  -1   66.73337  root default
>  -3   22.28214 22.28214 rack marack
>  -87.27475  7.27475 host abacus
>  19   hdd  1.81879  1.81879 osd.19
>  20   hdd  1.81879  1.42563 osd.20
>  21   hdd  1.81879  1.81879 osd.21
>  50   hdd  1.81839  1.81839 osd.50
> -107.76500  6.67049 host gold
>   7   hdd  0.86299  0.83659 osd.7
>   9   hdd  0.86299  0.78972 osd.9
>  10   hdd  0.86299  0.72031 osd.10
>  14   hdd  0.86299  0.65315 osd.14
>  15   hdd  0.86299  0.72586 osd.15
>  22   hdd  0.86299  0.80528 osd.22
>  23   hdd  0.86299  0.63741 osd.23
>  24   hdd  0.86299  0.77718 osd.24
>  25   hdd  0.86299  0.72499 osd.25
>  -57.24239  7.24239 host hassium
>   0   hdd  1.80800  1.52536 osd.0
>   1   hdd  1.80800  1.65421 osd.1
>  26   hdd  1.80800  1.65140 osd.26
>  51   hdd  1.81839  1.81839 osd.51
>  -2   21.30070 21.30070 rack marack2
> -127.76999  8.14474 host hamms
>  27   ssd  0.86299  0.99367 osd.27
>  28   ssd  0.86299  0.95961 osd.28
>  

Re: [ceph-users] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED

2019-05-01 Thread Joe Ryner
I have found a little more information.
When I turn off pg_autoscaler the warning goes away turn it back on and the
warning comes back.

I have ran the following:
# ceph osd pool autoscale-status
 POOLSIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO
PG_NUM  NEW PG_NUM  AUTOSCALE
 images28523G3.068779G  1.2441
1000  warn
 locks 676.5M3.068779G  0.
   8  warn
 rbd   0 3.068779G  0.
   8  warn
 data  0 3.068779G  0.
   8  warn
 metadata   3024k3.068779G  0.
   8  warn

# ceph df
RAW STORAGE:
CLASS SIZE   AVAIL   USEDRAW USED %RAW USED
hdd   51 TiB  26 TiB  24 TiB   24 TiB 48.15
ssd   17 TiB 8.5 TiB 8.1 TiB  8.1 TiB 48.69
TOTAL 67 TiB  35 TiB  32 TiB   32 TiB 48.28

POOLS:
POOL ID STORED  OBJECTS USED%USED MAX
AVAIL
data  0 0 B   0 0 B 0   6.7
TiB
metadata  1 6.3 KiB  21 3.0 MiB 0   6.7
TiB
rbd   2 0 B   2 0 B 0   6.7
TiB
images3 9.3 TiB   2.82M  28 TiB 57.94   6.7
TiB
locks 4 215 MiB 517 677 MiB 0   6.7
TiB


It looks to me like it thinks the images pool no right in the
autoscale-status.

Below is a osd crush tree
# ceph osd crush tree
ID  CLASS WEIGHT   (compat) TYPE NAME
 -1   66.73337  root default
 -3   22.28214 22.28214 rack marack
 -87.27475  7.27475 host abacus
 19   hdd  1.81879  1.81879 osd.19
 20   hdd  1.81879  1.42563 osd.20
 21   hdd  1.81879  1.81879 osd.21
 50   hdd  1.81839  1.81839 osd.50
-107.76500  6.67049 host gold
  7   hdd  0.86299  0.83659 osd.7
  9   hdd  0.86299  0.78972 osd.9
 10   hdd  0.86299  0.72031 osd.10
 14   hdd  0.86299  0.65315 osd.14
 15   hdd  0.86299  0.72586 osd.15
 22   hdd  0.86299  0.80528 osd.22
 23   hdd  0.86299  0.63741 osd.23
 24   hdd  0.86299  0.77718 osd.24
 25   hdd  0.86299  0.72499 osd.25
 -57.24239  7.24239 host hassium
  0   hdd  1.80800  1.52536 osd.0
  1   hdd  1.80800  1.65421 osd.1
 26   hdd  1.80800  1.65140 osd.26
 51   hdd  1.81839  1.81839 osd.51
 -2   21.30070 21.30070 rack marack2
-127.76999  8.14474 host hamms
 27   ssd  0.86299  0.99367 osd.27
 28   ssd  0.86299  0.95961 osd.28
 29   ssd  0.86299  0.80768 osd.29
 30   ssd  0.86299  0.86893 osd.30
 31   ssd  0.86299  0.92583 osd.31
 32   ssd  0.86299  1.00227 osd.32
 33   ssd  0.86299  0.73099 osd.33
 34   ssd  0.86299  0.80766 osd.34
 35   ssd  0.86299  1.04811 osd.35
 -75.45636  5.45636 host parabola
  5   hdd  1.81879  1.81879 osd.5
 12   hdd  1.81879  1.81879 osd.12
 13   hdd  1.81879  1.81879 osd.13
 -62.63997  3.08183 host radium
  2   hdd  0.87999  1.05594 osd.2
  6   hdd  0.87999  1.10501 osd.6
 11   hdd  0.87999  0.92088 osd.11
 -95.43439  5.43439 host splinter
 16   hdd  1.80800  1.80800 osd.16
 17   hdd  1.81839  1.81839 osd.17
 18   hdd  1.80800  1.80800 osd.18
-11   23.15053 23.15053 rack marack3
-138.63300  8.98921 host helm
 36   ssd  0.86299  0.71931 osd.36
 37   ssd  0.86299  0.92601 osd.37
 38   ssd  0.86299  0.79585 osd.38
 39   ssd  0.86299  1.08521 osd.39
 40   ssd  0.86299  0.89500 osd.40
 41   ssd  0.86299  0.92351 osd.41
 42   ssd  0.86299  0.89690 osd.42
 43   ssd  0.86299  0.92480 osd.43
 44   ssd  0.86299  0.84467 osd.44
 45   ssd  0.86299  0.97795 osd.45
-407.27515  7.89609 host samarium
 46   hdd  1.81879  1.90242 osd.46
 47   hdd  1.81879  1.86723 osd.47
 48   hdd  1.81879  1.93404 osd.48
 49   hdd  1.81879  2.19240 osd.49
 -47.24239  7.24239 host scandium
  3   hdd  1.80800  1.76680 osd.3
  4   hdd  1.80800  1.80800 osd.4
  8   hdd  1.80800  1.80800 osd.8
 52   hdd  1.81839  1.81839 osd.52


Any ideas?





On Wed, May 1, 2019 at 9:32 AM Joe Ryner  wrote:

> Hi,
>
> I have an old ceph cluster and have upgraded recently from Luminous to
> Nautilus.  After converting 

Re: [ceph-users] HEALTH_WARN - 3 modules have failed dependencies

2019-05-01 Thread Ranjan Ghosh

Ah, after researching some more I think I got hit by this bug:

https://github.com/ceph/ceph/pull/25585

At least that's exactly what I see in the logs: "Interpreter change 
detected - this module can only be loaded into one interpreter per process."


Ceph modules don't seem to work at all with the newest Ubuntu version. 
Only one module can be loaded. Sad :-(


Hope this will be fixed soon...


Am 30.04.19 um 21:18 schrieb Ranjan Ghosh:


Hi my beloved Ceph list,

After an upgrade from Ubuntu Cosmic to Ubuntu Disco (and according 
Ceph packages updated from 13.2.2 to 13.2.4), I now get this when I 
enter "ceph health":


HEALTH_WARN 3 modules have failed dependencies

"ceph mgr module ls" only reports those 3 modules enabled:

"enabled_modules": [
    "dashboard",
    "restful",
    "status"
    ],
...

Then I found this page here:

docs.ceph.com/docs/master/rados/operations/health-checks

Under "MGR_MODULE_DEPENDENCY" it says:

"An enabled manager module is failing its dependency check. This 
health check should come with an explanatory message from the module 
about the problem."


What is "this health check"? If the page talks about "ceph health" or 
"ceph -s" then, no, there is no explanatory message there on what's wrong.


Furthermore, it says:

"This health check is only applied to enabled modules. If a module is 
not enabled, you can see whether it is reporting dependency issues in 
the output of ceph module ls."


The command "ceph module ls", however, doesn't exist. If "ceph mgr 
module ls" is really meant, then I get this:


{
    "enabled_modules": [
    "dashboard",
    "restful",
    "status"
    ],
    "disabled_modules": [
    {
    "name": "balancer",
    "can_run": true,
    "error_string": ""
    },
    {
    "name": "hello",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "influx",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "iostat",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "localpool",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "prometheus",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "selftest",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "smart",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "telegraf",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "telemetry",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "zabbix",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    }
    ]
}

Usually the Ceph documentation is great, very detailed and helpful. 
But I can find nothing on how to resolve this problem. Any help is 
much appreciated.


Thank you / Best regards

Ranjan





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED

2019-05-01 Thread Joe Ryner
Hi,

I have an old ceph cluster and have upgraded recently from Luminous to
Nautilus.  After converting to Nautilus I decided it was time to convert to
bluestore.

Before I converted the cluster was healthy but after I have a HEALTH_WARN

#ceph health detail
HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1
subtrees have overcommitted pool target_size_ratio
POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool
target_size_bytes
Pools ['data', 'metadata', 'rbd', 'images', 'locks'] overcommit
available storage by 1.244x due to target_size_bytes0  on pools []
POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool
target_size_ratio
Pools ['data', 'metadata', 'rbd', 'images', 'locks'] overcommit
available storage by 1.244x due to target_size_ratio 0.000 on pools []

I started with a target_size ratio of .85 on the images pool and reduced it
to 0 to hopefully get the warning to go away.  The cluster seems to be
running fine, I just can't figure out what the problem is and how to make
the message go away.  I restarted the monitors this morning in hopes to fix
it.  Anyone have any ideas?

Thanks in advance


-- 
Joe Ryner
Associate Director
Center for the Application of Information Technologies (CAIT) -
http://www.cait.org
Western Illinois University - http://www.wiu.edu


P: (309) 298-1804
F: (309) 298-2806
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inodes on /cephfs

2019-05-01 Thread Oliver Freyermuth
Dear Yury,

Am 01.05.19 um 08:07 schrieb Yury Shevchuk:
> cephfs is not alone at this, there are other inode-less filesystems
> around.  They all go with zeroes:
> 
> # df -i /nfs-dir
> Filesystem  Inodes IUsed IFree IUse% Mounted on
> xxx.xxx.xx.x:/xxx/xxx/x  0 0 0 - /xxx
> 
> # df -i /reiserfs-dir
> FilesystemInodes   IUsed   IFree IUse% Mounted on
> /xxx//x0   0   0-  /xxx/xxx//x
> 
> # df -i /btrfs-dir
> Filesystem   Inodes IUsed IFree IUse% Mounted on
> /xxx/xx/  0 0 0 - /

you are right, thanks for pointing me to these examples! 

> 
> Would YUM refuse to install on them all, including mainstream btrfs?
> I doubt that.  Prehaps YUM is confused by Inodes count that
> cephfs (alone!) reports as non-zero.  Look at YUM sources?

Indeed, Yum works on all these file systems. 
Here's the place in the sources:
https://github.com/rpm-software-management/rpm/blob/6913360d66510e60d7b6399cd338425d663a051b/lib/transaction.c#L172
That's actually in RPM, since Yum calls RPM and the complaint comes from RPM. 

Reading the sources, they just interpret the results from the statfs call. If a 
file system reports:
sfb.f_ffree == 0 && sfb.f_files == 0
i.e. no used and no free inodes, then it's assumed the file system has no 
notion of inodes, and the check is disabled. 
However, since CephFS reports something non-zero for the total count (f_files), 
RPM assumes it has a notion of Inodes, and a check should be performed. 

So indeed, another solution would be to change f_files to also report 0, as all 
other file systems without actual inodes seem to do. 
That would (in my opinion) also be more correct than what is currently done, 
since reporting something non-zero as f_files but zero as f_free
from a logical point of view seems "full". 
Even df shows a more useful output with both being zero - it just shows a 
"dash", highlighting that this is not information to be monitored. 

What do you think? 

Cheers,
Oliver

> 
> 
> -- Yury
> 
> On Wed, May 01, 2019 at 01:23:57AM +0200, Oliver Freyermuth wrote:
>> Am 01.05.19 um 00:51 schrieb Patrick Donnelly:
>>> On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth
>>>  wrote:

 Dear Cephalopodians,

 we have a classic libvirtd / KVM based virtualization cluster using 
 Ceph-RBD (librbd) as backend and sharing the libvirtd configuration 
 between the nodes via CephFS
 (all on Mimic).

 To share the libvirtd configuration between the nodes, we have symlinked 
 some folders from /etc/libvirt to their counterparts on /cephfs,
 so all nodes see the same configuration.
 In general, this works very well (of course, there's a "gotcha": Libvirtd 
 needs reloading / restart for some changes to the XMLs, we have automated 
 that),
 but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
 Whenever there's a libvirtd update, unattended upgrades fail, and we see:

Transaction check error:
  installing package 
 libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 needs 2 inodes on 
 the /cephfs filesystem
  installing package 
 libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on 
 the /cephfs filesystem

 So it seems yum follows the symlinks and checks the available inodes on 
 /cephfs. Sadly, that reveals:
[root@kvm001 libvirt]# LANG=C df -i /cephfs/
Filesystem Inodes IUsed IFree IUse% Mounted on
ceph-fuse  6868 0  100% /cephfs

 I think that's just because there is no real "limit" on the maximum inodes 
 on CephFS. However, returning 0 breaks some existing tools (notably, Yum).

 What do you think? Should CephFS return something different than 0 here to 
 not break existing tools?
 Or should the tools behave differently? But one might also argue that if 
 the total number of Inodes matches the used number of Inodes, the FS is 
 indeed "full".
 It's just unclear to me who to file a bug against ;-).

 Right now, I am just using:
 yum -y --setopt=diskspacecheck=0 update
 as a manual workaround, but this is naturally rather cumbersome.
>>>
>>> This is fallout from [1]. See discussion on setting f_free to 0 here
>>> [2]. In summary, userland tools are trying to be too clever by looking
>>> at f_free. [I could be convinced to go back to f_free = ULONG_MAX if
>>> there are other instances of this.]
>>>
>>> [1] https://github.com/ceph/ceph/pull/23323
>>> [2] https://github.com/ceph/ceph/pull/23323#issuecomment-409249911
>>
>> Thanks for the references! That certainly enlightens me on why this decision 
>> was taken, and of course I congratulate upon trying to prevent false 
>> monitoring. 
>> Still, even though I don't have other instances at hand (yet), I am not yet 
>> convinced "0" 

Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-05-01 Thread Jan Pekař - Imatic

Today problem reappeared.

Restarting mon helps, but it is no solving the issue.

Is there any way how to debug that? Can I dump this keys from MON, from OSD or 
other components? Can I debug key exchange?

Thank you

On 27/04/2019 10.56, Jan Pekař - Imatic wrote:


On 26/04/2019 21.50, Gregory Farnum wrote:

On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic  wrote:

Hi,

yesterday my cluster reported slow request for minutes and after restarting 
OSDs (reporting slow requests) it stuck with peering PGs. Whole
cluster was not responding and IO stopped.

I also notice, that problem was with cephx - all OSDs were reporting the same 
(even the same number of secret_id)

cephx: verify_authorizer could not get service secret for service osd 
secret_id=14086
.. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 
cs=0 l=1).handle_connect_msg: got bad authorizer
auth: could not find secret_id=14086

My questions are:

Why happened that?
Can I prevent cluster from stopping to work (with cephx enabled)?
How quickly are keys rotating/expiring and can I check problems with that 
anyhow?

I'm running NTP on nodes (and also ceph monitors), so time should not be the 
issue. I noticed, that some monitor nodes has no timezone set,
but I hope MONs are using UTC to distribute keys to clients. Or different 
timezone between MON and OSD can cause the problem?

Hmm yeah, it's probably not using UTC. (Despite it being good
practice, it's actually not an easy default to adhere to.) cephx
requires synchronized clocks and probably the same timezone (though I
can't swear to that.)


I "fixed" the problem by restarting monitors.

It happened for the second time during last 3 months, so I'm reporting it as 
issue, that can happen.

I also noticed in all OSDs logs

2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating 
possible clock skew, rotating keys expired way too early (before
2019-04-25 09:06:55.65)

approximately 7 hours before problem occurred. I can see, that it related to 
the issue. But why 7 hours? Is there some timeout or grace
period of old keys usage before they are invalidated?

7 hours shouldn't be directly related. IIRC by default a new rotating
key is issued every hour, it gives out the current and next key on
request, and daemons accept keys within a half-hour offset of what
they believe the current time to be. Something like that.
-Greg


If it is like you wrote, UTC is not problem. I'm running cluster with this configuration over 1 year and there were only 2-3 incidents of 
this kind.

I changed timezone, restarted services and I will wait...
I forget to mention, that I'm running luminous.

JP


Thank you

With regards

Jan Pekar

--

Ing. Jan Pekař
jan.pe...@imatic.cz

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-01 Thread Igor Fedotov

Hi Igor,

yeah, BlueStore allocators are absolutely interchangeable. You can 
switch between them for free.



Thanks,

Igor


On 5/1/2019 8:59 AM, Igor Podlesny wrote:

On Tue, 30 Apr 2019 at 20:56, Igor Podlesny  wrote:

On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
[..]

Any suggestions ?

-- Try different allocator.

Ah, BTW, except memory allocator there's another option: recently
backported bitmap allocator.
Igor Fedotov wrote about it's expected to have lesser memory footprint
with time:

 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html

Also I'm not sure though if it's okay to switch existent OSDs "on-fly"
-- changing config and restarting OSDs.
Igor (Fedotov), can you please elaborate on this matter?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inodes on /cephfs

2019-05-01 Thread Yury Shevchuk
cephfs is not alone at this, there are other inode-less filesystems
around.  They all go with zeroes:

# df -i /nfs-dir
Filesystem  Inodes IUsed IFree IUse% Mounted on
xxx.xxx.xx.x:/xxx/xxx/x  0 0 0 - /xxx

# df -i /reiserfs-dir
FilesystemInodes   IUsed   IFree IUse% Mounted on
/xxx//x0   0   0-  /xxx/xxx//x

# df -i /btrfs-dir
Filesystem   Inodes IUsed IFree IUse% Mounted on
/xxx/xx/  0 0 0 - /

Would YUM refuse to install on them all, including mainstream btrfs?
I doubt that.  Prehaps YUM is confused by Inodes count that
cephfs (alone!) reports as non-zero.  Look at YUM sources?


-- Yury

On Wed, May 01, 2019 at 01:23:57AM +0200, Oliver Freyermuth wrote:
> Am 01.05.19 um 00:51 schrieb Patrick Donnelly:
> > On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth
> >  wrote:
> >>
> >> Dear Cephalopodians,
> >>
> >> we have a classic libvirtd / KVM based virtualization cluster using 
> >> Ceph-RBD (librbd) as backend and sharing the libvirtd configuration 
> >> between the nodes via CephFS
> >> (all on Mimic).
> >>
> >> To share the libvirtd configuration between the nodes, we have symlinked 
> >> some folders from /etc/libvirt to their counterparts on /cephfs,
> >> so all nodes see the same configuration.
> >> In general, this works very well (of course, there's a "gotcha": Libvirtd 
> >> needs reloading / restart for some changes to the XMLs, we have automated 
> >> that),
> >> but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
> >> Whenever there's a libvirtd update, unattended upgrades fail, and we see:
> >>
> >>Transaction check error:
> >>  installing package 
> >> libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 needs 2 inodes on 
> >> the /cephfs filesystem
> >>  installing package 
> >> libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on 
> >> the /cephfs filesystem
> >>
> >> So it seems yum follows the symlinks and checks the available inodes on 
> >> /cephfs. Sadly, that reveals:
> >>[root@kvm001 libvirt]# LANG=C df -i /cephfs/
> >>Filesystem Inodes IUsed IFree IUse% Mounted on
> >>ceph-fuse  6868 0  100% /cephfs
> >>
> >> I think that's just because there is no real "limit" on the maximum inodes 
> >> on CephFS. However, returning 0 breaks some existing tools (notably, Yum).
> >>
> >> What do you think? Should CephFS return something different than 0 here to 
> >> not break existing tools?
> >> Or should the tools behave differently? But one might also argue that if 
> >> the total number of Inodes matches the used number of Inodes, the FS is 
> >> indeed "full".
> >> It's just unclear to me who to file a bug against ;-).
> >>
> >> Right now, I am just using:
> >> yum -y --setopt=diskspacecheck=0 update
> >> as a manual workaround, but this is naturally rather cumbersome.
> > 
> > This is fallout from [1]. See discussion on setting f_free to 0 here
> > [2]. In summary, userland tools are trying to be too clever by looking
> > at f_free. [I could be convinced to go back to f_free = ULONG_MAX if
> > there are other instances of this.]
> > 
> > [1] https://github.com/ceph/ceph/pull/23323
> > [2] https://github.com/ceph/ceph/pull/23323#issuecomment-409249911
> 
> Thanks for the references! That certainly enlightens me on why this decision 
> was taken, and of course I congratulate upon trying to prevent false 
> monitoring. 
> Still, even though I don't have other instances at hand (yet), I am not yet 
> convinced "0" is a better choice than "ULONG_MAX". 
> It certainly alerts users / monitoring software about doing something wrong, 
> but it prevents a check which any file system (or rather, any file system I 
> encountered so far) allows. 
> 
> Yum (or other package managers doing things in a safe manner) need to ensure 
> they can fully install a package in an "atomic" way before doing so,
> since rolling back may be complex or even impossible (for most file systems). 
> So they need a way to check if a file system can store the additional files 
> in terms of space and inodes, before placing the data there,
> or risk installing something only partially, and potentially being unable to 
> roll back. 
> 
> In most cases, the free number of inodes allows for that check. Of course, 
> that has no (direct) meaning for CephFS, so one might argue the tools should 
> add an exception for CephFS - 
> but as the discussion correctly stated, there's no defined way to find out 
> where the file system has a notion of "free inodes", and - if we go for an 
> exceptional treatment for a list of file systems - 
> not even a "clean" way to find out if the file system is CephFS (the tools 
> will only see it is FUSE for ceph-fuse) [1]. 
> 
> So my question is: 
> How are tools which need to ensure that a file system can accept a given 
> number of bytes and inodes before 

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-01 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 20:56, Igor Podlesny  wrote:
> On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
> [..]
> > Any suggestions ?
>
> -- Try different allocator.

Ah, BTW, except memory allocator there's another option: recently
backported bitmap allocator.
Igor Fedotov wrote about it's expected to have lesser memory footprint
with time:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html

Also I'm not sure though if it's okay to switch existent OSDs "on-fly"
-- changing config and restarting OSDs.
Igor (Fedotov), can you please elaborate on this matter?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com