[ceph-users] How do I setpolicy to deny deletes for a bucket

2019-05-29 Thread Priya Sehgal
I want to deny deletes on one of my buckets. I tried to run "s3cmd
setpolicy". I tried two configs (json files). I do not get any error code
and when I try to do getpolicy I see the same json. However, when I delete
objects present in the bucket I am able to delete the object. Please let me
know where am I going wrong.

Here are the two policy json files:
1. POLICY FILE 1
{
  "Version": "2012-10-17",
  "Statement": [{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:DeleteObject",
"Resource": [
  "arn:aws:s3:::my-new-bucket3/*"
]
  }]
}

2. POLICY FILE 2
{
"Version": "2012-10-17",

"Statement": [

{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:GetObjectAcl",
"s3:PutObjectAcl",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:PutBucketAcl",
"s3:GetBucketLocation"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "s3:ListAllMyBuckets",
"Resource": "*"
},
{
"Effect": "Deny",
"Action": [
"s3:DeleteBucket",
"s3:DeleteBucketPolicy",
"s3:DeleteBucketWebsite",
"s3:DeleteObject",
"s3:DeleteObjectVersion"
],
"Resource": "arn:aws:s3:::my-new-bucket3/*"
}
]
}

Command used: s3cmd setpolicy examplepol s3://my-new-bucket3

where, exampol file contains either (1) or (2) of the above policy stmts.

-- 
Regards,
Priya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Using Ceph Ansible to Add Nodes to Cluster at Weight 0

2019-05-29 Thread Mike Cave
Good afternoon,

I’m about to expand my cluster from 380 to 480 OSDs (5 nodes with 20 disks per 
node) and am trying to determine the best way to go about this task.

I deployed the cluster with ceph ansible and everything worked well. So I’d 
like to add the new nodes with ceph ansible as well.

The issue I have is adding that many OSDs at once will likely cause a huge 
issue with the cluster if they come in fully weighted.

I was hoping to use ceph ansible and set the initial weight to zero and then 
gently bring them up to the correct weight for each OSD.

I will be doing this with a total of 380 OSDs over the next while. My plan is 
to bring in groups of 6 nodes (I have six racks and the map is rack-redundant) 
until I’m completed on the additions.

In dev I tried bringing in a node while the cluster was in ‘no rebalance’ mode 
and there was still significant movement with some stuck pgs and other oddities 
until I reweighted and then unset ‘no rebalance’.

I’d like a s little friction for the cluster as possible as it is in heavy use 
right now.

I’m running mimic (13.2.5) on CentOS.

Any suggestions on best practices for this?

Thank you for reading and any help you might be able provide. I’m happy to 
provide any details you might want.

Cheers,
Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP object in RGW GC pool

2019-05-29 Thread J. Eric Ivancich
Hi Wido,

When you run `radosgw-admin gc list`, I assume you are *not* using the
"--include-all" flag, right? If you're not using that flag, then
everything listed should be expired and be ready for clean-up. If after
running `radosgw-admin gc process` the same entries appear in
`radosgw-admin gc list` then gc apparently stalled.

There were a few bugs within gc processing that could prevent it from
making forward progress. They were resolved with a PR (master:
https://github.com/ceph/ceph/pull/26601 ; mimic backport:
https://github.com/ceph/ceph/pull/27796). Unfortunately that code was
backported after the 13.2.5 release, but it is in place for the 13.2.6
release of mimic.

Eric


On 5/29/19 3:19 AM, Wido den Hollander wrote:
> Hi,
> 
> I've got a Ceph cluster with this status:
> 
> health: HEALTH_WARN
> 3 large omap objects
> 
> After looking into it I see that the issue comes from objects in the
> '.rgw.gc' pool.
> 
> Investigating it I found that the gc.* objects have a lot of OMAP keys:
> 
> for OBJ in $(rados -p .rgw.gc ls); do
>   echo $OBJ
>   rados -p .rgw.gc listomapkeys $OBJ|wc -l
> done
> 
> I then found out that on average these objects have about 100k of OMAP
> keys each, but two stand out and have about 3M OMAP keys.
> 
> I can list the GC with 'radosgw-admin gc list' and this yields a JSON
> which is a couple of MB in size.
> 
> I ran:
> 
> $ radosgw-admin gc process
> 
> That runs for hours and then finishes, but the large list of OMAP keys
> stays.
> 
> Running Mimic 13.3.5 on this cluster.
> 
> Has anybody seen this before?
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nfs-ganesha with rados_kv backend

2019-05-29 Thread Jeff Layton
On Wed, 2019-05-29 at 13:49 +, Stolte, Felix wrote:
> Hi,
> 
> is anyone running an active-passive nfs-ganesha cluster with cephfs backend 
> and using the rados_kv recovery backend? My setup runs fine, but takeover is 
> giving me a headache. On takeover I see the following messages in ganeshas 
> log file:
> 

Note that there are significant problems with the rados_kv recovery
backend. In particular, it does not properly handle the case where the
server crashes during the grace period. The rados_ng and rados_cluster
backends do handle those situations properly.

> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
> ganesha.nfsd-9793[dbus_heartbeat] nfs_start_grace :STATE :EVENT :NFS Server 
> Now IN GRACE, duration 5
> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
> ganesha.nfsd-9793[dbus_heartbeat] nfs_start_grace :STATE :EVENT :NFS Server 
> recovery event 5 nodeid -1 ip 10.0.0.5
> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
> ganesha.nfsd-9793[dbus_heartbeat] rados_kv_traverse :CLIENT ID :EVENT :Failed 
> to lst kv ret=-2
> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
> ganesha.nfsd-9793[dbus_heartbeat] rados_kv_read_recov_clids_takeover :CLIENT 
> ID :EVENT :Failed to takeover
> 29/05/2019 15:38:26 : epoch 5cee88c4 : cephgw-e2-1 : 
> ganesha.nfsd-9793[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now 
> NOT IN GRACE
> 
> The result is clients hanging for up to 2 Minutes. Has anyone ran into the 
> same problem?
> 
> Ceph Version: 12.2.11
> nfs-ganesha: 2.7.3
> 

If I had to guess, the hanging is probably due to state that is being
held by the other node's MDS session that hasn't expired yet. Ceph v12
doesn't have the client reclaim interfaces that make more instantaneous
failover possible. That's new in v14 (Nautilus). See pages 12 and 13
here:

https://static.sched.com/hosted_files/cephalocon2019/86/Rook-Deployed%20NFS%20Clusters%20over%20CephFS.pdf

> ganesha.conf (identical on both nodes besides nodeid in rados_kv:
> 
> NFS_CORE_PARAM {
> Enable_RQUOTA = false;
> Protocols = 3,4;
> }
> 
> CACHEINODE {
> Dir_Chunk = 0;
> NParts = 1;
> Cache_Size = 1;
> }
> 
> NFS_krb5 {
> Active_krb5 = false;
> }
> 
> NFSv4 {
> Only_Numeric_Owners = true;
> RecoveryBackend = rados_kv;
> Grace_Period = 5;
> Lease_Lifetime = 5;

Yikes! That's _way_ too short a grace period and lease lifetime. Ganesha
will probably exit the grace period before the clients ever realize the
server has restarted, and they will fail to reclaim their state.

> Minor_Versions = 1,2;
> }
> 
> RADOS_KV {
> ceph_conf = '/etc/ceph/ceph.conf';
> userid = "ganesha";
> pool = "cephfs_metadata";
> namespace = "ganesha";
> nodeid = "cephgw-k2-1";
> }
> 
> Any hint would be appreciated.

I consider ganesha's dbus-based takeover mechanism to be broken by
design, as it requires the recovery backend to do things that can't be
done atomically. If a crash occurs at the wrong time, the recovery
database can end up trashed and no one can reclaim anything.

If you really want an active/passive setup then I'd move away from that
and just have whatever clustering software you're using start up the
daemon on the active node after ensuring that it's shut down on the
passive one. With that, you can also use the rados_ng recovery backend,
which is more resilient in the face of multiple crashes.

In that configuration you would want to have the same config file on
both nodes, including the same nodeid so that you can potentially take
advantage of the RECLAIM_RESET interface to kill off the old session
quickly after the server restarts.

You also need a much longer grace period.

Cheers,
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Gregory Farnum
These OSDs are far too small at only 10GiB for the balancer to try and
do any work. It's not uncommon for metadata like OSDMaps to exceed
that size in error states and in any real deployment a single PG will
be at least that large.
There are probably parameters you can tweak to try and make it work,
but I wouldn't bother since the behavior will be nothing like what
you'd see in anything of size.
-Greg

On Wed, May 29, 2019 at 8:52 AM Tarek Zegar  wrote:
>
> Can anyone help with this? Why can't I optimize this cluster, the pg counts 
> and data distribution is way off.
> __
>
> I enabled the balancer plugin and even tried to manually invoke it but it 
> won't allow any changes. Looking at ceph osd df, it's not even at all. 
> Thoughts?
>
> root@hostadmin:~# ceph osd df
> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
> 1 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
> 3 hdd 0.00980 1.0 10 GiB 8.3 GiB 1.7 GiB 82.83 1.14 156
> 6 hdd 0.00980 1.0 10 GiB 8.4 GiB 1.6 GiB 83.77 1.15 144
> 0 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
> 5 hdd 0.00980 1.0 10 GiB 9.0 GiB 1021 MiB 90.03 1.23 159
> 7 hdd 0.00980 1.0 10 GiB 7.7 GiB 2.3 GiB 76.57 1.05 141
> 2 hdd 0.00980 1.0 10 GiB 5.5 GiB 4.5 GiB 55.42 0.76 90
> 4 hdd 0.00980 1.0 10 GiB 5.9 GiB 4.1 GiB 58.78 0.81 99
> 8 hdd 0.00980 1.0 10 GiB 6.3 GiB 3.7 GiB 63.12 0.87 111
> TOTAL 90 GiB 53 GiB 37 GiB 72.93
> MIN/MAX VAR: 0.76/1.23 STDDEV: 12.67
>
>
> root@hostadmin:~# osdmaptool om --upmap out.txt --upmap-pool rbd
> osdmaptool: osdmap file 'om'
> writing upmap command output to: out.txt
> checking for upmap cleanups
> upmap, max-count 100, max deviation 0.01 <---really? It's not even close to 
> 1% across the drives
> limiting to pools rbd (1)
> no upmaps proposed
>
>
> ceph balancer optimize myplan
> Error EALREADY: Unable to find further optimization,or distribution is 
> already perfect
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Oliver Freyermuth
Hi Tarek,

that's good news, glad my hunch was correct :-). 

Am 29.05.19 um 19:31 schrieb Tarek Zegar:
> Hi Oliver
> 
> Here is the output of the active mgr log after I toggled balancer off / on, I 
> grep'd out only "balancer" as it was far to verbose (see below). When I look 
> at ceph osd df I see it optimized :)
> I would like to understand two things however, why is "prepared 0/10 changes" 
> zero if it actually did something, what in the log can I look for before I 
> toggled that said basically "hey balancer isn't going to work because I still 
> think min-client-compact-level< luminous"

I can sadly not answer the first question, maybe somebody else on the list can 
- but I can at least answer the second one. Since I did not remember the exact 
wording of the message we saw October last year,
I checked the sources:
https://github.com/ceph/ceph/blob/5111f6df16b106e4e7105e88b88c6eeceb770c4f/src/pybind/mgr/balancer/module.py#L420
So you should find something like:
  min_compat_client "%s" < "luminous", which is required for pg-upmap. Try 
"ceph osd set-require-min-compat-client luminous" before enabling this mode
in the mgr log. So the message by itself is very helpful, it's just very hidden 
in the mgr logs ;-). 

The "prepared x/y changes" message is also generated here:
https://github.com/ceph/ceph/blob/5111f6df16b106e4e7105e88b88c6eeceb770c4f/src/pybind/mgr/balancer/module.py#L940
but I do not understand why it shows 0 in your case. Maybe somebody else on 
this list can explain ;-). 

Cheers,
Oliver

> 
> Thanks for helping me in getting this working!
> 
> 
> 
> root@hostmonitor1:/var/log/ceph# ceph osd df
> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
> 1 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
> 3 hdd 0.00980 1.0 10 GiB 5.3 GiB 4.7 GiB 53.25 0.97 150
> 6 hdd 0.00980 1.0 10 GiB 5.6 GiB 4.4 GiB 56.07 1.03 150
> 0 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
> 5 hdd 0.00980 1.0 10 GiB 5.7 GiB 4.3 GiB 56.97 1.04 151
> 7 hdd 0.00980 1.0 10 GiB 5.2 GiB 4.8 GiB 52.35 0.96 149
> 2 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
> 4 hdd 0.00980 1.0 10 GiB 5.5 GiB 4.5 GiB 55.25 1.01 150
> 8 hdd 0.00980 1.0 10 GiB 5.4 GiB 4.6 GiB 54.07 0.99 150
> TOTAL 70 GiB 34 GiB 36 GiB 54.66
> MIN/MAX VAR: 0.96/1.04 STDDEV: 1.60
> 
> 
> 2019-05-29 17:06:49.324 7f40ce42a700 0 log_channel(audit) log [DBG] : 
> from='client.11262 192.168.0.12:0/4104979884' entity='client.admin' 
> cmd=[{"prefix": "balancer off", "target": ["mgr", ""]}]: dispatch
> *2019-05-29 17:06:49.324 7f40ce42a700 1 mgr.server handle_command pyc_prefix: 
> 'balancer status'*
> *2019-05-29 17:06:49.324 7f40ce42a700 1 mgr.server handle_command pyc_prefix: 
> 'balancer mode'*
> *2019-05-29 17:06:49.324 7f40ce42a700 1 mgr.server handle_command pyc_prefix: 
> 'balancer on'*
> *2019-05-29 17:06:49.324 7f40ce42a700 1 mgr.server handle_command pyc_prefix: 
> 'balancer off'*
> 2019-05-29 17:06:49.324 7f40cec2b700 1 mgr[balancer] Handling command: 
> '{'prefix': 'balancer off', 'target': ['mgr', '']}'
> 2019-05-29 17:06:49.388 7f40d747a700 4 mgr[py] Loaded module_config entry 
> mgr/balancer/max_misplaced:.50
> 2019-05-29 17:06:49.388 7f40d747a700 4 mgr[py] Loaded module_config entry 
> mgr/balancer/mode:upmap
> 2019-05-29 17:06:49.539 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/active
> 2019-05-29 17:06:49.539 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/begin_time
> 2019-05-29 17:06:49.539 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/end_time
> 2019-05-29 17:06:49.539 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/sleep_interval
> 2019-05-29 17:06:54.279 7f40ce42a700 4 mgr.server handle_command 
> prefix=balancer on
> 2019-05-29 17:06:54.279 7f40ce42a700 0 log_channel(audit) log [DBG] : 
> from='client.11268 192.168.0.12:0/1339099349' entity='client.admin' 
> cmd=[{"prefix": "balancer on", "target": ["mgr", ""]}]: dispatch
> *2019-05-29 17:06:54.279 7f40ce42a700 1 mgr.server handle_command pyc_prefix: 
> 'balancer status'*
> *2019-05-29 17:06:54.279 7f40ce42a700 1 mgr.server handle_command pyc_prefix: 
> 'balancer mode'*
> *2019-05-29 17:06:54.279 7f40ce42a700 1 mgr.server handle_command pyc_prefix: 
> 'balancer on'*
> 2019-05-29 17:06:54.279 7f40cec2b700 1 mgr[balancer] Handling command: 
> '{'prefix': 'balancer on', 'target': ['mgr', '']}'
> 2019-05-29 17:06:54.287 7f40d747a700 4 mgr[py] Loaded module_config entry 
> mgr/balancer/active:1
> 2019-05-29 17:06:54.287 7f40d747a700 4 mgr[py] Loaded module_config entry 
> mgr/balancer/max_misplaced:.50
> 2019-05-29 17:06:54.287 7f40d747a700 4 mgr[py] Loaded module_config entry 
> mgr/balancer/mode:upmap
> 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/active
> 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/begin_time
> 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: 
> mgr/balancer/end_time
> 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr 

Re: [ceph-users] performance in a small cluster

2019-05-29 Thread Paul Emmerich
On Wed, May 29, 2019 at 11:37 AM Robert Sander 
wrote:

> Hi,
>
> Am 29.05.19 um 11:19 schrieb Martin Verges:
> >
> > We have identified the performance settings in the BIOS as a major
> > factor
> >
> > could you share your insights what options you changed to increase
> > performance and could you provide numbers to it?
>
> Most default perfomance settings nowadays seem to be geared towards
> power savings. This decreases CPU frequencies and does not play well
> with Ceph (and virtualization).
>

Agreed, disabling C states can help, disabling dynamic underclocking can
also help.

No need to that in the BIOS, setting that via linux-cpupower and similar
tools is enough.

Another thing that can help is this:

net.ipv4.tcp_low_latency=1

But all of these are to get last drop of IOPS out if you are already
gettings lots of IOPS,
it's not something that helps if your disk is only getting 1000 IOPS.

Paul


>
> There was just one setting in the BIOS of these machines called "Host
> Performance" that was set to "Balanced". We changed that to "Max
> Performance" and immediately the throughput doubled.
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de
>
> Tel: 030-405051-43
> Fax: 030-405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-29 Thread Paul Emmerich
On Wed, May 29, 2019 at 9:36 AM Robert Sander 
wrote:

> Am 24.05.19 um 14:43 schrieb Paul Emmerich:
> > * SSD model? Lots of cheap SSDs simply can't handle more than that
>
> The customer currently has 12 Micron 5100 1,92TB (Micron_5100_MTFDDAK1)
> SSDs and will get a batch of Micron 5200 in the next days
>

And there's your bottleneck ;)
The Micron 5100 performs horribly in Ceph, I've seen similar performance in
another cluster with these disks.
Basically they max out at around 1000 IOPS and report 100% utilization and
feel slow.

Haven't seen the 5200 yet.


Paul


>
> We have identified the performance settings in the BIOS as a major
> factor. Ramping that up we got a remarkable performance increase.
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de
>
> Tel: 030-405051-43
> Fax: 030-405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Tarek Zegar

Hi Oliver

Here is the output of the active mgr log after I toggled balancer off / on,
I grep'd out only "balancer" as it was far to verbose (see below). When I
look at ceph osd df I see it optimized :)
I would like to understand two things however, why is "prepared 0/10
changes" zero if it actually did something, what in the log can I look for
before I toggled that said basically "hey balancer isn't going to work
because I still think min-client-compact-level < luminous"

Thanks for helping me in getting this working!



root@hostmonitor1:/var/log/ceph# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL   %USE  VAR  PGS
 1   hdd 0.0098000 B 0 B 0 B 00   0
 3   hdd 0.00980  1.0 10 GiB 5.3 GiB 4.7 GiB 53.25 0.97 150
 6   hdd 0.00980  1.0 10 GiB 5.6 GiB 4.4 GiB 56.07 1.03 150
 0   hdd 0.0098000 B 0 B 0 B 00   0
 5   hdd 0.00980  1.0 10 GiB 5.7 GiB 4.3 GiB 56.97 1.04 151
 7   hdd 0.00980  1.0 10 GiB 5.2 GiB 4.8 GiB 52.35 0.96 149
 2   hdd 0.0098000 B 0 B 0 B 00   0
 4   hdd 0.00980  1.0 10 GiB 5.5 GiB 4.5 GiB 55.25 1.01 150
 8   hdd 0.00980  1.0 10 GiB 5.4 GiB 4.6 GiB 54.07 0.99 150
TOTAL 70 GiB  34 GiB  36 GiB 54.66
MIN/MAX VAR: 0.96/1.04  STDDEV: 1.60


2019-05-29 17:06:49.324 7f40ce42a700  0 log_channel(audit) log [DBG] :
from='client.11262 192.168.0.12:0/4104979884' entity='client.admin' cmd=
[{"prefix": "balancer off", "target": ["mgr", ""]}]: dispatch
2019-05-29 17:06:49.324 7f40ce42a700  1 mgr.server handle_command
pyc_prefix: 'balancer status'
2019-05-29 17:06:49.324 7f40ce42a700  1 mgr.server handle_command
pyc_prefix: 'balancer mode'
2019-05-29 17:06:49.324 7f40ce42a700  1 mgr.server handle_command
pyc_prefix: 'balancer on'
2019-05-29 17:06:49.324 7f40ce42a700  1 mgr.server handle_command
pyc_prefix: 'balancer off'
2019-05-29 17:06:49.324 7f40cec2b700  1 mgr[balancer] Handling command:
'{'prefix': 'balancer off', 'target': ['mgr', '']}'
2019-05-29 17:06:49.388 7f40d747a700  4 mgr[py] Loaded module_config entry
mgr/balancer/max_misplaced:.50
2019-05-29 17:06:49.388 7f40d747a700  4 mgr[py] Loaded module_config entry
mgr/balancer/mode:upmap
2019-05-29 17:06:49.539 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/active
2019-05-29 17:06:49.539 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/begin_time
2019-05-29 17:06:49.539 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/end_time
2019-05-29 17:06:49.539 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/sleep_interval
2019-05-29 17:06:54.279 7f40ce42a700  4 mgr.server handle_command
prefix=balancer on
2019-05-29 17:06:54.279 7f40ce42a700  0 log_channel(audit) log [DBG] :
from='client.11268 192.168.0.12:0/1339099349' entity='client.admin' cmd=
[{"prefix": "balancer on", "target": ["mgr", ""]}]: dispatch
2019-05-29 17:06:54.279 7f40ce42a700  1 mgr.server handle_command
pyc_prefix: 'balancer status'
2019-05-29 17:06:54.279 7f40ce42a700  1 mgr.server handle_command
pyc_prefix: 'balancer mode'
2019-05-29 17:06:54.279 7f40ce42a700  1 mgr.server handle_command
pyc_prefix: 'balancer on'
2019-05-29 17:06:54.279 7f40cec2b700  1 mgr[balancer] Handling command:
'{'prefix': 'balancer on', 'target': ['mgr', '']}'
2019-05-29 17:06:54.287 7f40d747a700  4 mgr[py] Loaded module_config entry
mgr/balancer/active:1
2019-05-29 17:06:54.287 7f40d747a700  4 mgr[py] Loaded module_config entry
mgr/balancer/max_misplaced:.50
2019-05-29 17:06:54.287 7f40d747a700  4 mgr[py] Loaded module_config entry
mgr/balancer/mode:upmap
2019-05-29 17:06:54.299 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/active
2019-05-29 17:06:54.299 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/begin_time
2019-05-29 17:06:54.299 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/end_time
2019-05-29 17:06:54.299 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/sleep_interval
2019-05-29 17:06:54.327 7f40cd3e8700  4 mgr[balancer] Optimize plan
auto_2019-05-29_17:06:54
2019-05-29 17:06:54.327 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/mode
2019-05-29 17:06:54.327 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/max_misplaced
2019-05-29 17:06:54.327 7f40cd3e8700  4 mgr[balancer] Mode upmap, max
misplaced 0.50
2019-05-29 17:06:54.327 7f40cd3e8700  4 mgr[balancer] do_upmap
2019-05-29 17:06:54.327 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/upmap_max_iterations
2019-05-29 17:06:54.327 7f40cd3e8700  4 mgr get_config get_config key:
mgr/balancer/upmap_max_deviation
2019-05-29 17:06:54.327 7f40cd3e8700  4 mgr[balancer] pools ['rbd']
2019-05-29 17:06:54.327 7f40cd3e8700  4 mgr[balancer] prepared 0/10 changes




From:   Oliver Freyermuth 
To: Tarek Zegar 
Cc: ceph-users@lists.ceph.com
Date:   05/29/2019 11:59 AM
Subject:[EXTERNAL] Re: [ceph-users] Balancer: uneven OSDs



Hi Tarek,

Am 29.05.19 um 18:49 schrieb Tarek Zegar:
> Hi 

Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Oliver Freyermuth
Hi Tarek,

Am 29.05.19 um 18:49 schrieb Tarek Zegar:
> Hi Oliver,
> 
> Thank you for the response, I did ensure that min-client-compact-level is 
> indeed Luminous (see below). I have no kernel mapped rbd clients. Ceph 
> versions reports mimic. Also below is the output of ceph balancer status. One 
> thing to note, I did enable the balancer after I already filled the cluster, 
> not from the onset. I had hoped that it wouldn't matter, though your comment 
> "if the compat-level is too old for upmap, you'll only find a small warning 
> about that in the logfiles" leaves me to believe that it will *not* work in 
> doing it this way, please confirm and let me know what message to look for in 
> /var/log/ceph.

it should also work well on existing clusters - we have also used it on a 
Luminous cluster after it was already half-filled, and it worked well - that's 
what it was made for ;-). 
The only issue we encountered was that the client-compat-level needed to be set 
to Luminous before enabling the balancer plugin, but since you can always 
disable and re-enable a plugin,
this is not a "blocker". 

Do you see anything in the logs of the active mgr when disabling and 
re-enabling the balancer plugin? 
That's how we initially found the message that we needed to raise the 
client-compat-level. 

Cheers,
Oliver

> 
> Thank you!
> 
> root@hostadmin:~# ceph balancer status
> {
> "active": true,
> "plans": [],
> "mode": "upmap"
> }
> 
> 
> 
> root@hostadmin:~# ceph features
> {
> "mon": [
> {
> "features": "0x3ffddff8ffacfffb",
> "release": "luminous",
> "num": 3
> }
> ],
> "osd": [
> {
> "features": "0x3ffddff8ffacfffb",
> "release": "luminous",
> "num": 7
> }
> ],
> "client": [
> {
> "features": "0x3ffddff8ffacfffb",
> "release": "luminous",
> "num": 1
> }
> ],
> "mgr": [
> {
> "features": "0x3ffddff8ffacfffb",
> "release": "luminous",
> "num": 3
> }
> ]
> }
> 
> 
> 
> 
> Inactive hide details for Oliver Freyermuth ---05/29/2019 11:13:51 AM---Hi 
> Tarek, what's the output of "ceph balancer status"?Oliver Freyermuth 
> ---05/29/2019 11:13:51 AM---Hi Tarek, what's the output of "ceph balancer 
> status"?
> 
> From: Oliver Freyermuth 
> To: ceph-users@lists.ceph.com
> Date: 05/29/2019 11:13 AM
> Subject: [EXTERNAL] Re: [ceph-users] Balancer: uneven OSDs
> Sent by: "ceph-users" 
> 
> --
> 
> 
> 
> Hi Tarek,
> 
> what's the output of "ceph balancer status"?
> In case you are using "upmap" mode, you must make sure to have a 
> min-client-compat-level of at least Luminous:
> http://docs.ceph.com/docs/mimic/rados/operations/upmap/
> Of course, please be aware that your clients must be recent enough 
> (especially for kernel clients).
> 
> Sadly, if the compat-level is too old for upmap, you'll only find a small 
> warning about that in the logfiles,
> but no error on terminal when activating the balancer or any other kind of 
> erroneous / health condition.
> 
> Cheers,
> Oliver
> 
> Am 29.05.19 um 17:52 schrieb Tarek Zegar:
>> Can anyone help with this? Why can't I optimize this cluster, the pg counts 
>> and data distribution is way off.
>> __
>>
>> I enabled the balancer plugin and even tried to manually invoke it but it 
>> won't allow any changes. Looking at ceph osd df, it's not even at all. 
>> Thoughts?
>>
>> root@hostadmin:~# ceph osd df
>> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
>> 1 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
>> 3 hdd 0.00980 1.0 10 GiB 8.3 GiB 1.7 GiB 82.83 1.14 156
>> 6 hdd 0.00980 1.0 10 GiB 8.4 GiB 1.6 GiB 83.77 1.15 144
>> 0 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
>> 5 hdd 0.00980 1.0 10 GiB 9.0 GiB 1021 MiB 90.03 1.23 159
>> 7 hdd 0.00980 1.0 10 GiB 7.7 GiB 2.3 GiB 76.57 1.05 141
>> 2 hdd 0.00980 1.0 10 GiB 5.5 GiB 4.5 GiB 55.42 0.76 90
>> 4 hdd 0.00980 1.0 10 GiB 5.9 GiB 4.1 GiB 58.78 0.81 99
>> 8 hdd 0.00980 1.0 10 GiB 6.3 GiB 3.7 GiB 63.12 0.87 111
>> TOTAL 90 GiB 53 GiB 37 GiB 72.93
>> MIN/MAX VAR: 0.76/1.23 STDDEV: 12.67
>>
>>
>> root@hostadmin:~# osdmaptool om 

Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Tarek Zegar

Hi Oliver,

Thank you for the response, I did ensure that min-client-compact-level is
indeed Luminous (see below). I have no kernel mapped rbd clients. Ceph
versions reports mimic. Also below is the output of ceph balancer status.
One thing to note, I did enable the balancer after I already filled the
cluster, not from the onset. I had hoped that it wouldn't matter, though
your comment "if the compat-level is too old for upmap, you'll only find a
small warning about that in the logfiles" leaves me to believe that it will
*not* work in doing it this way, please confirm and let me know what
message to look for in /var/log/ceph.

Thank you!

root@hostadmin:~# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}



root@hostadmin:~# ceph features
{
"mon": [
{
"features": "0x3ffddff8ffacfffb",
"release": "luminous",
"num": 3
}
],
"osd": [
{
"features": "0x3ffddff8ffacfffb",
"release": "luminous",
"num": 7
}
],
"client": [
{
"features": "0x3ffddff8ffacfffb",
"release": "luminous",
"num": 1
}
],
"mgr": [
{
"features": "0x3ffddff8ffacfffb",
"release": "luminous",
"num": 3
}
]
}






From:   Oliver Freyermuth 
To: ceph-users@lists.ceph.com
Date:   05/29/2019 11:13 AM
Subject:[EXTERNAL] Re: [ceph-users] Balancer: uneven OSDs
Sent by:"ceph-users" 



Hi Tarek,

what's the output of "ceph balancer status"?
In case you are using "upmap" mode, you must make sure to have a
min-client-compat-level of at least Luminous:
http://docs.ceph.com/docs/mimic/rados/operations/upmap/
Of course, please be aware that your clients must be recent enough
(especially for kernel clients).

Sadly, if the compat-level is too old for upmap, you'll only find a small
warning about that in the logfiles,
but no error on terminal when activating the balancer or any other kind of
erroneous / health condition.

Cheers,
 Oliver

Am 29.05.19 um 17:52 schrieb Tarek Zegar:
> Can anyone help with this? Why can't I optimize this cluster, the pg
counts and data distribution is way off.
> __
>
> I enabled the balancer plugin and even tried to manually invoke it but it
won't allow any changes. Looking at ceph osd df, it's not even at all.
Thoughts?
>
> root@hostadmin:~# ceph osd df
> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
> 1 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
> 3 hdd 0.00980 1.0 10 GiB 8.3 GiB 1.7 GiB 82.83 1.14 156
> 6 hdd 0.00980 1.0 10 GiB 8.4 GiB 1.6 GiB 83.77 1.15 144
> 0 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
> 5 hdd 0.00980 1.0 10 GiB 9.0 GiB 1021 MiB 90.03 1.23 159
> 7 hdd 0.00980 1.0 10 GiB 7.7 GiB 2.3 GiB 76.57 1.05 141
> 2 hdd 0.00980 1.0 10 GiB 5.5 GiB 4.5 GiB 55.42 0.76 90
> 4 hdd 0.00980 1.0 10 GiB 5.9 GiB 4.1 GiB 58.78 0.81 99
> 8 hdd 0.00980 1.0 10 GiB 6.3 GiB 3.7 GiB 63.12 0.87 111
> TOTAL 90 GiB 53 GiB 37 GiB 72.93
> MIN/MAX VAR: 0.76/1.23 STDDEV: 12.67
>
>
> root@hostadmin:~# osdmaptool om --upmap out.txt --upmap-pool rbd
> osdmaptool: osdmap file 'om'
> writing upmap command output to: out.txt
> checking for upmap cleanups
> upmap, max-count 100, max*deviation 0.01 <---really? It's not even close
to 1% across the drives*
> limiting to pools rbd (1)
> *no upmaps proposed*
>
>
> ceph balancer optimize myplan
> Error EALREADY: Unable to find further optimization,or distribution is
already perfect
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

(See attached file: smime.p7s)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




smime.p7s
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Oliver Freyermuth

Hi Tarek,

what's the output of "ceph balancer status"?
In case you are using "upmap" mode, you must make sure to have a 
min-client-compat-level of at least Luminous:
http://docs.ceph.com/docs/mimic/rados/operations/upmap/
Of course, please be aware that your clients must be recent enough (especially 
for kernel clients).

Sadly, if the compat-level is too old for upmap, you'll only find a small 
warning about that in the logfiles,
but no error on terminal when activating the balancer or any other kind of 
erroneous / health condition.

Cheers,
Oliver

Am 29.05.19 um 17:52 schrieb Tarek Zegar:

Can anyone help with this? Why can't I optimize this cluster, the pg counts and 
data distribution is way off.
__

I enabled the balancer plugin and even tried to manually invoke it but it won't 
allow any changes. Looking at ceph osd df, it's not even at all. Thoughts?

root@hostadmin:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
1 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
3 hdd 0.00980 1.0 10 GiB 8.3 GiB 1.7 GiB 82.83 1.14 156
6 hdd 0.00980 1.0 10 GiB 8.4 GiB 1.6 GiB 83.77 1.15 144
0 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
5 hdd 0.00980 1.0 10 GiB 9.0 GiB 1021 MiB 90.03 1.23 159
7 hdd 0.00980 1.0 10 GiB 7.7 GiB 2.3 GiB 76.57 1.05 141
2 hdd 0.00980 1.0 10 GiB 5.5 GiB 4.5 GiB 55.42 0.76 90
4 hdd 0.00980 1.0 10 GiB 5.9 GiB 4.1 GiB 58.78 0.81 99
8 hdd 0.00980 1.0 10 GiB 6.3 GiB 3.7 GiB 63.12 0.87 111
TOTAL 90 GiB 53 GiB 37 GiB 72.93
MIN/MAX VAR: 0.76/1.23 STDDEV: 12.67


root@hostadmin:~# osdmaptool om --upmap out.txt --upmap-pool rbd
osdmaptool: osdmap file 'om'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 100, max*deviation 0.01 <---really? It's not even close to 1% 
across the drives*
limiting to pools rbd (1)
*no upmaps proposed*


ceph balancer optimize myplan
Error EALREADY: Unable to find further optimization,or distribution is already 
perfect


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Marc Roos
 

I had this with balancer active and "crush-compat"
MIN/MAX VAR: 0.43/1.59  STDDEV: 10.81

And by increasing the pg of some pools (from 8 to 64) and deleting empty 
pools, I went to this

MIN/MAX VAR: 0.59/1.28  STDDEV: 6.83

(Do not want to go to this upmap yet)




-Original Message-
From: Tarek Zegar [mailto:tze...@us.ibm.com] 
Sent: woensdag 29 mei 2019 17:52
To: ceph-users
Subject: *SPAM* [ceph-users] Balancer: uneven OSDs

Can anyone help with this? Why can't I optimize this cluster, the pg 
counts and data distribution is way off.
__

I enabled the balancer plugin and even tried to manually invoke it but 
it won't allow any changes. Looking at ceph osd df, it's not even at 
all. Thoughts?

root@hostadmin:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
1 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
3 hdd 0.00980 1.0 10 GiB 8.3 GiB 1.7 GiB 82.83 1.14 156
6 hdd 0.00980 1.0 10 GiB 8.4 GiB 1.6 GiB 83.77 1.15 144 0 hdd 
0.00980 0 0 B 0 B 0 B 0 0 0
5 hdd 0.00980 1.0 10 GiB 9.0 GiB 1021 MiB 90.03 1.23 159
7 hdd 0.00980 1.0 10 GiB 7.7 GiB 2.3 GiB 76.57 1.05 141
2 hdd 0.00980 1.0 10 GiB 5.5 GiB 4.5 GiB 55.42 0.76 90
4 hdd 0.00980 1.0 10 GiB 5.9 GiB 4.1 GiB 58.78 0.81 99
8 hdd 0.00980 1.0 10 GiB 6.3 GiB 3.7 GiB 63.12 0.87 111 TOTAL 90 GiB 
53 GiB 37 GiB 72.93 MIN/MAX VAR: 0.76/1.23 STDDEV: 12.67


root@hostadmin:~# osdmaptool om --upmap out.txt --upmap-pool rbd
osdmaptool: osdmap file 'om'
writing upmap command output to: out.txt checking for upmap cleanups 
upmap, max-count 100, max deviation 0.01 <---really? It's not even close 
to 1% across the drives limiting to pools rbd (1) no upmaps proposed


ceph balancer optimize myplan
Error EALREADY: Unable to find further optimization,or distribution is 
already perfect



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Tarek Zegar

Can anyone help with this? Why can't I optimize this cluster, the pg counts
and data distribution is way off.
__

I enabled the balancer plugin and even tried to manually invoke it but it
won't allow any changes. Looking at ceph osd df, it's not even at all.
Thoughts?

root@hostadmin:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL%USE  VAR  PGS
 1   hdd 0.0098000 B 0 B  0 B 00   0
 3   hdd 0.00980  1.0 10 GiB 8.3 GiB  1.7 GiB 82.83 1.14 156
 6   hdd 0.00980  1.0 10 GiB 8.4 GiB  1.6 GiB 83.77 1.15 144
 0   hdd 0.0098000 B 0 B  0 B 00   0
 5   hdd 0.00980  1.0 10 GiB 9.0 GiB 1021 MiB 90.03 1.23 159
 7   hdd 0.00980  1.0 10 GiB 7.7 GiB  2.3 GiB 76.57 1.05 141
 2   hdd 0.00980  1.0 10 GiB 5.5 GiB  4.5 GiB 55.42 0.76  90
 4   hdd 0.00980  1.0 10 GiB 5.9 GiB  4.1 GiB 58.78 0.81  99
 8   hdd 0.00980  1.0 10 GiB 6.3 GiB  3.7 GiB 63.12 0.87 111
TOTAL 90 GiB  53 GiB   37 GiB 72.93
MIN/MAX VAR: 0.76/1.23  STDDEV: 12.67


root@hostadmin:~# osdmaptool om --upmap out.txt --upmap-pool rbd
osdmaptool: osdmap file 'om'
writing upmap command output to: out.txt
checking for upmap cleanups
upmap, max-count 100, max deviation 0.01  <---really? It's not even close
to 1% across the drives
 limiting to pools rbd (1)
no upmaps proposed


ceph balancer optimize myplan
Error EALREADY: Unable to find further optimization,or distribution is
already perfect
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Meaning of Ceph MDS / Rank in "Stopped" state.

2019-05-29 Thread Wesley Dillingham
On further thought, Im now thinking this is telling me which rank is stopped 
(2), not that two ranks are stopped. I guess I am still curious about why this 
information is retained here and can rank 2 be made active again? If so, would 
this be cleaned up out of "stopped"?

The state diagram here: http://docs.ceph.com/docs/master/cephfs/mds-states/

seems to indicate that once a rank is "Stopped" it has no path to move out of 
that state. Perhaps I am reading it wrong.

We have updated multi-active-mds clusters and pushed down max_mds to 1 then 
back to 2 during those upgrades and on all of those clusters we do not have any 
listed in "stopped." so I am guessing those ranks go back to active.

Thanks for the clarity.

From: ceph-users  on behalf of Wesley 
Dillingham 
Sent: Tuesday, May 28, 2019 5:15 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Meaning of Ceph MDS / Rank in "Stopped" state.

Notice: This email is from an external sender.



I am working to develop some monitoring for our File clusters and as part of 
the check I inspect `ceph mds stat` for damaged,failed,stopped MDS/Ranks. 
Initially I set my check to Alarm if any of these states was discovered but as 
I distributed it out I noticed that one of our clusters had the following:

 "failed": [],
   "damaged": [],
   "stopped": [
   2
   ],

However the cluster health is good and the mds state is: cephfs-2/2/2 up  
{0=p3plcephmds001=up:active,1=p3plcephmds002=up:active}, 1 up:standby

A little further digging and I found that a stopped state doesnt apply to an 
MDS but rather a rank and may indicate that max_mds was previously set higher 
than its current setting of 2, and the "Stopped" ranks are simply ranks which 
were active and simply offloaded their state to other ranks.

My question is, how can I inspect further which ranks are "stopped" and would 
it be appropriate to "clear" those stopped ranks if possible or should I modify 
my check to ignore stopped ranks and only focus on damaged/failed ranks.

The cluster is running 12.2.12

Thanks.

Respectfully,

Wes Dillingham
wdilling...@godaddy.com
Site Reliability Engineer IV - Platform Storage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [events] Ceph Day Netherlands July 2nd - CFP ends June 3rd

2019-05-29 Thread Mike Perez
Hi everyone,

This is the last week to submit for the Ceph Day Netherlands CFP
ending June 3rd:

https://ceph.com/cephdays/netherlands-2019/
https://zfrmz.com/E3ouYm0NiPF1b3NLBjJk

--
Mike Perez (thingee)

On Thu, May 23, 2019 at 10:12 AM Mike Perez  wrote:
>
> Hi everyone,
>
> We will be having Ceph Day Netherlands July 2nd!
>
> https://ceph.com/cephdays/netherlands-2019/
>
> The CFP will be ending June 3rd, so there is still time to get your
> Ceph related content in front of the Ceph community ranging from all
> levels of expertise:
>
> https://zfrmz.com/E3ouYm0NiPF1b3NLBjJk
>
> If your company is interested in sponsoring the event, we would be
> delighted to have you. Please contact me directly for further
> information.
>
> Hosted by the Ceph community (and our friends) in select cities around
> the world, Ceph Days are full-day events dedicated to fostering our
> vibrant community.
>
> In addition to Ceph experts, community members, and vendors, you’ll
> hear from production users of Ceph who’ll share what they’ve learned
> from their deployments.
>
> Each Ceph Day ends with a Q session and cocktail reception. Join us!
>
> --
> Mike Perez (thingee)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Sizing for DB/WAL: 4% for large drives?

2019-05-29 Thread Jake Grimmett
Thank you for a lot of detailed and useful information :)

I'm tempted to ask a related question on SSD endurance...

If 60GB is the sweet spot for each DB/WAL partition, and the SSD has
spare capacity, for example, I'd budgeted 266GB per DB/WAL.

Would it then be better to make a 60GB "sweet spot" sized DB/WALs, and
leave the remaining SSD unused, as this would maximise the lifespan of
the SSD, and speedup  garbage collection?

many thanks

Jake



On 5/29/19 9:56 AM, Mattia Belluco wrote:
> On 5/29/19 5:40 AM, Konstantin Shalygin wrote:
>> block.db should be 30Gb or 300Gb - anything between is pointless. There
>> is described why:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html
> 
> Following some discussions we had at the past Cephalocon I beg to differ
> on this point: when RocksDB needs to compact a layer it rewrites it
> *before* deleting the old data; if you'd like to be sure you db does not
> spill over to the spindle you should allocate twice the size of the
> biggest layer to allow for compaction. I guess ~60 GB would be the sweet
> spot assuming you don't plan to mess with size and multiplier of the
> rocksDB layers and don't want to go all the way to 600 GB (300 GB x2)
> 
> regards,
> Mattia
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Nfs-ganesha with rados_kv backend

2019-05-29 Thread Stolte, Felix
Hi,

is anyone running an active-passive nfs-ganesha cluster with cephfs backend and 
using the rados_kv recovery backend? My setup runs fine, but takeover is giving 
me a headache. On takeover I see the following messages in ganeshas log file:

29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
ganesha.nfsd-9793[dbus_heartbeat] nfs_start_grace :STATE :EVENT :NFS Server Now 
IN GRACE, duration 5
29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
ganesha.nfsd-9793[dbus_heartbeat] nfs_start_grace :STATE :EVENT :NFS Server 
recovery event 5 nodeid -1 ip 10.0.0.5
29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
ganesha.nfsd-9793[dbus_heartbeat] rados_kv_traverse :CLIENT ID :EVENT :Failed 
to lst kv ret=-2
29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
ganesha.nfsd-9793[dbus_heartbeat] rados_kv_read_recov_clids_takeover :CLIENT ID 
:EVENT :Failed to takeover
29/05/2019 15:38:26 : epoch 5cee88c4 : cephgw-e2-1 : ganesha.nfsd-9793[reaper] 
nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE

The result is clients hanging for up to 2 Minutes. Has anyone ran into the same 
problem?

Ceph Version: 12.2.11
nfs-ganesha: 2.7.3

ganesha.conf (identical on both nodes besides nodeid in rados_kv:

NFS_CORE_PARAM {
Enable_RQUOTA = false;
Protocols = 3,4;
}

CACHEINODE {
Dir_Chunk = 0;
NParts = 1;
Cache_Size = 1;
}

NFS_krb5 {
Active_krb5 = false;
}

NFSv4 {
Only_Numeric_Owners = true;
RecoveryBackend = rados_kv;
Grace_Period = 5;
Lease_Lifetime = 5;
Minor_Versions = 1,2;
}

RADOS_KV {
ceph_conf = '/etc/ceph/ceph.conf';
userid = "ganesha";
pool = "cephfs_metadata";
namespace = "ganesha";
nodeid = "cephgw-k2-1";
}

Any hint would be appreciated.

Best regards 
Felix
-
-
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
-
-
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Trigger (hot) reload of ceph.conf

2019-05-29 Thread Wido den Hollander


On 5/29/19 11:41 AM, Johan Thomsen wrote:
> Hi,
> 
> It doesn't look like SIGHUP causes the osd's to trigger conf reload from
> files? Is there any other way I can do that, without restarting? 
> 

No, there isn't. I suggest you look into the new config store which is
in Ceph since the Mimic release where daemons can fetch (additional)
configuration from the Monitors.

These settings are live, persistent and don't need a daemon restart in
most cases.

Wido

> I prefer having most of my config in files, but it's annoying that I
> need to cause the cluster to go in HEALTH_WARN in order to reload them.
> 
> Thanks for response in advance.
> 
> /Johan
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Trigger (hot) reload of ceph.conf

2019-05-29 Thread Johan Thomsen
Hi,

It doesn't look like SIGHUP causes the osd's to trigger conf reload from
files? Is there any other way I can do that, without restarting?

I prefer having most of my config in files, but it's annoying that I need
to cause the cluster to go in HEALTH_WARN in order to reload them.

Thanks for response in advance.

/Johan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-29 Thread Robert Sander
Hi,

Am 29.05.19 um 11:19 schrieb Martin Verges:
> 
> We have identified the performance settings in the BIOS as a major
> factor
> 
> could you share your insights what options you changed to increase
> performance and could you provide numbers to it?

Most default perfomance settings nowadays seem to be geared towards
power savings. This decreases CPU frequencies and does not play well
with Ceph (and virtualization).

There was just one setting in the BIOS of these machines called "Host
Performance" that was set to "Balanced". We changed that to "Max
Performance" and immediately the throughput doubled.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de

Tel: 030-405051-43
Fax: 030-405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-29 Thread Andrei Mikhailovsky
It would be interesting to learn the improvements types and the BIOS changes 
that helped you. 

Thanks 

> From: "Martin Verges" 
> To: "Robert Sander" 
> Cc: "ceph-users" 
> Sent: Wednesday, 29 May, 2019 10:19:09
> Subject: Re: [ceph-users] performance in a small cluster

> Hello Robert,

>> We have identified the performance settings in the BIOS as a major factor

> could you share your insights what options you changed to increase performance
> and could you provide numbers to it?

> Many thanks in advance

> --
> Martin Verges
> Managing director

> Mobile: +49 174 9335695
> E-Mail: [ mailto:martin.ver...@croit.io | martin.ver...@croit.io ]
> Chat: [ https://t.me/MartinVerges | https://t.me/MartinVerges ]

> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263

> Web: [ https://croit.io/ | https://croit.io ]
> YouTube: [ https://goo.gl/PGE1Bx | https://goo.gl/PGE1Bx ]

> Am Mi., 29. Mai 2019 um 09:36 Uhr schrieb Robert Sander < [
> mailto:r.san...@heinlein-support.de | r.san...@heinlein-support.de ] >:

>> Am 24.05.19 um 14:43 schrieb Paul Emmerich:
>> > * SSD model? Lots of cheap SSDs simply can't handle more than that

>> The customer currently has 12 Micron 5100 1,92TB (Micron_5100_MTFDDAK1)
>> SSDs and will get a batch of Micron 5200 in the next days

>> We have identified the performance settings in the BIOS as a major
>> factor. Ramping that up we got a remarkable performance increase.

>> Regards
>> --
>> Robert Sander
>> Heinlein Support GmbH
>> Linux: Akademie - Support - Hosting
>> [ http://www.heinlein-support.de/ | http://www.heinlein-support.de ]

>> Tel: 030-405051-43
>> Fax: 030-405051-19

>> Zwangsangaben lt. §35a GmbHG:
>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>> Geschäftsführer: Peer Heinlein -- Sitz: Berlin

>> ___
>> ceph-users mailing list
>> [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ]
>> [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-29 Thread Martin Verges
Hello Robert,

We have identified the performance settings in the BIOS as a major factor
>

could you share your insights what options you changed to increase
performance and could you provide numbers to it?

Many thanks in advance

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Mi., 29. Mai 2019 um 09:36 Uhr schrieb Robert Sander <
r.san...@heinlein-support.de>:

> Am 24.05.19 um 14:43 schrieb Paul Emmerich:
> > * SSD model? Lots of cheap SSDs simply can't handle more than that
>
> The customer currently has 12 Micron 5100 1,92TB (Micron_5100_MTFDDAK1)
> SSDs and will get a batch of Micron 5200 in the next days
>
> We have identified the performance settings in the BIOS as a major
> factor. Ramping that up we got a remarkable performance increase.
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de
>
> Tel: 030-405051-43
> Fax: 030-405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent number of pools

2019-05-29 Thread Jan Fajerski

On Tue, May 28, 2019 at 11:50:01AM -0700, Gregory Farnum wrote:

  You’re the second report I’ve seen if this, and while it’s confusing,
  you should be Abel to resolve it by restarting your active manager
  daemon.

Maybe this is related? http://tracker.ceph.com/issues/40011


  On Sun, May 26, 2019 at 11:52 PM Lars Täuber <[1]taeu...@bbaw.de>
  wrote:

Fri, 24 May 2019 21:41:33 +0200
Michel Raabe <[2]rmic...@devnu11.net> ==> Lars Täuber
<[3]taeu...@bbaw.de>, [4]ceph-users@lists.ceph.com :
>
> You can also try
>
> $ rados lspools
> $ ceph osd pool ls
>
> and verify that with the pgs
>
> $ ceph pg ls --format=json-pretty | jq -r '.pg_stats[].pgid' | cut
-d.
> -f1 | uniq
>
Yes, now I know but I still get this:
$ sudo ceph -s
[…]
  data:
pools:   5 pools, 1153 pgs
[…]
and with all other means I get:
$ sudo ceph osd lspools | wc -l
3
Which is what I expect, because all other pools are removed.
But since this has no bad side effects I can live with it.
Cheers,
Lars
___
ceph-users mailing list
[5]ceph-users@lists.ceph.com
[6]http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

References

  1. mailto:taeu...@bbaw.de
  2. mailto:rmic...@devnu11.net
  3. mailto:taeu...@bbaw.de
  4. mailto:ceph-users@lists.ceph.com
  5. mailto:ceph-users@lists.ceph.com
  6. http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Sizing for DB/WAL: 4% for large drives?

2019-05-29 Thread Mattia Belluco
On 5/29/19 5:40 AM, Konstantin Shalygin wrote:
> block.db should be 30Gb or 300Gb - anything between is pointless. There
> is described why:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html

Following some discussions we had at the past Cephalocon I beg to differ
on this point: when RocksDB needs to compact a layer it rewrites it
*before* deleting the old data; if you'd like to be sure you db does not
spill over to the spindle you should allocate twice the size of the
biggest layer to allow for compaction. I guess ~60 GB would be the sweet
spot assuming you don't plan to mess with size and multiplier of the
rocksDB layers and don't want to go all the way to 600 GB (300 GB x2)

regards,
Mattia


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Global Data Deduplication

2019-05-29 Thread Felix Hüttner
Hi everyone,

We are currently using Ceph as the backend for our OpenStack blockstorage. For 
backup of these disks we thought about also using ceph (just with hdd's instead 
of ssd's). As we will have some volumes that will be backuped daily and that 
will probably not change too often I searched for any possible deduplication 
methods for ceph.

There I noticed this paper regarding "Global Data Deduplication" 
(https://ceph.com/wp-content/uploads/2018/07/ICDCS_2018_mwoh.pdf). It says "We 
implemented the proposed design upon open source distributed storage system, 
Ceph".

Unfortunately I was not able to find any documentation for this anywhere. The 
only thing that seems related is the cephdeduptool.
Is there some something that I just missed? Or is it implicitly done in the 
background and I don't need to care about it?

Thanks for your help

Felix

Hinweise zum Datenschutz finden Sie hier.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-29 Thread Robert Sander
Am 24.05.19 um 14:43 schrieb Paul Emmerich:
> * SSD model? Lots of cheap SSDs simply can't handle more than that

The customer currently has 12 Micron 5100 1,92TB (Micron_5100_MTFDDAK1)
SSDs and will get a batch of Micron 5200 in the next days

We have identified the performance settings in the BIOS as a major
factor. Ramping that up we got a remarkable performance increase.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de

Tel: 030-405051-43
Fax: 030-405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] is rgw crypt default encryption key long term supported ?

2019-05-29 Thread Scheurer François
Hello Casey


Thank you for your reply.
To close this subject, one last question.

Do you know if it is possible to rotate the key defined by 
"rgw_crypt_default_encryption_key=" ?


Best Regards
Francois Scheurer




From: Casey Bodley 
Sent: Tuesday, May 28, 2019 5:37 PM
To: Scheurer François; ceph-users@lists.ceph.com
Subject: Re: is rgw crypt default encryption key long term supported ?

On 5/28/19 11:17 AM, Scheurer François wrote:
> Hi Casey
>
>
> I greatly appreciate your quick and helpful answer :-)
>
>
>> It's unlikely that we'll do that, but if we do it would be announced with a 
>> long deprecation period and migration strategy.
> Fine, just the answer we wanted to hear ;-)
>
>
>> However, I would still caution against using either as a strategy for
>> key management, especially when (as of mimic) the ceph configuration is
>> centralized in the ceph-mon database [1][2]. If there are gaps in our
>> sse-kms integration that makes it difficult to use in practice, I'd
>> really like to address those.
> sse-kms is working great, no issue or gaps with it.
> We already use it in our openstack (rocky) with barbican and ceph/radosgw 
> (luminous).
>
> But we have customers that want encryption by default, something like SSE-S3 
> (cf. below).
> Do you know if there are plans to implement something similar?
I would love to see support for sse-s3. We've talked about building
something around vault (which I think is what minio does?), but so far
nobody has taken it up as a project.
>
> Using dm-crypt would cost too much time for the conversion (72x 8TB SATA 
> disks...) .
> And dm-crypt is also storing its key on the monitors (cf. 
> https://www.spinics.net/lists/ceph-users/msg52402.html).
>
>
> Best Regards
> Francois Scheurer
>
>
> Amazon SSE-3 description:
>
> https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html
> Protecting Data Using Server-Side Encryption with Amazon S3-Managed 
> Encryption Keys (SSE-S3)
> Server-side encryption protects data at rest. Amazon S3 encrypts each object 
> with a unique key. As an additional safeguard, it encrypts the key itself 
> with a master key that it rotates regularly. Amazon S3 server-side encryption 
> uses one of the strongest block ciphers available, 256-bit Advanced 
> Encryption Standard (AES-256), to encrypt your data.
>
>
> https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketPUTencryption.html
> The following is an example of the request body for setting SSE-S3.
>  xmlns="http://s3.amazonaws.com/doc/2006-03-01/;>
>
>  
>  AES256
>  
> 
> 
>
>
>
>
>
>
>
>
> 
> From: Casey Bodley 
> Sent: Tuesday, May 28, 2019 3:55 PM
> To: Scheurer François; ceph-users@lists.ceph.com
> Subject: Re: is rgw crypt default encryption key long term supported ?
>
> Hi François,
>
>
> Removing support for either of rgw_crypt_default_encryption_key or
> rgw_crypt_s3_kms_encryption_keys would mean that objects encrypted with
> those keys would no longer be accessible. It's unlikely that we'll do
> that, but if we do it would be announced with a long deprecation period
> and migration strategy.
>
>
> However, I would still caution against using either as a strategy for
> key management, especially when (as of mimic) the ceph configuration is
> centralized in the ceph-mon database [1][2]. If there are gaps in our
> sse-kms integration that makes it difficult to use in practice, I'd
> really like to address those.
>
>
> Casey
>
>
> [1]
> https://ceph.com/community/new-mimic-centralized-configuration-management/
>
> [2]
> http://docs.ceph.com/docs/mimic/rados/configuration/ceph-conf/#monitor-configuration-database
>
>
> On 5/28/19 6:39 AM, Scheurer François wrote:
>> Dear Casey, Dear Ceph Users The following is written in the radosgw
>> documentation
>> (http://docs.ceph.com/docs/luminous/radosgw/encryption/): rgw crypt
>> default encryption key = 4YSmvJtBv0aZ7geVgAsdpRnLBEwWSWlMIGnRS8a9TSA=
>>
>>Important: This mode is for diagnostic purposes only! The ceph
>> configuration file is not a secure method for storing encryption keys.
>>
>>  Keys that are accidentally exposed in this way should be
>> considered compromised.
>>
>>
>>
>>
>> Is the warning only about the key exposure risk or does it mean also
>> that the feature could be removed in future?
>>
>>
>> The is also another similar parameter "rgw crypt s3 kms encryption
>> keys" (cf. usage example in
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030679.html).
>> 
>>
>>
>> Both parameters are still interesting (provided the ceph.conf is
>> encrypted) but we want to be sure that they will not be dropped in future.
>>
>>
>>
>>
>> Best Regards
>>
>> Francois
>>


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list

[ceph-users] Large OMAP object in RGW GC pool

2019-05-29 Thread Wido den Hollander
Hi,

I've got a Ceph cluster with this status:

health: HEALTH_WARN
3 large omap objects

After looking into it I see that the issue comes from objects in the
'.rgw.gc' pool.

Investigating it I found that the gc.* objects have a lot of OMAP keys:

for OBJ in $(rados -p .rgw.gc ls); do
  echo $OBJ
  rados -p .rgw.gc listomapkeys $OBJ|wc -l
done

I then found out that on average these objects have about 100k of OMAP
keys each, but two stand out and have about 3M OMAP keys.

I can list the GC with 'radosgw-admin gc list' and this yields a JSON
which is a couple of MB in size.

I ran:

$ radosgw-admin gc process

That runs for hours and then finishes, but the large list of OMAP keys
stays.

Running Mimic 13.3.5 on this cluster.

Has anybody seen this before?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Sizing for DB/WAL: 4% for large drives?

2019-05-29 Thread Burkhard Linke

Hi,

On 5/29/19 8:25 AM, Konstantin Shalygin wrote:



We have a similar setup, but 24 disks and 2x P4800X. And the 375GB NVME
drives are _not_ large enough:

*snipsnap*



Your block.db is 29Gb, should be 30Gb to prevent spillover to slow 
backend.




Well, it's the usual gigabyte vs. gigibyte fuck up.


The drive has exactly 366292584 bytes, which is ~350GB (with GB = the 
computer scientist's GB, 1024^3). Since rocksdb also seems to be written 
by computer scientists, we are 10 GB short for a working setup...



There are options to reduce the level size in rocksdb. Does anyone have 
experience with changing them, and what are sane values (e.g. powers of 2)?



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Sizing for DB/WAL: 4% for large drives?

2019-05-29 Thread Konstantin Shalygin

We have a similar setup, but 24 disks and 2x P4800X. And the 375GB NVME
drives are _not_ large enough:


2019-05-29 07:00:00.000108 mon.bcf-03 [WRN] overall HEALTH_WARN BlueFS
spillover detected on 22 OSD(s)

root at bcf-10  :~# 
parted /dev/nvme0n1 print
Model: NVMe Device (nvme)
Disk /dev/nvme0n1: 375GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End Size    File system  Name  Flags
   1  1049kB  31.1GB  31.1GB
   2  31.1GB  62.3GB  31.1GB
   3  62.3GB  93.4GB  31.1GB
   4  93.4GB  125GB   31.1GB
   5  125GB   156GB   31.1GB
   6  156GB   187GB   31.1GB
   7  187GB   218GB   31.1GB
   8  218GB   249GB   31.1GB
   9  249GB   280GB   31.1GB
10  280GB   311GB   31.1GB
11  311GB   343GB   31.1GB
12  343GB   375GB   32.6GB


The second NVME has the same partition layout. The twelfth partition is
actually large enough to hold all the data, but the other 11 partitions
on this drive are a little bit too small. I'm still trying to calculate
the exact sweet spot


With 24 OSDs and two of them having a just-large-enough-db-partition, I
end up with 22 OSD not fully using their db partition and spilling over
into the slow disk...exactly as reported by ceph.

Details for one of the affected OSDs:

      "bluefs": {
      "gift_bytes": 0,
      "reclaim_bytes": 0,
      "db_total_bytes": 31138504704,
      "db_used_bytes": 2782912512,
      "wal_total_bytes": 0,
      "wal_used_bytes": 0,
      "slow_total_bytes": 320062095360,
      "slow_used_bytes": 5838471168,
      "num_files": 135,
      "log_bytes": 13295616,
      "log_compactions": 9,
      "logged_bytes": 338104320,
      "files_written_wal": 2,
      "files_written_sst": 5066,
      "bytes_written_wal": 375879721287,
      "bytes_written_sst": 227201938586,
      "bytes_written_slow": 6516224,
      "max_bytes_wal": 0,
      "max_bytes_db": 5265940480,
      "max_bytes_slow": 7540310016
      },

Maybe it's just matter of shifting some megabytes. We are about to
deploy more of these nodes, so I would be grateful if anyone can comment
on the correct size of the DB partitions. Otherwise I'll have to use a
RAID-0 for two drives.


Regards,




Your block.db is 29Gb, should be 30Gb to prevent spillover to slow backend.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Sizing for DB/WAL: 4% for large drives?

2019-05-29 Thread Burkhard Linke

Hi,

On 5/29/19 5:23 AM, Frank Yu wrote:

Hi Jake,

I have same question about size of DB/WAL for OSD。My situations:  12 
osd per OSD nodes, 8 TB(maybe 12TB later) per OSD, Intel NVMe SSD 
(optane P4800x) 375G per OSD nodes, which means DB/WAL can use about 
30GB per OSD(8TB), I mainly use CephFS to serve the HPC cluster for ML.
(plan to separate CephFS metadata to pool based on NVMe SSD, BTW, does 
this improve the performance a lot? any compares?)



We have a similar setup, but 24 disks and 2x P4800X. And the 375GB NVME 
drives are _not_ large enough:



2019-05-29 07:00:00.000108 mon.bcf-03 [WRN] overall HEALTH_WARN BlueFS 
spillover detected on 22 OSD(s)


root@bcf-10:~# parted /dev/nvme0n1 print
Model: NVMe Device (nvme)
Disk /dev/nvme0n1: 375GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End Size    File system  Name  Flags
 1  1049kB  31.1GB  31.1GB
 2  31.1GB  62.3GB  31.1GB
 3  62.3GB  93.4GB  31.1GB
 4  93.4GB  125GB   31.1GB
 5  125GB   156GB   31.1GB
 6  156GB   187GB   31.1GB
 7  187GB   218GB   31.1GB
 8  218GB   249GB   31.1GB
 9  249GB   280GB   31.1GB
10  280GB   311GB   31.1GB
11  311GB   343GB   31.1GB
12  343GB   375GB   32.6GB


The second NVME has the same partition layout. The twelfth partition is 
actually large enough to hold all the data, but the other 11 partitions 
on this drive are a little bit too small. I'm still trying to calculate 
the exact sweet spot



With 24 OSDs and two of them having a just-large-enough-db-partition, I 
end up with 22 OSD not fully using their db partition and spilling over 
into the slow disk...exactly as reported by ceph.


Details for one of the affected OSDs:

    "bluefs": {
    "gift_bytes": 0,
    "reclaim_bytes": 0,
    "db_total_bytes": 31138504704,
    "db_used_bytes": 2782912512,
    "wal_total_bytes": 0,
    "wal_used_bytes": 0,
    "slow_total_bytes": 320062095360,
    "slow_used_bytes": 5838471168,
    "num_files": 135,
    "log_bytes": 13295616,
    "log_compactions": 9,
    "logged_bytes": 338104320,
    "files_written_wal": 2,
    "files_written_sst": 5066,
    "bytes_written_wal": 375879721287,
    "bytes_written_sst": 227201938586,
    "bytes_written_slow": 6516224,
    "max_bytes_wal": 0,
    "max_bytes_db": 5265940480,
    "max_bytes_slow": 7540310016
    },

Maybe it's just matter of shifting some megabytes. We are about to 
deploy more of these nodes, so I would be grateful if anyone can comment 
on the correct size of the DB partitions. Otherwise I'll have to use a 
RAID-0 for two drives.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com