Re: [ceph-users] Investigating Config Error, 300x reduction in IOPs performance on RGW layer

2019-07-17 Thread Robert LeBlanc
I'm pretty new to RGW, but I'm needing to get max performance as well. Have you tried moving your RGW metadata pools to nvme? Carve out a bit of NVMe space and then pin the pool to the SSD class in CRUSH, that way the small metadata ops aren't on slow media. Robert LeBlanc PGP

[ceph-users] Investigating Config Error, 300x reduction in IOPs performance on RGW layer

2019-07-17 Thread Ravi Patel
Hello, We have deployed ceph cluster and we are trying to debug a massive drop in performance between the RADOS layer vs the RGW layer ## Cluster config 4 OSD nodes (12 Drives each, NVME Journals, 1 SSD drive) 40GbE NIC 2 RGW nodes ( DNS RR load balancing) 40GbE NIC 3 MON nodes 1 GbE NIC ##

Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-07-17 Thread Jeff Layton
Ahh, I just noticed you were running nautilus on the client side. This patch went into v14.2.2, so once you update to that you should be good to go. -- Jeff On Wed, 2019-07-17 at 17:10 -0400, Jeff Layton wrote: > This is almost certainly the same bug that is fixed here: > >

Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-07-17 Thread Jeff Layton
This is almost certainly the same bug that is fixed here: https://github.com/ceph/ceph/pull/28324 It should get backported soon-ish but I'm not sure which luminous release it'll show up in. Cheers, Jeff On Wed, 2019-07-17 at 10:36 +0100, David C wrote: > Thanks for taking a look at this,

[ceph-users] MON DNS Lookup & Version 2 Protocol

2019-07-17 Thread DHilsbos
All; I'm trying to firm up my understanding of how Ceph works, and ease of management tools and capabilities. I stumbled upon this: http://docs.ceph.com/docs/nautilus/rados/configuration/mon-lookup-dns/ It got me wondering; how do you convey protocol version 2 capabilities in this format?

[ceph-users] Allocation recommendations for separate blocks.db and WAL

2019-07-17 Thread Robert LeBlanc
So, I see the recommendation for 4% of OSD space for blocks.db/WAL and the corresponding discussion regrading the 3/30/300GB vs 6/60/600GB allocation. How does this change when WAL is seperate from blocks.db? Reading [0] it seems that 6/60/600 is not correct. It seems that to compact a 300GB DB,

Re: [ceph-users] disk usage reported incorrectly

2019-07-17 Thread Igor Fedotov
Fix is on its way too... See https://github.com/ceph/ceph/pull/28978 On 7/17/2019 8:55 PM, Paul Mezzanini wrote: Oh my. That's going to hurt with 788 OSDs. Time for some creative shell scripts and stepping through the nodes. I'll report back. -- Paul Mezzanini Sr Systems Administrator /

Re: [ceph-users] enterprise support

2019-07-17 Thread Void Star Nill
Thanks everyone. Appreciate the inputs. Any feedback on support quality of these vendors? Croit, Mirantis, Redhat, Ubuntu? Anyone already using them (other than Robert)? Thanks, Shridhar On Mon, 15 Jul 2019 at 13:30, Robert LeBlanc wrote: > We recently used Croit (https://croit.io/) and they

Re: [ceph-users] disk usage reported incorrectly

2019-07-17 Thread Paul Mezzanini
Oh my. That's going to hurt with 788 OSDs. Time for some creative shell scripts and stepping through the nodes. I'll report back. -- Paul Mezzanini Sr Systems Administrator / Engineer, Research Computing Information & Technology Services Finance & Administration Rochester Institute of

Re: [ceph-users] Multisite RGW - endpoints configuration

2019-07-17 Thread Peter Eisch
Hi, I also have been looking solutions for improving sync. I have two clusters, 25 ms RTT, with the RGW multi-site configured and all nodes running 12.2.12. I have three rgw nodes at each with the nodes behind haproxy at each site. There is a 1G circuit between the sites and bandwidth usage

[ceph-users] MGR module config from ceph.conf

2019-07-17 Thread Oskar Malnowicz
Hello, is it possible to set key/values for mgr modules from a file (e.g ceph.con) instead of e.g. ceph config set mgr mgr/influx/ ? Thx, Oskar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] disk usage reported incorrectly

2019-07-17 Thread Igor Fedotov
Forgot to provide a workaround... If that's the case then you need to repair each OSD with corresponding command in ceph-objectstore-tool... Thanks, Igor. On 7/17/2019 6:29 PM, Paul Mezzanini wrote: Sometime after our upgrade to Nautilus our disk usage statistics went off the rails

Re: [ceph-users] disk usage reported incorrectly

2019-07-17 Thread Igor Fedotov
H Paul, there was a post from Sage named "Pool stats issue with upgrades to nautilus" recently. Perhaps that's the case if you add new OSD or repair existing one... Thanks, Igor On 7/17/2019 6:29 PM, Paul Mezzanini wrote: Sometime after our upgrade to Nautilus our disk usage statistics

[ceph-users] disk usage reported incorrectly

2019-07-17 Thread Paul Mezzanini
Sometime after our upgrade to Nautilus our disk usage statistics went off the rails wrong. I can't tell you exactly when it broke but I know that after the initial upgrade it worked at least for a bit. Correct numbers should be something similar to: (These are copy/pasted from the

Re: [ceph-users] ceph mon crash - ceph mgr module ls -f plain

2019-07-17 Thread Oskar Malnowicz
thx! Am 17.07.19 um 16:28 schrieb Sage Weil: > Thanks, opened bug https://tracker.ceph.com/issues/40804. Fix should be > trivial. > > sage > > On Wed, 17 Jul 2019, Oskar Malnowicz wrote: > >> Hello, >> when i execute the following command on one of my three ceph-mon, all >> ceph-mon crashes. >>

Re: [ceph-users] ceph mon crash - ceph mgr module ls -f plain

2019-07-17 Thread Sage Weil
Thanks, opened bug https://tracker.ceph.com/issues/40804. Fix should be trivial. sage On Wed, 17 Jul 2019, Oskar Malnowicz wrote: > Hello, > when i execute the following command on one of my three ceph-mon, all > ceph-mon crashes. > > ceph mgr module ls -f plain > >  ceph version 14.2.1

Re: [ceph-users] Multisite RGW - endpoints configuration

2019-07-17 Thread Casey Bodley
On 7/17/19 8:04 AM, P. O. wrote: Hi, Is there any mechanism inside the rgw that can detect faulty endpoints for a configuration with multiple endpoints? No, replication requests that fail just get retried using round robin until they succeed. If an endpoint isn't available, we assume it

Re: [ceph-users] New best practices for osds???

2019-07-17 Thread Maged Mokhtar
in most cases write back cache does help a lot for hdd write latency, either raid-0 or some Areca cards support write back in jbod mode. Our observation they could help by a 3-5x factor in Bluestore, whereas db/wal on flash will be about 2x, it does depend on hardware but in general we see

[ceph-users] ceph mon crash - ceph mgr module ls -f plain

2019-07-17 Thread Oskar Malnowicz
Hello, when i execute the following command on one of my three ceph-mon, all ceph-mon crashes. ceph mgr module ls -f plain  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)  1: (()+0x12890) [0x7fcc5e5e3890]  2: (gsignal()+0xc7) [0x7fcc5d6dbe97]  3: (abort()+0x141)

Re: [ceph-users] New best practices for osds???

2019-07-17 Thread Lars Marowsky-Bree
On 2019-07-17T08:27:46, John Petrini wrote: The main problem we've observed is that not all HBAs can just efficiently and easily pass through disks 1:1. Some of those from a more traditional server background insist on having some form of mapping via RAID. In that case it depends on whether 1

Re: [ceph-users] New best practices for osds???

2019-07-17 Thread Mark Nelson
Some of the first performance studies we did back at Inktank were looking at RAID-0 vs JBOD setups! :)  You are absolutely right that the controller cache (especially write-back with a battery or supercap) can help with HDD-only configurations.  Where we typically saw problems was when you

Re: [ceph-users] Random slow requests without any load

2019-07-17 Thread Kees Meijs
Hi, Experienced similar issues. Our cluster internal network (completely separated) now has NOTRACK (no connection state tracking) iptables rules. In full: > # iptables-save > # Generated by xtables-save v1.8.2 on Wed Jul 17 14:57:38 2019 > *filter > :FORWARD DROP [0:0] > :OUTPUT ACCEPT [0:0] >

Re: [ceph-users] New best practices for osds???

2019-07-17 Thread John Petrini
Dell has a whitepaper that compares Ceph performance using JBOD and RAID-0 per disk that recommends RAID-0 for HDD's: en.community.dell.com/techcenter/cloud/m/dell_cloud_resources/20442913/download After switching from JBOD to RAID-0 we saw a huge reduction in latency, the difference was much

Re: [ceph-users] Multisite RGW - endpoints configuration

2019-07-17 Thread P. O.
Hi, Is there any mechanism inside the rgw that can detect faulty endpoints for a configuration with multiple endpoints? Is there any advantage related with the number of replication endpoints? Can I expect improved replication performance (the more synchronization rgws = the faster replication)?

Re: [ceph-users] Multisite RGW - endpoints configuration

2019-07-17 Thread P. O.
Hi, Is there any mechanism inside the rgw that can detect faulty endpoints for a configuration with multiple endpoints? Is there any advantage related with the number of replication endpoints? Can I expect improved replication performance (the more synchronization rgws = the faster

Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-07-17 Thread David C
Thanks for taking a look at this, Daniel. Below is the only interesting bit from the Ceph MDS log at the time of the crash but I suspect the slow requests are a result of the Ganesha crash rather than the cause of it. Copying the Ceph list in case anyone has any ideas. 2019-07-15 15:06:54.624007

Re: [ceph-users] Random slow requests without any load

2019-07-17 Thread Maximilien Cuony
Hello, Just a quick update about this if somebody else get the same issue: The problem was with the firewall. Port range and established connection are allowed, but for some reasons it seems the tracking of connections are lost, leading to a strange state where one machine refuse data (RST

Re: [ceph-users] cephfs snapshot scripting questions

2019-07-17 Thread Marc Roos
H, ok ok, test it first, can't remember if it is finished. Checks also if it is usefull to create a snapshot, by checking the size of the directory. [@ cron.daily]# cat backup-archive-mail.sh #!/bin/bash cd /home/ for account in `ls -c1 /home/mail-archive/ | sort` do

[ceph-users] deep-scrub : stat mismatch

2019-07-17 Thread Ashley Merrick
Hey, I have a PG that after a deep-scrub it shows the following output: 3.0 deep-scrub : stat mismatch, got 23/24 objects, 0/0 clones, 23/24 dirty, 23/24 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. This is on a

Re: [ceph-users] HEALTH_WARN 1 MDSs report slow metadata IOs

2019-07-17 Thread Dietmar Rieder
Hi, thanks for the hint!! This did it. I indeed found stuck requests using "ceph daemon mds.xxx objecter_requests". I then restarted the osds involved in those requests one by one and now the problems are gone and the status is back to HEALTH_OK. Thanks again Dietmar On 7/17/19 9:08 AM,

Re: [ceph-users] HEALTH_WARN 1 MDSs report slow metadata IOs

2019-07-17 Thread Yan, Zheng
Check if there is any hang request in 'ceph daemon mds.xxx objecter_requests' On Tue, Jul 16, 2019 at 11:51 PM Dietmar Rieder wrote: > > On 7/16/19 4:11 PM, Dietmar Rieder wrote: > > Hi, > > > > We are running ceph version 14.1.2 with cephfs only. > > > > I just noticed that one of our pgs had

Re: [ceph-users] HEALTH_WARN 1 MDSs report slow metadata IOs

2019-07-17 Thread Dietmar Rieder
On 7/16/19 5:34 PM, Dietmar Rieder wrote: > On 7/16/19 4:11 PM, Dietmar Rieder wrote: >> Hi, >> >> We are running ceph version 14.1.2 with cephfs only. >> >> I just noticed that one of our pgs had scrub errors which I could repair >> >> # ceph health detail >> HEALTH_ERR 1 MDSs report slow