Re: [ceph-users] Revert a CephFS snapshot?

2019-11-14 Thread Sage Weil
On Thu, 14 Nov 2019, Patrick Donnelly wrote: > On Wed, Nov 13, 2019 at 6:36 PM Jerry Lee wrote: > > > > On Thu, 14 Nov 2019 at 07:07, Patrick Donnelly wrote: > > > > > > On Wed, Nov 13, 2019 at 2:30 AM Jerry Lee wrote: > > > > Recently, I'm evaluating the snpahsot feature of CephFS from kernel

Re: [ceph-users] Nautilus 14.2.2 release announcement

2019-07-19 Thread Sage Weil
On Fri, 19 Jul 2019, Alex Litvak wrote: > Dear Ceph developers, > > Please forgive me if this post offends anyone, but it would be nice if this > and all other releases would be announced before or shortly after they hit the > repos. Yep, my fault. Abhishek normally does this but he's out on

Re: [ceph-users] Legacy BlueStore stats reporting?

2019-07-19 Thread Sage Weil
On Fri, 19 Jul 2019, Stig Telfer wrote: > > On 19 Jul 2019, at 10:01, Konstantin Shalygin wrote: > >> Using Ceph-Ansible stable-4.0 I did a rolling update from latest Mimic to > >> Nautilus 14.2.2 on a cluster yesterday, and the update ran to completion > >> successfully. > >> > >> However, in

Re: [ceph-users] ceph mon crash - ceph mgr module ls -f plain

2019-07-17 Thread Sage Weil
Thanks, opened bug https://tracker.ceph.com/issues/40804. Fix should be trivial. sage On Wed, 17 Jul 2019, Oskar Malnowicz wrote: > Hello, > when i execute the following command on one of my three ceph-mon, all > ceph-mon crashes. > > ceph mgr module ls -f plain > >  ceph version 14.2.1

Re: [ceph-users] Changing the release cadence

2019-07-15 Thread Sage Weil
On Mon, 15 Jul 2019, Kaleb Keithley wrote: > On Mon, Jul 15, 2019 at 10:10 AM Sage Weil wrote: > > > On Mon, 15 Jul 2019, Kaleb Keithley wrote: > > > > > > If Octopus is really an LTS release like all the others, and you want > > > bleeding edge users

Re: [ceph-users] Changing the release cadence

2019-07-15 Thread Sage Weil
On Mon, 15 Jul 2019, Kaleb Keithley wrote: > On Wed, Jun 5, 2019 at 11:58 AM Sage Weil wrote: > > > ... > > > > This has mostly worked out well, except that the mimic release received > > less attention that we wanted due to the fact that multiple downstream &

Re: [ceph-users] Pool stats issue with upgrades to nautilus

2019-07-12 Thread Sage Weil
On Fri, 12 Jul 2019, Nathan Fish wrote: > Thanks. Speaking of 14.2.2, is there a timeline for it? We really want > some of the fixes in it as soon as possible. I think it's basically ready now... probably Monday? sage > > On Fri, Jul 12, 2019 at 11:22 AM Sage Weil wrote: > &g

[ceph-users] Pool stats issue with upgrades to nautilus

2019-07-12 Thread Sage Weil
Hi everyone, All current Nautilus releases have an issue where deploying a single new (Nautilus) BlueStore OSD on an upgraded cluster (i.e. one that was originally deployed pre-Nautilus) breaks the pool utilization stats reported by ``ceph df``. Until all OSDs have been reprovisioned or

Re: [ceph-users] How does monitor know OSD is dead?

2019-07-03 Thread Sage Weil
On Sun, 30 Jun 2019, Bryan Henderson wrote: > > I'm not sure why the monitor did not mark it _out_ after 600 seconds > > (default) > > Well, that part I understand. The monitor didn't mark the OSD out because the > monitor still considered the OSD up. No reason to mark an up OSD out. > > I

[ceph-users] Octopus release target: March 1 2020

2019-07-03 Thread Sage Weil
Hi everyone, The target release date for Octopus is March 1, 2020. The freeze will be January 1, 2020. As a practical matter, that means any features need to be in before people leave for the holidays, ensuring the features get in in time and also that we can run tests over the holidays

[ceph-users] Tech Talk tomorrow: Intro to Ceph

2019-06-26 Thread Sage Weil
Hi everyone, Tomorrow's Ceph Tech Talk will be an updated "Intro to Ceph" talk by Sage Weil. This will be based on a newly refreshed set of slides and provide a high-level introduction to the overall Ceph architecture, RGW, RBD, and CephFS. Our plan is to follow-up later t

Re: [ceph-users] Changing the release cadence

2019-06-26 Thread Sage Weil
people out for vacations) right in the middle of the lead-up to the freeze. Thoughts? sage On Wed, 26 Jun 2019, Sage Weil wrote: > On Wed, 26 Jun 2019, Alfonso Martinez Hidalgo wrote: > > I think March is a good idea. > > Spring had a slight edge over fall in the twitter pol

Re: [ceph-users] Changing the release cadence

2019-06-26 Thread Sage Weil
For example, Nautilus was set to release in February and we got it out > > late in late March (Almost April) > > > > Would love to see more of a discussion around solving the problem of > > releasing when we say we are going to - so that we can then choose > > what the cadence is.

Re: [ceph-users] Changing the release cadence

2019-06-26 Thread Sage Weil
On Tue, 25 Jun 2019, Alfredo Deza wrote: > On Mon, Jun 17, 2019 at 4:09 PM David Turner wrote: > > > > This was a little long to respond with on Twitter, so I thought I'd share > > my thoughts here. I love the idea of a 12 month cadence. I like October > > because admins aren't upgrading

Re: [ceph-users] Changing the release cadence

2019-06-17 Thread Sage Weil
On Wed, 5 Jun 2019, Sage Weil wrote: > That brings us to an important decision: what time of year should we > release? Once we pick the timing, we'll be releasing at that time *every > year* for each release (barring another schedule shift, which we want to > avoid), so let's choo

Re: [ceph-users] mutable health warnings

2019-06-14 Thread Sage Weil
On Thu, 13 Jun 2019, Neha Ojha wrote: > Hi everyone, > > There has been some interest in a feature that helps users to mute > health warnings. There is a trello card[1] associated with it and > we've had some discussion[2] in the past in a CDM about it. In > general, we want to understand a few

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Sage Weil
On Thu, 13 Jun 2019, Harald Staub wrote: > On 13.06.19 15:52, Sage Weil wrote: > > On Thu, 13 Jun 2019, Harald Staub wrote: > [...] > > I think that increasing the various suicide timeout options will allow > > it to stay up long enough to clean up the ginormous objects: &g

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Sage Weil
depending on the nature of the problem; I suggested new OSDs as import > target) > > Paul > > On Thu, Jun 13, 2019 at 3:52 PM Sage Weil wrote: > > > On Thu, 13 Jun 2019, Harald Staub wrote: > > > Idea received from Wido den Hollander: > > > bluestore

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-13 Thread Sage Weil
de note that since you started the OSD read-write using the internal copy of rocksdb, don't forget that the external copy you extracted (/mnt/ceph/db?) is now stale!) sage > > Any opinions? > > Thanks! > Harry > > On 13.06.19 09:32, Harald Staub wrote: > > On 13.0

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Sage Weil wrote: > On Thu, 13 Jun 2019, Simon Leinen wrote: > > Sage Weil writes: > > >> 2019-06-12 23:40:43.555 7f724b27f0c0 1 rocksdb: do_open column > > >> families: [default] > > >> Unrecognized command: stats > > >

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Thu, 13 Jun 2019, Simon Leinen wrote: > Sage Weil writes: > >> 2019-06-12 23:40:43.555 7f724b27f0c0 1 rocksdb: do_open column families: > >> [default] > >> Unrecognized command: stats > >> ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/ver

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Simon Leinen wrote: > We hope that we can get some access to S3 bucket indexes back, possibly > by somehow dropping and re-creating those indexes. Are all 3 OSDs crashing in the same way? My guess is that the reshard process triggered some massive rocksdb transaction that

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Simon Leinen wrote: > Sage Weil writes: > > What happens if you do > > > ceph-kvstore-tool rocksdb /mnt/ceph/db stats > > (I'm afraid that our ceph-kvstore-tool doesn't know about a "stats" > command; but it still tries to open

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Simon Leinen wrote: > Dear Sage, > > > Also, can you try ceph-bluestore-tool bluefs-export on this osd? I'm > > pretty sure it'll crash in the same spot, but just want to confirm > > it's a bluefs issue. > > To my surprise, this actually seems to have worked: > > $ time

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Harald Staub wrote: > On 12.06.19 17:40, Sage Weil wrote: > > On Wed, 12 Jun 2019, Harald Staub wrote: > > > Also opened an issue about the rocksdb problem: > > > https://tracker.ceph.com/issues/40300 > > > > Thanks! > > &g

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Sage Weil
On Wed, 12 Jun 2019, Harald Staub wrote: > Also opened an issue about the rocksdb problem: > https://tracker.ceph.com/issues/40300 Thanks! The 'rocksdb: Corruption: file is too short' the root of the problem here. Can you try starting the OSD with 'debug_bluestore=20' and 'debug_bluefs=20'?

Re: [ceph-users] RFC: relicence Ceph LGPL-2.1 code as LGPL-2.1 or LGPL-3.0

2019-06-12 Thread Sage Weil
On Fri, 10 May 2019, Sage Weil wrote: > Hi everyone, > > -- What -- > > The Ceph Leadership Team[1] is proposing a change of license from > *LGPL-2.1* to *LGPL-2.1 or LGPL-3.0* (dual license). The specific changes > are described by this pull request: > > h

[ceph-users] typical snapmapper size

2019-06-06 Thread Sage Weil
Hello RBD users, Would you mind running this command on a random OSD on your RBD-oriented cluster? ceph-objectstore-tool \ --data-path /var/lib/ceph/osd/ceph-NNN \ '["meta",{"oid":"snapmapper","key":"","snapid":0,"hash":2758339587,"max":0,"pool":-1,"namespace":"","max":0}]' \ list-omap |

[ceph-users] Changing the release cadence

2019-06-05 Thread Sage Weil
Hi everyone, Since luminous, we have had the follow release cadence and policy: - release every 9 months - maintain backports for the last two releases - enable upgrades to move either 1 or 2 releases heads (e.g., luminous -> mimic or nautilus; mimic -> nautilus or octopus; ...) This

Re: [ceph-users] v12.2.5 Luminous released

2019-06-04 Thread Sage Weil
e one active MDS, > >upgrade the single active MDS, then upgrade/start standbys. Finally, > >restore the previous max_mds. > > > >See also: https://tracker.ceph.com/issues/23172 > > > > > > Other Notable Changes > > - &g

Re: [ceph-users] RFC: relicence Ceph LGPL-2.1 code as LGPL-2.1 or LGPL-3.0

2019-05-24 Thread Sage Weil
On Fri, 10 May 2019, Robin H. Johnson wrote: > On Fri, May 10, 2019 at 02:27:11PM +0000, Sage Weil wrote: > > If you are a Ceph developer who has contributed code to Ceph and object to > > this change of license, please let us know, either by replying to this > > mes

[ceph-users] RFC: relicence Ceph LGPL-2.1 code as LGPL-2.1 or LGPL-3.0

2019-05-10 Thread Sage Weil
Hi everyone, -- What -- The Ceph Leadership Team[1] is proposing a change of license from *LGPL-2.1* to *LGPL-2.1 or LGPL-3.0* (dual license). The specific changes are described by this pull request: https://github.com/ceph/ceph/pull/22446 If you are a Ceph developer who has

Re: [ceph-users] upgrade to nautilus: "require-osd-release nautilus" required to increase pg_num

2019-05-02 Thread Sage Weil
On Mon, 29 Apr 2019, Alexander Y. Fomichev wrote: > Hi, > > I just upgraded from mimic to nautilus(14.2.0) and stumbled upon a strange > "feature". > I tried to increase pg_num for a pool. There was no errors but also no > visible effect: > > # ceph osd pool get foo_pool01 pg_num > pg_num: 256 >

Re: [ceph-users] rgw, nss: dropping the legacy PKI token support in RadosGW (removed in OpenStack Ocata)

2019-04-19 Thread Sage Weil
[Adding ceph-users for better usability] On Fri, 19 Apr 2019, Radoslaw Zarzynski wrote: > Hello, > > RadosGW can use OpenStack Keystone as one of its authentication > backends. Keystone in turn had been offering many token variants > over the time with PKI/PKIz being one of them. Unfortunately,

[ceph-users] Cephalocon Barcelona, May 19-20

2019-04-05 Thread Sage Weil
Hi everyone, This is a reminder that Cephalocon Barcelona is coming up next month (May 19-20), and it's going to be great! We have two days of Ceph content over four tracks, including: - A Rook tutorial for deploy Ceph over SSD instances - Several other Rook and Kubernetes related talks,

Re: [ceph-users] BADAUTHORIZER in Nautilus

2019-04-04 Thread Sage Weil
69c4-e7e4-47d3-8fb7-475ea4cfe14a > > This should have the information you need. > > On Wed, Apr 3, 2019 at 5:49 PM Sage Weil wrote: > > > This OSD also appears on teh accepting end of things, and probably > > has newer keys that the OSD connecting (tho it' shard to te

Re: [ceph-users] BADAUTHORIZER in Nautilus

2019-04-03 Thread Sage Weil
ast twice and am still getting the > > same error. > > > > I'll send a log file with confirmed interesting bad behavior shortly > > > > On Wed, Apr 3, 2019, 17:17 Sage Weil wrote: > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 --1- v1:10.36.9.46:6813/500

Re: [ceph-users] BADAUTHORIZER in Nautilus

2019-04-03 Thread Sage Weil
ing nautilus? Does 'ceph versions' show everything has upgraded? sage On Wed, 3 Apr 2019, Shawn Edwards wrote: > File uploaded: f1a2bfb3-92b4-495c-8706-f99cb228efc7 > > On Wed, Apr 3, 2019 at 4:57 PM Sage Weil wrote: > > > Hmm, that doesn't help. > > > > Can you se

Re: [ceph-users] BADAUTHORIZER in Nautilus

2019-04-03 Thread Sage Weil
e: > https://gist.github.com/lesserevil/3b82d37e517f4561ce53c81629717aae > > On Wed, Apr 3, 2019 at 4:07 PM Sage Weil wrote: > > > On Wed, 3 Apr 2019, Shawn Edwards wrote: > > > Recent nautilus upgrade from mimic. No issues on mimic. > > > > > > Now g

Re: [ceph-users] BADAUTHORIZER in Nautilus

2019-04-03 Thread Sage Weil
On Wed, 3 Apr 2019, Shawn Edwards wrote: > Recent nautilus upgrade from mimic. No issues on mimic. > > Now getting this or similar in all osd logs, there is very little osd > communicatoin, and most of the PG are either 'down' or 'unknown', even > though I can see the data on the filestores. >

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-22 Thread Sage Weil
I have a ticket open for this: http://tracker.ceph.com/issues/38745 Please comment there with the health warning you're seeing and any other details so we can figure out why it's happening. I wouldn't reprovision those OSDs yet, until we know why it happens. Also, it's likely that

Re: [ceph-users] [Ceph-community] How does ceph use the STS service?

2019-02-28 Thread Sage Weil
ll be 14.2.0 in a week or two. sage > > > From: ceph-users on behalf of admin > > Sent: Thursday, February 28, 2019 4:22 AM > To: Pritha Srivastava; Sage Weil; ceph-us...@ceph.com > Subject: Re: [ceph-users] [Ceph-community] How does ceph use

Re: [ceph-users] [Ceph-community] How does ceph use the STS service?

2019-02-27 Thread Sage Weil
Moving this to ceph-users. On Wed, 27 Feb 2019, admin wrote: > I want to use the STS service to generate temporary credentials for use by > third-party clients. > > I configured STS lite based on the documentation. > http://docs.ceph.com/docs/master/radosgw/STSLite/ > > This is my

Re: [ceph-users] ceph osd journal disk in RAID#1?

2019-02-14 Thread Sage Weil
On Thu, 14 Feb 2019, John Petrini wrote: > Cost and available disk slots are also worth considering since you'll > burn a lot more by going RAID-1, which again really isn't necessary. > This may be the most convincing reason not to bother. Generally speaking, if the choice is between a 2 RAID-1

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-14 Thread Sage Weil
are all active+clean, > > > the old maps should be trimmed and the disk space freed. > > > > > > However, several people have noted that (at least in luminous > > > releases) the old maps are not trimmed until after HEALTH_OK *and* all > > > mons are restarted. T

Re: [ceph-users] change OSD IP it uses

2019-02-08 Thread Sage Weil
The IP that an OSD (or other non-monitor daemon) uses normally depends on what IP is used by the local host to reach the monitor(s). If you want your OSDs to be on a different network, generally the way to do that is to move the monitors to that network too. You can also try the

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-06 Thread Sage Weil
Hi Swami The limit is somewhat arbitrary, based on cluster sizes we had seen when we picked it. In your case it should be perfectly safe to increase it. sage On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote: > Hello - Are the any limits for mon_data_size for cluster with 2PB > (with 2000+

Re: [ceph-users] Luminous cluster in very bad state need some assistance.

2019-02-04 Thread Sage Weil
On Mon, 4 Feb 2019, Philippe Van Hecke wrote: > So i restarted the osd but he stop after some time. But this is an effect on > the cluster and cluster is on a partial recovery process. > > please find here log file of osd 49 after this restart >

Re: [ceph-users] Luminous cluster in very bad state need some assistance.

2019-02-03 Thread Sage Weil
596887 [121,24]121 > [121,24]121 69295'19811665 2019-02-01 12:48:41.343144 > 66131'19810044 2019-01-30 11:44:36.006505 > > cp done. > > So i can make ceph-objecstore-tool --op remove command ? yep! > > ____ >

Re: [ceph-users] Luminous cluster in very bad state need some assistance.

2019-02-03 Thread Sage Weil
/current/11.182_head to a safe location and then use the ceph-objecstore-tool --op remove command. But first confirm that 'ceph pg ls' shows the PG as active. sage > > Kr > > Philippe. > > ____ > From: Sage Weil > Sent: 04 February 2019

Re: [ceph-users] Luminous cluster in very bad state need some assistance.

2019-02-03 Thread Sage Weil
On Mon, 4 Feb 2019, Sage Weil wrote: > On Mon, 4 Feb 2019, Philippe Van Hecke wrote: > > Hi Sage, First of all tanks for your help > > > > Please find here > > https://filesender.belnet.be/?s=download=dea0edda-5b6a-4284-9ea1-c1fdf88b65e9 Something caused the version

Re: [ceph-users] Luminous cluster in very bad state need some assistance.

2019-02-03 Thread Sage Weil
On Mon, 4 Feb 2019, Philippe Van Hecke wrote: > Hi Sage, First of all tanks for your help > > Please find here > https://filesender.belnet.be/?s=download=dea0edda-5b6a-4284-9ea1-c1fdf88b65e9 > the osd log with debug info for osd.49. and indeed if all buggy osd can > restart that can may be

Re: [ceph-users] Luminous cluster in very bad state need some assistance.

2019-02-03 Thread Sage Weil
On Sun, 3 Feb 2019, Philippe Van Hecke wrote: > Hello, > I'am working for BELNET the Belgian Natioanal Research Network > > We currently a manage a luminous ceph cluster on ubuntu 16.04 > with 144 hdd osd spread across two data centers with 6 osd nodes > on each datacenter. Osd(s) are 4 TB sata

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-30 Thread Sage Weil
; >From what I see in diff, the biggest difference is in tcmalloc, but maybe > >I'm wrong. > > (I'm using tcmalloc 2.5-2.2) > > > - Mail original - > De: "Sage Weil" > À: "aderumier" > Cc: "ceph-users" , "ceph-devel"

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-25 Thread Sage Weil
Can you capture a perf top or perf record to see where teh CPU time is going on one of the OSDs wth a high latency? Thanks! sage On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: > > Hi, > > I have a strange behaviour of my osd, on multiple clusters, > > All cluster are running mimic

[ceph-users] Ceph tech talk tomorrow: NooBaa data platform for distributed hybrid clouds

2019-01-16 Thread Sage Weil
Hi everyone, First, this is a reminder that there is a Tech Talk tomorrow from Guy Margalit about NooBaa, a multi-cloud object data services platform: Jan 17 at 19:00 UTC https://bluejeans.com/908675367 Why, you might ask? There is a lot of interest among many Ceph developers and vendors to

[ceph-users] dropping python 2 for nautilus... go/no-go

2019-01-16 Thread Sage Weil
Hi everyone, This has come up several times before, but we need to make a final decision. Alfredo has a PR prepared that drops Python 2 support entirely in master, which will mean nautilus is Python 3 only. All of our distro targets (el7, bionic, xenial) include python 3, so that isn't an

Re: [ceph-users] OSDs crashing in EC pool (whack-a-mole)

2019-01-08 Thread Sage Weil
I've seen this on luminous, but not on mimic. Can you generate a log with debug osd = 20 leading up to the crash? Thanks! sage On Tue, 8 Jan 2019, Paul Emmerich wrote: > I've seen this before a few times but unfortunately there doesn't seem > to be a good solution at the moment :( > > See

Re: [ceph-users] size of inc_osdmap vs osdmap

2019-01-02 Thread Sage Weil
I think that code was broken by ea723fbb88c69bd00fefd32a3ee94bf5ce53569c and should be fixed like so: diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc index 8376a40668..12f468636f 100644 --- a/src/mon/OSDMonitor.cc +++ b/src/mon/OSDMonitor.cc @@ -1006,7 +1006,8 @@ void

Re: [ceph-users] Help with setting device-class rule on pool without causing data to move

2018-12-30 Thread Sage Weil
On Sun, 30 Dec 2018, David C wrote: > Hi All > > I'm trying to set the existing pools in a Luminous cluster to use the hdd > device-class but without moving data around. If I just create a new rule > using the hdd class and set my pools to use that new rule it will cause a > huge amount of data

Re: [ceph-users] Cephalocon Barcelona 2019 CFP now open!

2018-12-10 Thread Sage Weil
On Mon, 10 Dec 2018, Wido den Hollander wrote: > On 12/10/18 5:00 PM, Mike Perez wrote: > > Hello everyone! > > > > It gives me great pleasure to announce the CFP for Cephalocon Barcelona > > 2019 is now open [1]! > > > > Cephalocon Barcelona aims to bring together more than 800 technologists >

Re: [ceph-users] Mimic offline problem

2018-10-05 Thread Sage Weil
re is something wrong. But we are not sure if we cant use the tool or > there is something wrong with OSD. > > > > On 4 Oct 2018, at 06:17, Sage Weil wrote: > > > > On Thu, 4 Oct 2018, Goktug Yildirim wrote: > >> This is our cluster state right now. I can rea

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
sdb CheckConstency code. Not sure what to make of that. > https://paste.ubuntu.com/p/SY3576dNbJ/ > https://paste.ubuntu.com/p/smyT6Y976b/ These are failing in BlueStore code. The ceph-blustore-tool fsck may help here, can you give it a shot? sage > > > On 3 Oct 2018, at 21:37,

Re: [ceph-users] Bluestore vs. Filestore

2018-10-03 Thread Sage Weil
On Tue, 2 Oct 2018, jes...@krogh.cc wrote: > Hi. > > Based on some recommendations we have setup our CephFS installation using > bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS > server - 100TB-ish size. > > Current setup is - a sizeable Linux host with 512GB of memory -

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
ing pgs properly. If you take that 30 byte file I sent earlier (as hex) and update the osdmap epoch to the latest on the mon, confirm it decodes and dumps properly, and then inject it on the 3 mons, that should get you past this hump (and hopefully back up!). sage > > Sage Weil şunlar

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
>ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json >error: buffer::malformed_input: void >creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer >understand >old encoding version 2 < 111 > >My ceph version: 13.2.2 > >3 Eki 2018 Çar, saat 2

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
> "created_pools": [] > } > > You can find the "dump" link below. > > dump: > https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing > > > Sage Weil şunları yazdı (3 Eki 2018 18:45): > > >> On We

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
ch here "creating_pgs": [], "queue": [], "created_pools": [ 66 ] } sage > > > On 3 Oct 2018, at 17:52, Sage Weil wrote: > > > > On Wed, 3 Oct 2018, Goktug Yildirim wrote: > >> Sage, > >> > >> P

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
ons 8. start all mons 4-6 will probably be an iterative process... let's start by getting the structure out and dumping the current value? The code to refer to to understand the structure is src/mon/CreatingPGs.h encode/decode methods. sage > > > > On 3 Oct 2018, at

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
t +create_info) prio 255 cost 10 e72642) queued > >> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 > >> to_process >> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> > >> waiting <> waiting_peering {} >

Re: [ceph-users] Mimic offline problem

2018-10-02 Thread Sage Weil
osd_find_best_info_ignore_history_les is a dangerous option and you should only use it in very specific circumstances when directed by a developer. In such cases it will allow a stuck PG to peer. But you're not getting to that point...you're seeing some sort of resource exhaustion. The noup

Re: [ceph-users] Mimic upgrade failure

2018-09-24 Thread Sage Weil
Hi Kevin, Do you have an update on the state of the cluster? I've opened a ticket http://tracker.ceph.com/issues/36163 to track the likely root cause we identified, and have a PR open at https://github.com/ceph/ceph/pull/24247 Thanks! sage On Thu, 20 Sep 2018, Sage Weil wrote: > On Thu,

[ceph-users] crush map reclassifier

2018-09-21 Thread Sage Weil
Hi everyone, In luminous we added the crush device classes that automagically categorize your OSDs and hdd, ssd, etc, and allow you write CRUSH rules that target a subset of devices. Prior to this it was necessary to make custom edits to your CRUSH map with parallel hierarchies for each OSD

Re: [ceph-users] Mimic upgrade failure

2018-09-20 Thread Sage Weil
1.27% 1.27% libceph-common.so.0 [.] ceph::encode std::vector >, > std::less, mempool::pool_allocator<(mempool::pool_index_t)15, > std::pair mempool::pool_allocator<(mempool::pool_index_t) > 1.13% 1.13% ceph-mon [.] std::_Rb_tree std::pair memp

Re: [ceph-users] Mimic upgrade failure

2018-09-20 Thread Sage Weil
small mimic test cluster actually shows similar in it's features, > mon,mgr,mds,osd all report luminous features yet have 13.2.1 installed, so > maybe that is normal. > > Kevin > > On 09/19/2018 09:35 AM, Sage Weil wrote: > > It's hard to tell exactly from the below,

Re: [ceph-users] Mimic upgrade failure

2018-09-19 Thread Sage Weil
/ceph/mon/ceph-sephmon1/store.db/26299339.sst", O_RDONLY) = 429 > stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", > {st_mode=S_IFREG|0644, st_size=1658656, ...}) = 0 > mmap(NULL, 1658656, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eea87e000 > close(429)

Re: [ceph-users] Mimic upgrade failure

2018-09-19 Thread Sage Weil
conf for all of the mons. This is a bit of a band-aid but should help you keep the mons in quorum until we sort out what is going on. sage > Thanks > Kevin > > On 09/10/2018 07:06 AM, Sage Weil wrote: > > I took a look at the mon log you sent. A few things I noticed

Re: [ceph-users] Mimic upgrade failure

2018-09-14 Thread Sage Weil
ons about every minute so I let this > > > run for a few elections and saw this node become the leader a > > > couple times. Debug logs start around 23:27:30. I had managed to > > > get about 850/857 osds up, but it seems that within the last 30 > > >

Re: [ceph-users] [Ceph-community] Multisite replication jewel and luminous

2018-09-12 Thread Sage Weil
[Moving this to ceph-users where it will get more eyeballs.] On Wed, 12 Sep 2018, Andrew Cassera wrote: > Hello, > > Any help would be appreciated. I just created two clusters in the lab. One > cluster is running jewel 10.2.10 and the other cluster is running luminous > 12.2.8. After creating

Re: [ceph-users] Mimic upgrade failure

2018-09-10 Thread Sage Weil
ed the file with my email address for the user. It is > > > with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The > > > mons are calling for elections about every minute so I let this > > > run for a few elections and saw this node become the lead

Re: [ceph-users] Mimic upgrade failure

2018-09-08 Thread Sage Weil
Hi Kevin, I can't think of any major luminous->mimic changes off the top of my head that would impact CPU usage, but it's always possible there is something subtle. Can you ceph-post-file a the full log from one of your mons (preferbably the leader)? You might try adjusting the rocksdb cache

Re: [ceph-users] ceph-fuse using excessive memory

2018-09-05 Thread Sage Weil
On Wed, 5 Sep 2018, Andras Pataki wrote: > Hi cephers, > > Every so often we have a ceph-fuse process that grows to rather large size (up > to eating up the whole memory of the machine).  Here is an example of a 200GB > RSS size ceph-fuse instance: > > # ceph daemon

Re: [ceph-users] New Ceph community manager: Mike Perez

2018-08-29 Thread Sage Weil
Correction: Mike's new email is actually mipe...@redhat.com (sorry, mperez!). sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] New Ceph community manager: Mike Perez

2018-08-28 Thread Sage Weil
Hi everyone, Please help me welcome Mike Perez, the new Ceph community manager! Mike has a long history with Ceph: he started at DreamHost working on OpenStack and Ceph back in the early days, including work on the original RBD integration. He went on to work in several roles in the OpenStack

Re: [ceph-users] removing auids and auid-based cephx capabilities

2018-08-11 Thread Sage Weil
On Fri, 10 Aug 2018, Gregory Farnum wrote: > On Wed, Aug 8, 2018 at 1:33 PM, Sage Weil wrote: > > There is an undocumented part of the cephx authentication framework called > > the 'auid' (auth uid) that assigns an integer identifier to cephx users > > and to rados pools an

Re: [ceph-users] RBD image "lightweight snapshots"

2018-08-10 Thread Sage Weil
On Fri, 10 Aug 2018, Paweł Sadowski wrote: > On 08/09/2018 04:39 PM, Alex Elder wrote: > > On 08/09/2018 08:15 AM, Sage Weil wrote: > >> On Thu, 9 Aug 2018, Piotr Dałek wrote: > >>> Hello, > >>> > >>> At OVH we're heavily utilizing s

Re: [ceph-users] RBD image "lightweight snapshots"

2018-08-09 Thread Sage Weil
On Thu, 9 Aug 2018, Piotr Dałek wrote: > Hello, > > At OVH we're heavily utilizing snapshots for our backup system. We think > there's an interesting optimization opportunity regarding snapshots I'd like > to discuss here. > > The idea is to introduce a concept of a "lightweight" snapshots -

[ceph-users] removing auids and auid-based cephx capabilities

2018-08-08 Thread Sage Weil
There is an undocumented part of the cephx authentication framework called the 'auid' (auth uid) that assigns an integer identifier to cephx users and to rados pools and allows you to craft cephx capabilities that apply to those pools. This is leftover infrastructure from an ancient time in

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread Sage Weil
s to every >OSD like back in the hammer days?  I seem to remember that OSD maps >should be a lot smaller now, so maybe this isn't as big of a problem as >it was back then? >  >Thanks, >Bryan >  > >From: ceph-users on behalf of Sage >Weil >Date: Friday, July 27, 2018 at

[ceph-users] v13.2.1 Mimic released

2018-07-27 Thread Sage Weil
subject to replay attack (issue#24836 http://tracker.ceph.com/issues/24836, Sage Weil) * CVE 2018-1129: auth: cephx signature check is weak (issue#24837 http://tracker.ceph.com/issues/24837, Sage Weil) * CVE 2018-10861: mon: auth checks not correct for pool ops (issue#24838 * <http://tracker.ceph.

Re: [ceph-users] download.ceph.com repository changes

2018-07-25 Thread Sage Weil
On Tue, 24 Jul 2018, Alfredo Deza wrote: > Hi all, > > After the 12.2.6 release went out, we've been thinking on better ways > to remove a version from our repositories to prevent users from > upgrading/installing a known bad release. > > The way our repos are structured today means every single

Re: [ceph-users] [Ceph-maintainers] v12.2.7 Luminous released

2018-07-18 Thread Sage Weil
On Wed, 18 Jul 2018, Linh Vu wrote: > Thanks for all your hard work in putting out the fixes so quickly! :) > > We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, > not RGW. In the release notes, it says RGW is a risk especially the > garbage collection, and the

[ceph-users] Ceph Community Manager

2018-07-18 Thread Sage Weil
Hi everyone, Leo Vaz has moved on from his community manager role. I'd like to take this opportunity to thank him for his efforts over the past year, and to wish him the best in his future ventures. We've accomplished a lot during his tenure (including our first Cephalocon!) and Leo's

Re: [ceph-users] 10.2.6 upgrade

2018-07-18 Thread Sage Weil
On Wed, 18 Jul 2018, Glen Baars wrote: > Hello Ceph Users, > > We installed 12.2.6 on a single node in the cluster ( new node added, 80TB > moved ) > Disabled scrub/deepscrub once the issues with 12.2.6 were discovered. > > > Today we upgrade the one affected node to 12.2.7 today, set osd skip

Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Sage Weil
On Wed, 18 Jul 2018, Oliver Freyermuth wrote: > Am 18.07.2018 um 14:20 schrieb Sage Weil: > > On Wed, 18 Jul 2018, Linh Vu wrote: > >> Thanks for all your hard work in putting out the fixes so quickly! :) > >> > >> We have a cluster on 12.2.5 with

Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Sage Weil
On Wed, 18 Jul 2018, Linh Vu wrote: > Thanks for all your hard work in putting out the fixes so quickly! :) > > We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, > not RGW. In the release notes, it says RGW is a risk especially the > garbage collection, and the

Re: [ceph-users] v12.2.7 Luminous released

2018-07-17 Thread Sage Weil
On Tue, 17 Jul 2018, Stefan Kooman wrote: > Quoting Abhishek Lekshmanan (abhis...@suse.com): > > > *NOTE* The v12.2.5 release has a potential data corruption issue with > > erasure coded pools. If you ran v12.2.5 with erasure coding, please see ^^^ > > below. > > < snip > >

Re: [ceph-users] 12.2.6 CRC errors

2018-07-16 Thread Sage Weil
repair this PG's wth ceph pg repair and it reports the error is fixed. > But is it really fixed? > Do I have to be afraid to have now corrupted data? > Would it be an option to noout this bluestore OSD's and stop them? > When do you expect the new 12.2.7 Release? Will it fix all the errors

Re: [ceph-users] 12.2.6 CRC errors

2018-07-14 Thread Sage Weil
On Sat, 14 Jul 2018, Glen Baars wrote: > Hello Ceph users! > > Note to users, don't install new servers on Friday the 13th! > > We added a new ceph node on Friday and it has received the latest 12.2.6 > update. I started to see CRC errors and investigated hardware issues. I > have since found

[ceph-users] IMPORTANT: broken luminous 12.2.6 release in repo, do not upgrade

2018-07-13 Thread Sage Weil
Hi everyone, tl;dr: Please avoid the 12.2.6 packages that are currently present on download.ceph.com. We will have a 12.2.7 published ASAP (probably Monday). If you do not use bluestore or erasure-coded pools, none of the issues affect you. Details: We built 12.2.6 and pushed it to the

  1   2   3   4   5   6   7   8   9   10   >