Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-09 Thread Philippe D'Anjou
 I dont think this has anything to do with CephFS, the mon crashes for same 
reason even without the mds running.I have still the old rocksdb files but they 
had a corruption issue, not sure if that's easier to fix, there havent been any 
changes on the cluster in between.
This is a disaster rebuild, we managed to get all cephfs data back online, 
apart from some metadata, and we're copying for the last weeks now but suddenly 
the mon died first of rocksdb corruption and now after the repair because of 
that osdmap issue.

Am Mittwoch, 9. Oktober 2019, 20:19:42 OESZ hat Gregory Farnum 
 Folgendes geschrieben:  
 
 On Mon, Oct 7, 2019 at 11:11 PM Philippe D'Anjou
 wrote:
>
> Hi,
> unfortunately it's single mon, because we had major outage on this cluster 
> and it's just being used to copy off data now. We werent able to add more 
> mons because once a second mon was added it crashed the first one (there's a 
> bug tracker ticket).
> I still have old rocksdb files before I ran a repair on it, but well it had 
> the rocksdb corruption issue (not sure why that happened, it ran fine for 
> 2months now).
>
> Any options? I mean everything still works, data is accessible, RBDs run, 
> only cephfs mount is obviously not working. For that short amount of time the 
> mon starts it reports no issues and all commands run fine.

Sounds like you actually lost some data. You'd need to manage a repair
by trying to figure out why CephFS needs that map and performing
surgery on either the monitor (to give it a fake map or fall back to
something else) or the CephFS data structures.

You might also be able to rebuild the CephFS metadata using the
disaster recovery tools to work around it, but no guarantees there
since I don't understand why CephFS is digging up OSD maps that nobody
else in the cluster cares about.
-Greg


> Am Montag, 7. Oktober 2019, 21:59:20 OESZ hat Gregory Farnum 
>  Folgendes geschrieben:
>
>
> On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou
>  wrote:
> >
> > I had to use rocksdb repair tool before because the rocksdb files got 
> > corrupted, for another reason (another bug possibly). Maybe that is why now 
> > it crash loops, although it ran fine for a day.
>
> Yeah looks like it lost a bit of data. :/
>
> > What is meant with "turn it off and rebuild from remainder"?
>
> If only one monitor is crashing, you can remove it from the quorum,
> zap all the disks, and add it back so that it recovers from its
> healthy peers.
> -Greg
>
>
> >
> > Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
> >  Folgendes geschrieben:
> >
> >
> > Hmm, that assert means the monitor tried to grab an OSDMap it had on
> > disk but it didn't work. (In particular, a "pinned" full map which we
> > kept around after trimming the others to save on disk space.)
> >
> > That *could* be a bug where we didn't have the pinned map and should
> > have (or incorrectly thought we should have), but this code was in
> > Mimic as well as Nautilus and I haven't seen similar reports. So it
> > could also mean that something bad happened to the monitor's disk or
> > Rocksdb store. Can you turn it off and rebuild from the remainder, or
> > do they all exhibit this bug?
> >
> >
> > On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
> >  wrote:
> > >
> > > Hi,
> > > our mon is acting up all of a sudden and dying in crash loop with the 
> > > following:
> > >
> > >
> > > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> > >    -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> > >mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 
> > >4548623..4549352) is_readable = 1 - now=2019-10-04 14:00:24.339620 
> > >lease_expire=0.00 has v0 lc 4549352
> > >    -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > >mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> > >closest pinned map ver 252615 not available! error: (2) No such file or 
> > >directory
> > >    -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > >/build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> > >OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
> > >7f6e5d461700 time 2019-10-04 14:00:24.347580
> > > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 
> > > 0)
> > >
> > >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > >(stable)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > >const*)+0x152) [0x7f6e68eb064e]
> > >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
> > >const*, char const*, ...)+0) [0x7f6e68eb0829]
> > >  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > >ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> > >  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> > >ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> > >  5: 
> > >(OSDMonitor::encode_trim_extra(std::shared_ptr,
> > > unsigned long)+0x8c) [0x717c3c]
> > >  6: (PaxosService::maybe_trim()+0x473) 

Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-09 Thread Gregory Farnum
On Wed, Oct 9, 2019 at 10:58 AM Vladimir Brik <
vladimir.b...@icecube.wisc.edu> wrote:

> Best I can tell, automatic cache sizing is enabled and all related
> settings are at their default values.
>
> Looking through cache tunables, I came across
> osd_memory_expected_fragmentation, which the docs define as "estimate
> the percent of memory fragmentation". What's the formula to compute
> actual percentage of memory fragmentation?
>
> Based on /proc/buddyinfo, I suspect that our memory fragmentation is a
> lot worse than osd_memory_expected_fragmentation default of 0.15. Could
> this be related to many OSDs' RSSes far exceeding osd_memory_target?
>
> So far high memory consumption hasn't been a problem for us. (I guess
> it's possible that the kernel simply sees no need to reclaim unmapped
> memory until there is actually real memory pressure?)


Oh well that you can check on the admin socket using the “heap” family of
commands. It’ll tell you how much the daemon is actually using out of
what’s allocated, and IIRC what it’s given back to the OS but maybe hasn’t
actually been reclaimed.

It's just a little
> scary not understanding why this started happening when memory usage had
> been so stable before.


>
> Thanks,
>
> Vlad
>
>
>
> On 10/9/19 11:51 AM, Gregory Farnum wrote:
> > On Mon, Oct 7, 2019 at 7:20 AM Vladimir Brik
> >  wrote:
> >>
> >>   > Do you have statistics on the size of the OSDMaps or count of them
> >>   > which were being maintained by the OSDs?
> >> No, I don't think so. How can I find this information?
> >
> > Hmm I don't know if we directly expose the size of maps. There are
> > perfcounters which expose the range of maps being kept around but I
> > don't know their names off-hand.
> >
> > Maybe it's something else involving the bluestore cache or whatever;
> > if you're not using the newer memory limits I'd switch to those but
> > otherwise I dunno.
> > -Greg
> >
> >>
> >> Memory consumption started to climb again:
> >> https://icecube.wisc.edu/~vbrik/graph-3.png
> >>
> >> Some more info (not sure if relevant or not):
> >>
> >> I increased size of the swap on the servers to 10GB and it's being
> >> completely utilized, even though there is still quite a bit of free
> memory.
> >>
> >> It appears that memory is highly fragmented on the NUMA node 0 of all
> >> the servers. Some of the servers have no free pages higher than order 0.
> >> (Memory on NUMA node 1 of the servers appears much less fragmented.)
> >>
> >> The servers have 192GB of RAM, 2 NUMA nodes.
> >>
> >>
> >> Vlad
> >>
> >>
> >>
> >> On 10/4/19 6:09 PM, Gregory Farnum wrote:
> >>> Do you have statistics on the size of the OSDMaps or count of them
> >>> which were being maintained by the OSDs? I'm not sure why having noout
> >>> set would change that if all the nodes were alive, but that's my bet.
> >>> -Greg
> >>>
> >>> On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
> >>>  wrote:
> 
>  And, just as unexpectedly, things have returned to normal overnight
>  https://icecube.wisc.edu/~vbrik/graph-1.png
> 
>  The change seems to have coincided with the beginning of Rados Gateway
>  activity (before, it was essentially zero). I can see nothing in the
>  logs that would explain what happened though.
> 
>  Vlad
> 
> 
> 
>  On 10/2/19 3:43 PM, Vladimir Brik wrote:
> > Hello
> >
> > I am running a Ceph 14.2.2 cluster and a few days ago, memory
> > consumption of our OSDs started to unexpectedly grow on all 5 nodes,
> > after being stable for about 6 months.
> >
> > Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
> > Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png
> >
> > I am not sure what changed to cause this. Cluster usage has been very
> > light (typically <10 iops) during this period, and the number of
> objects
> > stayed about the same.
> >
> > The only unusual occurrence was the reboot of one of the nodes the
> day
> > before (a firmware update). For the reboot, I ran "ceph osd set
> noout",
> > but forgot to unset it until several days later. Unsetting noout did
> not
> > stop the increase in memory consumption.
> >
> > I don't see anything unusual in the logs.
> >
> > Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
> > 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
> > don't know why there is such a big spread. All HDDs are 10TB, 72-76%
> > utilized, with 101-104 PGs.
> >
> > Does anybody know what might be the problem here and how to address
> or
> > debug it?
> >
> >
> > Thanks very much,
> >
> > Vlad
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  ___
>  ceph-users mailing 

Re: [ceph-users] Ceph pg repair clone_missing?

2019-10-09 Thread Brad Hubbard
Awesome! Sorry it took so long.

On Thu, Oct 10, 2019 at 12:44 AM Marc Roos  wrote:
>
>
> Brad, many thanks!!! My cluster has finally HEALTH_OK af 1,5 year or so!
> :)
>
>
> -Original Message-
> Subject: Re: Ceph pg repair clone_missing?
>
> On Fri, Oct 4, 2019 at 6:09 PM Marc Roos 
> wrote:
> >
> >  >
> >  >Try something like the following on each OSD that holds a copy of
> >  >rbd_data.1f114174b0dc51.0974 and see what output you
> get.
> >  >Note that you can drop the bluestore flag if they are not bluestore
>
> > >osds and you will need the osd stopped at the time (set noout). Also
>
> > >note, snapids are displayed in hexadecimal in the output (but then
> '4'
> >  >is '4' so not a big issues here).
> >  >
> >  >$ ceph-objectstore-tool --type bluestore --data-path
> > >/var/lib/ceph/osd/ceph-XX/ --pgid 17.36 --op list
> >  >rbd_data.1f114174b0dc51.0974
> >
> > I got these results
> >
> > osd.7
> > Error getting attr on : 17.36_head,#-19:6c00:::scrub_17.36:head#,
> > (61) No data available
> > ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> > na
> > pid":63,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
> > ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> > na
> > pid":-2,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
>
> Ah, so of course the problem is the snapshot is missing. You may need to
> try something like the following on each of those osds.
>
> $ ceph-objectstore-tool --type bluestore --data-path
> /var/lib/ceph/osd/ceph-XX/ --pgid 17.36
> '{"oid":"rbd_data.1f114174b0dc51.0974","key":"","snapid":-2,
> "hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}'
> remove-clone-metadata 4
>
> >
> > osd.12
> > ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> > na
> > pid":63,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
> > ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> > na
> > pid":-2,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
> >
> > osd.29
> > ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> > na
> > pid":63,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
> > ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> > na
> > pid":-2,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
> >
> >
> >  >
> >  >The likely issue here is the primary believes snapshot 4 is gone but
>
> > >there is still data and/or metadata on one of the replicas which is
> > >confusing the issue. If that is the case you can use the the
> > >ceph-objectstore-tool to delete the relevant snapshot(s)  >
>
>
>
> --
> Cheers,
> Brad
>
>
>


-- 
Cheers,
Brad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-10-09 Thread Patrick Donnelly
Looks like this bug: https://tracker.ceph.com/issues/41148

On Wed, Oct 9, 2019 at 1:15 PM David C  wrote:
>
> Hi Daniel
>
> Thanks for looking into this. I hadn't installed ceph-debuginfo, here's the 
> bt with line numbers:
>
> #0  operator uint64_t (this=0x10) at 
> /usr/src/debug/ceph-14.2.2/src/include/object.h:123
> #1  Client::fill_statx (this=this@entry=0x274b980, in=0x0, 
> mask=mask@entry=341, stx=stx@entry=0x7fccdbefa210) at 
> /usr/src/debug/ceph-14.2.2/src/client/Client.cc:7336
> #2  0x7fce4ea1d4ca in fill_statx (stx=0x7fccdbefa210, mask=341, in=..., 
> this=0x274b980) at /usr/src/debug/ceph-14.2.2/src/client/Client.h:898
> #3  Client::_readdir_cache_cb (this=this@entry=0x274b980, 
> dirp=dirp@entry=0x7fcb7d0e7860,
> cb=cb@entry=0x7fce4e9d0950 <_readdir_single_dirent_cb(void*, dirent*, 
> ceph_statx*, off_t, Inode*)>, p=p@entry=0x7fccdbefa6a0, caps=caps@entry=341,
> getref=getref@entry=true) at 
> /usr/src/debug/ceph-14.2.2/src/client/Client.cc:7999
> #4  0x7fce4ea1e865 in Client::readdir_r_cb (this=0x274b980, 
> d=0x7fcb7d0e7860,
> cb=cb@entry=0x7fce4e9d0950 <_readdir_single_dirent_cb(void*, dirent*, 
> ceph_statx*, off_t, Inode*)>, p=p@entry=0x7fccdbefa6a0, want=want@entry=1775,
> flags=flags@entry=0, getref=true) at 
> /usr/src/debug/ceph-14.2.2/src/client/Client.cc:8138
> #5  0x7fce4ea1f3dd in Client::readdirplus_r (this=, 
> d=, de=de@entry=0x7fccdbefa8c0, stx=stx@entry=0x7fccdbefa730, 
> want=want@entry=1775,
> flags=flags@entry=0, out=0x7fccdbefa720) at 
> /usr/src/debug/ceph-14.2.2/src/client/Client.cc:8307
> #6  0x7fce4e9c92d8 in ceph_readdirplus_r (cmount=, 
> dirp=, de=de@entry=0x7fccdbefa8c0, 
> stx=stx@entry=0x7fccdbefa730,
> want=want@entry=1775, flags=flags@entry=0, out=out@entry=0x7fccdbefa720) 
> at /usr/src/debug/ceph-14.2.2/src/libcephfs.cc:629
> #7  0x7fce4ece7b0e in fsal_ceph_readdirplus (dir=, 
> cred=, out=0x7fccdbefa720, flags=0, want=1775, 
> stx=0x7fccdbefa730, de=0x7fccdbefa8c0,
> dirp=, cmount=) at 
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/statx_compat.h:314
> #8  ceph_fsal_readdir (dir_pub=, whence=, 
> dir_state=0x7fccdbefaa30, cb=0x522640 , 
> attrmask=122830,
> eof=0x7fccdbefac0b) at 
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/handle.c:211
> #9  0x005256e1 in mdcache_readdir_uncached 
> (directory=directory@entry=0x7fcaa8bb84a0, whence=, 
> dir_state=, cb=,
> attrmask=, eod_met=) at 
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1654
> #10 0x00517a88 in mdcache_readdir (dir_hdl=0x7fcaa8bb84d8, 
> whence=0x7fccdbefab18, dir_state=0x7fccdbefab30, cb=0x432db0 
> , attrmask=122830,
> eod_met=0x7fccdbefac0b) at 
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:551
> #11 0x0043434a in fsal_readdir 
> (directory=directory@entry=0x7fcaa8bb84d8, cookie=cookie@entry=0, 
> nbfound=nbfound@entry=0x7fccdbefac0c,
> eod_met=eod_met@entry=0x7fccdbefac0b, attrmask=122830, 
> cb=cb@entry=0x46f600 , 
> opaque=opaque@entry=0x7fccdbefac20)
> at /usr/src/debug/nfs-ganesha-2.7.3/FSAL/fsal_helper.c:1164
> #12 0x004705b9 in nfs4_op_readdir (op=0x7fcb7fed1f80, 
> data=0x7fccdbefaea0, resp=0x7fcb7d106c40)
> at /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_readdir.c:664
> #13 0x0045d120 in nfs4_Compound (arg=, req= out>, res=0x7fcb7e001000)
> at /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_Compound.c:942
> #14 0x004512cd in nfs_rpc_process_request (reqdata=0x7fcb7e1d1950) at 
> /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_worker_thread.c:1328
> #15 0x00450766 in nfs_rpc_decode_request (xprt=0x7fcaf17fb0e0, 
> xdrs=0x7fcb7e1ddb90) at 
> /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
> #16 0x7fce6165707d in svc_rqst_xprt_task (wpe=0x7fcaf17fb2f8) at 
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:769
> #17 0x7fce6165759a in svc_rqst_epoll_events (n_events=, 
> sr_rec=0x56a24c0) at 
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:941
> #18 svc_rqst_epoll_loop (sr_rec=) at 
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1014
> #19 svc_rqst_run_task (wpe=0x56a24c0) at 
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1050
> #20 0x7fce6165f123 in work_pool_thread (arg=0x7fcd381c77b0) at 
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/work_pool.c:181
> #21 0x7fce5fc17dd5 in start_thread (arg=0x7fccdbefe700) at 
> pthread_create.c:307
> #22 0x7fce5ed8eead in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> On Mon, Oct 7, 2019 at 3:40 PM Daniel Gryniewicz  wrote:
>>
>> Client::fill_statx() is a fairly large function, so it's hard to know
>> what's causing the crash.  Can you get line numbers from your backtrace?
>>
>> Daniel
>>
>> On 10/7/19 9:59 AM, David C wrote:
>> > Hi All
>> >
>> > Further to my previous messages, I upgraded
>> > to 

Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-10-09 Thread David C
Hi Daniel

Thanks for looking into this. I hadn't installed ceph-debuginfo, here's the
bt with line numbers:

#0  operator uint64_t (this=0x10) at
/usr/src/debug/ceph-14.2.2/src/include/object.h:123
#1  Client::fill_statx (this=this@entry=0x274b980, in=0x0, mask=mask@entry=341,
stx=stx@entry=0x7fccdbefa210) at
/usr/src/debug/ceph-14.2.2/src/client/Client.cc:7336
#2  0x7fce4ea1d4ca in fill_statx (stx=0x7fccdbefa210, mask=341, in=...,
this=0x274b980) at /usr/src/debug/ceph-14.2.2/src/client/Client.h:898
#3  Client::_readdir_cache_cb (this=this@entry=0x274b980, dirp=dirp@entry
=0x7fcb7d0e7860,
cb=cb@entry=0x7fce4e9d0950 <_readdir_single_dirent_cb(void*, dirent*,
ceph_statx*, off_t, Inode*)>, p=p@entry=0x7fccdbefa6a0, caps=caps@entry=341,
getref=getref@entry=true) at
/usr/src/debug/ceph-14.2.2/src/client/Client.cc:7999
#4  0x7fce4ea1e865 in Client::readdir_r_cb (this=0x274b980,
d=0x7fcb7d0e7860,
cb=cb@entry=0x7fce4e9d0950 <_readdir_single_dirent_cb(void*, dirent*,
ceph_statx*, off_t, Inode*)>, p=p@entry=0x7fccdbefa6a0, want=want@entry
=1775,
flags=flags@entry=0, getref=true) at
/usr/src/debug/ceph-14.2.2/src/client/Client.cc:8138
#5  0x7fce4ea1f3dd in Client::readdirplus_r (this=,
d=, de=de@entry=0x7fccdbefa8c0, stx=stx@entry=0x7fccdbefa730,
want=want@entry=1775,
flags=flags@entry=0, out=0x7fccdbefa720) at
/usr/src/debug/ceph-14.2.2/src/client/Client.cc:8307
#6  0x7fce4e9c92d8 in ceph_readdirplus_r (cmount=,
dirp=, de=de@entry=0x7fccdbefa8c0, stx=stx@entry
=0x7fccdbefa730,
want=want@entry=1775, flags=flags@entry=0, out=out@entry=0x7fccdbefa720)
at /usr/src/debug/ceph-14.2.2/src/libcephfs.cc:629
#7  0x7fce4ece7b0e in fsal_ceph_readdirplus (dir=,
cred=, out=0x7fccdbefa720, flags=0, want=1775,
stx=0x7fccdbefa730, de=0x7fccdbefa8c0,
dirp=, cmount=) at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/statx_compat.h:314
#8  ceph_fsal_readdir (dir_pub=, whence=,
dir_state=0x7fccdbefaa30, cb=0x522640 ,
attrmask=122830,
eof=0x7fccdbefac0b) at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/handle.c:211
#9  0x005256e1 in mdcache_readdir_uncached
(directory=directory@entry=0x7fcaa8bb84a0, whence=,
dir_state=, cb=,
attrmask=, eod_met=) at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1654
#10 0x00517a88 in mdcache_readdir (dir_hdl=0x7fcaa8bb84d8,
whence=0x7fccdbefab18, dir_state=0x7fccdbefab30, cb=0x432db0
, attrmask=122830,
eod_met=0x7fccdbefac0b) at
/usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:551
#11 0x0043434a in fsal_readdir
(directory=directory@entry=0x7fcaa8bb84d8,
cookie=cookie@entry=0, nbfound=nbfound@entry=0x7fccdbefac0c,
eod_met=eod_met@entry=0x7fccdbefac0b, attrmask=122830, cb=cb@entry=0x46f600
, opaque=opaque@entry=0x7fccdbefac20)
at /usr/src/debug/nfs-ganesha-2.7.3/FSAL/fsal_helper.c:1164
#12 0x004705b9 in nfs4_op_readdir (op=0x7fcb7fed1f80,
data=0x7fccdbefaea0, resp=0x7fcb7d106c40)
at /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_readdir.c:664
#13 0x0045d120 in nfs4_Compound (arg=,
req=, res=0x7fcb7e001000)
at /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_Compound.c:942
#14 0x004512cd in nfs_rpc_process_request (reqdata=0x7fcb7e1d1950)
at /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_worker_thread.c:1328
#15 0x00450766 in nfs_rpc_decode_request (xprt=0x7fcaf17fb0e0,
xdrs=0x7fcb7e1ddb90) at
/usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
#16 0x7fce6165707d in svc_rqst_xprt_task (wpe=0x7fcaf17fb2f8) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:769
#17 0x7fce6165759a in svc_rqst_epoll_events (n_events=,
sr_rec=0x56a24c0) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:941
#18 svc_rqst_epoll_loop (sr_rec=) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1014
#19 svc_rqst_run_task (wpe=0x56a24c0) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1050
#20 0x7fce6165f123 in work_pool_thread (arg=0x7fcd381c77b0) at
/usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/work_pool.c:181
#21 0x7fce5fc17dd5 in start_thread (arg=0x7fccdbefe700) at
pthread_create.c:307
#22 0x7fce5ed8eead in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

On Mon, Oct 7, 2019 at 3:40 PM Daniel Gryniewicz  wrote:

> Client::fill_statx() is a fairly large function, so it's hard to know
> what's causing the crash.  Can you get line numbers from your backtrace?
>
> Daniel
>
> On 10/7/19 9:59 AM, David C wrote:
> > Hi All
> >
> > Further to my previous messages, I upgraded
> > to libcephfs2-14.2.2-0.el7.x86_64 as suggested and things certainly seem
> > a lot more stable, I have had some crashes though, could someone assist
> > in debugging this latest crash please?
> >
> > (gdb) bt
> > #0  0x7fce4e9fc1bb in Client::fill_statx(Inode*, unsigned int,
> > ceph_statx*) () from /lib64/libcephfs.so.2
> > #1  

Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-09 Thread Vladimir Brik
Best I can tell, automatic cache sizing is enabled and all related 
settings are at their default values.


Looking through cache tunables, I came across 
osd_memory_expected_fragmentation, which the docs define as "estimate 
the percent of memory fragmentation". What's the formula to compute 
actual percentage of memory fragmentation?


Based on /proc/buddyinfo, I suspect that our memory fragmentation is a 
lot worse than osd_memory_expected_fragmentation default of 0.15. Could 
this be related to many OSDs' RSSes far exceeding osd_memory_target?


So far high memory consumption hasn't been a problem for us. (I guess 
it's possible that the kernel simply sees no need to reclaim unmapped 
memory until there is actually real memory pressure?) It's just a little 
scary not understanding why this started happening when memory usage had 
been so stable before.


Thanks,

Vlad



On 10/9/19 11:51 AM, Gregory Farnum wrote:

On Mon, Oct 7, 2019 at 7:20 AM Vladimir Brik
 wrote:


  > Do you have statistics on the size of the OSDMaps or count of them
  > which were being maintained by the OSDs?
No, I don't think so. How can I find this information?


Hmm I don't know if we directly expose the size of maps. There are
perfcounters which expose the range of maps being kept around but I
don't know their names off-hand.

Maybe it's something else involving the bluestore cache or whatever;
if you're not using the newer memory limits I'd switch to those but
otherwise I dunno.
-Greg



Memory consumption started to climb again:
https://icecube.wisc.edu/~vbrik/graph-3.png

Some more info (not sure if relevant or not):

I increased size of the swap on the servers to 10GB and it's being
completely utilized, even though there is still quite a bit of free memory.

It appears that memory is highly fragmented on the NUMA node 0 of all
the servers. Some of the servers have no free pages higher than order 0.
(Memory on NUMA node 1 of the servers appears much less fragmented.)

The servers have 192GB of RAM, 2 NUMA nodes.


Vlad



On 10/4/19 6:09 PM, Gregory Farnum wrote:

Do you have statistics on the size of the OSDMaps or count of them
which were being maintained by the OSDs? I'm not sure why having noout
set would change that if all the nodes were alive, but that's my bet.
-Greg

On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
 wrote:


And, just as unexpectedly, things have returned to normal overnight
https://icecube.wisc.edu/~vbrik/graph-1.png

The change seems to have coincided with the beginning of Rados Gateway
activity (before, it was essentially zero). I can see nothing in the
logs that would explain what happened though.

Vlad



On 10/2/19 3:43 PM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.2 cluster and a few days ago, memory
consumption of our OSDs started to unexpectedly grow on all 5 nodes,
after being stable for about 6 months.

Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png

I am not sure what changed to cause this. Cluster usage has been very
light (typically <10 iops) during this period, and the number of objects
stayed about the same.

The only unusual occurrence was the reboot of one of the nodes the day
before (a firmware update). For the reboot, I ran "ceph osd set noout",
but forgot to unset it until several days later. Unsetting noout did not
stop the increase in memory consumption.

I don't see anything unusual in the logs.

Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
don't know why there is such a big spread. All HDDs are 10TB, 72-76%
utilized, with 101-104 PGs.

Does anybody know what might be the problem here and how to address or
debug it?


Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-09 Thread Gregory Farnum
On Mon, Oct 7, 2019 at 11:11 PM Philippe D'Anjou
 wrote:
>
> Hi,
> unfortunately it's single mon, because we had major outage on this cluster 
> and it's just being used to copy off data now. We werent able to add more 
> mons because once a second mon was added it crashed the first one (there's a 
> bug tracker ticket).
> I still have old rocksdb files before I ran a repair on it, but well it had 
> the rocksdb corruption issue (not sure why that happened, it ran fine for 
> 2months now).
>
> Any options? I mean everything still works, data is accessible, RBDs run, 
> only cephfs mount is obviously not working. For that short amount of time the 
> mon starts it reports no issues and all commands run fine.

Sounds like you actually lost some data. You'd need to manage a repair
by trying to figure out why CephFS needs that map and performing
surgery on either the monitor (to give it a fake map or fall back to
something else) or the CephFS data structures.

You might also be able to rebuild the CephFS metadata using the
disaster recovery tools to work around it, but no guarantees there
since I don't understand why CephFS is digging up OSD maps that nobody
else in the cluster cares about.
-Greg


> Am Montag, 7. Oktober 2019, 21:59:20 OESZ hat Gregory Farnum 
>  Folgendes geschrieben:
>
>
> On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou
>  wrote:
> >
> > I had to use rocksdb repair tool before because the rocksdb files got 
> > corrupted, for another reason (another bug possibly). Maybe that is why now 
> > it crash loops, although it ran fine for a day.
>
> Yeah looks like it lost a bit of data. :/
>
> > What is meant with "turn it off and rebuild from remainder"?
>
> If only one monitor is crashing, you can remove it from the quorum,
> zap all the disks, and add it back so that it recovers from its
> healthy peers.
> -Greg
>
>
> >
> > Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
> >  Folgendes geschrieben:
> >
> >
> > Hmm, that assert means the monitor tried to grab an OSDMap it had on
> > disk but it didn't work. (In particular, a "pinned" full map which we
> > kept around after trimming the others to save on disk space.)
> >
> > That *could* be a bug where we didn't have the pinned map and should
> > have (or incorrectly thought we should have), but this code was in
> > Mimic as well as Nautilus and I haven't seen similar reports. So it
> > could also mean that something bad happened to the monitor's disk or
> > Rocksdb store. Can you turn it off and rebuild from the remainder, or
> > do they all exhibit this bug?
> >
> >
> > On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
> >  wrote:
> > >
> > > Hi,
> > > our mon is acting up all of a sudden and dying in crash loop with the 
> > > following:
> > >
> > >
> > > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> > >-3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> > > mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 
> > > 4548623..4549352) is_readable = 1 - now=2019-10-04 14:00:24.339620 
> > > lease_expire=0.00 has v0 lc 4549352
> > >-2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > > mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> > > closest pinned map ver 252615 not available! error: (2) No such file or 
> > > directory
> > >-1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> > > OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' 
> > > thread 7f6e5d461700 time 2019-10-04 14:00:24.347580
> > > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 
> > > 0)
> > >
> > >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > > (stable)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > > const*)+0x152) [0x7f6e68eb064e]
> > >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
> > > const*, char const*, ...)+0) [0x7f6e68eb0829]
> > >  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > > ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> > >  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> > > ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> > >  5: 
> > > (OSDMonitor::encode_trim_extra(std::shared_ptr,
> > >  unsigned long)+0x8c) [0x717c3c]
> > >  6: (PaxosService::maybe_trim()+0x473) [0x707443]
> > >  7: (Monitor::tick()+0xa9) [0x5ecf39]
> > >  8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> > >  9: (Context::complete(int)+0x9) [0x6070d9]
> > >  10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
> > >  11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
> > >  12: (()+0x76ba) [0x7f6e67cab6ba]
> > >  13: (clone()+0x6d) [0x7f6e674d441d]
> > >
> > >  0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal 
> > > (Aborted) **
> > >  in thread 7f6e5d461700 thread_name:safe_timer
> > >
> > >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > > 

Re: [ceph-users] Ceph multi site outage question

2019-10-09 Thread Melzer Pinto
Thanks - yeah jewel is old  But i meant to say nautilus and not luminous.

The first option probably wont work for me. Since both sides are active and the 
application1 needs to write in both places as http://application1.something.com.

The 2nd one in theory should work. I'm using haproxy and it does have an option 
to rewrite host headers. I can also replace it with nginx since i think it'll 
handle this kind of thing better. In such a situation, I'd set one site's 
radosgw to application1-master and the 2nd one to application1-slave. The 
reverse proxy will then rewrite application1 to application1-master or 
application1-slave depending on the site.

Thanks

From: Ed Fisher 
Sent: Wednesday, October 9, 2019 11:13 AM
To: Melzer Pinto 
Cc: ceph-users@lists.ceph.com 
Subject: Re: [ceph-users] Ceph multi site outage question

Boy, Jewel is pretty old. Even Luminous is getting up there. There have been a 
lot of multisite improvements in Mimic and now Nautilus, so you might want to 
consider upgrading all the way to 14.2.4.

Anyway, the way we solve this is by giving each zone a different name (eg 
application1a/application1b), and then having a virtual IP for application1. We 
then move the virtual IP around whichever zone we want to have accepting 
traffic for that zonegroup. In our case we're advertising the virtual IP on all 
of the radosgw instances using bgp and then letting our routers do per-stream 
ECMP to load balance the traffic. Each RGW in each cluster checks the realm's 
period every few seconds and decides to announce/withdraw the IP based on 
whether that rgw's zone is the master zone for the zonegroup (plus whether the 
rgw instance is healthy, etc).

We have both application1.example.com and 
application1a/application1b.example.com as 
hostnames in the zonegroup config, but just 
application1.example.com for the endpoint. I'm 
not sure what the equivalent settings are on Jewel's multisite, if any. If 
you're routing radosgw traffic through a reverse proxy or load balancer you can 
also rewrite the host header on the fly.

Hope this helps,
Ed

On Oct 9, 2019, at 10:02 AM, Melzer Pinto 
mailto:melzer.pi...@mezocliq.com>> wrote:

Hello,
I have a question about multi site configuration. I have 2 clusters configured 
in a single realm and zonegroup. One cluster is the master zone and the other 
the slave. Lets assume the first cluster can be reached at 
http://application1.something.com and the 
2nd one is 
http://application1-slave.something.com.
 My application has a number of config files that reference 
http://application1.something.com. So if 
there is a site outage i'd need to change all of these files to 
http://application1-slave.something.com
 and restart. I was wondering if there are any alternatives where I dont have 
to change the config files. The best solution would be to use the same name in 
both clusters - 
http://application1.something.com. But i'm 
not sure if that is recommended or doable even. Any suggestions?  I'm using the 
latest version of Ceph Jewel, 10.2.11,  but I am planning to upgrade to 
luminous soon.

Thanks
M
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-09 Thread Gregory Farnum
On Mon, Oct 7, 2019 at 7:20 AM Vladimir Brik
 wrote:
>
>  > Do you have statistics on the size of the OSDMaps or count of them
>  > which were being maintained by the OSDs?
> No, I don't think so. How can I find this information?

Hmm I don't know if we directly expose the size of maps. There are
perfcounters which expose the range of maps being kept around but I
don't know their names off-hand.

Maybe it's something else involving the bluestore cache or whatever;
if you're not using the newer memory limits I'd switch to those but
otherwise I dunno.
-Greg

>
> Memory consumption started to climb again:
> https://icecube.wisc.edu/~vbrik/graph-3.png
>
> Some more info (not sure if relevant or not):
>
> I increased size of the swap on the servers to 10GB and it's being
> completely utilized, even though there is still quite a bit of free memory.
>
> It appears that memory is highly fragmented on the NUMA node 0 of all
> the servers. Some of the servers have no free pages higher than order 0.
> (Memory on NUMA node 1 of the servers appears much less fragmented.)
>
> The servers have 192GB of RAM, 2 NUMA nodes.
>
>
> Vlad
>
>
>
> On 10/4/19 6:09 PM, Gregory Farnum wrote:
> > Do you have statistics on the size of the OSDMaps or count of them
> > which were being maintained by the OSDs? I'm not sure why having noout
> > set would change that if all the nodes were alive, but that's my bet.
> > -Greg
> >
> > On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
> >  wrote:
> >>
> >> And, just as unexpectedly, things have returned to normal overnight
> >> https://icecube.wisc.edu/~vbrik/graph-1.png
> >>
> >> The change seems to have coincided with the beginning of Rados Gateway
> >> activity (before, it was essentially zero). I can see nothing in the
> >> logs that would explain what happened though.
> >>
> >> Vlad
> >>
> >>
> >>
> >> On 10/2/19 3:43 PM, Vladimir Brik wrote:
> >>> Hello
> >>>
> >>> I am running a Ceph 14.2.2 cluster and a few days ago, memory
> >>> consumption of our OSDs started to unexpectedly grow on all 5 nodes,
> >>> after being stable for about 6 months.
> >>>
> >>> Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
> >>> Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png
> >>>
> >>> I am not sure what changed to cause this. Cluster usage has been very
> >>> light (typically <10 iops) during this period, and the number of objects
> >>> stayed about the same.
> >>>
> >>> The only unusual occurrence was the reboot of one of the nodes the day
> >>> before (a firmware update). For the reboot, I ran "ceph osd set noout",
> >>> but forgot to unset it until several days later. Unsetting noout did not
> >>> stop the increase in memory consumption.
> >>>
> >>> I don't see anything unusual in the logs.
> >>>
> >>> Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
> >>> 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
> >>> don't know why there is such a big spread. All HDDs are 10TB, 72-76%
> >>> utilized, with 101-104 PGs.
> >>>
> >>> Does anybody know what might be the problem here and how to address or
> >>> debug it?
> >>>
> >>>
> >>> Thanks very much,
> >>>
> >>> Vlad
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph multi site outage question

2019-10-09 Thread Ed Fisher
Boy, Jewel is pretty old. Even Luminous is getting up there. There have been a 
lot of multisite improvements in Mimic and now Nautilus, so you might want to 
consider upgrading all the way to 14.2.4.

Anyway, the way we solve this is by giving each zone a different name (eg 
application1a/application1b), and then having a virtual IP for application1. We 
then move the virtual IP around whichever zone we want to have accepting 
traffic for that zonegroup. In our case we're advertising the virtual IP on all 
of the radosgw instances using bgp and then letting our routers do per-stream 
ECMP to load balance the traffic. Each RGW in each cluster checks the realm's 
period every few seconds and decides to announce/withdraw the IP based on 
whether that rgw's zone is the master zone for the zonegroup (plus whether the 
rgw instance is healthy, etc). 

We have both application1.example.com  and 
application1a/application1b.example.com  as 
hostnames in the zonegroup config, but just application1.example.com 
 for the endpoint. I'm not sure what the 
equivalent settings are on Jewel's multisite, if any. If you're routing radosgw 
traffic through a reverse proxy or load balancer you can also rewrite the host 
header on the fly.

Hope this helps,
Ed

> On Oct 9, 2019, at 10:02 AM, Melzer Pinto  wrote:
> 
> Hello,
> I have a question about multi site configuration. I have 2 clusters 
> configured in a single realm and zonegroup. One cluster is the master zone 
> and the other the slave. Lets assume the first cluster can be reached at 
> http://application1.something.com  and 
> the 2nd one is http://application1-slave.something.com 
> . My application has a number of 
> config files that reference http://application1.something.com 
> . So if there is a site outage i'd need 
> to change all of these files to http://application1-slave.something.com 
>  and restart. I was wondering if 
> there are any alternatives where I dont have to change the config files. The 
> best solution would be to use the same name in both clusters - 
> http://application1.something.com . But 
> i'm not sure if that is recommended or doable even. Any suggestions?  I'm 
> using the latest version of Ceph Jewel, 10.2.11,  but I am planning to 
> upgrade to luminous soon. 
> 
> Thanks
> M
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph multi site outage question

2019-10-09 Thread Melzer Pinto
Hello,
I have a question about multi site configuration. I have 2 clusters configured 
in a single realm and zonegroup. One cluster is the master zone and the other 
the slave. Lets assume the first cluster can be reached at 
http://application1.something.com and the 2nd one is 
http://application1-slave.something.com. My application has a number of config 
files that reference http://application1.something.com. So if there is a site 
outage i'd need to change all of these files to 
http://application1-slave.something.com and restart. I was wondering if there 
are any alternatives where I dont have to change the config files. The best 
solution would be to use the same name in both clusters - 
http://application1.something.com. But i'm not sure if that is recommended or 
doable even. Any suggestions?  I'm using the latest version of Ceph Jewel, 
10.2.11,  but I am planning to upgrade to luminous soon.

Thanks
M
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph pg repair clone_missing?

2019-10-09 Thread Marc Roos
 
Brad, many thanks!!! My cluster has finally HEALTH_OK af 1,5 year or so! 
:)


-Original Message-
Subject: Re: Ceph pg repair clone_missing?

On Fri, Oct 4, 2019 at 6:09 PM Marc Roos  
wrote:
>
>  >
>  >Try something like the following on each OSD that holds a copy of
>  >rbd_data.1f114174b0dc51.0974 and see what output you 
get.
>  >Note that you can drop the bluestore flag if they are not bluestore  

> >osds and you will need the osd stopped at the time (set noout). Also  

> >note, snapids are displayed in hexadecimal in the output (but then 
'4'
>  >is '4' so not a big issues here).
>  >
>  >$ ceph-objectstore-tool --type bluestore --data-path  
> >/var/lib/ceph/osd/ceph-XX/ --pgid 17.36 --op list
>  >rbd_data.1f114174b0dc51.0974
>
> I got these results
>
> osd.7
> Error getting attr on : 17.36_head,#-19:6c00:::scrub_17.36:head#,
> (61) No data available
> ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> na 
> pid":63,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
> ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> na 
> pid":-2,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]

Ah, so of course the problem is the snapshot is missing. You may need to 
try something like the following on each of those osds.

$ ceph-objectstore-tool --type bluestore --data-path 
/var/lib/ceph/osd/ceph-XX/ --pgid 17.36 
'{"oid":"rbd_data.1f114174b0dc51.0974","key":"","snapid":-2,
"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}'
remove-clone-metadata 4

>
> osd.12
> ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> na 
> pid":63,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
> ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> na 
> pid":-2,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
>
> osd.29
> ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> na 
> pid":63,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
> ["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","s
> na 
> pid":-2,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
>
>
>  >
>  >The likely issue here is the primary believes snapshot 4 is gone but 
 
> >there is still data and/or metadata on one of the replicas which is  
> >confusing the issue. If that is the case you can use the the  
> >ceph-objectstore-tool to delete the relevant snapshot(s)  >



--
Cheers,
Brad



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com