On Thu, Oct 17, 2019 at 10:19 PM Gustavo Tonini <gustavoton...@gmail.com> wrote: > > No. The cluster was just rebalancing. > > The journal seems damaged: > > ceph@deployer:~$ cephfs-journal-tool --rank=fs_padrao:0 journal inspect > 2019-10-16 17:46:29.596 7fcd34cbf700 -1 NetHandler create_socket couldn't > create socket (97) Address family not supported by protocol
corrupted journal shouldn't cause error like this. This is more like network issue. please double check network config of your cluster. > Overall journal integrity: DAMAGED > Corrupt regions: > 0x1c5e4d904ab-1c5e4d9ddbc > ceph@deployer:~$ > > Could a journal reset help with this? > > I could snapshot all FS pools and export the journal before to guarantee a > rollback to this state if something goes wrong with jounal reset. > > On Thu, Oct 17, 2019, 09:07 Yan, Zheng <uker...@gmail.com> wrote: >> >> On Tue, Oct 15, 2019 at 12:03 PM Gustavo Tonini <gustavoton...@gmail.com> >> wrote: >> > >> > Dear ceph users, >> > we're experiencing a segfault during MDS startup (replay process) which is >> > making our FS inaccessible. >> > >> > MDS log messages: >> > >> > Oct 15 03:41:39.894584 mds1 ceph-mds: -472> 2019-10-15 00:40:30.201 >> > 7f3c08f49700 1 -- 192.168.8.195:6800/3181891717 <== osd.26 >> > 192.168.8.209:6821/2419345 3 ==== osd_op_reply(21 1.00000000 [getxattr] >> > v0'0 uv0 ondisk = -61 ((61) No data available)) v8 ==== 154+0+0 >> > (3715233608 0 0) 0x2776340 con 0x18bd500 >> > Oct 15 03:41:39.894584 mds1 ceph-mds: -472> 2019-10-15 00:40:30.201 >> > 7f3c00589700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched >> > Oct 15 03:41:39.894658 mds1 ceph-mds: -472> 2019-10-15 00:40:30.201 >> > 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched got 0 and 544 >> > Oct 15 03:41:39.894658 mds1 ceph-mds: -472> 2019-10-15 00:40:30.201 >> > 7f3c00589700 10 mds.0.cache.ino(0x100) magic is 'ceph fs volume v011' >> > (expecting 'ceph fs volume v011') >> > Oct 15 03:41:39.894735 mds1 ceph-mds: -472> 2019-10-15 00:40:30.201 >> > 7f3c00589700 10 mds.0.cache.snaprealm(0x100 seq 1 0x1799c00) open_parents >> > [1,head] >> > Oct 15 03:41:39.894735 mds1 ceph-mds: -472> 2019-10-15 00:40:30.201 >> > 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched [inode 0x100 [...2,head] >> > ~mds0/ auth v275131 snaprealm=0x1799c00 f(v0 1=1+0) n(v76166 rc2020-07-17 >> > 15:29:27.000000 b41838692297 -3184=-3168+-16)/n() (iversion lock) >> > 0x18bf800] >> > Oct 15 03:41:39.894821 mds1 ceph-mds: -472> 2019-10-15 00:40:30.201 >> > 7f3c00589700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched >> > Oct 15 03:41:39.894821 mds1 ceph-mds: -472> 2019-10-15 00:40:30.201 >> > 7f3c00589700 10 mds.0.cache.ino(0x1) _fetched got 0 and 482 >> > Oct 15 03:41:39.894891 mds1 ceph-mds: -472> 2019-10-15 00:40:30.201 >> > 7f3c00589700 10 mds.0.cache.ino(0x1) magic is 'ceph fs volume v011' >> > (expecting 'ceph fs volume v011') >> > Oct 15 03:41:39.894958 mds1 ceph-mds: -472> 2019-10-15 00:40:30.205 >> > 7f3c00589700 -1 *** Caught signal (Segmentation fault) **#012 in thread >> > 7f3c00589700 thread_name:fn_anonymous#012#012 ceph version 13.2.6 >> > (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)#012 1: >> > (()+0x11390) [0x7f3c0e48a390]#012 2: (operator<<(std::ostream&, SnapRealm >> > const&)+0x42) [0x72cb92]#012 3: (SnapRealm::merge_to(SnapRealm*)+0x308) >> > [0x72f488]#012 4: (CInode::decode_snap_blob(ceph::buffer::list&)+0x53) >> > [0x6e1f63]#012 5: >> > (CInode::decode_store(ceph::buffer::list::iterator&)+0x76) [0x702b86]#012 >> > 6: (CInode::_fetched(ceph::buffer::list&, ceph::buffer::list&, >> > Context*)+0x1b2) [0x702da2]#012 7: (MDSIOContextBase::complete(int)+0x119) >> > [0x74fcc9]#012 8: (Finisher::finisher_thread_entry()+0x12e) >> > [0x7f3c0ebffece]#012 9: (()+0x76ba) [0x7f3c0e4806ba]#012 10: >> > (clone()+0x6d) [0x7f3c0dca941d]#012 NOTE: a copy of the executable, or >> > `objdump -rdS <executable>` is needed to interpret this. >> > Oct 15 03:41:39.895400 mds1 ceph-mds: --- logging levels --- >> > Oct 15 03:41:39.895473 mds1 ceph-mds: 0/ 5 none >> > Oct 15 03:41:39.895473 mds1 ceph-mds: 0/ 1 lockdep >> > >> >> looks like snap info for root inode is corrupted. did you do any >> unusually operation before this happened? >> >> >> > >> > Cluster status information: >> > >> > cluster: >> > id: b8205875-e56f-4280-9e52-6aab9c758586 >> > health: HEALTH_WARN >> > 1 filesystem is degraded >> > 1 nearfull osd(s) >> > 11 pool(s) nearfull >> > >> > services: >> > mon: 3 daemons, quorum mon1,mon2,mon3 >> > mgr: mon1(active), standbys: mon2, mon3 >> > mds: fs_padrao-1/1/1 up {0=mds1=up:replay(laggy or crashed)} >> > osd: 90 osds: 90 up, 90 in >> > >> > data: >> > pools: 11 pools, 1984 pgs >> > objects: 75.99 M objects, 285 TiB >> > usage: 457 TiB used, 181 TiB / 639 TiB avail >> > pgs: 1896 active+clean >> > 87 active+clean+scrubbing+deep+repair >> > 1 active+clean+scrubbing >> > >> > io: >> > client: 89 KiB/s wr, 0 op/s rd, 3 op/s wr >> > >> > Has anyone seen anything like this? >> > >> > Regards, >> > Gustavo. >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com