Re: [ceph-users] Crashed MDS (segfault)

Yan, Zheng Fri, 25 Oct 2019 07:29:14 -0700

On Fri, Oct 25, 2019 at 9:42 PM Gustavo Tonini <gustavoton...@gmail.com> wrote:
>
> Running "cephfs-data-scan init  --force-init" solved the problem.
>
> Then I had to run "cephfs-journal-tool event recover_dentries summary" and 
> truncate the journal to fix the corrupted journal.
>
> CephFS worked well for approximately 3 hours and then our MDS crashed again, 
> apparently due to the bug described at https://tracker.ceph.com/issues/38452
>


does the method in issue #38452 work for you?  if not, please
debug_mds to 10, and set log around the crash to us


Yan, Zheng

> On Wed, Oct 23, 2019, 02:24 Yan, Zheng <uker...@gmail.com> wrote:
>>
>> On Tue, Oct 22, 2019 at 1:49 AM Gustavo Tonini <gustavoton...@gmail.com> 
>> wrote:
>> >
>> > Is there a possibility to lose data if I use "cephfs-data-scan init  
>> > --force-init"?
>> >
>>
>> It only causes incorrect stat on root inode, can't cause data lose.
>>
>> running 'ceph daemon mds.a scrub_path / force repair' after mds
>> restart can fix the incorrect stat.
>>
>> > On Mon, Oct 21, 2019 at 4:36 AM Yan, Zheng <uker...@gmail.com> wrote:
>> >>
>> >> On Fri, Oct 18, 2019 at 9:10 AM Gustavo Tonini <gustavoton...@gmail.com> 
>> >> wrote:
>> >> >
>> >> > Hi Zheng,
>> >> > the cluster is running ceph mimic. This warning about network only 
>> >> > appears when using nautilus' cephfs-journal-tool.
>> >> >
>> >> > "cephfs-data-scan scan_links" does not report any issue.
>> >> >
>> >> > How could variable "newparent" be NULL at 
>> >> > https://github.com/ceph/ceph/blob/master/src/mds/SnapRealm.cc#L599 ? Is 
>> >> > there a way to fix this?
>> >> >
>> >>
>> >>
>> >> try 'cephfs-data-scan init'. It will setup root inode's snaprealm.
>> >>
>> >> > On Thu, Oct 17, 2019 at 9:58 PM Yan, Zheng <uker...@gmail.com> wrote:
>> >> >>
>> >> >> On Thu, Oct 17, 2019 at 10:19 PM Gustavo Tonini 
>> >> >> <gustavoton...@gmail.com> wrote:
>> >> >> >
>> >> >> > No. The cluster was just rebalancing.
>> >> >> >
>> >> >> > The journal seems damaged:
>> >> >> >
>> >> >> > ceph@deployer:~$ cephfs-journal-tool --rank=fs_padrao:0 journal 
>> >> >> > inspect
>> >> >> > 2019-10-16 17:46:29.596 7fcd34cbf700 -1 NetHandler create_socket 
>> >> >> > couldn't create socket (97) Address family not supported by protocol
>> >> >>
>> >> >> corrupted journal shouldn't cause error like this. This is more like
>> >> >> network issue. please double check network config of your cluster.
>> >> >>
>> >> >> > Overall journal integrity: DAMAGED
>> >> >> > Corrupt regions:
>> >> >> > 0x1c5e4d904ab-1c5e4d9ddbc
>> >> >> > ceph@deployer:~$
>> >> >> >
>> >> >> > Could a journal reset help with this?
>> >> >> >
>> >> >> > I could snapshot all FS pools and export the journal before to 
>> >> >> > guarantee a rollback to this state if something goes wrong with 
>> >> >> > jounal reset.
>> >> >> >
>> >> >> > On Thu, Oct 17, 2019, 09:07 Yan, Zheng <uker...@gmail.com> wrote:
>> >> >> >>
>> >> >> >> On Tue, Oct 15, 2019 at 12:03 PM Gustavo Tonini 
>> >> >> >> <gustavoton...@gmail.com> wrote:
>> >> >> >> >
>> >> >> >> > Dear ceph users,
>> >> >> >> > we're experiencing a segfault during MDS startup (replay process) 
>> >> >> >> > which is making our FS inaccessible.
>> >> >> >> >
>> >> >> >> > MDS log messages:
>> >> >> >> >
>> >> >> >> > Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> >> > 00:40:30.201 7f3c08f49700  1 -- 192.168.8.195:6800/3181891717 <== 
>> >> >> >> > osd.26 192.168.8.209:6821/2419345 3 ==== osd_op_reply(21 
>> >> >> >> > 1.00000000 [getxattr] v0'0 uv0 ondisk = -61 ((61) No data 
>> >> >> >> > available)) v8 ==== 154+0+0 (3715233608 0 0) 0x2776340 con 
>> >> >> >> > 0x18bd500
>> >> >> >> > Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> >> > 00:40:30.201 7f3c00589700 10 MDSIOContextBase::complete: 
>> >> >> >> > 18C_IO_Inode_Fetched
>> >> >> >> > Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> >> > 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched got 
>> >> >> >> > 0 and 544
>> >> >> >> > Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> >> > 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100)  magic is 
>> >> >> >> > 'ceph fs volume v011' (expecting 'ceph fs volume v011')
>> >> >> >> > Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> >> > 00:40:30.201 7f3c00589700 10  mds.0.cache.snaprealm(0x100 seq 1 
>> >> >> >> > 0x1799c00) open_parents [1,head]
>> >> >> >> > Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> >> > 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched 
>> >> >> >> > [inode 0x100 [...2,head] ~mds0/ auth v275131 snaprealm=0x1799c00 
>> >> >> >> > f(v0 1=1+0) n(v76166 rc2020-07-17 15:29:27.000000 b41838692297 
>> >> >> >> > -3184=-3168+-16)/n() (iversion lock) 0x18bf800]
>> >> >> >> > Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> >> > 00:40:30.201 7f3c00589700 10 MDSIOContextBase::complete: 
>> >> >> >> > 18C_IO_Inode_Fetched
>> >> >> >> > Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> >> > 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x1) _fetched got 0 
>> >> >> >> > and 482
>> >> >> >> > Oct 15 03:41:39.894891 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> >> > 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x1)  magic is 'ceph 
>> >> >> >> > fs volume v011' (expecting 'ceph fs volume v011')
>> >> >> >> > Oct 15 03:41:39.894958 mds1 ceph-mds:   -472> 2019-10-15 
>> >> >> >> > 00:40:30.205 7f3c00589700 -1 *** Caught signal (Segmentation 
>> >> >> >> > fault) **#012 in thread 7f3c00589700 
>> >> >> >> > thread_name:fn_anonymous#012#012 ceph version 13.2.6 
>> >> >> >> > (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)#012 1: 
>> >> >> >> > (()+0x11390) [0x7f3c0e48a390]#012 2: (operator<<(std::ostream&, 
>> >> >> >> > SnapRealm const&)+0x42) [0x72cb92]#012 3: 
>> >> >> >> > (SnapRealm::merge_to(SnapRealm*)+0x308) [0x72f488]#012 4: 
>> >> >> >> > (CInode::decode_snap_blob(ceph::buffer::list&)+0x53) 
>> >> >> >> > [0x6e1f63]#012 5: 
>> >> >> >> > (CInode::decode_store(ceph::buffer::list::iterator&)+0x76) 
>> >> >> >> > [0x702b86]#012 6: (CInode::_fetched(ceph::buffer::list&, 
>> >> >> >> > ceph::buffer::list&, Context*)+0x1b2) [0x702da2]#012 7: 
>> >> >> >> > (MDSIOContextBase::complete(int)+0x119) [0x74fcc9]#012 8: 
>> >> >> >> > (Finisher::finisher_thread_entry()+0x12e) [0x7f3c0ebffece]#012 9: 
>> >> >> >> > (()+0x76ba) [0x7f3c0e4806ba]#012 10: (clone()+0x6d) 
>> >> >> >> > [0x7f3c0dca941d]#012 NOTE: a copy of the executable, or `objdump 
>> >> >> >> > -rdS <executable>` is needed to interpret this.
>> >> >> >> > Oct 15 03:41:39.895400 mds1 ceph-mds: --- logging levels ---
>> >> >> >> > Oct 15 03:41:39.895473 mds1 ceph-mds:    0/ 5 none
>> >> >> >> > Oct 15 03:41:39.895473 mds1 ceph-mds:    0/ 1 lockdep
>> >> >> >> >
>> >> >> >>
>> >> >> >> looks like snap info for root inode is corrupted. did you do any
>> >> >> >> unusually operation before this happened?
>> >> >> >>
>> >> >> >>
>> >> >> >> >
>> >> >> >> > Cluster status information:
>> >> >> >> >
>> >> >> >> >   cluster:
>> >> >> >> >     id:     b8205875-e56f-4280-9e52-6aab9c758586
>> >> >> >> >     health: HEALTH_WARN
>> >> >> >> >             1 filesystem is degraded
>> >> >> >> >             1 nearfull osd(s)
>> >> >> >> >             11 pool(s) nearfull
>> >> >> >> >
>> >> >> >> >   services:
>> >> >> >> >     mon: 3 daemons, quorum mon1,mon2,mon3
>> >> >> >> >     mgr: mon1(active), standbys: mon2, mon3
>> >> >> >> >     mds: fs_padrao-1/1/1 up  {0=mds1=up:replay(laggy or crashed)}
>> >> >> >> >     osd: 90 osds: 90 up, 90 in
>> >> >> >> >
>> >> >> >> >   data:
>> >> >> >> >     pools:   11 pools, 1984 pgs
>> >> >> >> >     objects: 75.99 M objects, 285 TiB
>> >> >> >> >     usage:   457 TiB used, 181 TiB / 639 TiB avail
>> >> >> >> >     pgs:     1896 active+clean
>> >> >> >> >              87   active+clean+scrubbing+deep+repair
>> >> >> >> >              1    active+clean+scrubbing
>> >> >> >> >
>> >> >> >> >   io:
>> >> >> >> >     client:   89 KiB/s wr, 0 op/s rd, 3 op/s wr
>> >> >> >> >
>> >> >> >> > Has anyone seen anything like this?
>> >> >> >> >
>> >> >> >> > Regards,
>> >> >> >> > Gustavo.
>> >> >> >> > _______________________________________________
>> >> >> >> > ceph-users mailing list
>> >> >> >> > ceph-users@lists.ceph.com
>> >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Gustavo.
>> >
>> >
>> >
>> > --
>> > Gustavo.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crashed MDS (segfault)

Reply via email to