Do I apply this against the v9.2.0 git tag? On Thu, Jan 14, 2016 at 4:48 PM, Dyweni - Ceph-Users < 6exbab4fy...@dyweni.com> wrote:
> Your patch lists the command as "addfailed" but the email lists the > command as "add failed". (Note the space). > > > > > > On 2016-01-14 18:46, Yan, Zheng wrote: > >> Here is patch for v9.2.0. After install the modified version of >> ceph-mon, run “ceph mds add failed 1” >> >> >> >> >> >> On Jan 15, 2016, at 08:20, Mike Carlson <m...@bayphoto.com> wrote: >>> >>> okay, that sounds really good. >>> >>> Would it help if you had access to our cluster? >>> >>> On Thu, Jan 14, 2016 at 4:19 PM, Yan, Zheng <z...@redhat.com> wrote: >>> >>> > On Jan 15, 2016, at 08:16, Mike Carlson <m...@bayphoto.com> wrote: >>> > >>> > Did I just loose all of my data? >>> > >>> > If we were able to export the journal, could we create a brand new mds >>> out of that and retrieve our data? >>> >>> No. it’s early to fix. but you need to re-compile ceph-mon from source >>> code. I’m writing the patch. >>> >>> >>> >>> >>> > >>> > On Thu, Jan 14, 2016 at 4:15 PM, Yan, Zheng <z...@redhat.com> wrote: >>> > >>> > > On Jan 15, 2016, at 08:01, Gregory Farnum <gfar...@redhat.com> >>> wrote: >>> > > >>> > > On Thu, Jan 14, 2016 at 3:46 PM, Mike Carlson <m...@bayphoto.com> >>> wrote: >>> > >> Hey Zheng, >>> > >> >>> > >> I've been in the #ceph irc channel all day about this. >>> > >> >>> > >> We did that, we set max_mds back to 1, but, instead of stopping mds >>> 1, we >>> > >> did a "ceph mds rmfailed 1". Running ceph mds stop 1 produces: >>> > >> >>> > >> # ceph mds stop 1 >>> > >> Error EEXIST: mds.1 not active (???) >>> > >> >>> > >> >>> > >> Our mds in a state of resolve, and will not come back. >>> > >> >>> > >> We then tried to roll back the mds map to the epoch just before we >>> set >>> > >> max_mds to 2, but that command crashes all but one of our monitors >>> and never >>> > >> completes >>> > >> >>> > >> We do not know what to do at this point, if there was a way to get >>> the mds >>> > >> back up just so we could back it up, we're okay with rebuilding. We >>> just >>> > >> need the data back. >>> > > >>> > > It's not clear to me how much you've screwed up your monitor cluster. >>> > > If that's still alive, you should just need to set max mds to 2, turn >>> > > on an mds daemon, and let it resolve. Then you can follow the steps >>> > > Zheng outlined for reducing the number of nodes cleanly. >>> > > (That assumes that your MDS state is healthy and that the reason for >>> > > your mounts hanging was a problem elsewhere, like with directory >>> > > fragmentation confusing NFS.) >>> > > >>> > > If your monitor cluster is actually in trouble (ie, the crashing >>> > > problem made it to disk), that's a whole other thing now. But I >>> > > suspect/hope it didn't and you just need to shut down the client >>> > > trying to do the setmap and then turn the monitors all back on. >>> > > Meanwhile, please post a bug at tracker.ceph.com with the actual >>> > > monitor commands you ran and as much of the backtrace/log as you can; >>> > > we don't want to have commands which break the system! ;) >>> > > -Greg >>> > >>> > the problem is that he ran ‘ceph mds rmfailed 1’ and there is no >>> command to undo this. I think we need a command “ceph mds addfailed rank’ >>> > >>> > Regards >>> > Yan, Zheng >>> > >>> > >>> > > >>> > >> >>> > >> Mike C >>> > >> >>> > >> >>> > >> >>> > >> On Thu, Jan 14, 2016 at 3:33 PM, Yan, Zheng <uker...@gmail.com> >>> wrote: >>> > >>> >>> > >>> On Fri, Jan 15, 2016 at 3:28 AM, Mike Carlson <m...@bayphoto.com> >>> wrote: >>> > >>>> Thank you for the reply Zheng >>> > >>>> >>> > >>>> We tried set mds bal frag to true, but the end result was less >>> than >>> > >>>> desirable. All nfs and smb clients could no longer browse the >>> share, >>> > >>>> they >>> > >>>> would hang on a directory with anything more than a few hundred >>> files. >>> > >>>> >>> > >>>> We then tried to back out the active/active mds change, no luck, >>> > >>>> stopping >>> > >>>> one of the mds's (mds 1) prevented us from mounting the cephfs >>> > >>>> filesystem >>> > >>>> >>> > >>>> So we failed and removed the secondary MDS, and now our primary >>> mds is >>> > >>>> stuck >>> > >>>> in a "resovle" state: >>> > >>>> >>> > >>>> # ceph -s >>> > >>>> cluster cabd1728-2eca-4e18-a581-b4885364e5a4 >>> > >>>> health HEALTH_WARN >>> > >>>> clock skew detected on mon.lts-mon >>> > >>>> mds cluster is degraded >>> > >>>> Monitor clock skew detected >>> > >>>> monmap e1: 4 mons at >>> > >>>> >>> > >>>> {lts-mon= >>> 10.5.68.236:6789/0,lts-osd1=10.5.68.229:6789/0,lts-osd2=10.5.68.230:6789/0,lts-osd3=10.5.68.203:6789/0 >>> } >>> > >>>> election epoch 1282, quorum 0,1,2,3 >>> > >>>> lts-osd3,lts-osd1,lts-osd2,lts-mon >>> > >>>> mdsmap e7892: 1/2/1 up {0=lts-mon=up:resolve} >>> > >>>> osdmap e10183: 102 osds: 101 up, 101 in >>> > >>>> pgmap v6714309: 4192 pgs, 7 pools, 31748 GB data, 23494 >>> kobjects >>> > >>>> 96188 GB used, 273 TB / 367 TB avail >>> > >>>> 4188 active+clean >>> > >>>> 4 active+clean+scrubbing+deep >>> > >>>> >>> > >>>> Now we are really down for the count. We cannot get our MDS back >>> up in >>> > >>>> an >>> > >>>> active state and none of our data is accessible. >>> > >>> >>> > >>> you can't remove active mds this way, you need to: >>> > >>> >>> > >>> 1. make sure all active mds are running >>> > >>> 2. run 'ceph mds set max_mds 1' >>> > >>> 3. run 'ceph mds stop 1' >>> > >>> >>> > >>> step 3 changes the second mds's state to stopping. Wait a while, >>> the >>> > >>> second mds will go to standby state. Occasionally, the second MDS >>> can >>> > >>> stuck in stopping state. If it happens, restart all MDS, then >>> repeat >>> > >>> step 3. >>> > >>> >>> > >>> Regards >>> > >>> Yan, Zheng >>> > >>> >>> > >>> >>> > >>> >>> > >>>> >>> > >>>> >>> > >>>> On Wed, Jan 13, 2016 at 7:05 PM, Yan, Zheng <uker...@gmail.com> >>> wrote: >>> > >>>>> >>> > >>>>> On Thu, Jan 14, 2016 at 3:37 AM, Mike Carlson <m...@bayphoto.com >>> > >>> > >>>>> wrote: >>> > >>>>>> Hey Greg, >>> > >>>>>> >>> > >>>>>> The inconsistent view is only over nfs/smb on top of our /ceph >>> mount. >>> > >>>>>> >>> > >>>>>> When I look directly on the /ceph mount (which is using the >>> cephfs >>> > >>>>>> kernel >>> > >>>>>> module), everything looks fine >>> > >>>>>> >>> > >>>>>> It is possible that this issue just went unnoticed, and it only >>> being >>> > >>>>>> a >>> > >>>>>> infernalis problem is just a red herring. With that, it is oddly >>> > >>>>>> coincidental that we just started seeing issues. >>> > >>>>> >>> > >>>>> This seems like seekdir bugs in kernel client, could you try 4.0+ >>> > >>>>> kernel. >>> > >>>>> >>> > >>>>> Besides, do you enable "mds bal frag" for ceph-mds >>> > >>>>> >>> > >>>>> >>> > >>>>> Regards >>> > >>>>> Yan, Zheng >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>>> >>> > >>>>>> On Wed, Jan 13, 2016 at 11:30 AM, Gregory Farnum < >>> gfar...@redhat.com> >>> > >>>>>> wrote: >>> > >>>>>>> >>> > >>>>>>> On Wed, Jan 13, 2016 at 11:24 AM, Mike Carlson < >>> m...@bayphoto.com> >>> > >>>>>>> wrote: >>> > >>>>>>>> Hello. >>> > >>>>>>>> >>> > >>>>>>>> Since we upgraded to Infernalis last, we have noticed a severe >>> > >>>>>>>> problem >>> > >>>>>>>> with >>> > >>>>>>>> cephfs when we have it shared over Samba and NFS >>> > >>>>>>>> >>> > >>>>>>>> Directory listings are showing an inconsistent view of the >>> files: >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> $ ls /lts-mon/BD/xmlExport/ | wc -l >>> > >>>>>>>> 100 >>> > >>>>>>>> $ sudo umount /lts-mon >>> > >>>>>>>> $ sudo mount /lts-mon >>> > >>>>>>>> $ ls /lts-mon/BD/xmlExport/ | wc -l >>> > >>>>>>>> 3507 >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> The only work around I have found is un-mounting and >>> re-mounting >>> > >>>>>>>> the >>> > >>>>>>>> nfs >>> > >>>>>>>> share, that seems to clear it up >>> > >>>>>>>> Same with samba, I'd post it here but its thousands of lines. >>> I >>> > >>>>>>>> can >>> > >>>>>>>> add >>> > >>>>>>>> additional details on request. >>> > >>>>>>>> >>> > >>>>>>>> This happened after our upgrade to infernalis. Is it possible >>> the >>> > >>>>>>>> MDS >>> > >>>>>>>> is >>> > >>>>>>>> in >>> > >>>>>>>> an inconsistent state? >>> > >>>>>>> >>> > >>>>>>> So this didn't happen to you until after you upgraded? Are you >>> > >>>>>>> seeing >>> > >>>>>>> missing files when looking at cephfs directly, or only over the >>> > >>>>>>> NFS/Samba re-exports? Are you also sharing Samba by >>> re-exporting the >>> > >>>>>>> kernel cephfs mount? >>> > >>>>>>> >>> > >>>>>>> Zheng, any ideas about kernel issues which might cause this or >>> be >>> > >>>>>>> more >>> > >>>>>>> visible under infernalis? >>> > >>>>>>> -Greg >>> > >>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> We have cephfs mounted on a server using the built in cephfs >>> > >>>>>>>> kernel >>> > >>>>>>>> module: >>> > >>>>>>>> >>> > >>>>>>>> lts-mon:6789:/ /ceph ceph >>> > >>>>>>>> name=admin,secretfile=/etc/ceph/admin.secret,noauto,_netdev >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> We are running all of our ceph nodes on ubuntu 14.04 LTS. >>> Samba is >>> > >>>>>>>> up >>> > >>>>>>>> to >>> > >>>>>>>> date, 4.1.6, and we export nfsv3 to linux and freebsd >>> systems. All >>> > >>>>>>>> seem >>> > >>>>>>>> to >>> > >>>>>>>> exhibit the same behavior. >>> > >>>>>>>> >>> > >>>>>>>> system info: >>> > >>>>>>>> >>> > >>>>>>>> # uname -a >>> > >>>>>>>> Linux lts-osd1 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 >>> > >>>>>>>> 21:42:59 >>> > >>>>>>>> UTC >>> > >>>>>>>> 2015 x86_64 x86_64 x86_64 GNU/Linux >>> > >>>>>>>> root@lts-osd1:~# lsb >>> > >>>>>>>> lsblk lsb_release >>> > >>>>>>>> root@lts-osd1:~# lsb_release -a >>> > >>>>>>>> No LSB modules are available. >>> > >>>>>>>> Distributor ID: Ubuntu >>> > >>>>>>>> Description: Ubuntu 14.04.3 LTS >>> > >>>>>>>> Release: 14.04 >>> > >>>>>>>> Codename: trusty >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> package info: >>> > >>>>>>>> >>> > >>>>>>>> # dpkg -l|grep ceph >>> > >>>>>>>> ii ceph 9.2.0-1trusty >>> > >>>>>>>> amd64 distributed storage and file system >>> > >>>>>>>> ii ceph-common 9.2.0-1trusty >>> > >>>>>>>> amd64 common utilities to mount and interact with a >>> ceph >>> > >>>>>>>> storage >>> > >>>>>>>> cluster >>> > >>>>>>>> ii ceph-fs-common 9.2.0-1trusty >>> > >>>>>>>> amd64 common utilities to mount and interact with a >>> ceph >>> > >>>>>>>> file >>> > >>>>>>>> system >>> > >>>>>>>> ii ceph-mds 9.2.0-1trusty >>> > >>>>>>>> amd64 metadata server for the ceph distributed file >>> system >>> > >>>>>>>> ii libcephfs1 9.2.0-1trusty >>> > >>>>>>>> amd64 Ceph distributed file system client library >>> > >>>>>>>> ii python-ceph 9.2.0-1trusty >>> > >>>>>>>> amd64 Meta-package for python libraries for the Ceph >>> > >>>>>>>> libraries >>> > >>>>>>>> ii python-cephfs 9.2.0-1trusty >>> > >>>>>>>> amd64 Python libraries for the Ceph libcephfs library >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> What is interesting, is a directory or file will not show up >>> in a >>> > >>>>>>>> listing, >>> > >>>>>>>> however, if we directly access the file, it shows up in that >>> > >>>>>>>> instance: >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> # ls -al |grep SCHOOL >>> > >>>>>>>> # ls -alnd SCHOOL667055 >>> > >>>>>>>> drwxrwsr-x 1 21695 21183 2962751438 Jan 13 09:33 >>> SCHOOL667055 >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> Any tips are appreciated! >>> > >>>>>>>> >>> > >>>>>>>> Thanks, >>> > >>>>>>>> Mike C >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> _______________________________________________ >>> > >>>>>>>> ceph-users mailing list >>> > >>>>>>>> ceph-users@lists.ceph.com >>> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >>>>>>>> >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> _______________________________________________ >>> > >>>>>> ceph-users mailing list >>> > >>>>>> ceph-users@lists.ceph.com >>> > >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >>> > >>> >>> >>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com