Hi Steve, This could be LU-5152 (https://jira.whamcloud.com/browse/LU-5152), which tentatively tried to fix unprivileged chgrp -R. The patch introduced some kind of dependency between servers in the quota handling. It has been reverted in 2.10.6, however it’s not clear to me what the plan for chgrp -R is at this point. Perhaps someone at Whamcloud could clarify. We definitively have users doing chgrp -R occasionally.
In your case, I would recommend upgrading to 2.10.6, in my experience it's painless to upgrade between 2.10.x, we do that in a rolling upgrade fashion by failing over targets to avoid any significant downtime. Stephane > On Jan 23, 2019, at 11:48 AM, Steve Barnet <bar...@icecube.wisc.edu> wrote: > > Hi all, > > Since early last summer, we have been running a 2.10.4 filesystem > pretty much without incident. Then about 2 weeks ago, it started > crashing for no immediately obvious reason. There are no indications > of hardware related problems in the logs, and the work loads have > not changed significantly as far as we can tell. > > I can't rule out hardware or system performance problems, but if > that is the case, there are no obvious pointers as to what those > would be. We had one workload that seemed to trigger the problem > (a couple dozen jobs running du on parts of the filesystem), but > that had been running for months, and even after we killed that > we had a couple crashes. > > Since the first crash (on 7 January) we have experience > these crashes sporadically. Sometimes days between crashes, > other times, hours. > > The symptoms are the filesystem becoming unresponsive, and a > load spike on the MDS and one OSS (we have 8x OSS). The OSS > affected seems to be somewhat random. In the system logs, we see > hung_task timeouts and stack traces, followed shortly by lustre-log > dumps. The only real commonality I have seen is that on the MDS, > the first hung task is in jbd2_journal_commit_transaction. > > To recover the filesystems, I have done e2fsck on the MDT, > and any affected OSTs. They have come back cleanly every time. > > I have attached snippets of the log files at the time of > the most recent crash. > > A high level summary of our system: > > MDS (1x) & OSS (8x) > OS: CentOS Linux release 7.6.1810 (Core) > kernel: 3.10.0-862.2.3.el7_lustre.x86_64 > Lustre: 2.10.4 (ldiskfs) > > Clients: a mix, but predominantly CentOS 7.x running 2.10.4 > > Any insights would be greatly appreciated. There are lots of > logs, so if they would be helpful, I can certainly make them > available. In particular, that first lustre-log is pretty big, > so I just grabbed the lines in closest proximity to the crash. > > Also, if there's a way to get more debugging level > information from lustre, I'm happy to try that as well. > > And I realize this is all at a very high level, so I'll be > happy to provide any additional info needed to help me figure > this out. > > Thanks much for taking the time! > > Best, > > ---Steve > > > > <oss-messages.txt><mds-messages.txt><lustre-log.1547807192.63647-snippet.gz>_______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org