On 10/19/18 12:37 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
On Oct 17, 2018, at 7:30 PM, Riccardo Veraldi
wrote:
anyway especially regarding the OSSes you may eventually need some ZFS module parameters
optimizations regarding vdev_write and vdev_read max to increase those values higher
> On Oct 17, 2018, at 7:30 PM, Riccardo Veraldi
> wrote:
>
> anyway especially regarding the OSSes you may eventually need some ZFS module
> parameters optimizations regarding vdev_write and vdev_read max to increase
> those values higher than default. You may also disable ZIL, change the
There is a somewhat hidden danger with eviction: You can get silent data loss.
The simplest example is buffered (ie, any that aren't direct I/O) writes -
Lustre reports completion (ie your write() syscall completes) once the data is
in the page cache on the client (like any modern file system,
> On Oct 19, 2018, at 10:42 AM, Marion Hakanson wrote:
>
> Thanks for the feedback. You're both confirming what we've learned so far,
> that we had to unmount all the clients (which required rebooting most of
> them), then reboot all the storage servers, to get things unstuck until the
>
On Oct 19, 2018, at 08:42, Marion Hakanson wrote:
>
> Thanks for the feedback. You're both confirming what we've learned so far,
> that we had to unmount all the clients (which required rebooting most of
> them), then reboot all the storage servers, to get things unstuck until the
> problem
Sigh. Instructions that I've found for that have been a bit on the slim side
(:-). We'll give it a try.
Thanks and regards,
Marion
> On Oct 19, 2018, at 07:59, Peter Bortas wrote:
>
> So, that is at least not a syntax for abort_recovery I'm familiar
> with. To take an example from last
So, that is at least not a syntax for abort_recovery I'm familiar
with. To take an example from last time I did this, I first determined
which device wasn't completing the recovery, then logged in on the
server (an OST in this case) and ran:
# lctl dl|grep obdfilter|grep fouo5-OST
3 UP
Thanks for the feedback. You're both confirming what we've learned so far,
that we had to unmount all the clients (which required rebooting most of them),
then reboot all the storage servers, to get things unstuck until the problem
recurred.
I tried abort_recovery on the clients last night,
That should fix it, but I'd like to advocate for using abort_recovery.
Compared to unmounting thousands of clients abort_recovery is a quick
operation that takes a few minutes to do. Wouldn't say it gets used a
lot, but I've done it on NSCs live environment six times since 2016,
solving the
Marion,
You note the deadlock reoccurs on server reboot, so you’re really stuck. This
is most likely due to recovery where operations from the clients are replayed.
If you’re fine with letting any pending I/O fail in order to get the system
back up, I would suggest a client side action:
This issue is really kicking our behinds:
https://jira.whamcloud.com/browse/LU-11465
While we're waiting for the issue to get some attention from Lustre developers,
are there suggestions on how we can recover our cluster from this kind of
deadlocked, stuck-threads-on-the-MDS (or OSS) situation?
11 matches
Mail list logo