Re: [lustre-discuss] [EXTERNAL] Re: Use of lazystatfs

2023-07-06 Thread Mike Mosley via lustre-discuss
Andreas,

Thank you for the information.  We appreciate it.

Mike



On Wed, Jul 5, 2023 at 8:46 PM Andreas Dilger  wrote:

> [*Caution*: Email from External Sender. Do not click or open links or
> attachments unless you know this sender.]
>
>
> On Jul 5, 2023, at 07:14, Mike Mosley via lustre-discuss <
> lustre-discuss@lists.lustre.org> wrote:
>
> Hello everyone,
>
> We have drained some of our OSS/OSTs and plan to deactivate them soon.
> The process ahead leads us to a couple of questions that we hope somebody
> can advise us on.
>
> Scenario
> We have fully drained the target OSTs using * 'lfs find'* to identify all
> files located on the targets and then feeding the list to '*lfs migrate*.
> ' A final scan shows there are no files left on the targets.
>
> Questions
> 1) Running '*lfs df -h'* still shows some space being used even though we
> have drained all of the data.   Is that normal?  i.e.
>
> UUID   bytesUsed   Available Use% Mounted on
> hydra-OST0010_UUID 84.7T  583.8M   80.5T   1%
> /dfs/hydra[OST:16]
> hydra-OST0011_UUID 84.7T  581.4M   80.5T   1%
> /dfs/hydra[OST:17]
> hydra-OST0012_UUID 84.7T  581.7M   80.5T   1%
> /dfs/hydra[OST:18]
> hydra-OST0013_UUID 84.7T  582.4M   80.5T   1%
> /dfs/hydra[OST:19]
> hydra-OST0014_UUID 84.7T  584.1M   80.5T   1%
> /dfs/hydra[OST:20]
> hydra-OST0015_UUID 84.7T  583.4M   80.5T   1%
> /dfs/hydra[OST:21]
> hydra-OST0016_UUID 84.7T  583.6M   80.5T   1%
> /dfs/hydra[OST:22]
> hydra-OST0017_UUID 84.7T  581.8M   80.5T   1%
> /dfs/hydra[OST:23]
> hydra-OST0018_UUID 84.7T  582.6M   80.5T   1%
> /dfs/hydra[OST:24]
> hydra-OST0019_UUID 84.7T  582.7M   80.5T   1%
> /dfs/hydra[OST:25]
> hydra-OST001a_UUID 84.7T  580.0M   80.5T   1%
> /dfs/hydra[OST:26]
> hydra-OST001b_UUID 84.7T  580.4M   80.5T   1%
> /dfs/hydra[OST:27]
> hydra-OST001c_UUID 84.7T  582.1M   80.5T   1%
> /dfs/hydra[OST:28]
> hydra-OST001d_UUID 84.7T  583.2M   80.5T   1%
> /dfs/hydra[OST:29]
> hydra-OST001e_UUID 84.7T  583.7M   80.5T   1%
> /dfs/hydra[OST:30]
> hydra-OST001f_UUID 84.7T  587.7M   80.5T   1%
> /dfs/hydra[OST:31]
>
>
> I would suggest to unmount the OSTs from Lustre and mount via ldiskfs,
> then run "find $MOUNT/O -type f -ls" to find if there are any in-use files
> left.  It is likely that the 580M used by all of the OSTs is just residual
> logs and large directories under O/*.  There might be some hundreds or
> thousands of zero-length object files that were precreated but never used,
> that will typically have an unusual file access mode 07666 and can be
> ignored.
>
> 2) According to some comments, prior to deactivating the OSS/OSTs, we
> should add the *'lazystatfs'* option to all of our client mounts so that
> they do not hang once we deactivate some of the OSTs.   Is that correct?
> If so, why would you not just always have that option set?What are the
> ramifications of doing it well in advance of the OST deactivations?
>
>
> The lazystatfs feature has been enabled by default since Lustre 2.9 so I
> don't think you need to do anything with it anymore.  The "lfs df" command
> will automatically skip unconfigured OSTs.
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Use of lazystatfs

2023-07-05 Thread Mike Mosley via lustre-discuss
Hello everyone,

We have drained some of our OSS/OSTs and plan to deactivate them soon.  The
process ahead leads us to a couple of questions that we hope somebody can
advise us on.

Scenario
We have fully drained the target OSTs using * 'lfs find'* to identify all
files located on the targets and then feeding the list to '*lfs migrate*. '
A final scan shows there are no files left on the targets.

Questions
1) Running '*lfs df -h'* still shows some space being used even though we
have drained all of the data.   Is that normal?  i.e.

UUID   bytesUsed   Available Use% Mounted on

hydra-OST0010_UUID 84.7T  583.8M   80.5T   1%
/dfs/hydra[OST:16]

hydra-OST0011_UUID 84.7T  581.4M   80.5T   1%
/dfs/hydra[OST:17]

hydra-OST0012_UUID 84.7T  581.7M   80.5T   1%
/dfs/hydra[OST:18]

hydra-OST0013_UUID 84.7T  582.4M   80.5T   1%
/dfs/hydra[OST:19]

hydra-OST0014_UUID 84.7T  584.1M   80.5T   1%
/dfs/hydra[OST:20]

hydra-OST0015_UUID 84.7T  583.4M   80.5T   1%
/dfs/hydra[OST:21]

hydra-OST0016_UUID 84.7T  583.6M   80.5T   1%
/dfs/hydra[OST:22]

hydra-OST0017_UUID 84.7T  581.8M   80.5T   1%
/dfs/hydra[OST:23]

hydra-OST0018_UUID 84.7T  582.6M   80.5T   1%
/dfs/hydra[OST:24]

hydra-OST0019_UUID 84.7T  582.7M   80.5T   1%
/dfs/hydra[OST:25]

hydra-OST001a_UUID 84.7T  580.0M   80.5T   1%
/dfs/hydra[OST:26]

hydra-OST001b_UUID 84.7T  580.4M   80.5T   1%
/dfs/hydra[OST:27]

hydra-OST001c_UUID 84.7T  582.1M   80.5T   1%
/dfs/hydra[OST:28]

hydra-OST001d_UUID 84.7T  583.2M   80.5T   1%
/dfs/hydra[OST:29]

hydra-OST001e_UUID 84.7T  583.7M   80.5T   1%
/dfs/hydra[OST:30]

hydra-OST001f_UUID 84.7T  587.7M   80.5T   1%
/dfs/hydra[OST:31]


2) According to some comments, prior to deactivating the OSS/OSTs, we
should add the *'lazystatfs'* option to all of our client mounts so that
they do not hang once we deactivate some of the OSTs.   Is that correct?
If so, why would you not just always have that option set?What are the
ramifications of doing it well in advance of the OST deactivations?

Thanks in advance for feedback,

Mike
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-22 Thread Mike Mosley via lustre-discuss
Rick,

You were on the right track!

We were fortunate enough to get an expert from Cambridge Computing to take
a look at things and he managed to get us back into a normal state.

He remounted the MDTs with the *abort_recov* option and we were finally
able to get things going again.

Thanks to all who responded and special shout out to Brad at Cambridge
Computing for making time to help us get this fixed.

Mike




On Wed, Jun 21, 2023 at 4:32 PM Mohr, Rick  wrote:

> Mike,
>
> On the off chance that the recovery process is causing the issue, you
> could try mounting the mdt with the "abort_recov" option and see if the
> behavior changes.
>
> --Rick
>
>
>
> On 6/21/23, 2:33 PM, "lustre-discuss on behalf of Jeff Johnson" <
> lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> on behalf of
> jeff.john...@aeoncomputing.com <mailto:jeff.john...@aeoncomputing.com>>
> wrote:
>
>
> Maybe someone else in the list can add clarity but I don't believe a
> recovery process on mount would keep the MDS read-only or trigger that
> trace. Something else may be going on.
>
>
> I would start from the ground up. Bring your servers up, unmounted. Ensure
> lnet is loaded and configured properly. Test lnet using ping or
> lnet_selftest from your MDS to all of your OSS nodes. Then mount your
> combined MGS/MDT volume on the MDS and see what happens.
>
>
>
>
> Is your MDS in a high-availability pair?
> What version of Lustre are you running?
>
>
>
>
> ...just a few things readers on the list might want to know.
>
>
>
>
> --Jeff
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 11:21 AM Mike Mosley  <mailto:mike.mos...@charlotte.edu> <mailto:mike.mos...@charlotte.edu
> <mailto:mike.mos...@charlotte.edu>>> wrote:
>
>
> Jeff,
>
>
> At this point we have the OSS shutdown. We were coming back from. full
> outage and so we are trying to get the MDS up before starting to bring up
> the OSS.
>
>
>
>
> Mike
>
>
>
>
> On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson <
> jeff.john...@aeoncomputing.com <mailto:jeff.john...@aeoncomputing.com>
> <_blank>> wrote:
>
>
> Mike,
>
>
> Have you made sure the the o2ib interface on all of your Lustre servers
> (MDS & OSS) are functioning properly? Are you able to `lctl ping
> x.x.x.x@o2ib` successfully between MDS and OSS nodes?
>
>
>
>
> --Jeff
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>
> <_blank>> wrote:
>
>
> Rick,172.16.100.4 is the IB address of one of the OSS servers. I
> believe the mgt and mdt0 are the same target. My understanding is that we
> have a single instanceof the MGT which is on the first MDT server i.e. it
> was created via a comand similar to:
>
>
>
>
> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>
>
>
>
>
>
> Does that make sense.
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  moh...@ornl.gov> <_blank>> wrote:
>
>
> Which host is 172.16.100.4? Also, are the mgt and mdt0 on the same target
> or are they two separate targets just on the same host?
>
>
> --Rick
>
>
>
>
> On 6/21/23, 12:52 PM, "Mike Mosley"  mike.mos...@charlotte.edu> <_blank> <mailto:mike.mos...@charlotte.edu
> <mailto:mike.mos...@charlotte.edu> <_blank>>> wrote:
>
>
>
>
> Hi Rick,
>
>
>
>
> The MGS/MDS are combined. The output I posted is from the primary.
>
>
>
>
>
>
>
>
> THanks,
>
>
>
>
>
>
>
>
> Mike
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick  moh...@ornl.gov> <_blank> <mailto:moh...@ornl.gov <mailto:moh...@ornl.gov>
> <_blank>> <mailto:moh...@ornl.gov <mailto:moh...@ornl.gov> <_blank>
> <mailto:moh...@ornl.gov <mailto:moh...@ornl.gov> <_blank>>>> wrote:
>
>
>
>
> Mike,
>
>
>
>
> It looks like the mds server is having a problem contacting the mgs
> server. I'm guessing the mgs is a separate host? I would start by looking
> for possible network problems that might explain the LNet timeouts. You can
> try using "lctl ping" to test the LNet connection between nodes, and you
> can also try regular "ping" between the IP addresses on the IB interfaces.
>
>
>
>
> --Rick
>
>
>
>
>
>
>
>

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Mike Mosley via lustre-discuss
Rick,

Thanks we are going to try some of these suggestions later this evening or
tomorrow.   We are currently backing up the mdt (as described in the Lustre
manual).   I will post further once we get there.

THanks for the suggestions.

Mike

On Wed, Jun 21, 2023 at 4:32 PM Mohr, Rick  wrote:

> Mike,
>
> On the off chance that the recovery process is causing the issue, you
> could try mounting the mdt with the "abort_recov" option and see if the
> behavior changes.
>
> --Rick
>
>
>
> On 6/21/23, 2:33 PM, "lustre-discuss on behalf of Jeff Johnson" <
> lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> on behalf of
> jeff.john...@aeoncomputing.com <mailto:jeff.john...@aeoncomputing.com>>
> wrote:
>
>
> Maybe someone else in the list can add clarity but I don't believe a
> recovery process on mount would keep the MDS read-only or trigger that
> trace. Something else may be going on.
>
>
> I would start from the ground up. Bring your servers up, unmounted. Ensure
> lnet is loaded and configured properly. Test lnet using ping or
> lnet_selftest from your MDS to all of your OSS nodes. Then mount your
> combined MGS/MDT volume on the MDS and see what happens.
>
>
>
>
> Is your MDS in a high-availability pair?
> What version of Lustre are you running?
>
>
>
>
> ...just a few things readers on the list might want to know.
>
>
>
>
> --Jeff
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 11:21 AM Mike Mosley  <mailto:mike.mos...@charlotte.edu> <mailto:mike.mos...@charlotte.edu
> <mailto:mike.mos...@charlotte.edu>>> wrote:
>
>
> Jeff,
>
>
> At this point we have the OSS shutdown. We were coming back from. full
> outage and so we are trying to get the MDS up before starting to bring up
> the OSS.
>
>
>
>
> Mike
>
>
>
>
> On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson <
> jeff.john...@aeoncomputing.com <mailto:jeff.john...@aeoncomputing.com>
> <_blank>> wrote:
>
>
> Mike,
>
>
> Have you made sure the the o2ib interface on all of your Lustre servers
> (MDS & OSS) are functioning properly? Are you able to `lctl ping
> x.x.x.x@o2ib` successfully between MDS and OSS nodes?
>
>
>
>
> --Jeff
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>
> <_blank>> wrote:
>
>
> Rick,172.16.100.4 is the IB address of one of the OSS servers. I
> believe the mgt and mdt0 are the same target. My understanding is that we
> have a single instanceof the MGT which is on the first MDT server i.e. it
> was created via a comand similar to:
>
>
>
>
> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>
>
>
>
>
>
> Does that make sense.
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  moh...@ornl.gov> <_blank>> wrote:
>
>
> Which host is 172.16.100.4? Also, are the mgt and mdt0 on the same target
> or are they two separate targets just on the same host?
>
>
> --Rick
>
>
>
>
> On 6/21/23, 12:52 PM, "Mike Mosley"  mike.mos...@charlotte.edu> <_blank> <mailto:mike.mos...@charlotte.edu
> <mailto:mike.mos...@charlotte.edu> <_blank>>> wrote:
>
>
>
>
> Hi Rick,
>
>
>
>
> The MGS/MDS are combined. The output I posted is from the primary.
>
>
>
>
>
>
>
>
> THanks,
>
>
>
>
>
>
>
>
> Mike
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick  moh...@ornl.gov> <_blank> <mailto:moh...@ornl.gov <mailto:moh...@ornl.gov>
> <_blank>> <mailto:moh...@ornl.gov <mailto:moh...@ornl.gov> <_blank>
> <mailto:moh...@ornl.gov <mailto:moh...@ornl.gov> <_blank>>>> wrote:
>
>
>
>
> Mike,
>
>
>
>
> It looks like the mds server is having a problem contacting the mgs
> server. I'm guessing the mgs is a separate host? I would start by looking
> for possible network problems that might explain the LNet timeouts. You can
> try using "lctl ping" to test the LNet connection between nodes, and you
> can also try regular "ping" between the IP addresses on the IB interfaces.
>
>
>
>
> --Rick
>
>
>
>
>
>
>
>
> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
> lustre-discuss"  lustre-discuss-boun...@lists.lustre.org> <_blank>  lustre-discuss-boun...@l

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Mike Mosley via lustre-discuss
Jeff,

At this point we have the OSS shutdown.  We were coming back from. full
outage and so we are trying to get the MDS up before starting to bring up
the OSS.

Mike

On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson 
wrote:

> Mike,
>
> Have you made sure the the o2ib interface on all of your Lustre servers
> (MDS & OSS) are functioning properly? Are you able to `lctl ping
> x.x.x.x@o2ib` successfully between MDS and OSS nodes?
>
> --Jeff
>
>
> On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
> lustre-discuss@lists.lustre.org> wrote:
>
>> Rick,
>> 172.16.100.4 is the IB address of one of the OSS servers.I
>>  believe the mgt and mdt0 are the same target.   My understanding is that
>> we have a single instanceof the MGT which is on the first MDT server i.e.
>> it was created via a comand similar to:
>>
>> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>>
>> Does that make sense.
>>
>> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  wrote:
>>
>>> Which host is 172.16.100.4?  Also, are the mgt and mdt0 on the same
>>> target or are they two separate targets just on the same host?
>>>
>>> --Rick
>>>
>>>
>>> On 6/21/23, 12:52 PM, "Mike Mosley" >> mike.mos...@charlotte.edu>> wrote:
>>>
>>>
>>> Hi Rick,
>>>
>>>
>>> The MGS/MDS are combined. The output I posted is from the primary.
>>>
>>>
>>>
>>>
>>> THanks,
>>>
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick >> moh...@ornl.gov> <mailto:moh...@ornl.gov <mailto:moh...@ornl.gov>>>
>>> wrote:
>>>
>>>
>>> Mike,
>>>
>>>
>>> It looks like the mds server is having a problem contacting the mgs
>>> server. I'm guessing the mgs is a separate host? I would start by looking
>>> for possible network problems that might explain the LNet timeouts. You can
>>> try using "lctl ping" to test the LNet connection between nodes, and you
>>> can also try regular "ping" between the IP addresses on the IB interfaces.
>>>
>>>
>>> --Rick
>>>
>>>
>>>
>>>
>>> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
>>> lustre-discuss" >> lustre-discuss-boun...@lists.lustre.org> <_blank> >> lustre-discuss-boun...@lists.lustre.org >> lustre-discuss-boun...@lists.lustre.org> <_blank>> on behalf of
>>> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>
>>> <_blank> <mailto:lustre-discuss@lists.lustre.org >> lustre-discuss@lists.lustre.org> <_blank>>> wrote:
>>>
>>>
>>>
>>>
>>> Greetings,
>>>
>>>
>>>
>>>
>>> We have experienced some type of issue that is causing both of our MDS
>>> servers to only be able to mount the mdt device in read only mode. Here are
>>> some of the error messages we are seeing in the log files below. We lost
>>> our Lustre expert a while back and we are not sure how to proceed to
>>> troubleshoot this issue. Can anybody provide us guidance on how to proceed?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked
>>> for more than 120 seconds.
>>> Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123
>>> 1 0x0086
>>> Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] schedule+0x29/0x70
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> schedule_timeout+0x221/0x2d0
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> tracing_is_on+0x15/0x30
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> tracing_record_cmdline+0x1d/0x120
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> probe_sched_wakeup+0x2b/0xa0
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> 

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Mike Mosley via lustre-discuss
Rick,
172.16.100.4 is the IB address of one of the OSS servers.I
 believe the mgt and mdt0 are the same target.   My understanding is that
we have a single instanceof the MGT which is on the first MDT server i.e.
it was created via a comand similar to:

# mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb

Does that make sense.

On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  wrote:

> Which host is 172.16.100.4?  Also, are the mgt and mdt0 on the same target
> or are they two separate targets just on the same host?
>
> --Rick
>
>
> On 6/21/23, 12:52 PM, "Mike Mosley"  mike.mos...@charlotte.edu>> wrote:
>
>
> Hi Rick,
>
>
> The MGS/MDS are combined. The output I posted is from the primary.
>
>
>
>
> THanks,
>
>
>
>
> Mike
>
>
>
>
> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick  moh...@ornl.gov> <mailto:moh...@ornl.gov <mailto:moh...@ornl.gov>>> wrote:
>
>
> Mike,
>
>
> It looks like the mds server is having a problem contacting the mgs
> server. I'm guessing the mgs is a separate host? I would start by looking
> for possible network problems that might explain the LNet timeouts. You can
> try using "lctl ping" to test the LNet connection between nodes, and you
> can also try regular "ping" between the IP addresses on the IB interfaces.
>
>
> --Rick
>
>
>
>
> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
> lustre-discuss"  lustre-discuss-boun...@lists.lustre.org> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>> on behalf of
> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>
> <_blank> <mailto:lustre-discuss@lists.lustre.org  lustre-discuss@lists.lustre.org> <_blank>>> wrote:
>
>
>
>
> Greetings,
>
>
>
>
> We have experienced some type of issue that is causing both of our MDS
> servers to only be able to mount the mdt device in read only mode. Here are
> some of the error messages we are seeing in the log files below. We lost
> our Lustre expert a while back and we are not sure how to proceed to
> troubleshoot this issue. Can anybody provide us guidance on how to proceed?
>
>
>
>
>
>
>
>
> Thanks,
>
>
>
>
>
>
>
>
> Mike
>
>
>
>
>
>
>
>
> Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
> more than 120 seconds.
> Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123 1
> 0x0086
> Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
> Jun 20 15:12:14 hyd-mds1 kernel: [] schedule+0x29/0x70
> Jun 20 15:12:14 hyd-mds1 kernel: []
> schedule_timeout+0x221/0x2d0
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> tracing_is_on+0x15/0x30
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> tracing_record_cmdline+0x1d/0x120
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> probe_sched_wakeup+0x2b/0xa0
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> ttwu_do_wakeup+0xb5/0xe0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> wait_for_completion+0xfd/0x140
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> wake_up_state+0x20/0x20
> Jun 20 15:12:14 hyd-mds1 kernel: []
> llog_process_or_fork+0x244/0x450 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> llog_process+0x14/0x20 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> class_config_parse_llog+0x125/0x350 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_cfg_log+0x790/0xc40 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_log+0x3dc/0x8f0 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> config_recover_log_add+0x13f/0x280 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_config+0x88b/0x13f0 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_process_log+0x2d8/0xad0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> libcfs_debug_msg+0x57/0x80 [libcfs]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lprocfs_counter_add+0xf9/0x160 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> server_start_targets+0x13a4/0x2a20 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lustre_start_mgc+0x260/0x2510 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> server_fill_super+0x10cc/0x1890 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_fill_super+0x468/0x960 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] 

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Mike Mosley via lustre-discuss
Hi Rick,

The MGS/MDS are combined.   The output I posted is from the primary.

THanks,

Mike

On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick  wrote:

> Mike,
>
> It looks like the mds server is having a problem contacting the mgs
> server.  I'm guessing the mgs is a separate host?  I would start by looking
> for possible network problems that might explain the LNet timeouts.  You
> can try using "lctl ping" to test the LNet connection between nodes, and
> you can also try regular "ping" between the IP addresses on the IB
> interfaces.
>
> --Rick
>
>
> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
> lustre-discuss"  lustre-discuss-boun...@lists.lustre.org> on behalf of
> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>>
> wrote:
>
>
> Greetings,
>
>
> We have experienced some type of issue that is causing both of our MDS
> servers to only be able to mount the mdt device in read only mode. Here are
> some of the error messages we are seeing in the log files below. We lost
> our Lustre expert a while back and we are not sure how to proceed to
> troubleshoot this issue. Can anybody provide us guidance on how to proceed?
>
>
>
>
> Thanks,
>
>
>
>
> Mike
>
>
>
>
> Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
> more than 120 seconds.
> Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123 1
> 0x0086
> Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
> Jun 20 15:12:14 hyd-mds1 kernel: [] schedule+0x29/0x70
> Jun 20 15:12:14 hyd-mds1 kernel: []
> schedule_timeout+0x221/0x2d0
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> tracing_is_on+0x15/0x30
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> tracing_record_cmdline+0x1d/0x120
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> probe_sched_wakeup+0x2b/0xa0
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> ttwu_do_wakeup+0xb5/0xe0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> wait_for_completion+0xfd/0x140
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> wake_up_state+0x20/0x20
> Jun 20 15:12:14 hyd-mds1 kernel: []
> llog_process_or_fork+0x244/0x450 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> llog_process+0x14/0x20 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> class_config_parse_llog+0x125/0x350 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_cfg_log+0x790/0xc40 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_log+0x3dc/0x8f0 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> config_recover_log_add+0x13f/0x280 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_config+0x88b/0x13f0 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_process_log+0x2d8/0xad0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> libcfs_debug_msg+0x57/0x80 [libcfs]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lprocfs_counter_add+0xf9/0x160 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> server_start_targets+0x13a4/0x2a20 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lustre_start_mgc+0x260/0x2510 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> server_fill_super+0x10cc/0x1890 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_fill_super+0x468/0x960 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lustre_common_put_super+0x270/0x270 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] mount_nodev+0x4f/0xb0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_mount+0x38/0x60 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] mount_fs+0x3e/0x1b0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> vfs_kern_mount+0x67/0x110
> Jun 20 15:12:14 hyd-mds1 kernel: [] do_mount+0x1ef/0xd00
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> __check_object_size+0x1ca/0x250
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> kmem_cache_alloc_trace+0x3c/0x200
> Jun 20 15:12:14 hyd-mds1 kernel: [] SyS_mount+0x83/0xd0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> system_call_fastpath+0x25/0x2a
> Jun 20 15:13:14 hyd-mds1 kernel: LNet:
> 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for
> 172.16.100.4@o2ib: 9 seconds
> Jun 20 15:13:14 hyd-mds1 kernel: LNet:
> 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 239 previous
> similar messages
> Jun 20 15:14:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
> more than 120 seconds.
> Jun 20 15:14:14 hyd-mds1 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_ti

[lustre-discuss] MDTs will only mount read only

2023-06-21 Thread Mike Mosley via lustre-discuss
Greetings,

We have experienced some type of issue that is causing both of our MDS
servers to only be able to mount the mdt device in read only mode.  Here
are some of the error messages we are seeing in the log files below.   We
lost our Lustre expert a while back and we are not sure how to proceed to
troubleshoot this issue.   Can anybody provide us guidance on how to
proceed?

Thanks,

Mike

Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
more than 120 seconds.

Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

Jun 20 15:12:14 hyd-mds1 kernel: mount.lustreD 9f27a3bc5230
  0  4123  1 0x0086

Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:

Jun 20 15:12:14 hyd-mds1 kernel: [] schedule+0x29/0x70

Jun 20 15:12:14 hyd-mds1 kernel: []
schedule_timeout+0x221/0x2d0

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
tracing_is_on+0x15/0x30

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
tracing_record_cmdline+0x1d/0x120

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
probe_sched_wakeup+0x2b/0xa0

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
ttwu_do_wakeup+0xb5/0xe0

Jun 20 15:12:14 hyd-mds1 kernel: []
wait_for_completion+0xfd/0x140

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
wake_up_state+0x20/0x20

Jun 20 15:12:14 hyd-mds1 kernel: []
llog_process_or_fork+0x244/0x450 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: []
llog_process+0x14/0x20 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: []
class_config_parse_llog+0x125/0x350 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: []
mgc_process_cfg_log+0x790/0xc40 [mgc]

Jun 20 15:12:14 hyd-mds1 kernel: []
mgc_process_log+0x3dc/0x8f0 [mgc]

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
config_recover_log_add+0x13f/0x280 [mgc]

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
class_config_dump_handler+0x7e0/0x7e0 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: []
mgc_process_config+0x88b/0x13f0 [mgc]

Jun 20 15:12:14 hyd-mds1 kernel: []
lustre_process_log+0x2d8/0xad0 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
libcfs_debug_msg+0x57/0x80 [libcfs]

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
lprocfs_counter_add+0xf9/0x160 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: []
server_start_targets+0x13a4/0x2a20 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
lustre_start_mgc+0x260/0x2510 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
class_config_dump_handler+0x7e0/0x7e0 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: []
server_fill_super+0x10cc/0x1890 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: []
lustre_fill_super+0x468/0x960 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
lustre_common_put_super+0x270/0x270 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: [] mount_nodev+0x4f/0xb0

Jun 20 15:12:14 hyd-mds1 kernel: []
lustre_mount+0x38/0x60 [obdclass]

Jun 20 15:12:14 hyd-mds1 kernel: [] mount_fs+0x3e/0x1b0

Jun 20 15:12:14 hyd-mds1 kernel: []
vfs_kern_mount+0x67/0x110

Jun 20 15:12:14 hyd-mds1 kernel: [] do_mount+0x1ef/0xd00

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
__check_object_size+0x1ca/0x250

Jun 20 15:12:14 hyd-mds1 kernel: [] ?
kmem_cache_alloc_trace+0x3c/0x200

Jun 20 15:12:14 hyd-mds1 kernel: [] SyS_mount+0x83/0xd0

Jun 20 15:12:14 hyd-mds1 kernel: []
system_call_fastpath+0x25/0x2a

Jun 20 15:13:14 hyd-mds1 kernel: LNet:
4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for
172.16.100.4@o2ib: 9 seconds

Jun 20 15:13:14 hyd-mds1 kernel: LNet:
4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 239 previous
similar messages

Jun 20 15:14:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
more than 120 seconds.

Jun 20 15:14:14 hyd-mds1 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

Jun 20 15:14:14 hyd-mds1 kernel: mount.lustreD 9f27a3bc5230
  0  4123  1 0x0086

dumpe2fs seems to show that the file systems are clean i.e.

dumpe2fs 1.45.6.wc1 (20-Mar-2020)

Filesystem volume name:   hydra-MDT

Last mounted on:  /

Filesystem UUID:  3ae09231-7f2a-43b3-a4ee-7f36080b5a66

Filesystem magic number:  0xEF53

Filesystem revision #:1 (dynamic)

Filesystem features:  has_journal ext_attr resize_inode dir_index
filetype mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg
dir_nlink quota

Filesystem flags: signed_directory_hash

Default mount options:user_xattr acl

Filesystem state: clean

Errors behavior:  Continue

Filesystem OS type:   Linux

Inode count:  2247671504

Block count:  1404931944

Reserved block count: 70246597

Free blocks:  807627552

Free inodes:  2100036536

First block:  0

Block size:   4096

Fragment size:4096

Reserved GDT blocks:  1024

Blocks per group: 20472

Fragments per group:  20472

Inodes per group: 32752

Inode blocks per group:   8188

Flex block group size:16

Filesystem created:   Thu Aug  8 14:21:01 2019

Last mount time:  Tue Jun 20 15:19:03 2023

Las