Re: [lustre-discuss] new mounted client shows lower disk space

2018-11-20 Thread fırat yılmaz
Hi Thomas And Raj,

Thank you for the feedback

Thomas,

I have checked the recovery_status on oss's
I assume this recovery durations are second and i have to check it while
mount operation takes long time. These values are updated every minute.
Good to learn that.

oss1
status: COMPLETE
recovery_start: 1539937227
*recovery_duration: 97*
completed_clients: 3/3
replayed_requests: 0
last_transno: 73019446366
VBR: DISABLED
IR: ENABLED

oss2
status: COMPLETE
recovery_start: 1540380323
*recovery_duration: 436*
completed_clients: 196/197
replayed_requests: 0
last_transno: 77309411331
VBR: ENABLED
IR: ENABLED

oss3

status: COMPLETE
recovery_start: 1539937210
*recovery_duration: 150*
completed_clients: 0/3
replayed_requests: 0
last_transno: 73019440310
VBR: ENABLED
IR: ENABLED

oss4
status: COMPLETE
recovery_start: 1539937234
*recovery_duration: 151*
completed_clients: 0/3
replayed_requests: 0
last_transno: 55839576629
VBR: ENABLED
IR: ENABLED

oss5
status: COMPLETE
recovery_start: 1539937257
*recovery_duration: 96*
completed_clients: 3/3
replayed_requests: 0
last_transno: 51544609437
VBR: DISABLED
IR: ENABLED

oss6
recovery_start: 1539937194
*recovery_duration: 96*
completed_clients: 3/3
replayed_requests: 0
last_transno: 47249690300
VBR: DISABLED
IR: ENABLED


Filesystem mounts on system boot process and its solid that after each
reboot  lctl ping from client to server finishes with no error and vice
versa.

It seems like when there is a high I/O on the filesystem, mount operation
takes longer.

Best Regards.

On Wed, Nov 14, 2018 at 6:13 PM Raj  wrote:

> I would check if LNET address gets setup properly before mounting lustre
> FS from client. You can try manually loading lustre module and try pinging
> (lctl ping oss-nid) all the OSS nodes and observe any abnormalities and
> dmesg before mounting FS.
> It could be as simple as duplicate IP address in your ib interface or
> unstable IB fabric.
>
> On Wed, Nov 14, 2018 at 8:08 AM Thomas Roth  wrote:
>
>> Hi,
>>
>> your error messages are all well known - the one on the OSS will show up
>> as soon as the Lustre modules
>> are loaded, provided you have some clients asking for the OSTs (and your
>> MDT, which should be up by
>> then, is also looking for the OSTs).
>> The kiblnd_check_conns message I have also seen quite often, never with
>> any OST problems.
>>
>> Rather seems your OST take a lot of time to mount or to recover - did you
>> check
>> /proc/fs/lustre/obdfilter/lustre-OST00*/recovery_status
>> ?
>>
>> Regards
>> Thomas
>>
>> On 11/12/18 9:46 AM, fırat yılmaz wrote:
>> > Hi All
>> > OS=Redhat 7.4
>> > Lustre Version: Intel® Manager for Lustre* software 4.0.3.0
>> >
>> > I have 72 osts over 6 oss with HA and 1 mdt serving to 195 clients over
>> > infiniband EDR.
>> >
>> > After a reboot on client, lustre filesystem mounts on startup. It
>> should be
>> > 2.1 TB area but lt starts with 350TB.
>> >
>> > lfs osts command shows 90 percent of even numbered osts are ACTIVE and
>> no
>> > information about other OSTs, as time passes like 1 hour or so, all OSTs
>> > become active and lustre area can be seen as 2.1 PB
>> >
>> >
>> > dmesg on lustre oss server:
>> > LustreError: 137-5: lustre-OST0009_UUID: not available for connect from
>> > 10.0.0.130@o2ib (no target). If you are running an HA pair check that
>> the
>> > target is mounted on the other server.
>> >
>> > dmesg on client:
>> > LNet: 5419:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for
>> > 10.0.0.5@o2ib: 15 seconds
>> > Lustre: 5546:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request
>> sent
>> > has failed due to network error: [sent 1542009416/real 1542009426]
>> > req@885f4761 x1616909446641136/t0(0)
>> > o8->lustre-OST0030-osc-885f75219800@10.0.0.8@o2ib:28/4 lens
>> 520/544 e 0
>> > to 1 dl 1542009696 ref 1 fl Rpc:eXN/0/ rc 0/-1
>> >
>> > I tested infiniband with ib_send_lat, ib_read_lat and no error occured
>> > I tested lnet ping with lctl ping 10.0.0.8@o2ib , no error occured
>> > 12345-0@lo
>> > 12345-10.51.22.8@o2ib
>> >
>> > Why some OST's  can be accesible while some are not in the same server?
>> > Best Regards.
>> >
>> >
>> > ___
>> > lustre-discuss mailing list
>> > lustre-discuss@lists.lustre.org
>> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> >
>>
>>
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] new mounted client shows lower disk space

2018-11-14 Thread Raj
I would check if LNET address gets setup properly before mounting lustre FS
from client. You can try manually loading lustre module and try pinging
(lctl ping oss-nid) all the OSS nodes and observe any abnormalities and
dmesg before mounting FS.
It could be as simple as duplicate IP address in your ib interface or
unstable IB fabric.

On Wed, Nov 14, 2018 at 8:08 AM Thomas Roth  wrote:

> Hi,
>
> your error messages are all well known - the one on the OSS will show up
> as soon as the Lustre modules
> are loaded, provided you have some clients asking for the OSTs (and your
> MDT, which should be up by
> then, is also looking for the OSTs).
> The kiblnd_check_conns message I have also seen quite often, never with
> any OST problems.
>
> Rather seems your OST take a lot of time to mount or to recover - did you
> check
> /proc/fs/lustre/obdfilter/lustre-OST00*/recovery_status
> ?
>
> Regards
> Thomas
>
> On 11/12/18 9:46 AM, fırat yılmaz wrote:
> > Hi All
> > OS=Redhat 7.4
> > Lustre Version: Intel® Manager for Lustre* software 4.0.3.0
> >
> > I have 72 osts over 6 oss with HA and 1 mdt serving to 195 clients over
> > infiniband EDR.
> >
> > After a reboot on client, lustre filesystem mounts on startup. It should
> be
> > 2.1 TB area but lt starts with 350TB.
> >
> > lfs osts command shows 90 percent of even numbered osts are ACTIVE and no
> > information about other OSTs, as time passes like 1 hour or so, all OSTs
> > become active and lustre area can be seen as 2.1 PB
> >
> >
> > dmesg on lustre oss server:
> > LustreError: 137-5: lustre-OST0009_UUID: not available for connect from
> > 10.0.0.130@o2ib (no target). If you are running an HA pair check that
> the
> > target is mounted on the other server.
> >
> > dmesg on client:
> > LNet: 5419:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for
> > 10.0.0.5@o2ib: 15 seconds
> > Lustre: 5546:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request
> sent
> > has failed due to network error: [sent 1542009416/real 1542009426]
> > req@885f4761 x1616909446641136/t0(0)
> > o8->lustre-OST0030-osc-885f75219800@10.0.0.8@o2ib:28/4 lens 520/544
> e 0
> > to 1 dl 1542009696 ref 1 fl Rpc:eXN/0/ rc 0/-1
> >
> > I tested infiniband with ib_send_lat, ib_read_lat and no error occured
> > I tested lnet ping with lctl ping 10.0.0.8@o2ib , no error occured
> > 12345-0@lo
> > 12345-10.51.22.8@o2ib
> >
> > Why some OST's  can be accesible while some are not in the same server?
> > Best Regards.
> >
> >
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] new mounted client shows lower disk space

2018-11-14 Thread Thomas Roth
Hi,

your error messages are all well known - the one on the OSS will show up as 
soon as the Lustre modules
are loaded, provided you have some clients asking for the OSTs (and your MDT, 
which should be up by
then, is also looking for the OSTs).
The kiblnd_check_conns message I have also seen quite often, never with any OST 
problems.

Rather seems your OST take a lot of time to mount or to recover - did you check
/proc/fs/lustre/obdfilter/lustre-OST00*/recovery_status
?

Regards
Thomas

On 11/12/18 9:46 AM, fırat yılmaz wrote:
> Hi All
> OS=Redhat 7.4
> Lustre Version: Intel® Manager for Lustre* software 4.0.3.0
> 
> I have 72 osts over 6 oss with HA and 1 mdt serving to 195 clients over
> infiniband EDR.
> 
> After a reboot on client, lustre filesystem mounts on startup. It should be
> 2.1 TB area but lt starts with 350TB.
> 
> lfs osts command shows 90 percent of even numbered osts are ACTIVE and no
> information about other OSTs, as time passes like 1 hour or so, all OSTs
> become active and lustre area can be seen as 2.1 PB
> 
> 
> dmesg on lustre oss server:
> LustreError: 137-5: lustre-OST0009_UUID: not available for connect from
> 10.0.0.130@o2ib (no target). If you are running an HA pair check that the
> target is mounted on the other server.
> 
> dmesg on client:
> LNet: 5419:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for
> 10.0.0.5@o2ib: 15 seconds
> Lustre: 5546:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent
> has failed due to network error: [sent 1542009416/real 1542009426]
> req@885f4761 x1616909446641136/t0(0)
> o8->lustre-OST0030-osc-885f75219800@10.0.0.8@o2ib:28/4 lens 520/544 e 0
> to 1 dl 1542009696 ref 1 fl Rpc:eXN/0/ rc 0/-1
> 
> I tested infiniband with ib_send_lat, ib_read_lat and no error occured
> I tested lnet ping with lctl ping 10.0.0.8@o2ib , no error occured
> 12345-0@lo
> 12345-10.51.22.8@o2ib
> 
> Why some OST's  can be accesible while some are not in the same server?
> Best Regards.
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] new mounted client shows lower disk space

2018-11-12 Thread fırat yılmaz
Hi All
OS=Redhat 7.4
Lustre Version: Intel® Manager for Lustre* software 4.0.3.0

I have 72 osts over 6 oss with HA and 1 mdt serving to 195 clients over
infiniband EDR.

After a reboot on client, lustre filesystem mounts on startup. It should be
2.1 TB area but lt starts with 350TB.

lfs osts command shows 90 percent of even numbered osts are ACTIVE and no
information about other OSTs, as time passes like 1 hour or so, all OSTs
become active and lustre area can be seen as 2.1 PB


dmesg on lustre oss server:
LustreError: 137-5: lustre-OST0009_UUID: not available for connect from
10.0.0.130@o2ib (no target). If you are running an HA pair check that the
target is mounted on the other server.

dmesg on client:
LNet: 5419:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for
10.0.0.5@o2ib: 15 seconds
Lustre: 5546:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent
has failed due to network error: [sent 1542009416/real 1542009426]
req@885f4761 x1616909446641136/t0(0)
o8->lustre-OST0030-osc-885f75219800@10.0.0.8@o2ib:28/4 lens 520/544 e 0
to 1 dl 1542009696 ref 1 fl Rpc:eXN/0/ rc 0/-1

I tested infiniband with ib_send_lat, ib_read_lat and no error occured
I tested lnet ping with lctl ping 10.0.0.8@o2ib , no error occured
12345-0@lo
12345-10.51.22.8@o2ib

Why some OST's  can be accesible while some are not in the same server?
Best Regards.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org