Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-11-11 Thread Tung-Han Hsieh
Dear Nathan,

Thank you very much for sharing this info. LU-14124 and LU-14125
exactly described the problem what we have encountered.

At this moment what we have done is upgrade to Lustre-2.12.5.
We tried the following work arounds:

1. Set grant_shirnk=0 at MGS only (not in the clients):
   /opt/lustre/sbin/lctl set_param -P osc.*.grant_shrink=0

2. Set grant_shirnk=0 at the client only (not in MGS):
   /opt/lustre/sbin/lctl get_param osc.*.grant_shrink=0

It seems that setting to MGS did not help too much. The cur_grant_bytes
still got decreasing with continuous I/O. But setting to clients seems
help. We saw that although cur_grant_bytes may vary but could be reset
back in a short time.

>From the discussion of LU-14124 and LU-14125, it seems that probably
there is a bug to be fixed. We will be happy if the final solution will
come out.

Best Regards,

T.H.Hsieh


On Wed, Nov 11, 2020 at 07:25:50PM -0700, Nathan Dauchy - NOAA Affiliate wrote:
> Simon, Tung-Han,
> 
> You may also want to watch these tickets that seem to be related to the
> issue you describe:
> https://jira.whamcloud.com/browse/LU-14124
> https://jira.whamcloud.com/browse/LU-14125
> 
> -Nathan
> 
> 
> On Mon, Nov 2, 2020 at 7:18 AM Simon Guilbault <
> simon.guilba...@calculquebec.ca> wrote:
> 
> > Hi,
> >
> > If you set it on the MGS, it will be the new default for all the clients
> > and new mount on the FS, the problem is you need LU-12759 (fixed in 2.12.4)
> > on your clients since there was a bug on older clients and that setting was
> > not working correctly.
> >
> > On Mon, Nov 2, 2020 at 12:38 AM Tung-Han Hsieh <
> > thhs...@twcp1.phys.ntu.edu.tw> wrote:
> >
> >> Dear Simon,
> >>
> >> Following your suggestions, now we confirmed that the problem of
> >> dropping I/O performance of a client when there is a continous
> >> I/O in the background is solved. It works charming. Thank you so
> >> much !!
> >>
> >> Here is a final question. We found that this command:
> >>
> >> lctl set_param osc.*.grant_shrink=0
> >>
> >> can be run the client, which fixed the value of "cur_grant_bytes"
> >> to be the highest value 1880752127, and thereby fixed the problem.
> >> Whenever we remount the file system (I mean, explicitly umount and
> >> mount the file system), we need to execute this command again to
> >> set it to zero.
> >>
> >> But this command:
> >>
> >> lctl set_param -P osc.*.grant_shrink=0
> >>
> >> has to be run in the MGS node. Only setting it in MGS but without
> >> setting in the client, it seems that the "cur_grant_bytes" of the
> >> testing client still dropping under the background continous I/O.
> >> So I am asking what's the meaning of this setting in MGS node.
> >>
> >> Thank you very much.
> >>
> >>
> >> T.H.Hsieh
> >>
> >> On Fri, Oct 30, 2020 at 01:37:01PM +0800, Tung-Han Hsieh wrote:
> >> > Dear Simon,
> >> >
> >> > Thank you very much for your useful information. Now we are arranging
> >> > the system maintenance date in order to upgrade to Lustre-2.12.5. Then
> >> > we will follow your suggestion to see whether this problem could be
> >> > fixed.
> >> >
> >> > Here I report a test of under continuous I/O, how the cur_grant_bytes
> >> > changed overtime. Again the client runs the following script for
> >> > continuous reading in the background:
> >> >
> >> > # The Lustre file system was mounted under /home
> >> > while [ 1 ]; do
> >> > tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
> >> > done
> >> >
> >> > And every 20 mins, in the same client we copied a 600MB file from one
> >> > directory to another within Lustre, and check the "cur_grant_bytes" by
> >> > the following command running in the same client:
> >> >
> >> > /opt/lustre/sbin/lctl get_param osc.*.cur_grant_bytes
> >> >
> >> > The result is (every line separated by around 20 mins):
> >> >
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1880752127
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1410564096
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1059201024
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=794400768
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=595800576
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=446850432
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=335137824
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=251353368
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=188515026
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=141386270
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=106039703
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=79529778
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=59647334
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=44735501
> >> > osc.chome-OST-osc-88a03915.cur_grant_bytes=33551626
> >> > osc.chome-OST-osc-88a03915.cur_grant_by

Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-11-11 Thread Nathan Dauchy - NOAA Affiliate
Simon, Tung-Han,

You may also want to watch these tickets that seem to be related to the
issue you describe:
https://jira.whamcloud.com/browse/LU-14124
https://jira.whamcloud.com/browse/LU-14125

-Nathan


On Mon, Nov 2, 2020 at 7:18 AM Simon Guilbault <
simon.guilba...@calculquebec.ca> wrote:

> Hi,
>
> If you set it on the MGS, it will be the new default for all the clients
> and new mount on the FS, the problem is you need LU-12759 (fixed in 2.12.4)
> on your clients since there was a bug on older clients and that setting was
> not working correctly.
>
> On Mon, Nov 2, 2020 at 12:38 AM Tung-Han Hsieh <
> thhs...@twcp1.phys.ntu.edu.tw> wrote:
>
>> Dear Simon,
>>
>> Following your suggestions, now we confirmed that the problem of
>> dropping I/O performance of a client when there is a continous
>> I/O in the background is solved. It works charming. Thank you so
>> much !!
>>
>> Here is a final question. We found that this command:
>>
>> lctl set_param osc.*.grant_shrink=0
>>
>> can be run the client, which fixed the value of "cur_grant_bytes"
>> to be the highest value 1880752127, and thereby fixed the problem.
>> Whenever we remount the file system (I mean, explicitly umount and
>> mount the file system), we need to execute this command again to
>> set it to zero.
>>
>> But this command:
>>
>> lctl set_param -P osc.*.grant_shrink=0
>>
>> has to be run in the MGS node. Only setting it in MGS but without
>> setting in the client, it seems that the "cur_grant_bytes" of the
>> testing client still dropping under the background continous I/O.
>> So I am asking what's the meaning of this setting in MGS node.
>>
>> Thank you very much.
>>
>>
>> T.H.Hsieh
>>
>> On Fri, Oct 30, 2020 at 01:37:01PM +0800, Tung-Han Hsieh wrote:
>> > Dear Simon,
>> >
>> > Thank you very much for your useful information. Now we are arranging
>> > the system maintenance date in order to upgrade to Lustre-2.12.5. Then
>> > we will follow your suggestion to see whether this problem could be
>> > fixed.
>> >
>> > Here I report a test of under continuous I/O, how the cur_grant_bytes
>> > changed overtime. Again the client runs the following script for
>> > continuous reading in the background:
>> >
>> > # The Lustre file system was mounted under /home
>> > while [ 1 ]; do
>> > tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
>> > done
>> >
>> > And every 20 mins, in the same client we copied a 600MB file from one
>> > directory to another within Lustre, and check the "cur_grant_bytes" by
>> > the following command running in the same client:
>> >
>> > /opt/lustre/sbin/lctl get_param osc.*.cur_grant_bytes
>> >
>> > The result is (every line separated by around 20 mins):
>> >
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1880752127
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1410564096
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1059201024
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=794400768
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=595800576
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=446850432
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=335137824
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=251353368
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=188515026
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=141386270
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=106039703
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=79529778
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=59647334
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=44735501
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=33551626
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=25163720
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=18872790
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=14154593
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=10615945
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=7961959
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=5971470
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=4478603
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=3358953
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=2519215
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1889412
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1417059
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1062795
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=797097
>> > osc.chome-OST-osc-88a03915.cur_grant_bytes=797097
>> > 
>> >
>> > The value 797097 seems to be the minimum. When it dropped to 1062795,
>> > the time of cp dramatically increased from around 1 sec to 1 min. In
>> > addition, during the test, the cluster is completely idling. And it
>> > is obvious 

Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-11-02 Thread Simon Guilbault
Hi,

If you set it on the MGS, it will be the new default for all the clients
and new mount on the FS, the problem is you need LU-12759 (fixed in 2.12.4)
on your clients since there was a bug on older clients and that setting was
not working correctly.

On Mon, Nov 2, 2020 at 12:38 AM Tung-Han Hsieh <
thhs...@twcp1.phys.ntu.edu.tw> wrote:

> Dear Simon,
>
> Following your suggestions, now we confirmed that the problem of
> dropping I/O performance of a client when there is a continous
> I/O in the background is solved. It works charming. Thank you so
> much !!
>
> Here is a final question. We found that this command:
>
> lctl set_param osc.*.grant_shrink=0
>
> can be run the client, which fixed the value of "cur_grant_bytes"
> to be the highest value 1880752127, and thereby fixed the problem.
> Whenever we remount the file system (I mean, explicitly umount and
> mount the file system), we need to execute this command again to
> set it to zero.
>
> But this command:
>
> lctl set_param -P osc.*.grant_shrink=0
>
> has to be run in the MGS node. Only setting it in MGS but without
> setting in the client, it seems that the "cur_grant_bytes" of the
> testing client still dropping under the background continous I/O.
> So I am asking what's the meaning of this setting in MGS node.
>
> Thank you very much.
>
>
> T.H.Hsieh
>
> On Fri, Oct 30, 2020 at 01:37:01PM +0800, Tung-Han Hsieh wrote:
> > Dear Simon,
> >
> > Thank you very much for your useful information. Now we are arranging
> > the system maintenance date in order to upgrade to Lustre-2.12.5. Then
> > we will follow your suggestion to see whether this problem could be
> > fixed.
> >
> > Here I report a test of under continuous I/O, how the cur_grant_bytes
> > changed overtime. Again the client runs the following script for
> > continuous reading in the background:
> >
> > # The Lustre file system was mounted under /home
> > while [ 1 ]; do
> > tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
> > done
> >
> > And every 20 mins, in the same client we copied a 600MB file from one
> > directory to another within Lustre, and check the "cur_grant_bytes" by
> > the following command running in the same client:
> >
> > /opt/lustre/sbin/lctl get_param osc.*.cur_grant_bytes
> >
> > The result is (every line separated by around 20 mins):
> >
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1880752127
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1410564096
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1059201024
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=794400768
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=595800576
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=446850432
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=335137824
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=251353368
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=188515026
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=141386270
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=106039703
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=79529778
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=59647334
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=44735501
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=33551626
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=25163720
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=18872790
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=14154593
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=10615945
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=7961959
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=5971470
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=4478603
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=3358953
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=2519215
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1889412
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1417059
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=1062795
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=797097
> > osc.chome-OST-osc-88a03915.cur_grant_bytes=797097
> > 
> >
> > The value 797097 seems to be the minimum. When it dropped to 1062795,
> > the time of cp dramatically increased from around 1 sec to 1 min. In
> > addition, during the test, the cluster is completely idling. And it
> > is obvious that this test does not saturate the loading of both network
> > and MDT / OST hardware (they have almost no loading).
> >
> > I am wondering whether this could be a bug to report to the development
> > team.
> >
> > Best Regards,
> >
> > T.H.Hsieh
> >
> > On Thu, Oct 29, 2020 at 09:49:42AM -0400, Simon Guilbault wrote:
> > > Our current workaround was to use the following command on the MGS with
> > >

Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-11-01 Thread Tung-Han Hsieh
Dear Simon,

Following your suggestions, now we confirmed that the problem of
dropping I/O performance of a client when there is a continous
I/O in the background is solved. It works charming. Thank you so
much !!

Here is a final question. We found that this command:

lctl set_param osc.*.grant_shrink=0

can be run the client, which fixed the value of "cur_grant_bytes"
to be the highest value 1880752127, and thereby fixed the problem.
Whenever we remount the file system (I mean, explicitly umount and
mount the file system), we need to execute this command again to
set it to zero.

But this command:

lctl set_param -P osc.*.grant_shrink=0

has to be run in the MGS node. Only setting it in MGS but without
setting in the client, it seems that the "cur_grant_bytes" of the
testing client still dropping under the background continous I/O.
So I am asking what's the meaning of this setting in MGS node.

Thank you very much.


T.H.Hsieh

On Fri, Oct 30, 2020 at 01:37:01PM +0800, Tung-Han Hsieh wrote:
> Dear Simon,
> 
> Thank you very much for your useful information. Now we are arranging
> the system maintenance date in order to upgrade to Lustre-2.12.5. Then
> we will follow your suggestion to see whether this problem could be
> fixed.
> 
> Here I report a test of under continuous I/O, how the cur_grant_bytes
> changed overtime. Again the client runs the following script for
> continuous reading in the background:
> 
> # The Lustre file system was mounted under /home
> while [ 1 ]; do
> tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
> done
> 
> And every 20 mins, in the same client we copied a 600MB file from one
> directory to another within Lustre, and check the "cur_grant_bytes" by
> the following command running in the same client:
> 
> /opt/lustre/sbin/lctl get_param osc.*.cur_grant_bytes
> 
> The result is (every line separated by around 20 mins):
> 
> osc.chome-OST-osc-88a03915.cur_grant_bytes=1880752127
> osc.chome-OST-osc-88a03915.cur_grant_bytes=1410564096
> osc.chome-OST-osc-88a03915.cur_grant_bytes=1059201024
> osc.chome-OST-osc-88a03915.cur_grant_bytes=794400768
> osc.chome-OST-osc-88a03915.cur_grant_bytes=595800576
> osc.chome-OST-osc-88a03915.cur_grant_bytes=446850432
> osc.chome-OST-osc-88a03915.cur_grant_bytes=335137824
> osc.chome-OST-osc-88a03915.cur_grant_bytes=251353368
> osc.chome-OST-osc-88a03915.cur_grant_bytes=188515026
> osc.chome-OST-osc-88a03915.cur_grant_bytes=141386270
> osc.chome-OST-osc-88a03915.cur_grant_bytes=106039703
> osc.chome-OST-osc-88a03915.cur_grant_bytes=79529778
> osc.chome-OST-osc-88a03915.cur_grant_bytes=59647334
> osc.chome-OST-osc-88a03915.cur_grant_bytes=44735501
> osc.chome-OST-osc-88a03915.cur_grant_bytes=33551626
> osc.chome-OST-osc-88a03915.cur_grant_bytes=25163720
> osc.chome-OST-osc-88a03915.cur_grant_bytes=18872790
> osc.chome-OST-osc-88a03915.cur_grant_bytes=14154593
> osc.chome-OST-osc-88a03915.cur_grant_bytes=10615945
> osc.chome-OST-osc-88a03915.cur_grant_bytes=7961959
> osc.chome-OST-osc-88a03915.cur_grant_bytes=5971470
> osc.chome-OST-osc-88a03915.cur_grant_bytes=4478603
> osc.chome-OST-osc-88a03915.cur_grant_bytes=3358953
> osc.chome-OST-osc-88a03915.cur_grant_bytes=2519215
> osc.chome-OST-osc-88a03915.cur_grant_bytes=1889412
> osc.chome-OST-osc-88a03915.cur_grant_bytes=1417059
> osc.chome-OST-osc-88a03915.cur_grant_bytes=1062795
> osc.chome-OST-osc-88a03915.cur_grant_bytes=797097
> osc.chome-OST-osc-88a03915.cur_grant_bytes=797097
> 
> 
> The value 797097 seems to be the minimum. When it dropped to 1062795,
> the time of cp dramatically increased from around 1 sec to 1 min. In
> addition, during the test, the cluster is completely idling. And it
> is obvious that this test does not saturate the loading of both network
> and MDT / OST hardware (they have almost no loading).
> 
> I am wondering whether this could be a bug to report to the development
> team.
> 
> Best Regards,
> 
> T.H.Hsieh
> 
> On Thu, Oct 29, 2020 at 09:49:42AM -0400, Simon Guilbault wrote:
> > Our current workaround was to use the following command on the MGS with
> > Lustre 2.12.5 that include the patches in LU-12651 and LU-12759 (we were
> > using a patched 2.12.4 a few months ago):
> > lctl set_param -P osc.*.grant_shrink=0
> > 
> > We could not find the root cause of the underlying problem, dynamic grant
> > shrinking seems to be useful when the OSTs are running out of free space.
> > 
> > On Wed, Oct 28, 2020 at 11:47 PM Tung-Han Hsieh <
> > thhs...@twcp1.phys.ntu.edu.tw> wrote:
> > 
> > > Dear Simon,
> > >
> > > Thank you very much for your hint. Yes, you are right. We compare

Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-10-29 Thread Tung-Han Hsieh
Dear Simon,

Thank you very much for your useful information. Now we are arranging
the system maintenance date in order to upgrade to Lustre-2.12.5. Then
we will follow your suggestion to see whether this problem could be
fixed.

Here I report a test of under continuous I/O, how the cur_grant_bytes
changed overtime. Again the client runs the following script for
continuous reading in the background:

# The Lustre file system was mounted under /home
while [ 1 ]; do
tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
done

And every 20 mins, in the same client we copied a 600MB file from one
directory to another within Lustre, and check the "cur_grant_bytes" by
the following command running in the same client:

/opt/lustre/sbin/lctl get_param osc.*.cur_grant_bytes

The result is (every line separated by around 20 mins):

osc.chome-OST-osc-88a03915.cur_grant_bytes=1880752127
osc.chome-OST-osc-88a03915.cur_grant_bytes=1410564096
osc.chome-OST-osc-88a03915.cur_grant_bytes=1059201024
osc.chome-OST-osc-88a03915.cur_grant_bytes=794400768
osc.chome-OST-osc-88a03915.cur_grant_bytes=595800576
osc.chome-OST-osc-88a03915.cur_grant_bytes=446850432
osc.chome-OST-osc-88a03915.cur_grant_bytes=335137824
osc.chome-OST-osc-88a03915.cur_grant_bytes=251353368
osc.chome-OST-osc-88a03915.cur_grant_bytes=188515026
osc.chome-OST-osc-88a03915.cur_grant_bytes=141386270
osc.chome-OST-osc-88a03915.cur_grant_bytes=106039703
osc.chome-OST-osc-88a03915.cur_grant_bytes=79529778
osc.chome-OST-osc-88a03915.cur_grant_bytes=59647334
osc.chome-OST-osc-88a03915.cur_grant_bytes=44735501
osc.chome-OST-osc-88a03915.cur_grant_bytes=33551626
osc.chome-OST-osc-88a03915.cur_grant_bytes=25163720
osc.chome-OST-osc-88a03915.cur_grant_bytes=18872790
osc.chome-OST-osc-88a03915.cur_grant_bytes=14154593
osc.chome-OST-osc-88a03915.cur_grant_bytes=10615945
osc.chome-OST-osc-88a03915.cur_grant_bytes=7961959
osc.chome-OST-osc-88a03915.cur_grant_bytes=5971470
osc.chome-OST-osc-88a03915.cur_grant_bytes=4478603
osc.chome-OST-osc-88a03915.cur_grant_bytes=3358953
osc.chome-OST-osc-88a03915.cur_grant_bytes=2519215
osc.chome-OST-osc-88a03915.cur_grant_bytes=1889412
osc.chome-OST-osc-88a03915.cur_grant_bytes=1417059
osc.chome-OST-osc-88a03915.cur_grant_bytes=1062795
osc.chome-OST-osc-88a03915.cur_grant_bytes=797097
osc.chome-OST-osc-88a03915.cur_grant_bytes=797097


The value 797097 seems to be the minimum. When it dropped to 1062795,
the time of cp dramatically increased from around 1 sec to 1 min. In
addition, during the test, the cluster is completely idling. And it
is obvious that this test does not saturate the loading of both network
and MDT / OST hardware (they have almost no loading).

I am wondering whether this could be a bug to report to the development
team.

Best Regards,

T.H.Hsieh

On Thu, Oct 29, 2020 at 09:49:42AM -0400, Simon Guilbault wrote:
> Our current workaround was to use the following command on the MGS with
> Lustre 2.12.5 that include the patches in LU-12651 and LU-12759 (we were
> using a patched 2.12.4 a few months ago):
> lctl set_param -P osc.*.grant_shrink=0
> 
> We could not find the root cause of the underlying problem, dynamic grant
> shrinking seems to be useful when the OSTs are running out of free space.
> 
> On Wed, Oct 28, 2020 at 11:47 PM Tung-Han Hsieh <
> thhs...@twcp1.phys.ntu.edu.tw> wrote:
> 
> > Dear Simon,
> >
> > Thank you very much for your hint. Yes, you are right. We compared
> > the grant size of two client by (running in each client):
> >
> > lctl get_param osc.*.cur_grant_bytes
> >
> > - Client A: It has run the following large data transfer for over 36 hrs.
> >
> > while [ 1 ]; do
> > tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
> > done
> >
> >   The value of "cur_grant_bytes" is 796134.
> >
> > - Client B: It is almost idling during the action of Client A.
> >
> >   The value of "cur_grant_bytes" is 1715863552.
> >
> > If this is the reason that hit the I/O performance of Client A seriously,
> > is it possible to maintain it at a constant value at least for the head
> > node (since the head node is the most probable one to have large and long
> > time data I/O of the whole cluster, especially for a data center) ?
> >
> > I would be also like to ask: Why this value has to be dynamically adjusted
> > ?
> >
> > Thank you very much for your comment in advance.
> >
> > Best Regards,
> >
> > T.H.Hsieh
> >
> > On Wed, Oct 28, 2020 at 02:00:21PM -0400, Simon Guilbault wrote:
> > > Hi, we had a similar performance problem on our login/DTNs node a few
> > > months ago, the problem was the gran

Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-10-29 Thread Simon Guilbault
Our current workaround was to use the following command on the MGS with
Lustre 2.12.5 that include the patches in LU-12651 and LU-12759 (we were
using a patched 2.12.4 a few months ago):
lctl set_param -P osc.*.grant_shrink=0

We could not find the root cause of the underlying problem, dynamic grant
shrinking seems to be useful when the OSTs are running out of free space.

On Wed, Oct 28, 2020 at 11:47 PM Tung-Han Hsieh <
thhs...@twcp1.phys.ntu.edu.tw> wrote:

> Dear Simon,
>
> Thank you very much for your hint. Yes, you are right. We compared
> the grant size of two client by (running in each client):
>
> lctl get_param osc.*.cur_grant_bytes
>
> - Client A: It has run the following large data transfer for over 36 hrs.
>
> while [ 1 ]; do
> tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
> done
>
>   The value of "cur_grant_bytes" is 796134.
>
> - Client B: It is almost idling during the action of Client A.
>
>   The value of "cur_grant_bytes" is 1715863552.
>
> If this is the reason that hit the I/O performance of Client A seriously,
> is it possible to maintain it at a constant value at least for the head
> node (since the head node is the most probable one to have large and long
> time data I/O of the whole cluster, especially for a data center) ?
>
> I would be also like to ask: Why this value has to be dynamically adjusted
> ?
>
> Thank you very much for your comment in advance.
>
> Best Regards,
>
> T.H.Hsieh
>
> On Wed, Oct 28, 2020 at 02:00:21PM -0400, Simon Guilbault wrote:
> > Hi, we had a similar performance problem on our login/DTNs node a few
> > months ago, the problem was the grant size was shrinking and was getting
> > stuck under 1MB. Once under 1MB, the client had to send every request to
> > the OST using sync IO.
> >
> > Check the output of the following command:
> > lctl get_param osc.*.cur_grant_bytes
> >
> > On Wed, Oct 28, 2020 at 12:08 AM Tung-Han Hsieh <
> > thhs...@twcp1.phys.ntu.edu.tw> wrote:
> >
> > > Dear All,
> > >
> > > Sorry that I am not sure whether this mail was successfully posted to
> > > the lustre-discuss mailing list or not. So I resent it again. Please
> > > ignore it if you already read it before.
> > >
> > >
> ===
> > >
> > > Dear Andreas,
> > >
> > > Thank you very much for your kindly suggestions. These days I got a
> chance
> > > to follow your suggestions for the test. This email is to report the
> > > results
> > > I have done so far. What I have done were:
> > >
> > > 1. Upgrade one client (with Infiniband) to Lustre 2.13.56_44_gf8a8d3f
> > >(obtained from github). The compiling information is:
> > >
> > >- Linux kernel 4.19.123.
> > >- Infiniband MLNX_OFED_SRC-4.6-1.0.1.1.
> > >- ./configure --prefix=/opt/lustre \
> > >  --with-o2ib=/path/of/mlnx-ofed-kernel-4.6 \
> > >  --disable-server --enable-mpitests=no
> > >- make
> > >- make install
> > >
> > > 2. We mounted the lustre file system (lustre MDT/OST servers: version
> > >2.12.4 with Infiniband with ZFS backend) by this command:
> > >
> > >- mount -t lustre -o flock mdt@o2ib:/chome /home
> > >
> > > 3. The script to simulate large data transfer is following:
> > >(the directory "/home/large/data" contains 758 files, each size
> 600MB)
> > >
> > >while [ 1 ]; do
> > >tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
> > >done
> > >
> > >ps. Note that this scenario is common in a large data center, while
> > >some users transferring large data out of the data center
> through
> > >the head node; while other users might copy files and do their
> > >normal works in the same head node.
> > >
> > > 4. During the data transfer in the background, I occationally ran this
> > >command in the same client to test whether there is any abnormality
> > >in I/O performance (where /home/dir1/file has size 600MB):
> > >
> > >cp /home/dir1/file /home/dir2/
> > >
> > >In the beginning this command can complete in about 1 sec. But after
> > >around 18 hours (not exactly, because the test ran overnight while
> > >I was sleeping), the problem appeared. The time to complete the same
> > >cp command was more than 1 minute.
> > >
> > >During the test, I am sure that the whole cluster was idling. The
> MDT
> > >and OST servers did not have other loading. The CPU usage of the
> testing
> > >client was below 0.3.
> > >
> > >Then I stopped the test, and let the whole system completely idle.
> But
> > >after 3 hours, the I/O abnormality of the same "cp" command was
> still
> > >there. Only after I unmounted /home and remounted /home, the
> abnormality
> > >of "cp" recovered to normal.
> > >
> > > Before and after remounting /home (which I call "reset"), I did the
> > > following tests:
> > >
> > > 1. Using "top" to check the memory usag

Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-10-28 Thread Tung-Han Hsieh
Dear Simon,

Thank you very much for your hint. Yes, you are right. We compared
the grant size of two client by (running in each client):

lctl get_param osc.*.cur_grant_bytes

- Client A: It has run the following large data transfer for over 36 hrs.

while [ 1 ]; do
tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
done

  The value of "cur_grant_bytes" is 796134.

- Client B: It is almost idling during the action of Client A.

  The value of "cur_grant_bytes" is 1715863552.

If this is the reason that hit the I/O performance of Client A seriously,
is it possible to maintain it at a constant value at least for the head
node (since the head node is the most probable one to have large and long
time data I/O of the whole cluster, especially for a data center) ?

I would be also like to ask: Why this value has to be dynamically adjusted ?

Thank you very much for your comment in advance.

Best Regards,

T.H.Hsieh

On Wed, Oct 28, 2020 at 02:00:21PM -0400, Simon Guilbault wrote:
> Hi, we had a similar performance problem on our login/DTNs node a few
> months ago, the problem was the grant size was shrinking and was getting
> stuck under 1MB. Once under 1MB, the client had to send every request to
> the OST using sync IO.
> 
> Check the output of the following command:
> lctl get_param osc.*.cur_grant_bytes
> 
> On Wed, Oct 28, 2020 at 12:08 AM Tung-Han Hsieh <
> thhs...@twcp1.phys.ntu.edu.tw> wrote:
> 
> > Dear All,
> >
> > Sorry that I am not sure whether this mail was successfully posted to
> > the lustre-discuss mailing list or not. So I resent it again. Please
> > ignore it if you already read it before.
> >
> > ===
> >
> > Dear Andreas,
> >
> > Thank you very much for your kindly suggestions. These days I got a chance
> > to follow your suggestions for the test. This email is to report the
> > results
> > I have done so far. What I have done were:
> >
> > 1. Upgrade one client (with Infiniband) to Lustre 2.13.56_44_gf8a8d3f
> >(obtained from github). The compiling information is:
> >
> >- Linux kernel 4.19.123.
> >- Infiniband MLNX_OFED_SRC-4.6-1.0.1.1.
> >- ./configure --prefix=/opt/lustre \
> >  --with-o2ib=/path/of/mlnx-ofed-kernel-4.6 \
> >  --disable-server --enable-mpitests=no
> >- make
> >- make install
> >
> > 2. We mounted the lustre file system (lustre MDT/OST servers: version
> >2.12.4 with Infiniband with ZFS backend) by this command:
> >
> >- mount -t lustre -o flock mdt@o2ib:/chome /home
> >
> > 3. The script to simulate large data transfer is following:
> >(the directory "/home/large/data" contains 758 files, each size 600MB)
> >
> >while [ 1 ]; do
> >tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
> >done
> >
> >ps. Note that this scenario is common in a large data center, while
> >some users transferring large data out of the data center through
> >the head node; while other users might copy files and do their
> >normal works in the same head node.
> >
> > 4. During the data transfer in the background, I occationally ran this
> >command in the same client to test whether there is any abnormality
> >in I/O performance (where /home/dir1/file has size 600MB):
> >
> >cp /home/dir1/file /home/dir2/
> >
> >In the beginning this command can complete in about 1 sec. But after
> >around 18 hours (not exactly, because the test ran overnight while
> >I was sleeping), the problem appeared. The time to complete the same
> >cp command was more than 1 minute.
> >
> >During the test, I am sure that the whole cluster was idling. The MDT
> >and OST servers did not have other loading. The CPU usage of the testing
> >client was below 0.3.
> >
> >Then I stopped the test, and let the whole system completely idle. But
> >after 3 hours, the I/O abnormality of the same "cp" command was still
> >there. Only after I unmounted /home and remounted /home, the abnormality
> >of "cp" recovered to normal.
> >
> > Before and after remounting /home (which I call "reset"), I did the
> > following tests:
> >
> > 1. Using "top" to check the memory usage:
> >
> > Before reset:
> > =
> > top - 10:43:15 up 35 days, 52 min,  3 users,  load average: 0.00, 0.00,
> > 0.00
> > Tasks: 404 total,   1 running, 162 sleeping,   0 stopped,   0 zombie
> > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> > 0.0 st
> > KiB Mem : 13232632+total, 13000131+free,   647784 used,  1677220 buff/cache
> > KiB Swap: 15631240 total, 15631240 free,0 used. 13076376+avail Mem
> >
> > After reset:
> > =
> > top - 10:48:02 up 35 days, 57 min,  3 users,  load average: 0.04, 0.01,
> > 0.00
> > Tasks: 395 total,   1 running, 159 sleeping,   0 stopped,   0 zombie

Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-10-28 Thread Simon Guilbault
Hi, we had a similar performance problem on our login/DTNs node a few
months ago, the problem was the grant size was shrinking and was getting
stuck under 1MB. Once under 1MB, the client had to send every request to
the OST using sync IO.

Check the output of the following command:
lctl get_param osc.*.cur_grant_bytes

On Wed, Oct 28, 2020 at 12:08 AM Tung-Han Hsieh <
thhs...@twcp1.phys.ntu.edu.tw> wrote:

> Dear All,
>
> Sorry that I am not sure whether this mail was successfully posted to
> the lustre-discuss mailing list or not. So I resent it again. Please
> ignore it if you already read it before.
>
> ===
>
> Dear Andreas,
>
> Thank you very much for your kindly suggestions. These days I got a chance
> to follow your suggestions for the test. This email is to report the
> results
> I have done so far. What I have done were:
>
> 1. Upgrade one client (with Infiniband) to Lustre 2.13.56_44_gf8a8d3f
>(obtained from github). The compiling information is:
>
>- Linux kernel 4.19.123.
>- Infiniband MLNX_OFED_SRC-4.6-1.0.1.1.
>- ./configure --prefix=/opt/lustre \
>  --with-o2ib=/path/of/mlnx-ofed-kernel-4.6 \
>  --disable-server --enable-mpitests=no
>- make
>- make install
>
> 2. We mounted the lustre file system (lustre MDT/OST servers: version
>2.12.4 with Infiniband with ZFS backend) by this command:
>
>- mount -t lustre -o flock mdt@o2ib:/chome /home
>
> 3. The script to simulate large data transfer is following:
>(the directory "/home/large/data" contains 758 files, each size 600MB)
>
>while [ 1 ]; do
>tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
>done
>
>ps. Note that this scenario is common in a large data center, while
>some users transferring large data out of the data center through
>the head node; while other users might copy files and do their
>normal works in the same head node.
>
> 4. During the data transfer in the background, I occationally ran this
>command in the same client to test whether there is any abnormality
>in I/O performance (where /home/dir1/file has size 600MB):
>
>cp /home/dir1/file /home/dir2/
>
>In the beginning this command can complete in about 1 sec. But after
>around 18 hours (not exactly, because the test ran overnight while
>I was sleeping), the problem appeared. The time to complete the same
>cp command was more than 1 minute.
>
>During the test, I am sure that the whole cluster was idling. The MDT
>and OST servers did not have other loading. The CPU usage of the testing
>client was below 0.3.
>
>Then I stopped the test, and let the whole system completely idle. But
>after 3 hours, the I/O abnormality of the same "cp" command was still
>there. Only after I unmounted /home and remounted /home, the abnormality
>of "cp" recovered to normal.
>
> Before and after remounting /home (which I call "reset"), I did the
> following tests:
>
> 1. Using "top" to check the memory usage:
>
> Before reset:
> =
> top - 10:43:15 up 35 days, 52 min,  3 users,  load average: 0.00, 0.00,
> 0.00
> Tasks: 404 total,   1 running, 162 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> KiB Mem : 13232632+total, 13000131+free,   647784 used,  1677220 buff/cache
> KiB Swap: 15631240 total, 15631240 free,0 used. 13076376+avail Mem
>
> After reset:
> =
> top - 10:48:02 up 35 days, 57 min,  3 users,  load average: 0.04, 0.01,
> 0.00
> Tasks: 395 total,   1 running, 159 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> 0.0 st
> KiB Mem : 13232632+total, 12946539+free,   675948 used,  2184976 buff/cache
> KiB Swap: 15631240 total, 15631240 free,0 used. 13073571+avail Mem
>
>It seems that most of the memory were in "free" state. The amount of
>hidden memory was neglectable. So I did not further investigate the
>amount of slab memory.
>
> 2. Using "strace" with the following commands:
>
>- Before reset (took 1 min of each cp):
>  strace -c -o /tmp/log2-err.txt cp /home/dir1/file /home/dir2/
>
>- After reset (took 1 sec of each cp):
>  strace -c -o /tmp/log2-reset.txt cp /home/dir1/file /home/dir2/
>
>From the log files, the major time consuming was read and write
> syscalls.
>The others are neglectable.
>
>% time seconds  usecs/call callserrors syscall
>-- --- --- - - 
>(Before reset)
> 71.460.2784241920   145   write
> 28.060.109322 705   155   read
>(After reset)
> 52.920.2990912063   145   write
> 46.850.2647771708   155  

Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-10-27 Thread Tung-Han Hsieh
Dear All,

Sorry that I am not sure whether this mail was successfully posted to
the lustre-discuss mailing list or not. So I resent it again. Please
ignore it if you already read it before.

===

Dear Andreas,

Thank you very much for your kindly suggestions. These days I got a chance
to follow your suggestions for the test. This email is to report the results
I have done so far. What I have done were:

1. Upgrade one client (with Infiniband) to Lustre 2.13.56_44_gf8a8d3f
   (obtained from github). The compiling information is:

   - Linux kernel 4.19.123.
   - Infiniband MLNX_OFED_SRC-4.6-1.0.1.1.
   - ./configure --prefix=/opt/lustre \
 --with-o2ib=/path/of/mlnx-ofed-kernel-4.6 \
 --disable-server --enable-mpitests=no
   - make
   - make install

2. We mounted the lustre file system (lustre MDT/OST servers: version
   2.12.4 with Infiniband with ZFS backend) by this command:

   - mount -t lustre -o flock mdt@o2ib:/chome /home

3. The script to simulate large data transfer is following:
   (the directory "/home/large/data" contains 758 files, each size 600MB)

   while [ 1 ]; do
   tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
   done

   ps. Note that this scenario is common in a large data center, while
   some users transferring large data out of the data center through
   the head node; while other users might copy files and do their
   normal works in the same head node.

4. During the data transfer in the background, I occationally ran this
   command in the same client to test whether there is any abnormality
   in I/O performance (where /home/dir1/file has size 600MB):

   cp /home/dir1/file /home/dir2/

   In the beginning this command can complete in about 1 sec. But after
   around 18 hours (not exactly, because the test ran overnight while
   I was sleeping), the problem appeared. The time to complete the same
   cp command was more than 1 minute.

   During the test, I am sure that the whole cluster was idling. The MDT
   and OST servers did not have other loading. The CPU usage of the testing
   client was below 0.3.

   Then I stopped the test, and let the whole system completely idle. But
   after 3 hours, the I/O abnormality of the same "cp" command was still
   there. Only after I unmounted /home and remounted /home, the abnormality
   of "cp" recovered to normal.

Before and after remounting /home (which I call "reset"), I did the
following tests:

1. Using "top" to check the memory usage:

Before reset:
=
top - 10:43:15 up 35 days, 52 min,  3 users,  load average: 0.00, 0.00, 0.00
Tasks: 404 total,   1 running, 162 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13232632+total, 13000131+free,   647784 used,  1677220 buff/cache
KiB Swap: 15631240 total, 15631240 free,0 used. 13076376+avail Mem

After reset:
=
top - 10:48:02 up 35 days, 57 min,  3 users,  load average: 0.04, 0.01, 0.00
Tasks: 395 total,   1 running, 159 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13232632+total, 12946539+free,   675948 used,  2184976 buff/cache
KiB Swap: 15631240 total, 15631240 free,0 used. 13073571+avail Mem

   It seems that most of the memory were in "free" state. The amount of
   hidden memory was neglectable. So I did not further investigate the
   amount of slab memory.

2. Using "strace" with the following commands:

   - Before reset (took 1 min of each cp):
 strace -c -o /tmp/log2-err.txt cp /home/dir1/file /home/dir2/

   - After reset (took 1 sec of each cp):
 strace -c -o /tmp/log2-reset.txt cp /home/dir1/file /home/dir2/

   From the log files, the major time consuming was read and write syscalls.
   The others are neglectable.

   % time seconds  usecs/call callserrors syscall
   -- --- --- - - 
   (Before reset)
71.460.2784241920   145   write
28.060.109322 705   155   read
   (After reset)
52.920.2990912063   145   write
46.850.2647771708   155   read

   Before reset, since we have done the cp test for the same file a
   few times, the file was already cached. So the reading time is
   smaller before reset than that after reset (since after reset /home
   was remounted).

   Hence from this result, the time of syscalls looks normal. The
   performance drop seems occuring in other places.

Now I haven't done the investigation of Lustre kernel debug log by enabling
Lustre debug=-1. We will find another chance to do it.

Up to now, any comments or suggestions are very welcome.

Thanks for your help in advance.


Best Regards,

T.H.Hsi

Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-10-08 Thread Andreas Dilger
On Oct 8, 2020, at 10:37 AM, Tung-Han Hsieh  
wrote:
> 
> Dear All,
> 
> In the past months, we encountered several times of Lustre I/O abnormally
> slowing down. It is quite mysterious that there seems no problem on the
> network hardware, nor the lustre itself since there is no error message
> at all in MDT/OST/client sides.
> 
> Recently we probably found a way to reproduce it, and then have some
> suspections. We found that if we continuously perform I/O on a client
> without stop, then after some time threshold (probably more than 24
> hours), the additional file I/O bandwidth of that client will be shriked
> dramatically.
> 
> Our configuration is the following:
> - One MDT and one OST server, based on ZFS + Lustre-2.12.4.
> - The OST is served by a RAID 5 system with 15 SAS hard disks.
> - Some clients connect to MDT/OST through Infiniband, some through
>  gigabit ethernet.
> 
> Our test was focused on the clients using infiniband, which is described
> in the following:
> 
> We have a huge (several TB) amount of data stored in the Lustre file
> system to be transferred to outside network. In order not to exhaust
> the network bandwidth of our institute, we transfer the data with limited
> bandwidth via the following command:
> 
> rsync -av --bwlimit=1000  ://
> 
> That is, the transferring rate is 1 MB per second, which is relatively
> low. The client read the data from Lustre through infiniband. So during
> data transmission, presumably there is no problem to do other data I/O
> on the same client. On average, when copy a 600 MB file from one directory
> to another directory (both in the same Lustre file system), it took about
> 1.0 - 2.0 secs, even when the rsync process still working.
> 
> But after about 24 hours of continuously sending data via rsync, the
> additional I/O on the same client was dramatically shrinked. When it happens,
> it took more than 1 minute to copy a 600 MB from somewhere to another place
> (both in the same Lustre) while rsync is still running.
> 
> Then, we stopped the rsync process, and wait for a while (about one
> hour). The I/O performance of copying that 600 MB file returns normal.
> 
> Based on this observation, we are suspecting that whether there is a
> hidden QoS mechanism built in Lustre ? When a process occupies the I/O
> bandwidth for a long time and exceeded some limits, does Lustre automatically
> shrinked the I/O bandwidth for all processes running in the same client ?
> 
> I am not against such QoS design, if it does exist. But the amount of
> shrinking seems to be too large for infiniband (QDR and above). Then
> I am further suspecting that whether this is due to that our system is
> mixed with clients in which some have infiniband but some do not ?
> 
> Could anyone help to fix this problem ? Any suggestions will be very
> appreciated.

There is no "hidden QOS", unless it is so well hidden that I don't know
about it.

You could investigate several different things to isolate the problem:
- try with a 2.13.56 client to see if the problem is already fixed
- check if the client is using a lot of CPU when it becomes slow
- run strace on your copy process to see which syscalls are slow
- check memory/slab usage
- enable Lustre debug=-1 and dump the kernel debug log to see where
  the process is taking a long time to complete a request

It is definitely possible that there is some kind of problem, since this
is not a very common workload to be continuously writing to the same file
descriptor for over a day.  You'll have to do the investigation on your
system to isolate the source of the problem.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Hidden QoS in Lustre ?

2020-10-08 Thread Tung-Han Hsieh
Dear All,

In the past months, we encountered several times of Lustre I/O abnormally
slowing down. It is quite mysterious that there seems no problem on the
network hardware, nor the lustre itself since there is no error message
at all in MDT/OST/client sides.

Recently we probably found a way to reproduce it, and then have some
suspections. We found that if we continuously perform I/O on a client
without stop, then after some time threshold (probably more than 24
hours), the additional file I/O bandwidth of that client will be shriked
dramatically.

Our configuration is the following:
- One MDT and one OST server, based on ZFS + Lustre-2.12.4.
- The OST is served by a RAID 5 system with 15 SAS hard disks.
- Some clients connect to MDT/OST through Infiniband, some through
  gigabit ethernet.

Our test was focused on the clients using infiniband, which is described
in the following:

We have a huge (several TB) amount of data stored in the Lustre file
system to be transferred to outside network. In order not to exhaust
the network bandwidth of our institute, we transfer the data with limited
bandwidth via the following command:

rsync -av --bwlimit=1000  ://

That is, the transferring rate is 1 MB per second, which is relatively
low. The client read the data from Lustre through infiniband. So during
data transmission, presumably there is no problem to do other data I/O
on the same client. On average, when copy a 600 MB file from one directory
to another directory (both in the same Lustre file system), it took about
1.0 - 2.0 secs, even when the rsync process still working.

But after about 24 hours of continuously sending data via rsync, the
additional I/O on the same client was dramatically shrinked. When it happens,
it took more than 1 minute to copy a 600 MB from somewhere to another place
(both in the same Lustre) while rsync is still running.

Then, we stopped the rsync process, and wait for a while (about one
hour). The I/O performance of copying that 600 MB file returns normal.

Based on this observation, we are suspecting that whether there is a
hidden QoS mechanism built in Lustre ? When a process occupies the I/O
bandwidth for a long time and exceeded some limits, does Lustre automatically
shrinked the I/O bandwidth for all processes running in the same client ?

I am not against such QoS design, if it does exist. But the amount of
shrinking seems to be too large for infiniband (QDR and above). Then
I am further suspecting that whether this is due to that our system is
mixed with clients in which some have infiniband but some do not ?

Could anyone help to fix this problem ? Any suggestions will be very
appreciated.

Thanks very much.

T.H.Hsieh
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org