On Oct 8, 2020, at 10:37 AM, Tung-Han Hsieh <thhs...@twcp1.phys.ntu.edu.tw> wrote: > > Dear All, > > In the past months, we encountered several times of Lustre I/O abnormally > slowing down. It is quite mysterious that there seems no problem on the > network hardware, nor the lustre itself since there is no error message > at all in MDT/OST/client sides. > > Recently we probably found a way to reproduce it, and then have some > suspections. We found that if we continuously perform I/O on a client > without stop, then after some time threshold (probably more than 24 > hours), the additional file I/O bandwidth of that client will be shriked > dramatically. > > Our configuration is the following: > - One MDT and one OST server, based on ZFS + Lustre-2.12.4. > - The OST is served by a RAID 5 system with 15 SAS hard disks. > - Some clients connect to MDT/OST through Infiniband, some through > gigabit ethernet. > > Our test was focused on the clients using infiniband, which is described > in the following: > > We have a huge (several TB) amount of data stored in the Lustre file > system to be transferred to outside network. In order not to exhaust > the network bandwidth of our institute, we transfer the data with limited > bandwidth via the following command: > > rsync -av --bwlimit=1000 <data_in_Lustre> <out_side_server>:/<out_side_path>/ > > That is, the transferring rate is 1 MB per second, which is relatively > low. The client read the data from Lustre through infiniband. So during > data transmission, presumably there is no problem to do other data I/O > on the same client. On average, when copy a 600 MB file from one directory > to another directory (both in the same Lustre file system), it took about > 1.0 - 2.0 secs, even when the rsync process still working. > > But after about 24 hours of continuously sending data via rsync, the > additional I/O on the same client was dramatically shrinked. When it happens, > it took more than 1 minute to copy a 600 MB from somewhere to another place > (both in the same Lustre) while rsync is still running. > > Then, we stopped the rsync process, and wait for a while (about one > hour). The I/O performance of copying that 600 MB file returns normal. > > Based on this observation, we are suspecting that whether there is a > hidden QoS mechanism built in Lustre ? When a process occupies the I/O > bandwidth for a long time and exceeded some limits, does Lustre automatically > shrinked the I/O bandwidth for all processes running in the same client ? > > I am not against such QoS design, if it does exist. But the amount of > shrinking seems to be too large for infiniband (QDR and above). Then > I am further suspecting that whether this is due to that our system is > mixed with clients in which some have infiniband but some do not ? > > Could anyone help to fix this problem ? Any suggestions will be very > appreciated.
There is no "hidden QOS", unless it is so well hidden that I don't know about it. You could investigate several different things to isolate the problem: - try with a 2.13.56 client to see if the problem is already fixed - check if the client is using a lot of CPU when it becomes slow - run strace on your copy process to see which syscalls are slow - check memory/slab usage - enable Lustre debug=-1 and dump the kernel debug log to see where the process is taking a long time to complete a request It is definitely possible that there is some kind of problem, since this is not a very common workload to be continuously writing to the same file descriptor for over a day. You'll have to do the investigation on your system to isolate the source of the problem. Cheers, Andreas
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org