Dear All,

I am asking for a strange problem of Lustre filesystem we have
encountered since many years ago.

We are using Lustre 1.8.7 (many years ago we already have set
up the Lustre filesystem for more than 50TB data storage, and
our system is busy for scientific computation almost all the
time, so we did not upgrade it to the recent versions), with
Linux kernel 2.6.32.19, and Debian GNU/Linux since 5.X to recently
upgraded to 7.11. Sometimes we encounter data written incompleted
problem.

The senario is the following. When a user use "rsync -av" to
transfer large amount of data from else where to the Lustre
filesystem, after transfer usuzlly we run the same "rsync -av"
again to make sure that all data are transfered (the second
run does not repeat the transfer if the data file is OK).
However, we often found that many files are not OK, and need
to be transfered again.

We run "ls -l" to check data files after rsync command, and do
see that many files have incorrect sizes. But it is quite strange
that, when running "ls -l" again, we often found that all files
have correct sizes. This time running the same "rsync -av" command
shows that no files have to be re-transfered.

It seems to me that during the first file transfer, some files
are in cache and are not really written in the storage completely.
But when we run "ls -l", something is triggered and the cache
is flushed out to the storage. The data kept in cache without
flushing could stand for a long time. We often saw that after
several hours of the first rsync command, many data files are
still not flushed out unless we issued "ls -l" command.

This phenomenon caused a serious problem. When the system is in
heavy loading and many data I/O are performed, we may encounter
data lose due to data is in cache without flushing out to the disk.
We have checked the RAID setting of the storage, and make sure
that the I/O is in "write through" mode. We searched in the Lustre
manual, but did not see any useful information about how to tune
Lustre filesystem to avoid this problem.

So I am asking whether anyone has similar experience, and whether
it is possible to fix it.

Thank you very much for your kindly help.


Best Regards,

T.H.Hsieh
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to