John,

I wonder if this could be a credit issue.  Do you know the size of the other 
job that is doing the checkpointing?  It sounds like your job is just a single 
client job so it is going to have a limited number of credits (the default used 
to be 8 but I don't know if that is still the case).  If the other job is using 
100 nodes (just as an example), it could have 100x more outstanding IO requests 
than your job can. The spike in the server load makes me think that IO requests 
are getting backed up.  

Lustre has limit on the peer_credits which is the number of outstanding IO 
requests per client which helps to prevent any one client from monopolizing a 
Lustre server.  But the nodes themselves also have a limit on the total number 
of credits which helps to limit the number of outstanding IO requests on the 
server (I think the number is related to the limitations of the network fabric, 
but it can also serve as a way to limit the number of requests that get queued 
on the server to help prevent a server from getting overloaded).  If a large 
job is checkpointing, then maybe that job is chewing up the server's credits so 
that your application is only getting a small number of IO requests added to a 
very large queue of outstanding requests.  My knowledge of credits may be 
flawed/out-dated (and perhaps someone else on the list can correct me if I am), 
but it's one way that contention could exist on a server even if there isn't 
contention on the OSTs themselves.

If your application is using a single client which has some local SSD storage, 
maybe the Persistent Client Cache (PCC) feature might be of some benefit to you 
(if it's available on your file system).

--Rick


On 1/12/26, 7:52 PM, "lustre-discuss on behalf of John Bauer via 
lustre-discuss" <[email protected]> wrote:


All,
My questions of recent are related to my trying to understand the following 
issue. I have an application that is writing, reading forwards, and reading 
backwards, a single file multiple times ( as seen in bottom frame of Image 1). 
The file is striped 4x16M on 4 ssd OSTs on 2 OSS. Everything runs along just 
great with transfer rates in the 5GB/s range. At some point, another 
application triggers approximately 135 GB of writes to each of the 32 hdd OSTs 
on the16 OSSs of the file system. When this happens my applications performance 
drops to 4.8 MB/s, a 99.9% loss of performance for the 33+ second duration of 
the other application's writes. My application is doing 16MB preads and pwrites 
in parallel using 4 pthreads, with O_DIRECT on the client. The main question I 
have is: "Why do the writes from the other application affect my application so 
dramatically?" I am making demands of the 2 OSS of about the same order of 
magnitude, 2.5GB/s each from 2 OSS, as the other application is getting from 
the same 2 OSS, about 4 GB/s each. There should be no competition for the OSTs, 
as I am using ssd and the other application is using hdd. If both applications 
are triggering Direct I/O on the OSSs, I would think there would be minimal 
competition for compute resources on the OSSs. But as seen below in Image 3, 
there is a huge spike in cpu load during the other application's writes. This 
is not a one-off event. I see this about 2 out of every 3 times I run this job. 
I suspect the other application is one that checkpoints on a regular interval, 
but I am a non-root user and have no way to determine. I am using PCP/pmapi to 
get the OSS data during my run. If the images get removed from the email, I 
have used alternate text with links to Dropbox for the images.
Thanks,
John


_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to