Hi!

We are running some x86_64 servers with large RAM (128GB). Just to imagine: 
With a memory speed of a little more than 9GB/s it takes > 10 seconds to read 
all RAM...

In the past and recently we had problems with read() stalls when the kernel was 
writing back big amounts (like 80GB) of dirty buffers on a somewhat slow 
(40MB/s) device. The problem is old and well-known, it seems, but to really 
solved.

One recommendation was to limit the amount of dirty buffers, which actually did 
not help to really avoid the problem, specifically if new dirty buffers are 
used as soon as they are available (i.e.: some were flushed). I had success 
with limiting the used memory (including dirty pages) with control groups 
(memory:iothrottle, SLES11 SP2), but the control framework (rccgconfig setting 
up proper rights for /sys/fs/cgroup/mem/iothrottle/tasks) is quite incomplete 
(no group write permission or ACL setup possible), so the end user can hardly 
use that.

I still don't know whether read stalls are caused by the I/O channel or device 
being saturated, or whether the kernel is waiting for unused buffers to receive 
the read data, but I learned that I/O schedulers (and possibly the block layer 
optimizations) can cause extra delays, too.

We had one situation where a single sector could not be read with direct I/O 
for 10 seconds.

Recently we had the problem again, but it was clear that it was _not_ the 
device being overloaded, nor was it the I/O channel. The read problem was 
reported for a devioce that was almost idle, and the I/O channel (FC) can 
handle much more than the disk system can in both directions. So the problem 
seems to be inside the kernel.

Oracle recommends (in article  1557478.1, without explaining the details) to 
turn off transparent huge pages. Before that I didn't think much about that 
feature. It seems the kernel is not just creating huge pages when they are 
requested explicitly (that's what I had thought), but also implicitly to reduce 
the number of pages to me managed. Collecting smaller pages to combine them for 
huge pages may also involve moving memory around (compaction), it seems. I 
still don't know whether the kernel will also try to compact dirty cache pages 
to huge pages, but we still see read stalls when there are many dirty pages 
(like when copying 400GB of data to a somewhat slow (30MB/s) disk.

Now I wonder what the real solution to the problem (not the numerous 
work-arounds) would be. Obviously simply stopping (yield) dirty buffer flush to 
give read a chance may not be sufficient when read needs to wait for unused 
pages, especially if the disks being read from are faster than those being 
written to.
To my understanding dirty pages have an "age" that is used to decide whether to 
flush them or not. Also the I/O scheduler seems to prefer read requests over 
write requests. What I do not know is whether a read request is sent to the I/O 
scheduler before buffer pages are assigned to the request, or after the pages 
were assigned. So a read request only has the chance to have an "age" once it 
entered the I/O scheduler, right?

So if read and writes had an "age" both, some EDF (earliest deadline first) 
scheduling could be used to perform I/O (which would be controlling buffer 
usage as a side-effect). For transparent huge pages, requests for a huge page 
should also have an age and a priority that is significantly below that of I/O 
buffers. If there exists an efficient algorithm and data model to perform these 
tasks, the problem may be solved.

Unfortunately if many buffers are dirtied at one moment and reads are requested 
significantly later, there may be an additional need for time-slices when doing 
I/O (note: I'm not talking about quotas of some MB, but quotas of time). The 
I/O throughput may vary a lot, and time seems the only way to manage latency 
correctly. To avoid a situation where reads may cause stalling writes (and thus 
the age of dirty buffers growing without bounds), the priority of writes should 
be _carefully_ increased, taking care not to create a "fright train of dirty 
buffers" to be flushed. So maybe "smuggle in" a few dirty buffers between read 
requests. As a high-level flow control (like for the cgroups mechanism), 
processes with a high amount of dirty buffers should be suspended or scheduled 
with very low priority to give the memory and I/O systems a change to process 
the dirty buffers.

For reference: The machine in question is at 3.0.74-0.6.10-default with the 
latest SLES11 SP2 kernel being 3.0.93-0.5.

I'd like to know what the gurus thing about that. I think with increasing RAM 
this issue will become extremely important soon.

Regards,
Ulrich
P.S: Not subscribed to linux-kernel, so keep me on CC:, please

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to