On Sat, 28 Apr 2007, Mikulas Patocka wrote: > > > > Especially with lots of memory, allowing 40% of that memory to be dirty is > > just insane (even if we limit it to "just" 40% of the normal memory zone. > > That can be gigabytes. And no amount of IO scheduling will make it > > pleasant to try to handle the situation where that much memory is dirty. > > What about using different dirtypage limits for different processes?
Not good. We inadvertedly actually had a very strange case of that, in the sense that we had different dirtypage limits depending on the type of the allocation: if somebody used GFP_HIGHUSER, he'd be looking at the percentage as a percentage of _all_ memory, but if somebody used GFP_KERNEL he'd look at it as a percentage of just the normal low memory. So effectively they had different limits (the percentage may have been the same, but the _meaning_ of the percentage changed ;) And it's really problematic, because it means that the process that has a high tolerance for dirty memory will happily dirty a lot of RAM, and then when the process that has a _low_ tolerance comes along, it might write just a single byte, and go "oh, damn, I'm way over my dirty limits, I will now have to start doing writeouts like mad". Your form is much better: > --- i.e. every process has dirtypage activity counter, that is increased when > it dirties a page and decreased over time. ..but is really hard to do, and in particular, it's really hard to make any kinds of guarantees that when you have a hundred processes, they won't go over the total dirty limit together! And one of the reasons for the dirty limit is that the VM really wants to know that it always has enough clean memory that it can throw away that even if it needs to do allocations while under IO, it's not totally screwed. An example of this is using dirty mmap with a networked filesystem: with 2.6.20 and later, this should actually _work_ fairly reliably, exactly because we now also count the dirty mapped pages in the dirty limits, so we never get into the situation that we used to be able to get into, where some process had mapped all of RAM, and dirtied it without the kernel even realizing, and then when the kernel needed more memory (in order to write some of it back), it was totally screwed. So we do need the "global limit", as just a VM safety issue. We could do some per-process counters in addition to that, but generally, the global limit actually ends up doing the right thing: heavy writers are more likely to _hit_ the limit, so statistically the people who write most are also the people who end up havign to clean up - so it's all fair. > The main problem is that if the user extracts tar archive, tar eventually > blocks on writeback I/O --- O.K. But if bash attempts to write one page to > .bash_history file at the same time, it blocks too --- bad, the user is > annoyed. Right, but it's actually very unlikely. Think about it: the person who extracts the tar-archive is perhaps dirtying a thousand pages, while the .bash_history writeback is doing a single one. Which process do you think is going to hit the "oops, we went over the limit" case 99.9% of the time? The _really_ annoying problem is when you just have absolutely tons of memory dirty, and you start doing the writeback: if you saturate the IO queues totally, it simply doesn't matter _who_ starts the writeback, because anybody who needs to do any IO at all (not necessarily writing) is going to be blocked. This is why having gigabytes of dirty data (or even "just" hundreds of megs) can be so annoying. Even with a good software IO scheduler, when you have disks that do tagged queueing, if you fill up the disk queue with a few dozen (depends on the disk what the queue limit is) huge write requests, it doesn't really matter if the _software_ queuing then gives a big advantage to reads coming in. They'll _still_ be waiting for a long time, especially since you don't know what the disk firmware is going to do. It's possible that we could do things like refusing to use all tag entries on the disk for writing. That would probably help latency a _lot_. Right now, if we do writeback, and fill up all the slots on the disk, we cannot even feed the disk the read request immediately - we'll have to wait for some of the writes to finish before we can even queue the read to the disk. (Of course, if disks don't support tagged queueing, you'll never have this problem at all, but most disks do these days, and I strongly suspect it really can aggravate latency numbers a lot). Jens? Comments? Or do you do that already? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/