Hi,

On Tue, 12 Oct 1999 15:03:25 +0400, Hans Reiser <[EMAIL PROTECTED]> said:

>> With journaling, however, we have a new problem.  We can have large
>> amounts of dirty data pinned in memory, but we cannot actualy write
>> that data to disk without first allocating more memory.

> Trivia: I don't think this is a feature of journaling, but rather a
> feature of a particular implementation of journaling.  Chris will
> correct me if I err, but Chris's journaling doesn't have this
> property.

>From what I can see it does, in two places.  (Ext3 has similar
properties in both places.)  Likewise, Chris will correct me if I'm
wrong. :)

In Reiserfs's journal_end(), a commit results in getblk() being called
to produce one new journal block for every existing block in the entire
transaction, and the transaction blocks are then copied to the journal.
There is a strict ordering involved, and even if we batch the
copy/writes one block at a time, that one-block allocation is still
required before any subsequent blocks in the journal can be freed.
(Reiserfs currently defaults to up to 128 such allocations being
required before anything from the transaction can be freed.)

Ext3 performs the copy-out of the log record when the block is initially
dirtied, not when it is committed, but it still needs to be prepared to
make a copy for every block in the transaction.

Secondly, when a transaction is still in progress --- well before we
start to commit it --- you cannot free up _any_ of the buffers involved
until the transaction is ultimately committed.  Doing so would violate
the write ordering requirement that the journal is written before the
transaction's buffers are flushed to disk.  That means that when memory
pressure starts to become significant, you still have to be able to
complete any transactions which have already started, and that requires
that you are able to satisfy any future memory requests generated as a
result of more buffer accesses as the transactions completes.  

It's a fundamental property that as soon as you have write ordering,
any memory allocation required to complete a single write is going to
exert disproportionate memory pressure because of the number of other
writes which get blocked until that first one succeeds.  Deferred
allocation will just make the matter much, much worse.

> Let us define a buffer's state as FLUSHTIME_NON_EXPANDING if flushing it
> requires no additional memory, and FLUSHTIME_EXPANDING otherwise.

Yes, the XFS people were using a similar "reserved" flag to indicate
whether a particular memory allocation was accounted against dirty
reserved memory, and they want Linux to support an upper bound on such
reserved memory so that they can guarantee memory deadlock freedom.
ext3 would certainly be able to benefit from such a feature in the VM.

> I see the following separate issues:

> how to drive a kernel subsystem to flush some memory.  I advocate that
> the vm system push, and the subsystems give it calls for doing the
> pushing.

Yep.  The main problem is a matter of "cost": it is more expensive to
free up journaled buffer_heads than to free up unreferenced cache pages,
for example.  The "priority" counters in the vm try_to_free_page loop
are an obvious place to access such cost information, so that we don't
try too hard to wait for journal commits if we aren't all that short of
memory.

> How to ensure that there is at least largest_reservation buffers of
> FLUSHTIME_NON_EXPANDING memory at all times, where largest_reservation
> is the sum of the amount every kernel subsystem says it might need at
> maximum.  

Would it be better to place an upper limit on the EXPANDING memory
instead?  That way, you wouldn't care where the extra memory required to
flush those buffers comes from, and you'd give the VM to free up
whatever stuff in core was least useful at the time you need the
memory.  

> There would be a reserve() and unreserve() for the kernel subsystems
> to call.  I hypothesize that if largest_reservation is unnecessarily
> large, so long as it is not completely obscene performance will not
> suffer (and might gain), and the code simplicity/performance will be
> improved as a result of using the maximum possible to need rather than
> tracking the amount actually needed.

The counter-argument is that if you have deferred allocation of written
data, then the more reserved data you allow, the later you allow the
allocation to take place and the more flexibility you have to let the
filesystem choose appropriate allocations for data which is being
written sequentially.  This is definitely something that XFS want.

> For reiserfs, it would simplify our balancing code (fix_nodes() in
> particular) and improve our performance if we could efficiently
> reserve.  Roma, think about this.

It should definitely be possible to establish a fairly clean common
kernel API for this.  Doing so would have the extra advantage that if
you had mixed ReiserFS and XFS partitions on the same machine, the
VM's memory reservation would be able to cope cleanly with multiple
users of reserved memory.

--Stephen

Reply via email to