Rik van Riel <[EMAIL PROTECTED]> writes:


> I know this could involve an extra copy of the page for
> writing, but it fits in really well with the transactioning
> scheme and could, with the proper API, give us a way to
> implement a high-performance transactioning interface for
> userspace. 

The user space side of this idea seems completely silly.
Plus user space has more issues to contend with.

For filesystem metadata journally is used exclusively to 
be able to recover a consistent filesystem after an unexpected
software disappearance (power outage, kernel crash etc).

For user space transactions (i.e. what a user space program
would use in a database) are also used to sort out
multiple processes that perform simultaneous actions
without locking.

Consider the following points:
(a) Transactions usually succeed.
(b) Transactions can be huge and involve multiple files.
(c) Commits happen after the end of every transaction,
    and their length affects the latency of operations.

In a large transaction, especially one that creates
new data (so the rollback logs can be very small),
you want the data to go to disk like a normal file write.

Then when the commit eventually happens you hardly need
to wait at all, and only the rollback log needs to be 
cleaned up.

Consider loading a gigabyte of data into your database
in one transaction.

Equally there is the question of what happens if you
have two records on the same page, being updated
in different transactions.  Your scheme would
either force one transaction to wait for the other, or 
force the transactions to merge.  Not nice.

I won't argue that building a nice user space API isn't
a good idea.  But we need to get fs level journalling into 
the kernel first, and we need to think about things more carefully.

This isn't like implementing vfork where the question was 
what is the best way to stop the parent process, because 
clone could already cleanly handle the tricky address space
shareing.

Eric

Reply via email to