Re: [sqlite] light weight write barriers

杨苏立 Yang Su Li Thu, 15 Nov 2012 08:14:49 -0800

On Thu, Nov 15, 2012 at 6:07 AM, David Lang <da...@lang.hm> wrote:

> On Wed, 14 Nov 2012, Vladislav Bolkhovitin wrote:
>
>  Nico Williams, on 11/13/2012 02:13 PM wrote:
>>
>>> declaring groups of internally-unordered writes where the groups are
>>> ordered with respect to each other... is practically the same as
>>> barriers.
>>>
>>
>> Which barriers? Barriers meaning cache flush or barriers meaning commands
>> order, or barriers meaning both?
>>
>> There's no such thing as "barrier". It is fully artificial abstraction.
>> After all, at the bottom of your stack, you will have to translate it
>> either to cache flush, or commands order enforcement, or both.
>>
>
> When people talk about barriers, they are talking about order enforcement.
>
>
>  Your mistake is that you are considering barriers as something real,
>> which can do something real for you, while it is just a artificial
>> abstraction apparently invented by people with limited knowledge how
>> storage works, hence having very foggy vision how barriers supposed to be
>> processed by it. A simple wrong answer.
>>
>> Generally, you can invent any abstraction convenient for you, but farther
>> your abstractions from reality of your hardware => less you will get from
>> it with bigger effort.
>>
>> There are no barriers in Linux and not going to be. Accept it. And start
>> instead thinking about offload capabilities your storage can offer to you.
>>
>
> the hardware capabilities are not directly accessable from userspace (and
> they probably shouldn't be)
>
> barriers keep getting mentioned because they are a easy concept to
> understand. "do this set of stuff before doing any of this other set of
> stuff, but I don't care when any of this gets done" and they fit well with
> the requirements of the users.
>


Well, I think there are two questions to be answered here: what primitive
should be offered to the user by the file system (currently we have fsync);
and what primitive should be offered by the lower level and used by the
file system (currently we have barrier, or flushing and FUA).

I do agree that we should keep what is accessible from user-space simple
and stupid. However if you look into fsync semantics a bit closer, I think
there are two things to be noted:

1. fsync actually does two things at the same time: ordering writes (in a
barrier-like manner), and forcing cached writes to disk. This makes it very
difficult to implement fsync efficiently. However, logically they are two
distinctive functionalities, and user might want one but not the other.
Particularly, users might want a barrier, but doesn't care about durability
(much). I have no idea why ordering and durability, which seem quite
different, ended up being bundled together in a single fsync call.

2. fsync semantic in POSIX is a bit vague (at least to me) in a concurrent
setting. What is the expected behavior when more than one thread write to
the same file descriptor, or different file descriptor associated with the
same file?

So I do think in the user space, we need some kind of barrier (or other)
primitive which is not tied to durability guarantees; and hopefully this
primitive could be implemented more efficiently than fsync. And of course,
this primitive should be simple and intuitive, abstracting the complexity
out.


On the other hand, we have the questions of what should file system use.
Traditionally block layer provides barrier primitive, and now I think they
are moving to flushing and FUA, or even ordered commands. (
http://lwn.net/Articles/400541/).

In terms of whether file system should be exposed with the hardware
capability, in this case, ordered commands. I personally think it should.
In modern file system we do all kind of stuff to ensure ordering, and I
think I can see how leveraging ordered commands (when it is available from
hardware) could potentially boost performance. And all the complexity of,
say, topological order, is dealt within the file system, and is not visible
to the user.

Of course, there are challenges in when you want to do ordered writes in
file system. As Ts'o mentioned, *when you have entagled metadata updates,
i.e., *you update file A, and file B, and file A and B might share
metadata, it could be difficult to get the ordering right without
sacrificing performance. But I personally think it is worth exploring.

Suli


>
> Users readily accept that if the system crashes, they will loose the most
> recent stuff that they did, but they get annoyed when things get corrupted
> to the point that they loose the entire file.
>
> this includes things like modifying one option and a crash resulting in
> the config file being blank. Yes, you can do the 'write to temp file, sync
> file, sync directory, rename file" dance, but the fact that to do so the
> user must sit and wait for the syncs to take place can be a problem. It
> would be far better to be able to say "write to temp file, and after it's
> on disk, rename the file" and not have the user wait. The user doesn't
> really care if the changes hit disk immediately, or several seconds (or
> even 10s of seconds) later, as long as there is not any possibility of the
> rename hitting disk before the file contents.
>
> The fact that this could be implemented in multiple ways in the existing
> hardware does not mean that there need to be multiple ways exposed to
> userspace, it just means that the cost of doing the operation will vary
> depending on the hardware that you have. This also means that if new
> hardware introduces a new way of implementing this, that improvement can be
> passed on to the users without needing application changes.
>
> David Lang
>
> ______________________________**_________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users<http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users>
>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] light weight write barriers

Reply via email to