Re: [sqlite] light weight write barriers

2012-11-28 Thread Vladislav Bolkhovitin


Nico Williams, on 11/26/2012 03:05 PM wrote:

Vlad,

You keep saying that programmers don't understand "barriers".  You've
provided no evidence of this. Meanwhile memory barriers are generally
well understood, and every programmer I know understands that a
"barrier" is a synchronization primitive that says that all operations
of a certain type will have completed prior to the barrier returning
control to its caller.


Well, your understanding of memory barriers is wrong, and you are illustrating 
that the memory barriers concept is not so well understood on practice.


Simplifying, memory barrier instructions are not "cache flush" of this CPU as it 
is often thought. They set order how reads or writes from other CPUs are visible 
on this CPU. And nothing else. Locally on each CPU reads and writes are always 
seen in order. So, (1) on a single CPU system memory barrier instructions don't 
make any sense and (2) they should go at least in a pair for each participating in 
the interaction CPU, otherwise it's an apparent sign of a mistake.


There's nothing similar in storage, because storage has strong consistency 
requirements even if it is distributed. All those clouds and hadoops with weak 
consistency requirements are outside of this discussion, although even they don't 
have anything similar to memory barriers.


As I already wrote, concept of a flat Earth and Sun revolving around is also very 
simple to understand. Are you still using this concept?



So just give us a barrier.


Similarly to the flat Earth, I'd strongly suggest you to start using adequate 
concept of what you want to achieve starting from what I proposed few e-mails ago 
in this thread.


If you look at it, it offers exactly what you want, only named correctly.

Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-19 Thread Vladislav Bolkhovitin

Vladislav Bolkhovitin, on 11/17/2012 12:02 AM wrote:

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().


This is how I would export cache sync and requests ordering abstractions to the
user space:

For async IO (io_submit() and friends) I would extend struct iocb by flags, 
which
would allow to set the required capabilities, i.e. if this request is FUA, or 
full
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per
each iocb.

For the regular read()/write() I would add to "flags" parameter of
sync_file_range() one more flag: if this sync is immediate or not.

To enforce ordering rules I would add one more command to fcntl(). It would make
the latest submitted write in this fd ORDERED.


Correction. To avoid possible races better that the new fcntl() command would 
specify that N subsequent read()/write()/sync() calls as ORDERED.


For instance, in the simplest case of N=1, one next after fcntl() write() would be 
handled as ORDERED.


(Unfortunately, it doesn't look like this old read()/write() interface has space 
for a more elegant solution)


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-17 Thread Vladislav Bolkhovitin


Chris Friesen, on 11/15/2012 05:35 PM wrote:

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().


This is how I would export cache sync and requests ordering abstractions to the 
user space:


For async IO (io_submit() and friends) I would extend struct iocb by flags, which 
would allow to set the required capabilities, i.e. if this request is FUA, or full 
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per 
each iocb.


For the regular read()/write() I would add to "flags" parameter of 
sync_file_range() one more flag: if this sync is immediate or not.


To enforce ordering rules I would add one more command to fcntl(). It would make 
the latest submitted write in this fd ORDERED.


All together those should provide the requested functionality in a simple, 
effective, unambiguous and backward compatible manner.


Vlad

1. See my other today's e-mail about what is immediate cache sync.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-17 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote:

1. fsync actually does two things at the same time: ordering writes (in a
barrier-like manner), and forcing cached writes to disk. This makes it very
difficult to implement fsync efficiently.


Exactly!


However, logically they are two distinctive functionalities


Exactly!

Those two points are exactly why concept of barriers must be forgotten for sake of 
productivity and be replaced by a finer grained abstractions as well as why they 
where removed from the Linux kernel


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-17 Thread Vladislav Bolkhovitin

David Lang, on 11/15/2012 07:07 AM wrote:

There's no such thing as "barrier". It is fully artificial abstraction. After
all, at the bottom of your stack, you will have to translate it either to cache
flush, or commands order enforcement, or both.


When people talk about barriers, they are talking about order enforcement.


Not correct. When people are talking about barriers, they are meaning different 
things. For instance, Alan Cox few e-mails ago was meaning cache flush.


That's the problem with the barriers concept: barriers are ambiguous. There's no 
barrier which can fit all requirements.



the hardware capabilities are not directly accessable from userspace (and they
probably shouldn't be)


The discussion is not about to directly provide storage hardware capabilities to 
the user space. The discussion is to replace fully inadequate barriers 
abstractions to a set of other, adequate abstractions.


For instance:

1. Cache flush primitives:

1.1. FUA

1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile 
media


1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, 
possibly before all data hit non-volatile media.


2. ORDERED attribute for requests. It provides the following behavior rules:

A.  All requests without this attribute can be executed in parallel and be freely 
reordered.


B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED 
command completed.


Those abstractions can naturally fit all storage capabilities. For instance:

 - On simple WT cache hardware not supporting ordering commands, (1) translates 
to NOP and (2) to queue draining.


 - On full features HW, both (1) and (2) translates to the appropriate storage 
capabilities.


On FTL storage (B) can be further optimized by doing data transfers for ORDERED 
commands in parallel, but commit them in the requested order.



barriers keep getting mentioned because they are a easy concept to understand.


Well, concept of flat Earth and Sun rotating around it is also easy to understand. 
So, why isn't it used?


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-15 Thread Vladislav Bolkhovitin


Nico Williams, on 11/13/2012 02:13 PM wrote:

declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.


Which barriers? Barriers meaning cache flush or barriers meaning commands order, 
or barriers meaning both?


There's no such thing as "barrier". It is fully artificial abstraction. After all, 
at the bottom of your stack, you will have to translate it either to cache flush, 
or commands order enforcement, or both.


Are you going to invent 3 types of barriers?


There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.

My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.


Your mistake is that you are considering barriers as something real, which can do 
something real for you, while it is just a artificial abstraction apparently 
invented by people with limited knowledge how storage works, hence having very 
foggy vision how barriers supposed to be processed by it. A simple wrong answer.


Generally, you can invent any abstraction convenient for you, but farther your 
abstractions from reality of your hardware => less you will get from it with 
bigger effort.


There are no barriers in Linux and not going to be. Accept it. And start instead 
thinking about offload capabilities your storage can offer to you.


Vlad

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-15 Thread Vladislav Bolkhovitin


Alan Cox, on 11/13/2012 12:40 PM wrote:

Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment 
supports
barriers and, as far as I know the storage history, has never supported.


The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.


The cache flush is cache flush. You can call it barrier, if you want to continue 
confusing yourself and others.



Instead, what storage does support in this area are:


Yes - the devil is in the detail once you go beyond simple capabilities.


None of those details brings anything not solvable. For instance, I already 
described in this thread a simple way how requested order of commands can be 
carried through the stack and implemented that algorithm in SCST.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-13 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote:

 SATA's Native Command

Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.



And so? If SATA can't do it, does it mean that nobody else can't do it
too? I know a plenty of non-SATA devices, which can do the ordering
requirements you need.



I would be very much interested in what kind of device support this kind of
"topological order", and in what settings they are typically used.

Does modern flash/SSD (esp. which are used on smartphones) support this?

If you could point me to some information about this, that would be very
much appreciated.


I don't think storage in smartphone can support such advanced functionality, 
because it tends to be the cheapest, hence the simplest.


But many modern Enterprise SAS drives can do it, because for those customers 
performance is the key requirement. Unfortunately, I'm not sure I can name exact 
brands and models, because I had my knowledge from NDA'ed docs, so this info can 
be also NDA'ed.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-13 Thread Vladislav Bolkhovitin

Richard Hipp, on 11/02/2012 08:24 AM wrote:

SQLite cares.  SQLite is an in-process, transaction, zero-configuration
database that is estimated to be used by over 1 million distinct
applications and to be have over 2 billion deployments.  SQLite uses
ordinary disk files in ordinary directories, often selected by the
end-user.  There is no system administrator with SQLite, so there is no
opportunity to use a dedicated filesystem with special mount options.

SQLite uses fsync() as a write barrier to assure consistency following a
power loss.  In addition, we do everything we can to maximize the amount of
time after the fsync() before we actually do another write where order
matters, in the hopes that the writes will still be ordered on platforms
where fsync() is ignored for whatever reason.  Even so, we believe we could
get a significant performance boost and reliability improvement if we had a
reliable write barrier.


I would suggest you to forget word "barrier" for productivity sake. You don't want 
barriers and confusion they bring. You want instead access to storage accelerated 
cache sync, commands ordering and atomic attributes/operations. See my other 
today's e-mail about those.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-13 Thread Vladislav Bolkhovitin


Alan Cox, on 11/02/2012 08:33 AM wrote:

b) most drives will internally re-order requests anyway


They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.


c) cheap drives won't support barriers


Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment supports 
barriers and, as far as I know the storage history, has never supported.


Instead, what storage does support in this area are:

1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc.

2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, 
etc.


3. Atomic commands, e.g. scattered writes, which allow to write data in several 
separate not adjacent  blocks in an atomic manner, i.e. guarantee that either all 
blocks are written or none at all. This is a relatively new functionality, natural 
for flash storage with its COW internals.


Obviously, using such atomic write commands, an application or a file system don't 
need any journaling anymore. FusionIO reported that after they modified MySQL to 
use them, they had 50% performance increase.



Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, 
including on the same request. That is the root cause why barrier concept is so 
evil. If you specify a barrier, how can you say what kind actual action you really 
want from the storage: cache flush? Or ordered write? Or both?


This is why relatively recent removal of barriers from the Linux kernel 
(http://lwn.net/Articles/400541/) was a big step ahead. The next logical step 
should be to allow ORDERED attribute for requests be accelerated by ORDERED 
commands of the storage, if it supports them. If not, fall back to the existing 
queue draining.


Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A 
simple Google search shows that only Linux uses this concept for storage. And 2 
years passed, since they were removed from the kernel, but people still discuss 
barriers as if they are here.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-13 Thread Vladislav Bolkhovitin


Howard Chu, on 11/01/2012 08:38 PM wrote:

Alan Cox wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because "there's no
difference where to drain the queue, on the kernel or the storage side"?


Send patches.


Isn't any type of kernel-side ordering an exercise in futility, since
a) the kernel has no knowledge of the disk's actual geometry
b) most drives will internally re-order requests anyway
c) cheap drives won't support barriers


This is why it is so important for performance to use all storage capabilities. 
Particularly, ORDERED commands instead of trying to pretend be smarter, than the 
storage, doing queue draining.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-02 Thread Vladislav Bolkhovitin


Alan Cox, on 10/31/2012 05:54 AM wrote:

I don't want to flame on this topic, but you are not right here. As far as I can
see, a big chunk of Linux storage and file system developers are/were employed 
by
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.

You know, RedHat from recent times also stepped to this market, at least I saw
their advertisement on SDC 2012. So, you can add here all RedHat employees.


Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
not sure personally there is much point


Those doesn't contradict the point that high performance storage vendors are also 
funding Linux kernel storage development.



Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.


How about that recently preliminary infrastructure to send ORDERED commands 
instead of queue draining was deleted from the kernel, because "there's no 
difference where to drain the queue, on the kernel or the storage side"?


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-31 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/27/2012 12:44 AM wrote:

On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:

What different in our positions is that you are considering storage
as something you can connect to your desktop, while in my view
storage is something, which stores data and serves them the best
possible way with the best performance.


I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.


I don't want to flame on this topic, but you are not right here. As far as I can 
see, a big chunk of Linux storage and file system developers are/were employed by 
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.


You know, RedHat from recent times also stepped to this market, at least I saw 
their advertisement on SDC 2012. So, you can add here all RedHat employees.



As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware.  Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features.  Perhaps now you'll understand better it's not happening?


Price doesn't matter here, because it's completely different topic.


It matters if you think I'm going to do it on my own time, out of my
own budget.  And if you think my employer is going to choose to use
said hardware, price definitely matters.  I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.

It's rare that you get to design something where performance matters
above all else.  Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-).  But for
the rest of the world, price absolutely matters.


I fully understand your position. But "affordable" and "useful" are completely 
orthogonal things. The "high end" features are very useful, if you want to get 
high performance. Then ones, who can afford them, will use them, which might be 
your favorite bank, for instance, hence they will be indirectly working for you.


Of course, you don't have to work on those features, especially for free, but you 
similarly don't have then to call them useless only because they are not 
affordable to be put in a desktop [1].


Our discussion started not from "value-for-money", but from a constant demand to 
perform ordered commands without full queue draining, which is ignored by the 
Linux storage developers for YEARS as not useful, right?


Vlad

[1] If you or somebody else want to put something supporting all necessary 
features to perform ORDERED commands, including ACA, in a desktop, you can look at 
modern SAS SSDs. I can't call price for those devices "high-end".



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-27 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/25/2012 09:50 AM wrote:

Yeah  I don't buy that.  One, flash is still too expensive.  Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.

If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed

Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.


Here I agree with you.

Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-27 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/25/2012 01:14 AM wrote:

On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:

Yes, SCSI has full support for ordered/simple commands designed
exactly for that task: to have steady flow of commands even in case
when some of them are ordered.


SCSI does, yes --- *if* the device actually implements Tagged Command
Queuing (TCQ).  Not all devices do.

More importantly, SATA drives do *not* have this capability, and when
you compare the price of SATA drives to uber-expensive "enterprise
drives", it's not surprising that most people don't actually use
SCSI/SAS drives that have implemented TCQ.


What different in our positions is that you are considering storage as something 
you can connect to your desktop, while in my view storage is something, which 
stores data and serves them the best possible way with the best performance.


Hence, for you the least common denominator of all storage features is the most 
important, while for me to get the best of what possible from storage is the most 
important.


In my view storage should offload from the host system as much as possible: data 
movements, ordered operations requirements, atomic operations, deduplication, 
snapshots, reliability measures (eg RAIDs), load balancing, etc.


It's the same as with 2D/3D video acceleration hardware. If you want the best 
performance from your system, you should offload from it as much as possible. In 
case of video - to the video hardware, in case of storage - to the storage. The 
same as with video, for storage better offload - better performance. On hundreds 
of thousands IOPS it's clearly visible.


Price doesn't matter here, because it's completely different topic.


SATA's Native Command
Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.


And so? If SATA can't do it, does it mean that nobody else can't do it too? I know 
a plenty of non-SATA devices, which can do the ordering requirements you need.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-27 Thread Vladislav Bolkhovitin


Nico Williams, on 10/24/2012 05:17 PM wrote:

Yes, SCSI has full support for ordered/simple commands designed exactly for
that task: [...]

[...]

But historically for some reason Linux storage developers were stuck with
"barriers" concept, which is obviously not the same as ORDERED commands,
hence had a lot troubles with their ambiguous semantic. As far as I can tell
the reason of that was some lack of sufficiently deep SCSI understanding
(how to handle errors, believe that ACA is something legacy from parallel
SCSI times, etc.).


Barriers are a very simple abstraction, so there's that.


It isn't simple at all. If you think for some time about barriers from the storage 
point of view, you will soon realize how bad and ambiguous they are.



Before that happens, people will keep returning again and again with those
simple questions: why the queue must be flushed for any ordered operation?
Isn't is an obvious overkill?


That [cache flushing]


It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if 
you like.


Often there's a big difference where it's done: on the system side, or on the 
storage side.


Actually, performance improvements from NCQ in many cases are not because it 
allows the drive to reorder requests, as it's commonly thought, but because it 
allows to have internal drive's processing stages stay always busy without any 
idle time. Drives often have a long internal pipeline.. Hence the need to keep 
every stage of it always busy and hence why using ORDERED commands is important 
for performance.



is not what's being asked for here. Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.


I believe what you really want is to be able to send to the storage a sequence of 
your favorite operations (FS operations, async IO operations, etc.) like:


Write back caching disabled:

data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ...

Write back caching enabled:

data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, 
..., data op2M, ...


Right?

(ORDERED means that it is guaranteed that this ordered command never in any 
circumstances will be executed before any previous command completed AND after any 
subsequent command completed.)


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-24 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote:

I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before

I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?

As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?


Yes, SCSI has full support for ordered/simple commands designed exactly for that 
task: to have steady flow of commands even in case when some of them are ordered. 
It also has necessary facilities to handle commands errors without unexpected 
reorders of their subsequent commands (ACA, etc.). Those allow to get full storage 
performance by fully "fill the pipe", using networking terms. I can easily imaging 
real life configs, where it can bring 2+ times more performance, than with queue 
flushing.


In fact, AFAIK, AIX requires from storage to support ordered commands and ACA.

Implementation should be relatively easy as well, because all transports naturally 
have link as the point of serialization, so all you need in multithreaded 
environment is to pass some SN from the point when each ORDERED command created to 
the point when it sent to the link and make sure that no SIMPLE commands can ever 
cross ORDERED commands. You can see how it is implemented in SCST in an elegant 
and lockless manner (for SIMPLE commands).


But historically for some reason Linux storage developers were stuck with 
"barriers" concept, which is obviously not the same as ORDERED commands, hence had 
a lot troubles with their ambiguous semantic. As far as I can tell the reason of 
that was some lack of sufficiently deep SCSI understanding (how to handle errors, 
believe that ACA is something legacy from parallel SCSI times, etc.).


Hopefully, eventually the storage developers will realize the value behind ordered 
commands and learn corresponding SCSI facilities to deal with them. It's quite 
easy to demonstrate this value, if you know where to look at and not blindly 
refusing such possibility. I have already tried to explain it a couple of times, 
but was not successful.


Before that happens, people will keep returning again and again with those simple 
questions: why the queue must be flushed for any ordered operation? Isn't is an 
obvious overkill?


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users