Re: [sqlite] light weight write barriers
Nico Williams, on 11/26/2012 03:05 PM wrote: Vlad, You keep saying that programmers don't understand "barriers". You've provided no evidence of this. Meanwhile memory barriers are generally well understood, and every programmer I know understands that a "barrier" is a synchronization primitive that says that all operations of a certain type will have completed prior to the barrier returning control to its caller. Well, your understanding of memory barriers is wrong, and you are illustrating that the memory barriers concept is not so well understood on practice. Simplifying, memory barrier instructions are not "cache flush" of this CPU as it is often thought. They set order how reads or writes from other CPUs are visible on this CPU. And nothing else. Locally on each CPU reads and writes are always seen in order. So, (1) on a single CPU system memory barrier instructions don't make any sense and (2) they should go at least in a pair for each participating in the interaction CPU, otherwise it's an apparent sign of a mistake. There's nothing similar in storage, because storage has strong consistency requirements even if it is distributed. All those clouds and hadoops with weak consistency requirements are outside of this discussion, although even they don't have anything similar to memory barriers. As I already wrote, concept of a flat Earth and Sun revolving around is also very simple to understand. Are you still using this concept? So just give us a barrier. Similarly to the flat Earth, I'd strongly suggest you to start using adequate concept of what you want to achieve starting from what I proposed few e-mails ago in this thread. If you look at it, it offers exactly what you want, only named correctly. Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Vladislav Bolkhovitin, on 11/17/2012 12:02 AM wrote: The easiest way to implement this fsync would involve three things: 1. Schedule writes for all dirty pages in the fs cache that belong to the affected file, wait for the device to report success, issue a cache flush to the device (or request ordering commands, if available) to make it tell the truth, and wait for the device to report success. AFAIK this already happens, but without taking advantage of any request ordering commands. 2. The requesting thread returns as soon as the kernel has identified all data that will be written back. This is new, but pretty similar to what AIO already does. 3. No write is allowed to enqueue any requests at the device that involve the same file, until all outstanding fsync complete [3]. This is new. This sounds interesting as a way to expose some useful semantics to userspace. I assume we'd need to come up with a new syscall or something since it doesn't match the behaviour of posix fsync(). This is how I would export cache sync and requests ordering abstractions to the user space: For async IO (io_submit() and friends) I would extend struct iocb by flags, which would allow to set the required capabilities, i.e. if this request is FUA, or full cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per each iocb. For the regular read()/write() I would add to "flags" parameter of sync_file_range() one more flag: if this sync is immediate or not. To enforce ordering rules I would add one more command to fcntl(). It would make the latest submitted write in this fd ORDERED. Correction. To avoid possible races better that the new fcntl() command would specify that N subsequent read()/write()/sync() calls as ORDERED. For instance, in the simplest case of N=1, one next after fcntl() write() would be handled as ORDERED. (Unfortunately, it doesn't look like this old read()/write() interface has space for a more elegant solution) Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Chris Friesen, on 11/15/2012 05:35 PM wrote: The easiest way to implement this fsync would involve three things: 1. Schedule writes for all dirty pages in the fs cache that belong to the affected file, wait for the device to report success, issue a cache flush to the device (or request ordering commands, if available) to make it tell the truth, and wait for the device to report success. AFAIK this already happens, but without taking advantage of any request ordering commands. 2. The requesting thread returns as soon as the kernel has identified all data that will be written back. This is new, but pretty similar to what AIO already does. 3. No write is allowed to enqueue any requests at the device that involve the same file, until all outstanding fsync complete [3]. This is new. This sounds interesting as a way to expose some useful semantics to userspace. I assume we'd need to come up with a new syscall or something since it doesn't match the behaviour of posix fsync(). This is how I would export cache sync and requests ordering abstractions to the user space: For async IO (io_submit() and friends) I would extend struct iocb by flags, which would allow to set the required capabilities, i.e. if this request is FUA, or full cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per each iocb. For the regular read()/write() I would add to "flags" parameter of sync_file_range() one more flag: if this sync is immediate or not. To enforce ordering rules I would add one more command to fcntl(). It would make the latest submitted write in this fd ORDERED. All together those should provide the requested functionality in a simple, effective, unambiguous and backward compatible manner. Vlad 1. See my other today's e-mail about what is immediate cache sync. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote: 1. fsync actually does two things at the same time: ordering writes (in a barrier-like manner), and forcing cached writes to disk. This makes it very difficult to implement fsync efficiently. Exactly! However, logically they are two distinctive functionalities Exactly! Those two points are exactly why concept of barriers must be forgotten for sake of productivity and be replaced by a finer grained abstractions as well as why they where removed from the Linux kernel Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
David Lang, on 11/15/2012 07:07 AM wrote: There's no such thing as "barrier". It is fully artificial abstraction. After all, at the bottom of your stack, you will have to translate it either to cache flush, or commands order enforcement, or both. When people talk about barriers, they are talking about order enforcement. Not correct. When people are talking about barriers, they are meaning different things. For instance, Alan Cox few e-mails ago was meaning cache flush. That's the problem with the barriers concept: barriers are ambiguous. There's no barrier which can fit all requirements. the hardware capabilities are not directly accessable from userspace (and they probably shouldn't be) The discussion is not about to directly provide storage hardware capabilities to the user space. The discussion is to replace fully inadequate barriers abstractions to a set of other, adequate abstractions. For instance: 1. Cache flush primitives: 1.1. FUA 1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile media 1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, possibly before all data hit non-volatile media. 2. ORDERED attribute for requests. It provides the following behavior rules: A. All requests without this attribute can be executed in parallel and be freely reordered. B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED command completed. Those abstractions can naturally fit all storage capabilities. For instance: - On simple WT cache hardware not supporting ordering commands, (1) translates to NOP and (2) to queue draining. - On full features HW, both (1) and (2) translates to the appropriate storage capabilities. On FTL storage (B) can be further optimized by doing data transfers for ORDERED commands in parallel, but commit them in the requested order. barriers keep getting mentioned because they are a easy concept to understand. Well, concept of flat Earth and Sun rotating around it is also easy to understand. So, why isn't it used? Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Nico Williams, on 11/13/2012 02:13 PM wrote: declaring groups of internally-unordered writes where the groups are ordered with respect to each other... is practically the same as barriers. Which barriers? Barriers meaning cache flush or barriers meaning commands order, or barriers meaning both? There's no such thing as "barrier". It is fully artificial abstraction. After all, at the bottom of your stack, you will have to translate it either to cache flush, or commands order enforcement, or both. Are you going to invent 3 types of barriers? There's a lot to be said for simplicity... as long as the system is not so simple as to not work at all. My p.o.v. is that a filesystem write barrier is effectively the same as fsync() with the ability to return sooner (before writes hit stable storage) when the filesystem and hardware support on-disk layouts and primitives which can be used to order writes preceding and succeeding the barrier. Your mistake is that you are considering barriers as something real, which can do something real for you, while it is just a artificial abstraction apparently invented by people with limited knowledge how storage works, hence having very foggy vision how barriers supposed to be processed by it. A simple wrong answer. Generally, you can invent any abstraction convenient for you, but farther your abstractions from reality of your hardware => less you will get from it with bigger effort. There are no barriers in Linux and not going to be. Accept it. And start instead thinking about offload capabilities your storage can offer to you. Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Alan Cox, on 11/13/2012 12:40 PM wrote: Barriers are pretty much universal as you need them for power off ! I'm afraid, no storage (drives, if you like this term more) at the moment supports barriers and, as far as I know the storage history, has never supported. The ATA cache flush is a write barrier, and given you have no NV cache visible to the controller it's the same thing. The cache flush is cache flush. You can call it barrier, if you want to continue confusing yourself and others. Instead, what storage does support in this area are: Yes - the devil is in the detail once you go beyond simple capabilities. None of those details brings anything not solvable. For instance, I already described in this thread a simple way how requested order of commands can be carried through the stack and implemented that algorithm in SCST. Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote: SATA's Native Command Queuing (NCQ) is not equivalent; this allows the drive to reorder requests (in particular read requests) so they can be serviced more efficiently, but it does *not* allow the OS to specify a partial, relative ordering of requests. And so? If SATA can't do it, does it mean that nobody else can't do it too? I know a plenty of non-SATA devices, which can do the ordering requirements you need. I would be very much interested in what kind of device support this kind of "topological order", and in what settings they are typically used. Does modern flash/SSD (esp. which are used on smartphones) support this? If you could point me to some information about this, that would be very much appreciated. I don't think storage in smartphone can support such advanced functionality, because it tends to be the cheapest, hence the simplest. But many modern Enterprise SAS drives can do it, because for those customers performance is the key requirement. Unfortunately, I'm not sure I can name exact brands and models, because I had my knowledge from NDA'ed docs, so this info can be also NDA'ed. Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Richard Hipp, on 11/02/2012 08:24 AM wrote: SQLite cares. SQLite is an in-process, transaction, zero-configuration database that is estimated to be used by over 1 million distinct applications and to be have over 2 billion deployments. SQLite uses ordinary disk files in ordinary directories, often selected by the end-user. There is no system administrator with SQLite, so there is no opportunity to use a dedicated filesystem with special mount options. SQLite uses fsync() as a write barrier to assure consistency following a power loss. In addition, we do everything we can to maximize the amount of time after the fsync() before we actually do another write where order matters, in the hopes that the writes will still be ordered on platforms where fsync() is ignored for whatever reason. Even so, we believe we could get a significant performance boost and reliability improvement if we had a reliable write barrier. I would suggest you to forget word "barrier" for productivity sake. You don't want barriers and confusion they bring. You want instead access to storage accelerated cache sync, commands ordering and atomic attributes/operations. See my other today's e-mail about those. Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Alan Cox, on 11/02/2012 08:33 AM wrote: b) most drives will internally re-order requests anyway They will but only as permitted by the commands queued, so you have some control depending upon the interface capabilities. c) cheap drives won't support barriers Barriers are pretty much universal as you need them for power off ! I'm afraid, no storage (drives, if you like this term more) at the moment supports barriers and, as far as I know the storage history, has never supported. Instead, what storage does support in this area are: 1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc. 2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, etc. 3. Atomic commands, e.g. scattered writes, which allow to write data in several separate not adjacent blocks in an atomic manner, i.e. guarantee that either all blocks are written or none at all. This is a relatively new functionality, natural for flash storage with its COW internals. Obviously, using such atomic write commands, an application or a file system don't need any journaling anymore. FusionIO reported that after they modified MySQL to use them, they had 50% performance increase. Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, including on the same request. That is the root cause why barrier concept is so evil. If you specify a barrier, how can you say what kind actual action you really want from the storage: cache flush? Or ordered write? Or both? This is why relatively recent removal of barriers from the Linux kernel (http://lwn.net/Articles/400541/) was a big step ahead. The next logical step should be to allow ORDERED attribute for requests be accelerated by ORDERED commands of the storage, if it supports them. If not, fall back to the existing queue draining. Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A simple Google search shows that only Linux uses this concept for storage. And 2 years passed, since they were removed from the kernel, but people still discuss barriers as if they are here. Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Howard Chu, on 11/01/2012 08:38 PM wrote: Alan Cox wrote: How about that recently preliminary infrastructure to send ORDERED commands instead of queue draining was deleted from the kernel, because "there's no difference where to drain the queue, on the kernel or the storage side"? Send patches. Isn't any type of kernel-side ordering an exercise in futility, since a) the kernel has no knowledge of the disk's actual geometry b) most drives will internally re-order requests anyway c) cheap drives won't support barriers This is why it is so important for performance to use all storage capabilities. Particularly, ORDERED commands instead of trying to pretend be smarter, than the storage, doing queue draining. Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Alan Cox, on 10/31/2012 05:54 AM wrote: I don't want to flame on this topic, but you are not right here. As far as I can see, a big chunk of Linux storage and file system developers are/were employed by the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle. You know, RedHat from recent times also stepped to this market, at least I saw their advertisement on SDC 2012. So, you can add here all RedHat employees. Booleans generally should be reserved for logic operators. Most of the Linux companies work on both low and high end storage. The two are not mutually exclusive nor do they divide neatly by market. Many big clouds use cheap low end drives by the crate, some high end desktops are using SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm not sure personally there is much point Those doesn't contradict the point that high performance storage vendors are also funding Linux kernel storage development. Send patches with benchmarks demonstrating it is useful. It's really quite simple. Code talks. How about that recently preliminary infrastructure to send ORDERED commands instead of queue draining was deleted from the kernel, because "there's no difference where to drain the queue, on the kernel or the storage side"? Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Theodore Ts'o, on 10/27/2012 12:44 AM wrote: On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote: What different in our positions is that you are considering storage as something you can connect to your desktop, while in my view storage is something, which stores data and serves them the best possible way with the best performance. I don't get paid to make Linux storage work well for gold-plated storage, and as far as I know, none of the purveyors of said gold plated software systems are currently employing Linux file system developers to make Linux file systems work well on said gold-plated hardware. I don't want to flame on this topic, but you are not right here. As far as I can see, a big chunk of Linux storage and file system developers are/were employed by the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle. You know, RedHat from recent times also stepped to this market, at least I saw their advertisement on SDC 2012. So, you can add here all RedHat employees. As for what I might do on my own time, for fun, I can't afford said gold-plated hardware, and personally I get a lot more satisfaction if I know there will be a large number of people who benefit from my work (it was really cool when I found out that millions and millions of Android devices were going to be using ext4 :-), as opposed to a very small number of people who have paid $$$ to storage vendors who don't feel it's worthwhile to pay core Linux file system developers to leverage their hardware. Earlier, you were bemoaning why Linux file system developers weren't paying attention to using said fancy SCSI features. Perhaps now you'll understand better it's not happening? Price doesn't matter here, because it's completely different topic. It matters if you think I'm going to do it on my own time, out of my own budget. And if you think my employer is going to choose to use said hardware, price definitely matters. I consider engineering to be the art of making tradeoffs, and price is absolutely one of the things that we need to trade off against other goals. It's rare that you get to design something where performance matters above all else. Maybe it's that way if you're paid by folks whose job it is to destablize the world's financial markets by pushing the holes into the right half plane (i.e., high frequency trading :-). But for the rest of the world, price absolutely matters. I fully understand your position. But "affordable" and "useful" are completely orthogonal things. The "high end" features are very useful, if you want to get high performance. Then ones, who can afford them, will use them, which might be your favorite bank, for instance, hence they will be indirectly working for you. Of course, you don't have to work on those features, especially for free, but you similarly don't have then to call them useless only because they are not affordable to be put in a desktop [1]. Our discussion started not from "value-for-money", but from a constant demand to perform ordered commands without full queue draining, which is ignored by the Linux storage developers for YEARS as not useful, right? Vlad [1] If you or somebody else want to put something supporting all necessary features to perform ORDERED commands, including ACA, in a desktop, you can look at modern SAS SSDs. I can't call price for those devices "high-end". ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Theodore Ts'o, on 10/25/2012 09:50 AM wrote: Yeah I don't buy that. One, flash is still too expensive. Two, the capital costs to build enough Silicon foundries to replace the current production volume of HDD's is way too expensive for any company to afford (the cloud providers are buying *huge* numbers of HDD's) --- and that's assuming companies wouldn't chose to use those foundries for products with larger margins --- such as, for example, CPU/GPU chips. :-) And third and finally, if you study the long-term trends in terms of Data Retention Time (going down), Program and Read Disturb (going up), and Write Endurance (going down) as a function of feature size and/or time, you'd be wise to treat flash as nothing more than short-term cache, and not as a long term stable store. If end users completely give up on flash, and store all of their precious family pictures on flash storage, after a couple of years, they are likely going to be very disappointed Speaking personally, I wouldn't want to have anything on flash for more than a few months at *most* before I made sure I had another copy saved on spinning rust platters for long-term retention. Here I agree with you. Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Theodore Ts'o, on 10/25/2012 01:14 AM wrote: On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote: Yes, SCSI has full support for ordered/simple commands designed exactly for that task: to have steady flow of commands even in case when some of them are ordered. SCSI does, yes --- *if* the device actually implements Tagged Command Queuing (TCQ). Not all devices do. More importantly, SATA drives do *not* have this capability, and when you compare the price of SATA drives to uber-expensive "enterprise drives", it's not surprising that most people don't actually use SCSI/SAS drives that have implemented TCQ. What different in our positions is that you are considering storage as something you can connect to your desktop, while in my view storage is something, which stores data and serves them the best possible way with the best performance. Hence, for you the least common denominator of all storage features is the most important, while for me to get the best of what possible from storage is the most important. In my view storage should offload from the host system as much as possible: data movements, ordered operations requirements, atomic operations, deduplication, snapshots, reliability measures (eg RAIDs), load balancing, etc. It's the same as with 2D/3D video acceleration hardware. If you want the best performance from your system, you should offload from it as much as possible. In case of video - to the video hardware, in case of storage - to the storage. The same as with video, for storage better offload - better performance. On hundreds of thousands IOPS it's clearly visible. Price doesn't matter here, because it's completely different topic. SATA's Native Command Queuing (NCQ) is not equivalent; this allows the drive to reorder requests (in particular read requests) so they can be serviced more efficiently, but it does *not* allow the OS to specify a partial, relative ordering of requests. And so? If SATA can't do it, does it mean that nobody else can't do it too? I know a plenty of non-SATA devices, which can do the ordering requirements you need. Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
Nico Williams, on 10/24/2012 05:17 PM wrote: Yes, SCSI has full support for ordered/simple commands designed exactly for that task: [...] [...] But historically for some reason Linux storage developers were stuck with "barriers" concept, which is obviously not the same as ORDERED commands, hence had a lot troubles with their ambiguous semantic. As far as I can tell the reason of that was some lack of sufficiently deep SCSI understanding (how to handle errors, believe that ACA is something legacy from parallel SCSI times, etc.). Barriers are a very simple abstraction, so there's that. It isn't simple at all. If you think for some time about barriers from the storage point of view, you will soon realize how bad and ambiguous they are. Before that happens, people will keep returning again and again with those simple questions: why the queue must be flushed for any ordered operation? Isn't is an obvious overkill? That [cache flushing] It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if you like. Often there's a big difference where it's done: on the system side, or on the storage side. Actually, performance improvements from NCQ in many cases are not because it allows the drive to reorder requests, as it's commonly thought, but because it allows to have internal drive's processing stages stay always busy without any idle time. Drives often have a long internal pipeline.. Hence the need to keep every stage of it always busy and hence why using ORDERED commands is important for performance. is not what's being asked for here. Just a light-weight barrier. My proposal works without having to add new system calls: a) use a COW format, b) have background threads doing fsync()s, c) in each transaction's root block note the last known-committed (from a completed fsync()) transaction's root block, d) have an array of well-known ubberblocks large enough to accommodate as many transactions as possible without having to wait for any one fsync() to complete, d) do not reclaim space from any one past transaction until at least one subsequent transaction is fully committed. This obtains ACI- transaction semantics (survives power failures but without durability for the last N transactions at power-failure time) without requiring changes to the OS at all, and with support for delayed D (durability) notification. I believe what you really want is to be able to send to the storage a sequence of your favorite operations (FS operations, async IO operations, etc.) like: Write back caching disabled: data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ... Write back caching enabled: data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, ..., data op2M, ... Right? (ORDERED means that it is guaranteed that this ordered command never in any circumstances will be executed before any previous command completed AND after any subsequent command completed.) Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] light weight write barriers
杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote: I am not quite whether I should ask this question here, but in terms of light weight barrier/fsync, could anyone tell me why the device driver / OS provide the barrier interface other than some other abstractions anyway? I am sorry if this sounds like a stupid questions or it has been discussed before I mean, most of the time, we only need some ordering in writes; not complete order, but partial,very simple topological order. And a barrier seems to be a heavy weighted solution to achieve this anyway: you have to finish all writes before the barrier, then start all writes issued after the barrier. That is some ordering which is much stronger than what we need, isn't it? As most of the time the order we need do not involve too many blocks (certainly a lot less than all the cached blocks in the system or in the disk's cache), that topological order isn't likely to be very complicated, and I image it could be implemented efficiently in a modern device, which already has complicated caching/garbage collection/whatever going on internally. Particularly, it seems not too hard to be implemented on top of SCSI's ordered/simple task mode? Yes, SCSI has full support for ordered/simple commands designed exactly for that task: to have steady flow of commands even in case when some of them are ordered. It also has necessary facilities to handle commands errors without unexpected reorders of their subsequent commands (ACA, etc.). Those allow to get full storage performance by fully "fill the pipe", using networking terms. I can easily imaging real life configs, where it can bring 2+ times more performance, than with queue flushing. In fact, AFAIK, AIX requires from storage to support ordered commands and ACA. Implementation should be relatively easy as well, because all transports naturally have link as the point of serialization, so all you need in multithreaded environment is to pass some SN from the point when each ORDERED command created to the point when it sent to the link and make sure that no SIMPLE commands can ever cross ORDERED commands. You can see how it is implemented in SCST in an elegant and lockless manner (for SIMPLE commands). But historically for some reason Linux storage developers were stuck with "barriers" concept, which is obviously not the same as ORDERED commands, hence had a lot troubles with their ambiguous semantic. As far as I can tell the reason of that was some lack of sufficiently deep SCSI understanding (how to handle errors, believe that ACA is something legacy from parallel SCSI times, etc.). Hopefully, eventually the storage developers will realize the value behind ordered commands and learn corresponding SCSI facilities to deal with them. It's quite easy to demonstrate this value, if you know where to look at and not blindly refusing such possibility. I have already tried to explain it a couple of times, but was not successful. Before that happens, people will keep returning again and again with those simple questions: why the queue must be flushed for any ordered operation? Isn't is an obvious overkill? Vlad ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users