Re: [sqlite] light weight write barriers

2012-11-16 Thread David Lang

On Fri, 16 Nov 2012, Howard Chu wrote:


David Lang wrote:
barriers keep getting mentioned because they are a easy concept to 
understand.
"do this set of stuff before doing any of this other set of stuff, but I 
don't
care when any of this gets done" and they fit well with the requirements of 
the

users.

Users readily accept that if the system crashes, they will loose the most 
recent

stuff that they did,


*some* users may accept that. *None* should.


when users are given a choice of having all their work be very slow, or have it 
be fast, but in the unlikely event of a crash they loose their mose recent 
changes, they are willing to loose their most recent changes.


If you think about it, this is not much different from the fact that you loose 
all changes since the last time you saved the thing you are working on. Many 
programs save state periodically so that if the application crashes the user 
hasn't lost everything, but any application that tried to save after every 
single change would be so slow that nobody would use it.


There is always going to be a window after a user hits 'save' where the data can 
be lost, because it's not yet on disk.



There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it 
because they don't know better. We programmers, who know better, have failed 
to raise a stink and demand that this be fixed.
 A) Drives should not lose data on power failure. If a drive accepts a write 
request and says "OK, done" then that data should get written to stable 
storage, period. Whether it requires capacitors or some other onboard power 
supply, or whatever, they should just do it. Keep in mind that today, most of 
the difference between enterprise drives and consumer desktop drives is just 
a firmware change, that hardware is already identical. Nobody should accept a 
product that doesn't offer this guarantee. It's inexcusable.


This is an option to you. However if you have enabled write caching and 
reordering, you have explicitly told the system to be faster at the expense of 
loosing data under some conditions. The fact that you then loose data under 
those conditions should not surprise you.


The idea that you must have enough power to write all the pending data to disk 
is problematic as that then severely limits the amount of cache that you have.


 B) it should go without saying - drives should reliably report back to the 
host, when something goes wrong. E.g., if a write request has been accepted, 
cached, and reported complete, but then during the actual write an ECC 
failure is detected in the cacheline, the drive needs to tell the host "oh by 
the way, block XXX didn't actually make it to disk like I told you it did 
10ms ago."


The issue isn't a drive having a write error, it's the system shutting down 
(or crashing) before the data is written, no OS level tricks will help you here.



The real problem here isn't the drive claiming the data has been written when it 
hasn't, the real problem is that the application has said 'write this data' to 
the OS, and the OS has not done so yet.


The OS delays the writes for many legitimate reasons (the disk may be busy, it 
can get things done more efficently by combining and reordering the writes, etc)


Unless the system crashes, this is not a problem, the data will eventually be 
written out, and on system shutdown everthing is good.


But if the system crashes, some of this postphoned work doesn't get done, and 
that can be a problem.


Applications can do fsync if they want to be sure that their data is safe on 
disk NOW, but they currently have no way of saying "I want to make sure that A 
happens before B, but I don't care if A happens now or 10 seconds from now"


That is the gap that it would be useful to provide a mechanism to deal with, and 
it doesn't matter what your disk system does in terms of lieing ot not, there 
still isn't a way to deal with this today.


David Lang
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-15 Thread David Lang

On Wed, 14 Nov 2012, Vladislav Bolkhovitin wrote:


Nico Williams, on 11/13/2012 02:13 PM wrote:

declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.


Which barriers? Barriers meaning cache flush or barriers meaning commands 
order, or barriers meaning both?


There's no such thing as "barrier". It is fully artificial abstraction. After 
all, at the bottom of your stack, you will have to translate it either to 
cache flush, or commands order enforcement, or both.


When people talk about barriers, they are talking about order enforcement.

Your mistake is that you are considering barriers as something real, which 
can do something real for you, while it is just a artificial abstraction 
apparently invented by people with limited knowledge how storage works, hence 
having very foggy vision how barriers supposed to be processed by it. A 
simple wrong answer.


Generally, you can invent any abstraction convenient for you, but farther 
your abstractions from reality of your hardware => less you will get from it 
with bigger effort.


There are no barriers in Linux and not going to be. Accept it. And start 
instead thinking about offload capabilities your storage can offer to you.


the hardware capabilities are not directly accessable from userspace (and they 
probably shouldn't be)


barriers keep getting mentioned because they are a easy concept to understand. 
"do this set of stuff before doing any of this other set of stuff, but I don't 
care when any of this gets done" and they fit well with the requirements of the 
users.


Users readily accept that if the system crashes, they will loose the most recent 
stuff that they did, but they get annoyed when things get corrupted to the point 
that they loose the entire file.


this includes things like modifying one option and a crash resulting in the 
config file being blank. Yes, you can do the 'write to temp file, sync file, 
sync directory, rename file" dance, but the fact that to do so the user must sit 
and wait for the syncs to take place can be a problem. It would be far better to 
be able to say "write to temp file, and after it's on disk, rename the file" and 
not have the user wait. The user doesn't really care if the changes hit disk 
immediately, or several seconds (or even 10s of seconds) later, as long as there 
is not any possibility of the rename hitting disk before the file contents.


The fact that this could be implemented in multiple ways in the existing 
hardware does not mean that there need to be multiple ways exposed to userspace, 
it just means that the cost of doing the operation will vary depending on the 
hardware that you have. This also means that if new hardware introduces a new 
way of implementing this, that improvement can be passed on to the users without 
needing application changes.


David Lang
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users