Re: [sqlite] SQLite for datalogging - best practices

Keith Medcalf Sun, 28 Oct 2018 15:33:25 -0700

This could be pretty complicated and depends a lot on the manufacturer of the 
SSD.  The first thing to be said is that the most accurate portrayal of the 
life of the device is probably found by what the manufacturer is willing to 
warranty and in most cases the warranty will be very conservative.  On a 
traditional "spinning disk", the warranty usually reflects the longevity design 
of the moving parts (motors, actuators, bearings) and not so much the actual 
"data preservation" period of the magnetic media itself -- once you get past 
the mechanical parts, the lifetime of the data stored on the media itself is 
typically not relevant (ie, is much longer than the physical/mechanical 
lifetime).  Special considerations apply to devices used for non-online storage 
(for example backup media written to infrequently and mostly kept offline in a 
vault) where the mechanical lifetime is mostly irrelevant (or, more accurately, 
is determined by cycle counts and shelf-lifetime) and the lifetime of the 
stored data is the limiting factor (plus, as a real consideration, whether or 
not an entire computer system capable of reading the device is archived as well 
since that is quite often the factor most limiting access lifetime).


Not so with entirely solid state devices for which the mechanical is largely 
irrelevant (though mechanical failure due to induced stress is often a failure 
mode, it largely cannot be predicted accurately).

For both mechanical (spinning) and solid state devices (of all types including 
Flash, RAM, CPUs, etc) the best way to deal with mechanical stress is to ensure:
 (1) They are powered up once and only once
 (2) They are permitted to reach normal and stable operating temperature before 
being used
 (2) They remain powered on and at that stable operating temperature until death

In all cases the "Hardware Destroyer" modes (colloquially called "Power Saving" 
modes) place the most stress on the devices.  Each thermal cycle due to a power 
cycle (or Power Saving) reduces the mechanical lifespan by half.  Operating 
outside the designed temperature range also reduces the mechanical lifespan.  
For both "spinning" and solid-state devices you want to minimize the thermal 
changes and always operate at the designed operating temperature with the 
minimum fluctuations possible in order to maximize the physical mechanical 
lifetime.  For most devices the mechanical stress due to thermal cycling are 
already taken into account since they have such a great effect of expected 
serviceable lifetime.  This is why most "consumer" spinning disks have only 90 
day warranties -- they are known to operate in "Hardware Destroyer" 
configurations and usages and that is the limiting factor for their lifetime.  
In "continuous service" the useful life usually follows the MTBF which is a 
calculation based on "continuous service".

For FLASH / SSD / NVRAM type devices the limiting lifetime factor (once the 
above mechanical issues are addressed) is the breakdown of the function of the 
tunnel oxide layer by the retention of electrons leading to a failure to 
"erase" the cell (thus the bit becomes "stuck").  Various methods are used by 
different manufacturers to reduce or eliminate this effect, but basically it is 
a limitation on how many times a cell can be "erased" successfully.  Generally 
speaking the number of erase operations per cell has increased into the 100,000 
or millions of erase cycles per cell.  

Once one cell becomes "stuck" and cannot be erased, the entire containing block 
must be removed from service (although in some devices each block has ECC codes 
attached which can correct for some "stuck" bits, the need to use the ECC 
hardware indicates impending peril of the block -- just as the BER on a 
spinning disk is a good indicator of the deterioration of the oxide layer -- 
though it is quite confounded lately by the use of spinning disk technology 
which is inherently unreliable and depends on error correction for normal 
operation).  

This size of this block is determined by the manufacturer and has a profound 
effect on the "speed" of the device.  Generally speaking "slow" devices (NVRAM, 
"Flash Drives", etc will use small block sizes whereas SSD devices will use 
rather large block sizes).  (example -- SmartCard Flash may use a single line 
of 8 bytes (64-bits) as the block size compared to some SSDs which use block 
sizes up to about 2 MB (16777216 bits)).  As the probability of an erase 
failure is evenly distributed (one would hope, barring manufacturing defects) 
through the entire block (since the entire block is erased as a unit) the 
failure is at the erased block level.  The "storage controller" of most SSD 
type devices continuously "re-allocates" blocks and copies data internally from 
one block to another trying to ensure that the "erase failure" probability is 
even across all blocks of a device -- this is called "wear levelling" -- 
periodically moving data from blocks that are not frequently changed to blocks 
that are more frequently changed so that the low erase count blocks can be 
erased and re-used (thus evening out the erasures evenly across all blocks).  
Additional compensation for "stuck" bits in a block is usually provided by 
having an excess of blocks (spare blocks) and ECC hardware so that failing 
blocks can be removed from service and replaced with a "spare" (similar 
technology is used in spinning disks to compensate for manufacturing defects 
and degradation of the oxide layer on the platters -- though on spinning disks 
where these "spares" are located and how they are used has a very great impact 
on performance).  Unlike spinning disks though, when the pool of "spare blocks" 
on an SSD becomes exhausted the device effectively becomes useless.

The next factor is the internal write multiplication factor.  Lets say you have 
a device which is divided into 2 MB blocks.  And you update 1 sector (512 
bytes) somewhere in this block.  The device must (a) read out the entire 2MB 
block (b) update the data within the block then (c) re-write a new 2 MB block 
to replace the old.  This places a heavy erase-reuse of blocks containing 
things that are updated frequently (like directory blocks).  In many cases the 
storage controller will compensate for this fact by allowing blocks to have 
"holes" and having the "hole" written as a component piece of a block that also 
contains other changed blocks and the block containing the "hole" is only 
recovered by background wear leveling.  This means that there needs to be a 
table mapping "logical blocks" to "physical blocks" and that must be stored 
somewhere as well, and is even more volatile than the data blocks themselves 
(so it is usually stored in RAM and only periodically backed up to the 
persistent flash storage).  Additionally, data may be deleted without being 
overwritten and this is optimized by use of TRIM which allows the storage 
controller to be aware of what data is does NOT need to copy.

All this is generally why SSDs are rated with "bytes written" as the limiting 
factor and *not* erasure count.  Not because the "bytes written" is of any 
direct significance, but because it is the best raw measure of the lifetime of 
the device which is based heavily on the internal management implemented by the 
storage controller.  Assuming that you (a) manage your physical thermal stress, 
(b) use the device with the access patterns for which it was designed; and, (c) 
utilize TRIM, you will likely find that the "total bytes written" is the most 
accurate portrayal of SSD lifetime.  Whether SSD has a "shelf-life" I do not 
know (mechanical disk does -- for example the lubricant used in the moving 
parts will degrade at a fixed rate -- and perhaps faster if not used).  For 
spinning disks assuming that you have managed physical stresses and disabled 
the "Hardware Destroyer" the MTBF is the most reliable indicator of lifetime.

So, for my NVMe SSD which is now 1 year old, and which has a rated lifetime of 
1,200 TBW, and currently the counters say "24 TB written" can, assuming that 
future access patterns are the same as those over the past year, will last 50 
years (which is 10 times the warranty period).  For the SATA SSD the 
calculation is ever longer lifetime, exceeding the warrantied lifetime by 100 
times (estimate is over 1000 years).

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of Gerlando Falauto
>Sent: Sunday, 28 October, 2018 08:06
>To: SQLite mailing list
>Subject: [sqlite] SQLite for datalogging - best practices
>
>Hi everyone,
>
>as I mentioned a few months ago, I'm using SQLite to continuously log
>data
>collected from several sources on a linux system:
>
>This is the current approach:
>- There is just one single-threaded writer process, which also
>periodically
>issues DELETE statements to remove older data.
>- In order to prevent long-running reading queries from blocking the
>writer, I'm using WAL mode.
>- The database is opened with SYNCHRONOUS=1 to prevent database
>corruption.
>- Not doing any manual checkpoints, just limiting journal size to
>100MB.
>- Page size is at its default
>- Underlying storage is a F2FS partition on a commodity 32GB SSD.
>
>The idea would be to end up with a ~20GB database with about 10 days
>worth
>of rolling data.
>
>There are two (apparently) opposing requirements:
>- in case of a power outage, all data collected up to at most N
>seconds
>prior to the power failure should be readable. Ideally N=0, but
>what's
>important is that the database never gets corrupted.
>- the SSD's wear due to continuous writes should be reduced to a
>minimum
>
>Of course there's no silver bullet and some sort of compromise must
>be
>accepted.
>However, it's not clear to me how to control or even understand the
>overall
>behavior of the current system -- in terms of potential data loss and
>SSD
>wearout rate -- apart from empirical testing.
>
>There's just too many layers which will affect the end result:
>- how the application interacts with SQLite (pragmas, statements,
>transactions, explicit checkpoints, etc...)
>- how SQLite interacts with the OS (write(), sync()...)
>- how the F2FS filesystem interacts with the SSD (block writes,
>TRIM...)
>- how the SSD controller interacts with the underlying flash chips
>(buffering, erases, writes, wear leveling...)
>
>Any suggestion on how to proceed, where to start?
>What should I be assuming as "already-taken-care-of" and what should
>I
>rather concentrate on?
>Final database size? Commit/checkpoint frequency? Page size?
>
>For instance, one crazy idea might be to put the WAL file on a
>ramdisk
>instead (if at all possible) and manually run checkpoints at periodic
>intervals. But that would make no sense if I knew for a fact that the
>OS
>will never actually write to disk while the WAL file is open, until a
>sync() occurs.
>
>I found the "Total_Erase_Count" reported by smartmontools to be an
>interesting end-to-end metric in the long run and I believe its
>growth rate
>is what needs to be minimized in the end.
>But I have no idea what the theoretical minimum would be (as a
>function of
>data rate and acceptable amount of data loss), and if there's a more
>immediate way of measuring SSD wear rate.
>
>I could start fiddling with everything that comes into mind and
>measure the
>end result, but
>I believe this should be a well-known problem and I'm just missing
>the
>basic assumptions here...
>
>Thank you!
>Gerlando
>_______________________________________________
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] SQLite for datalogging - best practices

Reply via email to