Re: [sqlite] SQLite for datalogging - best practices

Keith Medcalf Sun, 28 Oct 2018 19:09:19 -0700

>>The next factor is the internal write multiplication factor.  Lets
>>say you have a device which is divided into 2 MB blocks.  And you update 1
>>sector (512 bytes) somewhere in this block.  The device must (a) read out
>>the entire 2MB block (b) update the data within the block then (c) re-
>>write a new 2 MB block to replace the old.


>>That I don't get. Are you sure about that? My understanding from what
>>I've been reading was that the technology behind SSDs would force you to
>>*erase* 2 MB blocks but also allow you to *write* e.g. 4KB pages.

This is true, more or less.  As you probably know NV storage (no matter what 
type, be it flash, nvram, prom, eprom, eeprom, whatever) all basically works 
the same.  You can "erase" it which results in all the bits being the same off 
state -- lets call it a 0 (though it is possible for the logic states to be 
reversed where "off" or clear is 1 and on or programmed is 0).  You can then 
"program" it by switching the state of some of the bits from "clear" to 
"programmed" (0 -> 1).  You cannot, however, ever return a "programmed" (1) bit 
back to "cleared" (0) state, except by erasing the whole block.  Depending on 
the particular device the "erase" size may be the same as the "program" size or 
it may be bigger up to the entire device -- UV eraseable PROM is an example of 
this where you can only "erase" the entire device as a whole, there are others.

So yes, there are in fact TWO block sizes, the ERASE block size and the PROGRAM 
block size.  The ERASE size is often bigger than the PROGRAM size.  A PROGRAM 
operation programs an entire PROGRAM block, and an ERASE erases and entire 
erase block which encompases multiple program blocks.  For most SSD/Flash 
storage devices the size of the PROGRAM block is the minimum I/O block size, 
and is usually 4 KB or so.  The ERASE size may be much bigger (and it usually 
is) at say 2 MB.

For simple management processing there is a total number of "program blocks" on 
the device, addressed by their "physical block number".  Each physical block 
resides within a "erase block" which is usually larger than the "program block" 
size and the "erase block" number can be derived from the "physical block 
number" (usually by a simple binary shift).  For storage management at the most 
basic level, the hardware storage controller maintains mapping between the 
"Logical Block Number" and "Physical Block Number", and a "Physical Block 
Allocation Table" containing information about the usage of physical program 
blocks.  There will also be a list of free physical blocks, and table of some 
statistics about various block erase/program operations.  At least the mapping 
(logical->physical) and BAT must be persistent.  The statistics are usually 
also persistent.  The lists and any other tables are only needed during 
"operation" and are usually rebuilt entirely in device RAM when the device is 
powered on, their contents being derived from the persistent data.

All access is by "Logical Block Number" (which may reside at any "physical 
block number").  

There are basically three operations that take place on Logical Blocks:  Read, 
Write, Delete

Read simply translated the logical->physical, reads the physical block, and 
returns it to the "requestor".

Write will mark the current physical block that holds the logical block as 
"deleted", find a free physical block to write the data to (and write it), then 
update the logical->physical mapping table to map the logical block to its 
physical location.

Delete is the same as write except that there is no writing of a physical data 
block, and the logical block is marked as "unallocated" in the 
logical->physical mapping table, and the actual physical data block is marked 
as "deleted" in the BAT.

This process depends on there being a "pool" of "ready to program" physical 
storage blocks.  This is managed by a separate process running at the hardware 
level.  If the free pool is depleted then the equivalent of an interrupt to the 
pool management process must be generated to get the pool manager to put some 
blocks on the free list and the process of writing has to wait until there is a 
block in the free pool which can be used.  Sometimes the BAT updates will 
generate an interrupt to the block management process (for example, all 
physical blocks in an erase block are now "deleted" so the entire erase block 
can be erased and all the physical blocks it contains put on the free block 
list).

The high level TRIM operation is really nothing more than "delete" against a 
logical block.

>In other words, I was expecting the SSD controller and/or the
>filesystem to be smart enough to cleverly allocate and move pages around 
>within the
>available blocks. 
>So if a 2MB-block is made of 512 4KB-pages, just overwriting the same
>4KB page 512 times will only cause one block erasure (or something in
>that order of magnitude), not 512.  If that is correct, my conclusion
>would be >that you should always write in multiples of the page size (e.g.
>4KB), assuming you somehow get to know that value.
>Perhaps you're actually saying the same thing in the following
>paragraph?

More or less.  Basically the I/O size presented by the OS driver is usually 
equal to the program block size (but does not have to be).  If it is not, then 
"data editing" is carried out on the device in RAM the same as it is for 
spinning disks (retrieve the block, edit the data, write the new block, done at 
the hardware level).

The efficiency of writing to SSDs and minimizing "erase" operations is 
dependent on having a pool of blocks available on the free list.  There is no 
need to ever "erase" unless this pool is depleted however, the background 
manager does this anyway to manage the layout of blocks, coalesce free blocks, 
and try to optimize the ordering of logical blocks (if you can optimize the 
layout of the logical blocks in physical blocks you can optimize access speed 
-- especially since it takes time to "open" a "line" (which is again usually 
somewhere between an erase and a program block size) for access. 

So yes, updating/deleting a logical block will eventually result in an erase 
operation but the urgency (now, in a minute, next week, etc) depends on the 
size of the free list.

So really the secret is to have lots of free blocks available at all times and 
keep the thermal limits in mind.  Once you get close to the edge (either in not 
having blocks available or pushing the thermal envelope) performance will 
suffer and the device will degrade faster.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.




_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] SQLite for datalogging - best practices

Reply via email to