dennis.wu wrote: > Dependency: > [PATCH] nvdimm: Add NVDIMM_NO_DEEPFLUSH flag to control btt > data deepflush > https://lore.kernel.org/nvdimm/[email protected]/T/#u > > Reason: > In BTT, each write will write sector data, update 4 bytes btt_map > entry and update 16 bytes bflog (two 8 bytes atomic write),the > meta data write overhead is big and we can optimize the algorithm > and not use the bflog. Then each write, we will update the sector > data and then 4 bytes btt_map entry. > > How: > 1. scan the btt_map to generate the aba mapping bitmap, if one > internal aba used, the bit will be set. > 2. generate the in-memory freelist according the aba bitmap, the > freelist is a array that records all the free ABAs like: > | 340 | 422 | 578 |... > that means ABA 340, 422, 578 are free. The last nfree(nlane) > records in the array will be used for each lane at the beginning. > 3. Get a free ABA of a lane, write data to the ABA. If the premap > btt_map entry is initialization state (e_flag=0, z_flag=0), get > an free ABA from the free ABA array for the lane. If the premap > btt_map entry is not in initialization state, the ABA in the > btt_map entry will be looked as the free ABA of the lane.Once > the free ABAs = nfree that means the arena is fully written and > we can free the whole freelist (not implimented yet). > 4. In the code, "version_major ==2" is the new algorithm and > the logic in else is the old algorithm. > > Result: > 1. The write performance can improve ~50% and the latency also > reduce to 60% of origial algorithm.
How does this improvement affect a real-world workload vs a microbenchmark? > 2. During initialization, scan btt_map and generate the freelist > will take time and lead namespace enable longer. With 4K sector, > 1TB namespace, the enable time less than 4s. This will only happen > once during initalization. > 3. Take 4 bytes per sector memory to store the freelist. But once > the arena fully written, the freelist can be freed. As we know,in > the storage case, the disk always be fully written for usage, then > we don't have memory space overhead. > > Compatablity: > 1. The new algorithm keep the layout of bflog, only ignore its > logic, that means no update during new algorithm. > 2. If a namespace create with old algorithm and layout, you can > switch to the new algorithm seamless w/o any specific operation. > 3. Since the bflog will not be updated if you move to the new > algorithm. After you write data with the new algorithmyou, you > can't switch back from the new algorithm to old algorithm. Before digging deeper into the implementation, this needs a better compatibility story. It is not acceptable to break the on-media format like this. Consider someone bisecting a kernel problem over this change, or someone reverting to an older kernel after encountering a regression. As far as I can see this would need to be a BTT3 layout and require explicit opt-in to move to the new format.
