Hi Dan,

Thank you!

Currently, we are working with one customer to evaluate the clickhouse and rocketmq with the optimization. From the preliminary performance data, we can see performance improvement. We will have some pathfinding work in Q3.

About the compatibility, we have the limitation to change from the new algorithm to the old one. I think it is good to have a new BTT layout version. I will check how to make it happen.

Thank you very much!

Dennis Wu

On 7/12/22 13:06, Dan Williams wrote:
dennis.wu wrote:
Dependency:
[PATCH] nvdimm: Add NVDIMM_NO_DEEPFLUSH flag to control btt
data deepflush
https://lore.kernel.org/nvdimm/[email protected]/T/#u

Reason:
In BTT, each write will write sector data, update 4 bytes btt_map
entry and update 16 bytes bflog (two 8 bytes atomic write),the
meta data write overhead is big and we can optimize the algorithm
and not use the bflog. Then each write, we will update the sector
data and then 4 bytes btt_map entry.

How:
1. scan the btt_map to generate the aba mapping bitmap, if one
internal aba used, the bit will be set.
2. generate the in-memory freelist according the aba bitmap, the
freelist is a array that records all the free ABAs like:
| 340 | 422 | 578 |...
that means ABA 340, 422, 578 are free. The last nfree(nlane)
records in the array will be used for each lane at the beginning.
3. Get a free ABA of a lane, write data to the ABA. If the premap
btt_map entry is initialization state (e_flag=0, z_flag=0), get
an free ABA from the free ABA array for the lane. If the premap
btt_map entry is not in initialization state, the ABA in the
btt_map entry will be looked as the free ABA of the lane.Once
the free ABAs = nfree that means the arena is fully written and
we can free the whole freelist (not implimented yet).
4. In the code, "version_major ==2" is the new algorithm and
the logic in else is the old algorithm.

Result:
1. The write performance can improve ~50% and the latency also
reduce to 60% of origial algorithm.
How does this improvement affect a real-world workload vs a
microbenchmark?

2. During initialization, scan btt_map and generate the freelist
will take time and lead namespace enable longer. With 4K sector,
1TB namespace, the enable time less than 4s. This will only happen
once during initalization.
3. Take 4 bytes per sector memory to store the freelist. But once
the arena fully written, the freelist can be freed. As we know,in
the storage case, the disk always be fully written for usage, then
we don't have memory space overhead.

Compatablity:
1. The new algorithm keep the layout of bflog, only ignore its
logic, that means no update during new algorithm.
2. If a namespace create with old algorithm and layout, you can
switch to the new algorithm seamless w/o any specific operation.
3. Since the bflog will not be updated if you move to the new
algorithm. After you write data with the new algorithmyou, you
can't switch back from the new algorithm to old algorithm.
Before digging deeper into the implementation, this needs a better
compatibility story. It is not acceptable to break the on-media format
like this.  Consider someone bisecting a kernel problem over this
change, or someone reverting to an older kernel after encountering a
regression. As far as I can see this would need to be a BTT3 layout and
require explicit opt-in to move to the new format.

Reply via email to