Gavin, This idea looks promising, as Dave mentioned, it could anticipate adding support for moving cold data to cheaper cloud storage
Enrico Il giorno ven 26 mag 2023 alle ore 06:17 Dave Fisher <wave4d...@comcast.net> ha scritto: > > > > Sent from my iPhone > > > On May 25, 2023, at 7:37 PM, Gavin gao <gaozhang...@gmail.com> wrote: > > > > In a typical bookkeeper deployment, SSD disks are used to store Journal log > > data, while HDD disks are used to store Ledger data. > > What is used is a deployment choice. I know that when OMB is run locally > attached SSDs are used for both. > > I do agree that the choice of SSD and HDD disks can impact Bookkeeper > performance. Increasing IOPs and throughput will impact performance > significantly. For example in AWS a default gp3 attached disk will have large > latencies, but upping performance will give performance maybe 4x slower than > an SSD locally attached. > > Data writes are > > initially stored in memory and then asynchronously flushed to the HDD disk > > in the background. However, due to memory limitations, the amount of data > > that can be cached is restricted. Consequently, requests for historical > > data ultimately rely on the HDD disk, which becomes a bottleneck for the > > entire Bookkeeper cluster. Moreover, during data recovery processes > > following node failures, a substantial amount of historical data needs to > > be read from the HDD disk, leading to the disk's I/O utilization reaching > > maximum capacity and resulting in significant read request delays or > > failures. > > > > To address these challenges, a new architecture is proposed: the > > introduction of a disk cache between the memory cache and the HDD disk, > > utilizing an SSD disk as an intermediary medium to significantly extend > > data caching duration. The data flow is as follows: journal -> write cache > > -> SSD cache -> HDD disk. The SSD disk cache functions as a regular > > LedgerStorage layer and is compatible with all existing LedgerStorage > > implementations. > > A different way to look at this is to consider the cold layer as being > optional and within HDD or even S3. In S3 you could have advantages with > recovery into different AZs. You could also significantly improve replay. > > The following outlines the process: > > > > 1. Data eviction from the disk cache to the Ledger data disk occurs on a > > per-log file basis. > > 2. A new configuration parameter, diskCacheRetentionTime, is added to > > set the duration for which hot data is retained. Files with write > > timestamps older than the retention time will be evicted to the Ledger > > data > > disk. > > If you can adjust this to use a recent use approach then very long ledger can > be easily read with predictively moving ledgers from cold to hot. > > > 3. A new configuration parameter, diskCacheThreshold, is added. If the > > disk cache utilization exceeds the threshold, the eviction process is > > accelerated. Data is evicted to the Ledger data disk based on the order of > > file writes until the disk space recovers above the threshold. > > 4. A new thread, ColdStorageArchiveThread, is introduced to periodically > > evict data from the disk cache to the Ledger data disk. > > Anotger thread is also needed - ColdStorageRetrievalThread. > > Just some thoughts. > > Best, > Dave