Igniters, I've spent some time analyzing performance of rebalancing process. The initial goal was to understand, what limits it's throughput, because it is significantly slower than network and storage device can theoretically handle.
Turns out, our current implementation has a number of issues caused by a single fundamental problem. During rebalance data is sent in batches called GridDhtPartitionSupplyMessages. Batch size is configurable, default value is 512KB, which could mean thousands of key-value pairs. However, we don't take any advantage over this fact and process each entry independently: - checkpointReadLock is acquired multiple times for every entry, leading to unnecessary contention - this is clearly a bug; - for each entry we write (and fsync, if configuration assumes it) a separate WAL record - so, if batch contains N entries, we might end up doing N fsyncs; - adding every entry into CacheDataStore also happens completely independently. It means, we will traverse and modify each index tree N times, we will allocate space in FreeList N times and we will have to additionally store in WAL O(N*log(N)) page delta records. I've created a few tickets in JIRA with very different levels of scale and complexity. Ways to reduce impact of independent processing: - https://issues.apache.org/jira/browse/IGNITE-8019 - aforementioned bug, causing contention on checkpointReadLock; - https://issues.apache.org/jira/browse/IGNITE-8018 - inefficiency in GridCacheMapEntry implementation; - https://issues.apache.org/jira/browse/IGNITE-8017 - automatically disable WAL during preloading. Ways to solve problem on more global level: - https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket to introduce batch modification; - https://issues.apache.org/jira/browse/IGNITE-8020 - complete redesign of rebalancing process for persistent caches, based on file transfer. Everyone is welcome to criticize above ideas, suggest new ones or participate in implementation. -- Best regards, Ilya