Igniters,

I've spent some time analyzing performance of rebalancing process. The
initial goal was to understand, what limits it's throughput, because it is
significantly slower than network and storage device can theoretically
handle.

Turns out, our current implementation has a number of issues caused by a
single fundamental problem.

During rebalance data is sent in batches called
GridDhtPartitionSupplyMessages. Batch size is configurable, default value
is 512KB, which could mean thousands of key-value pairs. However, we don't
take any advantage over this fact and process each entry independently:
- checkpointReadLock is acquired multiple times for every entry, leading to
unnecessary contention - this is clearly a bug;
- for each entry we write (and fsync, if configuration assumes it) a
separate WAL record - so, if batch contains N entries, we might end up
doing N fsyncs;
- adding every entry into CacheDataStore also happens completely
independently. It means, we will traverse and modify each index tree N
times, we will allocate space in FreeList N times and we will have to
additionally store in WAL O(N*log(N)) page delta records.

I've created a few tickets in JIRA with very different levels of scale and
complexity.

Ways to reduce impact of independent processing:
- https://issues.apache.org/jira/browse/IGNITE-8019 - aforementioned bug,
causing contention on checkpointReadLock;
- https://issues.apache.org/jira/browse/IGNITE-8018 - inefficiency in
GridCacheMapEntry implementation;
- https://issues.apache.org/jira/browse/IGNITE-8017 - automatically disable
WAL during preloading.

Ways to solve problem on more global level:
- https://issues.apache.org/jira/browse/IGNITE-7935 - a ticket to introduce
batch modification;
- https://issues.apache.org/jira/browse/IGNITE-8020 - complete redesign of
rebalancing process for persistent caches, based on file transfer.

Everyone is welcome to criticize above ideas, suggest new ones or
participate in implementation.

-- 
Best regards,
Ilya

Reply via email to