[
https://issues.apache.org/jira/browse/IGNITE-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946651#comment-17946651
]
Roman Puchkovskiy commented on IGNITE-25030:
--------------------------------------------
The patch looks good to me
> Metastorage compaction blocks other threads by reading from drive
> -----------------------------------------------------------------
>
> Key: IGNITE-25030
> URL: https://issues.apache.org/jira/browse/IGNITE-25030
> Project: Ignite
> Issue Type: Bug
> Reporter: Ivan Bessonov
> Assignee: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
> Time Spent: 3h
> Remaining Estimate: 0h
>
> h3. Problem description
> This happened:
> {code:java}
> 2025-04-02 17:30:25:607 +0000
> [WARNING][%node18%common-scheduler-0][FailureManager] Possible failure
> suppressed according to a configured handler [hnd=NoOpFailureHandler
> [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException:
> IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread
> is blocked for 46737 ms that is more than the allowed 500 ms, it is
> "%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned
> by "%node18%metastorage-compaction-executor-0" Id=324{code}
> The fact that we're doing it in a network thread will be addressed in a
> separate issue. All we need to know in this context is that:
> * When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're
> waiting for read lock, which is kind of stupid. This problem actually scales
> to all the other read operations. I'm calling it stupid because meta-storage
> is a multi-versioned storage by design, and such locks should not exist.
> * When we're holding a write lock in compaction thread, or potentially any
> other write operation in a meta-storage FSM thread, we might spend an
> unexpectedly large amount of time there, which is simply unacceptable.
> Reading data from the storage should be avoided when we're holding a lock
> like that.
> I created this issue as a *Bug* because it is a serious design flaw, and it
> makes the system completely unresponsive in certain cases.
> h3. Potential improvements
> Here I will present a semi-organized list of thoughts, that might be
> converted into a design later on.
> * RW lock in its current form must be removed.
> * Every time we read the data, we:
> ** Fix the revision number for the operation.
> ** Read the consistent cut.
> ** Read optimistically, assuming that compaction doesn't happen.
> ** If we figured that compaction has indeed happen when we finished reading
> data, and the data might be invalid, we throw compaction exception.
> * Resolving timestamp into a revision (or the opposite) should never require
> lock either.
> ** We should use the same optimistic strategy as with regular reads.
> ** We should introduce an on-heap cache for the most recent values, like we
> did for terms in raft log manager.
> *** While doing that, we should ensure that it'll make sense. In order to do
> that, we need to check that we are more likely to resolve recent
> timestamps/revisions.
> *** Unlike term values in raft log manager, here we might want to cache a
> few hundred values. We also need a concurrent access for reading the cache,
> meaning that the implementation will different. Just make sure to make it
> {*}non blocking{*}.
> *** _Should be done in a separate Jira_
> * Same for reading index/term/confguration/etc. There must be a simple way
> to make it non blocking either.
> * The fun part - FSM operations. Several ways to improve them:
> ** Remove the write lock. Use read-evaluate-write approach.
> ** When we read the data in FSM, we should not fix the revision, we should
> read the "latest" revision, it is more efficient. _Might be done in a
> separate Jira._
> ** We should always remember - there are only two writing agents here. A
> state machine and a compaction process.
> ** State machine is a latency-sensitive part of the system. We should avoid
> blocking it with concurrent compaction. This restriction dictates the rest of
> the design.
> ** When we read a list of existing revisions for an entry and prepend a new
> revision to it, we don't lock anything. This means that by the time we write
> this list back into a storage, it might contain some compacted revisions.
> *This is fine.* The compactor itself should keep this race in mind and expect
> such a trash in revisions list. If the value is not found - it's been
> compacted previously. In my understanding, this is the only adequate approach
> for the task.
> ** By the way, we should also expect that we have some trash there while
> just reading the data. If something is in the process of being compacted, we
> shouldn't even try to read it, but if we do, we should be ready to the data
> that concurrently disappeared.
> * Compactor:
> ** Here we can't allow to write trash data into a revisions list of any
> entry, because (unlike in FSM) that would lead to inconsistent data.
> ** I propose doing the {{db.write(writeOptions, batch);}} calls while
> holding an exclusive lock.
> ** FSM just acquires the lock and writes the data, without any retries,
> because it has a priority.
> ** Compactor should have a check, like "is it safe to write a batch that I
> prepared":
> *** If the answer is "yes", we proceed with the write.
> *** If the answer is "no" or "idk", we:
> **** Don't proceed with writing the batch.
> **** Refresh the cursor.
> **** Re-read the data and prepare a new batch.
> **** Repeat.
> *** The goal is to have a check that will have a small likelihood of
> "maybe". I propose having a sufficiently large bloom filter and update it
> synchronously. It should be cleared every time we refresh the iterator.
> ** Such an approach will save us from constant "refresh" calls on iterator,
> which are slow. They will become rare. They will almost disappear if there's
> not much concurrent load.
> *
--
This message was sent by Atlassian Jira
(v8.20.10#820010)