[I] Lookup updates that leads inconsistencies (druid)

via GitHub Thu, 10 Aug 2023 09:36:33 -0700


pranavbhole opened a new issue, #14796:
URL: https://github.com/apache/druid/issues/14796

Please provide a detailed title (e.g. "Broker crashes when using TopN query
with Bound filter" instead of just "Broker crashes").

### Affected Version: Master branch

The Druid version where the problem was encountered.

### Description
Lookup instantiation mainly works on broadcasting two notice, AddNotice,
DropNotice. When we create fresh lookup, we issue AddNotice and when we update
the existing lookup we issue the AddNotice and DropNotice.
We have been seeing the inconsistent lookup state esp with JDBC in the
cluster that caused by the following scenario, I was able to reproduce this in
local as well.

1. Post create jdbc lookup request, assuming that jdbc server is consistent,
it loads the lookup and ready to serve.
2. Next pooling fails due to jdbc server not available/ some issues with
lookup jdbc connection but old lookup is still serving good.
3. Druid User tries to update the lookup json and post update request and
old good lookup is also killed and query fails with cache state
CACHE_NOT_INITIALIZED.

Bug is the behavior of Step 3, currently we blindly issue the [AddNotice for
new lookup and
DropNotice](https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/query/lookup/LookupListeningResource.java#L96)
for old lookup without making sure that new lookup cache population is
successful.

There is another bug at Step 2 where we do have resiliency in dealing with
JDBC handle lookups, [we do not retry handle on transient
errors](https://github.com/apache/druid/blob/master/extensions-core/lookups-cached-global/src/main/java/org/apache/druid/server/lookup/namespace/JdbcCacheGenerator.java#L134C18-L134C27).
If transient error occurs then we need to wait for next pooling period to
reach and populate lookups. In this time, lookup's state remains
CACHE_NOT_INITIALIZED if we have no successful load previously.

**Proposals for the addressing Step 2 and Step 3 bugs:**

1. Step 2: Create resilient handle and retry on transient error.
2. Step 3: Delay the Drop notice execution until AddNotice loads the lookup
on the current node, and make sure that we have one latest lookup loaded
successfully and good to drop the previous one. This can be done by starting
the schedule executor thread here
https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/query/lookup/LookupReferencesManager.java#L675
that can execute it after Delay D, N times and it can get the latest stateRef
from LookupReferenceManager and make sure that latest Ref cache is loaded
successfully (Also make sure that LookupExtractorFactoryContainer that we are
trying to remove is not same as current stateRef container).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Lookup updates that leads inconsistencies (druid)

Reply via email to