pranavbhole opened a new issue, #14796:
URL: https://github.com/apache/druid/issues/14796

   Please provide a detailed title (e.g. "Broker crashes when using TopN query 
with Bound filter" instead of just "Broker crashes").
   
   ### Affected Version: Master branch
   
   The Druid version where the problem was encountered.
   
   ### Description
   Lookup instantiation mainly works on broadcasting two notice, AddNotice, 
DropNotice. When we create fresh lookup, we issue AddNotice and when we update 
the existing lookup we issue the AddNotice and DropNotice. 
   We have been seeing the inconsistent lookup state esp with JDBC in the 
cluster that caused by the following scenario, I was able to reproduce this in 
local as well. 
   
   1. Post create jdbc lookup request, assuming that jdbc server is consistent, 
it loads the lookup and ready to serve. 
   2. Next pooling fails due to jdbc server not available/ some issues with 
lookup jdbc connection but old lookup is still serving good.
   3. Druid User tries to update the lookup json and post update request and 
old good lookup is also killed and query fails with cache state 
CACHE_NOT_INITIALIZED. 
   
   Bug is the behavior of Step 3, currently we blindly issue the [AddNotice for 
new lookup and 
DropNotice](https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/query/lookup/LookupListeningResource.java#L96)
 for old lookup without making sure that new lookup cache population is 
successful. 
   
   There is another bug at Step 2 where we do have resiliency in dealing with 
JDBC handle lookups, [we do not retry handle on transient 
errors](https://github.com/apache/druid/blob/master/extensions-core/lookups-cached-global/src/main/java/org/apache/druid/server/lookup/namespace/JdbcCacheGenerator.java#L134C18-L134C27).
  If transient error occurs then we need to wait for next pooling period to 
reach and populate lookups.  In this time, lookup's state remains 
CACHE_NOT_INITIALIZED if we have no successful load previously. 
   
   **Proposals for the addressing Step 2 and Step 3 bugs:** 
   
   1. Step 2: Create resilient handle and retry on transient error.
   2. Step 3: Delay the Drop notice execution until AddNotice loads the lookup 
on the current node, and make sure that we have one latest lookup loaded 
successfully and good to drop the previous one. This can be done by starting 
the schedule executor thread here 
https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/query/lookup/LookupReferencesManager.java#L675
 that can execute it after Delay D, N times and it can get the latest stateRef 
from LookupReferenceManager and make sure that latest Ref cache is loaded 
successfully (Also make sure that LookupExtractorFactoryContainer that we are 
trying to remove is not same as current stateRef container). 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to