[ 
https://issues.apache.org/jira/browse/RANGER-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramachandran Krishnan resolved RANGER-5654.
-------------------------------------------
    Resolution: Fixed

> Solr audit dispatcher fails to index after Kerberos TGT relogin (No key to 
> store) with default useTicketCache=true
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: RANGER-5654
>                 URL: https://issues.apache.org/jira/browse/RANGER-5654
>             Project: Ranger
>          Issue Type: Task
>          Components: Ranger
>            Reporter: Ramachandran Krishnan
>            Assignee: Ramachandran Krishnan
>            Priority: Major
>             Fix For: 3.0.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> The Solr audit dispatcher ({{{}ranger-audit-dispatcher-solr{}}}) can stop 
> indexing audits into Kerberos-protected Solr after TGT refresh/relogin. Logs 
> show repeated batch failures:
> Error sending message to Solr
> Login failed due to: Unable to login with rangerauditserver/<host>@<REALM> 
> due to: No key to store
> Kafka consumption continues, but Solr doc counts do not increase and Ranger 
> Admin Audit → Access does not show new events.
> h3. Affected components (master)
>  * {{audit-server/audit-dispatcher/dispatcher-solr}} → 
> {{SolrAuditDestination}} / {{KerberosAction}}
>  * {{agents-audit/core}} → {{AbstractKerberosUser.checkTGTAndRelogin()}}
>  * Shipped config templates:
>  ** 
> {{audit-server/audit-dispatcher/dispatcher-solr/src/main/resources/conf/ranger-audit-dispatcher-solr-site.xml}}
>  ** 
> {{dev-support/ranger-docker/scripts/audit-dispatcher/ranger-audit-dispatcher-solr-site.xml}}
> Both ship:
> xasecure.audit.jaas.Client.option.useTicketCache = true
> with {{useKeyTab=true}} and {{{}storeKey=true{}}}.
> h3. Root cause
>  # Config: {{useTicketCache=true}} encourages reuse of the default credential 
> cache. In Docker/long-running deployments, cache tickets (e.g. from {{kinit}} 
> / {{{}KRB5CCNAME{}}}) can mix with keytab-based JAAS login. On relogin, 
> {{Krb5LoginModule}} may not have a storable key and fails with “No key to 
> store”
> h2. The pipeline (normal path)
> HDFS plugin → audit ingestor → Kafka (ranger_audits) → Solr dispatcher → Solr 
> → Ranger Admin UI
>   # HDFS plugin — A user (e.g. {{{}testuser1{}}}) does something audited, 
> like {{hdfs dfs -ls}} on a path they’re denied on. The plugin sends the audit 
> to the ingestor.
>  # Ingestor — Accepts the audit and writes it to Kafka topic 
> {{{}ranger_audits{}}}.
>  # Kafka — Holds the audit record. You can see the topic’s end offset go up.
>  # Solr dispatcher — Reads from Kafka, then POSTs/indexes each batch into the 
> Kerberos-protected Solr collection {{{}ranger_audits{}}}.
>  # Solr — Stores the document. A query like {{reqUser:testuser1}} returns 
> more docs.
>  # Ranger Admin — Audit → Access reads from Solr and shows the new event.
> So: Kafka growing only means step 2–3 worked. Solr/Admin updating means steps 
> 4–6 worked.
> h3. Docker Tier 3 audit stack:
> ||Container||Role||
> |{{ranger-kdc}}|Kerberos for Kafka, plugins, ingestor, Solr|
> |{{ranger}} + {{ranger-postgres}}|Ranger Admin + policies|
> |{{ranger-solr}} + {{ranger-zk}}|Audit search backend ({{{}ranger_audits{}}} 
> collection)|
> |{{ranger-kafka}}|Topic {{ranger_audits}}|
> |{{ranger-audit-ingestor}}|Plugins POST audits here ({{{}:7081{}}})|
> |{{ranger-audit-dispatcher-solr}}|Kafka → Solr (Kerberos to Solr)|
> |{{ranger-hadoop}}|HDFS + Ranger HDFS plugin ({{{}dev_hdfs{}}})|
> h3. What you do to reproduce the bug
> h3. Step 1 — Run the stack with Kerberos + Solr dispatcher
> Bring up the Tier 3 Docker audit stack: ingestor, Kafka, 
> ranger-audit-dispatcher-solr, Solr, HDFS, etc., all using Kerberos (keytabs, 
> not simple auth).
> Solr is locked down; the dispatcher must log in as {{rangerauditserver/...}} 
> to write to Solr.
> h3. Step 2 — Trigger real audits
> Run something that produces audits end-to-end, e.g.:
>  * the HDFS deny-traverse flow ({{{}testuser1{}}} tries to traverse a path 
> Ranger denies — that generates an audited DENY).
> At first this often works: plugin ✓, ingestor ✓, Kafka offset ✓.
> h3. Step 3 — Stress Kerberos / login state
> Do one or more of:
>  * Restart {{ranger-audit-ingestor}} (common during E2E {{--fresh-plan}} when 
> topics are recreated).
>  * Delete and recreate {{ranger_audits}} (dispatchers restart, consumers 
> rewind).
>  * Wait long enough for the dispatcher’s Kerberos TGT to need refresh/relogin 
> (or hit the 80% TGT lifetime window in {{{}AbstractKerberosUser{}}}).
> These don’t break Kafka itself; they change tickets, caches, and JVM login 
> state in the Solr dispatcher.
> h3. Step 4 — Trigger audits again
> Run the same HDFS audit trigger again. Now watch each hop.
> ----
> h3. What you observe when the bug hits
> *Solr dispatcher logs (the smoking gun)*
> ERROR - Error processing batch in worker 'solr-worker-0', batch size: 5
> java.lang.Exception: Failure in sending audits into Solr
>  
> ERROR - Error sending message to Solr
> Login failed due to: Unable to login with rangerauditserver/...@... due to: 
> No key to store
> h3. Meaning:
>  * The dispatcher still consumes from Kafka.
>  * When it tries to send the batch to Solr, Kerberos login/relogin fails.
>  * Every batch fails → nothing new in Solr.
> h3. Kafka — looks healthy (misleading)
> end offset 4 → 5 ✓
> The ingestor → Kafka path is fine. New audits land on the topic. That’s why 
> the bug is easy to miss if you only check Kafka.
> h3. Solr — stuck
> waiting for Solr docs (reqUser:testuser1)...
> Solr count did not increase (before=63, after=63) ✗
> Query {{reqUser:testuser1}} (via Kerberos from inside the dispatcher 
> container): count unchanged.
> h3. Ranger Admin — often unchanged too
> {{totalCount}} may not move; {{testuser1}} doesn’t appear in recent audits 
> because Admin reads Solr, not Kafka.
> ----
> h2. Why it happens 
> ||Component||Mechanism||{{useTicketCache}}||Proactive relogin||
> |Ingestor Kafka producer|JAAS string|false|Kafka client handles refresh|
> |Kafka plugin|JAAS string|false|connection-time|
> |HDFS dispatcher|UGI keytab|N/A|{{checkTGTAndReloginFromKeytab()}}|
> |Plugin → ingestor audits|UGI keytab|N/A|{{checkTGTAndReloginFromKeytab()}}|
> |Admin Solr (postgres docker)|JAAS via site XML|false|on-demand queries|
> |SPNEGO acceptor (ingestor HTTP)|JAAS acceptor|true|different role 
> ({{{}isInitiator=false{}}})|
> |Solr dispatcher|JAAS client + {{KerberosAction}}|true (bug)|every Solr write|
> So Ranger already has the “similar change” ({{{}useTicketCache=false{}}}) in 
> ingestor, Kafka plugin, schema-registry, and Docker Admin — the Solr 
> dispatcher shipped XML was the one place in the audit-server path that did 
> not follow that convention.
>  
> *Proposed fix:*
> Config (dispatcher Solr site XML):
> xasecure.audit.jaas.Client.option.useTicketCache = false
> Force keytab-based login for a keytab service principal.
> *Verification*
>  * HDFS audit pipeline E2E: plugin → ingestor → Kafka → Solr dispatcher → 
> Solr → Admin API
>  * Solr {{numFound}} increases for {{reqUser:testuser1}}
>  * Dispatcher logs show {{Successful login for rangerauditserver/...}} 
> without repeated {{No key to store}}
> *Notes*
>  * Not specific to dynamic Kafka partition plan; reproduces on master with 
> standard Solr dispatcher + Kerberos.
>  * {{AuditServerConstants.JAAS_USER_TICKET_CACHE}} already documents 
> {{useTicketCache=false}} for some Kafka paths; Solr dispatcher template is 
> inconsistent.
> h3. How we proved the fix
>  * Set {{useTicketCache=false}} → login always from keytab.
>  * Restart Solr dispatcher before pipeline checks in E2E.
> After that: Solr {{{}63 → 64{}}}, Admin shows {{{}testuser1{}}}, logs show 
> {{{}Successful login for rangerauditserver/...{}}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to