[
https://issues.apache.org/jira/browse/RANGER-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091167#comment-18091167
]
Ramachandran Krishnan commented on RANGER-5654:
-----------------------------------------------
Merged into master Commit
details:https://github.com/apache/ranger/commit/7017225e604c3506b52db1d87c44325794204056
> Solr audit dispatcher fails to index after Kerberos TGT relogin (No key to
> store) with default useTicketCache=true
> ------------------------------------------------------------------------------------------------------------------
>
> Key: RANGER-5654
> URL: https://issues.apache.org/jira/browse/RANGER-5654
> Project: Ranger
> Issue Type: Task
> Components: Ranger
> Reporter: Ramachandran Krishnan
> Assignee: Ramachandran Krishnan
> Priority: Major
> Fix For: 3.0.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> The Solr audit dispatcher ({{{}ranger-audit-dispatcher-solr{}}}) can stop
> indexing audits into Kerberos-protected Solr after TGT refresh/relogin. Logs
> show repeated batch failures:
> Error sending message to Solr
> Login failed due to: Unable to login with rangerauditserver/<host>@<REALM>
> due to: No key to store
> Kafka consumption continues, but Solr doc counts do not increase and Ranger
> Admin Audit → Access does not show new events.
> h3. Affected components (master)
> * {{audit-server/audit-dispatcher/dispatcher-solr}} →
> {{SolrAuditDestination}} / {{KerberosAction}}
> * {{agents-audit/core}} → {{AbstractKerberosUser.checkTGTAndRelogin()}}
> * Shipped config templates:
> **
> {{audit-server/audit-dispatcher/dispatcher-solr/src/main/resources/conf/ranger-audit-dispatcher-solr-site.xml}}
> **
> {{dev-support/ranger-docker/scripts/audit-dispatcher/ranger-audit-dispatcher-solr-site.xml}}
> Both ship:
> xasecure.audit.jaas.Client.option.useTicketCache = true
> with {{useKeyTab=true}} and {{{}storeKey=true{}}}.
> h3. Root cause
> # Config: {{useTicketCache=true}} encourages reuse of the default credential
> cache. In Docker/long-running deployments, cache tickets (e.g. from {{kinit}}
> / {{{}KRB5CCNAME{}}}) can mix with keytab-based JAAS login. On relogin,
> {{Krb5LoginModule}} may not have a storable key and fails with “No key to
> store”
> h2. The pipeline (normal path)
> HDFS plugin → audit ingestor → Kafka (ranger_audits) → Solr dispatcher → Solr
> → Ranger Admin UI
> # HDFS plugin — A user (e.g. {{{}testuser1{}}}) does something audited,
> like {{hdfs dfs -ls}} on a path they’re denied on. The plugin sends the audit
> to the ingestor.
> # Ingestor — Accepts the audit and writes it to Kafka topic
> {{{}ranger_audits{}}}.
> # Kafka — Holds the audit record. You can see the topic’s end offset go up.
> # Solr dispatcher — Reads from Kafka, then POSTs/indexes each batch into the
> Kerberos-protected Solr collection {{{}ranger_audits{}}}.
> # Solr — Stores the document. A query like {{reqUser:testuser1}} returns
> more docs.
> # Ranger Admin — Audit → Access reads from Solr and shows the new event.
> So: Kafka growing only means step 2–3 worked. Solr/Admin updating means steps
> 4–6 worked.
> h3. Docker Tier 3 audit stack:
> ||Container||Role||
> |{{ranger-kdc}}|Kerberos for Kafka, plugins, ingestor, Solr|
> |{{ranger}} + {{ranger-postgres}}|Ranger Admin + policies|
> |{{ranger-solr}} + {{ranger-zk}}|Audit search backend ({{{}ranger_audits{}}}
> collection)|
> |{{ranger-kafka}}|Topic {{ranger_audits}}|
> |{{ranger-audit-ingestor}}|Plugins POST audits here ({{{}:7081{}}})|
> |{{ranger-audit-dispatcher-solr}}|Kafka → Solr (Kerberos to Solr)|
> |{{ranger-hadoop}}|HDFS + Ranger HDFS plugin ({{{}dev_hdfs{}}})|
> h3. What you do to reproduce the bug
> h3. Step 1 — Run the stack with Kerberos + Solr dispatcher
> Bring up the Tier 3 Docker audit stack: ingestor, Kafka,
> ranger-audit-dispatcher-solr, Solr, HDFS, etc., all using Kerberos (keytabs,
> not simple auth).
> Solr is locked down; the dispatcher must log in as {{rangerauditserver/...}}
> to write to Solr.
> h3. Step 2 — Trigger real audits
> Run something that produces audits end-to-end, e.g.:
> * the HDFS deny-traverse flow ({{{}testuser1{}}} tries to traverse a path
> Ranger denies — that generates an audited DENY).
> At first this often works: plugin ✓, ingestor ✓, Kafka offset ✓.
> h3. Step 3 — Stress Kerberos / login state
> Do one or more of:
> * Restart {{ranger-audit-ingestor}} (common during E2E {{--fresh-plan}} when
> topics are recreated).
> * Delete and recreate {{ranger_audits}} (dispatchers restart, consumers
> rewind).
> * Wait long enough for the dispatcher’s Kerberos TGT to need refresh/relogin
> (or hit the 80% TGT lifetime window in {{{}AbstractKerberosUser{}}}).
> These don’t break Kafka itself; they change tickets, caches, and JVM login
> state in the Solr dispatcher.
> h3. Step 4 — Trigger audits again
> Run the same HDFS audit trigger again. Now watch each hop.
> ----
> h3. What you observe when the bug hits
> *Solr dispatcher logs (the smoking gun)*
> ERROR - Error processing batch in worker 'solr-worker-0', batch size: 5
> java.lang.Exception: Failure in sending audits into Solr
>
> ERROR - Error sending message to Solr
> Login failed due to: Unable to login with rangerauditserver/...@... due to:
> No key to store
> h3. Meaning:
> * The dispatcher still consumes from Kafka.
> * When it tries to send the batch to Solr, Kerberos login/relogin fails.
> * Every batch fails → nothing new in Solr.
> h3. Kafka — looks healthy (misleading)
> end offset 4 → 5 ✓
> The ingestor → Kafka path is fine. New audits land on the topic. That’s why
> the bug is easy to miss if you only check Kafka.
> h3. Solr — stuck
> waiting for Solr docs (reqUser:testuser1)...
> Solr count did not increase (before=63, after=63) ✗
> Query {{reqUser:testuser1}} (via Kerberos from inside the dispatcher
> container): count unchanged.
> h3. Ranger Admin — often unchanged too
> {{totalCount}} may not move; {{testuser1}} doesn’t appear in recent audits
> because Admin reads Solr, not Kafka.
> ----
> h2. Why it happens
> ||Component||Mechanism||{{useTicketCache}}||Proactive relogin||
> |Ingestor Kafka producer|JAAS string|false|Kafka client handles refresh|
> |Kafka plugin|JAAS string|false|connection-time|
> |HDFS dispatcher|UGI keytab|N/A|{{checkTGTAndReloginFromKeytab()}}|
> |Plugin → ingestor audits|UGI keytab|N/A|{{checkTGTAndReloginFromKeytab()}}|
> |Admin Solr (postgres docker)|JAAS via site XML|false|on-demand queries|
> |SPNEGO acceptor (ingestor HTTP)|JAAS acceptor|true|different role
> ({{{}isInitiator=false{}}})|
> |Solr dispatcher|JAAS client + {{KerberosAction}}|true (bug)|every Solr write|
> So Ranger already has the “similar change” ({{{}useTicketCache=false{}}}) in
> ingestor, Kafka plugin, schema-registry, and Docker Admin — the Solr
> dispatcher shipped XML was the one place in the audit-server path that did
> not follow that convention.
>
> *Proposed fix:*
> Config (dispatcher Solr site XML):
> xasecure.audit.jaas.Client.option.useTicketCache = false
> Force keytab-based login for a keytab service principal.
> *Verification*
> * HDFS audit pipeline E2E: plugin → ingestor → Kafka → Solr dispatcher →
> Solr → Admin API
> * Solr {{numFound}} increases for {{reqUser:testuser1}}
> * Dispatcher logs show {{Successful login for rangerauditserver/...}}
> without repeated {{No key to store}}
> *Notes*
> * Not specific to dynamic Kafka partition plan; reproduces on master with
> standard Solr dispatcher + Kerberos.
> * {{AuditServerConstants.JAAS_USER_TICKET_CACHE}} already documents
> {{useTicketCache=false}} for some Kafka paths; Solr dispatcher template is
> inconsistent.
> h3. How we proved the fix
> * Set {{useTicketCache=false}} → login always from keytab.
> * Restart Solr dispatcher before pipeline checks in E2E.
> After that: Solr {{{}63 → 64{}}}, Admin shows {{{}testuser1{}}}, logs show
> {{{}Successful login for rangerauditserver/...{}}}.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)