ramackri opened a new pull request, #1030:
URL: https://github.com/apache/ranger/pull/1030

   Fixes [RANGER-5654](https://issues.apache.org/jira/browse/RANGER-5654): Solr 
audit dispatcher stops indexing audits into Kerberos-protected Solr after a TGT 
refresh/relogin when `useTicketCache=true` (the shipped default).
   
   ### What changes were proposed in this pull request?
   
   **Problem**
   
   The Solr audit dispatcher consumes audits from Kafka but eventually stops 
writing to Solr when Kerberos is enabled. Dispatcher logs show repeated 
failures such as `Failure in sending audits into Solr` and `No key to store`. 
Kafka consumer offsets continue to advance while Solr document counts remain 
flat.
   
   Root cause:
   
   1. Shipped Solr dispatcher config sets 
`xasecure.audit.jaas.Client.option.useTicketCache=true` together with 
keytab-based login.
   2. `AbstractKerberosUser.checkTGTAndRelogin()` performs `logout(); login()` 
when the TGT nears expiry.
   3. With `useTicketCache=true`, the relogin path can fail because the ticket 
cache has no key material to store after logout, leaving the dispatcher in a 
broken auth state until restart.
   
   **Solution**
   
   | Area | File | Change |
   |------|------|--------|
   | Relogin recovery | `agents-audit/core/.../AbstractKerberosUser.java` | On 
relogin `LoginException`, recreate `Subject` and `LoginContext`, then retry 
`login()` instead of leaving the user logged out |
   | Shipped config | 
`audit-server/audit-dispatcher/dispatcher-solr/.../ranger-audit-dispatcher-solr-site.xml`
 | Set `useTicketCache=false` so keytab login is used consistently (avoids 
ticket-cache relogin failure) |
   | Docker config | 
`dev-support/ranger-docker/scripts/audit-dispatcher/ranger-audit-dispatcher-solr-site.xml`
 | Same `useTicketCache=false` default for Tier 3 audit stack |
   
   This complements 
[RANGER-5643](https://issues.apache.org/jira/browse/RANGER-5643) (JAAS `_HOST` 
expansion and Solr URL rewrite for SPNEGO). RANGER-5643 fixed initial 
SPNEGO/JAAS principal alignment; this patch fixes the **long-running** 
dispatcher failure after TGT relogin.
   
   ### How was this patch tested?
   
   #### Code review / static verification
   
   - Confirmed `checkTGTAndRelogin()` is invoked from `KerberosAction` before 
Solr operations, so relogin failures directly block indexing.
   - Verified shipped and Docker Solr dispatcher site XML both defaulted to 
`useTicketCache=true` on master.
   
   #### Manual testing (Docker Tier 3 audit stack with Kerberos)
   
   Environment: `dev-support/ranger-docker` Tier 3 compose 
(`docker-compose.ranger-audit-tier3.yml`) — Ranger Admin, KDC, Postgres, Solr, 
ZooKeeper, Kafka, audit ingestor, Solr audit dispatcher, and HDFS plugin with 
Kerberos enabled.
   
   **Reproduce failure (master behavior, before patch):**
   
   1. Start Tier 3 stack and wait for audit health 
(`./scripts/audit/wait-for-audit-health.sh --tier 3`).
   2. Trigger HDFS audits (e.g. `hdfs dfs -ls /` as a test user).
   3. Confirm audits reach Kafka (ingestor offset / topic growth).
   4. After TGT refresh window or forced relogin cycle, observe Solr dispatcher 
logs:
      - `Failure in sending audits into Solr`
      - `No key to store`
   5. Solr query count for test user (`reqUser:testuser1`) stops increasing; 
Kafka offset continues to grow.
   
   **Verify fix (with this patch applied):**
   
   1. Rebuild audit-dispatcher tarball with patched `agents-audit/core` and 
redeploy Solr dispatcher container with updated site XML 
(`useTicketCache=false`).
   2. Restart Solr dispatcher and confirm clean JAAS login in logs (`Successful 
login for rangerauditserver/...`).
   3. Trigger additional HDFS audits.
   4. Confirm Solr document count increases (e.g. `reqUser:testuser1` count 
incremented).
   5. Confirm Ranger Admin audit UI / Solr `totalCount` reflects new audits.
   6. Full HDFS → ingestor → Kafka → Solr dispatcher → Solr pipeline **PASS** 
(12/12 checks in dynamic-partition E2E harness; Solr indexing hop green after 
dispatcher restart with patched config).
   
   **Observed after fix:**
   
   - No `No key to store` errors during normal operation or after relogin.
   - Solr dispatcher resumes indexing without manual keytab re-kinit inside the 
container.
   - End-to-end audit delivery to Solr stable under Kerberos.
   
   ### Related
   
   - Jira: [RANGER-5654](https://issues.apache.org/jira/browse/RANGER-5654)
   - Related Kerberos SPNEGO fix: 
[RANGER-5643](https://issues.apache.org/jira/browse/RANGER-5643)
   
   
   Made with [Cursor](https://cursor.com)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to