[ 
https://issues.apache.org/jira/browse/RANGER-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramachandran Krishnan updated RANGER-5655:
------------------------------------------
    Description:     (was: Implement a *dynamic unified ingestor registry* for 
Ranger audit-ingestor so operators can change Kafka partition routing and 
per-repo service allowlists at runtime — without restarting ingestor pods.

The registry is stored in a Kafka compacted topic 
({{{}ranger_audit_partition_plan{}}}) and managed via REST 
({{{}/api/audit/partition-plan{}}}). All ingestor replicas converge on the same 
versioned plan through a background watcher; {{AuditPartitioner}} routes audit 
records on the hot path from in-memory state only.

Feature flag (default off): 
{{ranger.audit.ingestor.kafka.partition.plan.dynamic.enabled=false}}

 
----
h2. *Problem*

Today (static mode), audit-ingestor loads two kinds of configuration from XML 
at startup only:
||Job||Question||Static behavior||
|Service allowlist|May this Kerberos principal POST audits for repo 
{*}R{*}?|{{ranger.audit.ingestor.service.<repo>.allowed.users}} in site XML|
|Partition routing|After accept, which {{ranger_audits}} 
partition?|{{kafka.configured.plugins}} + per-plugin overrides in site XML|

Changing either requires editing XML and {*}restarting every ingestor 
replica{*}. Contiguous-range static allocation can also reshuffle later plugins 
when an early plugin's partition count changes.

*Goals:*
 * Onboard new plugins/repos and scale hot plugins without ingestor restart
 * Append-only partition growth (no reshuffle of existing plugin assignments)
 * One shared source of truth across all ingestor pods
 * No new infra (no Postgres / ZooKeeper for the registry)

----
h2. Solution

Introduce a *unified partition plan document* (versioned JSON) in Kafka topic 
{{ranger_audit_partition_plan}} (1 partition, compacted). One document holds:
 * {{plugins}} — dedicated partition IDs per plugin id (Kafka record key / 
agent id)
 * {{buffer}} — partition pool for not-yet-promoted plugins (sticky hash)
 * {{services}} — per-repo {{allowedUsers}} for {{POST /api/audit/access}}
 * {{topicPartitionCount}} — must match live {{ranger_audits}} partition count
 * {{version}} — optimistic locking for REST mutations

*Control plane:* REST + registry topic + {{PartitionPlanWatcher}}
*Data plane:* plugins POST {{/api/audit/access}} → allowlist check → 
{{AuditPartitioner}} → {{ranger_audits}}

Solr/HDFS dispatchers are unchanged; they consume all partitions of 
{{{}ranger_audits{}}}.
----
h2. Key deliverables
 * *Kafka registry* — {{{}KafkaPartitionPlanRegistry{}}}, bootstrap from XML on 
greenfield, brownfield pre-seed support
 * *REST API* — {{{}GET/PATCH /partition-plan{}}}, {{{}POST 
/partition-plan/plugins{}}}, {{{}POST /partition-plan/services{}}}, {{{}PATCH 
/partition-plan/plugins/{pluginId{}}}} with {{expectedVersion}} (409 on stale)
 * *Dynamic partitioner* — {{AuditPartitioner}} reads 
{{{}PartitionPlanHolder{}}}; round-robin for promoted plugins; buffer sticky 
hash for unknown plugins; post-scale routing uses {{max(cluster, plan)}} when 
metadata lags
 * *Unified allowlist* — {{services}} map in same registry; {{auth_to_local}} 
rules recomposed from allowlists in dynamic mode
 * *Topic grow* — grow {{ranger_audits}} before registry write when 
promoting/scaling
 * *E2E harness* — Docker Tier 3 scripts for partition-plan REST, plugin 
onboard/routing, auth_to_local, full plugin→Solr pipelines (HDFS, Ozone, Hive, 
etc.)
 * *Documentation* — implementation guide, ops runbook, brownfield migration, 
E2E test plan

----
h2. Configuration
{code:xml}
<property>
  <name>ranger.audit.ingestor.kafka.partition.plan.dynamic.enabled</name>
  <value>true</value>
</property>
<property>
  <name>ranger.audit.ingestor.kafka.partition.plan.topic</name>
  <value>ranger_audit_partition_plan</value>
</property>
{code}
When dynamic mode is on and registry is empty, the first ingestor pod 
bootstraps an initial plan from existing XML properties 
({{{}kafka.configured.plugins{}}}, buffer/per-plugin counts, service 
allowlists).
----
h2. Example plan JSON
{code:json}
{
  "topic": "ranger_audits",
  "version": 12,
  "topicPartitionCount": 48,
  "plugins": {
    "hdfs":        { "partitions": [0, 1, 2, 3, 4, 5] },
    "hiveServer2": { "partitions": [6, 7, 8, 9, 10, 11] }
  },
  "buffer": { "partitions": [12, 13, "..."] },
  "services": {
    "dev_hive":  { "allowedUsers": ["hive"] },
    "dev_ozone": { "allowedUsers": ["om", "ozone"] },
    "dev_hdfs":  { "allowedUsers": ["hdfs", "nn"] }
  }
}
{code}
----
h2. Testing
h3. Validated scenarios
h4. 1. Static vs dynamic mode

What it proves: The feature flag works and existing deployments are not broken.
||Mode||Config||Expected behavior||
|Static (default)|{{kafka.partition.plan.dynamic.enabled=false}}|Routing and 
allowlists come from XML at startup only. {{GET /api/audit/partition-plan}} 
returns 503. Ingestor health is 200. HDFS/plugin audits still flow to Solr.|
|Dynamic|{{dynamic.enabled=true}}|Ingestor starts {{{}PartitionPlanWatcher{}}}, 
reads plan from Kafka topic {{{}ranger_audit_partition_plan{}}}, and routes 
audits from in-memory plan. {{GET /partition-plan}} returns JSON.|

 
----
h4. 2. Greenfield bootstrap

What it proves: On a new cluster with an empty plan topic, the first ingestor 
pod creates version 1 of the partition plan automatically — no manual REST call.

Flow:
 # Enable dynamic mode; restart ingestor.
 # Plan topic {{ranger_audit_partition_plan}} is created (1 partition, 
compacted).
 # Ingestor publishes v1 from existing XML ({{{}kafka.configured.plugins{}}}, 
buffer size, per-plugin overrides, service allowlists).
 # Plan’s {{topicPartitionCount}} matches live {{ranger_audits}} partition 
count (e.g. 48 = 13 plugins × 3 + 9 buffer on full lab layout).

Success: {{GET /partition-plan}} shows v1, correct plugin list, partition count 
aligned with Kafka. Audits still work after enable.
----
h4. 3. REST promote / scale (200 / 409 / 400)

What it proves: Operators can change routing at runtime via REST without 
restarting ingestor. The API rejects bad requests.
||Action||REST||Success||Failure cases||
|Promote new plugin (e.g. {{{}storm{}}}) from buffer → dedicated 
partitions|{{POST /partition-plan/plugins}}|200, {{version}} increments, plugin 
in {{plugins}} map|400 if plugin already promoted (e.g. {{hdfs}} twice)|
|Scale hot plugin (+N partitions)|{{{}PATCH 
/partition-plan/plugins/{pluginId{}}}}|200, new partition IDs appended at tail; 
{{ranger_audits}} grown first if needed|400 if scaling a buffer-only plugin|
|Stale edit|Any mutation with wrong {{expectedVersion}}|—|409 + current plan 
body (forces refresh and retry)|
|Invalid plan|Overlapping partitions, reshuffle existing IDs, {{partitionCount: 
0}}|—|400|

Key property: Changes are append-only — existing plugin partition lists are 
never reshuffled.
----
h4. 4. Multi-pod convergence

What it proves: In a real cluster with multiple ingestor replicas behind a load 
balancer, all pods see the same plan after an admin change.

Flow:
 # Primary ingestor on :7081, second replica on :7082 (same Kafka plan topic, 
different Kerberos identity).
 # Both pods report the same {{version}} at startup.
 # Admin promotes a plugin on primary only.
 # Within one watcher cycle (~30s), replica on :7082 shows the same new version 
without any REST call on the replica.

Success: No drift between pods; routing is consistent cluster-wide.
----
h4. 5. Brownfield pre-seed

What it proves: Existing production clusters can cut over to dynamic mode 
safely by writing the plan into Kafka before enabling the feature — ingestor 
must not overwrite it with a fresh XML bootstrap.

Flow (Path A migration):
 # Capture current plan JSON while dynamic is briefly on.
 # Turn dynamic off; delete plan topic (simulate “still on static”).
 # Operator pre-seeds plan to Kafka with {{version=1}} and marker 
{{{}updatedBy=brownfield-e2e-seed{}}}.
 # Enable dynamic; restart ingestor.
 # Ingestor adopts pre-seeded plan — not auto-{{{}bootstrap{}}} from XML.
 # Rollback: turn dynamic off → partition-plan API 503, health 200, audits 
still OK in static mode.

----
h4. 6. Plugin onboard + Kafka partition routing

What it proves: A real plugin can be onboarded via REST and its audit events 
land on the correct Kafka partitions defined in the plan.

Flow:
 # Enable dynamic mode (greenfield buffer-only layout).
 # For each running plugin container (HDFS, Ozone, Hive, etc.): {{POST 
/partition-plan/services}} with {{{}serviceName{}}}, {{{}pluginId{}}}, 
{{{}partitionCount{}}}, {{{}allowedUsers{}}}.
 # Plugin authenticates with Kerberos and {{{}POST /api/audit/access{}}}.
 # Read Kafka record for that event → partition number must be in the plugin’s 
assigned list in the plan (not buffer, after promote).

----
h4. 7. auth_to_local recomposition

What it proves: The unified registry {{services}} map controls who may POST 
audits, and Kerberos principals are mapped to short usernames correctly.

Flow (per plugin repo):
 # Plugin calls {{POST /access}} with Kerberos principal (e.g. 
{{{}hdfs/[email protected]{}}}) → ingestor maps via 
{{auth_to_local}} → short name {{hdfs}} → 200 if in 
{{{}services[dev_hdfs].allowedUsers{}}}.
 # Remove user from allowlist via {{PATCH /partition-plan}} (services delta) → 
same principal gets 403.
 # Re-add allowlist → 200 again.
 # Cross-repo denial: HDFS principal posting to {{dev_kms}} repo → 403.

Why it matters: In dynamic mode, allowlists live in the registry (not XML 
restart). {{auth_to_local}} rules must stay in sync with 
{{{}services[].allowedUsers{}}}.
----
h4. 8. HDFS / Ozone / Hive full audit pipelines

What it proves: Dynamic partition plan changes do not break the end-to-end 
audit path: plugin → ingestor → Kafka → dispatcher → Solr (and Admin Audit UI 
when {{{}audit_store=solr{}}}).
||Plugin||What is exercised||
|HDFS|Real NameNode operation → plugin audit → ingestor → correct partition → 
Solr doc|
|Ozone|OM principal ({{{}om{}}}, {{{}ozone{}}}) → {{dev_ozone}} repo → full 
pipeline|
|Hive|HS2 principal ({{{}hive{}}}) → {{dev_hive}} repo → full pipeline|

Validated after: promote/scale (partition plan changed live), not only at 
bootstrap.

Verify Solr:
curl -s 
'http://localhost:8983/solr/ranger_audits/select?q=repo:dev_hdfs&rows=3&wt=json'
----
h3. Summary table (for Jira)
||Scenario||Plain question||How verified||Pass criteria||
|Static vs dynamic|Does default-off behavior still work?|Feature flag + REST 
503/200|Static unchanged; dynamic enables plan API|
|Greenfield bootstrap|Who creates v1 on empty topic?|First ingestor start|Plan 
v1 in Kafka; counts match|
|REST promote/scale|Can ops change routing live?|REST mutations|200 on valid; 
409 stale; 400 illegal|
|Multi-pod convergence|Do all replicas agree?|2 ingestors, promote on one|Same 
version ≤35s on both|
|Brownfield pre-seed|Safe production cutover?|Pre-write plan, then 
enable|Pre-seed preserved; rollback OK|
|Plugin onboard + routing|Do audits hit right partitions?|POST access + Kafka 
inspect|Partition ∈ plan list|
|auth_to_local|Does allowlist enforcement work?|Allow/deny via services 
map|200/403 per principal+repo|
|Full pipelines|Is audit delivery intact?|HDFS/Ozone/Hive → Solr|Docs in 
{{ranger_audits}} Solr collection|
h2. Acceptance criteria
 # With {{{}dynamic.enabled=false{}}}, behavior matches existing static XML 
partitioning; {{GET /partition-plan}} returns 503
 # With {{{}dynamic.enabled=true{}}}, registry topic created (1 partition, 
compacted); bootstrap plan published on greenfield
 # REST promote/scale updates plan version; stale {{expectedVersion}} returns 
409
 # All ingestor replicas converge to same plan within watcher refresh interval
 # Audits for promoted plugin land only on assigned {{ranger_audits}} partitions
 # {{POST /partition-plan/services}} updates allowlist; unauthorized principal 
returns 403
 # Post-scale routing works when Kafka metadata lags plan 
({{{}AuditPartitioner{}}} bound logic))

> Implement dynamic unified ingestor registry for audit-ingestor: runtime Kafka 
> partition routing and per-repo service allowlists via compacted topic + REST, 
> without ingestor restarts. Feature flag default off.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: RANGER-5655
>                 URL: https://issues.apache.org/jira/browse/RANGER-5655
>             Project: Ranger
>          Issue Type: Improvement
>          Components: Ranger
>            Reporter: Ramachandran Krishnan
>            Assignee: Ramachandran Krishnan
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: Dynamic Ingestor Registry Guide (Ranger Audit 
> Ingestor).pdf
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to