GitHub user weiqingy created a discussion: [Feature] Per-Event-Type Configurable Log Levels for Event Log
## Context This document describes the existing event log architecture and the proposed enhancement. GitHub issue: https://github.com/apache/flink-agents/issues/541 Discussion https://github.com/apache/flink-agents/discussions/516 (Planning Flink Agents 0.3) ## Existing Event Log Architecture ### Overview The event log system captures every event flowing through an agent for debugging, auditing, and observability. It consists of API interfaces in the `api` module and a file-based implementation in the `runtime` module. ### Core Components - **EventLogger** (`api/.../logger/EventLogger.java`) — Logging interface with `open`, `append`, `flush`, `close` lifecycle. - **EventLoggerConfig** (`api/.../logger/EventLoggerConfig.java`) — Builder-based config holding logger type, event filter, and properties. - **EventFilter** (`api/.../EventFilter.java`) — Functional interface for binary accept/reject filtering. - **EventLoggerFactory** (`api/.../logger/EventLoggerFactory.java`) — Factory supporting built-in (`"file"`) and custom logger registration via `registerFactory()`. - **FileEventLogger** (`runtime/.../eventlog/FileEventLogger.java`) — JSON Lines file-based implementation. - **EventLogRecord** (`runtime/.../eventlog/EventLogRecord.java`) — Wrapper combining `EventContext` + `Event` for serialization. ### Limitations Addressed 1. Only binary accept/reject via `EventFilter` — no concept of levels. 2. No per-event-type granularity without writing custom lambda filters. 3. No level metadata in JSON output for downstream querying. 4. Cannot configure "log ChatRequest at VERBOSE but ToolRequest at STANDARD." 5. STANDARD and VERBOSE had identical behavior (no detail omission). 6. `eventType` only available nested inside the `event` object, not at top level. 7. Typos in per-type config silently ignored with no feedback. ## Enhancement Design ### Design Summary - New `EventLogLevel` enum: `OFF`, `STANDARD`, `VERBOSE`. - Per-event-type levels configured in `EventLoggerConfig` via a `Map<String, EventLogLevel>`. - Config key pattern for per-type overrides: `event-log.level.<eventType>` — each event type has its own independently overridable config key. - Global default level via `event-log.level.default`. - Per-event overall length limit via `event-log.max-length` (default 0, disabled). - STANDARD semantics: "details might be omitted to keep the logs concise." This includes truncating long strings, large lists, and deeply nested structures. The specific truncation strategy can evolve over time without breaking the semantic contract. - VERBOSE logs everything unmodified. - `eventType` emitted as a top-level JSON field for easier downstream filtering. - Startup validation warns about unrecognized event type names in config. - Log level recorded in JSON output for downstream filtering. - Fully backward compatible. ### New Files **`EventLogLevel.java`** (`api/.../logger/`): ```java public enum EventLogLevel { OFF, // Do not log STANDARD, // Log with concise detail (details might be omitted) VERBOSE; // Log with full detail (no omission) public boolean isEnabled() { return this != OFF; } } ``` ### Modified Files **`EventLoggerConfig.java`** — Added fields and builder methods: - `defaultLogLevel` (defaults to `STANDARD`) - `eventLogLevels` map (event type name → level) - `maxEventLength` (default 0, disabled) — max chars for the serialized event at STANDARD level - `getEffectiveLogLevel(Event)` — resolves the log level for a given event (see [Event Type Resolution](#event-type-resolution) for lookup logic) - `shouldLog(Event, EventContext)` — composes level check with event filter (see [Level and Filter Composition](#level-and-filter-composition) for semantics) - `getMaxEventLength()` — overall event length threshold - Builder offers both `eventLogLevel(Class, EventLogLevel)` (stores `class.getSimpleName()`) and `eventLogLevel(String typeName, EventLogLevel)` for Python event types or custom types that cannot be referenced as Java classes. The string variant rejects empty/blank strings and the reserved name `"default"` at build time with `IllegalArgumentException`. **`AgentConfigOptions.java`** — Config options using key pattern: - `event-log.level.default` (String, default `"STANDARD"`) — global default level. Note: `default` is a **reserved key name** and cannot be used as an event type name. - `event-log.level.<eventType>` (String) — per-type override, one key per event type - `event-log.max-length` (Integer, default `0`) — max serialized event length at STANDARD level; 0 means no truncation **`EventLogRecord.java`** — Added `EventLogLevel logLevel` and `int maxEventLength` fields. The existing 2-arg constructor `EventLogRecord(context, event)` is preserved and defaults to `logLevel = STANDARD` and `maxEventLength = 0` (no truncation), which matches the current behavior exactly. **`EventLogRecordJsonSerializer.java`**: - Writes top-level `"eventType"` field alongside `"timestamp"` and `"logLevel"` - At STANDARD level with `maxEventLength > 0`, applies truncation to keep the serialized event within the overall length limit. Truncation strategies include: - Truncating long string fields, appending `"... [truncated]"` - Trimming large lists/arrays beyond a threshold number of elements - Capping deeply nested structures beyond a threshold depth - The specific truncation strategy is an implementation detail that may evolve over time. The semantic guarantee is: at STANDARD level, details might be omitted. - **Truncation operates at the JsonNode tree level.** The existing serializer already builds a `JsonNode` tree via `mapper.valueToTree()` (which returns a fresh tree per call) and writes it with `gen.writeTree()`. Truncation inserts between these steps: build tree → estimate size → truncate tree if oversized → write. This avoids any double-serialization cost. Because `valueToTree()` produces a new tree on every invocation, in-place mutation of the tree during truncation is safe even when multiple subtasks share the same static `ObjectMapper`. - **`maxEventLength` is an approximate cap**, not a strict byte-precise guarantee. Size estimation from the `JsonNode` tree does not account for JSON escaping or structural overhead, so actual serialized output may exceed the configured limit by a modest margin. This is acceptable for a logging feature — strict enforcement would require an extra serialization pass for oversized events, which is not worth the cost. - **Truncated string fields may contain partial content** that is not independently parseable (e.g., a JSON-encoded tool call argument cut mid-value). This is inherent to any truncation scheme. Consumers needing complete structured content from a specific event type should configure that type at VERBOSE level. - VERBOSE level writes all fields unmodified regardless of `maxEventLength` **`EventLogRecordJsonDeserializer.java`**: - Reads `logLevel` from JSON, falls back to `STANDARD` for old records - Reads top-level `eventType` if present; falls back to nested `event.eventType`. When both exist, the top-level field takes precedence. **`FileEventLogger.java`**: - Uses `config.shouldLog()` (level + filter) and passes effective level and `maxEventLength` to `EventLogRecord` - On `open()`, validates configured event type names and logs warnings for unrecognized names (see [Event Type Name Validation](#event-type-name-validation)) - On `open()`, if `maxEventLength > 0`, logs a one-time INFO message: `"Event log truncation enabled: events at STANDARD level will be truncated to N chars"` **`ActionExecutionOperator.java`** — Reads `event-log.level.default`, `event-log.level.<eventType>` keys, and `event-log.max-length` from agent config in `createEventLogger()`. ### Config Key Naming Convention Existing `AgentConfigOptions` uses a mix of camelCase (`baseLogDir`, `kafkaBootstrapServers`) and kebab-case (`job-identifier`). The new event log keys use **kebab-case with dot-separated hierarchy** (`event-log.level.default`, `event-log.level.<eventType>`, `event-log.max-length`) because: - Hierarchical keys require dot separators, making camelCase less readable (e.g., `eventLog.level.ChatRequestEvent` mixes two casing styles). - Kebab-case with dots is the standard Flink convention for hierarchical config (e.g., `state.backend.type`, `execution.checkpointing.interval`). - The `event-log.*` prefix groups all event log settings together for discoverability. ### Event Type Resolution `getEffectiveLogLevel(Event)` resolves the log level for a given event by matching its type name against the `eventLogLevels` map: 1. **Java events**: Match by **simple class name** (e.g., `ChatRequestEvent`). This is the name used in config keys: `event-log.level.ChatRequestEvent`. 2. **Python events** (`PythonEvent`): The Java class is always `PythonEvent`, so matching by Java class name would be useless. Instead, the logical event type from `PythonEvent.getEventType()` is used, checked in this order: - **Full name first** (e.g., `my_module.MyEvent`) — matches `event-log.level.my_module.MyEvent` - **Simple suffix second** (part after the last `.`, e.g., `MyEvent`) — matches `event-log.level.MyEvent` - The first match wins. This means a full-name key always takes precedence over a simple-suffix key if both are configured. - If `PythonEvent.getEventType()` is `null`, resolution falls through directly to the default level. - **Name collision note**: A simple-name config key (e.g., `event-log.level.InputEvent`) applies to both Java and Python events sharing that simple name. Use a full-name key (e.g., `event-log.level.utils.InputEvent`) to target only the Python event. 3. **Fallback**: If no per-type override matches, the `defaultLogLevel` is used. Example config for mixed Java and Python events: ```yaml event-log.level.default: STANDARD event-log.level.ChatRequestEvent: VERBOSE # Java event event-log.level.MyPythonEvent: OFF # Python event (simple name) event-log.level.my_module.MyOtherEvent: VERBOSE # Python event (full name) ``` ### Event Type Names: Config vs JSON Config keys use **simple class names** for brevity (`event-log.level.ChatRequestEvent`), while the JSON `eventType` field uses **fully-qualified class names** (`org.apache.flink.agents.api.event.ChatRequestEvent`) for unambiguous identification in downstream processing. This is intentional: - Config keys are typed by humans and benefit from short names. - JSON output is consumed by tools and benefits from unambiguous FQCNs. ### Level and Filter Composition `shouldLog(Event, EventContext)` composes the log level check with the existing `EventFilter`. Both must pass for an event to be logged (AND semantics): | `EventFilter.accept()` | `EventLogLevel` | Event logged? | |---|---|---| | `true` | `STANDARD` or `VERBOSE` | Yes | | `true` | `OFF` | No | | `false` | any level | No | The `EventFilter` is evaluated first. If the filter rejects, the level is not checked. This preserves the existing filter behavior: any `EventFilter` configured today continues to work unchanged. ### STANDARD vs VERBOSE Behavior | Aspect | STANDARD | VERBOSE | |---|---|---| | Event recorded | Yes | Yes | | Detail omission | Details might be omitted to keep logs concise | No omission | | Truncation scope | Long strings, large lists, deep nesting | N/A | | Overall event limit | Capped at `event-log.max-length` (if > 0) | No limit | | Use case | Production monitoring, low disk usage | Debugging, full payload inspection | The key semantic distinction: STANDARD means "details might be omitted to keep the logs concise." The specific truncation strategies (string length, list size, nesting depth) are implementation details that can evolve over time without breaking this contract. Setting `event-log.max-length` to 0 or negative disables truncation at STANDARD level, making STANDARD behave identically to VERBOSE (except for the metadata label). ### Usage Examples **Java API:** ```java EventLoggerConfig config = EventLoggerConfig.builder() .loggerType("file") .property("baseLogDir", "/tmp/agent-logs") .defaultLogLevel(EventLogLevel.STANDARD) .maxEventLength(4096) .eventLogLevel(ChatRequestEvent.class, EventLogLevel.VERBOSE) .eventLogLevel(ChatResponseEvent.class, EventLogLevel.VERBOSE) .eventLogLevel(ContextRetrievalRequestEvent.class, EventLogLevel.OFF) .build(); ``` **String-based configuration (config files):** ```yaml # Global default level event-log.level.default: STANDARD # Per-event-type overrides — each type has its own independently overridable key event-log.level.ChatRequestEvent: VERBOSE event-log.level.ChatResponseEvent: VERBOSE event-log.level.ContextRetrievalRequestEvent: OFF # Overall event length limit at STANDARD level (0 = no truncation) event-log.max-length: 4096 ``` **Override a single event type at job submission without affecting other settings:** ```bash # Shared config.yaml sets defaults for all jobs. # Override just one event type for debugging — other per-type levels from # config.yaml are preserved because each type is its own config key. flink run ... -Devent-log.level.ChatRequestEvent=VERBOSE ``` ### JSON Output Events now include top-level `logLevel` and `eventType` fields: ```json { "timestamp": "2024-01-15T10:30:00Z", "logLevel": "STANDARD", "eventType": "org.apache.flink.agents.api.InputEvent", "event": { "eventType": "org.apache.flink.agents.api.InputEvent", "id": "...", "attributes": {}, "input": "This is a short input that fits within the limit" } } ``` At STANDARD level with truncation (event exceeds `event-log.max-length`): ```json { "timestamp": "2024-01-15T10:30:00Z", "logLevel": "STANDARD", "eventType": "org.apache.flink.agents.api.event.ChatResponseEvent", "event": { "eventType": "org.apache.flink.agents.api.event.ChatResponseEvent", "id": "...", "attributes": {}, "response": "The beginning of a very long LLM response... [truncated]" } } ``` For Python events, the top-level `eventType` field contains the logical Python type string from `PythonEvent.getEventType()` (e.g., `my_module.MyEvent`), not the Java FQCN `org.apache.flink.agents.runtime.python.event.PythonEvent`. The existing serializer already handles this — it reads from the parsed `eventJsonStr` or falls back to `event.getEventType()`. Old records without `logLevel` or top-level `eventType` are deserialized correctly with `STANDARD` as the default level. ### Backward Compatibility - Default log level is `STANDARD`, matching existing behavior (all events logged). - Default `event-log.max-length` is `0` (disabled) — no truncation occurs unless explicitly configured. This preserves the current behavior where all events are logged in full. - `EventLogRecord` 2-arg constructor preserved with defaults `logLevel = STANDARD`, `maxEventLength = 0`, matching existing behavior exactly. - JSON without `logLevel` or top-level `eventType` fields deserializes correctly. - Existing `EventFilter` continues to work alongside log levels (AND composition; see [Level and Filter Composition](#level-and-filter-composition)). ### Event Type Name Validation On `FileEventLogger.open()`, event type names from `event-log.level.<eventType>` keys are validated against a **hardcoded registry** of known event classes from the `org.apache.flink.agents.api` and `org.apache.flink.agents.api.event` packages. A hardcoded registry is used rather than classpath scanning because Java has no reliable built-in mechanism for enumerating classes in a package at runtime. The registry is maintained as a simple `Set<String>` of known event class simple names. A unit test asserts that this registry matches the actual set of `Event` subclasses on the classpath, so any new event class added without updating the registry causes a test failure. Validation skips names that appear to be Python event types (names containing `.` or following Python naming conventions such as leading lowercase or underscores), since these cannot be validated against the Java registry. Unrecognized Java-style names produce a WARN-level log: ``` WARN FileEventLogger - Configured event log level for 'ChatRequstEvent' (event-log.level.ChatRequstEvent) but no matching event class was found. Check for typos in the config key. ``` The full list of known event types is logged at DEBUG level to avoid noise: ``` DEBUG FileEventLogger - Known event types: [ChatRequestEvent, ChatResponseEvent, ...] ``` This is a warning, not an error — custom event types not in the built-in registry will trigger the warning but still function correctly at runtime. ### Custom EventLogger Implementations `EventLoggerFactory` supports registering custom loggers via `registerFactory()`. The new `logLevel` and `maxEventLength` fields are carried in `EventLogRecord`, and truncation is performed in `EventLogRecordJsonSerializer` (the Jackson serializer bound to `EventLogRecord` via `@JsonSerialize`). - Custom loggers that serialize `EventLogRecord` via Jackson get truncation for free. - Custom loggers that use their own serialization are responsible for honoring `EventLogRecord.getLogLevel()` and `EventLogRecord.getMaxEventLength()` if they want to support truncation. ### Observability When truncation is active (`event-log.max-length > 0`), a counter metric `eventLogTruncatedEvents` is incremented each time an event is truncated at STANDARD level. This helps operators decide whether to increase the length limit or switch specific event types to VERBOSE. **Metric ownership**: `FileEventLogger` registers the counter during `open()` using the `StreamingRuntimeContext` available via `EventLoggerOpenParams`. The serializer signals truncation via a mutable `boolean truncated` field on `EventLogRecord`, which the serializer sets during `serialize()`. After each `append()`, `FileEventLogger` checks this flag and increments the counter. (Jackson's `JsonSerializer.serialize()` has a `void` return type, so a return value is not an option; a field on the record being serialized is the cleanest mechanism.) ### Config Key Pattern Rationale The per-type config key pattern `event-log.level.<eventType>` was chosen over a single comma-separated config option (e.g., `eventLogLevels=ChatRequestEvent=VERBOSE,...`) for composability: - **Independent overrides**: When a shared `config.yaml` defines per-type levels for all jobs, a single job can override just one event type at submission time (e.g., `-Devent-log.level.ChatRequestEvent=VERBOSE`) without re-specifying the entire list. - **No loss of other settings**: Overriding one key does not affect other `event-log.level.*` keys from the shared config. - **Flink convention**: Aligns with how Flink handles other hierarchical config keys. GitHub link: https://github.com/apache/flink-agents/discussions/552 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
