Re: [PR] [Improve](Streamingjob) support exclude_columns for Postgres streaming job [doris]

via GitHub Thu, 12 Mar 2026 02:53:24 -0700


github-actions[bot] commented on code in PR #61267:
URL: https://github.com/apache/doris/pull/61267#discussion_r2923469075



##########
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java:
##########
@@ -116,6 +122,31 @@ public static boolean isJson(String str) {
         }
     }
 
+    /**
+     * Parse the exclude-column set for a specific table from config.
+     *
+     * <p>Looks for key {@code "table.<tableName>.exclude_columns"} whose 
value is a comma-separated
+     * column list, e.g. {@code "secret,internal_note"}.
+     *
+     * @return lower-cased column name set; empty set when the key is absent
+     */

Review Comment:
   **[Incorrect Javadoc]** The `@return` says "lower-cased column name set" but 
the implementation does not apply `String::toLowerCase` -- it only trims. The 
returned set preserves the original casing from the config value.
   
   Either fix the Javadoc to say `column name set` (recommended, since 
preserving case is correct for case-sensitive databases), or add 
`.map(String::toLowerCase)` if lowercase normalization is actually intended.



##########
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/deserialize/DebeziumJsonDeserializer.java:
##########
@@ -93,35 +95,43 @@ public DeserializeResult deserialize(Map<String, String> 
context, SourceRecord r
             throws IOException {
         if (RecordUtils.isDataChangeRecord(record)) {
             LOG.trace("Process data change record: {}", record);
-            List<String> rows = deserializeDataChangeRecord(record);
+            List<String> rows = deserializeDataChangeRecord(record, context);
             return DeserializeResult.dml(rows);
         } else {
             return DeserializeResult.empty();
         }
     }
 
-    private List<String> deserializeDataChangeRecord(SourceRecord record) 
throws IOException {
+    private List<String> deserializeDataChangeRecord(
+            SourceRecord record, Map<String, String> context) throws 
IOException {
         List<String> rows = new ArrayList<>();
+        String tableName = extractTableName(record);
+        Set<String> excludeColumns = ConfigUtil.parseExcludeColumns(context, 
tableName);
         Envelope.Operation op = Envelope.operationFor(record);

Review Comment:
   **[Performance]** `ConfigUtil.parseExcludeColumns(context, tableName)` is 
called on **every DML record** in the CDC hot path. Each call builds a string 
key via concatenation, does a map lookup, splits the comma-separated value, 
trims each element, filters empties, and collects into a new `HashSet`.
   
   For high-throughput CDC streams this is wasteful. The exclude column set is 
static for the lifetime of the deserializer -- the config never changes between 
records.
   
   Suggested fix: cache the parsed `Set<String>` per table name in the 
deserializer, e.g. a `Map<String, Set<String>>` instance field populated lazily 
or eagerly in `init()`. This follows the existing pattern used for 
`serverTimeZone` (parsed once in `init()` and reused).
   
   ```java
   // In DebeziumJsonDeserializer, add instance field:
   private Map<String, Set<String>> excludeColumnsCache = new HashMap<>();
   
   // In deserializeDataChangeRecord, replace the current call with:
   Set<String> excludeColumns = excludeColumnsCache.computeIfAbsent(
       tableName, t -> ConfigUtil.parseExcludeColumns(context, t));
   ```



##########
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java:
##########
@@ -116,6 +122,31 @@ public static boolean isJson(String str) {
         }
     }
 
+    /**
+     * Parse the exclude-column set for a specific table from config.

Review Comment:
   **[Code Duplication]** This method is an exact duplicate of 
`StreamingJobUtils.parseExcludeColumns()` in `fe-core`. Both modules (`fe-core` 
and `cdc_client`) already depend on the `fe-common` module where 
`DataSourceConfigKeys` lives. Consider moving this utility method to 
`DataSourceConfigKeys` (or a sibling utility class in `fe-common`) to avoid the 
duplication and risk of the two copies diverging.



##########
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/DataSourceConfigValidator.java:
##########
@@ -43,11 +43,28 @@ public class DataSourceConfigValidator {
             DataSourceConfigKeys.SSL_ROOTCERT
     );
 
+    // Known suffixes for per-table config keys (format: 
"table.<tableName>.<suffix>")
+    private static final Set<String> ALLOW_TABLE_LEVEL_SUFFIXES = 
Sets.newHashSet(
+            DataSourceConfigKeys.TABLE_EXCLUDE_COLUMNS_SUFFIX
+    );
+
+    private static final String TABLE_LEVEL_PREFIX = 
DataSourceConfigKeys.TABLE + ".";
+
     public static void validateSource(Map<String, String> input) throws 
IllegalArgumentException {
         for (Map.Entry<String, String> entry : input.entrySet()) {
             String key = entry.getKey();
             String value = entry.getValue();
 
+            if (key.startsWith(TABLE_LEVEL_PREFIX)) {
+                // per-table config key: table.<tableName>.<suffix>

Review Comment:
   **[Validation Gap]** This validation has two issues:
   
   1. **Missing table name segment not detected:** A key like 
`table.exclude_columns` (with no table name in the middle) passes validation 
because `lastIndexOf('.')` finds the dot between `table` and `exclude_columns`, 
and the suffix `exclude_columns` is in the allowlist. The intended format 
requires 3 segments (`table.<tableName>.<suffix>`), but this only validates 2. 
A user who writes `"table.exclude_columns" = "col1"` instead of 
`"table.my_table.exclude_columns" = "col1"` gets no error -- the config is 
silently ignored at runtime.
   
   2. **Value not validated:** The `continue` skips the `isValidValue` check, 
so null/empty values for per-table keys pass without error.
   
   Suggested fix:
   ```java
   if (key.startsWith(TABLE_LEVEL_PREFIX)) {
       String rest = key.substring(TABLE_LEVEL_PREFIX.length());
       int dot = rest.lastIndexOf('.');
       if (dot <= 0) { // no table name or empty table name
           throw new IllegalArgumentException(
               "Invalid per-table config key: '" + key + "'. Expected format: 
table.<tableName>.<suffix>");
       }
       String suffix = rest.substring(dot + 1);
       if (!ALLOW_TABLE_LEVEL_SUFFIXES.contains(suffix)) {
           throw new IllegalArgumentException("Unknown per-table config key: '" 
+ key + "'");
       }
       if (value == null || value.isEmpty()) {
           throw new IllegalArgumentException("Invalid value for key '" + key + 
"': value cannot be empty");
       }
       continue;
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Improve](Streamingjob) support exclude_columns for Postgres streaming job [doris]

Reply via email to