github-actions[bot] commented on code in PR #61267:
URL: https://github.com/apache/doris/pull/61267#discussion_r2923469075
##########
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java:
##########
@@ -116,6 +122,31 @@ public static boolean isJson(String str) {
}
}
+ /**
+ * Parse the exclude-column set for a specific table from config.
+ *
+ * <p>Looks for key {@code "table.<tableName>.exclude_columns"} whose
value is a comma-separated
+ * column list, e.g. {@code "secret,internal_note"}.
+ *
+ * @return lower-cased column name set; empty set when the key is absent
+ */
Review Comment:
**[Incorrect Javadoc]** The `@return` says "lower-cased column name set" but
the implementation does not apply `String::toLowerCase` -- it only trims. The
returned set preserves the original casing from the config value.
Either fix the Javadoc to say `column name set` (recommended, since
preserving case is correct for case-sensitive databases), or add
`.map(String::toLowerCase)` if lowercase normalization is actually intended.
##########
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/source/deserialize/DebeziumJsonDeserializer.java:
##########
@@ -93,35 +95,43 @@ public DeserializeResult deserialize(Map<String, String>
context, SourceRecord r
throws IOException {
if (RecordUtils.isDataChangeRecord(record)) {
LOG.trace("Process data change record: {}", record);
- List<String> rows = deserializeDataChangeRecord(record);
+ List<String> rows = deserializeDataChangeRecord(record, context);
return DeserializeResult.dml(rows);
} else {
return DeserializeResult.empty();
}
}
- private List<String> deserializeDataChangeRecord(SourceRecord record)
throws IOException {
+ private List<String> deserializeDataChangeRecord(
+ SourceRecord record, Map<String, String> context) throws
IOException {
List<String> rows = new ArrayList<>();
+ String tableName = extractTableName(record);
+ Set<String> excludeColumns = ConfigUtil.parseExcludeColumns(context,
tableName);
Envelope.Operation op = Envelope.operationFor(record);
Review Comment:
**[Performance]** `ConfigUtil.parseExcludeColumns(context, tableName)` is
called on **every DML record** in the CDC hot path. Each call builds a string
key via concatenation, does a map lookup, splits the comma-separated value,
trims each element, filters empties, and collects into a new `HashSet`.
For high-throughput CDC streams this is wasteful. The exclude column set is
static for the lifetime of the deserializer -- the config never changes between
records.
Suggested fix: cache the parsed `Set<String>` per table name in the
deserializer, e.g. a `Map<String, Set<String>>` instance field populated lazily
or eagerly in `init()`. This follows the existing pattern used for
`serverTimeZone` (parsed once in `init()` and reused).
```java
// In DebeziumJsonDeserializer, add instance field:
private Map<String, Set<String>> excludeColumnsCache = new HashMap<>();
// In deserializeDataChangeRecord, replace the current call with:
Set<String> excludeColumns = excludeColumnsCache.computeIfAbsent(
tableName, t -> ConfigUtil.parseExcludeColumns(context, t));
```
##########
fs_brokers/cdc_client/src/main/java/org/apache/doris/cdcclient/utils/ConfigUtil.java:
##########
@@ -116,6 +122,31 @@ public static boolean isJson(String str) {
}
}
+ /**
+ * Parse the exclude-column set for a specific table from config.
Review Comment:
**[Code Duplication]** This method is an exact duplicate of
`StreamingJobUtils.parseExcludeColumns()` in `fe-core`. Both modules (`fe-core`
and `cdc_client`) already depend on the `fe-common` module where
`DataSourceConfigKeys` lives. Consider moving this utility method to
`DataSourceConfigKeys` (or a sibling utility class in `fe-common`) to avoid the
duplication and risk of the two copies diverging.
##########
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/DataSourceConfigValidator.java:
##########
@@ -43,11 +43,28 @@ public class DataSourceConfigValidator {
DataSourceConfigKeys.SSL_ROOTCERT
);
+ // Known suffixes for per-table config keys (format:
"table.<tableName>.<suffix>")
+ private static final Set<String> ALLOW_TABLE_LEVEL_SUFFIXES =
Sets.newHashSet(
+ DataSourceConfigKeys.TABLE_EXCLUDE_COLUMNS_SUFFIX
+ );
+
+ private static final String TABLE_LEVEL_PREFIX =
DataSourceConfigKeys.TABLE + ".";
+
public static void validateSource(Map<String, String> input) throws
IllegalArgumentException {
for (Map.Entry<String, String> entry : input.entrySet()) {
String key = entry.getKey();
String value = entry.getValue();
+ if (key.startsWith(TABLE_LEVEL_PREFIX)) {
+ // per-table config key: table.<tableName>.<suffix>
Review Comment:
**[Validation Gap]** This validation has two issues:
1. **Missing table name segment not detected:** A key like
`table.exclude_columns` (with no table name in the middle) passes validation
because `lastIndexOf('.')` finds the dot between `table` and `exclude_columns`,
and the suffix `exclude_columns` is in the allowlist. The intended format
requires 3 segments (`table.<tableName>.<suffix>`), but this only validates 2.
A user who writes `"table.exclude_columns" = "col1"` instead of
`"table.my_table.exclude_columns" = "col1"` gets no error -- the config is
silently ignored at runtime.
2. **Value not validated:** The `continue` skips the `isValidValue` check,
so null/empty values for per-table keys pass without error.
Suggested fix:
```java
if (key.startsWith(TABLE_LEVEL_PREFIX)) {
String rest = key.substring(TABLE_LEVEL_PREFIX.length());
int dot = rest.lastIndexOf('.');
if (dot <= 0) { // no table name or empty table name
throw new IllegalArgumentException(
"Invalid per-table config key: '" + key + "'. Expected format:
table.<tableName>.<suffix>");
}
String suffix = rest.substring(dot + 1);
if (!ALLOW_TABLE_LEVEL_SUFFIXES.contains(suffix)) {
throw new IllegalArgumentException("Unknown per-table config key: '"
+ key + "'");
}
if (value == null || value.isEmpty()) {
throw new IllegalArgumentException("Invalid value for key '" + key +
"': value cannot be empty");
}
continue;
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]